Slurm 集群完整部署指南(CentOS 7)

Slurm 集群完整部署指南(CentOS 7)

集群节点规划:

节点 主机名 IP 地址 角色
管理节点 manage01 10.1.0.64 Slurm 控制、NFS 服务端、MariaDB
登录节点 login01 10.1.0.65 用户登录、作业提交
计算节点 compute01 10.1.0.66 作业执行
计算节点 compute02 10.1.0.67 作业执行

一、集群基础初始化

1.1 设置主机名

在各节点分别执行:

1
2
3
4
5
6
7
8
9
10
11
# manage01
hostnamectl set-hostname manage01

# login01
hostnamectl set-hostname login01

# compute01
hostnamectl set-hostname compute01

# compute02
hostnamectl set-hostname compute02

1.2 配置 hosts 解析

所有节点编辑 /etc/hosts,添加:

1
2
3
4
10.1.0.64 manage01
10.1.0.65 login01
10.1.0.66 compute01
10.1.0.67 compute02

验证:

1
2
3
4
ping manage01
ping login01
ping compute01
ping compute02

1.3 关闭防火墙与安全服务

所有节点执行:

1
2
3
4
5
6
7
8
# 关闭防火墙
systemctl disable --now firewalld

# 关闭 dnsmasq
systemctl disable --now dnsmasq

# 关闭 NetworkManager
systemctl disable --now NetworkManager

1.4 关闭 SELinux

1
2
3
4
5
6
7
8
9
10
11
12
# 临时关闭
setenforce 0

# 永久关闭
sed -i 's#SELINUX=enforcing#SELINUX=disabled#g' /etc/selinux/config
sed -i 's#SELINUX=permissive#SELINUX=disabled#g' /etc/selinux/config
sed -i 's#SELINUX=enforcing#SELINUX=disabled#g' /etc/sysconfig/selinux
sed -i 's#SELINUX=permissive#SELINUX=disabled#g' /etc/sysconfig/selinux

# 重启后验证
reboot
getenforce # 应返回 Disabled

1.5 关闭 Swap

1
2
3
4
5
6
7
8
swapoff -a
sysctl -w vm.swappiness=0

# 永久关闭
sed -ri '/^[^#]*swap/s@^@#@' /etc/fstab

# 验证
free -h

1.6 配置 Yum 源(阿里云)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 备份原有源
mkdir -p /etc/yum.repos.d/bak
mv /etc/yum.repos.d/*.repo /etc/yum.repos.d/bak/

# 下载阿里云源
curl -o /etc/yum.repos.d/CentOS-Base.repo \
https://mirrors.aliyun.com/repo/Centos-7.repo

# 修正配置
sed -i 's/mirrorlist.centos.org/vault.centos.org/g' /etc/yum.repos.d/CentOS-Base.repo
sed -i 's/\$releasever/7/g' /etc/yum.repos.d/CentOS-Base.repo
sed -i 's/http:\/\/mirror.centos.org/https:\/\/mirrors.aliyun.com/g' /etc/yum.repos.d/CentOS-Base.repo

# 刷新缓存
yum clean all
yum makecache

1.7 配置时间同步(NTP)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
yum install ntpdate -y

# 配置时区
ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
echo 'Asia/Shanghai' >/etc/timezone

# 同步时间(外网)
ntpdate time2.aliyun.com

# 同步时间(内网)
ntpdate vineyard.pku.edu.cn

# 配置定时同步
echo "*/5 * * * * /usr/sbin/ntpdate -u time2.aliyun.com > /dev/null 2>&1" >> /var/spool/cron/root
crontab -l

1.8 配置 SSH 免密登录(manage01 执行)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
yum install sshpass -y
mkdir -p /extend/shell

cat >/extend/shell/fenfa_pub.sh<< 'EOF'
#!/bin/bash
PASS=sskj2025

if [ ! -f ~/.ssh/id_rsa ]; then
ssh-keygen -t rsa -f ~/.ssh/id_rsa -P ''
fi

for ip in 65 66 67; do
echo "正在发送公钥至 10.1.0.$ip..."
sshpass -p $PASS ssh-copy-id -o StrictHostKeyChecking=no 10.1.0.$ip
done
EOF

chmod +x /extend/shell/fenfa_pub.sh
/extend/shell/fenfa_pub.sh

# 验证免密
ssh login01
ssh compute01
ssh compute02

二、NFS 共享存储配置

2.1 服务端配置(manage01)

安装 NFS 与 RPC 服务

1
yum install -y nfs-utils rpcbind

创建并配置共享目录

1
2
mkdir /data
chmod 755 /data

编辑 /etc/exports

1
/data *(rw,sync,insecure,no_subtree_check,no_root_squash)

参数说明:

参数 说明
rw 允许读写
sync 同步写入磁盘
insecure 允许非保留端口连接
no_subtree_check 关闭子目录检查
no_root_squash 保留客户端 root 权限

启动服务

1
2
3
4
systemctl start rpcbind
systemctl start nfs-server
systemctl enable rpcbind
systemctl enable nfs-server

验证

1
2
showmount -e localhost
# 输出: /data *

2.2 客户端配置(login01、compute01、compute02)

安装并挂载

1
2
3
4
5
6
7
8
9
10
11
yum install nfs-utils -y

# 查看共享目录
showmount -e manage01

# 创建挂载点并挂载
mkdir /data
mount manage01:/data /data -o proto=tcp -o nolock

# 验证
df -h

配置开机自动挂载

编辑 /etc/fstab,追加:

1
manage01:/data /data nfs rw,auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0

fstab 参数说明:

参数 说明
rw 读写挂载
auto 开机自动挂载
nofail 挂载失败不影响系统启动
noatime 不记录访问时间
nolock 关闭文件锁
intr 允许中断
tcp 使用 TCP 协议
actimeo=1800 缓存时间 1800 秒

测试 fstab 配置:

1
mount -a  # 无报错即正常

2.3 创建共享目录结构

1
2
mkdir /data/home      # 用户家目录
mkdir /data/software # 共享软件安装

2.4 NFS 功能测试

1
2
3
4
5
# 在任意节点写入
echo "hello nfs server" > /data/test.txt

# 在其他节点读取验证
cat /data/test.txt

三、Slurm 集群部署

3.1 安装 Munge 认证服务

Munge 用于集群节点间的身份认证,所有节点的 UID/GID 必须一致。

创建 Munge 用户(所有节点)

1
2
groupadd -g 1108 munge
useradd -m -c "Munge Uid 'N' Gid Emporium" -d /var/lib/munge -u 1108 -g munge -s /sbin/nologin munge

生成熵池(manage01)

1
2
yum install -y rng-tools
rngd -r /dev/urandom

修改 /usr/lib/systemd/system/rngd.service

1
2
[Service]
ExecStart=/sbin/rngd -f -r /dev/urandom
1
2
3
systemctl daemon-reload
systemctl start rngd
systemctl enable rngd

安装 Munge(所有节点)

1
2
yum install epel-release -y
yum install munge munge-libs munge-devel -y

生成并分发密钥(manage01)

1
2
3
4
5
6
7
/usr/sbin/create-munge-key -r
dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key

# 分发到其他节点
scp -p /etc/munge/munge.key root@login01:/etc/munge/
scp -p /etc/munge/munge.key root@compute01:/etc/munge/
scp -p /etc/munge/munge.key root@compute02:/etc/munge/

设置权限并启动(所有节点)

1
2
3
4
5
chown munge: /etc/munge/munge.key
chmod 400 /etc/munge/munge.key

systemctl start munge
systemctl enable munge

验证 Munge

1
2
3
munge -n                        # 本地查看凭据
munge -n | unmunge # 本地解码
munge -n | ssh compute01 unmunge # 远程解码验证

3.2 安装 MariaDB(manage01)

1
2
3
yum -y install mariadb-server
systemctl start mariadb
systemctl enable mariadb

设置 root 密码并创建数据库:

1
2
3
4
ROOT_PASS=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16)
mysql -u root -e "UPDATE mysql.user SET Password=PASSWORD('${ROOT_PASS}') WHERE User='root'; FLUSH PRIVILEGES;"
mysql -uroot -p"${ROOT_PASS}" -e "create database slurm_acct_db;"
echo "MariaDB Root 密码: $ROOT_PASS"

创建 Slurm 数据库用户:

1
2
3
4
5
mysql -uroot -p$ROOT_PASS

create user slurm;
grant all on slurm_acct_db.* TO 'slurm'@'localhost' identified by '123456' with grant option;
flush privileges;

3.3 创建 Slurm 用户(所有节点)

1
2
groupadd -g 1109 slurm
useradd -m -c "Slurm manager" -d /var/lib/slurm -u 1109 -g slurm -s /bin/bash slurm

3.4 安装 Slurm 依赖(所有节点)

1
yum install gcc gcc-c++ readline-devel perl-ExtUtils-MakeMaker pam-devel rpm-build mysql-devel python3 -y

3.5 编译 Slurm RPM 包(manage01)

1
2
3
4
5
6
7
8
9
wget https://download.schedmd.com/slurm/slurm-22.05.3.tar.bz2 --no-check-certificate
yum install rpm-build -y
rpmbuild -ta --nodeps slurm-22.05.3.tar.bz2

# 分发到其他节点
mkdir -p /root/rpmbuild/RPMS/
scp -r /root/rpmbuild/RPMS/x86_64 root@login01:/root/rpmbuild/RPMS/x86_64
scp -r /root/rpmbuild/RPMS/x86_64 root@compute01:/root/rpmbuild/RPMS/x86_64
scp -r /root/rpmbuild/RPMS/x86_64 root@compute02:/root/rpmbuild/RPMS/x86_64

3.6 安装 Slurm(所有节点)

1
2
cd /root/rpmbuild/RPMS/x86_64/
yum localinstall slurm-*

3.7 配置 Slurm(manage01)

复制配置模板:

1
2
3
cp /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf
cp /etc/slurm/slurm.conf.example /etc/slurm/slurm.conf
cp /etc/slurm/slurmdbd.conf.example /etc/slurm/slurmdbd.conf

slurm.conf

cgroup.conf 使用默认配置,slurm.conf 修改为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
################################################
# CONTROL #
################################################
ClusterName=cluster
SlurmctldHost=manage01
SlurmctldPort=6817
SlurmdPort=6818
SlurmUser=slurm

################################################
# LOGGING & OTHER PATHS #
################################################
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
StateSaveLocation=/var/spool/slurmctld

################################################
# ACCOUNTING #
################################################
AccountingStorageEnforce=associations,limits,qos
AccountingStorageHost=manage01
AccountingStoragePass=/var/run/munge/munge.socket.2
AccountingStoragePort=6819
AccountingStorageType=accounting_storage/slurmdbd

################################################
# JOBS #
################################################
JobCompHost=localhost
JobCompLoc=slurm_acct_db
JobCompPass=123456
JobCompPort=3306
JobCompType=jobcomp/mysql
JobCompUser=slurm
JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux

################################################
# SCHEDULING & ALLOCATION #
################################################
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core

################################################
# TIMERS #
################################################
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0

################################################
# OTHER #
################################################
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=1
SwitchType=switch/none
TaskPlugin=task/affinity

################################################
# NODES #
################################################
NodeName=manage01 NodeAddr=10.1.0.64 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7823 State=UNKNOWN
NodeName=login01 NodeAddr=10.1.0.65 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7823 State=UNKNOWN
NodeName=compute0[1-2] NodeAddr=10.1.0.6[6-7] CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7823 State=UNKNOWN

################################################
# PARTITIONS #
################################################
PartitionName=compute Nodes=compute0[1-2] Default=YES MaxTime=INFINITE State=UP

slurmdbd.conf

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Authentication
AuthType=auth/munge
AuthInfo=/var/run/munge/munge.socket.2

# slurmDBD
DbdAddr=localhost
DbdHost=localhost
SlurmUser=slurm
DebugLevel=verbose
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid

# Database
StorageType=accounting_storage/mysql
StoragePass=123456
StorageUser=slurm
StorageLoc=slurm_acct_db

DebugLevel 说明: quiet → fatal → error → info → verbose → debug ~ debug5(逐级更详细)

分发配置文件

1
2
3
scp -r /etc/slurm/*.conf root@login01:/etc/slurm/
scp -r /etc/slurm/*.conf root@compute01:/etc/slurm/
scp -r /etc/slurm/*.conf root@compute02:/etc/slurm/

3.8 创建目录并设置权限

所有节点执行:

1
2
3
4
mkdir -p /var/spool/slurmd
chown slurm: /var/spool/slurmd
mkdir -p /var/log/slurm
chown slurm: /var/log/slurm

仅管理节点:

1
2
mkdir -p /var/spool/slurmctld
chown slurm: /var/spool/slurmctld

3.9 启动 Slurm 服务

管理节点(manage01):

1
2
3
4
5
6
7
8
chown slurm:slurm /etc/slurm/slurmdbd.conf
chmod 600 /etc/slurm/slurmdbd.conf

systemctl start slurmdbd
systemctl enable slurmdbd

systemctl start slurmctld
systemctl enable slurmctld

所有节点:

1
2
systemctl start slurmd
systemctl enable slurmd

3.10 验证集群

1
2
3
4
scontrol show config
sinfo
scontrol show partition
scontrol show node

提交测试作业:

1
2
3
srun -N2 hostname
scontrol show jobs
squeue -a

3.11 配置 QOS

Slurm 默认只有 normal 一个 QOS。对接 OpenSCOW 时需要 lownormalhigh 三个:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 查看已有 QOS
sacctmgr show qos

# 创建 QOS
sacctmgr -i create qos name=low
sacctmgr -i create qos name=high

# 设置优先级
sacctmgr -i modify qos name=normal set Priority=1000
sacctmgr -i modify qos name=high set Priority=2000
sacctmgr -i modify qos name=low set Priority=500

# 给用户添加 QOS
sacctmgr modify user name={username} set qos=low,high,normal defaultQOS=low

3.12 初始化账户和用户(可选)

1
2
3
4
5
6
7
8
# 创建账户
sacctmgr add account name=a_admin

# 创建用户
sacctmgr add user name=demo_admin account=a_admin partition=compute qos=low,high,normal defaultQOS=low

# 查看用户信息
sacctmgr show ass format="Cluster,Account,User,Partition,QOS"

四、常见报错处理

报错信息 解决方法
slurmdbd.conf file should be 600 is 644 chmod 600 /etc/slurm/slurmdbd.conf && systemctl restart slurmdbd
slurmdbd.conf not owned by SlurmUser root!=slurm chown slurm: /etc/slurm/slurmdbd.conf && systemctl restart slurmdbd

五、部署检查清单

检查项 命令
主机名 hostname
hosts 解析 ping manage01
SELinux getenforce
Swap free -h
时间同步 date
SSH 免密 ssh compute01
NFS 服务状态 systemctl status nfs-server
NFS 挂载 df -h | grep data
Munge 状态 systemctl status munge
Slurm 控制 systemctl status slurmctld
Slurm 计算 systemctl status slurmd
集群节点 sinfo

Slurm 集群完整部署指南(CentOS 7)
https://www.sajuna.cn/2026/05/16/Slurm集群完整部署指南/
作者
Matrix
发布于
2026年5月16日
许可协议