Slurm 集群完整部署指南(CentOS 7)
集群节点规划:
节点
主机名
IP 地址
角色
管理节点
manage01
10.1.0.64
Slurm 控制、NFS 服务端、MariaDB
登录节点
login01
10.1.0.65
用户登录、作业提交
计算节点
compute01
10.1.0.66
作业执行
计算节点
compute02
10.1.0.67
作业执行
一、集群基础初始化 1.1 设置主机名 在各节点分别执行:
1 2 3 4 5 6 7 8 9 10 11 hostnamectl set-hostname manage01 hostnamectl set-hostname login01 hostnamectl set-hostname compute01 hostnamectl set-hostname compute02
1.2 配置 hosts 解析 所有节点编辑 /etc/hosts,添加:
1 2 3 4 10.1.0.64 manage0110.1.0.65 login0110.1.0.66 compute0110.1.0.67 compute02
验证:
1 2 3 4 ping manage01 ping login01 ping compute01 ping compute02
1.3 关闭防火墙与安全服务 所有节点执行:
1 2 3 4 5 6 7 8 systemctl disable --now firewalld systemctl disable --now dnsmasq systemctl disable --now NetworkManager
1.4 关闭 SELinux 1 2 3 4 5 6 7 8 9 10 11 12 setenforce 0 sed -i 's#SELINUX=enforcing#SELINUX=disabled#g' /etc/selinux/config sed -i 's#SELINUX=permissive#SELINUX=disabled#g' /etc/selinux/config sed -i 's#SELINUX=enforcing#SELINUX=disabled#g' /etc/sysconfig/selinux sed -i 's#SELINUX=permissive#SELINUX=disabled#g' /etc/sysconfig/selinux reboot getenforce
1.5 关闭 Swap 1 2 3 4 5 6 7 8 swapoff -a sysctl -w vm.swappiness=0 sed -ri '/^[^#]*swap/s@^@#@' /etc/fstab free -h
1.6 配置 Yum 源(阿里云) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 mkdir -p /etc/yum.repos.d/bakmv /etc/yum.repos.d/*.repo /etc/yum.repos.d/bak/ curl -o /etc/yum.repos.d/CentOS-Base.repo \ https://mirrors.aliyun.com/repo/Centos-7.repo sed -i 's/mirrorlist.centos.org/vault.centos.org/g' /etc/yum.repos.d/CentOS-Base.repo sed -i 's/\$releasever/7/g' /etc/yum.repos.d/CentOS-Base.repo sed -i 's/http:\/\/mirror.centos.org/https:\/\/mirrors.aliyun.com/g' /etc/yum.repos.d/CentOS-Base.repo yum clean all yum makecache
1.7 配置时间同步(NTP) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 yum install ntpdate -yln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtimeecho 'Asia/Shanghai' >/etc/timezone ntpdate time2.aliyun.com ntpdate vineyard.pku.edu.cnecho "*/5 * * * * /usr/sbin/ntpdate -u time2.aliyun.com > /dev/null 2>&1" >> /var/spool/cron/root crontab -l
1.8 配置 SSH 免密登录(manage01 执行) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 yum install sshpass -ymkdir -p /extend/shellcat >/extend/shell/fenfa_pub.sh<< 'EOF' PASS=sskj2025if [ ! -f ~/.ssh/id_rsa ]; then ssh-keygen -t rsa -f ~/.ssh/id_rsa -P '' fi for ip in 65 66 67; do echo "正在发送公钥至 10.1.0.$ip ..." sshpass -p $PASS ssh-copy-id -o StrictHostKeyChecking=no 10.1.0.$ip done EOFchmod +x /extend/shell/fenfa_pub.sh /extend/shell/fenfa_pub.sh ssh login01 ssh compute01 ssh compute02
二、NFS 共享存储配置 2.1 服务端配置(manage01) 安装 NFS 与 RPC 服务 1 yum install -y nfs-utils rpcbind
创建并配置共享目录 1 2 mkdir /datachmod 755 /data
编辑 /etc/exports:
1 /data *(rw ,sync ,insecure ,no_subtree_check ,no_root_squash )
参数说明:
参数
说明
rw
允许读写
sync
同步写入磁盘
insecure
允许非保留端口连接
no_subtree_check
关闭子目录检查
no_root_squash
保留客户端 root 权限
启动服务 1 2 3 4 systemctl start rpcbind systemctl start nfs-server systemctl enable rpcbind systemctl enable nfs-server
验证
2.2 客户端配置(login01、compute01、compute02) 安装并挂载 1 2 3 4 5 6 7 8 9 10 11 yum install nfs-utils -y showmount -e manage01mkdir /data mount manage01:/data /data -o proto=tcp -o nolockdf -h
配置开机自动挂载 编辑 /etc/fstab,追加:
1 manage01 :/data /data nfs rw,auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0
fstab 参数说明:
参数
说明
rw
读写挂载
auto
开机自动挂载
nofail
挂载失败不影响系统启动
noatime
不记录访问时间
nolock
关闭文件锁
intr
允许中断
tcp
使用 TCP 协议
actimeo=1800
缓存时间 1800 秒
测试 fstab 配置:
2.3 创建共享目录结构 1 2 mkdir /data/home mkdir /data/software
2.4 NFS 功能测试 1 2 3 4 5 echo "hello nfs server" > /data/test.txtcat /data/test.txt
三、Slurm 集群部署 3.1 安装 Munge 认证服务 Munge 用于集群节点间的身份认证,所有节点的 UID/GID 必须一致。
创建 Munge 用户(所有节点) 1 2 groupadd -g 1108 munge useradd -m -c "Munge Uid 'N' Gid Emporium" -d /var/lib/munge -u 1108 -g munge -s /sbin/nologin munge
生成熵池(manage01) 1 2 yum install -y rng-tools rngd -r /dev/urandom
修改 /usr/lib/systemd/system/rngd.service:
1 2 [Service] ExecStart =/sbin/rngd -f -r /dev/urandom
1 2 3 systemctl daemon-reload systemctl start rngd systemctl enable rngd
安装 Munge(所有节点) 1 2 yum install epel-release -y yum install munge munge-libs munge-devel -y
生成并分发密钥(manage01) 1 2 3 4 5 6 7 /usr/sbin/create-munge-key -rdd if =/dev/urandom bs=1 count=1024 > /etc/munge/munge.key scp -p /etc/munge/munge.key root@login01:/etc/munge/ scp -p /etc/munge/munge.key root@compute01:/etc/munge/ scp -p /etc/munge/munge.key root@compute02:/etc/munge/
设置权限并启动(所有节点) 1 2 3 4 5 chown munge: /etc/munge/munge.keychmod 400 /etc/munge/munge.key systemctl start munge systemctl enable munge
验证 Munge 1 2 3 munge -n munge -n | unmunge munge -n | ssh compute01 unmunge
3.2 安装 MariaDB(manage01) 1 2 3 yum -y install mariadb-server systemctl start mariadb systemctl enable mariadb
设置 root 密码并创建数据库:
1 2 3 4 ROOT_PASS=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16) mysql -u root -e "UPDATE mysql.user SET Password=PASSWORD('${ROOT_PASS} ') WHERE User='root'; FLUSH PRIVILEGES;" mysql -uroot -p"${ROOT_PASS} " -e "create database slurm_acct_db;" echo "MariaDB Root 密码: $ROOT_PASS "
创建 Slurm 数据库用户:
1 2 3 4 5 mysql - uroot - p$ROOT_PASScreate user slurm;grant all on slurm_acct_db.* TO 'slurm' @'localhost' identified by '123456' with grant option; flush privileges;
3.3 创建 Slurm 用户(所有节点) 1 2 groupadd -g 1109 slurm useradd -m -c "Slurm manager" -d /var/lib/slurm -u 1109 -g slurm -s /bin/bash slurm
3.4 安装 Slurm 依赖(所有节点) 1 yum install gcc gcc-c++ readline-devel perl-ExtUtils-MakeMaker pam-devel rpm-build mysql-devel python3 -y
3.5 编译 Slurm RPM 包(manage01) 1 2 3 4 5 6 7 8 9 wget https://download.schedmd.com/slurm/slurm-22.05.3.tar.bz2 --no-check-certificate yum install rpm-build -y rpmbuild -ta --nodeps slurm-22.05.3.tar.bz2mkdir -p /root/rpmbuild/RPMS/ scp -r /root/rpmbuild/RPMS/x86_64 root@login01:/root/rpmbuild/RPMS/x86_64 scp -r /root/rpmbuild/RPMS/x86_64 root@compute01:/root/rpmbuild/RPMS/x86_64 scp -r /root/rpmbuild/RPMS/x86_64 root@compute02:/root/rpmbuild/RPMS/x86_64
3.6 安装 Slurm(所有节点) 1 2 cd /root/rpmbuild/RPMS/x86_64/ yum localinstall slurm-*
3.7 配置 Slurm(manage01) 复制配置模板:
1 2 3 cp /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.confcp /etc/slurm/slurm.conf.example /etc/slurm/slurm.confcp /etc/slurm/slurmdbd.conf.example /etc/slurm/slurmdbd.conf
slurm.conf cgroup.conf 使用默认配置,slurm.conf 修改为:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 ClusterName =clusterSlurmctldHost =manage01SlurmctldPort =6817 SlurmdPort =6818 SlurmUser =slurmSlurmctldDebug =infoSlurmctldLogFile =/var/log/slurm/slurmctld.logSlurmdDebug =infoSlurmdLogFile =/var/log/slurm/slurmd.logSlurmctldPidFile =/var/run/slurmctld.pidSlurmdPidFile =/var/run/slurmd.pidSlurmdSpoolDir =/var/spool/slurmdStateSaveLocation =/var/spool/slurmctldAccountingStorageEnforce =associations,limits,qosAccountingStorageHost =manage01AccountingStoragePass =/var/run/munge/munge.socket.2 AccountingStoragePort =6819 AccountingStorageType =accounting_storage/slurmdbdJobCompHost =localhostJobCompLoc =slurm_acct_dbJobCompPass =123456 JobCompPort =3306 JobCompType =jobcomp/mysqlJobCompUser =slurmJobContainerType =job_container/noneJobAcctGatherFrequency =30 JobAcctGatherType =jobacct_gather/linuxSchedulerType =sched/backfillSelectType =select/cons_tresSelectTypeParameters =CR_CoreInactiveLimit =0 KillWait =30 MinJobAge =300 SlurmctldTimeout =120 SlurmdTimeout =300 Waittime =0 MpiDefault =noneProctrackType =proctrack/cgroupReturnToService =1 SwitchType =switch/noneTaskPlugin =task/affinityNodeName =manage01 NodeAddr=10.1 .0.64 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7823 State=UNKNOWNNodeName =login01 NodeAddr=10.1 .0.65 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7823 State=UNKNOWNNodeName =compute0[1 -2 ] NodeAddr=10.1 .0.6 [6 -7 ] CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=7823 State=UNKNOWNPartitionName =compute Nodes=compute0[1 -2 ] Default=YES MaxTime=INFINITE State=UP
slurmdbd.conf 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 AuthType =auth/mungeAuthInfo =/var/run/munge/munge.socket.2 DbdAddr =localhostDbdHost =localhostSlurmUser =slurmDebugLevel =verboseLogFile =/var/log/slurm/slurmdbd.logPidFile =/var/run/slurmdbd.pidStorageType =accounting_storage/mysqlStoragePass =123456 StorageUser =slurmStorageLoc =slurm_acct_db
DebugLevel 说明: quiet → fatal → error → info → verbose → debug ~ debug5(逐级更详细)
分发配置文件 1 2 3 scp -r /etc/slurm/*.conf root@login01:/etc/slurm/ scp -r /etc/slurm/*.conf root@compute01:/etc/slurm/ scp -r /etc/slurm/*.conf root@compute02:/etc/slurm/
3.8 创建目录并设置权限 所有节点执行:
1 2 3 4 mkdir -p /var/spool/slurmdchown slurm: /var/spool/slurmdmkdir -p /var/log/slurmchown slurm: /var/log/slurm
仅管理节点:
1 2 mkdir -p /var/spool/slurmctldchown slurm: /var/spool/slurmctld
3.9 启动 Slurm 服务 管理节点(manage01):
1 2 3 4 5 6 7 8 chown slurm:slurm /etc/slurm/slurmdbd.confchmod 600 /etc/slurm/slurmdbd.conf systemctl start slurmdbd systemctl enable slurmdbd systemctl start slurmctld systemctl enable slurmctld
所有节点:
1 2 systemctl start slurmd systemctl enable slurmd
3.10 验证集群 1 2 3 4 scontrol show config sinfo scontrol show partition scontrol show node
提交测试作业:
1 2 3 srun -N2 hostname scontrol show jobs squeue -a
3.11 配置 QOS Slurm 默认只有 normal 一个 QOS。对接 OpenSCOW 时需要 low、normal、high 三个:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 sacctmgr show qos sacctmgr -i create qos name=low sacctmgr -i create qos name=high sacctmgr -i modify qos name=normal set Priority=1000 sacctmgr -i modify qos name=high set Priority=2000 sacctmgr -i modify qos name=low set Priority=500 sacctmgr modify user name={username} set qos=low,high,normal defaultQOS=low
3.12 初始化账户和用户(可选) 1 2 3 4 5 6 7 8 sacctmgr add account name=a_admin sacctmgr add user name=demo_admin account=a_admin partition=compute qos=low,high,normal defaultQOS=low sacctmgr show ass format="Cluster,Account,User,Partition,QOS"
四、常见报错处理
报错信息
解决方法
slurmdbd.conf file should be 600 is 644
chmod 600 /etc/slurm/slurmdbd.conf && systemctl restart slurmdbd
slurmdbd.conf not owned by SlurmUser root!=slurm
chown slurm: /etc/slurm/slurmdbd.conf && systemctl restart slurmdbd
五、部署检查清单
检查项
命令
主机名
hostname
hosts 解析
ping manage01
SELinux
getenforce
Swap
free -h
时间同步
date
SSH 免密
ssh compute01
NFS 服务状态
systemctl status nfs-server
NFS 挂载
df -h | grep data
Munge 状态
systemctl status munge
Slurm 控制
systemctl status slurmctld
Slurm 计算
systemctl status slurmd
集群节点
sinfo