Skip to content
Farley Lai edited this page Apr 30, 2021 · 1 revision

This is an essential guide to use Slurm. A overview and detailed introduction is available online.

Submission Hosts

There are several hosts horse01, ml46, ml47 and ml48 for users to submit jobs to the cluster. Only horse01 is equipped with GPUs as worker nodes for validating GPU programs before submission. Never initiate heavy processing on those submission hosts.

Worker Nodes

  • to reject direct ssh login
  • ssh logins are accepted for monitoring/debugging only when one of the user jobs is running
  • interactive shell must be scheduled

Partitions/Queues

  • [debug]: high priority queue to run jobs that should finish soon
  • [batch | long][_lp]: short or long term jobs to queue and _lp stands for low priority to avoid preemption
  • [scavenger]: no resource limit but with priority to be killed or preempt other running jobs
  • [gpu | gpu_scav]: gpu counterparts of the above long and scavenger
  • Priorities
    • CPU partitions precedence: debug > batch = long > batch_lp = long_lp > scavenger
    • GPU partitions precedence: gpu > gpu_scav
  • No auto checkpointing if killed due to preemption
    • check SLURM_RESTART_COUNT for the number of restarts

Usage

sbatch submits a job to a partition in the following format.

sbatch [OPTIONS(0)...] [ : [OPTIONS(n)...]] script(0) [args(0)...]

-p, --partition [batch | long | ...]: to submit a job to a partition, default to `batch`.
-n, --ntasks: a maximum number of tasks to allocate for the job, default to 1
-N, --nodes=<min[-max]>: the number of nodes assigned to the job and set to SLURM_JOB_NODES, default to 1-1
-c, --cpus-per-task=<n>: each task will be assigned the number of CPUs on the same machine, default to 1
-t, --time=<[days-]hours[:minutes[:seconds]]>: set running time limit, default by partition
--mem=<size[units]>: CPU memory per node to allocate
--mem-per-cpu=<size[units]>: memory per CPU to allocate
-J, --job-name=<jobname>: string name of the job, default to the script filename
-o, --output=<filename pattern>: file to which stdout will be written, default to slurm-%j.out
-e, --error=<filename pattern>: file to which stderr will be written, default to slurm-%j.err
-C, --constraint=<list>: node features such as IB for infiniband and haswell for Intel processors, default to none
--mail-type=<[NONE, BEGIN, END, FAIL, REQUEUE, ALL]>  
--mail-user=<[email protected]>
-w, --nodelist=<node name list>: specific nodes to allocate
-x, --exclude=<node name list>: explicitly exclude certain nodes from resources granted to the job

Note the --mem option only specifies CPU memory. To specify the minimum GPU memory, specify one of the available GPU models instead with the -C option. Refer to the ITS introduction for available GPU models in the Node feature tags section.

sacct displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database.

sacct -o reqmem,maxrss,averss,elapsed -j JOBID

scontrol to view and modify Slurm configuration and state.

scontrol update job JOBID MinMemoryNode=NEWMEM TimeLimit=NEWTIME

Practices

Run your job script in a bash script for sbatch to submit. If GPU is required, be sure to specify --gres=gpu:n where n >= 1 either in the command line or the script. For instance,

#SBATCH --partition gpu     # the job will be submitted to the gpu partition
#SBATCH --cpus-per-task=4   # one CPU 
#SBATCH --mem=16G           # 16GB CPU memory 
#SBATCH --gres=gpu:1        # 1 GPU is allocated for the job

#SBATCH --job-name=ml_job   # the name of the job. Default is the name of the script file
#SBATCH --output=ml-%j.out  # file to which stdout will be written (%j is replaced with the job id). Default slurm-%j.out
#SBATCH --error=ml-%j.err   # file to which stderr will be written (%j is replaced with the job id). Default  same as output

#SBATCH --mail-type=ALL # send email when job finishes or when it fails
#SBATCH [email protected]  #email to send notifiation to

Submit to multiple partitions including the scavenger to restart with high priority in case.

sbatch -p long,scavenger -t 7-00:00:00 --mem=16G longjob.sbatch

Submit an array of similar jobs parameterized by SLURM_ARRAY_TASK_ID.

sbatch --array=0-6[%2] -n1 -c1 trainsvms.sbatch reuters_train.dta  reuters_model "0.001 0.01 0.1 1 10 100 1000"

where trainsvms.sbatch is specified as follows:

#!/bin/bash
training_set_file=$1
model_file=$2
C_values=($3) #list of C values to train svms for
C=${C_values[$SLURM_ARRAY_TASK_ID]}  # use the C value corresponding to the index of this job in the job array
train_svm -o ${model_file}_${C}  -C ${C}  ${training_set_file} 

Interactive shell for a short time

srun --pty -p gpu --gres=gpu:1 -C GPUMODEL_TITANX --mem=16G  /bin/bash -l

Monitoring Jobs

squeue to view information about jobs located in the Slurm scheduling queue.

squeue         # list all the running and pending jobs
squeue -l      # same as above but give more information
squeue -u alex # list all the jobs belonging to user alex
squeue -p gpu  # list all jobs submitted to partition gpu
squeue -P      # list all the running and pending jobs sorted in order of priority
squeue -j 3343 # give information for job id 3343

squeue --array -j 3424 # show information about all jobs in the array with job ID 3424
squeue -j 3424_3       # show information about the 3rd job in the job array 3424

sacct provides more detailed information about running, pending, and also finished jobs.

sacct --format JobId,MaxRSS,UserCPU,SystemCPU,TotalCPU,State
scontrol show job 2232 -dd

Canceling Jobs

scancel to signal jobs or job steps that are under the control of Slurm.

scancel 23343 22345 32342
scancel 3424       # if 3424 is a job array, this cancels all the jobs in the array
scancel 3424_[1-5] # cancel jobs 1-5 in the job array 3424
scancel 3424_3     # cancel job 3 in the job array 3424

Modifying Jobs

scontrol monitors and modifies jobs after they have been submitted.

scontrol hold 2332 3343 45643                # hold jobs (i.e. keep these jobs pending, do not schedule them)
scontrol release 2332 3343 45643             # release the hold
scontrol update job 2332 partition=batch     # change the parttition of job 2332 to batch
scontrol update job 2332 MinMemoryNode=2G    # change the requested memory for a pending job
scontrol update job 2332 TimeLimit=00:20:00  # change the requested time 

scontrol hold 3424                           # hold all jobs in the job array
scontrol update job 3424_4 TimeLimit=1:30:00 # update the time limit of job 4 in the job array 

scontrol update job 2332 nice=100            # decrease the priority of job 2332 by 100.
scontrol requeue $JOBID                      # cancel and restart the job later

Getting information about the nodes and partitions with sinfo and scontrol.

sinfo                    # information about partitions
sinfo -l                 # more detailed information about partitions
sinfo -N                 # node-oriented information
sinfo -N -l              # more detailed information abou nodes, included any features available 
scontrol show nodes      # get information about nodes
scontrol show partitions # get information about partitions
scontrol show config     # show the configuration parameters of SLURM
Clone this wiki locally