Table of Contents
A cluster is a bunch of normal-ish PCs (called nodes) connected to each other and it is used to perform computation to solve problems in science, engineering and other domains.
It usually consists of a head node, which allows users to log in, and schedule jobs. These jobs are then put into a queue to be executed, and will be run when resources are available.
The filesystem, i.e., your home directory is often shared on a cluster; this means that you can access the same files from every node in the cluster, and you do not manually have to copy files across between machines. This also means that if you have very disk-heavy operations, it could affect the disk speeds and experience of other users.
Importantly, you should never run actual code on the headnode, as it may crash it/make everyone else's lives much harder. Always run code either via srun
or sbatch
(scroll down for usage and examples).
This is very much focused on a slurm-based cluster, but the general comments should be applicable more broadly.
If you use python, you'll probably need to install things, like conda / pip packages.
Consider installing conda (https://conda.io/projects/conda/en/latest/user-guide/install/linux.html) and using conda install
for everything. I personally prefer using conda
to create environments, and installing packages using pip
.
To install anaconda and get up and running, you can do the following: (Note, if the headnode does not have internet, you can prefix the downloading commands with srun
)
Get the latest download link here: https://www.anaconda.com/products/individual#linux
# Download anaconda installer
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# Install anaconda, and follow instructions / prompts
bash Miniconda3-latest-Linux-x86_64.sh
# Then, conda should be in your shell
# (you might just need to run `source ~/.bashrc`)
Then, you should be able to use conda
normally, e.g. conda create
, conda activate
, conda install
etc.
For other pieces of software (e.g. for more traditional HPC research), you can try and compile it from source, or checking if it already exists (e.g. via a module
).
It is generally easier to do the bulk of one's development locally, and then only move the code to a cluster once it is mature, and you are ready to run experiments. This is not always possible though; for instance, if your code uses private data only accessible from the cluster, or if your local machine is not powerful enough to develop on.
Use Visual Studio Code to develop your projects. It has a particularly handy remote extension that allows you to connect to a remote machine and code on it as if it were your own.
Install the Python extension for autocomplete, documentation and syntax highlighting. Also install the git extension.
Important: Run killall -9 -u <user> node
every time you close VsCode and you are done with it. This is because vscode creates loads of processes that run on the headnode, and it does not kill them automatically; this means that if many people open vscode often, it can degrade everyone's experience.
- Goal: Be able to run
ssh cluster
instead ofssh <username>@<ip>
every time.
On your local machine, edit the .ssh/config
file (on your local machine), and add in the following (again with the correct ip and username):
To use
Host cluster
HostName XX.XX.XX.XX
User <username>
Now in VsCode, you should be able to open the command palette and then connect to a new host, and use the cluster
preconfigured host instead of having to always specify the username and ip.
Generally, for projects that are reasonably-sized (for instance, preliminary experiments/exploration of a particular research idea), coding everything in one file should be sufficient. A few rules of thumb:
- Instead of copy pasting code exactly, use functions for identical or similar pieces of code.
- Commenting your code is helpful, both for other people reading it, and for your future self's sake. Comments don't have to be exhaustive, giving reasons for why you do something in a particular way, or links to docs explaining what the library's functions do, etc. should be enough.
- If your single file becomes larger than 500-1000 lines, then it may be worthwhile to investigate splitting your code into multiple files.
For Python, I prefer having .py
files with all of my code, and running them directly using python
.
Git is a very helpful tool, see here for more. A quick TL; DR:
- Git allows you to store all of your files, and exactly keep track of different versions of a particular file.
- Using Github, you can sync your git repositories to the cloud, allowing easy access from multiple locations (e.g. on a cluster and locally) or to multiple people (e.g. everyone working on the same project).
There are a few terms that are helpful to understand:
- Repository: A single project, where files are tracked by git.
- Remote: Basically where the git repository is stored online, this is most often a particular location on Github.
- Commit: Every time you do something, you can commit the changes, which ensures git tracks your files (therefore, you should be able to get back to exactly this repository state in future).
Two important things to note:
- Github does not allow large files (> 100MB), so do not try to commit and push one of these, it will make you sad.
- A
.gitignore
file is important to have, it allows you to specify which files should not be tracked by git. Use cases are- Ignoring large files to avoid accidentally pushing them to Github
- Ignoring sensitive files, containing information such as passwords or PII
Generally, you can create a .gitignore
file and add the following:
See here and here for more about gitignore.
The following file means ignore all *.db
files, the single file /path/to/my/sensitive/file.txt
, and all files in all directories that match the wildcard *patient_data*csv
(here *
means anything and **/
means that it should look in all directories and not just one).
*.db
/path/to/my/sensitive/file.txt
**/*patient_data*csv
To clone an external repository, you can run git clone <url>
, e.g.
git clone https://github.com/facebookresearch/dino
Slurm is a way to manage a cluster (not the only way, e.g. PBS is another). The main purpose of these tools is to allow efficient sharing of lots of resources (i.e., computers) among lots of users (e.g. an entire university/department/country). In these types of systems, there are a few important components:
- Head Node / Login Node: Often the machine you log in to, and submit all your jobs from. In general, this node should not be used to run compute-intensive tasks. There are generally between one and a handful of these.
- Compute Node: Compute nodes are the bulk of what makes up a cluster. They are actually used to perform computationally tasks, and there are generally lots of them (e.g. 10s-100s to thousands to millions).
- Partitions: Partitions are ways to split up the set of compute nodes into different subsets for one reason or another. One reason is to separate different hardware, e.g., having a
gpu
partition where all the nodes with gpus are, and ahighmem
partition containing only nodes with lots of memory. Another reason for partitions could be to control access, e.g., thelaba
partition orlabb
partition can only be accessed from members of the respective lab. - Queues: A queue is basically a list of all jobs that are currently running, and those that are scheduled to run. For each job, it generally contains the user, partition, duration, etc.
- Users: Clusters are shared resources, and there are different users using the same hardware; each user has their own account, with its associated permissions.
sinfo
If you run sinfo
, you can see how many nodes are available and how many are being used, e.g.:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch* up 3-00:00:00 1 down* mscluster58
batch* up 3-00:00:00 3 drain mscluster[12,18,21]
batch* up 3-00:00:00 32 alloc mscluster[11,13-17,19-20,22-45]
batch* up 3-00:00:00 12 idle mscluster[46-57]
biggpu up 3-00:00:00 3 idle mscluster[10,59-60]
stampede up 3-00:00:00 40 idle mscluster[61-100]
Each of the partitions lists the number of nodes, as well as how many have a particular status. The statuses are:
down
: Unavailabledrain
: Similar todown
, you cannot use thisalloc
: Someone's job is running on these, so they are busyidle
: These nodes are free and can be used
Generally, if the partition you want to run on has lots of idle
nodes, waiting time should be quite low; conversely, if all nodes are alloc
or down
, you will likely have to wait.
squeue
To see all of the queued jobs and busy jobs, you can run squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
137849 batch job1 abc PD 0:00 4 (launch failed requeued held)
137862 batch job2 abc PD 0:00 1 (launch failed requeued held)
137863 batch job3 abc PD 0:00 1 (launch failed requeued held)
137864 batch job4 abc PD 0:00 1 (launch failed requeued held)
138407 batch job5 def R 9:51:40 1 mscluster11
On large clusters, the output is somewhat hard to parse as it may be very long. All you need to know is that jobs with time of 0:00
are queued, whereas jobs with a nonzero time are running currently; you can also see on which nodes these jobs are running on.
If you care only about your jobs, you can run
squeue --me
OR
squeue -u <username>
OR
squeue | grep <username>
srun
srun
allows you to run single commands on a particular partition, for instance: srun -p stampede hostname
or srun -p stampede python my_model.py
.
I often use srun
for interactive jobs, meaning that you are given shell access to a compute node, on which you can interactively run commands. This is very useful for debugging or interactively developing something without using the headnode.
To run an interactive job, you can run:
srun -N 1 -p <partition> --pty bash
Then you should be on a compute node, and you can run your desired commands.
sbatch
Often, after developing something, we want to run large-scale jobs; or many experiments in parallel. srun
does not scale as well for these use-cases, but sbatch
does. It effectively allows you to submit a bash
script that will be run on a compute node.
-
The idea here is that you use a job script, that specifies what you want to run and what the config is that you want. (from the Wits HPC course page). In all the examples below, change the partition name and home directory / username
-
Save this as `myfile.batch` (remember to change the username field below)
#!/bin/bash # specify a partition #SBATCH -p stampede # specify number of nodes #SBATCH -N 1 # specify number of cores ##SBATCH -n 2 # specify the wall clock time limit for the job hh:mm:ss #SBATCH -t 00:10:00 # specify the job name #SBATCH -J test-job # specify the filename to be used for writing output # NOTE: You must replace the <username> with your own account name!! #SBATCH -o /home/<username>/my_output_file_slurm.%N.%j.out # specify the filename for stderr #SBATCH -e /home/<username>/my_error_file_slurm.%N.%j.err echo ------------------------------------------------------ echo -n 'Job is running on node ' $SLURM_JOB_NODELIST echo ------------------------------------------------------ echo SLURM: sbatch is running on $SLURM_SUBMIT_HOST echo SLURM: job ID is $SLURM_JOB_ID echo SLURM: submit directory is $SLURM_SUBMIT_DIR echo SLURM: number of nodes allocated is $SLURM_JOB_NUM_NODES echo SLURM: number of cores is $SLURM_NTASKS echo SLURM: job name is $SLURM_JOB_NAME echo ------------------------------------------------------ # From here you can run anything, any normal shell script # e.g. # cd ~/mycode # python mymodel.py # but for simplicity, use a simple bash command cd ~ echo "Hello, we are doing a job now" ```
-
And then you can schedule that using
sbatch myfile.batch
-
The output and error files will be in
/home/<username>/*.out
and/home/<username>/*.err
-
An even simpler one would be (save as e.g.
myfile.batch
):#!/bin/bash #SBATCH --job-name=test #SBATCH --output=/home/YOURUSERNAMEHERE/result.txt #SBATCH --ntasks=1 # increase the time here if you need more than 10 minutes to run your job. #SBATCH --time=10:00 #SBATCH --partition=batch # TODO run any commands here. /bin/hostname sleep 60
-
Some clusters have different modules, which are basically ways to manage different versions of particular pieces of software. This is a very basic introduction, but see here for more.
The following commands are useful:
module avail
: Which modules are available to loadmodule list
: Which modules are currently loadedmodule load <name>
: Load a particular module.module unload <name>
: Unload a particular module.
Generally, if you run something like python
and the error is command not found
, then you could check if a module called python
(or a versioned one, e.g. python/3.11
) exists, and then load it.
tmux
is similar to screen
, and is a way to run long commands over ssh without fearing that your connection will drop and kill the job.
See here for more details, and here is a great beginner tutorial.
Generally, the workflow is:
ssh machine
Then, you can open a tmux
session:
tmux
You should see a green bar at the bottom indicating that you are in a tmux
session.
Once you are in a session, you can do the following:
Ctrl-B
and some other key
The way to do this is to hold the control key, press the b key, and then let go of them before pressing the next key.
C-b %
and C-b "
split panes (e.g. control b and then either %
or "
(shift '))
C-b <arrow>
(arrow being one of the 4 arrow keys) can move aroundC-b d
detaches, i.e. hides the session (it is still open though)tmux a
(when you are not in tmux) opens the most recent tmux window
There are a few use cases of tmux:
- Long running commands. Suppose you have an interactive
srun
job and it takes 10+ hours to run. If you do this directly in ssh, then as soon as your ssh connection drops (e.g. because your laptop powers down or you lose connection), the job will terminate. If you run thesrun
inside atmux
, this does not happen. - Having multiple terminals in the same ssh session. You can split the terminal into different ones, so you can do different things without having to ssh multiple times.
- The tmux session persists, so if you have a laptop and desktop, both of these can open the same remote tmux session.
rsync
is like cp
but can transfer files between remote machines.
To copy my_code_directory
from your local machine to your remote cluster, you can run (this should be run on your local machine):
rsync -r my_code_directory <username>@XX.XX.XX.XX:~/
Then the code will be found on the cluster at the directory ~/my_code_directory
To get the results back, you can perform the reverse of the operation above (again, run this on your local machine)
rsync -r <username>@XX.XX.XX.XX:~/my_code_directory/results .
Vim is a way to edit files on a terminal, and can be very powerful when you are used to it. See here for more. Or, run vimtutor
in your terminal to learn more in an interactive and fun way!
The TL; DR is that you if you press escape, you can type :
and different letters to do different things. :w
(and enter) saves the file, :q
exits vim (:wq
saves and exits whereas :q!
exits without saving).
There are also helpful motions, such as (after pressing escape), gg
goes to the top of the file and G
goes to the bottom. When you actually want to write text, press i
and start typing. Remember to press escape before trying to use any of the :
commands.
If you are halfway through typing a command or file path, pressing tab
often autocompletes it; if there are multiple possible completions you need to press tab multiple times to see them.
If you press ctrl + r
, then you can type in a few letters of a command, and the most recent command matching that will be shown. Pressing enter runs it, and pressing ctrl + r
again searches further backwards in time.
In Linux, commands can be chained using a |
character (pronounced "pipe"). This is used as follows:
command1 | command2
And it basically runs command1, and passes its output as input to command2.
Often, we do this to filter output, e.g.
squeue | grep gpu
runs squeue
, and filters it to only return the lines that contain the three characters "gpu" (the grep command selects only lines that match its argument, "gpu" in this case).
We often use this with head
(return the first 10 lines) and tail
(return the last 10 lines), or less
(which allows you to nicely scroll and search through text).
If you are unsure how to use a particular command, e.g. srun
, you can run
srun --help
This is quite long, so I like to pipe it to less, i.e., run
srun --help | less
which puts you in a scrollable mode. In less
, you can also search, by typing /
and your search command.
As an alternative to --help
, you can run man <command>
(where man
is short for manual), e.g. man srun
.
To go further, consider reading the documentation of the particular program/cluster management system you use. It also helps to become familiar with the linux terminal/bash shell; here is a guide with exercises at the end. Here is a beginner's intro to Python.
Otherwise, the best way to improve is to practice and to get more experience.