A quick reference to access NYU's High Performance Computing Prince Cluster.
The official wiki is here, this is an unofficial document created as a reminder cheat sheet for MARL with a focus in Python.
There are four file systems: /home
, /scratch
, /BEEGFS
, and /archive
.
BEEGFS and Scratch are file systems mounted on Prince that are connected to the compute nodes where we can upload files faster. Notice that the content gets periodically flushed.
[NYUNetID@log-0 ~]$ cd /scratch/NYUNetID
[NYUNetID@log-0 ~]$ pwd
/scratch/NYUNetID
Use /home
for enviroments, and code. Use /BEEGFS
and /scratch
for storing data and program output during computation.
The compute nodes can't see /archive
.
We also have access to /archive/m/marl
This is for storing data.
Slurm allows you to load and manage multiple versions and configurations of software packages.
To see available package environments:
module avail
To load a model:
module load [package name]
For example if you want to use Tensorflow-gpu:
module load cudnn/8.0v6.0
module load cuda/8.0.44
module load tensorflow/python3.6/1.3.0
To check what is currently loaded:
module list
To remove all packages:
module purge
To get helpful information about the package:
module show torch/gnu/20170504
Will print something like
--------------------------------------------------------------------------------------------------------------------------------------------------
/share/apps/modulefiles/torch/gnu/20170504.lua:
--------------------------------------------------------------------------------------------------------------------------------------------------
whatis("Torch: a scientific computing framework with wide support for machine learning algorithms that puts GPUs first")
whatis("Name: torch version: 20170504 compilers: gnu")
load("cmake/intel/3.7.1")
load("cuda/8.0.44")
load("cudnn/8.0v5.1")
load("magma/intel/2.2.0")
...
load(...)
are the dependencies that are also loaded when you load a package.
You can submit batch jobs in prince to schedule jobs. This requires to write custom bash scripts. Batch jobs are great for longer jobs, and you can also run in interactive mode, which is great for short jobs and troubleshooting.
To run in interactive mode:
[NYUNetID@log-0 ~]$ srun --pty /bin/bash
This will run the default mode: a single CPU core and 2GB memory for 1 hour.
To request more CPU's:
[NYUNetID@log-0 ~]$ srun -n4 -t2:00:00 --mem=4000 --pty /bin/bash
[NYUNetID@c26-16 ~]$
That will request 4 compute nodes for 2 hours with 4 Gb of memory.
To exit a request:
[NYUNetID@c26-16 ~]$ exit
[NYUNetID@log-0 ~]$
[NYUNetID@log-0 ~]$ srun --gres=gpu:1 --pty /bin/bash
[NYUNetID@gpu-25 ~]$ nvidia-smi
Mon Oct 23 17:49:19 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48 Driver Version: 367.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 0000:12:00.0 Off | 0 |
| N/A 37C P8 29W / 149W | 0MiB / 11439MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
You can write a script that will be executed when the resources you requested became available.
A simple CPU demo:
## 1) Job settings
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=5:00:00
#SBATCH --mem=2GB
#SBATCH --job-name=CPUDemo
#SBATCH --mail-type=END
#SBATCH [email protected]
#SBATCH --output=slurm_%j.out
## 2) Everything from here on is going to run:
cd /scratch/NYUNetID/demos
python demo.py
Request GPU:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:4
#SBATCH --time=10:00:00
#SBATCH --mem=3GB
#SBATCH --job-name=GPUDemo
#SBATCH --mail-type=END
#SBATCH [email protected]
#SBATCH --output=slurm_%j.out
cd /scratch/NYUNetID/trainSomething
source activate ML
python train.py
Submit your job with:
sbatch myscript.s
Monitor the job:
squeue -u $USER
More info here
To copy data between your workstation and the NYU HPC clusters, you must set up and start an SSH tunnel.
What is a tunnel?
"A tunnel is a mechanism used to ship a foreign protocol across a network that normally wouldn't support it."1
- In your local computer root directory, and if you don't have it already, create a folder called
/.shh
:
mkdir ~/.ssh
- Set the permission to that folder:
chmod 700 ~/.ssh
- Inside that folder create a new file called
config
:
touch config
- Open that file in any text editor and add this:
# first we create the tunnel, with instructions to pass incoming
# packets on ports 8024, 8025 and 8026 through it and to specific
# locations
Host hpcgwtunnel
HostName gw.hpc.nyu.edu
ForwardX11 no
LocalForward 8025 dumbo.hpc.nyu.edu:22
LocalForward 8026 prince.hpc.nyu.edu:22
User NetID
# next we create an alias for incoming packets on the port. The
# alias corresponds to where the tunnel forwards these packets
Host dumbo
HostName localhost
Port 8025
ForwardX11 yes
User NetID
Host prince
HostName localhost
Port 8026
ForwardX11 yes
User NetID
Be sure to replace the NetID
for your NYU NetId
Running Jupyter Notebook on NYU HPC in 3 Clicks
To copy data between your workstation and the NYU HPC clusters, you must set up and start an SSH tunnel. (See previous step)
- Create a tunnel
ssh hpcgwtunnel
Once executed you'll see something like this:
Last login: Wed Nov 8 12:15:48 2017 from 74.65.201.238
cv965@hpc-bastion1~>$
This will use the settings in /.ssh/config
to create a tunnel. You need to leave this open when transfering files. Leave this terminal tab open and open a new tab to continue the process.
- Transfer files
Using rsync
is preferred. See the rsync wiki for more details.
rsync [options] source [source] destination
- a "Archive" mode - permissions and timestamps of the source are replicated at the destination.
- v "Verbose".
- n "dry run" - don't actually do anything, just indicate what would be done.
- C "follow CVS ignore conventions" - more on this below.
- r "Recursive".
- u "Update".
- A File:
scp /Users/local/data.txt NYUNetID@prince:/scratch/NYUNetID/path/
- A Folder:
scp -r /Users/local/path NYUNetID@prince:/scratch/NYUNetID/path/
- A File:
scp NYUNetID@prince:/scratch/NYUNetID/path/data.txt /Users/local/path/
- A Folder:
scp -r NYUNetID@prince:/scratch/NYUNetID/path/data.txt /Users/local/path/
Create a ./.screenrc
file and append this gist