Instructions for using clusters at Virginia Tech
Wiki page: https://mlp.ece.vt.edu/wiki/doku.php/computing
- Download Anaconda X86 installer, and install it. If you use python2, choose Anaconda2; Anaconda3 is for python3. (You can choose not to append the line to
.bashrc
) - Download cudnn package cuDNN v6.0 Library for Linux, you will need to sign up an account.
- Set up initialization bash files (following is Yuliang's setup)
- /home/ylzou/.bashrc:
# .bashrc
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
# User specific aliases and functions
source /home/ylzou/install/init.sh
- /home/ylzou/install/init.sh:
#!/bin/bash
HOST_NAME=`hostname | sed -e 's/\..*$//'`
# Python3.6
ANACONDA_BIN=/home/ylzou/anaconda3/bin
ANACONDA_LIB=/home/ylzou/anaconda3/lib
NVCC=/usr/local/cuda-8.0/bin
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH
CMAKE_PATH=/home/ylzou/anaconda3/bin/cmake
# Add paths for python
export PATH=$CMAKE_PATH:$NVCC:$ANACONDA_BIN:$PATH
# cudnn-8.0-v6.0
export CUDA_ROOT=/home/ylzou/install/cuda
export LD_LIBRARY_PATH=/home/ylzou/install/cuda/lib64:$LD_LIBRARY_PATH
# Add gpu_lock package
export PATH=$PATH:/srv/share/gpu_lock
export PYTHONPATH=$PYTHONPATH:/srv/share/gpu_lock
(Remember to re-connect the cluster after you setup the init scripts)
- (Optional) Set up remote editing with PyCharm, follow the instruction here
NOTE: I use Python3.6, cuda-8.0, and cudnn-8.0-6.0. You should modify the files above according to your choice.
- Check https://pytorch.org/, and choose the conda ones. (Yuliang's choice: conda install pytorch torchvision cuda80 -c soumith)
- Check if you can run the following scripts in Python
import torch
import torchvision
A = torch.Tensor(1)
A.cuda()
torch.backends.cudnn.version()
Reference: https://discuss.pytorch.org/t/error-when-using-cudnn/577/3
Follow the instructions here
- Download a distribution archive of bazel (bazel-[VERSION]-dist.zip), and unzip it in a empty folder you create.
- Run
bash ./compile.sh
,this will create a bazel binary in output/bazel. Notice that we cannot move it to/usr/local/bin directory
, since there already has an old-version one. - Clone TensorFlow repo,
git clone https://github.com/tensorflow/tensorflow
- Modify configure file in TensorFlow folder: Commet line166~line174. So that this file will not detect the version of default bazel, which is too old to use.
- Run
./configure
, use default setting most of the time, except for the cuda and cudnn relevant ones, which you should check your initialize bash file for reference. (Yuliang's setting:/home/ylzou/install/init.sh
) - Build a pip package
[path/to/your/bazel] build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
- Get whl file Run
bazel-bin/tensorflow/tools/pip_package/build_pip_package [/directory/to/store/this/file]
- Install the pip package
pip install [/directory/to/your/whl/file]
- Leave the TensorFlow directory, and test if you can import the package
(Only for Python3.6, other version should be able to use Anaconda to install)
- Clone repo,
git clone https://github.com/Itseez/opencv.git
- Checkout
cd opencv
git checkout 3.1.0 && git format-patch -1 10896129b39655e19e4e7c529153cb5c2191a1db && git am < 0001-GraphCut-deprecated-in-CUDA-7.5-and-removed-in-8.0.patch
- Setup build directory
mkdir build
cd build
cmake -DBUILD_TIFF=ON -DBUILD_opencv_java=OFF -DWITH_CUDA=OFF -DENABLE_AVX=ON -DWITH_OPENGL=ON -DWITH_OPENCL=ON -DWITH_IPP=ON -DWITH_TBB=ON -DWITH_EIGEN=ON -DWITH_V4L=ON -DWITH_VTK=OFF -DBUILD_TESTS=OFF -DBUILD_PERF_TESTS=OFF -DCMAKE_BUILD_TYPE=RELEASE -DBUILD_opencv_python2=OFF -DCMAKE_INSTALL_PREFIX=$(python3 -c "import sys; print(sys.prefix)") -DPYTHON3_EXECUTABLE=$(which python3) -DPYTHON3_INCLUDE_DIR=$(python3 -c "from distutils.sysconfig import get_python_inc; print(get_python_inc())") -DPYTHON3_PACKAGES_PATH=$(python3 -c "from distutils.sysconfig import get_python_lib; print(get_python_lib())") ..
- Make
make -j32
make install
- Check
python
Import cv2
Reference:
- Jinwoo's script
- https://www.scivision.co/anaconda-python-opencv3/
conda install -c menpo ffmpeg=3.1.3
sudo bash
sudo reboot
e.g. after fukushima reboot, you need the followings on the slurm master machine (marr).
sudo munged
sudo service slurm restart
sudo scontrol update node=fukushima state=resume
Access to all compute engines (aside from interactive nodes) is controlled via the job scheduler. You can follow the instructions here
- Write a shell script for submission of jobs on NewRiver. This is a .sh file Chen uses. You can modify it appropriately.
#!/bin/bash
#
# Annotated example for submission of jobs on NewRiver
#
# Syntax
# '#' denotes a comment
# '#PBS' denotes a PBS directive that is applied during execution
#
# More info
# https://secure.hosting.vt.edu/www.arc.vt.edu/computing/newriver/#examples
#
# Chen Gao
# Aug 16, 2017
#
# Account under which to run the job
#PBS -A vllab_2017
# Access group. Do not change this line.
#PBS -W group_list=newriver
# Set some system parameters (Resource Request)
#
# NewRiver has the following hardware:
# a. 100 24-core, 128 GB Intel Haswell nodes
# b. 16 24-core, 512 GB Intel Haswell nodes
# c. 8 24-core, 512 GB Intel Haswell nodes with 1 Nvidia K80 GPU
# d. 2 60-core, 3 TB Intel Ivy Bridge nodes
# e. 39 28-core, 512 GB Intel Broadwell nodes with 2 Nvidia P100 GPU
#
# Resources can be requested by specifying the number of nodes, cores, memory, GPUs, etc
# Examples:
# Request 2 nodes with 24 cores each
# #PBS -l nodes=1:ppn=24
# Request 4 cores (on any number of nodes)
# #PBS -l procs=4
# Request 12 cores with 20gb memory per core
# #PBS -l procs=12,pmem=20gb
# Request 2 nodes with 24 cores each and 20gb memory per core (will give two 512gb nodes)
# #PBS -l nodes=2:ppn=24,pmem=20gb
# Request 2 nodes with 24 cores per node and 1 gpu per node
# #PBS -l nodes=2:ppn=24:gpus=1
# Request 2 cores with 1 gpu each
# #PBS -l procs=2,gpus=1
#PBS -l procs=12,pmem=16gb,walltime=2:20:00:00
# Set Queue name
# normal_q for production jobs on all Haswell nodes (nr003-nr126)
# largemem_q for jobs on the two 3TB, 60-core Ivy Bridge servers (nr001-nr002)
# dev_q for development/debugging jobs on Haswell nodes. These jobs must be short but can be large.
# vis_q for visualization jobs on K80 GPU nodes (nr019-nr027). These jobs must be both short and small.
# open_q for jobs not requiring an allocation. These jobs must be both short and small.
# p100_normal_q for production jobs on P100 GPU nodes
# p100_dev_q for development/debugging jobs on P100 GPU nodes. These jobs must be short but can be large.
# For more on queues as policies, see https://www.arc.vt.edu/newriver#policy
#PBS -q normal_q
# Send emails to -M when
# a : a job aborts
# b : a job begins
# e : a job ends
#PBS -M <PID>@vt.edu
#PBS -m bea
# Add any modules you might require. This example adds matlab module.
# Use 'module avail' command to see a list of available modules.
#
module load matlab
# Navigate to the directory from which this script was executed
cd /home/chengao/BIrdDetection/Chen_code
# Below here enter the commands to start your job. A few examples are provided below.
# Some useful variables set by the job:
# $PBS_O_WORKDIR Directory from which the job was submitted
# $PBS_NODEFILE File containing list of cores available to the job
# $PBS_GPUFILE File containing list of GPUs available to the job
# $PBS_JOBID Job ID (e.g., 107619.master.cluster)
# $PBS_NP Number of cores allocated to the job
### If run Matlab job ###
#
# Open a MATLAB instance and call Rich_new()
#matlab -nodisplay -r "addpath('Chen_code'); Rich_new;exit"
### If run Tensorflow job ###
#
- To submit your job to the queuing system, use the command
qsub
. For example, if your script is in "JobScript.qsub", the command would be:
qsub ./JobScript.qsub
- This will return your job name of the form. xxxxxx is the job number
xxxxxx.master.cluster
- To check a job’s status, use the checkjob command:
checkjob -v xxxxxx
- To check resource usage on the nodes available to a running job, use:
jobload xxxxxx
- To remove a job from the queue, or stop a running job, use the command qdel
qdel xxxxxx
interact -q p100_dev_q -lnodes=1:ppn=28:gpus=2 -A vllab_2017
serv_name=$(hostname)
if [[ $serv_name == *"hu"* ]];
then
# Set up Huckleberry Dependencies
export PATH="/home/user_name/miniconda2/bin:$PATH"
else
# Set up Newriver Dependencies
export PATH="/home/user_name/anaconda2/bin:$PATH"
fi
WARNING: DO NOT follow the instruction from ARC. It's not working because cudnn is not visible on GPU nodes. 0. Connect to a GPU node.
- Install Anaconda python of your choice.
module purge
module load cuda/8.0.44
- Download cudnn from here.
- Add
LD_LIBRARY_PATH
to your.bashrc
file. Jinwoo's example:
# set the server name
serv_name=$(hostname)
if [[ $serv_name == *"hu"* ]];
then
# for PowerAI (huckleberry)
# added by Miniconda2 4.3.14 installer
export PATH="/home/jinchoi/pkg/miniconda2/bin:$PATH"
else
# for newriver
# added by Anaconda2 4.4.0 installer
export PATH="/home/jinchoi/pkg/anaconda2_nr/bin:$PATH"
export LD_LIBRARY_PATH=/home/jinchoi/lib/cuda/lib64:$LD_LIBRARY_PATH
fi
source ~/.bashrc
so that the os can locate your cudnn directory.- Follow the official TensorFlow installation procedure provided here.
- Enjoy!
- Install Anaconda and make your conda environment
- Install from source (ver 2.4.13). Installing recent OpenCV with Deep Neural Network might be tricky. If you do not use OpenCV DNN, just installl 2.4.13 without DNN.
$git clone https://github.com/Itseez/opencv.git
$cd opencv
$git checkout 2.4
mkdir build
cd build
- Do cmake. The following is the cmake command I used. You may want to change the PATH variables according to your miniconda installation path.
$cmake -D CMAKE_BUILD_TYPE=RELEASE \
-D CMAKE_INSTALL_PREFIX=~/pkg/opencv_2.4.13_build/ \
-D INSTALL_PYTHON_EXAMPLES=ON \
-D INSTALL_C_EXAMPLES=OFF \
-D PYTHON_EXECUTABLE=/home/jinchoi/pkg/miniconda2/envs/tensorflow/bin/python \
-D PYTHON_PACKAGES_PATH=/home/jinchoi/pkg/miniconda2/envs/tensorflow/lib \
-D BUILD_EXAMPLES=ON ..
- Do make
$make -j32
- Setup your path in .bashrc file. The following is my path in .bashrc file.
export LD_LIBRARY_PATH=/home/jinchoi/pkg/opencv/build/lib/:$LD_LIBRARY_PATH
export INCLUDE_PATH=/home/jinchoi/pkg/opencv/include:$INCLUDE_PATH
export PYTHONPATH=/home/jinchoi/pkg/opencv/build/lib:$PYTHONPATH
export PYTHONPATH=/home/jinchoi/pkg/opencv/include:$PYTHONPATH
- Enjoy!
$python
>>import cv2
If you don't see any errors, you are good to go.
Please fully utilize all the GPUs when you are submitting jobs to PowerAI. Each gpu node on PowerAI consists of 4 gpus. If you just submit a job naively, it will only use one GPU but it will block other people to use that node. It is too inefficient. So please run 4 jobs per GPU node. It is important as people outside the lab started to use PowerAI.
So when you have 4 different models to train, please DO NOT sbatch model1.sh
, sbatch model2.sh
, sbatch model3.sh
, sbatch model4.sh
unless each of your job requires GPU memory more than 16GB.
Please do sbatch model1.sh
, ssh to the node your model1 is assigned, then run your three other models in a background of that node using nohup, screen, or whatever your choice.
As far as we know, this is the best way to submit multiple jobs on a single GPU node. If you have more elegant way to submit 4 different jobs on a single GPU node, please let us know.
You can ask James McClure if you have questions. Or you can ask Jinwoo.
General instructions for how to access unix systems, you can check this link
- Make an account. You may ask Jia-Bin to do this.
- Just ssh to the
huckleberry1.arc.vt.edu
with your pidssh [email protected]
- Enjoy!
There are two ways to access off campus.
- Install Pulse (VPN client) from here
- Turn on Pulse
- SSH to the huckleberry as if you are on campus. (refer to 1)
- Enjoy!
- ssh to one of the CVMLP clusters with a port number 2222 (You need to use port 2222 to access CVMLP clusters off campus) e.g.
ssh -p 2222 <your_pid>@marr.ece.vt.edu
- ssh to the PowerAI as if you are on campus. (refer to 1)
- From personal computer:
$ scp -r <userid>@godel.ece.vt.edu:/srv/share/lab_helpful_files/ ~/
- Change to your CVL account username in the ~/lab_helpful_files/config file and move it to ~/.ssh
$ mv ~/lab_helpful_files/config ~/.ssh/
- Add the following lines to the “config” file
Host huck
Hostname huckleberry1.arc.vt.edu
Port 22
User jinchoi
You should change User to <your_pid>. You may change the huck to whatever name you want to use.
-
$ ssh-keygen -t rsa
-
Enter a bunch - Make sure ~/ on sever has .ssh folder login, does
$ cd ~/.ssh
work? if not, type
$ mkdir .ssh
$ scp ~/.ssh/id_rsa.pub <userid>@huckleberry1.arc.vt.edu:~/.ssh/
- On the PowerAI server (huckleberry):
$ cd ~/.ssh/
$ cat id_rsa.pub >> authorized_keys2
$ chmod 700 ~/.ssh
$ chmod 600 ~/.ssh/authorized_keys2
- Now you can type the following to connect to PowerAI from your PC (If you are off-campus, you need to use VPN)
$ ssh huck
You can set up a remote editing environment using sftp connect. This example is using Atom + Remote FTP, but you can do similar things for other editors + sftp plug-ins.
- First setup your password-less ssh environment. Follow the instructions in 2.
- On your local machine, choose a project directory to sync your source codes.
- Install
RemoteFTP
. Go to Setting->Install, type RemoteFTP, Install it. - Write a
.ftpconfig
file in the chosen directory as follows.
{
"protocol": "sftp",
"host": "huckleberry1.arc.vt.edu", // string - Hostname or IP address of the server. Default: 'localhost'
"port": 22, // integer - Port number of the server. Default: 22
"user": "jinchoi", // string - Username for authentication. Default: (none)
"remote": "/home/jinchoi/src/",
"privatekey": "/Users/jwC/.ssh/id_rsa" // string - Path to the private key file (in OpenSSH format). Default: (none)
}
For the “User”, “remote”, “privatekey” fields, you should modify them for your own settings. You may use VPN client if you are off-campus and want to use PowerAI. If you are off-campus and want to use CVMLP clusters, you can simply use port number 2222. 5. Connect to the server using "Packages->RemoteFTP->Connect" 6. Enjoy!
The current PowerAI GPU nodes do not support internet access. Not sure how to setup an Jupyter Notebook environment. This link might be helpful, but not tested yet.
The below instructions can be followed when connected to CVMLP/university wifi.
- Ssh into PowerAI cluster
- Launch the notebook using:
$jupyter notebook --no-browser --port=7777
You should get an output. The Jupyter Notebook is running at: https://localhost:7777/?token=104f6f1af5b7fdd761f28f5746c35b47f89d00698157ce85 - Open a new terminal in the local machine and use the following port forwarding command.
$ssh -N -L localhost:7777:localhost:7777 [email protected]
- Now open the browser and visit localhost:7777 and enter the token key mentioned in the link where your notebook is running
You should submit GPU jobs only using slurm. You can follow the instructions here.
In addition to the instructions, here are some more useful information.
Our group memebers can use priority_q when submitting either interactive or batch jobs on PowerAI. Instead of submitting jobs to "normal_q" (crowded and limited walltime), we can submit jobs to "priory_q". In "priority_q", we will have relaxed walltime restriction and ensure at least 40% of the computation cycles of PowerAI are allocated for priority_q.
scontrol show jobid -dd <jobid>
It will show you what .sh file you used for the jobid. Sometimes you need this information.
This is a train.sh file Jinwoo uses. You can modify it appropriately.
#!/bin/bash -l
#SBATCH -p normal_q # you can also use priority_q
#SBATCH -N 1
#SBATCH -t 144:00:00
#SBATCH -J c3d-full
#SBATCH -o ./log/C3D-RGB-Full_Training_UCF101_lr_0.001_batch_256_320k_full_img_size.log
hostname
echo $CUDA_VISIBLE_DEVICES
module load cuda
source /opt/DL/tensorflow/bin/tensorflow-activate
export PYTHONHOME="/home/jinchoi/pkg/miniconda2/envs/tensorflow"
srun python ./tools/train_net.py --device gpu --device_id 0 --imdb UCF101_RGB_1_split_0_TRAIN --cfg experiments/cfgs/c3d_rgb_detect_lr_0.001.yml --network C3D_detect_train --iters 3200000
Each GPU node on PowerAI consists of 4 GPUs. But there is no instruction regarding how to submit multiple jobs (e.g. 4 different jobs) per one GPU node.
James says you can use CUDA_VISIBLE_DEVICES
to do this, but it has not tested yet.
PowerAI architecture is ppc64le: we don’t have standard anaconda installer for this architecture. Instead, install miniconda which contains only conda and python. You can find one here. Download it, and install on your custom directory on PowerAI clusters. Then, you can install other modules using conda install …
or pip install ...
and so on.
Open your .bashrc
file.
$ vi .bashrc
And add the following export PATH="/home/chengao/miniconda2/bin:$PATH"
.
Update(08/24/2017): Looks like continuum now supports the standard anaconda for ppc64le architecture. Check this out: https://www.continuum.io/downloads#linux. Not tested yet though.
- Basically follow this instruction. Howerver, I didn't get the dependencies with apt-get as I don't have a sudo permission. I'm describing the setting I used and it was successful installation.
- Do make ffmpeg source dir
mkdir ~/pkg/ffmpeg_sources
- Compile required dependencies yasm -> no need libx264 libx265 libfdk-aac libmp3lame -> no need libopus libvpx
- Do ffmpeg build and install
$cd ~/ffmpeg_sources
$wget https://ffmpeg.org/releases/ffmpeg-snapshot.tar.bz2
$tar xjvf ffmpeg-snapshot.tar.bz2
$cd ffmpeg
$PATH="$HOME/bin:$PATH" PKG_CONFIG_PATH="$HOME/pkg/ffmpeg_build/lib/pkgconfig" ./configure --prefix="$HOME/pkg/ffmpeg_build" --pkg-config-flags="--static" --extra-cflags="-I$HOME/pkg/ffmpeg_build/include" --extra-ldflags="-L$HOME/pkg/ffmpeg_build/lib" --bindir="$HOME/bin" --enable-gpl --enable-libfdk-aac --enable-libfreetype --enable-libopus --enable-libvpx --enable-libx264 --enable-libx265 --enable-nonfree
$PATH="$HOME/bin:$PATH" make
$make install
$hash -r
- Installation is now complete and ffmpeg is now ready for use. Your newly compiled FFmpeg programs are in
~/bin
. add your~/bin
to.bashrc
$PATH
variable. - Enjoy!
References: [1] https://trac.ffmpeg.org/wiki/CompilationGuide/Ubuntu
- Install miniconda and make your conda environment
- Install from source (ver 3.1.0)
$git clone https://github.com/Itseez/opencv.git
This approach (installation from the source not from a zip file) is to avoid an knwon issue - Checkout and patch
$cd opencv
$git checkout 3.1.0 && git format-patch -1 10896129b39655e19e4e7c529153cb5c2191a1db && git am < 0001-GraphCut-deprecated-in-CUDA-7.5-and-removed-in-8.0.patch
- Manually update two source files according to this This step is to avoid "_FPU_SINGLE declaration error"
- Go to the opencv root dir
$cd opencv
mkdir build
cd build
- Do cmake. The following is the cmake command I used. You may want to change the PATH variables according to your miniconda installation path.
$cmake -D CMAKE_BUILD_TYPE=RELEASE \
-D CMAKE_INSTALL_PREFIX=~/pkg/opencv_3.1.0_build/ \
-D INSTALL_PYTHON_EXAMPLES=ON \
-D INSTALL_C_EXAMPLES=OFF \
-D OPENCV_EXTRA_MODULES_PATH=~/pkg/opencv_contrib-3.1.0/modules \
-D PYTHON_EXECUTABLE=/home/jinchoi/pkg/miniconda2/envs/tensorflow/bin/python \
-D PYTHON_PACKAGES_PATH=/home/jinchoi/pkg/miniconda2/envs/tensorflow/lib \
-D BUILD_EXAMPLES=ON ..
- Do make
$make -j32
- Setup your path in .bashrc file. The following is my path in .bashrc file.
export LD_LIBRARY_PATH=/home/jinchoi/pkg/opencv/build/lib/:$LD_LIBRARY_PATH
export INCLUDE_PATH=/home/jinchoi/pkg/opencv/include:$INCLUDE_PATH
export PYTHONPATH=/home/jinchoi/pkg/opencv/build/lib:$PYTHONPATH
export PYTHONPATH=/home/jinchoi/pkg/opencv/include:$PYTHONPATH
- Enjoy!
$python
>>import cv2
If you don't see any errors, you are good to go.
References:
[1] opencv/opencv#6677
[2] https://github.com/opencv/opencv/pull/6982/commits/0df9cbc954c61fca0993b563c2686f9710978b08
[3] https://www.pyimagesearch.com/2016/10/24/ubuntu-16-04-how-to-install-opencv/
You can load the pre-installed TensorFlow as follows.
“module load cuda”
“source /opt/DL/tensorflow/bin/tensorflow-activate”
Enjoy!
Pytorch installation is quite simple. Clone the sources , fulfill the dependencies and there you go!
git clone https://github.com/pytorch/pytorch.git
export CMAKE_PREFIX_PATH=[anaconda root directory]
conda install numpy pyyaml setuptools cmake cffi
python setup.py install
Done!
No one has been successfully installed a custom Caffe on PowerAI. There are some problems installing the dependencies such as glog, gflags, google protobuf.