Skip to content

malteos/getting-started

Repository files navigation

Getting started

PRs Welcome

This repository contains tutorials, scripts, examples etc. for getting started with your machine learning / NLP project.

The information are mainly tailored to users of DFKI's PEGASUS system.

Software development

IDE

Debugger

One of the key features of any good IDE is its debugging support. The debugger will make it much easier to fix your code (no need for print-statements anymore).

Tutorials for debuggers:

GitHub Copilot

Coding best practices

Get familar with coding standards and best practices! This improve your code by a lot and makes it much easier to maintain.

You can use automated tools to enforce coding styles:

Remote server

Today's machine learning requires large computing resources that your local machine won't have. Thus, you need to connect a remote server to run your experiments.

SSH

SSH keys

SSH config

Example

A SSH-config may contain entries like below. Replace <dfki_account> with your DFKI Account and <pegasus_account> with your PEGASUS account.

# PEGASUS via SSH-Gate
Host pegasus.dfki  # a custom hostname
    User <pegasus_account>
    HostName login2.pegasus.kl.dfki.de  # change this to a different login node if needed
    ProxyJump <dfki_account>@sshgate.sb.dfki.de

With such a config, you can simply connect to PEGASUS by typing ssh pegasus.dfki.

SSH proxy

An SSH connectio can be used a proxy to access resources from the intranet:

# replace <proxy_port> with a port number, e.g. 8001
ssh -D <proxy_port> pegasus.dfki

This creates a SOCKS proxy (see https://ma.ttias.be/socks-proxy-linux-ssh-bypass-content-filters/).

Enable this proxy via or system settings or browser settings or use a proxy browser plugin like FoxyProxy:

Use the following settings:

  • Proxy Type: SOCKS5
  • Proxy IP address: localhost
  • Proxy port: <proxy_port>

tmux

You connection to a remote might be lost and, therefore, it is important to maintain sessions of the remote server independent from your own connection. tmux provides this and many other features that will make your work on remote servers much easier.

Alternatives to tmux are: screen, ...

Environment

.bashrc example

The .bashrc in your home directory is loaded everytime you start a bash session. It is a good place to define global environment variables, e.g., cache paths or login credentials. For example:

# append to "~/.bashrc"

# shortcuts
alias ll="ls -l"

# PIP cache
# https://projects.dfki.uni-kl.de/km-publications/web/ML/core/hpc-doc/posts/pypi-cache/
export PIP_INDEX_URL=https://pypi-cache/index
export PIP_TRUSTED_HOST=pypi-cache
export PIP_NO_CACHE=true

# Huggingface
export HF_LOGIN=<your huggingface login>
export HF_PASSWORD=<your huggingface API key>
export HF_DATASETS_CACHE="/netscratch/$USER/datasets/hf_datasets_cache/"
export TRANSFORMERS_CACHE="/netscratch/$USER/datasets/transformers_cache/"

# Weights & Biases
export WANDB_API_KEY<your WANDB api>

Python environments (conda / virtualenv / ...)

Slurm

Read the PEGASUS documentation. It should provide all necassary information. For other questions, please use the cluster chat.

Some potentially useful commands:

# starts an interactive job with pytorch (8hrs time limit)
$ srun -K \
  --container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_23.07-py3.sqsh \
  --container-workdir="`pwd`" \
  --container-mounts=/netscratch/$USER:/netscratch/$USER,/ds:/ds:ro,"`pwd`":"`pwd`" \
   --time 08:00:00 --pty bash

# list your current jobs
squeue --me

Docker & containers

PEGASUS uses enroot for containers. If you have rebuild Docker images you can convert them as follows:

srun -p $ANY_PARTITION \
  enroot import \
  -o /netscratch/$USER/enroot/malteos_eulm_latest.sqsh \
  'docker:https://ghcr.io#malteos/eulm:latest'

Build custom images with Podman:

sbin/podman_build.sh

Jupyter notebooks

You can run Jupyter noteboks on Pegasus:

# start interactive compute job
# --container-save=$EVAL_DEV_IMAGE 
srun \
  --container-mounts=/netscratch:/netscratch,/home/$USER:/home/$USER \
  --container-image=$IMAGE \
  --container-workdir=$(pwd) -p RTX6000 --time 08:00:00 --pty /bin/bash

# run in compute job
echo "Jupyter starting at ... https://${HOSTNAME}.kl.dfki.de:8880" && jupyter notebook --ip=0.0.0.0 --port=8880 \
    --allow-root --no-browser --config /home/mostendorff/.jupyter/jupyter_notebook_config.json \
    --notebook-dir /netscratch/mostendorff/experiments

# start with fixed token (for VSCode -> "Specify Jupyter connection")
JUPYTER_TOKEN=yoursecrettoken jupyter notebook

Other useful links & resources

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published