Add docker compose and change containerized setup instructions to use…

… it (EleutherAI#1113) * Add pythia 14M config * Create 31M.yml * Add docker compose, update readme docker instructions to utilize it * Add logging limits to docker-compose files * Change data mount from /gpt-neox/data to /data/ This prevents possible errors if the user already has a /data/ directory in their /gpt-neox/ folder * Update README.md Makes the code blocks into blocks in the changed parts * Make the docker-compose spinup tidier * Avoid config bloat by only providing the updated paths * Apply precommit --------- Co-authored-by: Quentin Anthony <[email protected]>
hssn-20 · Jan 9, 2024 · e6e944a · e6e944a
1 parent f14782a
commit e6e944a
Show file tree

Hide file tree

Showing 7 changed files with 134 additions and 11 deletions.
diff --git a/Dockerfile b/Dockerfile
@@ -26,11 +26,11 @@ LABEL org.opencontainers.image.base.name="docker.io/nvidia/cuda:11.7.1-devel-ubu
 #### System package (uses default Python 3 version in Ubuntu 20.04)
 RUN apt-get update -y && \
  apt-get install -y \
-  git python3.9 python3-dev libpython3-dev python3-pip sudo pdsh \
-  htop llvm-9-dev tmux zstd software-properties-common build-essential autotools-dev \
-  nfs-common pdsh cmake g++ gcc curl wget vim less unzip htop iftop iotop ca-certificates ssh \
-  rsync iputils-ping net-tools libcupti-dev libmlx4-1 infiniband-diags ibutils ibverbs-utils \
-  rdmacm-utils perftest rdma-core nano && \
+ git python3.9 python3-dev libpython3-dev python3-pip sudo pdsh \
+ htop llvm-9-dev tmux zstd software-properties-common build-essential autotools-dev \
+ nfs-common pdsh cmake g++ gcc curl wget vim less unzip htop iftop iotop ca-certificates ssh \
+ rsync iputils-ping net-tools libcupti-dev libmlx4-1 infiniband-diags ibutils ibverbs-utils \
+ rdmacm-utils perftest rdma-core nano && \
  update-alternatives --install /usr/bin/python python /usr/bin/python3 1 && \
  update-alternatives --install /usr/bin/pip pip /usr/bin/pip3 1 && \
  pip install --upgrade pip && \

diff --git a/README.md b/README.md
@@ -225,11 +225,69 @@ You can then kick off a training run with `sbatch my_sbatch_script.sh`
 
 ### Containerized Setup
 
-We also provide a Dockerfile if you prefer to run NeoX in a container. To use this option, first build an image named `gpt-neox` from the repository root directory with `docker build -t gpt-neox -f Dockerfile .`. We also host pre-built images on [Docker Hub at `leogao2/gpt-neox`](https://hub.docker.com/r/leogao2/gpt-neox/tags).
+We also provide a Dockerfile and docker-compose configuration if you prefer to run NeoX in a container.
+
+Requirements to run the container are to have appropriate GPU drivers, an up-to-date installation of Docker, and [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) installed. To test if your installation is good you can use their "sample workload", which is:
+
+```
+docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
+```
+
+Provided that will run, you need to export NEOX_DATA_PATH and NEOX_CHECKPOINT_PATH in your environment to specify your data directory and directory for storing and loading checkpoints:
+
+```
+export NEOX_DATA_PATH=/mnt/sda/data/enwiki8 #or wherever your data is stored on your system
+export NEOX_CHECKPOINT_PATH=/mnt/sda/checkpoints
+```
+
+And then, from the gpt-neox directory, you can build the image and run a shell in a container with
+
+```
+docker compose run gpt-neox bash
+```
+
+After the build, you should be able to do this:
+```
+mchorse@537851ed67de:~$ echo $(pwd)
+/home/mchorse
+mchorse@537851ed67de:~$ ls -al
+total 48
+drwxr-xr-x 1 mchorse mchorse 4096 Jan 8 05:33 .
+drwxr-xr-x 1 root root 4096 Jan 8 04:09 ..
+-rw-r--r-- 1 mchorse mchorse 220 Feb 25 2020 .bash_logout
+-rw-r--r-- 1 mchorse mchorse 3972 Jan 8 04:09 .bashrc
+drwxr-xr-x 4 mchorse mchorse 4096 Jan 8 05:35 .cache
+drwx------ 3 mchorse mchorse 4096 Jan 8 05:33 .nv
+-rw-r--r-- 1 mchorse mchorse 807 Feb 25 2020 .profile
+drwxr-xr-x 2 root root 4096 Jan 8 04:09 .ssh
+drwxrwxr-x 8 mchorse mchorse 4096 Jan 8 05:35 chk
+drwxrwxrwx 6 root root 4096 Jan 7 17:02 data
+drwxr-xr-x 11 mchorse mchorse 4096 Jan 8 03:52 gpt-neox
+```
+
+For a long-running job, you should run
+
+```
+docker compose up -d
+```
+
+to run the container in detached mode, and then, in a separate terminal session, run
+
+```
+docker compose exec gpt-neox bash
+```
+
+You can then run any job you want from inside the container.
+
+Concerns when running for a long time or in detached mode include
+ - You will have to terminate the container manually when you are no longer using it
+ - If you want processes to continue running when your shell session ends, you will need to background them.
+ - If you then want logging, you will have to make sure to pipe logs to disk or set up wandb.
+
+If you prefer to run the prebuilt container image from dockerhub, you can run the docker compose commands with ```-f docker-compose-dockerhub.yml``` instead, e.g.,
 
-You can then run a container based on this image. For instance, the below snippet mounts the cloned repository (`gpt-neox`) directory to `/gpt-neox` in the container and uses [nvidia-docker](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) to make four GPUs (numbers 0-3) accessible to the container. [As noted by the NCCL documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#sharing-data), both `--shm-size=1g` and `--ulimit memlock=-1` are important to prevent Docker from allocating too little shared memory.
 ```
-nvidia-docker run --rm -it -e NVIDIA_VISIBLE_DEVICES=0,1,2,3 --shm-size=1g --ulimit memlock=-1 --mount type=bind,src=$PWD,dst=/gpt-neox gpt-neox
+docker compose run -f docker-compose-dockerhub.yml gpt-neox bash
 ```
 
 ## Usage

diff --git a/configs/docker/paths.yml b/configs/docker/paths.yml
@@ -0,0 +1,12 @@
+{
+ "train-data-paths": ["/home/mchorse/data/pile_deduped/pile_0.87_deduped_text_document"],
+ "valid-data-paths": ["/home/mchorse/data/pile_deduped/pile_0.87_deduped_text_document"],
+ "test-data-paths": ["/home/mchorse/data/pile_deduped/pile_0.87_deduped_text_document"],
+
+ "tokenizer-type": "HFTokenizer",
+ "vocab-file": "/home/mchorse/data/tokenizers/20B_tokenizer.json",
+
+ "save": "/home/mchorse/chk/",
+ "load": "/home/mchorse/chk/",
+ "checkpoint_validation_with_forward_pass": False
+}
diff --git a/configs/pythia/14M.yml b/configs/pythia/14M.yml
@@ -14,7 +14,7 @@
  "no-weight-tying": true,
  "gpt-j-residual": true,
  "output-layer-parallelism": "column",
- 
+
  "attention-config": [[["flash"], 6]],
 
  "scaled-upper-triang-masked-softmax-fusion": true,

diff --git a/configs/pythia/31M.yml b/configs/pythia/31M.yml
@@ -14,7 +14,7 @@
  "no-weight-tying": true,
  "gpt-j-residual": true,
  "output-layer-parallelism": "column",
- 
+
  "attention-config": [[["flash"], 6]],
 
  "scaled-upper-triang-masked-softmax-fusion": true,
@@ -54,7 +54,7 @@
  # activation checkpointing
  "checkpoint-activations": false,
  "checkpoint-num-layers": 1,
- "partition-activations": false, 
+ "partition-activations": false,
  "synchronize-each-layer": true,
 
  # regularization

diff --git a/docker-compose-dockerhub.yml b/docker-compose-dockerhub.yml
@@ -0,0 +1,25 @@
+version: '3'
+services:
+ gpt-neox:
+ command: nvidia-smi -q --loop=10
+ image: leogao2/gpt-neox:main
+ shm_size: 1g
+ ulimits:
+ memlock:
+ soft: -1
+ hard: -1
+ runtime: nvidia
+ deploy:
+ resources:
+ reservations:
+ devices:
+ - driver: nvidia
+ capabilities: [gpu]
+ logging:
+ options:
+ max-size: "100m"
+ max-file: "3"
+ volumes:
+ - ${NEOX_DATA_PATH}:/home/mchorse/data
+ - ${NEOX_CHECKPOINT_PATH}:/home/mchorse/chk
+ - .:/home/mchorse/gpt-neox
diff --git a/docker-compose.yml b/docker-compose.yml
@@ -0,0 +1,28 @@
+version: '3'
+services:
+ gpt-neox:
+ command: nvidia-smi -q --loop=10
+ image: gpt-neox
+ build:
+ context: .
+ dockerfile: Dockerfile
+ shm_size: 1g
+ ulimits:
+ memlock:
+ soft: -1
+ hard: -1
+ runtime: nvidia
+ deploy:
+ resources:
+ reservations:
+ devices:
+ - driver: nvidia
+ capabilities: [gpu]
+ logging:
+ options:
+ max-size: "100m"
+ max-file: "3"
+ volumes:
+ - ${NEOX_DATA_PATH}:/home/mchorse/data
+ - ${NEOX_CHECKPOINT_PATH}:/home/mchorse/chk
+ - .:/home/mchorse/gpt-neox