Skip to content

Commit

Permalink
Add docker compose and change containerized setup instructions to use…
Browse files Browse the repository at this point in the history
… it (EleutherAI#1113)

* Add pythia 14M config

* Create 31M.yml

* Add docker compose, update readme docker instructions to utilize it

* Add logging limits to docker-compose files

* Change data mount from /gpt-neox/data to /data/

This prevents possible errors if the user already has a /data/ directory in their /gpt-neox/ folder

* Update README.md

Makes the code blocks into blocks in the changed parts

* Make the docker-compose spinup tidier

* Avoid config bloat by only providing the updated paths

* Apply precommit

---------

Co-authored-by: Quentin Anthony <[email protected]>
  • Loading branch information
segyges and Quentin-Anthony committed Jan 9, 2024
1 parent f14782a commit e6e944a
Show file tree
Hide file tree
Showing 7 changed files with 134 additions and 11 deletions.
10 changes: 5 additions & 5 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,11 @@ LABEL org.opencontainers.image.base.name="docker.io/nvidia/cuda:11.7.1-devel-ubu
#### System package (uses default Python 3 version in Ubuntu 20.04)
RUN apt-get update -y && \
apt-get install -y \
git python3.9 python3-dev libpython3-dev python3-pip sudo pdsh \
htop llvm-9-dev tmux zstd software-properties-common build-essential autotools-dev \
nfs-common pdsh cmake g++ gcc curl wget vim less unzip htop iftop iotop ca-certificates ssh \
rsync iputils-ping net-tools libcupti-dev libmlx4-1 infiniband-diags ibutils ibverbs-utils \
rdmacm-utils perftest rdma-core nano && \
git python3.9 python3-dev libpython3-dev python3-pip sudo pdsh \
htop llvm-9-dev tmux zstd software-properties-common build-essential autotools-dev \
nfs-common pdsh cmake g++ gcc curl wget vim less unzip htop iftop iotop ca-certificates ssh \
rsync iputils-ping net-tools libcupti-dev libmlx4-1 infiniband-diags ibutils ibverbs-utils \
rdmacm-utils perftest rdma-core nano && \
update-alternatives --install /usr/bin/python python /usr/bin/python3 1 && \
update-alternatives --install /usr/bin/pip pip /usr/bin/pip3 1 && \
pip install --upgrade pip && \
Expand Down
64 changes: 61 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -225,11 +225,69 @@ You can then kick off a training run with `sbatch my_sbatch_script.sh`

### Containerized Setup

We also provide a Dockerfile if you prefer to run NeoX in a container. To use this option, first build an image named `gpt-neox` from the repository root directory with `docker build -t gpt-neox -f Dockerfile .`. We also host pre-built images on [Docker Hub at `leogao2/gpt-neox`](https://hub.docker.com/r/leogao2/gpt-neox/tags).
We also provide a Dockerfile and docker-compose configuration if you prefer to run NeoX in a container.

Requirements to run the container are to have appropriate GPU drivers, an up-to-date installation of Docker, and [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) installed. To test if your installation is good you can use their "sample workload", which is:

```
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
```

Provided that will run, you need to export NEOX_DATA_PATH and NEOX_CHECKPOINT_PATH in your environment to specify your data directory and directory for storing and loading checkpoints:

```
export NEOX_DATA_PATH=/mnt/sda/data/enwiki8 #or wherever your data is stored on your system
export NEOX_CHECKPOINT_PATH=/mnt/sda/checkpoints
```

And then, from the gpt-neox directory, you can build the image and run a shell in a container with

```
docker compose run gpt-neox bash
```

After the build, you should be able to do this:
```
mchorse@537851ed67de:~$ echo $(pwd)
/home/mchorse
mchorse@537851ed67de:~$ ls -al
total 48
drwxr-xr-x 1 mchorse mchorse 4096 Jan 8 05:33 .
drwxr-xr-x 1 root root 4096 Jan 8 04:09 ..
-rw-r--r-- 1 mchorse mchorse 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 mchorse mchorse 3972 Jan 8 04:09 .bashrc
drwxr-xr-x 4 mchorse mchorse 4096 Jan 8 05:35 .cache
drwx------ 3 mchorse mchorse 4096 Jan 8 05:33 .nv
-rw-r--r-- 1 mchorse mchorse 807 Feb 25 2020 .profile
drwxr-xr-x 2 root root 4096 Jan 8 04:09 .ssh
drwxrwxr-x 8 mchorse mchorse 4096 Jan 8 05:35 chk
drwxrwxrwx 6 root root 4096 Jan 7 17:02 data
drwxr-xr-x 11 mchorse mchorse 4096 Jan 8 03:52 gpt-neox
```

For a long-running job, you should run

```
docker compose up -d
```

to run the container in detached mode, and then, in a separate terminal session, run

```
docker compose exec gpt-neox bash
```

You can then run any job you want from inside the container.

Concerns when running for a long time or in detached mode include
- You will have to terminate the container manually when you are no longer using it
- If you want processes to continue running when your shell session ends, you will need to background them.
- If you then want logging, you will have to make sure to pipe logs to disk or set up wandb.

If you prefer to run the prebuilt container image from dockerhub, you can run the docker compose commands with ```-f docker-compose-dockerhub.yml``` instead, e.g.,

You can then run a container based on this image. For instance, the below snippet mounts the cloned repository (`gpt-neox`) directory to `/gpt-neox` in the container and uses [nvidia-docker](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) to make four GPUs (numbers 0-3) accessible to the container. [As noted by the NCCL documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#sharing-data), both `--shm-size=1g` and `--ulimit memlock=-1` are important to prevent Docker from allocating too little shared memory.
```
nvidia-docker run --rm -it -e NVIDIA_VISIBLE_DEVICES=0,1,2,3 --shm-size=1g --ulimit memlock=-1 --mount type=bind,src=$PWD,dst=/gpt-neox gpt-neox
docker compose run -f docker-compose-dockerhub.yml gpt-neox bash
```

## Usage
Expand Down
12 changes: 12 additions & 0 deletions configs/docker/paths.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{
"train-data-paths": ["/home/mchorse/data/pile_deduped/pile_0.87_deduped_text_document"],
"valid-data-paths": ["/home/mchorse/data/pile_deduped/pile_0.87_deduped_text_document"],
"test-data-paths": ["/home/mchorse/data/pile_deduped/pile_0.87_deduped_text_document"],

"tokenizer-type": "HFTokenizer",
"vocab-file": "/home/mchorse/data/tokenizers/20B_tokenizer.json",

"save": "/home/mchorse/chk/",
"load": "/home/mchorse/chk/",
"checkpoint_validation_with_forward_pass": False
}
2 changes: 1 addition & 1 deletion configs/pythia/14M.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
"no-weight-tying": true,
"gpt-j-residual": true,
"output-layer-parallelism": "column",

"attention-config": [[["flash"], 6]],

"scaled-upper-triang-masked-softmax-fusion": true,
Expand Down
4 changes: 2 additions & 2 deletions configs/pythia/31M.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
"no-weight-tying": true,
"gpt-j-residual": true,
"output-layer-parallelism": "column",

"attention-config": [[["flash"], 6]],

"scaled-upper-triang-masked-softmax-fusion": true,
Expand Down Expand Up @@ -54,7 +54,7 @@
# activation checkpointing
"checkpoint-activations": false,
"checkpoint-num-layers": 1,
"partition-activations": false,
"partition-activations": false,
"synchronize-each-layer": true,

# regularization
Expand Down
25 changes: 25 additions & 0 deletions docker-compose-dockerhub.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
version: '3'
services:
gpt-neox:
command: nvidia-smi -q --loop=10
image: leogao2/gpt-neox:main
shm_size: 1g
ulimits:
memlock:
soft: -1
hard: -1
runtime: nvidia
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
logging:
options:
max-size: "100m"
max-file: "3"
volumes:
- ${NEOX_DATA_PATH}:/home/mchorse/data
- ${NEOX_CHECKPOINT_PATH}:/home/mchorse/chk
- .:/home/mchorse/gpt-neox
28 changes: 28 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
version: '3'
services:
gpt-neox:
command: nvidia-smi -q --loop=10
image: gpt-neox
build:
context: .
dockerfile: Dockerfile
shm_size: 1g
ulimits:
memlock:
soft: -1
hard: -1
runtime: nvidia
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
logging:
options:
max-size: "100m"
max-file: "3"
volumes:
- ${NEOX_DATA_PATH}:/home/mchorse/data
- ${NEOX_CHECKPOINT_PATH}:/home/mchorse/chk
- .:/home/mchorse/gpt-neox

0 comments on commit e6e944a

Please sign in to comment.