Skip to content

An implementation of model parallel autoregressive transformers on GPUs, based on the DeepSpeed library.

License

Notifications You must be signed in to change notification settings

adam-jb/gpt-neox

 
 

Repository files navigation

Deploying GPT-NeoX API

This doc has notes on:

  1. Getting things set up
  2. How our git page differs from the official neoX one
  3. Original NeoX README.md

Setting up API on Vast AI

Note! Can't find a way to reach the public IP address of my Vast AI server

Set base image in Vast AI to this exactly: nvidia/cuda:11.1.1-devel-ubuntu20.04

Full setup inc installations and weight downloads takes about 2hrs

When you ssh into vast instance, preface the command with sudo

nvcc --version # cuda version

lsb_release -a # linux os version

to split screen with tmux

ctrl + B, followed by "

to switch between windows

ctrl + B, followed by o

To close a window

exit

Script to run in your new Vast AI instance

cd /home

wget -q https://bootstrap.pypa.io/get-pip.py
sudo python3 get-pip.py
pip3 --version
rm -rf get-pip.py

sudo apt update
yes Y | sudo apt install python3-pip python3-dev build-essential libssl-dev libffi-dev
yes Y | apt-get install python-setuptools
sudo python3-setuptools
yes Y | sudo apt install virtualenv
yes Y | sudo apt install git
yes Y | sudo apt install wget
yes Y | sudo apt install -y vim


git clone https://github.com/adam-jb/gpt-neox
cd /home/gpt-neox



virtualenv env_gpt_neox --python=python3
source env_gpt_neox/bin/activate

pip install torch==1.8.2+cu111 torchvision==0.9.2+cu111 torchaudio==0.8.2 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
pip install pip --upgrade
yes Y | sudo apt install libopenmpi-dev
pip install -r requirements/requirements.txt
python /home/gpt-neox/megatron/fused_kernels/setup.py install



# at one point was needed for the merge and not in requirements.txt. commented out as seems fine without it, but keeping just in case
# pip install PyYaml tqdm 

# need to run the model and some are not in requirements.txt (some are, but werent installed in the above)
# commented out as seems fine without it, but keeping just in case
# pip install shortuuid sentencepiece best-download


## Some here are included in requirements.txt, so ignore
# commented out as seems fine without it, but keeping just in case
#pip install wandb==0.10.28
#pip install transformers~=4.16.0
#pip install lm_eval==0.2.0


# install EleutherAI's version of deepspeed. Needs to be this specific version for it to run
# commented out as seems fine without it, but keeping just in case
#pip install git+https://github.com/EleutherAI/DeeperSpeed.git@eb7f5cff36678625d23db8a8fe78b4a93e5d2c75#egg=deepspeed
#pip install mpi4py


# needed for API hosting
pip install flask flask-sse gunicorn



# Get an error in main model if try and use a later version, so force this version
pip install protobuf==3.20


# Downloading weights: this takes an hour or so
# YOU MAY WISH TO DOWNLOAD DIFFERENT WEIGHTS
wget --cut-dirs=5 -nH -r --no-parent --reject "index.html*" https://mystic.the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/ -P 20B_checkpoints



# change params: CHANGE THE NUMBER OF TOKENS THE MODEL WILL RETURN HERE
python update_megatron_config_export.py


# merge weights so model can run on 1 gpu
# CHANGE THE FOLDER FROM 20B_checkpoints to another input if you tuned your own weights
python tools/merge.py -d 20B_checkpoints -o checkpoints_merged -s 150000 -mp 1 -pp 1


# overwrite the config file, which specifies the correct location of the vocab file
cp config_merged.yaml checkpoints_merged/configs/config.yml



# rerun this to allow model to be called as per the below
python /home/gpt-neox/megatron/fused_kernels/setup.py install


# All being well, all is now ready to run 

To run the API itself (see $$$ section below for query):

# BEFORE doing this open a 2nd terminal by pressing ctrl+b and pressing '%'; then press ctrl+b then 'o' to switch windows
# this lets you 
export FLASK_APP=flask_api_model
flask run --host=0.0.0.0

$$$ To query the API from another console on the same machine:

curl http:https://127.0.0.1:5000

curl http:https://127.0.0.1:5000/multi/anuj+was+having+a+heck+of+a+day

curl http:https://127.0.0.1:5000/multi/tildy+was+having+a+heck+of+a+day+.+Tell+us+what+she+did+next

curl http:https://127.0.0.1:5000/multi/adam+has+a+big+career+decision+and+he+will+make+the+best+choice+which+is

curl http:https://127.0.0.1:5000/multi/write+an+important+story+about+dead+skin+with+two+main+characters+beginning+in+london

curl http:https://127.0.0.1:5000/multi/write+a+rap+in+the+style+of+kanye+west

curl http:https://127.0.0.1:5000/multi/write+a+hiphop+rap+in+the+style+of+donald+trump

Notice these are internal facing queries only. Hosting on 0.0.0.0 should open the Flask API up to the world via the public IP address. But I had no luck with this. It might be because Vast have some security thing going on, which would be a real challenge.

If you get an error when trying to run this after breaking a process, use this to list current processes:

ps ax

And look for something starting with the description " home/gpt-neox/env_gpt_neox/bin/python /h...... "

Or anything else that looks like flask, pytorch, etc

To check if any other python processes are still running. If so end them with:

kill -9 [PROCESS ID]

To get your public ip address

apt install net-tools

curl ifconfig.me

To test you can reach the server - should see this response: {"we await":"your json"}

http:https://[YOUR IP]:5000/

Then query it:

http:https://[YOUR IP]:5000/multi/anuj+was+having+a+heck+of+a+day

To see if model can be held in memory open python console with command:

python3

Then copy and paste everything in the script chat_with_gpt.py to have an interactive conversation with GPT

to run the model with one input

echo "Anuj was having a wonderful day. Tell us what he did in as much detail as possible." > prompt.txt

python ./deepy.py generate.py checkpoints_merged/configs/config.yml -i prompt.txt -o sample_outputs.txt

cat sample_outputs.txt

echo "Model works if line above is something that looks like text"

Notes on using json config

get_deepspeed_main_args() in arguments.py brings all the arguments together

It gets deepspeed args from NeoXArgsDeepspeedRunner(), which is in /megatron/neox_arguments/deepspeed_args.py

Also gets get_parent_class_value_dict(). Also defined in arguments.py

NON-EXHAUSTIVE Changes I made to NeoX Eleuther repo:

Replaced file for merge: /home/gpt-neox/tools/merge.py with gcp_neox_support_files/merge.py

Replace generate file. Gives option of running interactive prompts if changing (I think) config.yml generate.py (main dir) with generate.py from support files

Add to main dir: megatron_config_export.json

Move: gcp_neox_support_files/new_text_generation_utils.py To megatron/text_generation_utils.py

Add chat_with_gpt.py to main dir

Add flask_model.py to main dir

replace: megatron/neox_arguments/arguments.py with: new_arguments.py (the one made in Adam's Mac DreamPress folder, made in June 2022)

GitHub issues Weights & Biases monitoring

GPT-NeoX

This repository records EleutherAI's work-in-progress for training large-scale language models on GPUs. Our current framework is based on NVIDIA's Megatron Language Model and has been augmented with techniques from DeepSpeed as well as some novel optimizations.

We aim to make this repo a centralized and accessible place to gather techniques for training large-scale autoregressive language models, and accelerate research into large-scale training. Additionally, we hope to train and open source a 175B parameter GPT-3 replication along the way. Please note, however, that this is a research codebase that is primarily designed for performance over ease of use. We endeavour to make it as easy to use as is feasible, but if there's anything in the readme that is unclear or you think you've found a bug, please open an issue.

If you are interested in contributing, please join our Discord and head to the #gpt-neox channel. We're working with cloud compute provider CoreWeave for training, and hope to release the weights of smaller models as we progress up to 175B parameters.

For those looking for a TPU-centric codebase, we recommend Mesh Transformer JAX.

Contents

Pretrained Models

GPT-NeoX-20B

GPT-NeoX-20B is a 20 billion parameter autoregressive language model trained on the Pile. Technical details about GPT-NeoX-20B can be found in our whitepaper. The configuration file for this model is both available at ./configs/20B.yml and included in the download links below.

Download Links

Slim weights - (No optimizer states, for inference or finetuning, 39GB)

To download from the command line to a folder named 20B_checkpoints, use the following command:

wget --cut-dirs=5 -nH -r --no-parent --reject "index.html*" https://mystic.the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/ -P 20B_checkpoints

Full weights - (Including optimizer states, 268GB)

To download from the command line to a folder named 20B_checkpoints, use the following command:

wget --cut-dirs=5 -nH -r --no-parent --reject "index.html*" https://mystic.the-eye.eu/public/AI/models/GPT-NeoX-20B/full_weights/ -P 20B_checkpoints

Weights can be alternatively be downloaded using a BitTorrent client. Torrent files can be downloaded here: slim weights, full weights.

We additionally have 150 checkpoints saved throughout training, one every 1,000 steps. We are working on figuring out how to best serve these at scale, but in the meanwhile people interested in working with the partially trained checkpoints can email us at [email protected] to arrange access.

Quick Start

Environment and Dependencies

Host Setup

First make sure you are in an environment with Python 3.8 or later with an appropriate version of PyTorch 1.8 or later installed.

To install the remaining basic dependencies, run:

pip install -r requirements/requirements.txt
python ./megatron/fused_kernels/setup.py install # optional if not using fused kernels

from the repository root.

Warning: Our codebase relies on DeeperSpeed, our fork of the DeepSpeed library with some added changes. We strongly recommend using Anaconda, a virtual machine, or some other form of environment isolation before continuing. Failure to do so may cause other repositories that rely on DeepSpeed to break.

Containerized Setup

We also provide a Dockerfile if you prefer to run NeoX in a container. To use this option, first build an image named gpt-neox from the repository root directory with docker build -t gpt-neox -f Dockerfile .. We also host pre-built images on Docker Hub at leogao2/gpt-neox.

You can then run a container based on this image. For instance, the below snippet mounts the cloned repository (gpt-neox) directory to /gpt-neox in the container and uses nvidia-docker to make four GPUs (numbers 0-3) accessible to the container. As noted by the NCCL documentation, both --shm-size=1g and --ulimit memlock=-1 are important to prevent Docker from allocating too little shared memory.

nvidia-docker run --rm -it -e NVIDIA_VISIBLE_DEVICES=0,1,2,3 --shm-size=1g --ulimit memlock=-1 --mount type=bind,src=$PWD,dst=/gpt-neox gpt-neox

Using a Pretrained Model

GPT-NeoX-20B (currently the only pretrained model we provide) is a very large model. The weights alone take up around 40GB in GPU memory and, due to the tensor parallelism scheme as well as the high memory usage, you will need at minimum 2 GPUs with a total of ~45GB of GPU VRAM to run inference, and significantly more for training. Unfortunately the model is not yet possible to use on a single consumer GPU.

GPT-NeoX parameters are defined in a YAML configuration file which is passed to the deepy.py launcher. For more details on the configuration file, see Configuration. The configuration file for GPT-NeoX-20B is at ./configs/20B.yml - but you may need to edit some fields to specify where your model and tokenizer are saved. In the config file edit the following fields:

  "vocab-file": "./20B_checkpoints/20B_tokenizer.json",
  "save": "./20B_checkpoints",
  "load": "./20B_checkpoints",

changing ./20B_checkpoints to the path to the root folder of the downloaded checkpoints. If the checkpoints exist at ./20B_checkpoints you can leave this as is.

Depending on the number of GPUs you're using, you may also need to change the parallelism settings. To run inference on the 20B model on 2 GPUs, change:

   "pipe-parallel-size": 4,

to:

   "pipe-parallel-size": 1,

If you're using 8 GPUs, you can leave this unchanged.

All functionality (inference included), should be launched in parallel using deepy.py, a wrapper around the deepspeed launcher.

We currently offer three main functions:

  1. train.py is used for training and finetuning models.
  2. evaluate.py is used to evaluate a trained model using the language model evaluation harness.
  3. generate.py is used to sample text from a trained model.

and can be launched with:

./deepy.py [script.py] [./path/to/config_1.yml] [./path/to/config_2.yml] ... [./path/to/config_n.yml]

E.G To generate text unconditionally with the GPT-NeoX-20B model, you can use the following:

./deepy.py generate.py ./configs/20B.yml

Or optionally pass in a text file (e.g prompt.txt) to use as the prompt, which should be a plain .txt file with each prompt separated by newline characters, also passing in the path to an output file.

./deepy.py generate.py ./configs/20B.yml -i prompt.txt -o sample_outputs.txt

To reproduce our evaluation numbers on, for example, lambada and PIQA use:

./deepy.py evaluate.py ./configs/20B.yml --eval_tasks lambada piqa

You can add an arbitrary list of evaluation tasks here, for details of all tasks available, see lm-evaluation-harness.

For more details on each entry point, see the Training and Finetuning, Inference and Evaluation sections.

Configuration

GPT-NeoX parameters are defined in a YAML configuration file which is passed to the deepy.py launcher. We have provided some example .yaml files in configs, including one for GPT-NeoX-20B, and example configuration files for other model sizes.

These files are generally complete, but non-optimal. For example, depending on your specific GPU configuration, you may need to change some settings such as pipe-parallel-size, model-parallel-size to increase or decrease the degree of parallelisation, train_micro_batch_size_per_gpu or gradient-accumulation-steps to modify batch size related settings, or the zero_optimization dict to modify how optimizer states are parallelised across workers.

For a more detailed guide to all the features available and how to configure them, see the configuration README, and for documentation of every possible argument, see configs/neox_arguments.md.

Datasets

Preconfigured Datasets

Several preconfigured datasets are available, including most components from the Pile, as well as the Pile train set itself, for straightforward tokenization using the prepare_data.py entry point.

E.G, to download and tokenize the Enron emails corpus with the GPT2 Tokenizer, saving them to ./data you can run:

python prepare_data.py -d ./data

or with the GPT-NeoX-20B tokenizer (assuming you have it saved at ./20B_checkpoints/20B_tokenizer.json):

python prepare_data.py -d ./data -t HFTokenizer --vocab-file ./20B_checkpoints/20B_tokenizer.json

The tokenized data will be saved out to two files at [data-dir]/[dataset-name]/[dataset-name]_text_document.bin & [data-dir]/[dataset-name]/[dataset-name]_text_document.bin. You will need to add the prefix that both these files share to your training configuration file under the data-path field. E.G:

  "data-path": "./data/enron/enron_text_document",

Using Custom Data

To prepare your own dataset for training with custom data, format it as one large jsonl-formatted file with each item in the list of dictionaries being a separate document. The document text should be grouped under one JSON key, i.e "text". Any auxiliary data stored in other fields will not be

Next make sure to download the GPT2 tokenizer vocab, and merge files from the following links:

Or use the 20B tokenizer (for which only a single Vocab file is needed):

(alternatively, you can provide any tokenizer file that can be loaded by Huggingface's tokenizers library with the Tokenizer.from_pretrained() command)

You can now pretokenize your data using tools/preprocess_data.py, the arguments for which are detailed below:

usage: preprocess_data.py [-h] --input INPUT [--jsonl-keys JSONL_KEYS [JSONL_KEYS ...]] [--num-docs NUM_DOCS] --tokenizer-type {HFGPT2Tokenizer,HFTokenizer,GPT2BPETokenizer,CharLevelTokenizer} [--vocab-file VOCAB_FILE] [--merge-file MERGE_FILE] [--append-eod] [--ftfy] --output-prefix OUTPUT_PREFIX
                          [--dataset-impl {lazy,cached,mmap}] [--workers WORKERS] [--log-interval LOG_INTERVAL]

optional arguments:
  -h, --help            show this help message and exit

input data:
  --input INPUT         Path to input jsonl files or lmd archive(s) - if using multiple archives, put them in a comma separated list
  --jsonl-keys JSONL_KEYS [JSONL_KEYS ...]
                        space separate listed of keys to extract from jsonl. Defa
  --num-docs NUM_DOCS   Optional: Number of documents in the input data (if known) for an accurate progress bar.

tokenizer:
  --tokenizer-type {HFGPT2Tokenizer,HFTokenizer,GPT2BPETokenizer,CharLevelTokenizer}
                        What type of tokenizer to use.
  --vocab-file VOCAB_FILE
                        Path to the vocab file
  --merge-file MERGE_FILE
                        Path to the BPE merge file (if necessary).
  --append-eod          Append an <eod> token to the end of a document.
  --ftfy                Use ftfy to clean text

output data:
  --output-prefix OUTPUT_PREFIX
                        Path to binary output file without suffix
  --dataset-impl {lazy,cached,mmap}
                        Dataset implementation to use. Default: mmap

runtime:
  --workers WORKERS     Number of worker processes to launch
  --log-interval LOG_INTERVAL
                        Interval between progress updates

For example:

python tools/preprocess_data.py \
            --input ./data/mydataset.jsonl.zst \
            --output-prefix ./data/mydataset \
            --vocab ./data/gpt2-vocab.json \
            --merge-file gpt2-merges.txt \
            --dataset-impl mmap \
            --tokenizer-type GPT2BPETokenizer \
            --append-eod

You would then run training with the following settings added to your configuration file:

  "data-path": "data/mydataset/mydataset",

Training and Finetuning

Training is launched using deepy.py, a wrapper around DeepSpeed's launcher, which launches the same script in parallel across many GPUs / nodes.

The general usage pattern is:

python ./deepy.py train.py [path/to/config1.yml] [path/to/config2.yml] ...

You can pass in an arbitrary number of configs which will all be merged at runtime.

You can also optionally pass in a config prefix, which will assume all your configs are in the same folder and append that prefix to their path.

E.G:

python ./deepy.py train.py -d configs small.yml local_setup.yml

This will deploy the train.py script on all nodes with one process per GPU. The worker nodes and number of GPUs are specified in the /job/hostfile file (see parameter documentation), or can simply be passed in as the num_gpus arg if running on a single node setup.

Although this is not strictly necessary, we find it useful to define the model parameters in one config file (e.g configs/small.yml) and the data path parameters in another (e.g configs/local_setup.yml).

Inference

We support three types of generation from a pretrained model:

  1. Unconditional generation
  2. Conditional generation based on an input read from a file
  3. Interactive generation, which allows for multiple rounds of back-and-forth between a user and the language model via a command line interface

All three types of text generation can be launched via python ./deepy.py generate.py -d configs small.yml local_setup.yml text_generation.yml with the appropriate values set in configs/text_generation.yml.

Evaluation

GPT-NeoX supports evaluation on downstream tasks through the language model evaluation harness.

To evaluate a trained model on the evaluation harness, simply run:

python ./deepy.py evaluate.py -d configs your_configs.yml --eval_tasks task1 task2 ... taskn

where --eval_tasks is a list of evaluation tasks followed by spaces, e.g --eval_tasks lambada hellaswag piqa sciq. For details of all tasks available, refer to the lm-evaluation-harness repo.

Monitoring

In addition to storing logs locally, we provide built-in support for two popular experiment monitoring frameworks: Weights & Biases and TensorBoard

Weights & Biases

EleutherAI is currently using Weights & Biases to record our experiments. If you are logged into Weights & Biases on your machine—you can do this by executing wandb login—your runs will automatically be recorded. There are two optional fields associated with Weights & Biases: wandb_group allows you to name the run group and wandb_team allows you to assign your runs to an organization or team account.

TensorBoard

We also support using TensorBoard via the tensorboard-dir field. Dependencies required for TensorBoard monitoring can be found in and installed from ./requirements/requirements-tensorboard.txt.

Administrative Notes

Citing GPT-NeoX

If you have found GPT-NeoX helpful in your work, you can cite this repository as

@software{gpt-neox,
  author = {Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Purohit, Shivanshu and Songz, Tri and Wang, Phil and Weinbach, Samuel},
  title = {{GPT-NeoX}: Large Scale Autoregressive Language Modeling in PyTorch},
  url = {http:https://github.com/eleutherai/gpt-neox},
  year = {2021}
}

To cite our 20 billion parameter model, please use

@article{gpt-neox-20b,
  title={{GPT-NeoX-20B}: An Open-Source Autoregressive Language Model},
  author={Black, Sid and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, USVSN Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel},
  year={2022}
}

Licensing

This repository hosts code that is part of EleutherAI's GPT-NeoX project. Copyright © 2021, EleutherAI contributors (in alphabetical order): Alex Andonian, Quentin Anthony, Stella Biderman, Sid Black, Preetham Gali, Leo Gao, Eric Hallahan, Josh Levy-Kramer, Connor Leahy, Lucas Nestler, Kip Parker, Michael Pieler, Shivanshu Purohit, Tri Songz, Phil Wang, Samuel Weinbach. Licensed under the Apache License:

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http:https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

This repository is based off code written by NVIDIA that is licensed under the Apache License, Version 2.0. In accordance with the Apache License, all files that are modifications of code originally written by NVIDIA maintain a NVIDIA copyright header. All files that do not contain such a header are original to EleutherAI contributors. When the NVIDIA code has been modified from its original version, that fact is noted in the copyright header. All derivative works of this repository must preserve these headers under the terms of the Apache License.

For full terms, see the LICENSE file. If you have any questions, comments, or concerns about licensing please email us at [email protected].

Acknowledgements

We run our experiments on a Kubernetes cluster generously provided by CoreWeave.

About

An implementation of model parallel autoregressive transformers on GPUs, based on the DeepSpeed library.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Languages

  • Python 85.4%
  • C++ 12.0%
  • Cuda 1.1%
  • C 0.8%
  • Dockerfile 0.6%
  • Shell 0.1%