Name	Name	Last commit message	Last commit date
Latest commit History 1,965 Commits
.github	.github
configs	configs
eval_tasks	eval_tasks
megatron	megatron
requirements	requirements
ssh	ssh
tests	tests
tools	tools
.clang-format	.clang-format
.dockerignore	.dockerignore
.gitignore	.gitignore
.pre-commit-config.yaml	.pre-commit-config.yaml
CITATION.cff	CITATION.cff
CODEOWNERS	CODEOWNERS
Dockerfile	Dockerfile
Dockerfile.rtx4080	Dockerfile.rtx4080
LICENSE	LICENSE
MANIFEST.in	MANIFEST.in
README.md	README.md
build.sh	build.sh
deepy.py	deepy.py
evaluate.py	evaluate.py
generate.py	generate.py
launch.sh	launch.sh
launch_worker.sh	launch_worker.sh
prepare_data.py	prepare_data.py
preprocess_data.sh	preprocess_data.sh
prompt.py	prompt.py
train.py	train.py
train.sh	train.sh

GPT-NeoX

This repository records EleutherAI's library for training large-scale language models on GPUs. Our current framework is based on NVIDIA's Megatron Language Model and has been augmented with techniques from DeepSpeed as well as some novel optimizations. We aim to make this repo a centralized and accessible place to gather techniques for training large-scale autoregressive language models, and accelerate research into large-scale training.

For those looking for a TPU-centric codebase, we recommend Mesh Transformer JAX.

If you are not looking to train models with billions of parameters from scratch, this is likely the wrong library to use. For generic inference needs, we recommend you use the Hugging Face transformers library instead which supports GPT-NeoX models.

Quick Start
- Environment and Dependencies
- Usage
Configuration
Datasets
- Preconfigured Datasets
- Using Custom Data
Training and Finetuning
- Select Pretrained Models
Inference
Evaluation
Exporting to Hugging Face
Monitoring
- Weights & Biases
- TensorBoard
Administrative Notes

Quick Start

Environment and Dependencies

Host Setup

First make sure you are in an environment with Python 3.8 with an appropriate version of PyTorch 1.8 or later installed. Note: Some of the libraries that GPT-NeoX depends on have not been updated to be compatible with Python 3.10+. Python 3.9 appears to work, but this codebase has been developed and tested for Python 3.8.

To install the remaining basic dependencies, run:

pip install -r requirements/requirements.txt
python ./megatron/fused_kernels/setup.py install # optional if not using fused kernels

from the repository root.

Warning: Our codebase relies on DeeperSpeed, our fork of the DeepSpeed library with some added changes. We strongly recommend using Anaconda, a virtual machine, or some other form of environment isolation before continuing. Failure to do so may cause other repositories that rely on DeepSpeed to break.

Flash Attention

To use Flash-Attention, install the additional dependencies in ./requirements/requirements-flashattention.txt and set the attention type in your configuration accordingly (see configs). This can provide significant speed-ups over regular attention on certain GPU architectures, including Ampere GPUs (such as A100s); see the repository for more details.

Containerized Setup

We also provide a Dockerfile if you prefer to run NeoX in a container. To use this option, first build an image named gpt-neox from the repository root directory with docker build -t gpt-neox -f Dockerfile .. We also host pre-built images on Docker Hub at leogao2/gpt-neox.

You can then run a container based on this image. For instance, the below snippet mounts the cloned repository (gpt-neox) directory to /gpt-neox in the container and uses nvidia-docker to make four GPUs (numbers 0-3) accessible to the container. As noted by the NCCL documentation, both --shm-size=1g and --ulimit memlock=-1 are important to prevent Docker from allocating too little shared memory.

nvidia-docker run --rm -it -e NVIDIA_VISIBLE_DEVICES=0,1,2,3 --shm-size=1g --ulimit memlock=-1 --mount type=bind,src=$PWD,dst=/gpt-neox gpt-neox

Usage

All functionality (inference included), should be launched using deepy.py, a wrapper around the deepspeed launcher.

We currently offer three main functions:

train.py is used for training and finetuning models.
evaluate.py is used to evaluate a trained model using the language model evaluation harness.
generate.py is used to sample text from a trained model.

which can be launched with:

./deepy.py [script.py] [./path/to/config_1.yml] [./path/to/config_2.yml] ... [./path/to/config_n.yml]

E.G To generate text unconditionally with the GPT-NeoX-20B model, you can use the following:

./deepy.py generate.py ./configs/20B.yml

Or optionally pass in a text file (e.g prompt.txt) to use as the prompt, which should be a plain .txt file with each prompt separated by newline characters, also passing in the path to an output file.

./deepy.py generate.py ./configs/20B.yml -i prompt.txt -o sample_outputs.txt

To reproduce our evaluation numbers on, for example, TriviaQA and PIQA use:

./deepy.py evaluate.py ./configs/20B.yml --eval_tasks triviaqa piqa

You can add an arbitrary list of evaluation tasks here, for details of all tasks available, see lm-evaluation-harness.

For more details on each entry point, see the Training and Finetuning, Inference and Evaluation

Configuration

GPT-NeoX parameters are defined in a YAML configuration file which is passed to the deepy.py launcher. We have provided some example .yaml files in configs, including one for GPT-NeoX-20B, and example configuration files for other model sizes.

These files are generally complete, but non-optimal. For example, depending on your specific GPU configuration, you may need to change some settings such as pipe-parallel-size, model-parallel-size to increase or decrease the degree of parallelisation, train_micro_batch_size_per_gpu or gradient-accumulation-steps to modify batch size related settings, or the zero_optimization dict to modify how optimizer states are parallelised across workers.

For a more detailed guide to all the features available and how to configure them, see the configuration README, and for documentation of every possible argument, see configs/neox_arguments.md.

Datasets

Preconfigured Datasets

Several preconfigured datasets are available, including most components from the Pile, as well as the Pile train set itself, for straightforward tokenization using the prepare_data.py entry point.

E.G, to download and tokenize the Enron emails corpus with the GPT2 Tokenizer, saving them to ./data you can run:

python prepare_data.py -d ./data

or with the GPT-NeoX-20B tokenizer (assuming you have it saved at ./20B_checkpoints/20B_tokenizer.json):

python prepare_data.py -d ./data -t HFTokenizer --vocab-file ./20B_checkpoints/20B_tokenizer.json

The tokenized data will be saved out to two files: [data-dir]/[dataset-name]/[dataset-name]_text_document.binand [data-dir]/[dataset-name]/[dataset-name]_text_document.idx. You will need to add the prefix that both these files share to your training configuration file under the data-path field. E.G:

  "data-path": "./data/enron/enron_text_document",

Using Custom Data

To prepare your own dataset for training with custom data, format it as one large jsonl-formatted file with each item in the list of dictionaries being a separate document. The document text should be grouped under one JSON key, i.e "text". Any auxiliary data stored in other fields will not be used.

Next make sure to download the GPT2 tokenizer vocab, and merge files from the following links:

Or use the 20B tokenizer (for which only a single Vocab file is needed):

Vocab: https://the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/20B_tokenizer.json

(alternatively, you can provide any tokenizer file that can be loaded by Hugging Face's tokenizers library with the Tokenizer.from_pretrained() command)

You can now pretokenize your data using tools/preprocess_data.py, the arguments for which are detailed below:

usage: preprocess_data.py [-h] --input INPUT [--jsonl-keys JSONL_KEYS [JSONL_KEYS ...]] [--num-docs NUM_DOCS] --tokenizer-type {HFGPT2Tokenizer,HFTokenizer,GPT2BPETokenizer,CharLevelTokenizer} [--vocab-file VOCAB_FILE] [--merge-file MERGE_FILE] [--append-eod] [--ftfy] --output-prefix OUTPUT_PREFIX
                          [--dataset-impl {lazy,cached,mmap}] [--workers WORKERS] [--log-interval LOG_INTERVAL]

optional arguments:
  -h, --help            show this help message and exit

input data:
  --input INPUT         Path to input jsonl files or lmd archive(s) - if using multiple archives, put them in a comma separated list
  --jsonl-keys JSONL_KEYS [JSONL_KEYS ...]
                        space separate listed of keys to extract from jsonl. Defa
  --num-docs NUM_DOCS   Optional: Number of documents in the input data (if known) for an accurate progress bar.

tokenizer:
  --tokenizer-type {HFGPT2Tokenizer,HFTokenizer,GPT2BPETokenizer,CharLevelTokenizer}
                        What type of tokenizer to use.
  --vocab-file VOCAB_FILE
                        Path to the vocab file
  --merge-file MERGE_FILE
                        Path to the BPE merge file (if necessary).
  --append-eod          Append an <eod> token to the end of a document.
  --ftfy                Use ftfy to clean text

output data:
  --output-prefix OUTPUT_PREFIX
                        Path to binary output file without suffix
  --dataset-impl {lazy,cached,mmap}
                        Dataset implementation to use. Default: mmap

runtime:
  --workers WORKERS     Number of worker processes to launch
  --log-interval LOG_INTERVAL
                        Interval between progress updates

For example:

python tools/preprocess_data.py \
            --input ./data/mydataset.jsonl.zst \
            --output-prefix ./data/mydataset \
            --vocab ./data/gpt2-vocab.json \
            --merge-file gpt2-merges.txt \
            --dataset-impl mmap \
            --tokenizer-type GPT2BPETokenizer \
            --append-eod

You would then run training with the following settings added to your configuration file:

  "data-path": "data/mydataset/mydataset",

Training and Finetuning

Training is launched using deepy.py, a wrapper around DeepSpeed's launcher, which launches the same script in parallel across many GPUs / nodes.

The general usage pattern is:

python ./deepy.py train.py [path/to/config1.yml] [path/to/config2.yml] ...

You can pass in an arbitrary number of configs which will all be merged at runtime.

You can also optionally pass in a config prefix, which will assume all your configs are in the same folder and append that prefix to their path.

E.G:

python ./deepy.py train.py -d configs small.yml local_setup.yml

This will deploy the train.py script on all nodes with one process per GPU. The worker nodes and number of GPUs are specified in the /job/hostfile file (see parameter documentation), or can simply be passed in as the num_gpus arg if running on a single node setup.

Although this is not strictly necessary, we find it useful to define the model parameters in one config file (e.g configs/small.yml) and the data path parameters in another (e.g configs/local_setup.yml).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT-NeoX

Contents

Quick Start

Environment and Dependencies

Host Setup

Flash Attention

Containerized Setup

Usage

Configuration

Datasets

Preconfigured Datasets

Using Custom Data

Training and Finetuning

Pretrained Models

GPT-NeoX-20B

License

hisashi-ito/gpt-neox

Folders and files

Latest commit

History

Repository files navigation

GPT-NeoX

Contents

Quick Start

Environment and Dependencies

Host Setup

Flash Attention

Containerized Setup

Usage

Configuration

Datasets

Preconfigured Datasets

Using Custom Data

Training and Finetuning

Pretrained Models

GPT-NeoX-20B