Skip to content

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

License

Notifications You must be signed in to change notification settings

strategist922/gpt-neox

 
 

Repository files navigation

GitHub issues Weights & Biases monitoring

GPT-NeoX

This repository records EleutherAI's work-in-progress for training large scale GPU language models. Our current framework is based on NVIDIA's Megatron Language Model and has been augmented with techniques from DeepSpeed as well as some novel optimizations.

We aim to make this repo a centralized and accessible place to gather techniques for training large scale autoregressive language models, and accelerate research into large scale training. Additionally, we hope to train and open source a 175B parameter GPT3 replication along the way.

For more info on our progress, please join our discord and head to the #gpt-neo channel. We're working with cloud compute provider Coreweave for training, and hope to release the weights of smaller models as we progress up to 175B parameters.

If you're looking for our TPU codebase, see GPT-Neo.

GPT-NeoX is under active development.

Features:

3D Parallelism

  • GPTNeoX offers full 3D parallelism (data, model and pipeline parallel) using deepspeed, allowing you to scale model training to hundreds of billions of parameters across multiple GPUs.

Model Structure

  • Positional Encodings:

    • Choose between T5 RPE style positional encodings, a learned encoding added to the input (GPT2-style), Sinusoidal positional encoding, and no positional encodings at all (which recent research has found to even outperform other positional encodings in autoregressive models).
  • Sparsity:

    • Deepspeed's sparse attention kernels are supported, but don't work with cuda 11.0+, and require a specific hardware setup (V100s/RTX2080s). add "sparsity": "all" to your config to use sparse attention on all layers, or "sparsity": "interspersed" to use it every other layer.
  • Norms:

    • A recent Google paper has shown layernorm may not be the best option for transformer models. We offer a choice of layernorm, scalenorm and RMSNorm easily configured by changing a single line in your config file.

Optimizers

  • NeoX supports Adam, CPUAdam, 1-Bit Adam and SM3 optimizers, as well as Deepspeed's Zero Redundancy Optimizer.

  • Zero Redundancy Optimizer (ZeRO):

    • ZeRO stage 1 works seamlessly with NeoX, while ZeRO stage 2 requires pipeline parallelism be set to 0. We are additionally working on integrating ZeRO 3 into the codebase. Turning on ZeRO is as simple as adding one field to your configuration file.

Straightforward configuration

  • Other libraries such as Megatron-LM require you configure them using command line arguments, which can often be difficult to work with and iterate upon. We offer straightforward configuration using .yaml files, which enables you to launch training runs across 100s of GPUs with a single line bash script.
  • Additionally, we hope to make data preparation easier on the user by providing scripts to automatically download and pretokenize a number of large-scale datasets.

Getting Started

Our codebase relies on DeeperSpeed, our fork of the DeepSpeed library with some added changes. We strongly recommend using Anaconda, a virtual machine, or some other form of environment isolation before installing from requirements.txt. Failure to do so may cause other repositories that rely on DeepSpeed to break.

First make sure you are in an environment with torch>=1.7.1 installed. Then run pip install -r requirements.txt. You may need to change the version of cupy-cudaxxx to match your machine's cuda version.

Finally, certain features rely on apex, which you can install with the command below:

pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" git+https://github.com/NVIDIA/apex.git@e2083df5eb96643c61613b9df48dd4eea6b07690

We also host a Docker Image on Dockerhub at leogao2/gpt-neox, which enables easy multi-node training.

Configuration and parameters

GPT-NeoX parameters are defined in a YAML configuration file which is passed to the deepy.py launcher - for examples see the configs folder.

For a full list of parameters and documentation see the configuration readme.

Datasets

Once you've installed all the requirements and set up your model configuration, the next step is obtaining and preprocessing your dataset.

For demonstrative purposes we've hosted the Enron Emails corpus and made it available for downloading. Running python prepare_data.py will download the tokenizer files and dataset, pretokenize the dataset, and save it into a folder named ./data.

In the future we will also be adding a single command to preprocess our 800GB language modelling dataset, The Pile, and all its constituent datasets.

To prepare your own dataset for training, format it as one large jsonl file with each item in the list of dictionaries being a separate document. The document text should be grouped under one json key, i.e "text".

Next make sure to download the GPT2 tokenizer vocab, and merge files from the following links:

We plan to integrate HuggingFace's Tokenizers library soon to make this process smoother.

You can now pretokenize your data using tools/preprocess_data.py.

Usage:

preprocess_data.py [-h] --input INPUT [--json-keys JSON_KEYS [JSON_KEYS ...]] [--split-sentences] [--keep-newlines] --tokenizer-type {BertWordPieceLowerCase,BertWordPieceCase,GPT2BPETokenizer} [--vocab-file VOCAB_FILE] [--merge-file MERGE_FILE] [--append-eod]
                          --output-prefix OUTPUT_PREFIX [--dataset-impl {lazy,cached,mmap}] [--workers WORKERS] [--log-interval LOG_INTERVAL]

input data:
  --input INPUT         Path to input JSON
  --json-keys JSON_KEYS [JSON_KEYS ...]
                        space separate listed of keys to extract from json. default = "text".
  --split-sentences     Split documents into sentences.
  --keep-newlines       Keep newlines between sentences when splitting.

tokenizer:
  --tokenizer-type {GPT2BPETokenizer}
                        What type of tokenizer to use.
  --vocab-file VOCAB_FILE
                        Path to the vocab file
  --merge-file MERGE_FILE
                        Path to the BPE merge file (if necessary).
  --append-eod          Append an <eod> token to the end of a document.

output data:
  --output-prefix OUTPUT_PREFIX
                        Path to binary output file without suffix
  --dataset-impl {lazy,cached,mmap}

runtime:
  --workers WORKERS     Number of worker processes to launch
  --log-interval LOG_INTERVAL
                        Interval between progress updates

For example:

python tools/preprocess_data.py \
            --input data/mydataset.jsonl \
            --output-prefix data/mydataset \
            --vocab data/gpt2-vocab.json \
            --dataset-impl mmap \
            --tokenizer-type GPT2BPETokenizer \
            --merge-file gpt2-merges.txt \
            --append-eod

You would then run training with the following settings added to your configuration file:

  "data-path": "data/mydataset/mydataset",

Training