Skip to content

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries

License

Notifications You must be signed in to change notification settings

EleutherAI/gpt-neox

Repository files navigation

GitHub issues Weights & Biases monitoring

GPT-NeoX

This repository records EleutherAI's work-in-progress for training large scale GPU language models. Our current framework is based on NVIDIA's Megatron Language Model and has been augmented with techniques from DeepSpeed as well as some novel optimizations.

If you are looking for our TPU codebase, see GPT-Neo.

GPT-NeoX is under active development and rough around the edges. GPT-NeoX is a complicated beast that will take time and patients to work on any specific environment.

Getting Started

Our codebase relies on DeeperSpeed, a custom modification to the DeepSpeed library. We strongly recommend using Anaconda, a virtual machine, or some other form of environment isolation before installing from requirements.txt. Failure to do so may cause other repositories that rely on DeepSpeed to break.

Datasets

Once you've installed requirements.txt, the next step is obtaining and processing data. For demonstrative purposes we have hosted the Enron Emails corpus and made it available for downloading. Running python prepare_data.py will download and process the dataset for language modeling. To use your own data, extend the DataDownloader class in tools/corpa.pyand register the new class in the DATA_DOWNLOADERS dict. Once this is done, you can add prepare_dataset(dataset_name) to process_data.py to load your data.

TO DO: Make a table showing the datasets currently available for download. List the name, size on disk (compressed), actual size, and number of tokens.

Training

GPT-NeoX is launched using the deepy.py script which is the root folder of this repo. You also need to ensure that repo root directory is added to the Python path so that the megatron folder is importable.

Example usage:

./deepy.py pretrain_gpt2.py -d configs pretrain_gpt2.yml local_setup.yml

This will:

  • Deploy the pretrain_gpt2.py script on all nodes with one process per GPU. The worker nodes and number of GPUs are specified in the /job/hostfile file (see parameter documentation). The worker processes are deployed by default using pdsh.
  • Model parameters are defined in the config file configs/ds_pretrain_gpt2.yml (configuration directory is configs/) which are used by GPT-NeoX
  • Data path parameters are defined in the config file configs/local_setup.yml. If you are an EleutherAI member and using the Kubernetes cluster, the eleutherai_cluster.yml config should be instead.

Further examples are contained in the examples folder.

Configuration and parameters

GPT-NeoX parameters are defined in a YAML configuration file which is passed to the deepy.py launcher - for examples see the configs folder and the examples folder. For a full list of parameters and documentation see corresponding readme.

Features

Model Structure

Positional Encodings:

Sparsity: Sparse attention kernels are supported, but they require model parallelism to be turned off. This is subject to change with updates in Deepspeed

Optimizers

Zero Redundnacy Optimizer (ZeRO): ZeRO stage 1 works seamlessly with NeoX, while ZeRO stage 2 does not, as it requires disabling pipeline parallelsm due to conflicts with gradient checkpointing among the two features.

ZeRO-Offloding: ZeRO-offloading requires ZeRO stage 2, hence is not supported.

1-Bit Adam:

Memory Optimizations

Data Parallel: Data parallelism is a ubiquitous technique in deep learning in which each input batch of training data is split among the data parallel workers. It is integrated into NeoX

Model Parallel: Model Parallelism is a broad class of techniques that partitions the individual layers of the model across workers. Model Parallelism is built into NeoX as it is a part of Megatron-LM

Pipeline Parallel: Pipeline parallelism divides the layers of the model into stages that can be processed in parallel. It is integrated into deepspeed itself.

Mixed Precision Training: Mixed precision training computes some operations in FP16 while some others in FP32, such as computing the forward pass and the gradient in fp16 and updating the weights in fp32. Mixed precision training is integrated into deepspeed as well.

Monitoring

EleutherAI is currently using Weights & Biases to record experiments. If you are logged into Weights & Biases on your machine - you can do this by executing wandb login - your runs will automatically be recorded. Additionally, set the config parameter wandb_team if you would like the run to be added to an organisation/team account.

Eleuther Cluster

We run our experiments on a Kubernetes cluster generously provided by CoreWeave. The /kubernetes/ directory contains code designed to facilitate work on our server. If you are an EleutherAI member, see the corresponding read-me for information about how to use our cluster.

Licensing

Copyright (c) 2021 Stella Biderman, Sid Black, Josh Levy-Kramer, Michael Pieler, and Shivanshu Purohit.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

   http:https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

This repository is based off code written by NVIDIA, HuggingFace, and Google Research that is licensed under the Apache License, Version 2.0. In accordance with the Apache License, all files that are modifications of code originally written by another party maintain their copyright header. When a file has been modified from its original version, that fact is noted in the copyright header. All derivative works of this repository must preserve these headers under the terms of the Apache License. A small amount of code in this repository is based on code written by Facebook and licensed under the MIT License. This too is marked in the headers.

For full terms, see the LICENSE file. If you have any questions, comments, or concerns about licensing please email us at [email protected].