Skip to content

Commit

Permalink
Merge branch 'main' into opt-embeddings-fix
Browse files Browse the repository at this point in the history
  • Loading branch information
Quentin-Anthony committed Jan 15, 2023
2 parents e1e3842 + 375de3f commit 62e9440
Show file tree
Hide file tree
Showing 27 changed files with 97 additions and 61 deletions.
22 changes: 11 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ This repository records [EleutherAI](https://www.eleuther.ai)'s library for trai

For those looking for a TPU-centric codebase, we recommend [Mesh Transformer JAX](https://github.com/kingoflolz/mesh-transformer-jax).

**If you are not looking to train models with billions of parameters from scratch, this is likely the wrong library to use. For generic inference needs, we recommend you use the HuggingFace `transformers` library instead which supports GPT-NeoX models.**
**If you are not looking to train models with billions of parameters from scratch, this is likely the wrong library to use. For generic inference needs, we recommend you use the Hugging Face `transformers` library instead which supports GPT-NeoX models.**

# Contents

Expand All @@ -24,7 +24,7 @@ For those looking for a TPU-centric codebase, we recommend [Mesh Transformer JAX
- [Training and Finetuning](#training-and-finetuning)
- [Inference](#inference)
- [Evaluation](#evaluation)
- [Exporting to HuggingFace](#exporting-to-huggingface)
- [Exporting to Hugging Face](#exporting-to-hugging-face)
- [Monitoring](#monitoring)
* [Weights & Biases](#wandb)
* [TensorBoard](#tensorboard)
Expand Down Expand Up @@ -146,10 +146,10 @@ To reproduce our evaluation numbers on, for example, TriviaQA and PIQA use:

You can add an arbitrary list of evaluation tasks here, for details of all tasks available, see [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).

For more details on each entry point, see the [Training and Finetuning](#training-and-finetuning), [Inference](#inference) and [Evaluation](#evaluation)
For more details on each entry point, see the [Training and Finetuning](#training-and-finetuning), [Inference](#inference) and [Evaluation](#evaluation)
# Configuration

GPT-NeoX parameters are defined in a YAML configuration file which is passed to the deepy.py launcher. We have provided some example .yaml files in [configs](./configs/), including one for GPT-NeoX-20B, and example configuration files for other model sizes.
GPT-NeoX parameters are defined in a YAML configuration file which is passed to the deepy.py launcher. We have provided some example .yaml files in [configs](./configs/), including one for GPT-NeoX-20B, and example configuration files for other model sizes.

These files are generally complete, but non-optimal. For example, depending on your specific GPU configuration, you may need to change some settings such as `pipe-parallel-size`, `model-parallel-size` to increase or decrease the degree of parallelisation, `train_micro_batch_size_per_gpu` or `gradient-accumulation-steps` to modify batch size related settings, or the `zero_optimization` dict to modify how optimizer states are parallelised across workers.

Expand Down Expand Up @@ -192,7 +192,7 @@ Or use the 20B tokenizer (for which only a single Vocab file is needed):

- Vocab: https://the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/20B_tokenizer.json

(alternatively, you can provide any tokenizer file that can be loaded by Huggingface's tokenizers library with the `Tokenizer.from_pretrained()` command)
(alternatively, you can provide any tokenizer file that can be loaded by Hugging Face's tokenizers library with the `Tokenizer.from_pretrained()` command)

You can now pretokenize your data using `tools/preprocess_data.py`, the arguments for which are detailed below:

Expand Down Expand Up @@ -277,7 +277,7 @@ Although this is not strictly necessary, we find it useful to define the model p

# Inference

**For most uses we recommend deploying models trained using the GPT-NeoX library via the HuggingFace Transformers library which is better optimized for inference.**
**For most uses we recommend deploying models trained using the GPT-NeoX library via the Hugging Face Transformers library which is better optimized for inference.**

We support three types of generation from a pretrained model:
1. Unconditional generation
Expand All @@ -298,22 +298,22 @@ python ./deepy.py evaluate.py -d configs your_configs.yml --eval_tasks task1 tas

where `--eval_tasks` is a list of evaluation tasks followed by spaces, e.g `--eval_tasks lambada hellaswag piqa sciq`. For details of all tasks available, refer to the [lm-evaluation-harness repo](https://github.com/EleutherAI/lm-evaluation-harness).

# Exporting to HuggingFace
# Exporting to Hugging Face

GPT-NeoX is optimized heavily for training only, and GPT-NeoX model checkpoints are not compatible out of the box with other deep learning libraries. To make models easily loadable and shareable with end users, and for further exporting to various other frameworks, GPT-NeoX supports checkpoint conversion to the [HuggingFace Transformers](https://arxiv.org/abs/1910.03771) GPTNeoXModel format.
GPT-NeoX is optimized heavily for training only, and GPT-NeoX model checkpoints are not compatible out of the box with other deep learning libraries. To make models easily loadable and shareable with end users, and for further exporting to various other frameworks, GPT-NeoX supports checkpoint conversion to the [Hugging Face Transformers](https://arxiv.org/abs/1910.03771) GPTNeoXModel format.

To convert a NeoX checkpoint to Huggingface-loadable format, run:
To convert a NeoX checkpoint to Hugging Face-loadable format, run:
```bash
python ./tools/convert_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yml --output_dir hf_model/save/location
```
Then to upload a model to [the Huggingface Hub](https://huggingface.co/), run:
Then to upload a model to [the Hugging Face Hub](https://huggingface.co/), run:
```
huggingface-cli login
python ./tools/upload.py
```
and input the requested information, including HF hub user token.

Note, however, that this compatibility is not one-to-one, and only certain configurations from GPT-NeoX are supported in the Huggingface GPTNeoXModel class. Advanced features such as alternative positional embeddings may require new Transformers modeling code and new conversion script tweaks.
Note, however, that this compatibility is not one-to-one, and only certain configurations from GPT-NeoX are supported in the Hugging Face GPTNeoXModel class. Advanced features such as alternative positional embeddings may require new Transformers modeling code and new conversion script tweaks.

# Monitoring

Expand Down
2 changes: 1 addition & 1 deletion configs/1-3B.yml
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"save-interval": 10000,
"checkpoint-factor": 10000,
"eval-interval": 1000,
"eval-iters": 10,

Expand Down
2 changes: 1 addition & 1 deletion configs/125M.yml
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"save-interval": 10000,
"checkpoint-factor": 10000,
"eval-interval": 1000,
"eval-iters": 10,

Expand Down
2 changes: 1 addition & 1 deletion configs/13B.yml
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"save-interval": 10000,
"checkpoint-factor": 10000,
"eval-interval": 1000,
"eval-iters": 10,

Expand Down
2 changes: 1 addition & 1 deletion configs/175B.yml
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"save-interval": 10000,
"checkpoint-factor": 10000,
"eval-interval": 1000,
"eval-iters": 10,

Expand Down
2 changes: 1 addition & 1 deletion configs/19M.yml
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"save-interval": 1000,
"checkpoint-factor": 1000,
"eval-interval": 100000,
"eval-iters": 10,

Expand Down
2 changes: 1 addition & 1 deletion configs/2-7B.yml
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"save-interval": 10000,
"checkpoint-factor": 10000,
"eval-interval": 1000,
"eval-iters": 10,

Expand Down
2 changes: 1 addition & 1 deletion configs/20B.yml
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"save-interval": 500,
"checkpoint-factor": 500, # this variable previously called `save-interval`
"eval-interval": 1000,
"eval-iters": 10,

Expand Down
2 changes: 1 addition & 1 deletion configs/350M.yml
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"save-interval": 10000,
"checkpoint-factor": 10000,
"eval-interval": 1000,
"eval-iters": 10,

Expand Down
2 changes: 1 addition & 1 deletion configs/49M.yml
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"save-interval": 1000,
"checkpoint-factor": 1000,
"eval-interval": 100000,
"eval-iters": 10,

Expand Down
2 changes: 1 addition & 1 deletion configs/6-7B.yml
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"save-interval": 10000,
"checkpoint-factor": 10000,
"eval-interval": 1000,
"eval-iters": 10,

Expand Down
2 changes: 1 addition & 1 deletion configs/760M.yml
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"save-interval": 10000,
"checkpoint-factor": 10000,
"eval-interval": 1000,
"eval-iters": 10,

Expand Down
2 changes: 1 addition & 1 deletion configs/800M.yml
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"save-interval": 1000,
"checkpoint-factor": 1000,
"eval-interval": 40000,
"eval-iters": 10,

Expand Down
2 changes: 1 addition & 1 deletion configs/bf16_125M.yml
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"save-interval": 10000,
"checkpoint-factor": 10000,
"eval-interval": 1000,
"eval-iters": 10,

Expand Down
2 changes: 1 addition & 1 deletion configs/bnb_125M.yml
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"save-interval": 10000,
"checkpoint-factor": 10000,
"eval-interval": 1000,
"eval-iters": 10,

Expand Down
2 changes: 1 addition & 1 deletion configs/gmlp_small.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"save-interval": 10000,
"checkpoint-factor": 10000,
"eval-interval": 1000,
"eval-iters": 10,

Expand Down
15 changes: 12 additions & 3 deletions configs/neox_arguments.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ Logging Arguments

- **git_hash**: str

Default = 317555a
Default = 075a525

current git hash of repository

Expand Down Expand Up @@ -923,6 +923,15 @@ Text Generation arguments



- **prompt_end**: str

Default =


a single prompt's end. Defaults to newline



- **sample_input_file**: str

Default = None
Expand Down Expand Up @@ -1155,10 +1164,10 @@ Training Arguments

Acts as a multiplier on either the "log" or "linear" checkpoint spacing.

With `checkpoint-scale="linear"`, `checkpoint-factor=20`, and `train-iters=100`, checkpoints will be saved at
With `checkpoint-scale="linear"`, `checkpoint-factor=20`, and `train-iters=100`, checkpoints will be saved at
steps [20, 40, 60, 80, 100].

With `checkpoint-scale="log"`, `checkpoint-factor=2`, and `train-iters=100`, checkpoints will be saved at
With `checkpoint-scale="log"`, `checkpoint-factor=2`, and `train-iters=100`, checkpoints will be saved at
steps [1, 2, 4, 8, 16, 32, 64, 100].

Note that the last checkpoint step is always saved.
Expand Down
2 changes: 1 addition & 1 deletion configs/slurm_125M.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"save-interval": 10000,
"checkpoint-factor": 10000,
"eval-interval": 1000,
"eval-iters": 10,
"log-interval": 100,
Expand Down
1 change: 1 addition & 0 deletions configs/text_generation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

# Params for all
"maximum_tokens": 102,
"prompt_end": "\n",
"temperature": 1.0,
"top_p": 0.0,
"top_k": 0,
Expand Down
2 changes: 2 additions & 0 deletions generate.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ def main():
input_file=neox_args.sample_input_file,
output_file=neox_args.sample_output_file,
maximum_tokens=neox_args.maximum_tokens,
prompt_end=neox_args.prompt_end,
recompute=neox_args.recompute,
temperature=neox_args.temperature,
top_k=neox_args.top_k,
Expand All @@ -75,6 +76,7 @@ def main():
recompute=neox_args.recompute,
temperature=neox_args.temperature,
maximum_tokens=neox_args.maximum_tokens,
prompt_end=neox_args.prompt_end,
top_k=neox_args.top_k,
top_p=neox_args.top_p,
)
Expand Down
10 changes: 5 additions & 5 deletions megatron/model/transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -663,15 +663,15 @@ def forward(self, x, attention_mask, layer_past=None):
# to save communication time (we can do a single allreduce after we add mlp / attn outputs).
# due to a bug, the two layernorms are not tied in GPT-NeoX-20B. This is non-desirable, but
# we preserve the functionality for backwards compatibility

residual = x
# applies the correct normalization depending on if the norms are tied
if self.gpt_j_tied:
x1, x2 = self.input_layernorm(x), self.post_attention_layernorm(x)
else:
x = self.input_layernorm(x)
x1, x2 = x, x

else:
x1, x2 = self.input_layernorm(x), self.post_attention_layernorm(x)

# attention operator
attention_output, attention_bias = self.attention(
x1, attention_mask, layer_past=layer_past
Expand Down Expand Up @@ -699,7 +699,7 @@ def forward(self, x, attention_mask, layer_past=None):
)

# output = (x + attn(ln(x)) + mlp(ln(x))
output = residual + self.reduce(output)
output = residual + self.reduce(output)
else:
# pseudocode:
# x = x + attn(ln1(x))
Expand Down
16 changes: 10 additions & 6 deletions megatron/neox_arguments/arguments.py
Original file line number Diff line number Diff line change
Expand Up @@ -741,18 +741,18 @@ def calculate_derived(self):
save_iters = set(self.extra_save_iters)
else:
save_iters = set()
step = self.checkpoint_factor # don't save step 0 or 1

step = self.checkpoint_factor # don't save step 0 or 1
while step < self.train_iters:
save_iters.add(step)
if self.checkpoint_scale == "log":
step *= self.checkpoint_factor
elif self.checkpoint_scale == "linear":
step += self.checkpoint_factor

save_iters = list(save_iters)
save_iters.sort()

self.update_values(
{
"save_iters": save_iters,
Expand Down Expand Up @@ -848,7 +848,7 @@ def calculate_derived(self):
if self.sparsity_config is None:
# Can't have a default value as an empty dict so need to set it here
self.update_value("sparsity_config", {})

# Adding equal dataset weights if none are provided
if self.train_data_paths and (self.train_data_weights is None):
self.train_data_weights = [1.0] * len(self.train_data_paths)
Expand Down Expand Up @@ -947,7 +947,11 @@ def validate_values(self):
raise ValueError(error_message)
return False

if self.save is not None and self.checkpoint_factor is None and self.extra_save_iters is None:
if (
self.save is not None
and self.checkpoint_factor is None
and self.extra_save_iters is None
):
error_message = (
self.__class__.__name__
+ ".validate_values() checkpoint_factor or extra_save_iters must be defined if save is defined"
Expand Down
11 changes: 8 additions & 3 deletions megatron/neox_arguments/neox_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -332,7 +332,7 @@ class NeoXArgsModel(NeoXArgsTemplate):
x = ln(x)
x = x + attn(x) + mlp(x)
"""

gpt_j_tied: bool = False
"""
If false, we use
Expand Down Expand Up @@ -785,10 +785,10 @@ class NeoXArgsTraining(NeoXArgsTemplate):
"""
Acts as a multiplier on either the "log" or "linear" checkpoint spacing.
With `checkpoint-scale="linear"`, `checkpoint-factor=20`, and `train-iters=100`, checkpoints will be saved at
With `checkpoint-scale="linear"`, `checkpoint-factor=20`, and `train-iters=100`, checkpoints will be saved at
steps [20, 40, 60, 80, 100].
With `checkpoint-scale="log"`, `checkpoint-factor=2`, and `train-iters=100`, checkpoints will be saved at
With `checkpoint-scale="log"`, `checkpoint-factor=2`, and `train-iters=100`, checkpoints will be saved at
steps [1, 2, 4, 8, 16, 32, 64, 100].
Note that the last checkpoint step is always saved.
Expand Down Expand Up @@ -1008,6 +1008,11 @@ class NeoXArgsTextgen(NeoXArgsTemplate):
maximum number of tokens to be generated
"""

prompt_end: str = "\n"
"""
a single prompt's end. Defaults to newline
"""

sample_input_file: str = None
"""
Get input from file instead of interactive mode, each line is an input.
Expand Down
Loading

0 comments on commit 62e9440

Please sign in to comment.