Merge branch 'main' into opt-embeddings-fix

jamestiotio · Jan 15, 2023 · 62e9440 · 62e9440
2 parents e1e3842 + 375de3f
commit 62e9440
Show file tree

Hide file tree

Showing 27 changed files with 97 additions and 61 deletions.
diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@ This repository records [EleutherAI](https://www.eleuther.ai)'s library for trai
 
 For those looking for a TPU-centric codebase, we recommend [Mesh Transformer JAX](https://github.com/kingoflolz/mesh-transformer-jax).
 
-**If you are not looking to train models with billions of parameters from scratch, this is likely the wrong library to use. For generic inference needs, we recommend you use the HuggingFace `transformers` library instead which supports GPT-NeoX models.**
+**If you are not looking to train models with billions of parameters from scratch, this is likely the wrong library to use. For generic inference needs, we recommend you use the Hugging Face `transformers` library instead which supports GPT-NeoX models.**
 
 # Contents
 
@@ -24,7 +24,7 @@ For those looking for a TPU-centric codebase, we recommend [Mesh Transformer JAX
 - [Training and Finetuning](#training-and-finetuning)
 - [Inference](#inference)
 - [Evaluation](#evaluation)
-- [Exporting to HuggingFace](#exporting-to-huggingface)
+- [Exporting to Hugging Face](#exporting-to-hugging-face)
 - [Monitoring](#monitoring)
  * [Weights & Biases](#wandb)
  * [TensorBoard](#tensorboard)
@@ -146,10 +146,10 @@ To reproduce our evaluation numbers on, for example, TriviaQA and PIQA use:
 
 You can add an arbitrary list of evaluation tasks here, for details of all tasks available, see [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
 
-For more details on each entry point, see the [Training and Finetuning](#training-and-finetuning), [Inference](#inference) and [Evaluation](#evaluation) 
+For more details on each entry point, see the [Training and Finetuning](#training-and-finetuning), [Inference](#inference) and [Evaluation](#evaluation)
 # Configuration
 
-GPT-NeoX parameters are defined in a YAML configuration file which is passed to the deepy.py launcher. We have provided some example .yaml files in [configs](./configs/), including one for GPT-NeoX-20B, and example configuration files for other model sizes. 
+GPT-NeoX parameters are defined in a YAML configuration file which is passed to the deepy.py launcher. We have provided some example .yaml files in [configs](./configs/), including one for GPT-NeoX-20B, and example configuration files for other model sizes.
 
 These files are generally complete, but non-optimal. For example, depending on your specific GPU configuration, you may need to change some settings such as `pipe-parallel-size`, `model-parallel-size` to increase or decrease the degree of parallelisation, `train_micro_batch_size_per_gpu` or `gradient-accumulation-steps` to modify batch size related settings, or the `zero_optimization` dict to modify how optimizer states are parallelised across workers.
 
@@ -192,7 +192,7 @@ Or use the 20B tokenizer (for which only a single Vocab file is needed):
 
 - Vocab: https://the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/20B_tokenizer.json
 
-(alternatively, you can provide any tokenizer file that can be loaded by Huggingface's tokenizers library with the `Tokenizer.from_pretrained()` command)
+(alternatively, you can provide any tokenizer file that can be loaded by Hugging Face's tokenizers library with the `Tokenizer.from_pretrained()` command)
 
 You can now pretokenize your data using `tools/preprocess_data.py`, the arguments for which are detailed below:
 
@@ -277,7 +277,7 @@ Although this is not strictly necessary, we find it useful to define the model p
 
 # Inference
 
-**For most uses we recommend deploying models trained using the GPT-NeoX library via the HuggingFace Transformers library which is better optimized for inference.**
+**For most uses we recommend deploying models trained using the GPT-NeoX library via the Hugging Face Transformers library which is better optimized for inference.**
 
 We support three types of generation from a pretrained model:
 1. Unconditional generation
@@ -298,22 +298,22 @@ python ./deepy.py evaluate.py -d configs your_configs.yml --eval_tasks task1 tas
 
 where `--eval_tasks` is a list of evaluation tasks followed by spaces, e.g `--eval_tasks lambada hellaswag piqa sciq`. For details of all tasks available, refer to the [lm-evaluation-harness repo](https://github.com/EleutherAI/lm-evaluation-harness).
 
-# Exporting to HuggingFace
+# Exporting to Hugging Face
 
-GPT-NeoX is optimized heavily for training only, and GPT-NeoX model checkpoints are not compatible out of the box with other deep learning libraries. To make models easily loadable and shareable with end users, and for further exporting to various other frameworks, GPT-NeoX supports checkpoint conversion to the [HuggingFace Transformers](https://arxiv.org/abs/1910.03771) GPTNeoXModel format.
+GPT-NeoX is optimized heavily for training only, and GPT-NeoX model checkpoints are not compatible out of the box with other deep learning libraries. To make models easily loadable and shareable with end users, and for further exporting to various other frameworks, GPT-NeoX supports checkpoint conversion to the [Hugging Face Transformers](https://arxiv.org/abs/1910.03771) GPTNeoXModel format.
 
-To convert a NeoX checkpoint to Huggingface-loadable format, run:
+To convert a NeoX checkpoint to Hugging Face-loadable format, run:
 ```bash
 python ./tools/convert_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yml --output_dir hf_model/save/location
 ```
-Then to upload a model to [the Huggingface Hub](https://huggingface.co/), run:
+Then to upload a model to [the Hugging Face Hub](https://huggingface.co/), run:
 ```
 huggingface-cli login
 python ./tools/upload.py
 ```
 and input the requested information, including HF hub user token.
 
-Note, however, that this compatibility is not one-to-one, and only certain configurations from GPT-NeoX are supported in the Huggingface GPTNeoXModel class. Advanced features such as alternative positional embeddings may require new Transformers modeling code and new conversion script tweaks.
+Note, however, that this compatibility is not one-to-one, and only certain configurations from GPT-NeoX are supported in the Hugging Face GPTNeoXModel class. Advanced features such as alternative positional embeddings may require new Transformers modeling code and new conversion script tweaks.
 
 # Monitoring
 

diff --git a/configs/1-3B.yml b/configs/1-3B.yml
@@ -79,7 +79,7 @@
  "distributed-backend": "nccl",
  "lr-decay-style": "cosine",
  "warmup": 0.01,
- "save-interval": 10000,
+ "checkpoint-factor": 10000,
  "eval-interval": 1000,
  "eval-iters": 10,
 

diff --git a/configs/125M.yml b/configs/125M.yml
@@ -79,7 +79,7 @@
  "distributed-backend": "nccl",
  "lr-decay-style": "cosine",
  "warmup": 0.01,
- "save-interval": 10000,
+ "checkpoint-factor": 10000,
  "eval-interval": 1000,
  "eval-iters": 10,
 

diff --git a/configs/13B.yml b/configs/13B.yml
@@ -80,7 +80,7 @@
  "distributed-backend": "nccl",
  "lr-decay-style": "cosine",
  "warmup": 0.01,
- "save-interval": 10000,
+ "checkpoint-factor": 10000,
  "eval-interval": 1000,
  "eval-iters": 10,
 

diff --git a/configs/175B.yml b/configs/175B.yml
@@ -78,7 +78,7 @@
  "distributed-backend": "nccl",
  "lr-decay-style": "cosine",
  "warmup": 0.01,
- "save-interval": 10000,
+ "checkpoint-factor": 10000,
  "eval-interval": 1000,
  "eval-iters": 10,
 

diff --git a/configs/19M.yml b/configs/19M.yml
@@ -74,7 +74,7 @@
  "distributed-backend": "nccl",
  "lr-decay-style": "cosine",
  "warmup": 0.01,
- "save-interval": 1000,
+ "checkpoint-factor": 1000,
  "eval-interval": 100000,
  "eval-iters": 10,
 

diff --git a/configs/2-7B.yml b/configs/2-7B.yml
@@ -79,7 +79,7 @@
  "distributed-backend": "nccl",
  "lr-decay-style": "cosine",
  "warmup": 0.01,
- "save-interval": 10000,
+ "checkpoint-factor": 10000,
  "eval-interval": 1000,
  "eval-iters": 10,
 

diff --git a/configs/20B.yml b/configs/20B.yml
@@ -94,7 +94,7 @@
  "distributed-backend": "nccl",
  "lr-decay-style": "cosine",
  "warmup": 0.01,
- "save-interval": 500,
+ "checkpoint-factor": 500, # this variable previously called `save-interval`
  "eval-interval": 1000,
  "eval-iters": 10,
 

diff --git a/configs/350M.yml b/configs/350M.yml
@@ -78,7 +78,7 @@
  "distributed-backend": "nccl",
  "lr-decay-style": "cosine",
  "warmup": 0.01,
- "save-interval": 10000,
+ "checkpoint-factor": 10000,
  "eval-interval": 1000,
  "eval-iters": 10,
 

diff --git a/configs/49M.yml b/configs/49M.yml
@@ -80,7 +80,7 @@
  "distributed-backend": "nccl",
  "lr-decay-style": "cosine",
  "warmup": 0.01,
- "save-interval": 1000,
+ "checkpoint-factor": 1000,
  "eval-interval": 100000,
  "eval-iters": 10,
 

diff --git a/configs/6-7B.yml b/configs/6-7B.yml
@@ -79,7 +79,7 @@
  "distributed-backend": "nccl",
  "lr-decay-style": "cosine",
  "warmup": 0.01,
- "save-interval": 10000,
+ "checkpoint-factor": 10000,
  "eval-interval": 1000,
  "eval-iters": 10,
 

diff --git a/configs/760M.yml b/configs/760M.yml
@@ -79,7 +79,7 @@
  "distributed-backend": "nccl",
  "lr-decay-style": "cosine",
  "warmup": 0.01,
- "save-interval": 10000,
+ "checkpoint-factor": 10000,
  "eval-interval": 1000,
  "eval-iters": 10,
 

diff --git a/configs/800M.yml b/configs/800M.yml
@@ -74,7 +74,7 @@
  "distributed-backend": "nccl",
  "lr-decay-style": "cosine",
  "warmup": 0.01,
- "save-interval": 1000,
+ "checkpoint-factor": 1000,
  "eval-interval": 40000,
  "eval-iters": 10,
 

diff --git a/configs/bf16_125M.yml b/configs/bf16_125M.yml
@@ -74,7 +74,7 @@
  "distributed-backend": "nccl",
  "lr-decay-style": "cosine",
  "warmup": 0.01,
- "save-interval": 10000,
+ "checkpoint-factor": 10000,
  "eval-interval": 1000,
  "eval-iters": 10,
 

diff --git a/configs/bnb_125M.yml b/configs/bnb_125M.yml
@@ -73,7 +73,7 @@
  "distributed-backend": "nccl",
  "lr-decay-style": "cosine",
  "warmup": 0.01,
- "save-interval": 10000,
+ "checkpoint-factor": 10000,
  "eval-interval": 1000,
  "eval-iters": 10,
 

diff --git a/configs/gmlp_small.yml b/configs/gmlp_small.yml
@@ -60,7 +60,7 @@
  "distributed-backend": "nccl",
  "lr-decay-style": "cosine",
  "warmup": 0.01,
- "save-interval": 10000,
+ "checkpoint-factor": 10000,
  "eval-interval": 1000,
  "eval-iters": 10,
 

diff --git a/configs/neox_arguments.md b/configs/neox_arguments.md
@@ -111,7 +111,7 @@ Logging Arguments
 
 - **git_hash**: str
 
- Default = 317555a
+ Default = 075a525
 
  current git hash of repository
 
@@ -923,6 +923,15 @@ Text Generation arguments
 
 
 
+- **prompt_end**: str
+
+ Default = 
+
+
+ a single prompt's end. Defaults to newline
+
+
+
 - **sample_input_file**: str
 
  Default = None
@@ -1155,10 +1164,10 @@ Training Arguments
 
  Acts as a multiplier on either the "log" or "linear" checkpoint spacing.
 
- With `checkpoint-scale="linear"`, `checkpoint-factor=20`, and `train-iters=100`, checkpoints will be saved at 
+ With `checkpoint-scale="linear"`, `checkpoint-factor=20`, and `train-iters=100`, checkpoints will be saved at
  steps [20, 40, 60, 80, 100].
 
- With `checkpoint-scale="log"`, `checkpoint-factor=2`, and `train-iters=100`, checkpoints will be saved at 
+ With `checkpoint-scale="log"`, `checkpoint-factor=2`, and `train-iters=100`, checkpoints will be saved at
  steps [1, 2, 4, 8, 16, 32, 64, 100].
 
  Note that the last checkpoint step is always saved.

diff --git a/configs/slurm_125M.yml b/configs/slurm_125M.yml
@@ -51,7 +51,7 @@
  "distributed-backend": "nccl",
  "lr-decay-style": "cosine",
  "warmup": 0.01,
- "save-interval": 10000,
+ "checkpoint-factor": 10000,
  "eval-interval": 1000,
  "eval-iters": 10,
  "log-interval": 100,

diff --git a/configs/text_generation.yml b/configs/text_generation.yml
@@ -6,6 +6,7 @@
 
  # Params for all
  "maximum_tokens": 102,
+ "prompt_end": "\n",
  "temperature": 1.0,
  "top_p": 0.0,
  "top_k": 0,

diff --git a/generate.py b/generate.py
@@ -62,6 +62,7 @@ def main():
  input_file=neox_args.sample_input_file,
  output_file=neox_args.sample_output_file,
  maximum_tokens=neox_args.maximum_tokens,
+ prompt_end=neox_args.prompt_end,
  recompute=neox_args.recompute,
  temperature=neox_args.temperature,
  top_k=neox_args.top_k,
@@ -75,6 +76,7 @@ def main():
  recompute=neox_args.recompute,
  temperature=neox_args.temperature,
  maximum_tokens=neox_args.maximum_tokens,
+ prompt_end=neox_args.prompt_end,
  top_k=neox_args.top_k,
  top_p=neox_args.top_p,
  )

diff --git a/megatron/model/transformer.py b/megatron/model/transformer.py
@@ -663,15 +663,15 @@ def forward(self, x, attention_mask, layer_past=None):
  # to save communication time (we can do a single allreduce after we add mlp / attn outputs).
  # due to a bug, the two layernorms are not tied in GPT-NeoX-20B. This is non-desirable, but
  # we preserve the functionality for backwards compatibility
- 
+
  residual = x
  # applies the correct normalization depending on if the norms are tied
  if self.gpt_j_tied:
- x1, x2 = self.input_layernorm(x), self.post_attention_layernorm(x)
- else:
  x = self.input_layernorm(x)
  x1, x2 = x, x
-
+ else:
+ x1, x2 = self.input_layernorm(x), self.post_attention_layernorm(x)
+
  # attention operator
  attention_output, attention_bias = self.attention(
  x1, attention_mask, layer_past=layer_past
@@ -699,7 +699,7 @@ def forward(self, x, attention_mask, layer_past=None):
  )
 
  # output = (x + attn(ln(x)) + mlp(ln(x))
- output  = residual  + self.reduce(output)
+ output = residual + self.reduce(output)
  else:
  # pseudocode:
  # x = x + attn(ln1(x))

diff --git a/megatron/neox_arguments/arguments.py b/megatron/neox_arguments/arguments.py
@@ -741,18 +741,18 @@ def calculate_derived(self):
  save_iters = set(self.extra_save_iters)
  else:
  save_iters = set()
- 
- step = self.checkpoint_factor # don't save step 0 or 1
+
+ step = self.checkpoint_factor  # don't save step 0 or 1
  while step < self.train_iters:
  save_iters.add(step)
  if self.checkpoint_scale == "log":
  step *= self.checkpoint_factor
  elif self.checkpoint_scale == "linear":
  step += self.checkpoint_factor
- 
+
  save_iters = list(save_iters)
  save_iters.sort()
- 
+
  self.update_values(
  {
  "save_iters": save_iters,
@@ -848,7 +848,7 @@ def calculate_derived(self):
  if self.sparsity_config is None:
  # Can't have a default value as an empty dict so need to set it here
  self.update_value("sparsity_config", {})
- 
+
  # Adding equal dataset weights if none are provided
  if self.train_data_paths and (self.train_data_weights is None):
  self.train_data_weights = [1.0] * len(self.train_data_paths)
@@ -947,7 +947,11 @@ def validate_values(self):
  raise ValueError(error_message)
  return False
 
- if self.save is not None and self.checkpoint_factor is None and self.extra_save_iters is None:
+ if (
+ self.save is not None
+ and self.checkpoint_factor is None
+ and self.extra_save_iters is None
+ ):
  error_message = (
  self.__class__.__name__
  + ".validate_values() checkpoint_factor or extra_save_iters must be defined if save is defined"

diff --git a/megatron/neox_arguments/neox_args.py b/megatron/neox_arguments/neox_args.py
@@ -332,7 +332,7 @@ class NeoXArgsModel(NeoXArgsTemplate):
  x = ln(x)
  x = x + attn(x) + mlp(x)
  """
- 
+
  gpt_j_tied: bool = False
  """
  If false, we use
@@ -785,10 +785,10 @@ class NeoXArgsTraining(NeoXArgsTemplate):
  """
  Acts as a multiplier on either the "log" or "linear" checkpoint spacing.
 
- With `checkpoint-scale="linear"`, `checkpoint-factor=20`, and `train-iters=100`, checkpoints will be saved at 
+ With `checkpoint-scale="linear"`, `checkpoint-factor=20`, and `train-iters=100`, checkpoints will be saved at
  steps [20, 40, 60, 80, 100].
 
- With `checkpoint-scale="log"`, `checkpoint-factor=2`, and `train-iters=100`, checkpoints will be saved at 
+ With `checkpoint-scale="log"`, `checkpoint-factor=2`, and `train-iters=100`, checkpoints will be saved at
  steps [1, 2, 4, 8, 16, 32, 64, 100].
 
  Note that the last checkpoint step is always saved.
@@ -1008,6 +1008,11 @@ class NeoXArgsTextgen(NeoXArgsTemplate):
  maximum number of tokens to be generated
  """
 
+ prompt_end: str = "\n"
+ """
+ a single prompt's end. Defaults to newline
+ """
+
  sample_input_file: str = None
  """
  Get input from file instead of interactive mode, each line is an input.