Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative document indices caused by 64 bit integer stored in a 32 bit integer array. #493

Closed
pwstegman opened this issue Jan 24, 2022 · 6 comments
Labels
bug Something isn't working

Comments

@pwstegman
Copy link
Contributor

pwstegman commented Jan 24, 2022

Describe the bug

While training on The Pile, I was getting errors from sparse attention, claiming that the sequence length wasn't divisible by the block size, despite using a sequence length of 8192 and a block size of 16. This was caused by negative document indices in the dataset, which caused weird sample lengths (screenshot included in screenshot section). The negative document indices were caused by a wraparound at the 32 bit signed integer limit. More info in the Proposed Solution section.

To Reproduce

I'm using the docker image leogao2/gpt-neox:sha-6dc7645. My training script is:

/bin/bash
python deepy.py train.py --conf_dir configs local_setup.yml sparse.yml 13B.yml

Configs are included in the Environment section.

Expected behavior

Each sample should be exactly 8193 tokens.

Proposed solution

In short, I traced the issue to this function:

py::array build_sample_idx(const py::array_t<int32_t>& sizes_,

It keeps looping until the target number of samples is reached:

while (sample_index <= num_samples) {

There are only ~200m documents, but since the loop covers multiple epochs, and there may be multiple documents per sample, the document index quickly went over 2.1 billion. The document index variable itself is a 64 bit signed integer:

int64_t doc_idx_index = 0;

However, it's being stored in a 32 bit signed integer array, and this is where the wraparound to negative values happens:

int32_t* sample_idx = new int32_t[2*(num_samples+1)];

sample_idx[2 * sample_index] = doc_idx_index;

To solve this, I propose:

  1. The array should be updated to a 64 bit signed integer array.
  2. Before being stored in the array, the document index should be modulo the number of documents.

I can submit a PR to take care of both of these if this sounds reasonable.

Screenshots

Here's one screenshot I took which highlights the core of the issue:

image

Environment (please complete the following information):

  • GPUs: 8x A100-SXM-80GB
  • Configs: See below

13B.yml

# GPT-2 pretraining setup
{
   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
   # across the node boundaries )
   "pipe-parallel-size": 1,
   "model-parallel-size": 4,

   # model settings
   "num-layers": 40,
   "hidden-size": 5120,
   "num-attention-heads": 40,
   "seq-length": 8192,
   "max-position-embeddings": 8192,
   "norm": "layernorm",
   "pos-emb": "rotary",
   "no-weight-tying": true,

   # these should provide some speedup but takes a while to build, set to true if desired
   "scaled-upper-triang-masked-softmax-fusion": true,
   "bias-gelu-fusion": true,

   
   # optimizer settings
   "optimizer": {
     "type": "Adam",
     "params": {
       "lr": 0.0000001,
       "betas": [0.9, 0.999],
       "eps": 1.0e-8,
     }
   },

   # ZeRO
   "zero_optimization": {
    "stage": 1,
    "allgather_partitions": True,
    "allgather_bucket_size": 500000000,
    "overlap_comm": True,
    "reduce_scatter": True,
    "reduce_bucket_size": 500000000,
    "contiguous_gradients": True,
    "round_robin_gradients": True,
    "cpu_offload": False
  },

   # batch / data settings
   "train_micro_batch_size_per_gpu": 1,
   "data-impl": "mmap",
   "split": "949,50,1",
   "gradient_accumulation_steps": 128,

   # activation checkpointing
   "checkpoint-activations": true,
   "checkpoint-num-layers": 1,
   "partition-activations": true,
   "synchronize-each-layer": true,

   # regularization
   "gradient_clipping": 1.0,
   "weight-decay": 0.1,
   "hidden-dropout": 0.1,
   "attention-dropout": 0.1,

   # precision settings
   "fp16": { 
     "fp16": true,
     "enabled": true,
     "loss_scale": 0,
     "loss_scale_window": 1000,
     "hysteresis": 2,
     "min_loss_scale": 1
   },

   # misc. training settings
   "train-iters": 3200000,
   "lr-decay-iters": 3200000,
   "distributed-backend": "nccl",
   "lr-decay-style": "cosine",
   "warmup": 0.00004,
   "min_lr": 0.00000001,
   "save-interval": 80,
   "eval-interval": 20,
   "eval-iters": 1,

   # logging
   "log-interval": 1,
   "steps_per_print": 1,
   "keep-last-n-checkpoints": 6,
   "wall_clock_breakdown": true,

   "override_lr_scheduler": true,
   "use_checkpoint_lr_scheduler": false,
   "finetune": true
}

local_setup.yml

{
  "data-path": "/mnt/4TBNVME/data/the_pile_preprocessed/train/the_pile_text_document",

  "vocab-file": "/mnt/4TBNVME/data/gpt2-vocab.json",
  "merge-file": "/mnt/4TBNVME/data/gpt2-merges.txt",

  "save": "/mnt/4TBNVME/checkpoints",
  "load": "/mnt/4TBNVME/checkpoints",
  "checkpoint_validation_with_forward_pass": False,
  
  "tensorboard-dir": "/mnt/4TBNVME/tensorboard_logs/run19",
  "log-dir": "/mnt/4TBNVME/gptneox_logs/run19",
  "use_wandb": False,
  "wandb_host": "https://api.wandb.ai",
  "wandb_project": "neox"
}

sparse.yml

# Add this to your config for sparse attention every other layer
{
  "attention_config": [[["local", "local"], "all"]],

  # sparsity config:
  # (these are the defaults for local sliding window sparsity, training will work without this here, but it's left in for
  # illustrative purposes)
  # see https://www.deepspeed.ai/tutorials/sparse-attention/#how-to-config-sparsity-structures for
  # more detailed config instructions and available parameters

  "sparsity_config": {
    "block": 16, # block size
    "num_local_blocks": 32,
  }
}

Additional context

None

@pwstegman pwstegman added the bug Something isn't working label Jan 24, 2022
@StellaAthena
Copy link
Member

This seems like a generally good idea, though I’m very intrigued about some of your config choices. You’re finetuning a 13B parameter model with a sequence length of 8192? And doing more than 10 epochs on the Pile?

@pwstegman
Copy link
Contributor Author

Great, I'll get to working on a PR! Realistically I won't be able to train on 10 epochs. I was just changing around parameters on the config and running little tests, and I stumbled across this bug by accident. I am curious whether the model can learn to process long-form text though, hence the 8192 sequence length.

Unrelated, I noticed that model checkpoints are stored in a way that is specific to the 3D parallelism config. Is it possible to take a checkpoint that used a model parallelism of 4 and update it to a model parallelism of 8? I was thinking it should be possible to write a conversion script that copies all the weights over into the right locations, but wasn't sure if something like that already existed.

@StellaAthena
Copy link
Member

That is a functionality we are currently exploring. It is unfortunately non-trivial :/

@StellaAthena
Copy link
Member

@pwstegman did you ever solve this issue?

@StellaAthena
Copy link
Member

@haileyschoelkopf @ShivanshuPurohit @Quentin-Anthony this was the issue that you independently discovered and then patched right?

@Quentin-Anthony
Copy link
Member

Yeah this should be fixed by #835

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants