Negative document indices caused by 64 bit integer stored in a 32 bit integer array. #493

pwstegman · 2022-01-24T23:00:42Z

Describe the bug

While training on The Pile, I was getting errors from sparse attention, claiming that the sequence length wasn't divisible by the block size, despite using a sequence length of 8192 and a block size of 16. This was caused by negative document indices in the dataset, which caused weird sample lengths (screenshot included in screenshot section). The negative document indices were caused by a wraparound at the 32 bit signed integer limit. More info in the Proposed Solution section.

To Reproduce

I'm using the docker image leogao2/gpt-neox:sha-6dc7645. My training script is:

/bin/bash
python deepy.py train.py --conf_dir configs local_setup.yml sparse.yml 13B.yml

Configs are included in the Environment section.

Expected behavior

Each sample should be exactly 8193 tokens.

Proposed solution

In short, I traced the issue to this function:

gpt-neox/megatron/data/helpers.cpp

Line 100 in 98683ae

py::array build_sample_idx(const py::array_t<int32_t>& sizes_,

It keeps looping until the target number of samples is reached:

gpt-neox/megatron/data/helpers.cpp

Line 145 in 98683ae

while (sample_index <= num_samples) {

There are only ~200m documents, but since the loop covers multiple epochs, and there may be multiple documents per sample, the document index quickly went over 2.1 billion. The document index variable itself is a 64 bit signed integer:

gpt-neox/megatron/data/helpers.cpp

Line 137 in 98683ae

int64_t doc_idx_index = 0;

However, it's being stored in a 32 bit signed integer array, and this is where the wraparound to negative values happens:

gpt-neox/megatron/data/helpers.cpp

Line 122 in 98683ae

int32_t* sample_idx = new int32_t[2*(num_samples+1)];

gpt-neox/megatron/data/helpers.cpp

Line 168 in 98683ae

sample_idx[2 * sample_index] = doc_idx_index;

To solve this, I propose:

The array should be updated to a 64 bit signed integer array.
Before being stored in the array, the document index should be modulo the number of documents.

I can submit a PR to take care of both of these if this sounds reasonable.

Screenshots

Here's one screenshot I took which highlights the core of the issue:

Environment (please complete the following information):

GPUs: 8x A100-SXM-80GB
Configs: See below

13B.yml

# GPT-2 pretraining setup
{
   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
   # across the node boundaries )
   "pipe-parallel-size": 1,
   "model-parallel-size": 4,

   # model settings
   "num-layers": 40,
   "hidden-size": 5120,
   "num-attention-heads": 40,
   "seq-length": 8192,
   "max-position-embeddings": 8192,
   "norm": "layernorm",
   "pos-emb": "rotary",
   "no-weight-tying": true,

   # these should provide some speedup but takes a while to build, set to true if desired
   "scaled-upper-triang-masked-softmax-fusion": true,
   "bias-gelu-fusion": true,

   
   # optimizer settings
   "optimizer": {
     "type": "Adam",
     "params": {
       "lr": 0.0000001,
       "betas": [0.9, 0.999],
       "eps": 1.0e-8,
     }
   },

   # ZeRO
   "zero_optimization": {
    "stage": 1,
    "allgather_partitions": True,
    "allgather_bucket_size": 500000000,
    "overlap_comm": True,
    "reduce_scatter": True,
    "reduce_bucket_size": 500000000,
    "contiguous_gradients": True,
    "round_robin_gradients": True,
    "cpu_offload": False
  },

   # batch / data settings
   "train_micro_batch_size_per_gpu": 1,
   "data-impl": "mmap",
   "split": "949,50,1",
   "gradient_accumulation_steps": 128,

   # activation checkpointing
   "checkpoint-activations": true,
   "checkpoint-num-layers": 1,
   "partition-activations": true,
   "synchronize-each-layer": true,

   # regularization
   "gradient_clipping": 1.0,
   "weight-decay": 0.1,
   "hidden-dropout": 0.1,
   "attention-dropout": 0.1,

   # precision settings
   "fp16": { 
     "fp16": true,
     "enabled": true,
     "loss_scale": 0,
     "loss_scale_window": 1000,
     "hysteresis": 2,
     "min_loss_scale": 1
   },

   # misc. training settings
   "train-iters": 3200000,
   "lr-decay-iters": 3200000,
   "distributed-backend": "nccl",
   "lr-decay-style": "cosine",
   "warmup": 0.00004,
   "min_lr": 0.00000001,
   "save-interval": 80,
   "eval-interval": 20,
   "eval-iters": 1,

   # logging
   "log-interval": 1,
   "steps_per_print": 1,
   "keep-last-n-checkpoints": 6,
   "wall_clock_breakdown": true,

   "override_lr_scheduler": true,
   "use_checkpoint_lr_scheduler": false,
   "finetune": true
}

local_setup.yml

{
  "data-path": "/mnt/4TBNVME/data/the_pile_preprocessed/train/the_pile_text_document",

  "vocab-file": "/mnt/4TBNVME/data/gpt2-vocab.json",
  "merge-file": "/mnt/4TBNVME/data/gpt2-merges.txt",

  "save": "/mnt/4TBNVME/checkpoints",
  "load": "/mnt/4TBNVME/checkpoints",
  "checkpoint_validation_with_forward_pass": False,
  
  "tensorboard-dir": "/mnt/4TBNVME/tensorboard_logs/run19",
  "log-dir": "/mnt/4TBNVME/gptneox_logs/run19",
  "use_wandb": False,
  "wandb_host": "https://api.wandb.ai",
  "wandb_project": "neox"
}

sparse.yml

# Add this to your config for sparse attention every other layer
{
  "attention_config": [[["local", "local"], "all"]],

  # sparsity config:
  # (these are the defaults for local sliding window sparsity, training will work without this here, but it's left in for
  # illustrative purposes)
  # see https://www.deepspeed.ai/tutorials/sparse-attention/#how-to-config-sparsity-structures for
  # more detailed config instructions and available parameters

  "sparsity_config": {
    "block": 16, # block size
    "num_local_blocks": 32,
  }
}

Additional context

None

The text was updated successfully, but these errors were encountered:

StellaAthena · 2022-01-25T02:03:44Z

This seems like a generally good idea, though I’m very intrigued about some of your config choices. You’re finetuning a 13B parameter model with a sequence length of 8192? And doing more than 10 epochs on the Pile?

pwstegman · 2022-01-26T20:59:31Z

Great, I'll get to working on a PR! Realistically I won't be able to train on 10 epochs. I was just changing around parameters on the config and running little tests, and I stumbled across this bug by accident. I am curious whether the model can learn to process long-form text though, hence the 8192 sequence length.

Unrelated, I noticed that model checkpoints are stored in a way that is specific to the 3D parallelism config. Is it possible to take a checkpoint that used a model parallelism of 4 and update it to a model parallelism of 8? I was thinking it should be possible to write a conversion script that copies all the weights over into the right locations, but wasn't sure if something like that already existed.

StellaAthena · 2022-01-30T08:02:06Z

That is a functionality we are currently exploring. It is unfortunately non-trivial :/

StellaAthena · 2022-09-25T14:53:29Z

@pwstegman did you ever solve this issue?

StellaAthena · 2023-04-03T05:03:31Z

@haileyschoelkopf @ShivanshuPurohit @Quentin-Anthony this was the issue that you independently discovered and then patched right?

Quentin-Anthony · 2023-04-03T15:29:50Z

Yeah this should be fixed by #835

pwstegman added the bug Something isn't working label Jan 24, 2022

StellaAthena closed this as completed Apr 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Negative document indices caused by 64 bit integer stored in a 32 bit integer array. #493

Negative document indices caused by 64 bit integer stored in a 32 bit integer array. #493

pwstegman commented Jan 24, 2022 •

edited

Loading

StellaAthena commented Jan 25, 2022

pwstegman commented Jan 26, 2022

StellaAthena commented Jan 30, 2022

StellaAthena commented Sep 25, 2022

StellaAthena commented Apr 3, 2023

Quentin-Anthony commented Apr 3, 2023

Negative document indices caused by 64 bit integer stored in a 32 bit integer array. #493

Negative document indices caused by 64 bit integer stored in a 32 bit integer array. #493

Comments

pwstegman commented Jan 24, 2022 • edited Loading

StellaAthena commented Jan 25, 2022

pwstegman commented Jan 26, 2022

StellaAthena commented Jan 30, 2022

StellaAthena commented Sep 25, 2022

StellaAthena commented Apr 3, 2023

Quentin-Anthony commented Apr 3, 2023

pwstegman commented Jan 24, 2022 •

edited

Loading