Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skipped 50 iterations in a row due to Overflow - Exiting training. #482

Closed
ScTof opened this issue Dec 16, 2021 · 4 comments
Closed

Skipped 50 iterations in a row due to Overflow - Exiting training. #482

ScTof opened this issue Dec 16, 2021 · 4 comments
Labels
bug Something isn't working

Comments

@ScTof
Copy link

ScTof commented Dec 16, 2021

Describe the bug
Hi, I'm getting an error trying to follow the "Quick Start" guide (https://github.com/EleutherAI/gpt-neox#quick-start):
Skipped 50 iterations in a row due to Overflow - Exiting training.

The error comes about after checkpoint global_step90000 is created. This is the traceback:

Traceback (most recent call last):
File "train.py", line 27, in
pretrain(neox_args=neox_args)
File "~/gpt-neox/megatron/training.py", line 103, in pretrain
iteration = train(
File "~/gpt-neox/megatron/training.py", line 562, in train
overflow_monitor.check(skipped_iter) # check for repeated overflow
File "~/gpt-neox/megatron/utils.py", line 353, in check
raise Exception(
Exception: Skipped 50 iterations in a row due to Overflow - Exiting training.

To Reproduce

  1. I'm using an Anaconda environment with Python 3.8.12 and PyTorch 1.10.0 (configured for GPU).

  2. I executed the following two commands from the repository root:
    pip install -r requirements/requirements.txt
    python ./megatron/fused_kernels/setup.py install

  3. Then I downloaded preconfigured datasets by executing:
    python ./prepare_data.py

  4. Afterwards I got the enron data by running:
    python prepare_data.py enron -t CharLevelTokenizer -d ./data/

  5. Finally, I ran the pretraining module as:
    python ./deepy.py train.py -d configs small.yml local_setup.yml

I didn't make any changes to the pre-existing config files.

Could somebody tell me what I'm doing wrong? Thanks in advance!

@ScTof ScTof added the bug Something isn't working label Dec 16, 2021
@pwstegman
Copy link
Contributor

pwstegman commented Dec 16, 2021

I ran into a similar issue in the past. Are the model weights in the latest checkpoint very large? It might be that the weights are shooting way up and causing overflows. Adding L2 regularization can help prevent this. In small.yml, you can change line 56 to:

"weight-decay": 0.01

I'm not sure if this will override the value in the checkpoint though. One quick test would be to try a negative weight decay and see if it throws an error. If it does, then you'll know it's reading the weight decay from the updated config and not the checkpoint.

Edit: Did some testing on my end. I got this error when changing the weight-decay param (while loading from checkpoint):

ValueError: loaded state dict has a different number of parameter groups
Traceback
NeoXArgs.from_ymls() ['configs/small.yml', 'configs/local_setup.yml']
INFO:root:NeoXArgs.calculate_derived() Total number of GPUs determined to be: 4
-------------------- arguments --------------------
  attention_config ................ ['global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global']updated
  attention_dropout ............... 0.0.........................updated
  batch_size ...................... 4...........................updated
  checkpoint_activations .......... True........................updated
  clip_grad ....................... 1.0.........................updated
  config_files .................... {'small.yml': '# GPT-2 pretraining setup\n{\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n   # across the node boundaries )\n   "pipe-parallel-size": 1,\n   "model-parallel-size": 1,\n\n   # model settings\n   "num-layers": 12,\n   "hidden-size": 768,\n   "num-attention-heads": 12,\n   "seq-length": 2048,\n   "max-position-embeddings": 2048,\n   "norm": "layernorm",\n   "pos-emb": "rotary",\n   "no-weight-tying": true,\n\n   # these should provide some speedup but takes a while to build, set to true if desired\n   "scaled-upper-triang-masked-softmax-fusion": false,\n   "bias-gelu-fusion": false,\n\n\n   # optimizer settings\n   "optimizer": {\n     "type": "Adam",\n     "params": {\n       "lr": 0.0006,\n       "betas": [0.9, 0.999],\n       "eps": 1.0e-8,\n     }\n   },\n   "zero_optimization": {\n    "stage": 0,\n    "allgather_partitions": True,\n    "allgather_bucket_size": 500000000,\n    "overlap_comm": True,\n    "reduce_scatter": True,\n    "reduce_bucket_size": 500000000,\n    "contiguous_gradients": True,\n    "cpu_offload": False\n  },\n\n   # batch / data settings\n   "train_micro_batch_size_per_gpu": 4,\n   "data-impl": "mmap",\n   "split": "949,50,1",\n\n   # activation checkpointing\n   "checkpoint-activations": true,\n   "checkpoint-num-layers": 1,\n   "partition-activations": true,\n   "synchronize-each-layer": true,\n\n   # regularization\n   "gradient_clipping": 1.0,\n   "weight-decay": 0.5,\n   "hidden-dropout": 0.0,\n   "attention-dropout": 0.0,\n\n   # "no-load-optim": false,\n\n   # precision settings\n   "fp16": { \n     "enabled": true,\n     "loss_scale": 0,\n     "loss_scale_window": 1000,\n     "hysteresis": 2,\n     "min_loss_scale": 1\n   },\n\n   # misc. training settings\n   "train-iters": 320000,\n   "lr-decay-iters": 320000,\n   "distributed-backend": "nccl",\n   "lr-decay-style": "cosine",\n   "warmup": 0.01,\n   "save-interval": 10,\n   "eval-interval": 1000,\n   "eval-iters": 10,\n\n   # logging\n   "log-interval": 100,\n   "steps_per_print": 10,\n   "keep-last-n-checkpoints": 4,\n   "wall_clock_breakdown": true,\n}\n', 'local_setup.yml': '# Suggested data paths when using GPT-NeoX locally\n{\n  "data-path": "data/enron/enron_text_document",\n  \n  # or for weighted datasets: \n  # "train-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n  # "test-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n  # "valid-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n  # "train-data-weights": [1., 2.],\n  # "test-data-weights": [2., 1.],\n  # "valid-data-weights": [0.5, 0.4],\n\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \n  # WARNING: setting this to True will override any user provided weights\n  # "weight_by_num_documents": false,\n  # "weighted_sampler_alpha": 0.3,\n\n  "vocab-file": "data/gpt2-vocab.json",\n  "merge-file": "data/gpt2-merges.txt",\n\n  "save": "checkpoints",\n  "load": "checkpoints",\n  "checkpoint_validation_with_forward_pass": False,\n  \n  "tensorboard-dir": "tensorboard",\n  "log-dir": "logs",\n  "use_wandb": False,\n  "wandb_host": "https://api.wandb.ai",\n  "wandb_project": "neox"\n}\n'}updated
  data_impl ....................... mmap........................updated
  data_path ....................... data/enron/enron_text_documentupdated
  dynamic_loss_scale .............. True........................updated
  eval_iters ...................... 10..........................updated
  fp16 ............................ {'enabled': True, 'loss_scale': 0, 'loss_scale_window': 1000, 'hysteresis': 2, 'min_loss_scale': 1}updated
  gas ............................. 1...........................updated
  global_num_gpus ................. 4...........................updated
  gradient_clipping ............... 1.0.........................updated
  hidden_dropout .................. 0.0.........................updated
  hidden_size ..................... 768.........................updated
  is_pipe_parallel ................ True........................updated
  keep_last_n_checkpoints ......... 4...........................updated
  load ............................ checkpoints.................updated
  log_dir ......................... logs........................updated
  log_interval .................... 100.........................updated
  lr .............................. 0.0006......................updated
  lr_decay_iters .................. 320000......................updated
  lr_decay_style .................. cosine......................updated
  max_position_embeddings ......... 2048........................updated
  merge_file ...................... data/gpt2-merges.txt........updated
  no_weight_tying ................. True........................updated
  num_attention_heads ............. 12..........................updated
  num_layers ...................... 12..........................updated
  optimizer ....................... {'type': 'Adam', 'params': {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08}}updated
  optimizer_type .................. Adam........................updated
  partition_activations ........... True........................updated
  pipe_parallel_size .............. 1...........................updated
  pos_emb ......................... rotary......................updated
  precision ....................... fp16........................updated
  save ............................ checkpoints.................updated
  save_interval ................... 10..........................updated
  seq_length ...................... 2048........................updated
  sparsity_config ................. {}..........................updated
  split ........................... 949,50,1....................updated
  synchronize_each_layer .......... True........................updated
  tensorboard_dir ................. tensorboard.................updated
  train_batch_size ................ 16..........................updated
  train_iters ..................... 320000......................updated
  train_micro_batch_size_per_gpu .. 4...........................updated
  use_wandb ....................... False.......................updated
  user_script ..................... train.py....................updated
  vocab_file ...................... data/gpt2-vocab.json........updated
  wall_clock_breakdown ............ True........................updated
  wandb_group ..................... PGUomanNCJsAJwjxHxxMQH_23td7yzpupdated
  weight_decay .................... 0.5.........................updated
  zero_allgather_bucket_size ...... 500000000...................updated
  zero_contiguous_gradients ....... True........................updated
  zero_optimization ............... {'stage': 0, 'allgather_partitions': True, 'allgather_bucket_size': 500000000, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 500000000, 'contiguous_gradients': True, 'cpu_offload': False}updated
  zero_reduce_bucket_size ......... 500000000...................updated
  zero_reduce_scatter ............. True........................updated
  zero_stage ...................... 0...........................updated
  activation ...................... gelu........................default
  adlr_autoresume ................. False.......................default
  adlr_autoresume_interval ........ 1000........................default
  amp ............................. None........................default
  apply_query_key_layer_scaling ... False.......................default
  attention_softmax_in_fp32 ....... False.......................default
  bias_dropout_fusion ............. False.......................default
  bias_gelu_fusion ................ False.......................default
  char_level_ppl .................. False.......................default
  checkpoint_in_cpu ............... False.......................default
  checkpoint_num_layers ........... 1...........................default
  checkpoint_validation_with_forward_pass  False................default
  contiguous_checkpointing ........ False.......................default
  deepscale ....................... False.......................default
  deepscale_config ................ None........................default
  deepspeed ....................... True........................default
  deepspeed_activation_checkpointing  True......................default
  deepspeed_mpi ................... False.......................default
  detect_nvlink_pairs ............. False.......................default
  distributed_backend ............. nccl........................default
  do_test ......................... None........................default
  do_train ........................ None........................default
  do_valid ........................ None........................default
  dump_state ...................... False.......................default
  eod_mask_loss ................... False.......................default
  eval_interval ................... 1000........................default
  eval_results_prefix ............. ............................default
  eval_tasks ...................... None........................default
  exclude ......................... None........................default
  exit_interval ................... None........................default
  finetune ........................ False.......................default
  flops_profiler .................. None........................default
  fp16_lm_cross_entropy ........... False.......................default
  fp32_allreduce .................. False.......................default
  git_hash ........................ 49e60fe.....................default
  gmlp_attn_dim ................... 64..........................default
  gpt_j_residual .................. False.......................default
  gradient_accumulation_steps ..... 1...........................default
  gradient_noise_scale_cpu_offload  False.......................default
  gradient_noise_scale_n_batches .. 5...........................default
  gradient_predivide_factor ....... 1.0.........................default
  hostfile ........................ None........................default
  hysteresis ...................... 2...........................default
  include ......................... None........................default
  init_method ..................... normal......................default
  init_method_std ................. 0.02........................default
  iteration ....................... None........................default
  launcher ........................ pdsh........................default
  layernorm_epsilon ............... 1e-05.......................default
  lazy_mpu_init ................... False.......................default
  local_rank ...................... None........................default
  log_grad_norm ................... False.......................default
  log_gradient_noise_scale ........ False.......................default
  log_optimizer_states ............ False.......................default
  log_param_norm .................. False.......................default
  loss_scale ...................... None........................default
  loss_scale_window ............... 1000.0......................default
  make_vocab_size_divisible_by .... 128.........................default
  master_addr ..................... None........................default
  master_port ..................... 29500.......................default
  maximum_tokens .................. 64..........................default
  min_lr .......................... 0.0.........................default
  min_scale ....................... 1.0.........................default
  mmap_warmup ..................... False.......................default
  model_parallel_size ............. 1...........................default
  no_load_optim ................... False.......................default
  no_load_rng ..................... False.......................default
  no_save_optim ................... False.......................default
  no_save_rng ..................... False.......................default
  norm ............................ layernorm...................default
  num_gpus ........................ None........................default
  num_nodes ....................... -1..........................default
  num_samples ..................... 0...........................default
  num_unique_layers ............... None........................default
  num_workers ..................... 2...........................default
  onnx_safe ....................... False.......................default
  output_layer_init_method ........ scaled_normal...............default
  output_layer_parallelism ........ row.........................default
  override_lr_scheduler ........... False.......................default
  padded_vocab_size ............... None........................default
  param_sharing_style ............. grouped.....................default
  pipe_partition_method ........... type:transformer|mlp........default
  prescale_gradients .............. False.......................default
  profile_backward ................ False.......................default
  rank ............................ None........................default
  recompute ....................... False.......................default
  rms_norm_epsilon ................ 1e-08.......................default
  rotary_emb_base ................. 10000.......................default
  rotary_pct ...................... 1.0.........................default
  rpe_max_distance ................ 128.........................default
  rpe_num_buckets ................. 32..........................default
  sample_input_file ............... None........................default
  sample_output_file .............. None........................default
  scaled_masked_softmax_fusion .... False.......................default
  scaled_upper_triang_masked_softmax_fusion  False..............default
  scalenorm_epsilon ............... 1e-08.......................default
  scheduler ....................... None........................default
  seed ............................ 1234........................default
  short_seq_prob .................. 0.1.........................default
  soft_prompt_tuning .............. None........................default
  sparse_gradients ................ False.......................default
  steps_per_print ................. 10..........................default
  temperature ..................... 0.0.........................default
  test_data_paths ................. None........................default
  test_data_weights ............... None........................default
  text_gen_type ................... None........................default
  tokenizer_type .................. GPT2BPETokenizer............default
  top_k ........................... 0...........................default
  top_p ........................... 0.0.........................default
  train_data_paths ................ None........................default
  train_data_weights .............. None........................default
  use_bnb_optimizer ............... False.......................default
  use_checkpoint_lr_scheduler ..... False.......................default
  use_cpu_initialization .......... False.......................default
  valid_data_paths ................ None........................default
  valid_data_weights .............. None........................default
  wandb_host ...................... https://api.wandb.ai........default
  wandb_project ................... neox........................default
  wandb_team ...................... None........................default
  warmup .......................... 0.01........................default
  weight_by_num_documents ......... False.......................default
  weighted_sampler_alpha .......... 0.3.........................default
  world_size ...................... None........................default
  zero_allow_untested_optimizer ... False.......................default
---------------- end of arguments ----------------
[2021-12-16 15:58:05,404] [WARNING] [runner.py:126:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-12-16 15:58:05,405] [INFO] [runner.py:366:main] cmd = /mnt/workspace/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 train.py --deepspeed_config {"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 0, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": false}, "wall_clock_breakdown": true} --megatron_config {"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 0, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": false}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 12, "hidden_size": 768, "num_attention_heads": 12, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "lr_decay_style": "cosine", "lr_decay_iters": 320000, "optimizer_type": "Adam", "zero_stage": 0, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.0006, "data_path": "data/enron/enron_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"small.yml": "# GPT-2 pretraining setup\n{\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n   # across the node boundaries )\n   \"pipe-parallel-size\": 1,\n   \"model-parallel-size\": 1,\n\n   # model settings\n   \"num-layers\": 12,\n   \"hidden-size\": 768,\n   \"num-attention-heads\": 12,\n   \"seq-length\": 2048,\n   \"max-position-embeddings\": 2048,\n   \"norm\": \"layernorm\",\n   \"pos-emb\": \"rotary\",\n   \"no-weight-tying\": true,\n\n   # these should provide some speedup but takes a while to build, set to true if desired\n   \"scaled-upper-triang-masked-softmax-fusion\": false,\n   \"bias-gelu-fusion\": false,\n\n\n   # optimizer settings\n   \"optimizer\": {\n     \"type\": \"Adam\",\n     \"params\": {\n       \"lr\": 0.0006,\n       \"betas\": [0.9, 0.999],\n       \"eps\": 1.0e-8,\n     }\n   },\n   \"zero_optimization\": {\n    \"stage\": 0,\n    \"allgather_partitions\": True,\n    \"allgather_bucket_size\": 500000000,\n    \"overlap_comm\": True,\n    \"reduce_scatter\": True,\n    \"reduce_bucket_size\": 500000000,\n    \"contiguous_gradients\": True,\n    \"cpu_offload\": False\n  },\n\n   # batch / data settings\n   \"train_micro_batch_size_per_gpu\": 4,\n   \"data-impl\": \"mmap\",\n   \"split\": \"949,50,1\",\n\n   # activation checkpointing\n   \"checkpoint-activations\": true,\n   \"checkpoint-num-layers\": 1,\n   \"partition-activations\": true,\n   \"synchronize-each-layer\": true,\n\n   # regularization\n   \"gradient_clipping\": 1.0,\n   \"weight-decay\": 0.5,\n   \"hidden-dropout\": 0.0,\n   \"attention-dropout\": 0.0,\n\n   # \"no-load-optim\": false,\n\n   # precision settings\n   \"fp16\": { \n     \"enabled\": true,\n     \"loss_scale\": 0,\n     \"loss_scale_window\": 1000,\n     \"hysteresis\": 2,\n     \"min_loss_scale\": 1\n   },\n\n   # misc. training settings\n   \"train-iters\": 320000,\n   \"lr-decay-iters\": 320000,\n   \"distributed-backend\": \"nccl\",\n   \"lr-decay-style\": \"cosine\",\n   \"warmup\": 0.01,\n   \"save-interval\": 10,\n   \"eval-interval\": 1000,\n   \"eval-iters\": 10,\n\n   # logging\n   \"log-interval\": 100,\n   \"steps_per_print\": 10,\n   \"keep-last-n-checkpoints\": 4,\n   \"wall_clock_breakdown\": true,\n}\n", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\n{\n  \"data-path\": \"data/enron/enron_text_document\",\n  \n  # or for weighted datasets: \n  # \"train-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n  # \"test-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n  # \"valid-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n  # \"train-data-weights\": [1., 2.],\n  # \"test-data-weights\": [2., 1.],\n  # \"valid-data-weights\": [0.5, 0.4],\n\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \n  # WARNING: setting this to True will override any user provided weights\n  # \"weight_by_num_documents\": false,\n  # \"weighted_sampler_alpha\": 0.3,\n\n  \"vocab-file\": \"data/gpt2-vocab.json\",\n  \"merge-file\": \"data/gpt2-merges.txt\",\n\n  \"save\": \"checkpoints\",\n  \"load\": \"checkpoints\",\n  \"checkpoint_validation_with_forward_pass\": False,\n  \n  \"tensorboard-dir\": \"tensorboard\",\n  \"log-dir\": \"logs\",\n  \"use_wandb\": False,\n  \"wandb_host\": \"https://api.wandb.ai\",\n  \"wandb_project\": \"neox\"\n}\n"}, "load": "checkpoints", "save_interval": 10, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "split": "949,50,1", "vocab_file": "data/gpt2-vocab.json", "merge_file": "data/gpt2-merges.txt", "attention_dropout": 0.0, "hidden_dropout": 0.0, "weight_decay": 0.5, "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "gas": 1, "clip_grad": 1.0, "dynamic_loss_scale": true, "pipe_parallel_size": 1, "is_pipe_parallel": true, "use_wandb": false, "wandb_group": "PGUomanNCJsAJwjxHxxMQH_23td7yzp", "log_dir": "logs", "tensorboard_dir": "tensorboard", "log_interval": 100, "user_script": "train.py", "global_num_gpus": 4}
[2021-12-16 15:58:06,363] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_DEV_PACKAGE libnccl-dev=2.8.4-1+cuda11.1
[2021-12-16 15:58:06,363] [INFO] [launch.py:75:main] 0 NCCL_VERSION 2.8.4-1
[2021-12-16 15:58:06,363] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_PACKAGE_VERSION 2.8.4-1
[2021-12-16 15:58:06,363] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_PACKAGE libnccl2=2.8.4-1+cuda11.1
[2021-12-16 15:58:06,363] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME libnccl-dev
[2021-12-16 15:58:06,363] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_PACKAGE_NAME libnccl2
[2021-12-16 15:58:06,363] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION 2.8.4-1
[2021-12-16 15:58:06,363] [INFO] [launch.py:82:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2021-12-16 15:58:06,363] [INFO] [launch.py:88:main] nnodes=1, num_local_procs=4, node_rank=0
[2021-12-16 15:58:06,363] [INFO] [launch.py:103:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3]})
[2021-12-16 15:58:06,363] [INFO] [launch.py:104:main] dist_world_size=4
[2021-12-16 15:58:06,363] [INFO] [launch.py:112:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
NeoXArgs.configure_distributed_args() using world size: 4 and model-parallel size: 1
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
[2021-12-16 15:58:09,760] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
> setting tensorboard ...
> initializing torch distributed ...
[2021-12-16 15:58:09,776] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2021-12-16 15:58:09,779] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2021-12-16 15:58:09,823] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
> initializing model parallel with size 1
MPU DP: [0, 1, 2, 3]
MPU PP: [0]
MPU PP: [1]
MPU PP: [2]
MPU PP: [3]
MPU MP: [0]
MPU MP: [1]
MPU MP: [2]
MPU MP: [3]
> setting random seeds to 1234 ...
[2021-12-16 15:58:10,838] [INFO] [checkpointing.py:223:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/mnt/workspace/gpt-neox/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/mnt/workspace/gpt-neox/megatron/data'
building GPT2 model ...
SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None
Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=1, model=0): 1, ProcessCoord(pipe=0, data=2, model=0): 2, ProcessCoord(pipe=0, data=3, model=0): 3}
[2021-12-16 15:58:15,327] [INFO] [module.py:363:_partition_layers] Partitioning pipeline stages with method type:transformer|mlp
stage=0 layers=17
     0: EmbeddingPipe
     1: _pre_transformer_block
     2: ParallelTransformerLayerPipe
     3: ParallelTransformerLayerPipe
     4: ParallelTransformerLayerPipe
     5: ParallelTransformerLayerPipe
     6: ParallelTransformerLayerPipe
     7: ParallelTransformerLayerPipe
     8: ParallelTransformerLayerPipe
     9: ParallelTransformerLayerPipe
    10: ParallelTransformerLayerPipe
    11: ParallelTransformerLayerPipe
    12: ParallelTransformerLayerPipe
    13: ParallelTransformerLayerPipe
    14: _post_transformer_block
    15: NormPipe
    16: ParallelLinearPipe
  loss: partial
[2021-12-16 15:58:15,356] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
Configuring Optimizer type: Adam with params: {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08}
> learning rate decay style: cosine
[2021-12-16 15:58:15,358] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
DeepSpeed is enabled.
[2021-12-16 15:58:15,359] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.15+eb7f5cf, git-hash=eb7f5cf, git-branch=main
[2021-12-16 15:58:15,359] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
[2021-12-16 15:58:15,361] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
[2021-12-16 15:58:16,710] [INFO] [engine.py:654:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[2021-12-16 15:58:16,710] [INFO] [engine.py:659:_configure_optimizer] Using client Optimizer as basic optimizer
[2021-12-16 15:58:16,710] [INFO] [engine.py:668:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam
[2021-12-16 15:58:16,710] [INFO] [logging.py:60:log_dist] [Rank 0] Creating fp16 optimizer with dynamic loss scale
[2021-12-16 15:58:16,751] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam
[2021-12-16 15:58:16,751] [INFO] [engine.py:498:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2021-12-16 15:58:16,751] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = 
[2021-12-16 15:58:16,751] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[[0.9, 0.999], [0.9, 0.999]]
[2021-12-16 15:58:16,751] [INFO] [config.py:759:print] DeepSpeedEngine configuration:
[2021-12-16 15:58:16,752] [INFO] [config.py:763:print]   activation_checkpointing_config  {
    "partition_activations": false,
    "contiguous_memory_optimization": false,
    "cpu_checkpointing": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
}
[2021-12-16 15:58:16,752] [INFO] [config.py:763:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2021-12-16 15:58:16,752] [INFO] [config.py:763:print]   allreduce_always_fp32 ........ False
[2021-12-16 15:58:16,752] [INFO] [config.py:763:print]   amp_enabled .................. False
[2021-12-16 15:58:16,752] [INFO] [config.py:763:print]   amp_params ................... False
Using /root/.cache/torch_extensions as PyTorch extensions root...Using /root/.cache/torch_extensions as PyTorch extensions root...[2021-12-16 15:58:16,752] [INFO] [config.py:763:print]   checkpoint_tag_validation_enabled  True

Using /root/.cache/torch_extensions as PyTorch extensions root...
[2021-12-16 15:58:16,752] [INFO] [config.py:763:print] checkpoint_tag_validation_fail False
[2021-12-16 15:58:16,752] [INFO] [config.py:763:print] disable_allgather ............ False
[2021-12-16 15:58:16,752] [INFO] [config.py:763:print] dump_state ................... False
[2021-12-16 15:58:16,752] [INFO] [config.py:763:print] dynamic_loss_scale_args ...... {'init_scale': 4294967296, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2021-12-16 15:58:16,752] [INFO] [config.py:763:print] elasticity_enabled ........... False
[2021-12-16 15:58:16,752] [INFO] [config.py:763:print] flops_profiler_config ........ {
"enabled": false,
"profile_step": 1,
"module_depth": -1,
"top_modules": 3,
"detailed": true
}
[2021-12-16 15:58:16,752] [INFO] [config.py:763:print] fp16_enabled ................. True
[2021-12-16 15:58:16,752] [INFO] [config.py:763:print] fp16_type .................... fp16
[2021-12-16 15:58:16,752] [INFO] [config.py:763:print] global_rank .................. 0
[2021-12-16 15:58:16,752] [INFO] [config.py:763:print] gradient_accumulation_steps .. 1
[2021-12-16 15:58:16,752] [INFO] [config.py:763:print] gradient_clipping ............ 1.0
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] gradient_predivide_factor .... 1.0
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] initial_dynamic_scale ........ 4294967296
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] loss_scale ................... 0
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] memory_breakdown ............. False
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] optimizer_legacy_fusion ...... False
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] optimizer_name ............... adam
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] optimizer_params ............. {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08}
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] pld_enabled .................. False
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] pld_params ................... False
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] precision .................... torch.float16
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] prescale_gradients ........... False
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] scheduler_name ............... None
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] scheduler_params ............. None
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] sparse_attention ............. None
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] sparse_gradients_enabled ..... False
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] steps_per_print .............. 10
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] tensorboard_enabled .......... False
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] tensorboard_job_name ......... DeepSpeedJobName
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] tensorboard_output_path ......
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] train_batch_size ............. 16
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] train_micro_batch_size_per_gpu 4
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] wall_clock_breakdown ......... True
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] world_size ................... 4
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] zero_allow_untested_optimizer False
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] zero_config .................. {
"stage": 0,
"contiguous_gradients": true,
"reduce_scatter": true,
"reduce_bucket_size": 5.000000e+08,
"allgather_partitions": true,
"allgather_bucket_size": 5.000000e+08,
"overlap_comm": true,
"load_from_fp32_weights": true,
"elastic_checkpoint": true,
"offload_param": null,
"offload_optimizer": null,
"sub_group_size": 1.000000e+12,
"prefetch_bucket_size": 5.000000e+07,
"param_persistence_threshold": 1.000000e+05,
"max_live_parameters": 1.000000e+09,
"max_reuse_distance": 1.000000e+09,
"gather_fp16_weights_on_model_save": false
}
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] zero_enabled ................. False
[2021-12-16 15:58:16,753] [INFO] [config.py:763:print] zero_optimization_stage ...... 0
[2021-12-16 15:58:16,754] [INFO] [config.py:765:print] json = {
"train_batch_size": 16,
"train_micro_batch_size_per_gpu": 4,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.0006,
"betas": [0.9, 0.999],
"eps": 1e-08
}
},
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"gradient_clipping": 1.0,
"zero_optimization": {
"stage": 0,
"allgather_partitions": true,
"allgather_bucket_size": 5.000000e+08,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5.000000e+08,
"contiguous_gradients": true,
"cpu_offload": false
},
"wall_clock_breakdown": true
}
Using /root/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.3686349391937256 seconds
Loading extension module utils...
Loading extension module utils...
Time to load utils op: 0.4023325443267822 secondsTime to load utils op: 0.40233945846557617 seconds

Loading extension module utils...
Time to load utils op: 0.4020814895629883 seconds
[2021-12-16 15:58:17,158] [INFO] [engine.py:84:init] CONFIG: micro_batches=1 micro_batch_size=4
[2021-12-16 15:58:17,304] [INFO] [engine.py:141:init] RANK=0 STAGE=0 LAYERS=17 [0, 17) STAGE_PARAMS=162322944 (162.323M) TOTAL_PARAMS=162322944 (162.323M) UNIQUE_PARAMS=162322944 (162.323M)

number of parameters on model parallel rank 0: 162322944
total params: 162,322,944
[2021-12-16 15:58:18,619] [INFO] [engine.py:1551:_load_checkpoint] rank: 1 loading checkpoint: checkpoints/global_step30/mp_rank_00_model_states.pt
[2021-12-16 15:58:18,619] [INFO] [engine.py:1551:_load_checkpoint] rank: 0 loading checkpoint: checkpoints/global_step30/mp_rank_00_model_states.pt
[2021-12-16 15:58:18,619] [INFO] [engine.py:1551:_load_checkpoint] rank: 2 loading checkpoint: checkpoints/global_step30/mp_rank_00_model_states.pt
[2021-12-16 15:58:18,619] [INFO] [engine.py:1551:_load_checkpoint] rank: 3 loading checkpoint: checkpoints/global_step30/mp_rank_00_model_states.pt
[2021-12-16 15:58:19,816] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=0 file=checkpoints/global_step30/layer_00-model_00-model_states.pt
[2021-12-16 15:58:19,831] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=2 file=checkpoints/global_step30/layer_02-model_00-model_states.pt
[2021-12-16 15:58:19,840] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=3 file=checkpoints/global_step30/layer_03-model_00-model_states.pt
[2021-12-16 15:58:19,851] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=4 file=checkpoints/global_step30/layer_04-model_00-model_states.pt
[2021-12-16 15:58:19,861] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=5 file=checkpoints/global_step30/layer_05-model_00-model_states.pt
[2021-12-16 15:58:19,871] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=6 file=checkpoints/global_step30/layer_06-model_00-model_states.pt
[2021-12-16 15:58:19,881] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=7 file=checkpoints/global_step30/layer_07-model_00-model_states.pt
[2021-12-16 15:58:19,891] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=8 file=checkpoints/global_step30/layer_08-model_00-model_states.pt
[2021-12-16 15:58:19,901] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=9 file=checkpoints/global_step30/layer_09-model_00-model_states.pt
[2021-12-16 15:58:19,912] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=10 file=checkpoints/global_step30/layer_10-model_00-model_states.pt
[2021-12-16 15:58:19,922] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=11 file=checkpoints/global_step30/layer_11-model_00-model_states.pt
[2021-12-16 15:58:19,932] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=12 file=checkpoints/global_step30/layer_12-model_00-model_states.pt
[2021-12-16 15:58:19,942] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=13 file=checkpoints/global_step30/layer_13-model_00-model_states.pt
[2021-12-16 15:58:19,943] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=15 file=checkpoints/global_step30/layer_15-model_00-model_states.pt
[2021-12-16 15:58:20,019] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=16 file=checkpoints/global_step30/layer_16-model_00-model_states.pt
Traceback (most recent call last):
File "train.py", line 27, in
pretrain(neox_args=neox_args)
File "/mnt/workspace/gpt-neox/megatron/training.py", line 82, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(
File "/mnt/workspace/gpt-neox/megatron/training.py", line 422, in setup_model_and_optimizer
neox_args.iteration = load_checkpoint(
File "/mnt/workspace/gpt-neox/megatron/checkpointing.py", line 192, in load_checkpoint
checkpoint_name, state_dict = model.load_checkpoint(neox_args.load,
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1523, in load_checkpoint
load_path, client_states = self._load_checkpoint(load_dir,
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1562, in _load_checkpoint
self.optimizer.load_state_dict(
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/runtime/fp16/fused_optimizer.py", line 468, in load_state_dict
self.optimizer.load_state_dict(state_dict["optimizer_state_dict"])
File "/mnt/workspace/venv/lib/python3.8/site-packages/torch/optim/optimizer.py", line 141, in load_state_dict
raise ValueError("loaded state dict has a different number of "
ValueError: loaded state dict has a different number of parameter groups
Traceback (most recent call last):
File "train.py", line 27, in
pretrain(neox_args=neox_args)
File "/mnt/workspace/gpt-neox/megatron/training.py", line 82, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(
File "/mnt/workspace/gpt-neox/megatron/training.py", line 422, in setup_model_and_optimizer
neox_args.iteration = load_checkpoint(
File "/mnt/workspace/gpt-neox/megatron/checkpointing.py", line 192, in load_checkpoint
checkpoint_name, state_dict = model.load_checkpoint(neox_args.load,
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1523, in load_checkpoint
load_path, client_states = self._load_checkpoint(load_dir,
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1562, in _load_checkpoint
self.optimizer.load_state_dict(
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/runtime/fp16/fused_optimizer.py", line 468, in load_state_dict
self.optimizer.load_state_dict(state_dict["optimizer_state_dict"])
File "/mnt/workspace/venv/lib/python3.8/site-packages/torch/optim/optimizer.py", line 141, in load_state_dict
raise ValueError("loaded state dict has a different number of "
ValueError: loaded state dict has a different number of parameter groups
Traceback (most recent call last):
File "train.py", line 27, in
pretrain(neox_args=neox_args)
File "/mnt/workspace/gpt-neox/megatron/training.py", line 82, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(
File "/mnt/workspace/gpt-neox/megatron/training.py", line 422, in setup_model_and_optimizer
neox_args.iteration = load_checkpoint(
File "/mnt/workspace/gpt-neox/megatron/checkpointing.py", line 192, in load_checkpoint
checkpoint_name, state_dict = model.load_checkpoint(neox_args.load,
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1523, in load_checkpoint
load_path, client_states = self._load_checkpoint(load_dir,
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1562, in _load_checkpoint
self.optimizer.load_state_dict(
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/runtime/fp16/fused_optimizer.py", line 468, in load_state_dict
self.optimizer.load_state_dict(state_dict["optimizer_state_dict"])
File "/mnt/workspace/venv/lib/python3.8/site-packages/torch/optim/optimizer.py", line 141, in load_state_dict
raise ValueError("loaded state dict has a different number of "
ValueError: loaded state dict has a different number of parameter groups
Traceback (most recent call last):
File "train.py", line 27, in
pretrain(neox_args=neox_args)
File "/mnt/workspace/gpt-neox/megatron/training.py", line 82, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(
File "/mnt/workspace/gpt-neox/megatron/training.py", line 422, in setup_model_and_optimizer
neox_args.iteration = load_checkpoint(
File "/mnt/workspace/gpt-neox/megatron/checkpointing.py", line 192, in load_checkpoint
checkpoint_name, state_dict = model.load_checkpoint(neox_args.load,
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1523, in load_checkpoint
load_path, client_states = self._load_checkpoint(load_dir,
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1562, in _load_checkpoint
self.optimizer.load_state_dict(
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/runtime/fp16/fused_optimizer.py", line 468, in load_state_dict
self.optimizer.load_state_dict(state_dict["optimizer_state_dict"])
File "/mnt/workspace/venv/lib/python3.8/site-packages/torch/optim/optimizer.py", line 141, in load_state_dict
raise ValueError("loaded state dict has a different number of "
ValueError: loaded state dict has a different number of parameter groups
Killing subprocess 5760
Killing subprocess 5761
Killing subprocess 5762
Killing subprocess 5763
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 192, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 179, in
main()
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 169, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 147, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/mnt/workspace/venv/bin/python', '-u', 'train.py', '--local_rank=3', '--deepspeed_config', '{"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 0, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": false}, "wall_clock_breakdown": true}', '--megatron_config', '{"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 0, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": false}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 12, "hidden_size": 768, "num_attention_heads": 12, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "lr_decay_style": "cosine", "lr_decay_iters": 320000, "optimizer_type": "Adam", "zero_stage": 0, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.0006, "data_path": "data/enron/enron_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"small.yml": "# GPT-2 pretraining setup\n{\n # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n # across the node boundaries )\n \"pipe-parallel-size\": 1,\n \"model-parallel-size\": 1,\n\n # model settings\n \"num-layers\": 12,\n \"hidden-size\": 768,\n \"num-attention-heads\": 12,\n \"seq-length\": 2048,\n \"max-position-embeddings\": 2048,\n \"norm\": \"layernorm\",\n \"pos-emb\": \"rotary\",\n \"no-weight-tying\": true,\n\n # these should provide some speedup but takes a while to build, set to true if desired\n \"scaled-upper-triang-masked-softmax-fusion\": false,\n \"bias-gelu-fusion\": false,\n\n\n # optimizer settings\n \"optimizer\": {\n \"type\": \"Adam\",\n \"params\": {\n \"lr\": 0.0006,\n \"betas\": [0.9, 0.999],\n \"eps\": 1.0e-8,\n }\n },\n \"zero_optimization\": {\n \"stage\": 0,\n \"allgather_partitions\": True,\n \"allgather_bucket_size\": 500000000,\n \"overlap_comm\": True,\n \"reduce_scatter\": True,\n \"reduce_bucket_size\": 500000000,\n \"contiguous_gradients\": True,\n \"cpu_offload\": False\n },\n\n # batch / data settings\n \"train_micro_batch_size_per_gpu\": 4,\n \"data-impl\": \"mmap\",\n \"split\": \"949,50,1\",\n\n # activation checkpointing\n \"checkpoint-activations\": true,\n \"checkpoint-num-layers\": 1,\n \"partition-activations\": true,\n \"synchronize-each-layer\": true,\n\n # regularization\n \"gradient_clipping\": 1.0,\n \"weight-decay\": 0.5,\n \"hidden-dropout\": 0.0,\n \"attention-dropout\": 0.0,\n\n # \"no-load-optim\": false,\n\n # precision settings\n \"fp16\": { \n \"enabled\": true,\n \"loss_scale\": 0,\n \"loss_scale_window\": 1000,\n \"hysteresis\": 2,\n \"min_loss_scale\": 1\n },\n\n # misc. training settings\n \"train-iters\": 320000,\n \"lr-decay-iters\": 320000,\n \"distributed-backend\": \"nccl\",\n \"lr-decay-style\": \"cosine\",\n \"warmup\": 0.01,\n \"save-interval\": 10,\n \"eval-interval\": 1000,\n \"eval-iters\": 10,\n\n # logging\n \"log-interval\": 100,\n \"steps_per_print\": 10,\n \"keep-last-n-checkpoints\": 4,\n \"wall_clock_breakdown\": true,\n}\n", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\n{\n \"data-path\": \"data/enron/enron_text_document\",\n \n # or for weighted datasets: \n # \"train-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n # \"test-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n # \"valid-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n # \"train-data-weights\": [1., 2.],\n # \"test-data-weights\": [2., 1.],\n # \"valid-data-weights\": [0.5, 0.4],\n\n # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \n # WARNING: setting this to True will override any user provided weights\n # \"weight_by_num_documents\": false,\n # \"weighted_sampler_alpha\": 0.3,\n\n \"vocab-file\": \"data/gpt2-vocab.json\",\n \"merge-file\": \"data/gpt2-merges.txt\",\n\n \"save\": \"checkpoints\",\n \"load\": \"checkpoints\",\n \"checkpoint_validation_with_forward_pass\": False,\n \n \"tensorboard-dir\": \"tensorboard\",\n \"log-dir\": \"logs\",\n \"use_wandb\": False,\n \"wandb_host\": \"https://api.wandb.ai\",\n \"wandb_project\": \"neox\"\n}\n"}, "load": "checkpoints", "save_interval": 10, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "split": "949,50,1", "vocab_file": "data/gpt2-vocab.json", "merge_file": "data/gpt2-merges.txt", "attention_dropout": 0.0, "hidden_dropout": 0.0, "weight_decay": 0.5, "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "gas": 1, "clip_grad": 1.0, "dynamic_loss_scale": true, "pipe_parallel_size": 1, "is_pipe_parallel": true, "use_wandb": false, "wandb_group": "PGUomanNCJsAJwjxHxxMQH_23td7yzp", "log_dir": "logs", "tensorboard_dir": "tensorboard", "log_interval": 100, "user_script": "train.py", "global_num_gpus": 4}']' returned non-zero exit status 1.

So I tried adding this to the config:

"no-load-optim": true

That resulted in:

AssertionError: must provide optimizer during init in order to use backward
Traceback
NeoXArgs.from_ymls() ['configs/small.yml', 'configs/local_setup.yml']
INFO:root:NeoXArgs.calculate_derived() Total number of GPUs determined to be: 4
-------------------- arguments --------------------
  attention_config ................ ['global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global']updated
  attention_dropout ............... 0.0.........................updated
  batch_size ...................... 4...........................updated
  checkpoint_activations .......... True........................updated
  clip_grad ....................... 1.0.........................updated
  config_files .................... {'small.yml': '# GPT-2 pretraining setup\n{\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n   # across the node boundaries )\n   "pipe-parallel-size": 1,\n   "model-parallel-size": 1,\n\n   # model settings\n   "num-layers": 12,\n   "hidden-size": 768,\n   "num-attention-heads": 12,\n   "seq-length": 2048,\n   "max-position-embeddings": 2048,\n   "norm": "layernorm",\n   "pos-emb": "rotary",\n   "no-weight-tying": true,\n\n   # these should provide some speedup but takes a while to build, set to true if desired\n   "scaled-upper-triang-masked-softmax-fusion": false,\n   "bias-gelu-fusion": false,\n\n\n   # optimizer settings\n   "optimizer": {\n     "type": "Adam",\n     "params": {\n       "lr": 0.0006,\n       "betas": [0.9, 0.999],\n       "eps": 1.0e-8,\n     }\n   },\n   "zero_optimization": {\n    "stage": 0,\n    "allgather_partitions": True,\n    "allgather_bucket_size": 500000000,\n    "overlap_comm": True,\n    "reduce_scatter": True,\n    "reduce_bucket_size": 500000000,\n    "contiguous_gradients": True,\n    "cpu_offload": False\n  },\n\n   # batch / data settings\n   "train_micro_batch_size_per_gpu": 4,\n   "data-impl": "mmap",\n   "split": "949,50,1",\n\n   # activation checkpointing\n   "checkpoint-activations": true,\n   "checkpoint-num-layers": 1,\n   "partition-activations": true,\n   "synchronize-each-layer": true,\n\n   # regularization\n   "gradient_clipping": 1.0,\n   "weight-decay": 0.5,\n   "hidden-dropout": 0.0,\n   "attention-dropout": 0.0,\n\n   "no-load-optim": true,\n\n   # precision settings\n   "fp16": { \n     "enabled": true,\n     "loss_scale": 0,\n     "loss_scale_window": 1000,\n     "hysteresis": 2,\n     "min_loss_scale": 1\n   },\n\n   # misc. training settings\n   "train-iters": 320000,\n   "lr-decay-iters": 320000,\n   "distributed-backend": "nccl",\n   "lr-decay-style": "cosine",\n   "warmup": 0.01,\n   "save-interval": 10,\n   "eval-interval": 1000,\n   "eval-iters": 10,\n\n   # logging\n   "log-interval": 100,\n   "steps_per_print": 10,\n   "keep-last-n-checkpoints": 4,\n   "wall_clock_breakdown": true,\n}\n', 'local_setup.yml': '# Suggested data paths when using GPT-NeoX locally\n{\n  "data-path": "data/enron/enron_text_document",\n  \n  # or for weighted datasets: \n  # "train-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n  # "test-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n  # "valid-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n  # "train-data-weights": [1., 2.],\n  # "test-data-weights": [2., 1.],\n  # "valid-data-weights": [0.5, 0.4],\n\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \n  # WARNING: setting this to True will override any user provided weights\n  # "weight_by_num_documents": false,\n  # "weighted_sampler_alpha": 0.3,\n\n  "vocab-file": "data/gpt2-vocab.json",\n  "merge-file": "data/gpt2-merges.txt",\n\n  "save": "checkpoints",\n  "load": "checkpoints",\n  "checkpoint_validation_with_forward_pass": False,\n  \n  "tensorboard-dir": "tensorboard",\n  "log-dir": "logs",\n  "use_wandb": False,\n  "wandb_host": "https://api.wandb.ai",\n  "wandb_project": "neox"\n}\n'}updated
  data_impl ....................... mmap........................updated
  data_path ....................... data/enron/enron_text_documentupdated
  dynamic_loss_scale .............. True........................updated
  eval_iters ...................... 10..........................updated
  fp16 ............................ {'enabled': True, 'loss_scale': 0, 'loss_scale_window': 1000, 'hysteresis': 2, 'min_loss_scale': 1}updated
  gas ............................. 1...........................updated
  global_num_gpus ................. 4...........................updated
  gradient_clipping ............... 1.0.........................updated
  hidden_dropout .................. 0.0.........................updated
  hidden_size ..................... 768.........................updated
  is_pipe_parallel ................ True........................updated
  keep_last_n_checkpoints ......... 4...........................updated
  load ............................ checkpoints.................updated
  log_dir ......................... logs........................updated
  log_interval .................... 100.........................updated
  lr .............................. 0.0006......................updated
  lr_decay_iters .................. 320000......................updated
  lr_decay_style .................. cosine......................updated
  max_position_embeddings ......... 2048........................updated
  merge_file ...................... data/gpt2-merges.txt........updated
  no_load_optim ................... True........................updated
  no_weight_tying ................. True........................updated
  num_attention_heads ............. 12..........................updated
  num_layers ...................... 12..........................updated
  optimizer ....................... {'type': 'Adam', 'params': {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08}}updated
  optimizer_type .................. Adam........................updated
  partition_activations ........... True........................updated
  pipe_parallel_size .............. 1...........................updated
  pos_emb ......................... rotary......................updated
  precision ....................... fp16........................updated
  save ............................ checkpoints.................updated
  save_interval ................... 10..........................updated
  seq_length ...................... 2048........................updated
  sparsity_config ................. {}..........................updated
  split ........................... 949,50,1....................updated
  synchronize_each_layer .......... True........................updated
  tensorboard_dir ................. tensorboard.................updated
  train_batch_size ................ 16..........................updated
  train_iters ..................... 320000......................updated
  train_micro_batch_size_per_gpu .. 4...........................updated
  use_wandb ....................... False.......................updated
  user_script ..................... train.py....................updated
  vocab_file ...................... data/gpt2-vocab.json........updated
  wall_clock_breakdown ............ True........................updated
  wandb_group ..................... Mf6keriq8236v5DsKyvGvG_fxq3a3ahupdated
  weight_decay .................... 0.5.........................updated
  zero_allgather_bucket_size ...... 500000000...................updated
  zero_contiguous_gradients ....... True........................updated
  zero_optimization ............... {'stage': 0, 'allgather_partitions': True, 'allgather_bucket_size': 500000000, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 500000000, 'contiguous_gradients': True, 'cpu_offload': False}updated
  zero_reduce_bucket_size ......... 500000000...................updated
  zero_reduce_scatter ............. True........................updated
  zero_stage ...................... 0...........................updated
  activation ...................... gelu........................default
  adlr_autoresume ................. False.......................default
  adlr_autoresume_interval ........ 1000........................default
  amp ............................. None........................default
  apply_query_key_layer_scaling ... False.......................default
  attention_softmax_in_fp32 ....... False.......................default
  bias_dropout_fusion ............. False.......................default
  bias_gelu_fusion ................ False.......................default
  char_level_ppl .................. False.......................default
  checkpoint_in_cpu ............... False.......................default
  checkpoint_num_layers ........... 1...........................default
  checkpoint_validation_with_forward_pass  False................default
  contiguous_checkpointing ........ False.......................default
  deepscale ....................... False.......................default
  deepscale_config ................ None........................default
  deepspeed ....................... True........................default
  deepspeed_activation_checkpointing  True......................default
  deepspeed_mpi ................... False.......................default
  detect_nvlink_pairs ............. False.......................default
  distributed_backend ............. nccl........................default
  do_test ......................... None........................default
  do_train ........................ None........................default
  do_valid ........................ None........................default
  dump_state ...................... False.......................default
  eod_mask_loss ................... False.......................default
  eval_interval ................... 1000........................default
  eval_results_prefix ............. ............................default
  eval_tasks ...................... None........................default
  exclude ......................... None........................default
  exit_interval ................... None........................default
  finetune ........................ False.......................default
  flops_profiler .................. None........................default
  fp16_lm_cross_entropy ........... False.......................default
  fp32_allreduce .................. False.......................default
  git_hash ........................ 49e60fe.....................default
  gmlp_attn_dim ................... 64..........................default
  gpt_j_residual .................. False.......................default
  gradient_accumulation_steps ..... 1...........................default
  gradient_noise_scale_cpu_offload  False.......................default
  gradient_noise_scale_n_batches .. 5...........................default
  gradient_predivide_factor ....... 1.0.........................default
  hostfile ........................ None........................default
  hysteresis ...................... 2...........................default
  include ......................... None........................default
  init_method ..................... normal......................default
  init_method_std ................. 0.02........................default
  iteration ....................... None........................default
  launcher ........................ pdsh........................default
  layernorm_epsilon ............... 1e-05.......................default
  lazy_mpu_init ................... False.......................default
  local_rank ...................... None........................default
  log_grad_norm ................... False.......................default
  log_gradient_noise_scale ........ False.......................default
  log_optimizer_states ............ False.......................default
  log_param_norm .................. False.......................default
  loss_scale ...................... None........................default
  loss_scale_window ............... 1000.0......................default
  make_vocab_size_divisible_by .... 128.........................default
  master_addr ..................... None........................default
  master_port ..................... 29500.......................default
  maximum_tokens .................. 64..........................default
  min_lr .......................... 0.0.........................default
  min_scale ....................... 1.0.........................default
  mmap_warmup ..................... False.......................default
  model_parallel_size ............. 1...........................default
  no_load_rng ..................... False.......................default
  no_save_optim ................... False.......................default
  no_save_rng ..................... False.......................default
  norm ............................ layernorm...................default
  num_gpus ........................ None........................default
  num_nodes ....................... -1..........................default
  num_samples ..................... 0...........................default
  num_unique_layers ............... None........................default
  num_workers ..................... 2...........................default
  onnx_safe ....................... False.......................default
  output_layer_init_method ........ scaled_normal...............default
  output_layer_parallelism ........ row.........................default
  override_lr_scheduler ........... False.......................default
  padded_vocab_size ............... None........................default
  param_sharing_style ............. grouped.....................default
  pipe_partition_method ........... type:transformer|mlp........default
  prescale_gradients .............. False.......................default
  profile_backward ................ False.......................default
  rank ............................ None........................default
  recompute ....................... False.......................default
  rms_norm_epsilon ................ 1e-08.......................default
  rotary_emb_base ................. 10000.......................default
  rotary_pct ...................... 1.0.........................default
  rpe_max_distance ................ 128.........................default
  rpe_num_buckets ................. 32..........................default
  sample_input_file ............... None........................default
  sample_output_file .............. None........................default
  scaled_masked_softmax_fusion .... False.......................default
  scaled_upper_triang_masked_softmax_fusion  False..............default
  scalenorm_epsilon ............... 1e-08.......................default
  scheduler ....................... None........................default
  seed ............................ 1234........................default
  short_seq_prob .................. 0.1.........................default
  soft_prompt_tuning .............. None........................default
  sparse_gradients ................ False.......................default
  steps_per_print ................. 10..........................default
  temperature ..................... 0.0.........................default
  test_data_paths ................. None........................default
  test_data_weights ............... None........................default
  text_gen_type ................... None........................default
  tokenizer_type .................. GPT2BPETokenizer............default
  top_k ........................... 0...........................default
  top_p ........................... 0.0.........................default
  train_data_paths ................ None........................default
  train_data_weights .............. None........................default
  use_bnb_optimizer ............... False.......................default
  use_checkpoint_lr_scheduler ..... False.......................default
  use_cpu_initialization .......... False.......................default
  valid_data_paths ................ None........................default
  valid_data_weights .............. None........................default
  wandb_host ...................... https://api.wandb.ai........default
  wandb_project ................... neox........................default
  wandb_team ...................... None........................default
  warmup .......................... 0.01........................default
  weight_by_num_documents ......... False.......................default
  weighted_sampler_alpha .......... 0.3.........................default
  world_size ...................... None........................default
  zero_allow_untested_optimizer ... False.......................default
---------------- end of arguments ----------------
[2021-12-16 16:00:00,007] [WARNING] [runner.py:126:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-12-16 16:00:00,007] [INFO] [runner.py:366:main] cmd = /mnt/workspace/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 train.py --deepspeed_config {"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 0, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": false}, "wall_clock_breakdown": true} --megatron_config {"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 0, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": false}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 12, "hidden_size": 768, "num_attention_heads": 12, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "lr_decay_style": "cosine", "lr_decay_iters": 320000, "optimizer_type": "Adam", "zero_stage": 0, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.0006, "data_path": "data/enron/enron_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"small.yml": "# GPT-2 pretraining setup\n{\n   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n   # across the node boundaries )\n   \"pipe-parallel-size\": 1,\n   \"model-parallel-size\": 1,\n\n   # model settings\n   \"num-layers\": 12,\n   \"hidden-size\": 768,\n   \"num-attention-heads\": 12,\n   \"seq-length\": 2048,\n   \"max-position-embeddings\": 2048,\n   \"norm\": \"layernorm\",\n   \"pos-emb\": \"rotary\",\n   \"no-weight-tying\": true,\n\n   # these should provide some speedup but takes a while to build, set to true if desired\n   \"scaled-upper-triang-masked-softmax-fusion\": false,\n   \"bias-gelu-fusion\": false,\n\n\n   # optimizer settings\n   \"optimizer\": {\n     \"type\": \"Adam\",\n     \"params\": {\n       \"lr\": 0.0006,\n       \"betas\": [0.9, 0.999],\n       \"eps\": 1.0e-8,\n     }\n   },\n   \"zero_optimization\": {\n    \"stage\": 0,\n    \"allgather_partitions\": True,\n    \"allgather_bucket_size\": 500000000,\n    \"overlap_comm\": True,\n    \"reduce_scatter\": True,\n    \"reduce_bucket_size\": 500000000,\n    \"contiguous_gradients\": True,\n    \"cpu_offload\": False\n  },\n\n   # batch / data settings\n   \"train_micro_batch_size_per_gpu\": 4,\n   \"data-impl\": \"mmap\",\n   \"split\": \"949,50,1\",\n\n   # activation checkpointing\n   \"checkpoint-activations\": true,\n   \"checkpoint-num-layers\": 1,\n   \"partition-activations\": true,\n   \"synchronize-each-layer\": true,\n\n   # regularization\n   \"gradient_clipping\": 1.0,\n   \"weight-decay\": 0.5,\n   \"hidden-dropout\": 0.0,\n   \"attention-dropout\": 0.0,\n\n   \"no-load-optim\": true,\n\n   # precision settings\n   \"fp16\": { \n     \"enabled\": true,\n     \"loss_scale\": 0,\n     \"loss_scale_window\": 1000,\n     \"hysteresis\": 2,\n     \"min_loss_scale\": 1\n   },\n\n   # misc. training settings\n   \"train-iters\": 320000,\n   \"lr-decay-iters\": 320000,\n   \"distributed-backend\": \"nccl\",\n   \"lr-decay-style\": \"cosine\",\n   \"warmup\": 0.01,\n   \"save-interval\": 10,\n   \"eval-interval\": 1000,\n   \"eval-iters\": 10,\n\n   # logging\n   \"log-interval\": 100,\n   \"steps_per_print\": 10,\n   \"keep-last-n-checkpoints\": 4,\n   \"wall_clock_breakdown\": true,\n}\n", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\n{\n  \"data-path\": \"data/enron/enron_text_document\",\n  \n  # or for weighted datasets: \n  # \"train-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n  # \"test-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n  # \"valid-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n  # \"train-data-weights\": [1., 2.],\n  # \"test-data-weights\": [2., 1.],\n  # \"valid-data-weights\": [0.5, 0.4],\n\n  # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \n  # WARNING: setting this to True will override any user provided weights\n  # \"weight_by_num_documents\": false,\n  # \"weighted_sampler_alpha\": 0.3,\n\n  \"vocab-file\": \"data/gpt2-vocab.json\",\n  \"merge-file\": \"data/gpt2-merges.txt\",\n\n  \"save\": \"checkpoints\",\n  \"load\": \"checkpoints\",\n  \"checkpoint_validation_with_forward_pass\": False,\n  \n  \"tensorboard-dir\": \"tensorboard\",\n  \"log-dir\": \"logs\",\n  \"use_wandb\": False,\n  \"wandb_host\": \"https://api.wandb.ai\",\n  \"wandb_project\": \"neox\"\n}\n"}, "load": "checkpoints", "save_interval": 10, "no_load_optim": true, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "split": "949,50,1", "vocab_file": "data/gpt2-vocab.json", "merge_file": "data/gpt2-merges.txt", "attention_dropout": 0.0, "hidden_dropout": 0.0, "weight_decay": 0.5, "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "gas": 1, "clip_grad": 1.0, "dynamic_loss_scale": true, "pipe_parallel_size": 1, "is_pipe_parallel": true, "use_wandb": false, "wandb_group": "Mf6keriq8236v5DsKyvGvG_fxq3a3ah", "log_dir": "logs", "tensorboard_dir": "tensorboard", "log_interval": 100, "user_script": "train.py", "global_num_gpus": 4}
[2021-12-16 16:00:00,941] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_DEV_PACKAGE libnccl-dev=2.8.4-1+cuda11.1
[2021-12-16 16:00:00,941] [INFO] [launch.py:75:main] 0 NCCL_VERSION 2.8.4-1
[2021-12-16 16:00:00,941] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_PACKAGE_VERSION 2.8.4-1
[2021-12-16 16:00:00,941] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_PACKAGE libnccl2=2.8.4-1+cuda11.1
[2021-12-16 16:00:00,941] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME libnccl-dev
[2021-12-16 16:00:00,941] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_PACKAGE_NAME libnccl2
[2021-12-16 16:00:00,941] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION 2.8.4-1
[2021-12-16 16:00:00,942] [INFO] [launch.py:82:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2021-12-16 16:00:00,942] [INFO] [launch.py:88:main] nnodes=1, num_local_procs=4, node_rank=0
[2021-12-16 16:00:00,942] [INFO] [launch.py:103:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3]})
[2021-12-16 16:00:00,942] [INFO] [launch.py:104:main] dist_world_size=4
[2021-12-16 16:00:00,942] [INFO] [launch.py:112:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2021-12-16 16:00:04,255] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
NeoXArgs.configure_distributed_args() using world size: 4 and model-parallel size: 1
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
[2021-12-16 16:00:04,374] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
> setting tensorboard ...
> initializing torch distributed ...
[2021-12-16 16:00:04,410] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2021-12-16 16:00:04,417] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
> initializing model parallel with size 1
MPU DP: [0, 1, 2, 3]
MPU PP: [0]
MPU PP: [1]
MPU PP: [2]
MPU PP: [3]
MPU MP: [0]
MPU MP: [1]
MPU MP: [2]
MPU MP: [3]
> setting random seeds to 1234 ...
[2021-12-16 16:00:05,453] [INFO] [checkpointing.py:223:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/mnt/workspace/gpt-neox/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/mnt/workspace/gpt-neox/megatron/data'
building GPT2 model ...
SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None
Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=1, model=0): 1, ProcessCoord(pipe=0, data=2, model=0): 2, ProcessCoord(pipe=0, data=3, model=0): 3}
[2021-12-16 16:00:09,950] [INFO] [module.py:363:_partition_layers] Partitioning pipeline stages with method type:transformer|mlp
stage=0 layers=17
     0: EmbeddingPipe
     1: _pre_transformer_block
     2: ParallelTransformerLayerPipe
     3: ParallelTransformerLayerPipe
     4: ParallelTransformerLayerPipe
     5: ParallelTransformerLayerPipe
     6: ParallelTransformerLayerPipe
     7: ParallelTransformerLayerPipe
     8: ParallelTransformerLayerPipe
     9: ParallelTransformerLayerPipe
    10: ParallelTransformerLayerPipe
    11: ParallelTransformerLayerPipe
    12: ParallelTransformerLayerPipe
    13: ParallelTransformerLayerPipe
    14: _post_transformer_block
    15: NormPipe
    16: ParallelLinearPipe
  loss: partial
[2021-12-16 16:00:09,977] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
[2021-12-16 16:00:09,978] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
DeepSpeed is enabled.
[2021-12-16 16:00:09,979] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.15+eb7f5cf, git-hash=eb7f5cf, git-branch=main
[2021-12-16 16:00:09,979] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
[2021-12-16 16:00:09,980] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer.
[2021-12-16 16:00:11,332] [INFO] [config.py:759:print] DeepSpeedEngine configuration:
[2021-12-16 16:00:11,333] [INFO] [config.py:763:print]   activation_checkpointing_config  {
    "partition_activations": false,
    "contiguous_memory_optimization": false,
    "cpu_checkpointing": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
}
[2021-12-16 16:00:11,333] [INFO] [config.py:763:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2021-12-16 16:00:11,333] [INFO] [config.py:763:print]   allreduce_always_fp32 ........ False
[2021-12-16 16:00:11,333] [INFO] [config.py:763:print]   amp_enabled .................. False
[2021-12-16 16:00:11,333] [INFO] [config.py:763:print]   amp_params ................... False
[2021-12-16 16:00:11,333] [INFO] [config.py:763:print]   checkpoint_tag_validation_enabled  True
[2021-12-16 16:00:11,333] [INFO] [config.py:763:print]   checkpoint_tag_validation_fail  False
[2021-12-16 16:00:11,333] [INFO] [config.py:763:print]   disable_allgather ............ False
[2021-12-16 16:00:11,333] [INFO] [config.py:763:print]   dump_state ................... False
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   dynamic_loss_scale_args ...... {'init_scale': 4294967296, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   elasticity_enabled ........... False
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   flops_profiler_config ........ {
    "enabled": false,
    "profile_step": 1,
    "module_depth": -1,
    "top_modules": 3,
    "detailed": true
}
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   fp16_enabled ................. True
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   fp16_type .................... fp16
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   global_rank .................. 0
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   gradient_accumulation_steps .. 1
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   gradient_clipping ............ 1.0
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   gradient_predivide_factor .... 1.0
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   initial_dynamic_scale ........ 4294967296
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   loss_scale ................... 0
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   memory_breakdown ............. False
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   optimizer_legacy_fusion ...... False
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   optimizer_name ............... adam
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   optimizer_params ............. {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08}
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   pld_enabled .................. False
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   pld_params ................... False
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   precision .................... torch.float16
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   prescale_gradients ........... False
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   scheduler_name ............... None
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   scheduler_params ............. None
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   sparse_attention ............. None
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   sparse_gradients_enabled ..... False
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   steps_per_print .............. 10
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   tensorboard_enabled .......... False
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   tensorboard_job_name ......... DeepSpeedJobName
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   tensorboard_output_path ......
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   train_batch_size ............. 16
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   train_micro_batch_size_per_gpu  4
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   wall_clock_breakdown ......... True
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   world_size ................... 4
[2021-12-16 16:00:11,334] [INFO] [config.py:763:print]   zero_allow_untested_optimizer  False
[2021-12-16 16:00:11,335] [INFO] [config.py:763:print]   zero_config .................. {
    "stage": 0,
    "contiguous_gradients": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 5.000000e+08,
    "allgather_partitions": true,
    "allgather_bucket_size": 5.000000e+08,
    "overlap_comm": true,
    "load_from_fp32_weights": true,
    "elastic_checkpoint": true,
    "offload_param": null,
    "offload_optimizer": null,
    "sub_group_size": 1.000000e+12,
    "prefetch_bucket_size": 5.000000e+07,
    "param_persistence_threshold": 1.000000e+05,
    "max_live_parameters": 1.000000e+09,
    "max_reuse_distance": 1.000000e+09,
    "gather_fp16_weights_on_model_save": false
}
[2021-12-16 16:00:11,335] [INFO] [config.py:763:print]   zero_enabled ................. False
[2021-12-16 16:00:11,335] [INFO] [config.py:763:print]   zero_optimization_stage ...... 0
Using /root/.cache/torch_extensions as PyTorch extensions root...
Using /root/.cache/torch_extensions as PyTorch extensions root...
[2021-12-16 16:00:11,335] [INFO] [config.py:765:print]   json = {
    "train_batch_size": 16,
    "train_micro_batch_size_per_gpu": 4,
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.0006,
            "betas": [0.9, 0.999],
            "eps": 1e-08
        }
    },
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "gradient_clipping": 1.0,
    "zero_optimization": {
        "stage": 0,
        "allgather_partitions": true,
        "allgather_bucket_size": 5.000000e+08,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 5.000000e+08,
        "contiguous_gradients": true,
        "cpu_offload": false
    },
    "wall_clock_breakdown": true
}
Using /root/.cache/torch_extensions as PyTorch extensions root...
Using /root/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.3736841678619385 seconds
Loading extension module utils...
Loading extension module utils...
Time to load utils op: 0.4020102024078369 seconds
Loading extension module utils...
Time to load utils op: 0.40239429473876953 seconds
Time to load utils op: 0.4019603729248047 seconds
[2021-12-16 16:00:11,739] [INFO] [engine.py:84:__init__] CONFIG: micro_batches=1 micro_batch_size=4
[2021-12-16 16:00:11,895] [INFO] [engine.py:141:__init__] RANK=0 STAGE=0 LAYERS=17 [0, 17) STAGE_PARAMS=162322944 (162.323M) TOTAL_PARAMS=162322944 (162.323M) UNIQUE_PARAMS=162322944 (162.323M)
 > number of parameters on model parallel rank 0: 162322944
 > total params: 162,322,944
[2021-12-16 16:00:13,283] [INFO] [engine.py:1551:_load_checkpoint] rank: 3 loading checkpoint: checkpoints/global_step30/mp_rank_00_model_states.pt
[2021-12-16 16:00:13,283] [INFO] [engine.py:1551:_load_checkpoint] rank: 2 loading checkpoint: checkpoints/global_step30/mp_rank_00_model_states.pt
[2021-12-16 16:00:13,283] [INFO] [engine.py:1551:_load_checkpoint] rank: 0 loading checkpoint: checkpoints/global_step30/mp_rank_00_model_states.pt
[2021-12-16 16:00:13,283] [INFO] [engine.py:1551:_load_checkpoint] rank: 1 loading checkpoint: checkpoints/global_step30/mp_rank_00_model_states.pt
[2021-12-16 16:00:14,495] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=0 file=checkpoints/global_step30/layer_00-model_00-model_states.pt
[2021-12-16 16:00:14,505] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=2 file=checkpoints/global_step30/layer_02-model_00-model_states.pt
[2021-12-16 16:00:14,513] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=3 file=checkpoints/global_step30/layer_03-model_00-model_states.pt
[2021-12-16 16:00:14,521] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=4 file=checkpoints/global_step30/layer_04-model_00-model_states.pt
[2021-12-16 16:00:14,529] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=5 file=checkpoints/global_step30/layer_05-model_00-model_states.pt
[2021-12-16 16:00:14,537] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=6 file=checkpoints/global_step30/layer_06-model_00-model_states.pt
[2021-12-16 16:00:14,546] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=7 file=checkpoints/global_step30/layer_07-model_00-model_states.pt
[2021-12-16 16:00:14,556] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=8 file=checkpoints/global_step30/layer_08-model_00-model_states.pt
[2021-12-16 16:00:14,566] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=9 file=checkpoints/global_step30/layer_09-model_00-model_states.pt
[2021-12-16 16:00:14,575] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=10 file=checkpoints/global_step30/layer_10-model_00-model_states.pt
[2021-12-16 16:00:14,584] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=11 file=checkpoints/global_step30/layer_11-model_00-model_states.pt
[2021-12-16 16:00:14,593] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=12 file=checkpoints/global_step30/layer_12-model_00-model_states.pt
[2021-12-16 16:00:14,602] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=13 file=checkpoints/global_step30/layer_13-model_00-model_states.pt
[2021-12-16 16:00:14,602] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=15 file=checkpoints/global_step30/layer_15-model_00-model_states.pt
[2021-12-16 16:00:14,677] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=16 file=checkpoints/global_step30/layer_16-model_00-model_states.pt
 > validated currently set args with arguments in the checkpoint ...
  successfully loaded checkpoints/global_step30/mp_rank_00_model_states.pt
Loading checkpoint and starting from iteration 30
> building train, validation, and test datasets ...
    reading sizes...
    reading pointers...
    reading document index...
    creating numpy buffer of mmap...
    creating memory view of numpy buffer...
 > dataset split:
    train:
     document indices in [0, 491014) total of 491014 documents
    validation:
     document indices in [491014, 516884) total of 25870 documents
    test:
     document indices in [516884, 517401) total of 517 documents
 > loading doc-idx mapping from data/enron/enron_text_document_train_indexmap_5120000ns_2048sl_1234s_doc_idx.npy
 > loading sample-idx mapping from data/enron/enron_text_document_train_indexmap_5120000ns_2048sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from data/enron/enron_text_document_train_indexmap_5120000ns_2048sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.002 seconds
    total number of samples: 5172595
    total number of epochs: 38
WARNING: shuffle index length (5172593) is not equal to sample index length (5172594)
WARNING: shuffle index length (5172593) is not equal to sample index length (5172594)
WARNING: shuffle index length (5172593) is not equal to sample index length (5172594)WARNING: shuffle index length (5172593) is not equal to sample index length (5172594)

loading doc-idx mapping from data/enron/enron_text_document_valid_indexmap_51360ns_2048sl_1234s_doc_idx.npy
loading sample-idx mapping from data/enron/enron_text_document_valid_indexmap_51360ns_2048sl_1234s_sample_idx.npy
loading shuffle-idx mapping from data/enron/enron_text_document_valid_indexmap_51360ns_2048sl_1234s_shuffle_idx.npy
WARNING: shuffle index length (54205) is not equal to sample index length (54206) loaded indexed file in 0.001 secondsWARNING: shuffle index length (54205) is not equal to sample index length (54206)

total number of samples: 54207
total number of epochs: 8

WARNING: shuffle index length (54205) is not equal to sample index length (54206)
WARNING: shuffle index length (54205) is not equal to sample index length (54206)

loading doc-idx mapping from data/enron/enron_text_document_test_indexmap_160ns_2048sl_1234s_doc_idx.npy
loading sample-idx mapping from data/enron/enron_text_document_test_indexmap_160ns_2048sl_1234s_sample_idx.npy
loading shuffle-idx mapping from data/enron/enron_text_document_test_indexmap_160ns_2048sl_1234s_shuffle_idx.npy
WARNING: shuffle index length (201) is not equal to sample index length (202) loaded indexed file in 0.001 seconds
WARNING: shuffle index length (201) is not equal to sample index length (202)

total number of samples: 203
total number of epochs: 2

WARNING: shuffle index length (201) is not equal to sample index length (202)
WARNING: shuffle index length (201) is not equal to sample index length (202)
setting training data start iteration to 30
setting validation data start iteration to 0
done with setups ...
time (ms) | model and optimizer: 4973.65 | train/valid/test data iterators: 1664.46
training ...
[2021-12-16 16:00:16,672] [INFO] [checkpointing.py:405:forward] Activation Checkpointing Information
[2021-12-16 16:00:16,672] [INFO] [checkpointing.py:406:forward] ----Partition Activations True, CPU CHECKPOINTING False
[2021-12-16 16:00:16,672] [INFO] [checkpointing.py:409:forward] ----contiguous Memory Checkpointing False with 12 total layers
[2021-12-16 16:00:16,673] [INFO] [checkpointing.py:412:forward] ----Synchronization True
[2021-12-16 16:00:16,673] [INFO] [checkpointing.py:413:forward] ----Profiling False
Traceback (most recent call last):
File "train.py", line 27, in
pretrain(neox_args=neox_args)
File "/mnt/workspace/gpt-neox/megatron/training.py", line 103, in pretrain
iteration = train(
File "/mnt/workspace/gpt-neox/megatron/training.py", line 555, in train
loss_dict, skipped_iter = train_step(
File "/mnt/workspace/gpt-neox/megatron/training.py", line 462, in train_step
reduced_loss = train_step_pipe(
File "/mnt/workspace/gpt-neox/megatron/training.py", line 511, in train_step_pipe
loss = model.train_batch(data_iter=data_iterator)
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 305, in train_batch
self._exec_schedule(sched)
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1308, in _exec_schedule
self._exec_instr(**cmd.kwargs)
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 718, in _exec_backward_pass
assert self.optimizer is not None, "must provide optimizer during "
AssertionError: must provide optimizer during init in order to use backward
Traceback (most recent call last):
File "train.py", line 27, in
pretrain(neox_args=neox_args)
File "/mnt/workspace/gpt-neox/megatron/training.py", line 103, in pretrain
iteration = train(
File "/mnt/workspace/gpt-neox/megatron/training.py", line 555, in train
loss_dict, skipped_iter = train_step(
File "/mnt/workspace/gpt-neox/megatron/training.py", line 462, in train_step
reduced_loss = train_step_pipe(
File "/mnt/workspace/gpt-neox/megatron/training.py", line 511, in train_step_pipe
loss = model.train_batch(data_iter=data_iterator)
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 305, in train_batch
self._exec_schedule(sched)
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1308, in _exec_schedule
self._exec_instr(**cmd.kwargs)
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 718, in _exec_backward_pass
assert self.optimizer is not None, "must provide optimizer during "
AssertionError: must provide optimizer during init in order to use backward
Traceback (most recent call last):
File "train.py", line 27, in
pretrain(neox_args=neox_args)
File "/mnt/workspace/gpt-neox/megatron/training.py", line 103, in pretrain
iteration = train(
File "/mnt/workspace/gpt-neox/megatron/training.py", line 555, in train
loss_dict, skipped_iter = train_step(
File "/mnt/workspace/gpt-neox/megatron/training.py", line 462, in train_step
reduced_loss = train_step_pipe(
File "/mnt/workspace/gpt-neox/megatron/training.py", line 511, in train_step_pipe
loss = model.train_batch(data_iter=data_iterator)
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 305, in train_batch
self._exec_schedule(sched)
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1308, in _exec_schedule
self._exec_instr(**cmd.kwargs)
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 718, in _exec_backward_pass
assert self.optimizer is not None, "must provide optimizer during "
AssertionError: must provide optimizer during init in order to use backward
Traceback (most recent call last):
File "train.py", line 27, in
pretrain(neox_args=neox_args)
File "/mnt/workspace/gpt-neox/megatron/training.py", line 103, in pretrain
iteration = train(
File "/mnt/workspace/gpt-neox/megatron/training.py", line 555, in train
loss_dict, skipped_iter = train_step(
File "/mnt/workspace/gpt-neox/megatron/training.py", line 462, in train_step
reduced_loss = train_step_pipe(
File "/mnt/workspace/gpt-neox/megatron/training.py", line 511, in train_step_pipe
loss = model.train_batch(data_iter=data_iterator)
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 305, in train_batch
self._exec_schedule(sched)
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1308, in _exec_schedule
self._exec_instr(**cmd.kwargs)
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 718, in _exec_backward_pass
assert self.optimizer is not None, "must provide optimizer during "
AssertionError: must provide optimizer during init in order to use backward
Killing subprocess 6433
Killing subprocess 6434
Killing subprocess 6435
Killing subprocess 6436
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 192, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 179, in
main()
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 169, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/mnt/workspace/venv/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 147, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/mnt/workspace/venv/bin/python', '-u', 'train.py', '--local_rank=3', '--deepspeed_config', '{"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 0, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": false}, "wall_clock_breakdown": true}', '--megatron_config', '{"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 0, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": false}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 12, "hidden_size": 768, "num_attention_heads": 12, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "lr_decay_style": "cosine", "lr_decay_iters": 320000, "optimizer_type": "Adam", "zero_stage": 0, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.0006, "data_path": "data/enron/enron_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"small.yml": "# GPT-2 pretraining setup\n{\n # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n # across the node boundaries )\n \"pipe-parallel-size\": 1,\n \"model-parallel-size\": 1,\n\n # model settings\n \"num-layers\": 12,\n \"hidden-size\": 768,\n \"num-attention-heads\": 12,\n \"seq-length\": 2048,\n \"max-position-embeddings\": 2048,\n \"norm\": \"layernorm\",\n \"pos-emb\": \"rotary\",\n \"no-weight-tying\": true,\n\n # these should provide some speedup but takes a while to build, set to true if desired\n \"scaled-upper-triang-masked-softmax-fusion\": false,\n \"bias-gelu-fusion\": false,\n\n\n # optimizer settings\n \"optimizer\": {\n \"type\": \"Adam\",\n \"params\": {\n \"lr\": 0.0006,\n \"betas\": [0.9, 0.999],\n \"eps\": 1.0e-8,\n }\n },\n \"zero_optimization\": {\n \"stage\": 0,\n \"allgather_partitions\": True,\n \"allgather_bucket_size\": 500000000,\n \"overlap_comm\": True,\n \"reduce_scatter\": True,\n \"reduce_bucket_size\": 500000000,\n \"contiguous_gradients\": True,\n \"cpu_offload\": False\n },\n\n # batch / data settings\n \"train_micro_batch_size_per_gpu\": 4,\n \"data-impl\": \"mmap\",\n \"split\": \"949,50,1\",\n\n # activation checkpointing\n \"checkpoint-activations\": true,\n \"checkpoint-num-layers\": 1,\n \"partition-activations\": true,\n \"synchronize-each-layer\": true,\n\n # regularization\n \"gradient_clipping\": 1.0,\n \"weight-decay\": 0.5,\n \"hidden-dropout\": 0.0,\n \"attention-dropout\": 0.0,\n\n \"no-load-optim\": true,\n\n # precision settings\n \"fp16\": { \n \"enabled\": true,\n \"loss_scale\": 0,\n \"loss_scale_window\": 1000,\n \"hysteresis\": 2,\n \"min_loss_scale\": 1\n },\n\n # misc. training settings\n \"train-iters\": 320000,\n \"lr-decay-iters\": 320000,\n \"distributed-backend\": \"nccl\",\n \"lr-decay-style\": \"cosine\",\n \"warmup\": 0.01,\n \"save-interval\": 10,\n \"eval-interval\": 1000,\n \"eval-iters\": 10,\n\n # logging\n \"log-interval\": 100,\n \"steps_per_print\": 10,\n \"keep-last-n-checkpoints\": 4,\n \"wall_clock_breakdown\": true,\n}\n", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\n{\n \"data-path\": \"data/enron/enron_text_document\",\n \n # or for weighted datasets: \n # \"train-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n # \"test-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n # \"valid-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n # \"train-data-weights\": [1., 2.],\n # \"test-data-weights\": [2., 1.],\n # \"valid-data-weights\": [0.5, 0.4],\n\n # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \n # WARNING: setting this to True will override any user provided weights\n # \"weight_by_num_documents\": false,\n # \"weighted_sampler_alpha\": 0.3,\n\n \"vocab-file\": \"data/gpt2-vocab.json\",\n \"merge-file\": \"data/gpt2-merges.txt\",\n\n \"save\": \"checkpoints\",\n \"load\": \"checkpoints\",\n \"checkpoint_validation_with_forward_pass\": False,\n \n \"tensorboard-dir\": \"tensorboard\",\n \"log-dir\": \"logs\",\n \"use_wandb\": False,\n \"wandb_host\": \"https://api.wandb.ai\",\n \"wandb_project\": \"neox\"\n}\n"}, "load": "checkpoints", "save_interval": 10, "no_load_optim": true, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "split": "949,50,1", "vocab_file": "data/gpt2-vocab.json", "merge_file": "data/gpt2-merges.txt", "attention_dropout": 0.0, "hidden_dropout": 0.0, "weight_decay": 0.5, "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "gas": 1, "clip_grad": 1.0, "dynamic_loss_scale": true, "pipe_parallel_size": 1, "is_pipe_parallel": true, "use_wandb": false, "wandb_group": "Mf6keriq8236v5DsKyvGvG_fxq3a3ah", "log_dir": "logs", "tensorboard_dir": "tensorboard", "log_interval": 100, "user_script": "train.py", "global_num_gpus": 4}']' returned non-zero exit status 1.

@StellaAthena
Copy link
Member

90,000 steps is a lot for a dataset as small as Enron. You're doing multiple epochs and the model is diverging. I recommend using weight decay set to 0.1 or even higher if you want to train this model to completion. In general, I recommend using weight decay unless you've deduplicated your data.

Perhaps this was a bad idea, but the quick start was not particularly intended to be run as-is for all 320,000 steps. That would be nearly 10 epochs on the Enron data which is quite excessive and not likely to produce a particularly good model. We used Enron in the demo because it's small and therefore can be downloaded and tokenized quickly, rather than because it's a good starter dataset.

What would you like to see in the quick start guide? Any advice about how to stop others from falling into this pitfall would be appreciated.

@sdtblck
Copy link
Contributor

sdtblck commented Dec 17, 2021

This is not a bug - this is completely intended behaviour.

The configs in this repo are not necessarily ready to go, depending on your dataset / model configuration you'll need to do tuning, they're just there for example purposes. I suggest you have a read of how mixed precision training works.

Stella made some good points, another thing you can try is to change the dynamic loss scaling parameters.

@sdtblck sdtblck closed this as completed Dec 17, 2021
@ScTof
Copy link
Author

ScTof commented Dec 17, 2021

@StellaAthena @pwstegman Thanks for your helpful feedback. The program is currently running with a weight decay of 0.15. So far no errors, let's see if it terminates.

@StellaAthena You asked what I would like to see in the quick start. What would be helpful, is a pre-created config file that works well with the given dataset and is referenced in the quick start guide. Basically, design the guide in a way that it can be followed blindly to train a first model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants