The `main` branch is broken. #240

StellaAthena · 2021-04-18T03:23:59Z

main seems to be broken.

Running with pp=0 has a memory leak and produces the error:

File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/engine.py", line 1473, in _load_checkpoint
File "/home/mchorse/gpt-neox/megatron/model/gpt2_model.py", line 150, in load_state_dict
strict=load_module_strict)
File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/engine.py", line 1374, in load_module_state_dict
if self._language_model_key in state_dict:
TypeError: argument of type 'NoneType' is not iterable

Running with pp > 0 produces the error:

Traceback (most recent call last):
File "pretrain_gpt2.py", line 182, in
args_defaults={'tokenizer_type': 'GPT2BPETokenizer'}, extra_args_provider=neox_args)
File "/home/mchorse/gpt-neox/megatron/training.py", line 91, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
File "/home/mchorse/gpt-neox/megatron/training.py", line 274, in setup_model_and_optimizer
args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
File "/home/mchorse/gpt-neox/megatron/checkpointing.py", line 242, in load_checkpoint
checkpoint_name, state_dict = model.load_checkpoint(load_dir)
File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/engine.py", line 1441, in load_checkpoint
load_lr_scheduler_states=load_lr_scheduler_states)
File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/engine.py", line 1473, in _load_checkpoint
strict=load_module_strict)
File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/pipe/engine.py", line 1123, in load_module_state_dict
self.module.load_state_dir(load_dir=self._curr_ckpt_path, strict=strict)
File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/pipe/module.py", line 563, in load_state_dir
strict=strict)
File "/home/mchorse/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1224, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for ParallelTransformerLayerPipe:
Missing key(s) in state_dict: "input_layernorm.scale", "post_attention_layernorm.scale".
Unexpected key(s) in state_dict: "input_layernorm.weight", "input_layernorm.bias", "post_attention_layernorm.weight", "post_attention_layernorm.bias".

Also we should really add apex to the docker image...

I'm going to sleep and won't have time to work on this tomorrow, as I'm helping my parents move. Maybe Monday?

The text was updated successfully, but these errors were encountered:

sdtblck · 2021-04-18T14:19:10Z

False alarm ⏰

StellaAthena added the bug Something isn't working label Apr 18, 2021

sdtblck closed this as completed Apr 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The `main` branch is broken. #240

The `main` branch is broken. #240

StellaAthena commented Apr 18, 2021 •

edited

Loading

sdtblck commented Apr 18, 2021

The main branch is broken. #240

The main branch is broken. #240

Comments

StellaAthena commented Apr 18, 2021 • edited Loading

sdtblck commented Apr 18, 2021

The `main` branch is broken. #240

The `main` branch is broken. #240

StellaAthena commented Apr 18, 2021 •

edited

Loading