Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The main branch is broken. #240

Closed
StellaAthena opened this issue Apr 18, 2021 · 1 comment
Closed

The main branch is broken. #240

StellaAthena opened this issue Apr 18, 2021 · 1 comment
Labels
bug Something isn't working

Comments

@StellaAthena
Copy link
Member

StellaAthena commented Apr 18, 2021

main seems to be broken.

Running with pp=0 has a memory leak and produces the error:

File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/engine.py", line 1473, in _load_checkpoint
File "/home/mchorse/gpt-neox/megatron/model/gpt2_model.py", line 150, in load_state_dict
strict=load_module_strict)
File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/engine.py", line 1374, in load_module_state_dict
if self._language_model_key in state_dict:
TypeError: argument of type 'NoneType' is not iterable

Running with pp > 0 produces the error:

Traceback (most recent call last):
File "pretrain_gpt2.py", line 182, in
args_defaults={'tokenizer_type': 'GPT2BPETokenizer'}, extra_args_provider=neox_args)
File "/home/mchorse/gpt-neox/megatron/training.py", line 91, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
File "/home/mchorse/gpt-neox/megatron/training.py", line 274, in setup_model_and_optimizer
args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
File "/home/mchorse/gpt-neox/megatron/checkpointing.py", line 242, in load_checkpoint
checkpoint_name, state_dict = model.load_checkpoint(load_dir)
File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/engine.py", line 1441, in load_checkpoint
load_lr_scheduler_states=load_lr_scheduler_states)
File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/engine.py", line 1473, in _load_checkpoint
strict=load_module_strict)
File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/pipe/engine.py", line 1123, in load_module_state_dict
self.module.load_state_dir(load_dir=self._curr_ckpt_path, strict=strict)
File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/pipe/module.py", line 563, in load_state_dir
strict=strict)
File "/home/mchorse/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1224, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for ParallelTransformerLayerPipe:
Missing key(s) in state_dict: "input_layernorm.scale", "post_attention_layernorm.scale".
Unexpected key(s) in state_dict: "input_layernorm.weight", "input_layernorm.bias", "post_attention_layernorm.weight", "post_attention_layernorm.bias".

Also we should really add apex to the docker image...

I'm going to sleep and won't have time to work on this tomorrow, as I'm helping my parents move. Maybe Monday?

@StellaAthena StellaAthena added the bug Something isn't working label Apr 18, 2021
@sdtblck
Copy link
Contributor

sdtblck commented Apr 18, 2021

False alarm ⏰

@sdtblck sdtblck closed this as completed Apr 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants