Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add checkpoint saving / loading #90

Merged
merged 22 commits into from
Jan 28, 2021
Merged

Add checkpoint saving / loading #90

merged 22 commits into from
Jan 28, 2021

Conversation

sdtblck
Copy link
Contributor

@sdtblck sdtblck commented Jan 24, 2021

does what it says on the tin

will require #89 to be merged first

@StellaAthena
Copy link
Member

I have resolved all of the conflicts but wish to test this again before approving to make sure I didn't fuck anything up.

@StellaAthena
Copy link
Member

StellaAthena commented Jan 25, 2021

This appears to run on one node on the server but not on multiple nodes. Error was

10.140.23.126:   File "/app/gpt_neox/gpt_neox.py", line 211, in __init__
10.140.23.126: TypeError: new() received an invalid combination of arguments - got (NoneType, int), but expected one of:
10.141.113.174: TypeError: new() received an invalid combination of arguments - got (NoneType, int), but expected one of:
10.140.23.126:  * (*, torch.device device)
10.141.113.174:  * (*, torch.device device)
10.140.23.126:       didn't match because some of the arguments have invalid types: (!NoneType!, !int!)
10.141.113.174:       didn't match because some of the arguments have invalid types: (!NoneType!, !int!)
10.140.23.126:  * (torch.Storage storage)
10.140.23.126:  * (Tensor other)
10.140.23.126:  * (tuple of ints size, *, torch.device device)
10.140.23.126:  * (object data, *, torch.device device)
10.140.23.126: 
10.140.23.126:     self.token_emb = nn.Embedding(num_tokens, dim)
10.141.113.174:  * (torch.Storage storage)
10.141.113.174:  * (Tensor other)
10.140.23.126:   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/sparse.py", line 109, in __init__
10.140.23.126:     self.weight = Parameter(torch.Tensor(num_embeddings, embedding_dim))
10.141.113.174:  * (tuple of ints size, *, torch.device device)
10.140.23.126: TypeError: new() received an invalid combination of arguments - got (NoneType, int), but expected one of:
10.140.23.126:  * (*, torch.device device)
10.141.113.174:  * (object data, *, torch.device device)
10.141.113.174: 
10.140.23.126:       didn't match because some of the arguments have invalid types: (!NoneType!, !int!)
10.140.23.126:  * (torch.Storage storage)
10.141.113.174:     return self.typename(*self.module_args, **self.module_kwargs)
10.140.23.126:  * (Tensor other)
10.141.113.174:   File "/app/gpt_neox/gpt_neox.py", line 211, in __init__
10.140.23.126:  * (tuple of ints size, *, torch.device device)
10.141.113.174:     self.token_emb = nn.Embedding(num_tokens, dim)
10.140.23.126:  * (object data, *, torch.device device)
10.140.23.126: 
10.141.113.174:   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/sparse.py", line 109, in __init__
10.141.113.174:     super().__init__(layers=spec, loss_fn=loss_fn, num_stages=num_stages, **kwargs)
10.141.113.174:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 177, in __init__
10.141.113.174:     self.token_emb = nn.Embedding(num_tokens, dim)
10.141.113.174:     self.weight = Parameter(torch.Tensor(num_embeddings, embedding_dim))  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/sparse.py", line 109, in __init__
10.141.113.174: 
10.141.113.174: TypeError: new() received an invalid combination of arguments - got (NoneType, int), but expected one of:
10.141.113.174:  * (*, torch.device device)
10.141.113.174:       didn't match because some of the arguments have invalid types: (!NoneType!, !int!)
10.141.113.174:  * (torch.Storage storage)
10.141.113.174:  * (Tensor other)
10.141.113.174:  * (tuple of ints size, *, torch.device device)
10.141.113.174:  * (object data, *, torch.device device)
10.141.113.174: 
10.141.113.174:     self.weight = Parameter(torch.Tensor(num_embeddings, embedding_dim))
10.141.113.174:     self._partition_layers(method=partition_method)
10.141.113.174:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 363, in _partition_layers
10.141.113.174: TypeError: new() received an invalid combination of arguments - got (NoneType, int), but expected one of:
10.141.113.174:  * (*, torch.device device)
10.141.113.174:       didn't match because some of the arguments have invalid types: (!NoneType!, !int!)
10.141.113.174:  * (torch.Storage storage)
10.141.113.174:  * (Tensor other)
10.141.113.174:  * (tuple of ints size, *, torch.device device)
10.141.113.174:  * (object data, *, torch.device device)
10.141.113.174: 
10.141.113.174:     param_counts = self._count_layer_params()
10.141.113.174:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 262, in _count_layer_params
10.141.113.174:     l = layer.build()
10.141.113.174:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 68, in build
10.141.113.174:     return self.typename(*self.module_args, **self.module_kwargs)
10.141.113.174:   File "/app/gpt_neox/gpt_neox.py", line 211, in __init__
10.141.113.174:     self.token_emb = nn.Embedding(num_tokens, dim)
10.141.113.174:   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/sparse.py", line 109, in __init__
10.141.113.174:     self.weight = Parameter(torch.Tensor(num_embeddings, embedding_dim))
10.141.113.174: TypeError: new() received an invalid combination of arguments - got (NoneType, int), but expected one of:
10.141.113.174:  * (*, torch.device device)
10.141.113.174:       didn't match because some of the arguments have invalid types: (!NoneType!, !int!)

I double checked that my conflict resolution didn't mess anything up, and besides deleting some blank lines and changing the batch size, the only change to the branch was the dropped comma that I fixed in 6b2c439.

Per my issue in deepspeed yesterday, I was told by a dev (microsoft/DeepSpeed#690 (comment)) that the error I was facing was due to the incorrect keyword.
@StellaAthena
Copy link
Member

StellaAthena commented Jan 26, 2021

Further testing revealed the following:

enwik8:

  • sh scripts/train_enwiki8.sh throws the error Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size1028 != 128 * 1 * 8 when run on one node. <- This was a config issue.
  • sh scripts/train_enwiki8.sh throws the error AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size1028 != 32 * 1 * 32 when run on multiple nodes.
  • sh scripts/train_enwiki8_pipeline.sh runs on one node.
  • sh scripts/train_enwik8_pipeline.sh throws the error RuntimeError: CUDA error: invalid device ordinal when run on multiple nodes.

GPT-3 Small

  • sh scripts/train_gpt3small.sh requires the instilation of thecupy (not a problem) and mpi4py (a problem) packages when run on one node. <- This was an issue caused by 1-Bit Adam
  • sh scripts/train_gpt3small.sh hangs when run on multiple nodes.
  • sh scripts/train_gpt3small_pipeline.sh runs on one node.
  • sh scripts/train_gpt3small_pipeline.sh hangs on multiple nodes.

EDIT: Errors that are striken through have been fixed

@StellaAthena
Copy link
Member

The above is with ZeRO 1. While I haven't rigorously tested with ZeRO 2, gpt3small OOM'd on one node with ZeRO 2.

Copy link
Member

@StellaAthena StellaAthena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multinode training is broken, but we don’t think it’s the fault of this code. @leogao2 is looking into doing a continuous integration Docker environment which he thinks will solve the problem. In any event, this shouldn’t break anything worse.

@StellaAthena StellaAthena merged commit 4aee002 into main Jan 28, 2021
1T or BUST automation moved this from In progress to Done Jan 28, 2021
@StellaAthena StellaAthena deleted the save_checkpoints branch January 28, 2021 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

Pipeline parallelism and gradient checkpointing (edit: and ZeRO 2!) don’t work together
3 participants