Add checkpoint saving / loading #90

sdtblck · 2021-01-24T03:42:47Z

does what it says on the tin

will require #89 to be merged first

checkpoint_dirs = natural_sort(checkpoint_dir) rather than natural_sort(checkpoint_dirs)

StellaAthena · 2021-01-25T04:43:47Z

I have resolved all of the conflicts but wish to test this again before approving to make sure I didn't fuck anything up.

StellaAthena · 2021-01-25T05:48:00Z

This appears to run on one node on the server but not on multiple nodes. Error was

10.140.23.126:   File "/app/gpt_neox/gpt_neox.py", line 211, in __init__
10.140.23.126: TypeError: new() received an invalid combination of arguments - got (NoneType, int), but expected one of:
10.141.113.174: TypeError: new() received an invalid combination of arguments - got (NoneType, int), but expected one of:
10.140.23.126:  * (*, torch.device device)
10.141.113.174:  * (*, torch.device device)
10.140.23.126:       didn't match because some of the arguments have invalid types: (!NoneType!, !int!)
10.141.113.174:       didn't match because some of the arguments have invalid types: (!NoneType!, !int!)
10.140.23.126:  * (torch.Storage storage)
10.140.23.126:  * (Tensor other)
10.140.23.126:  * (tuple of ints size, *, torch.device device)
10.140.23.126:  * (object data, *, torch.device device)
10.140.23.126: 
10.140.23.126:     self.token_emb = nn.Embedding(num_tokens, dim)
10.141.113.174:  * (torch.Storage storage)
10.141.113.174:  * (Tensor other)
10.140.23.126:   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/sparse.py", line 109, in __init__
10.140.23.126:     self.weight = Parameter(torch.Tensor(num_embeddings, embedding_dim))
10.141.113.174:  * (tuple of ints size, *, torch.device device)
10.140.23.126: TypeError: new() received an invalid combination of arguments - got (NoneType, int), but expected one of:
10.140.23.126:  * (*, torch.device device)
10.141.113.174:  * (object data, *, torch.device device)
10.141.113.174: 
10.140.23.126:       didn't match because some of the arguments have invalid types: (!NoneType!, !int!)
10.140.23.126:  * (torch.Storage storage)
10.141.113.174:     return self.typename(*self.module_args, **self.module_kwargs)
10.140.23.126:  * (Tensor other)
10.141.113.174:   File "/app/gpt_neox/gpt_neox.py", line 211, in __init__
10.140.23.126:  * (tuple of ints size, *, torch.device device)
10.141.113.174:     self.token_emb = nn.Embedding(num_tokens, dim)
10.140.23.126:  * (object data, *, torch.device device)
10.140.23.126: 
10.141.113.174:   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/sparse.py", line 109, in __init__
10.141.113.174:     super().__init__(layers=spec, loss_fn=loss_fn, num_stages=num_stages, **kwargs)
10.141.113.174:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 177, in __init__
10.141.113.174:     self.token_emb = nn.Embedding(num_tokens, dim)
10.141.113.174:     self.weight = Parameter(torch.Tensor(num_embeddings, embedding_dim))  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/sparse.py", line 109, in __init__
10.141.113.174: 
10.141.113.174: TypeError: new() received an invalid combination of arguments - got (NoneType, int), but expected one of:
10.141.113.174:  * (*, torch.device device)
10.141.113.174:       didn't match because some of the arguments have invalid types: (!NoneType!, !int!)
10.141.113.174:  * (torch.Storage storage)
10.141.113.174:  * (Tensor other)
10.141.113.174:  * (tuple of ints size, *, torch.device device)
10.141.113.174:  * (object data, *, torch.device device)
10.141.113.174: 
10.141.113.174:     self.weight = Parameter(torch.Tensor(num_embeddings, embedding_dim))
10.141.113.174:     self._partition_layers(method=partition_method)
10.141.113.174:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 363, in _partition_layers
10.141.113.174: TypeError: new() received an invalid combination of arguments - got (NoneType, int), but expected one of:
10.141.113.174:  * (*, torch.device device)
10.141.113.174:       didn't match because some of the arguments have invalid types: (!NoneType!, !int!)
10.141.113.174:  * (torch.Storage storage)
10.141.113.174:  * (Tensor other)
10.141.113.174:  * (tuple of ints size, *, torch.device device)
10.141.113.174:  * (object data, *, torch.device device)
10.141.113.174: 
10.141.113.174:     param_counts = self._count_layer_params()
10.141.113.174:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 262, in _count_layer_params
10.141.113.174:     l = layer.build()
10.141.113.174:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/module.py", line 68, in build
10.141.113.174:     return self.typename(*self.module_args, **self.module_kwargs)
10.141.113.174:   File "/app/gpt_neox/gpt_neox.py", line 211, in __init__
10.141.113.174:     self.token_emb = nn.Embedding(num_tokens, dim)
10.141.113.174:   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/sparse.py", line 109, in __init__
10.141.113.174:     self.weight = Parameter(torch.Tensor(num_embeddings, embedding_dim))
10.141.113.174: TypeError: new() received an invalid combination of arguments - got (NoneType, int), but expected one of:
10.141.113.174:  * (*, torch.device device)
10.141.113.174:       didn't match because some of the arguments have invalid types: (!NoneType!, !int!)

I double checked that my conflict resolution didn't mess anything up, and besides deleting some blank lines and changing the batch size, the only change to the branch was the dropped comma that I fixed in 6b2c439.

Per my issue in deepspeed yesterday, I was told by a dev (microsoft/DeepSpeed#690 (comment)) that the error I was facing was due to the incorrect keyword.

StellaAthena · 2021-01-26T16:55:21Z

Further testing revealed the following:

enwik8:

sh scripts/train_enwiki8.sh throws the error Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size1028 != 128 * 1 * 8 when run on one node. <- This was a config issue.
sh scripts/train_enwiki8.sh throws the error AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size1028 != 32 * 1 * 32 when run on multiple nodes.
sh scripts/train_enwiki8_pipeline.sh runs on one node.
sh scripts/train_enwik8_pipeline.sh throws the error RuntimeError: CUDA error: invalid device ordinal when run on multiple nodes.

GPT-3 Small

~~sh scripts/train_gpt3small.sh requires the instilation of thecupy (not a problem) and mpi4py (a problem) packages when run on one node.~~ <- This was an issue caused by 1-Bit Adam
sh scripts/train_gpt3small.sh hangs when run on multiple nodes.
sh scripts/train_gpt3small_pipeline.sh runs on one node.
sh scripts/train_gpt3small_pipeline.sh hangs on multiple nodes.

EDIT: Errors that are striken through have been fixed

StellaAthena · 2021-01-26T17:23:48Z

The above is with ZeRO 1. While I haven't rigorously tested with ZeRO 2, gpt3small OOM'd on one node with ZeRO 2.

StellaAthena

Multinode training is broken, but we don’t think it’s the fault of this code. @leogao2 is looking into doing a continuous integration Docker environment which he thinks will solve the problem. In any event, this shouldn’t break anything worse.

sdtblck added 6 commits January 23, 2021 16:55

fix torch.utils.checkpoint error

b6f8bcb

fix breaks in train_pipeline.py

b38bd8c

push fixes to train_pipeline.py

139f6b9

push changes to zero1 config

9e32117

omnibus changes to *pipeline.py scripts

39399dc

add checkpoint saving / loading

035db81

sdtblck requested a review from a team as a code owner January 24, 2021 03:42

sdtblck requested review from ConnorJL and AranKomat January 24, 2021 03:42

sdtblck added this to In progress in 1T or BUST via automation Jan 24, 2021

StellaAthena linked an issue Jan 24, 2021 that may be closed by this pull request

Pipeline parallelism and gradient checkpointing (edit: and ZeRO 2!) don’t work together #62

Closed

StellaAthena mentioned this pull request Jan 24, 2021

Pipeline parallelism and gradient checkpointing (edit: and ZeRO 2!) don’t work together #62

Closed

ShivanshuPurohit and others added 5 commits January 24, 2021 20:58

changed line 64

54dc29f

checkpoint_dirs = natural_sort(checkpoint_dir) rather than natural_sort(checkpoint_dirs)

Update utils.py

445d3a6

fix checkpoint saving / loading logic

dfbb94b

fix checkpoint saving logic

2bc3ba8

Merge branch 'main' into save_checkpoints

58ccb95

Update gpt3_small.json

6b2c439

Change params for OnebitAdam

6f11ca4

Per my issue in deepspeed yesterday, I was told by a dev (microsoft/DeepSpeed#690 (comment)) that the error I was facing was due to the incorrect keyword.

StellaAthena added 6 commits January 26, 2021 11:56

Merge branch 'main' into save_checkpoints

e5ca390

Fixing batch size

a4f7742

Made consistent with ZeRO 1

6134a59

Made consistent with ZeRO 2

e66711e

Update deepspeed_zero2.json

71b51b6

Undo previous commit

0ef4531

Reverted back to normal adam from 1-bit-adam (#96)

86d9700

ShivanshuPurohit and others added 2 commits January 27, 2021 01:32

Create checkpoints_config.json

10210e8

Update train_gpt3small_pipeline.sh

74b4f24

StellaAthena approved these changes Jan 28, 2021

View reviewed changes

StellaAthena merged commit 4aee002 into main Jan 28, 2021

1T or BUST automation moved this from In progress to Done Jan 28, 2021

StellaAthena deleted the save_checkpoints branch January 28, 2021 23:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add checkpoint saving / loading #90

Add checkpoint saving / loading #90

sdtblck commented Jan 24, 2021

StellaAthena commented Jan 25, 2021

StellaAthena commented Jan 25, 2021 •

edited

Loading

StellaAthena commented Jan 26, 2021 •

edited

Loading

StellaAthena commented Jan 26, 2021

StellaAthena left a comment

Add checkpoint saving / loading #90

Add checkpoint saving / loading #90

Conversation

sdtblck commented Jan 24, 2021

StellaAthena commented Jan 25, 2021

StellaAthena commented Jan 25, 2021 • edited Loading

StellaAthena commented Jan 26, 2021 • edited Loading

StellaAthena commented Jan 26, 2021

StellaAthena left a comment

Choose a reason for hiding this comment

StellaAthena commented Jan 25, 2021 •

edited

Loading

StellaAthena commented Jan 26, 2021 •

edited

Loading