Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure Checkpoint Saving / Loading works correctly #151

Closed
sdtblck opened this issue Mar 2, 2021 · 4 comments
Closed

Ensure Checkpoint Saving / Loading works correctly #151

sdtblck opened this issue Mar 2, 2021 · 4 comments
Labels
bug Something isn't working

Comments

@sdtblck
Copy link
Contributor

sdtblck commented Mar 2, 2021

@ShivanshuPurohit reported some problems with loading from a checkpoint. Someone should take a look and make sure everything works correctly there

@StellaAthena StellaAthena added the bug Something isn't working label Mar 2, 2021
@MicPie
Copy link
Contributor

MicPie commented Mar 4, 2021

Minimal adaptions to configs/pretrain_gpt2.yml to have a minimal example to reproduce loading checkpoint error:

Change lines https://github.com/EleutherAI/gpt-neox/blob/main/configs/pretrain_gpt2.yml#L27-L28 to:

   "log-interval": 1,
   "save-interval": 1,

Interesting part from the error message:

Traceback (most recent call last):
  File "pretrain_gpt2.py", line 184, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/home/mchorse/gpt-neox/megatron/training.py", line 92, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
  File "/home/mchorse/gpt-neox/megatron/training.py", line 295, in setup_model_and_optimizer
    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
  File "/home/mchorse/gpt-neox/megatron/checkpointing.py", line 214, in load_checkpoint
    checkpoint_name, state_dict = model.load_checkpoint(load_dir)
  File "/src/deepspeed/deepspeed/runtime/engine.py", line 1331, in load_checkpoint
    load_path, client_states = self._load_checkpoint(load_dir,
  File "/src/deepspeed/deepspeed/runtime/engine.py", line 1360, in _load_checkpoint
    checkpoint = torch.load(load_path, map_location=lambda storage, loc: storage)
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 585, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 242, in __init__
    super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
[2021-03-04 19:23:56,216] [INFO] [engine.py:1359:_load_checkpoint] rank: 0 loading checkpoint: checkpoints/gpt2_345m_ds/global_step40/mp_rank_00_model_states.pt
RuntimeError: [enforce fail at inline_container.cc:150] . PytorchStreamReader failed reading zip archive: failed finding central directory

This can be reproduced in a Python REPL:

>>> import torch
>>> ckpt = torch.load("mp_rank_00_model_states.pt")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 585, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 242, in __init__
    super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: [enforce fail at inline_container.cc:150] . PytorchStreamReader failed reading zip archive: failed finding central directory

If I rerun the steps above with commented out line https://github.com/EleutherAI/gpt-neox/blob/main/megatron/checkpointing.py#L139 I can reload the checkpoint in the REPL.

This seems to be related to the last saved checkpoint, earlier checkpoints can be loaded in the REPL. The last working checkpoint number seems to be saved in the file latest_checkpointed_iteration.txt. This needs further verification, as it is not always reproducible?

@sdtblck
Copy link
Contributor Author

sdtblck commented Mar 4, 2021

I can't reproduce this, for me it works fine.

I have changed

"log-interval": 1,
"save-interval": 1,

to

   "log-interval": 10,
   "save-interval": 10,

to ensure i'm not interrupting the checkpoint saving process.

And on multi-node clusters we should ensure we change save and load paths to the shared disk /mnt/ssd-cluster/...

It seems to work fine for me:

10.140.11.7: [2021-03-04 21:02:51,637] [INFO] [module.py:565:load_state_dir] RANK=0 Loaded layer=0 file=/mnt/ssd-cluster/data/checkpoints/gpt2_345m_ds/global_step50/layer_00-model_00-model_states.pt
10.140.11.7: [2021-03-04 21:02:51,723] [INFO] [module.py:565:load_state_dir] RANK=0 Loaded layer=2 file=/mnt/ssd-cluster/data/checkpoints/gpt2_345m_ds/global_step50/layer_02-model_00-model_states.pt
10.140.11.7: [2021-03-04 21:02:51,823] [INFO] [module.py:565:load_state_dir] RANK=0 Loaded layer=3 file=/mnt/ssd-cluster/data/checkpoints/gpt2_345m_ds/global_step50/layer_03-model_00-model_states.pt
10.140.11.7: [2021-03-04 21:02:51,956] [INFO] [module.py:565:load_state_dir] RANK=0 Loaded layer=4 file=/mnt/ssd-cluster/data/checkpoints/gpt2_345m_ds/global_step50/layer_04-model_00-model_states.pt
10.140.11.7: [2021-03-04 21:02:52,070] [INFO] [module.py:565:load_state_dir] RANK=0 Loaded layer=5 file=/mnt/ssd-cluster/data/checkpoints/gpt2_345m_ds/global_step50/layer_05-model_00-model_states.pt
10.140.11.7: [2021-03-04 21:02:52,190] [INFO] [module.py:565:load_state_dir] RANK=0 Loaded layer=6 file=/mnt/ssd-cluster/data/checkpoints/gpt2_345m_ds/global_step50/layer_06-model_00-model_states.pt
10.140.11.7: [2021-03-04 21:02:52,308] [INFO] [module.py:565:load_state_dir] RANK=0 Loaded layer=7 file=/mnt/ssd-cluster/data/checkpoints/gpt2_345m_ds/global_step50/layer_07-model_00-model_states.pt
10.140.11.7:  > using checkpoint value 0.00015 for learning rate
10.140.11.7:  > using checkpoint value 1e-05 for minimum learning rate
10.140.11.7:  > using checkpoint value 3200.0 for warmup iterations
10.140.11.7:  > using checkpoint value 320000 for total number of iterations
10.140.11.7:  > using checkpoint value cosine for decay style
10.140.11.7: could not find arguments in the checkpoint ...
10.140.11.7:   successfully loaded /mnt/ssd-cluster/data/checkpoints/gpt2_345m_ds/global_step50/mp_rank_00_model_states.pt
10.140.11.7:   successfully loaded /mnt/ssd-cluster/data/checkpoints/gpt2_345m_ds/global_step50/mp_rank_01_model_states.pt
10.141.250.254:   successfully loaded /mnt/ssd-cluster/data/checkpoints/gpt2_345m_ds/global_step50/mp_rank_03_model_states.pt
10.141.250.254:   successfully loaded /mnt/ssd-cluster/data/checkpoints/gpt2_345m_ds/global_step50/mp_rank_02_model_states.pt

@joshlk
Copy link
Member

joshlk commented Mar 5, 2021

@MicPie are you using a central directory to store checkpoints? It worked fine when I did this. I have added suggested save locations in PR #158

@MicPie
Copy link
Contributor

MicPie commented Mar 6, 2021

The fix EleutherAI/DeeperSpeed@12a2480 by sdtblck solved the problem.

(Use pip uninstall deepspeed; pip install -e git+git:https://github.com/EleutherAI/DeeperSpeed.git@cac19a86b67e6e98b9dca37128bc01e50424d9e9#egg=deepspeed to change to the updated DeeperSpeed version.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants