Ensure Checkpoint Saving / Loading works correctly #151

sdtblck · 2021-03-02T15:48:28Z

@ShivanshuPurohit reported some problems with loading from a checkpoint. Someone should take a look and make sure everything works correctly there

MicPie · 2021-03-04T19:42:27Z

Minimal adaptions to configs/pretrain_gpt2.yml to have a minimal example to reproduce loading checkpoint error:

Change lines https://github.com/EleutherAI/gpt-neox/blob/main/configs/pretrain_gpt2.yml#L27-L28 to:

   "log-interval": 1,
   "save-interval": 1,

Interesting part from the error message:

Traceback (most recent call last):
  File "pretrain_gpt2.py", line 184, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/home/mchorse/gpt-neox/megatron/training.py", line 92, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
  File "/home/mchorse/gpt-neox/megatron/training.py", line 295, in setup_model_and_optimizer
    args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
  File "/home/mchorse/gpt-neox/megatron/checkpointing.py", line 214, in load_checkpoint
    checkpoint_name, state_dict = model.load_checkpoint(load_dir)
  File "/src/deepspeed/deepspeed/runtime/engine.py", line 1331, in load_checkpoint
    load_path, client_states = self._load_checkpoint(load_dir,
  File "/src/deepspeed/deepspeed/runtime/engine.py", line 1360, in _load_checkpoint
    checkpoint = torch.load(load_path, map_location=lambda storage, loc: storage)
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 585, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 242, in __init__
    super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
[2021-03-04 19:23:56,216] [INFO] [engine.py:1359:_load_checkpoint] rank: 0 loading checkpoint: checkpoints/gpt2_345m_ds/global_step40/mp_rank_00_model_states.pt
RuntimeError: [enforce fail at inline_container.cc:150] . PytorchStreamReader failed reading zip archive: failed finding central directory

This can be reproduced in a Python REPL:

>>> import torch
>>> ckpt = torch.load("mp_rank_00_model_states.pt")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 585, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 242, in __init__
    super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: [enforce fail at inline_container.cc:150] . PytorchStreamReader failed reading zip archive: failed finding central directory

If I rerun the steps above with commented out line https://github.com/EleutherAI/gpt-neox/blob/main/megatron/checkpointing.py#L139 I can reload the checkpoint in the REPL.

This seems to be related to the last saved checkpoint, earlier checkpoints can be loaded in the REPL. The last working checkpoint number seems to be saved in the file latest_checkpointed_iteration.txt. This needs further verification, as it is not always reproducible?

sdtblck · 2021-03-04T21:12:52Z

I can't reproduce this, for me it works fine.

I have changed

"log-interval": 1,
"save-interval": 1,

to

   "log-interval": 10,
   "save-interval": 10,

to ensure i'm not interrupting the checkpoint saving process.

And on multi-node clusters we should ensure we change save and load paths to the shared disk /mnt/ssd-cluster/...

It seems to work fine for me:

10.140.11.7: [2021-03-04 21:02:51,637] [INFO] [module.py:565:load_state_dir] RANK=0 Loaded layer=0 file=/mnt/ssd-cluster/data/checkpoints/gpt2_345m_ds/global_step50/layer_00-model_00-model_states.pt
10.140.11.7: [2021-03-04 21:02:51,723] [INFO] [module.py:565:load_state_dir] RANK=0 Loaded layer=2 file=/mnt/ssd-cluster/data/checkpoints/gpt2_345m_ds/global_step50/layer_02-model_00-model_states.pt
10.140.11.7: [2021-03-04 21:02:51,823] [INFO] [module.py:565:load_state_dir] RANK=0 Loaded layer=3 file=/mnt/ssd-cluster/data/checkpoints/gpt2_345m_ds/global_step50/layer_03-model_00-model_states.pt
10.140.11.7: [2021-03-04 21:02:51,956] [INFO] [module.py:565:load_state_dir] RANK=0 Loaded layer=4 file=/mnt/ssd-cluster/data/checkpoints/gpt2_345m_ds/global_step50/layer_04-model_00-model_states.pt
10.140.11.7: [2021-03-04 21:02:52,070] [INFO] [module.py:565:load_state_dir] RANK=0 Loaded layer=5 file=/mnt/ssd-cluster/data/checkpoints/gpt2_345m_ds/global_step50/layer_05-model_00-model_states.pt
10.140.11.7: [2021-03-04 21:02:52,190] [INFO] [module.py:565:load_state_dir] RANK=0 Loaded layer=6 file=/mnt/ssd-cluster/data/checkpoints/gpt2_345m_ds/global_step50/layer_06-model_00-model_states.pt
10.140.11.7: [2021-03-04 21:02:52,308] [INFO] [module.py:565:load_state_dir] RANK=0 Loaded layer=7 file=/mnt/ssd-cluster/data/checkpoints/gpt2_345m_ds/global_step50/layer_07-model_00-model_states.pt
10.140.11.7:  > using checkpoint value 0.00015 for learning rate
10.140.11.7:  > using checkpoint value 1e-05 for minimum learning rate
10.140.11.7:  > using checkpoint value 3200.0 for warmup iterations
10.140.11.7:  > using checkpoint value 320000 for total number of iterations
10.140.11.7:  > using checkpoint value cosine for decay style
10.140.11.7: could not find arguments in the checkpoint ...
10.140.11.7:   successfully loaded /mnt/ssd-cluster/data/checkpoints/gpt2_345m_ds/global_step50/mp_rank_00_model_states.pt
10.140.11.7:   successfully loaded /mnt/ssd-cluster/data/checkpoints/gpt2_345m_ds/global_step50/mp_rank_01_model_states.pt
10.141.250.254:   successfully loaded /mnt/ssd-cluster/data/checkpoints/gpt2_345m_ds/global_step50/mp_rank_03_model_states.pt
10.141.250.254:   successfully loaded /mnt/ssd-cluster/data/checkpoints/gpt2_345m_ds/global_step50/mp_rank_02_model_states.pt

joshlk · 2021-03-05T09:25:11Z

@MicPie are you using a central directory to store checkpoints? It worked fine when I did this. I have added suggested save locations in PR #158

MicPie · 2021-03-06T17:08:10Z

The fix EleutherAI/DeeperSpeed@12a2480 by sdtblck solved the problem.

(Use pip uninstall deepspeed; pip install -e git+git:https://github.com/EleutherAI/DeeperSpeed.git@cac19a86b67e6e98b9dca37128bc01e50424d9e9#egg=deepspeed to change to the updated DeeperSpeed version.)

StellaAthena added the bug Something isn't working label Mar 2, 2021

StellaAthena closed this as completed Mar 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure Checkpoint Saving / Loading works correctly #151

Ensure Checkpoint Saving / Loading works correctly #151

sdtblck commented Mar 2, 2021 •

edited

Loading

MicPie commented Mar 4, 2021 •

edited

Loading

sdtblck commented Mar 4, 2021

joshlk commented Mar 5, 2021

MicPie commented Mar 6, 2021 •

edited

Loading

Ensure Checkpoint Saving / Loading works correctly #151

Ensure Checkpoint Saving / Loading works correctly #151

Comments

sdtblck commented Mar 2, 2021 • edited Loading

MicPie commented Mar 4, 2021 • edited Loading

sdtblck commented Mar 4, 2021

joshlk commented Mar 5, 2021

MicPie commented Mar 6, 2021 • edited Loading

sdtblck commented Mar 2, 2021 •

edited

Loading

MicPie commented Mar 4, 2021 •

edited

Loading

MicPie commented Mar 6, 2021 •

edited

Loading