-
Notifications
You must be signed in to change notification settings - Fork 981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure Checkpoint Saving / Loading works correctly #151
Comments
Minimal adaptions to Change lines https://github.com/EleutherAI/gpt-neox/blob/main/configs/pretrain_gpt2.yml#L27-L28 to:
Interesting part from the error message:
This can be reproduced in a Python REPL:
If I rerun the steps above with commented out line https://github.com/EleutherAI/gpt-neox/blob/main/megatron/checkpointing.py#L139 I can reload the checkpoint in the REPL. This seems to be related to the last saved checkpoint, earlier checkpoints can be loaded in the REPL. The last working checkpoint number seems to be saved in the file |
I can't reproduce this, for me it works fine. I have changed
to
to ensure i'm not interrupting the checkpoint saving process. And on multi-node clusters we should ensure we change It seems to work fine for me:
|
The fix EleutherAI/DeeperSpeed@12a2480 by sdtblck solved the problem. (Use |
@ShivanshuPurohit reported some problems with loading from a checkpoint. Someone should take a look and make sure everything works correctly there
The text was updated successfully, but these errors were encountered: