-
Notifications
You must be signed in to change notification settings - Fork 981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add checkpoint saving / loading #90
Conversation
checkpoint_dirs = natural_sort(checkpoint_dir) rather than natural_sort(checkpoint_dirs)
I have resolved all of the conflicts but wish to test this again before approving to make sure I didn't fuck anything up. |
This appears to run on one node on the server but not on multiple nodes. Error was
I double checked that my conflict resolution didn't mess anything up, and besides deleting some blank lines and changing the batch size, the only change to the branch was the dropped comma that I fixed in 6b2c439. |
Per my issue in deepspeed yesterday, I was told by a dev (microsoft/DeepSpeed#690 (comment)) that the error I was facing was due to the incorrect keyword.
Further testing revealed the following: enwik8:
GPT-3 Small
EDIT: Errors that are striken through have been fixed |
The above is with ZeRO 1. While I haven't rigorously tested with ZeRO 2, gpt3small OOM'd on one node with ZeRO 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multinode training is broken, but we don’t think it’s the fault of this code. @leogao2 is looking into doing a continuous integration Docker environment which he thinks will solve the problem. In any event, this shouldn’t break anything worse.
does what it says on the tin
will require #89 to be merged first