-
Notifications
You must be signed in to change notification settings - Fork 984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Take ZeRO 3 for a test drive #171
Comments
Code is implemented, running into some compatibility issues due to our mods. Shouldn’t be hard to resolve. |
Note: Do not use scientific notation in the config file. The config monster parses it as a string instead of a number (cc: @joshlk) Current error is
I thought that this was connected to the line |
Updates from further testing indicates that this is a Pipeline Parallelism problem:
|
Okay, so the problem with turning PP off is that this line returns Comparing to the MSFT repository, I see that they just always use weight tying which avoids this problem entirely. This portion of our code is otherwise the same. Turing off weight tying for us produces a new error
I haven't looked into this at all but it seems like this might be fixed by forcing weight tying to be on in a less ad hoc fashion. |
We made significant changes to the codebase and MSFT made significant changes to DS so I am effectively starting from scratch. Wheeeeeeee |
We are deprioritizing ZeRO 3 based on feedback from NVIDIA. This may be implemented in the future, but not for a while I suspect. |
Is your feature request related to a problem? Please describe.
Model too smol
Describe the solution you'd like
DeepSpeed ZeRO-3 is finally public! Let’s take it for a test drive (remember to turn pipeline parallelism off) and see if we can get it to run.
Describe alternatives you've considered
Use Pipeline Parallelism
Additional context
It probably doesn’t work or needs to be modded to work because DeepSpeed :works-internally:
The text was updated successfully, but these errors were encountered: