-
Notifications
You must be signed in to change notification settings - Fork 25.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weight mismatch when using deepspeed zero-stage 3 and pretrained codegen model #22017
Comments
Hey, there is something wrong indeed :
|
Ah sorry, you were right in ignoring the missmatches! Yes, there is a special argument to initialise your model using deepspeed in transformers but it does not support the deepspeed stage 3:
The documentation mentions this. |
For Non HF-Trainer integration please see:
I fixed your program to work:
|
BTW, when you use deepspeed offload w/ LION it will be slow. You want deepspeed's Adam instead or turn off offload. You shouldn't need it with 8 gpus and this small model. Unless you were just using it for a repro case, still 8 gpus is a lot of sharding. The Deepspeed team are working on flagging this incompatibility here microsoft/DeepSpeed#2971 Make sure to enabled gradient checkpointing - which will save you a ton of gpu memory at a small cost of slowdown. (unrelated to deepspeed) |
Thanks very much. The problem have been solved. |
System Info
transformers
version: 4.26.1Who can help?
@stas @ArthurZucker @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Model being properly initialized with the pretrained weights when using DeepSpeed ZERO Stage-3. It seems that the model parameters are randomly initialized so far.
The text was updated successfully, but these errors were encountered: