-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pytorch gradient checkpointing is much better than deepspeed ! #63
Comments
You can train a larger batch size in two ways:
What is the sequence length you are using, and what are your hidden dimensions? Please take a look at the line 390-398 in the following file for example of activation checkpointing. |
Thanks @samyam for the feedback. In the original code, I was using both gradient accumulation and activation check-pointing/re-materialization/re-computation. The gradient accumulation is not an issue between my code and deepspeed code. The problem is in the activation check-pointing/re-materialization/re-computation. I assumed if I disabled it and activate deepspeed, I can get the same results. Apparently, this is not the case. I will re-integrate gradient checkpoint and see if there is a benefit of using deepspeed in this case or not. |
Yes, DeepSpeed does not do activation check-pointing automatically. You have to add that in the model. Please keep us posted on if this solves your issue |
Closing the issue. Please let us know if there are further questions or issues! |
This reverts commit 59c91511a733db08c01cd35318b9d18e2b0d3894.
* add tensor parallelism for MoE
Hello,
I have a script that trains 12 layers transformer model (about 85 million) using gradient checkpoint. It was working with a local batch size of 32 per Nvidia Titan GPU.
I tried to use deepspeed instead and I am always getting OOM, even with a batch size 8.
minimal code:
Initialization:
Training:
Original Transformer code with gradient checkpointing:
The working batch size for the dataloader is only 4.
Any idea how can I achieve the same batch size as gradient checkpoints with deepspeed?
The text was updated successfully, but these errors were encountered: