Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DDP] DDP bucket memory release during fwd step #128696

Open
lichenlu opened this issue Jun 14, 2024 · 5 comments
Open

[DDP] DDP bucket memory release during fwd step #128696

lichenlu opened this issue Jun 14, 2024 · 5 comments
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@lichenlu
Copy link

lichenlu commented Jun 14, 2024

🚀 The feature, motivation and pitch

DDP bucket will always in GPU HBM,which size is same as the sum of module all weight gradients' size.
In fwd stage and optimizer stage, this memory is wasteful
I wonder know can this memory be release and malloc runtime for some larger module trainging such as stable diffusion?

Alternatives

No response

Additional context

No response

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @robieta @chaekit @aaronenyeshi @guotuofeng @guyang3532 @dzhulgakov @davidberard98 @briancoutinho @sraikund16 @sanrise

@malfet malfet added the oncall: profiler profiler-related issues (cpu, gpu, kineto) label Jun 14, 2024
@davidberard98 davidberard98 added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 14, 2024
@sraikund16 sraikund16 removed the oncall: profiler profiler-related issues (cpu, gpu, kineto) label Jun 14, 2024
@weifengpy weifengpy added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 14, 2024
@weifengpy
Copy link
Contributor

weifengpy commented Jun 14, 2024

Hi @lichenlu , this is a really good question! after discussion with @awgu , this is indeed a design question for DDP.

just more curious about your workload. In your case, is memory peaking at forward instead of backward? I want to make sure releasing DDP buckets actually reduce the peak memory

[Ignore original comment below that assume DDP freezes buckets]
You might know pytorch cuda caching allocator (CCA) already so excuse me for repeating the logic

  • DDP bucket are tensors(device='cuda').
  • Once tensors(device='cuda') is created, pytorch cuda caching allocator reserves the memory ALWAYS. Even it's freed, CCA still reserve it so other tensors can use it

The way you describe it In fwd stage and optimizer stage, this memory is wasteful, are you actually looking for FSDP? FSDP is still data parallel, but it's more memory efficient over DDP. FSDP saves memory by sharding parameters, gradients, and optimizer state dicts into 1/NGPU https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md

@weifengpy
Copy link
Contributor

weifengpy commented Jun 14, 2024

Or do you need gpu memory for other things outside forward/backward/optim, but still within pytorch cuda caching allocator ? Just to confirm how much value it brings to free DDP bucket outside of bwd

@lichenlu
Copy link
Author

lichenlu commented Jun 17, 2024

Yes, in our case, the memory peaking is at forward, out model is stable diffusion model, the peak memory is at the VAE encoder stage.
The VAE only need do forward calculate but no need do the gradient calculate
If the DDP bucket can be released after the allreduce , and rebuild it before the next step backward step , this can help release the peak memory.
But, this maybe cause another problem, the gradient_as_bucket_view will not work, and the memory release and alloc may cause some overhead
So, I think , it can be a option for user if possible.

@weifengpy weifengpy added triage review and removed triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jun 17, 2024
@weifengpy
Copy link
Contributor

Yes, in our case, the memory peaking is at forward, out model is stable diffusion model, the peak memory is at the VAE encoder stage. The VAE only need do forward calculate but no need do the gradient calculate If the DDP bucket can be released after the allreduce , and rebuild it before the next step backward step , this can help release the peak memory. But, this maybe cause another problem, the gradient_as_bucket_view will not work, and the memory release and alloc may cause some overhead So, I think , it can be a option for user if possible.

Thanks for explaining this in detail. I will bring proposal to team for discussion

@wconstab
Copy link
Contributor

It seems like this is a reasonable feature request on the surface.

However it adds complexity to the DDP logic and would probably need one more 'flag' that users have to set and we have to maintain.

This feature is positioned as a way to 'squeeze in' a large model with DDP. FSDP is our offering for data parallelism for large models. If FSDP could be used instead of DDP this might be a non-issue. Would FSDP be an option for this case?

@weifengpy weifengpy added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

6 participants