[DDP] DDP bucket memory release during fwd step #128696

lichenlu · 2024-06-14T10:00:00Z

🚀 The feature, motivation and pitch

DDP bucket will always in GPU HBM,which size is same as the sum of module all weight gradients' size.
In fwd stage and optimizer stage, this memory is wasteful
I wonder know can this memory be release and malloc runtime for some larger module trainging such as stable diffusion?

Alternatives

No response

Additional context

No response

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @robieta @chaekit @aaronenyeshi @guotuofeng @guyang3532 @dzhulgakov @davidberard98 @briancoutinho @sraikund16 @sanrise

weifengpy · 2024-06-14T22:00:18Z

Hi @lichenlu , this is a really good question! after discussion with @awgu , this is indeed a design question for DDP.

just more curious about your workload. In your case, is memory peaking at forward instead of backward? I want to make sure releasing DDP buckets actually reduce the peak memory

[Ignore original comment below that assume DDP freezes buckets]
You might know pytorch cuda caching allocator (CCA) already so excuse me for repeating the logic

DDP bucket are tensors(device='cuda').
Once tensors(device='cuda') is created, pytorch cuda caching allocator reserves the memory ALWAYS. Even it's freed, CCA still reserve it so other tensors can use it

The way you describe it In fwd stage and optimizer stage, this memory is wasteful, are you actually looking for FSDP? FSDP is still data parallel, but it's more memory efficient over DDP. FSDP saves memory by sharding parameters, gradients, and optimizer state dicts into 1/NGPU https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md

weifengpy · 2024-06-14T23:00:09Z

Or do you need gpu memory for other things outside forward/backward/optim, but still within pytorch cuda caching allocator ? Just to confirm how much value it brings to free DDP bucket outside of bwd

lichenlu · 2024-06-17T02:51:09Z

Yes, in our case, the memory peaking is at forward, out model is stable diffusion model, the peak memory is at the VAE encoder stage.
The VAE only need do forward calculate but no need do the gradient calculate
If the DDP bucket can be released after the allreduce , and rebuild it before the next step backward step , this can help release the peak memory.
But, this maybe cause another problem, the gradient_as_bucket_view will not work, and the memory release and alloc may cause some overhead
So, I think , it can be a option for user if possible.

weifengpy · 2024-06-17T19:51:19Z

Yes, in our case, the memory peaking is at forward, out model is stable diffusion model, the peak memory is at the VAE encoder stage. The VAE only need do forward calculate but no need do the gradient calculate If the DDP bucket can be released after the allreduce , and rebuild it before the next step backward step , this can help release the peak memory. But, this maybe cause another problem, the gradient_as_bucket_view will not work, and the memory release and alloc may cause some overhead So, I think , it can be a option for user if possible.

Thanks for explaining this in detail. I will bring proposal to team for discussion

wconstab · 2024-06-17T20:45:53Z

It seems like this is a reasonable feature request on the surface.

However it adds complexity to the DDP logic and would probably need one more 'flag' that users have to set and we have to maintain.

This feature is positioned as a way to 'squeeze in' a large model with DDP. FSDP is our offering for data parallelism for large models. If FSDP could be used instead of DDP this might be a non-issue. Would FSDP be an option for this case?

malfet added the oncall: profiler profiler-related issues (cpu, gpu, kineto) label Jun 14, 2024

davidberard98 added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 14, 2024

sraikund16 removed the oncall: profiler profiler-related issues (cpu, gpu, kineto) label Jun 14, 2024

weifengpy added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 14, 2024

weifengpy added triage review and removed triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jun 17, 2024

weifengpy added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DDP] DDP bucket memory release during fwd step #128696

[DDP] DDP bucket memory release during fwd step #128696

lichenlu commented Jun 14, 2024 •

edited by pytorch-bot bot

Loading

weifengpy commented Jun 14, 2024 •

edited

Loading

weifengpy commented Jun 14, 2024 •

edited

Loading

lichenlu commented Jun 17, 2024 •

edited

Loading

weifengpy commented Jun 17, 2024

wconstab commented Jun 17, 2024

[DDP] DDP bucket memory release during fwd step #128696

[DDP] DDP bucket memory release during fwd step #128696

Comments

lichenlu commented Jun 14, 2024 • edited by pytorch-bot bot Loading

🚀 The feature, motivation and pitch

Alternatives

Additional context

weifengpy commented Jun 14, 2024 • edited Loading

weifengpy commented Jun 14, 2024 • edited Loading

lichenlu commented Jun 17, 2024 • edited Loading

weifengpy commented Jun 17, 2024

wconstab commented Jun 17, 2024

lichenlu commented Jun 14, 2024 •

edited by pytorch-bot bot

Loading

weifengpy commented Jun 14, 2024 •

edited

Loading

weifengpy commented Jun 14, 2024 •

edited

Loading

lichenlu commented Jun 17, 2024 •

edited

Loading