[FSDP] Move the flattened tensors back to GPU to prevent CPU OOM #124008

exhyy · 2024-04-13T13:57:47Z

I encountered a CPU OOM issue when resuming from a checkpoint with FSDP.optim_state_dict_to_load. See https://github.com/huggingface/accelerate/blob/5ca095a34fede7c988af8c193eb0c0d199750845/src/accelerate/utils/fsdp_utils.py#L208 for details.

I notice that _flatten_tensor_optim_state and _flatten_zero_dim_tensor_optim_state will create tensors on CPU, and results in the OOM issue. I have tried creating tensors on GPU directly, but it caused OOM on GPU.

My solution is to move the flattened tensors back to the device where FSDP is running on. This works well for me with PyTorch 2.1.1. The current main branch doesn't seem to have fixed this issue yet.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @rohan-varma

pytorch-bot · 2024-04-13T13:57:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124008

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 60ba71a with merge base da7db5d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2024-04-13T13:57:51Z

✅login: exhyy / (60ba71a)

The committers listed above are authorized under a signed CLA.

awgu · 2024-04-15T17:00:48Z

cc: @fegin since you are more familiar with this code

I am not fully convinced by this solution though 🤔 I may need to understand more.

exhyy · 2024-04-20T08:13:51Z

Sorry for the late reply.
Well, I am not convinced about this solution as well. My main concern is that I am unsure if this will lead to a peak increase in GPU memory.

github-actions · 2024-06-19T08:34:36Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

move the flattened tensors to GPU to prevent CPU OOM

60ba71a

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Apr 13, 2024

pytorchbot added the open source label Apr 13, 2024

zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 15, 2024

zou3519 requested a review from awgu April 15, 2024 15:22

github-actions bot added the Stale label Jun 19, 2024

github-actions bot closed this Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP] Move the flattened tensors back to GPU to prevent CPU OOM #124008

[FSDP] Move the flattened tensors back to GPU to prevent CPU OOM #124008

exhyy commented Apr 13, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Apr 13, 2024 •

edited

Loading

linux-foundation-easycla bot commented Apr 13, 2024 •

edited

Loading

awgu commented Apr 15, 2024

exhyy commented Apr 20, 2024

github-actions bot commented Jun 19, 2024

[FSDP] Move the flattened tensors back to GPU to prevent CPU OOM #124008

[FSDP] Move the flattened tensors back to GPU to prevent CPU OOM #124008

Conversation

exhyy commented Apr 13, 2024 • edited by pytorch-bot bot Loading

pytorch-bot bot commented Apr 13, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124008

✅ No Failures

linux-foundation-easycla bot commented Apr 13, 2024 • edited Loading

awgu commented Apr 15, 2024

exhyy commented Apr 20, 2024

github-actions bot commented Jun 19, 2024

exhyy commented Apr 13, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Apr 13, 2024 •

edited

Loading

linux-foundation-easycla bot commented Apr 13, 2024 •

edited

Loading