[FSDP] Runtime Error on Checkpoint Loading for optimizer state #129110

jeejakp12 · 2024-06-20T04:18:17Z

for checkpoint optimizer, tensors are created on CUDA when other backends are used. This is because by default torch.device() constructed via a single device ordinal is treated as a cuda device.

In _alloc_tensor, empty tensor are created using device = cast(torch.device, _get_device_module(device_type).current_device()). above will return only the index which will create the empty tensor on CUDA by the default behavior. So, change it to use torch.device(device_type,device_module(device_type).current_device()) to get the device with the index.

Fixes #ISSUE_NUMBER

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @LucasLLC @MeetVadakkanchery @mhorowitz

pytorch-bot · 2024-06-20T04:18:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129110

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (6 Unrelated Failures)

As of commit 81cd818 with merge base 7128504 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

periodic / linux-focal-rocm6.1-py3.8 / test (distributed, 1, 2, linux.rocm.gpu) (gh) (disabled by #129390)
distributed/_tools/test_fsdp2_mem_tracker.py::TestTrackerFullyShard1DTrainingCore::test_tracker_multi_group_eager
periodic / win-vs2019-cuda11.8-py3 / test (default, 1, 4, windows.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
RuntimeError: doctests 1/1 failed!
periodic / win-vs2019-cuda11.8-py3 / test (default, 2, 4, windows.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
profiler\test_cpp_thread.py::CppThreadTest::test_profile_memory
periodic / win-vs2019-cuda11.8-py3 / test (default, 3, 4, windows.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
test_testing.py::TestImports::test_not_import_sympy
periodic / win-vs2019-cuda11.8-py3 / test (force_on_cpu, 1, 1, windows.4xlarge.nonephemeral) (gh) (similar failure)
RuntimeError: doctests 1/1 failed!

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

trunk / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral) (gh) (#130257)
test_testing.py::TestImports::test_not_import_sympy

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jeejakp12 · 2024-07-02T08:46:15Z

@pytorchbot

jeejakp12 · 2024-07-02T08:47:46Z

@pytorchbot rebase

pytorchmergebot · 2024-07-02T08:49:06Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-07-02T08:49:10Z

Successfully rebased origin/jeeja_fsdp_checkpoint_fix onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout origin/jeeja_fsdp_checkpoint_fix && git pull --rebase)

jeejakp12 · 2024-07-03T18:36:09Z

@pytorchbot rebase

pytorchmergebot · 2024-07-03T18:37:30Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-07-03T18:37:34Z

Successfully rebased origin/jeeja_fsdp_checkpoint_fix onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout origin/jeeja_fsdp_checkpoint_fix && git pull --rebase)

jeejakp12 · 2024-07-04T06:22:35Z

@pytorchbot rebase

pytorchmergebot · 2024-07-04T06:24:10Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-07-04T06:24:13Z

Successfully rebased origin/jeeja_fsdp_checkpoint_fix onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout origin/jeeja_fsdp_checkpoint_fix && git pull --rebase)

jeejakp12 · 2024-07-05T06:30:18Z

@pytorchbot rebase

pytorchmergebot · 2024-07-05T06:31:50Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

for checkpoint optimizer, tensors are created on CUDA when other backends are used. This is because by default torch.device() constructed via a single device ordinal is treated as a cuda device. In _alloc_tensor, empty tensor are created using device = cast(torch.device, _get_device_module(device_type).current_device()). above will return only the index which will create the empty tensor on CUDA by the default behavior. So, change it to use torch.device(device_type,device_module(device_type).current_device()) to get the device with the index. Signed-off-by: Jeeja <[email protected]>

pytorchmergebot · 2024-07-05T06:31:53Z

Successfully rebased origin/jeeja_fsdp_checkpoint_fix onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout origin/jeeja_fsdp_checkpoint_fix && git pull --rebase)

jeejakp12 · 2024-07-05T15:36:15Z

@wz337 can you please help review the change. Thanks Jeeja

jeejakp12 · 2024-07-08T10:18:08Z

@pytorchbot rebase

pytorchmergebot · 2024-07-08T10:19:37Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-07-08T10:19:39Z

Tried to rebase and push PR #129110, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

fegin · 2024-07-08T17:44:21Z

@pytorchbot merge

pytorchmergebot · 2024-07-08T17:46:10Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

LucasLLC · 2024-07-08T18:44:47Z

@pytorchbot merge

pytorchmergebot · 2024-07-08T18:46:22Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added module: distributed_checkpoint oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jun 20, 2024

pytorchbot added the open source label Jun 20, 2024

fegin requested a review from wz337 June 20, 2024 20:31

fegin added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Jun 20, 2024

fegin approved these changes Jun 20, 2024

View reviewed changes

jeejakp12 force-pushed the origin/jeeja_fsdp_checkpoint_fix branch 3 times, most recently from c758aa9 to e477573 Compare June 27, 2024 15:16

pytorchmergebot force-pushed the origin/jeeja_fsdp_checkpoint_fix branch from e477573 to 060e339 Compare July 2, 2024 08:49

pytorchmergebot force-pushed the origin/jeeja_fsdp_checkpoint_fix branch from 060e339 to dd87e13 Compare July 3, 2024 18:37

pytorchmergebot force-pushed the origin/jeeja_fsdp_checkpoint_fix branch from dd87e13 to 2a69d5e Compare July 4, 2024 06:24

pytorchmergebot force-pushed the origin/jeeja_fsdp_checkpoint_fix branch from 2a69d5e to 81cd818 Compare July 5, 2024 06:31

pytorchmergebot added the merging label Jul 8, 2024

pytorchmergebot removed the merging label Jul 8, 2024

LucasLLC added the topic: not user facing topic category label Jul 8, 2024

pytorchmergebot added the merging label Jul 8, 2024

pytorchmergebot closed this in 22c809a Jul 8, 2024

pytorchmergebot added Merged and removed merging labels Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP] Runtime Error on Checkpoint Loading for optimizer state #129110

[FSDP] Runtime Error on Checkpoint Loading for optimizer state #129110

jeejakp12 commented Jun 20, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jun 20, 2024 •

edited

Loading

jeejakp12 commented Jul 2, 2024

jeejakp12 commented Jul 2, 2024

pytorchmergebot commented Jul 2, 2024

pytorchmergebot commented Jul 2, 2024

jeejakp12 commented Jul 3, 2024

pytorchmergebot commented Jul 3, 2024

pytorchmergebot commented Jul 3, 2024

jeejakp12 commented Jul 4, 2024

pytorchmergebot commented Jul 4, 2024

pytorchmergebot commented Jul 4, 2024

jeejakp12 commented Jul 5, 2024

pytorchmergebot commented Jul 5, 2024

pytorchmergebot commented Jul 5, 2024

jeejakp12 commented Jul 5, 2024

jeejakp12 commented Jul 8, 2024

pytorchmergebot commented Jul 8, 2024

pytorchmergebot commented Jul 8, 2024

fegin commented Jul 8, 2024

pytorchmergebot commented Jul 8, 2024

LucasLLC commented Jul 8, 2024

pytorchmergebot commented Jul 8, 2024

[FSDP] Runtime Error on Checkpoint Loading for optimizer state #129110

[FSDP] Runtime Error on Checkpoint Loading for optimizer state #129110

Conversation

jeejakp12 commented Jun 20, 2024 • edited by pytorch-bot bot Loading

pytorch-bot bot commented Jun 20, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129110

✅ You can merge normally! (6 Unrelated Failures)

jeejakp12 commented Jul 2, 2024

jeejakp12 commented Jul 2, 2024

pytorchmergebot commented Jul 2, 2024

pytorchmergebot commented Jul 2, 2024

jeejakp12 commented Jul 3, 2024

pytorchmergebot commented Jul 3, 2024

pytorchmergebot commented Jul 3, 2024

jeejakp12 commented Jul 4, 2024

pytorchmergebot commented Jul 4, 2024

pytorchmergebot commented Jul 4, 2024

jeejakp12 commented Jul 5, 2024

pytorchmergebot commented Jul 5, 2024

pytorchmergebot commented Jul 5, 2024

jeejakp12 commented Jul 5, 2024

jeejakp12 commented Jul 8, 2024

pytorchmergebot commented Jul 8, 2024

pytorchmergebot commented Jul 8, 2024

fegin commented Jul 8, 2024

pytorchmergebot commented Jul 8, 2024

Merge failed

LucasLLC commented Jul 8, 2024

pytorchmergebot commented Jul 8, 2024

Merge started

jeejakp12 commented Jun 20, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jun 20, 2024 •

edited

Loading