[dtensor][debug] add operation tracing to comm_mode #129017

sinhaanshul · 2024-06-19T00:21:48Z

Stack from ghstack (oldest at bottom):

[dtensor][be] Reduced redundant LOC by creating functions to set up models used in example #129613
[dtensor][debug] Added forward and backward differentiation for module level tracing #129602
-> [dtensor][debug] add operation tracing to comm_mode #129017

Summary
I have added an even more detailed module tracker that now includes the collective counts and operations that happen in each submodule making it easier for users to debug. The tracing now includes the operation's DTensor arguements' input shape and sharding. Like the module collective tracing, the user also has the option to log the tracing table to output.txt file. I have decided not to include the example output for transformer as it is too many lines. The expected output for the MLP_operation_tracing is shown below:

Test Plan

torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing
torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

[ghstack-poisoned]

pytorch-bot · 2024-06-19T00:21:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129017

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commit 9aa0b16 with merge base ec284d3 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-13) (gh) (similar failure)
test_mps.py::TestMPS::test_mps_allocator_module
trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-14) (gh) (similar failure)
test_mps.py::TestMPS::test_mps_allocator_module

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (inductor_torchbench_cpu_smoketest_perf, 1, 1, linux.24xl.spr-metal) (gh) (#126993)
Process completed with exit code 1.
linux-binary-libtorch-pre-cxx11 / libtorch-cpu-shared-with-deps-pre-cxx11-build / build (gh) (#129931)
The process '/usr/bin/git' failed with exit code 128

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: aab2b50c790e49482b65a97c9471dd2f5e84a2b9 Pull Request resolved: #129017

XilunWu · 2024-06-21T08:32:12Z

let's convert this PR to Draft since it's still WIP.

[ghstack-poisoned]

ghstack-source-id: 03c3ffaf2b533ff8980484f2f3076faf6ac76124 Pull Request resolved: #129017

tianyu-l

This looks very interesting! Maybe similarly we can print out the placements of module parameters if they are DTensor?

[ghstack-poisoned]

wz337 · 2024-06-26T23:38:13Z

Curious of what is your decision of printing out the mesh for the DTensor.

[ghstack-poisoned]

XilunWu

LGTM! This change is a good fundament for future extension. Please address the comment in future PR.

XilunWu · 2024-07-01T19:54:07Z

torch/distributed/_tensor/debug/comm_mode.py

+ ansi_escape = re.compile(r"\x1B\[[0-?]*[ -/]*[@-~]")
+ table = ansi_escape.sub("", self.generate_operation_tracing_table())
+
+ with open("output.txt", "w") as log_file:


let's make the filename an argument. also in log_module_tracing_table_to_file.

sinhaanshul · 2024-07-01T21:00:20Z

@pytorchbot merge

pytorchmergebot · 2024-07-01T21:01:49Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-07-02T03:00:47Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

sinhaanshul · 2024-07-02T19:03:15Z

@pytorchbot merge

pytorchmergebot · 2024-07-02T19:04:45Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Update

c5bd14a

[ghstack-poisoned]

This was referenced Jun 19, 2024

[dtensor][example] add functionality allowing users to choose which example they'd to run #128720

Closed

[dtensor][debug] added logging module tracing table to file feature #128721

Closed

sinhaanshul mentioned this pull request Jun 19, 2024

[dtensor][test] test case suite for comm_mode features #128729

Closed

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jun 19, 2024

sinhaanshul mentioned this pull request Jun 19, 2024

[dtensor][debug] fixing CommDebugMode module collective tracing #128887

Closed

sinhaanshul added a commit that referenced this pull request Jun 19, 2024

[dtensor][debug] add operation tracing to comm_mode

a649d09

ghstack-source-id: aab2b50c790e49482b65a97c9471dd2f5e84a2b9 Pull Request resolved: #129017

XilunWu self-requested a review June 21, 2024 08:31

XilunWu marked this pull request as draft June 21, 2024 08:32

Update

60aa9f6

[ghstack-poisoned]

Update

4fdc4f4

[ghstack-poisoned]

Update

4616e23

[ghstack-poisoned]

Update

be768e6

[ghstack-poisoned]

Update

6562cb4

[ghstack-poisoned]

Update

9b9acaf

[ghstack-poisoned]

Update

caabfea

[ghstack-poisoned]

sinhaanshul added a commit that referenced this pull request Jun 25, 2024

[dtensor][debug] add operation tracing to comm_mode

02fed83

ghstack-source-id: 03c3ffaf2b533ff8980484f2f3076faf6ac76124 Pull Request resolved: #129017

sinhaanshul added the topic: not user facing topic category label Jun 25, 2024

sinhaanshul marked this pull request as ready for review June 25, 2024 22:22

sinhaanshul requested review from wz337 and tianyu-l June 25, 2024 22:22

tianyu-l reviewed Jun 26, 2024

View reviewed changes

Update

ba70339

[ghstack-poisoned]

This was referenced Jun 26, 2024

[dtensor][debug] Added forward and backward differentiation for module level tracing #129602

Closed

[dtensor][be] Reduced redundant LOC by creating functions to set up models used in example #129613

Closed

Update

9aa0b16

[ghstack-poisoned]

XilunWu approved these changes Jul 1, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 1, 2024

pytorchmergebot added the merging label Jul 1, 2024

pytorchmergebot added the Merged label Jul 2, 2024

pytorchmergebot closed this in 1f6c1fc Jul 2, 2024

pytorchmergebot removed the merging label Jul 2, 2024

github-actions bot deleted the gh/sinhaanhsul/24/head branch August 2, 2024 01:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dtensor][debug] add operation tracing to comm_mode #129017

[dtensor][debug] add operation tracing to comm_mode #129017

sinhaanshul commented Jun 19, 2024 •

edited

Loading

pytorch-bot bot commented Jun 19, 2024 •

edited

Loading

XilunWu commented Jun 21, 2024

tianyu-l left a comment

wz337 commented Jun 26, 2024

XilunWu left a comment

XilunWu Jul 1, 2024

sinhaanshul commented Jul 1, 2024

pytorchmergebot commented Jul 1, 2024

pytorchmergebot commented Jul 2, 2024

sinhaanshul commented Jul 2, 2024

pytorchmergebot commented Jul 2, 2024

[dtensor][debug] add operation tracing to comm_mode #129017

[dtensor][debug] add operation tracing to comm_mode #129017

Conversation

sinhaanshul commented Jun 19, 2024 • edited Loading

pytorch-bot bot commented Jun 19, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129017

✅ You can merge normally! (4 Unrelated Failures)

XilunWu commented Jun 21, 2024

tianyu-l left a comment

Choose a reason for hiding this comment

wz337 commented Jun 26, 2024

XilunWu left a comment

Choose a reason for hiding this comment

XilunWu Jul 1, 2024

Choose a reason for hiding this comment

sinhaanshul commented Jul 1, 2024

pytorchmergebot commented Jul 1, 2024

Merge started

pytorchmergebot commented Jul 2, 2024

sinhaanshul commented Jul 2, 2024

pytorchmergebot commented Jul 2, 2024

Merge started

sinhaanshul commented Jun 19, 2024 •

edited

Loading

pytorch-bot bot commented Jun 19, 2024 •

edited

Loading