Update start_, end_ and retired only for the right entry when retire a work #128948

HOOLoLo · 2024-06-18T11:53:56Z

Fixes #128805
If the buffer size of NCCLTraceBuffer is 10 and the pg has recorded 11 works, the entry of the work 0 will have been overwritten by the work 10, so when watchdog retire the work 0, the start_ and end_ of the entry 0 shouldn't be set to nullptr.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

pytorch-bot · 2024-06-18T11:53:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128948

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 66361dc with merge base 90d5a6f ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wconstab

Thanks for the fix, lgtm!

Nit- were you able to make a test case fail reliably without the fix? If so you should also add the test case.

Cc @c-p-i-o

wconstab · 2024-06-19T05:43:25Z

torch/csrc/distributed/c10d/TraceUtils.h

@@ -657,9 +659,6 @@ struct NCCLTraceBuffer {
 entry->duration_ = duration.value();
 }
 }
-
- entry->retired_ = true;


We should be holding guard.lock() when we reach this line and we should have re-verified id matches event id.

The proposed change looks like it should also be ok to me, but I don't understand why the issue can happen unless something is wrong with our locking.

Ok I see it now. What I said is only true if 'can compute duration' is true, which is only true if timing is enabled. The fix looks right to me.

HOOLoLo · 2024-06-19T11:50:59Z

Thanks for the fix, lgtm!

Nit- were you able to make a test case fail reliably without the fix? If so you should also add the test case.

Cc @c-p-i-o

Test case added

wconstab · 2024-06-19T13:29:48Z

Btw @HOOLoLo we have been developing scripts to analyze the dumped flight recorder data, I can share an incomplete version now or else we'll release it at some point. if you have any particular use case or feature request for it please let us know.

HOOLoLo · 2024-06-20T02:06:51Z

@pytorchbot merge

pytorch-bot · 2024-06-20T02:06:55Z

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

HOOLoLo · 2024-06-20T02:22:05Z

has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

Btw @HOOLoLo we have been developing scripts to analyze the dumped flight recorder data, I can share an incomplete version now or else we'll release it at some point. if you have any particular use case or feature request for it please let us know.

@wconstab Oh, thanks i haven't encountered any new feature request right now. Can u approve this PR again? I pressed the request review again button by mistake.

c-p-i-o · 2024-06-20T18:55:43Z

The following change will also take care of this change:
#126969
Basically, the Entrystruct wasn't being fully initialized correctly on every work.

HOOLoLo · 2024-06-21T02:06:34Z

@pytorchbot merge

pytorchmergebot · 2024-06-21T02:08:36Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-06-21T04:06:39Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-rocm6.1-py3.8 / test (default, 2, 2, linux.rocm.gpu)

Details for Dev Infra team

Raised by workflow job

HOOLoLo · 2024-06-21T06:27:54Z

@pytorchbot merge

pytorchmergebot · 2024-06-21T06:29:33Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-06-21T06:29:38Z

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

HOOLoLo · 2024-06-21T11:15:31Z

@pytorchbot merge

pytorchmergebot · 2024-06-22T04:17:00Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-06-22T04:17:06Z

Merge failed

Reason: 2 mandatory check(s) failed. The first few are:

BC Lint
pull

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

HOOLoLo · 2024-06-24T02:13:34Z

@pytorchbot merge

pytorchmergebot · 2024-06-24T02:15:14Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-06-24T02:15:20Z

Merge failed

Reason: 2 mandatory check(s) failed. The first few are:

BC Lint
pull

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

…a work

HOOLoLo · 2024-06-26T07:22:10Z

@pytorchbot merge

pytorchmergebot · 2024-06-26T07:24:48Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-06-26T07:24:54Z

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

HOOLoLo · 2024-06-26T08:07:59Z

@c-p-i-o Hi, this pr has three workflows awaiting approval. Can you help me?

c-p-i-o · 2024-06-26T18:27:16Z

I clicked "Approve and run"! Let's see if it works!

c-p-i-o · 2024-06-26T18:31:44Z

@pytorchbot merge

pytorchmergebot · 2024-06-26T18:33:35Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-06-26T21:04:24Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

clee2000 · 2024-06-26T21:04:48Z

@pytorchbot merge

Cancelled the workflow to fix something else, sorry about that

pytorchmergebot · 2024-06-26T21:06:27Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jun 18, 2024

pytorchbot added the open source label Jun 18, 2024

wconstab approved these changes Jun 19, 2024

View reviewed changes

HOOLoLo force-pushed the fix-entry-state branch from 4284d9d to 36ed017 Compare June 19, 2024 11:49

HOOLoLo force-pushed the fix-entry-state branch 2 times, most recently from 4f32348 to 13f46e8 Compare June 19, 2024 12:03

HOOLoLo requested a review from wconstab June 19, 2024 12:06

c-p-i-o self-requested a review June 20, 2024 18:55

c-p-i-o approved these changes Jun 20, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 21, 2024

pytorchmergebot added the merging label Jun 21, 2024

pytorchmergebot removed the merging label Jun 21, 2024

HOOLoLo force-pushed the fix-entry-state branch from e0ee084 to 7f7f228 Compare June 21, 2024 06:26

pytorchmergebot added the merging label Jun 21, 2024

pytorchmergebot removed the merging label Jun 21, 2024

pytorchmergebot added the merging label Jun 22, 2024

pytorchmergebot removed the merging label Jun 22, 2024

pytorchmergebot added the merging label Jun 24, 2024

pytorchmergebot removed the merging label Jun 24, 2024

HOOLoLo force-pushed the fix-entry-state branch 2 times, most recently from 6ab5003 to 346a8fa Compare June 25, 2024 11:40

Update start_, end_ and retired only for the right entry when retire …

66361dc

…a work

HOOLoLo force-pushed the fix-entry-state branch from 346a8fa to 66361dc Compare June 26, 2024 02:35

pytorchmergebot added the merging label Jun 26, 2024

pytorchmergebot removed the merging label Jun 26, 2024

pytorchmergebot added the merging label Jun 26, 2024

pytorchmergebot added the Merged label Jun 26, 2024

pytorchmergebot closed this in 5ad2ad5 Jun 26, 2024

pytorchmergebot removed the merging label Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update start_, end_ and retired only for the right entry when retire a work #128948

Update start_, end_ and retired only for the right entry when retire a work #128948

HOOLoLo commented Jun 18, 2024 •

edited

Loading

pytorch-bot bot commented Jun 18, 2024 •

edited

Loading

wconstab left a comment

wconstab Jun 19, 2024

HOOLoLo commented Jun 19, 2024

wconstab commented Jun 19, 2024

HOOLoLo commented Jun 20, 2024

pytorch-bot bot commented Jun 20, 2024

HOOLoLo commented Jun 20, 2024 •

edited

Loading

c-p-i-o commented Jun 20, 2024

HOOLoLo commented Jun 21, 2024

pytorchmergebot commented Jun 21, 2024

pytorchmergebot commented Jun 21, 2024

HOOLoLo commented Jun 21, 2024

pytorchmergebot commented Jun 21, 2024

pytorchmergebot commented Jun 21, 2024

HOOLoLo commented Jun 21, 2024

pytorchmergebot commented Jun 22, 2024

pytorchmergebot commented Jun 22, 2024

HOOLoLo commented Jun 24, 2024

pytorchmergebot commented Jun 24, 2024

pytorchmergebot commented Jun 24, 2024

HOOLoLo commented Jun 26, 2024

pytorchmergebot commented Jun 26, 2024

pytorchmergebot commented Jun 26, 2024

HOOLoLo commented Jun 26, 2024

c-p-i-o commented Jun 26, 2024

c-p-i-o commented Jun 26, 2024

pytorchmergebot commented Jun 26, 2024

pytorchmergebot commented Jun 26, 2024

clee2000 commented Jun 26, 2024

pytorchmergebot commented Jun 26, 2024

Update start_, end_ and retired only for the right entry when retire a work #128948

Update start_, end_ and retired only for the right entry when retire a work #128948

Conversation

HOOLoLo commented Jun 18, 2024 • edited Loading

pytorch-bot bot commented Jun 18, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128948

✅ No Failures

wconstab left a comment

Choose a reason for hiding this comment

wconstab Jun 19, 2024

Choose a reason for hiding this comment

HOOLoLo commented Jun 19, 2024

wconstab commented Jun 19, 2024

HOOLoLo commented Jun 20, 2024

pytorch-bot bot commented Jun 20, 2024

HOOLoLo commented Jun 20, 2024 • edited Loading

c-p-i-o commented Jun 20, 2024

HOOLoLo commented Jun 21, 2024

pytorchmergebot commented Jun 21, 2024

Merge started

pytorchmergebot commented Jun 21, 2024

Merge failed

HOOLoLo commented Jun 21, 2024

pytorchmergebot commented Jun 21, 2024

Merge started

pytorchmergebot commented Jun 21, 2024

Merge failed

HOOLoLo commented Jun 21, 2024

pytorchmergebot commented Jun 22, 2024

Merge started

pytorchmergebot commented Jun 22, 2024

Merge failed

HOOLoLo commented Jun 24, 2024

pytorchmergebot commented Jun 24, 2024

Merge started

pytorchmergebot commented Jun 24, 2024

Merge failed

HOOLoLo commented Jun 26, 2024

pytorchmergebot commented Jun 26, 2024

Merge started

pytorchmergebot commented Jun 26, 2024

Merge failed

HOOLoLo commented Jun 26, 2024

c-p-i-o commented Jun 26, 2024

c-p-i-o commented Jun 26, 2024

pytorchmergebot commented Jun 26, 2024

Merge started

pytorchmergebot commented Jun 26, 2024

clee2000 commented Jun 26, 2024

pytorchmergebot commented Jun 26, 2024

Merge started

HOOLoLo commented Jun 18, 2024 •

edited

Loading

pytorch-bot bot commented Jun 18, 2024 •

edited

Loading

HOOLoLo commented Jun 20, 2024 •

edited

Loading