-
Notifications
You must be signed in to change notification settings - Fork 22k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update start_, end_ and retired only for the right entry when retire a work #128948
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128948
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 66361dc with merge base 90d5a6f (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix, lgtm!
Nit- were you able to make a test case fail reliably without the fix? If so you should also add the test case.
Cc @c-p-i-o
@@ -657,9 +659,6 @@ struct NCCLTraceBuffer { | |||
entry->duration_ = duration.value(); | |||
} | |||
} | |||
|
|||
entry->retired_ = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be holding guard.lock() when we reach this line and we should have re-verified id matches event id.
The proposed change looks like it should also be ok to me, but I don't understand why the issue can happen unless something is wrong with our locking.
Ok I see it now. What I said is only true if 'can compute duration' is true, which is only true if timing is enabled. The fix looks right to me.
Test case added |
4f32348
to
13f46e8
Compare
Btw @HOOLoLo we have been developing scripts to analyze the dumped flight recorder data, I can share an incomplete version now or else we'll release it at some point. if you have any particular use case or feature request for it please let us know. |
@pytorchbot merge |
Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra. |
@wconstab Oh, thanks i haven't encountered any new feature request right now. Can u approve this PR again? I pressed the request review again button by mistake. |
The following change will also take care of this change: |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / linux-focal-rocm6.1-py3.8 / test (default, 2, 2, linux.rocm.gpu) Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 3 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
6ab5003
to
346a8fa
Compare
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 3 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
@c-p-i-o Hi, this pr has three workflows awaiting approval. Can you help me? |
I clicked "Approve and run"! Let's see if it works! |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
@pytorchbot merge Cancelled the workflow to fix something else, sorry about that |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes #128805
If the buffer size of NCCLTraceBuffer is 10 and the pg has recorded 11 works, the entry of the work 0 will have been overwritten by the work 10, so when watchdog retire the work 0, the start_ and end_ of the entry 0 shouldn't be set to nullptr.
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k