Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update start_, end_ and retired only for the right entry when retire a work #128948

Closed
wants to merge 1 commit into from

Conversation

HOOLoLo
Copy link
Contributor

@HOOLoLo HOOLoLo commented Jun 18, 2024

Fixes #128805
If the buffer size of NCCLTraceBuffer is 10 and the pg has recorded 11 works, the entry of the work 0 will have been overwritten by the work 10, so when watchdog retire the work 0, the start_ and end_ of the entry 0 shouldn't be set to nullptr.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

Copy link

pytorch-bot bot commented Jun 18, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128948

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 66361dc with merge base 90d5a6f (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jun 18, 2024
Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, lgtm!

Nit- were you able to make a test case fail reliably without the fix? If so you should also add the test case.

Cc @c-p-i-o

@@ -657,9 +659,6 @@ struct NCCLTraceBuffer {
entry->duration_ = duration.value();
}
}

entry->retired_ = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be holding guard.lock() when we reach this line and we should have re-verified id matches event id.

The proposed change looks like it should also be ok to me, but I don't understand why the issue can happen unless something is wrong with our locking.

Ok I see it now. What I said is only true if 'can compute duration' is true, which is only true if timing is enabled. The fix looks right to me.

@HOOLoLo
Copy link
Contributor Author

HOOLoLo commented Jun 19, 2024

Thanks for the fix, lgtm!

Nit- were you able to make a test case fail reliably without the fix? If so you should also add the test case.

Cc @c-p-i-o

Test case added

@HOOLoLo HOOLoLo force-pushed the fix-entry-state branch 2 times, most recently from 4f32348 to 13f46e8 Compare June 19, 2024 12:03
@HOOLoLo HOOLoLo requested a review from wconstab June 19, 2024 12:06
@wconstab
Copy link
Contributor

Btw @HOOLoLo we have been developing scripts to analyze the dumped flight recorder data, I can share an incomplete version now or else we'll release it at some point. if you have any particular use case or feature request for it please let us know.

@HOOLoLo
Copy link
Contributor Author

HOOLoLo commented Jun 20, 2024

@pytorchbot merge

Copy link

pytorch-bot bot commented Jun 20, 2024

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

@HOOLoLo
Copy link
Contributor Author

HOOLoLo commented Jun 20, 2024

has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

Btw @HOOLoLo we have been developing scripts to analyze the dumped flight recorder data, I can share an incomplete version now or else we'll release it at some point. if you have any particular use case or feature request for it please let us know.

@wconstab Oh, thanks i haven't encountered any new feature request right now. Can u approve this PR again? I pressed the request review again button by mistake.

@c-p-i-o
Copy link
Contributor

c-p-i-o commented Jun 20, 2024

The following change will also take care of this change:
#126969
Basically, the Entrystruct wasn't being fully initialized correctly on every work.

@c-p-i-o c-p-i-o self-requested a review June 20, 2024 18:55
@HOOLoLo
Copy link
Contributor Author

HOOLoLo commented Jun 21, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 21, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-rocm6.1-py3.8 / test (default, 2, 2, linux.rocm.gpu)

Details for Dev Infra team Raised by workflow job

@HOOLoLo
Copy link
Contributor Author

HOOLoLo commented Jun 21, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@HOOLoLo
Copy link
Contributor Author

HOOLoLo commented Jun 21, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@HOOLoLo
Copy link
Contributor Author

HOOLoLo commented Jun 24, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@HOOLoLo HOOLoLo force-pushed the fix-entry-state branch 2 times, most recently from 6ab5003 to 346a8fa Compare June 25, 2024 11:40
@HOOLoLo
Copy link
Contributor Author

HOOLoLo commented Jun 26, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@HOOLoLo
Copy link
Contributor Author

HOOLoLo commented Jun 26, 2024

@c-p-i-o Hi, this pr has three workflows awaiting approval. Can you help me?

@c-p-i-o
Copy link
Contributor

c-p-i-o commented Jun 26, 2024

I clicked "Approve and run"! Let's see if it works!

@c-p-i-o
Copy link
Contributor

c-p-i-o commented Jun 26, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@clee2000
Copy link
Contributor

@pytorchbot merge

Cancelled the workflow to fix something else, sorry about that

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C10d]: Work state in dump trace file is not deterministic.
6 participants