Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c10d][simple] increase the default heartbeat timeout to be larger #128751

Open
wants to merge 1 commit into
base: gh/shuqiangzhang/34/base
Choose a base branch
from

Conversation

shuqiangzhang
Copy link
Contributor

@shuqiangzhang shuqiangzhang commented Jun 14, 2024

Stack from ghstack (oldest at bottom):

Summary:
In multiple cases, we were seeing monitor heartbeat timeout triggred by
commAbort, which is undesriable as it hides the real timeout reason
(e.g.,, commAbort) and mislead debugging to other types of hangs (e.g.,
cuda calls, etc)
Test Plan:
CI

Tags:

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

Summary:
In multiple cases, we were seeing monitor heartbeat timeout triggred by
commAbort, which is undesriable as it hides the real timeout reason
(e.g.,, commAbort) and mislead debugging to other types of hangs (e.g.,
cuda calls, etc)
Test Plan:
CI

Tags:

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Jun 14, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128751

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit 5d331c5 with merge base 0344f95 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jun 14, 2024
shuqiangzhang added a commit that referenced this pull request Jun 14, 2024
Summary:
In multiple cases, we were seeing monitor heartbeat timeout triggred by
commAbort, which is undesriable as it hides the real timeout reason
(e.g.,, commAbort) and mislead debugging to other types of hangs (e.g.,
cuda calls, etc)
Test Plan:
CI

Tags:

ghstack-source-id: ba2bd5e7b92cbcc51672dc3c932057865b0a5a73
Pull Request resolved: #128751
heartbeatTimeoutInSec_ =
getCvarInt(TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC, 60 * 10 /*10 Mins*/);
getCvarInt(TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC, 60 * 11 /*11 Mins*/);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with changes, but just to be safe. Is there any workloads that we can verify that this indeed resolves the race condition? I mean why not 12 mins or longer?

@wconstab
Copy link
Contributor

Similar question to @fduwjj, it seems pretty unlikely that a 10% increase is a magic bullet unless we have some other 10 min timeout in the system that we are racing against.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants