[c10d][simple] increase the default heartbeat timeout to be larger #128751

shuqiangzhang · 2024-06-14T22:24:56Z

Stack from ghstack (oldest at bottom):

-> [c10d][simple] increase the default heartbeat timeout to be larger #128751

Summary:
In multiple cases, we were seeing monitor heartbeat timeout triggred by
commAbort, which is undesriable as it hides the real timeout reason
(e.g.,, commAbort) and mislead debugging to other types of hangs (e.g.,
cuda calls, etc)
Test Plan:
CI

Tags:

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

Summary: In multiple cases, we were seeing monitor heartbeat timeout triggred by commAbort, which is undesriable as it hides the real timeout reason (e.g.,, commAbort) and mislead debugging to other types of hangs (e.g., cuda calls, etc) Test Plan: CI Tags: [ghstack-poisoned]

pytorch-bot · 2024-06-14T22:24:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128751

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit 5d331c5 with merge base 0344f95 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: In multiple cases, we were seeing monitor heartbeat timeout triggred by commAbort, which is undesriable as it hides the real timeout reason (e.g.,, commAbort) and mislead debugging to other types of hangs (e.g., cuda calls, etc) Test Plan: CI Tags: ghstack-source-id: ba2bd5e7b92cbcc51672dc3c932057865b0a5a73 Pull Request resolved: #128751

fduwjj · 2024-06-17T18:53:44Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

 heartbeatTimeoutInSec_ =
- getCvarInt(TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC, 60 * 10 /*10 Mins*/);
+ getCvarInt(TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC, 60 * 11 /*11 Mins*/);


I am ok with changes, but just to be safe. Is there any workloads that we can verify that this indeed resolves the race condition? I mean why not 12 mins or longer?

wconstab · 2024-06-21T13:54:05Z

Similar question to @fduwjj, it seems pretty unlikely that a 10% increase is a magic bullet unless we have some other 10 min timeout in the system that we are racing against.

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jun 14, 2024

shuqiangzhang requested review from wconstab and fduwjj June 14, 2024 22:25

fduwjj reviewed Jun 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[c10d][simple] increase the default heartbeat timeout to be larger #128751

[c10d][simple] increase the default heartbeat timeout to be larger #128751

shuqiangzhang commented Jun 14, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jun 14, 2024 •

edited

Loading

fduwjj Jun 17, 2024

wconstab commented Jun 21, 2024

[c10d][simple] increase the default heartbeat timeout to be larger #128751

Are you sure you want to change the base?

[c10d][simple] increase the default heartbeat timeout to be larger #128751

Conversation

shuqiangzhang commented Jun 14, 2024 • edited by pytorch-bot bot Loading

pytorch-bot bot commented Jun 14, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128751

⏳ No Failures, 1 Pending

fduwjj Jun 17, 2024

Choose a reason for hiding this comment

wconstab commented Jun 21, 2024

shuqiangzhang commented Jun 14, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jun 14, 2024 •

edited

Loading