-
Notifications
You must be signed in to change notification settings - Fork 22k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TCPStore: improve connect and retry logic #129261
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129261
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit d435632 with merge base 93a33bf (): FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
// client's first query for validation | ||
validate(); | ||
C10D_WARNING( | ||
"TCP client failed to connect/validate to host {}:{} retrying (try={}): {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include timeout and/or retry in this message?
Perhaps in the exception thrown above too to indicate we're giving after that timeout period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added timeout and added another message when we timeout
03aa955
to
e601d18
Compare
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, some additional thoughts
e601d18
to
00a9aef
Compare
Merge failedReason: New commits were pushed while merging. Please rerun the merge command. Details for Dev Infra teamRaised by workflow job |
00a9aef
to
a77e5a4
Compare
client_ = detail::TCPClient::connect(addr_, opts); | ||
// TCP connection established | ||
C10D_DEBUG("TCP client connected to host {}:{}", addr_.host, addr_.port); | ||
C10D_WARNING( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to WARN
here? or shall we keep it as INFO
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking any retry is notable, can always decrease it if there's any complaints. We warn on the equivalent retry loop in socket.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the fix & optimization!
// success | ||
break; | ||
} catch (const c10::DistNetworkError& ex) { | ||
if (deadline < std::chrono::steady_clock::now()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the backoff object can "know" about the deadline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it but decided against it -- but maybe that's the cleaner implementation. Could also wrap the entire retry block in the backoff instead
Still a bit on the fence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm going to keep it as is -- abstracting away some of this logic would be nice but I'm not sure there's a clean way to report errors etc in the way we want if we don't own the loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Minor comments, non blocking.
a77e5a4
to
d435632
Compare
@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pytorchbot merge -f 'Landed internally' (Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally) |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
We've been facing issues where TCPStore can successfully connect but then fail in the validate() function due to resets from listen backlog queue overflow when combined with reset enabled as well as long init times.
This PR does a few things:
sleep(std::chrono::milliseconds(numWorkers))
on init which can add significant delays to startup. This is no longer necessary per @XilunWu [c10d][libuv] add partial read test for libuv backend and fix an error which only happens when partially reading a buffer #116141Test plan:
Will do internal testing with some large scale jobs to ensure TCPStore works correctly.
At 4k scale: 4x improvement
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang