TCPStore: improve connect and retry logic #129261

d4l3k · 2024-06-21T19:36:41Z

We've been facing issues where TCPStore can successfully connect but then fail in the validate() function due to resets from listen backlog queue overflow when combined with reset enabled as well as long init times.

This PR does a few things:

Retry that connect and validate up to the specified timeout.
Use exponential backoff for the retry logic with jitter instead of a fixed 1s sleep.
Eliminate the sleep(std::chrono::milliseconds(numWorkers)) on init which can add significant delays to startup. This is no longer necessary per @XilunWu [c10d][libuv] add partial read test for libuv backend and fix an error which only happens when partially reading a buffer #116141

Test plan:

python test/distributed/test_store.py -v
./build/bin/BackoffTest

Will do internal testing with some large scale jobs to ensure TCPStore works correctly.

At 4k scale: 4x improvement

tristanr@devvm4382 ~/pt_tests [SIGABRT]> time TORCH_SHOW_CPP_STACKTRACES=1 python tcpstore_large_test.py                                                                                                   (pytorch-3.10) 
started 0
init 0
set 0
joined all

________________________________________________________
Executed in    1.98 secs    fish           external
   usr time    0.93 secs   91.00 micros    0.93 secs
   sys time    1.98 secs  954.00 micros    1.97 secs

tristanr@devvm4382 ~/pt_tests> conda activate torchdrive-3.10                                                                                                                                              (pytorch-3.10) 
tristanr@devvm4382 ~/pt_tests> time TORCH_SHOW_CPP_STACKTRACES=1 python tcpstore_large_test.py                                                                                                          (torchdrive-3.10) 
started 0
init 0
set 0
joined all

________________________________________________________
Executed in    8.20 secs    fish           external
   usr time    2.15 secs    0.00 micros    2.15 secs
   sys time    2.76 secs  843.00 micros    2.76 secs

import time
import os
import threading
from multiprocessing import Pool


WORLD_SIZE = 10000

import torch.distributed as dist

def run(rank):
    should_log = rank % (WORLD_SIZE // 10) == 0
    if should_log:
        print(f"started {rank}")
    store = dist.TCPStore(
        host_name="devvm4382.nao0.facebook.com",
        port=29500,
        world_size=WORLD_SIZE,
        is_master=rank == 0,
        use_libuv=True,
    )
    if should_log:
        print(f"init {rank}")
    store.set(f"key{rank}", "1234")
    if should_log:
        print(f"set {rank}")
    del store

def noop(rank):
    pass


print("starting pool")
with Pool(WORLD_SIZE) as pool:
    pool.map(noop, range(WORLD_SIZE), 1)
    print("pool hot")
    start = time.time()
    pool.map(run, range(WORLD_SIZE), 1)
    print("run finished", time.time()-start)

tristanr@devvm4382 ~/pt_tests> python tcpstore_large_test.py                                                                                                                                (pytorch-3.10) 
starting pool
pool hot
started 0
[W624 16:58:09.086081750 TCPStore.cpp:343] [c10d] Starting store with 10000 workers but somaxconn is 4096.This might cause instability during bootstrap, consider increasing it.
started 1000
init 1000
set 1000
started 2000
init 2000
set 2000
started 3000
init 3000
set 3000
started 4000
init 4000
set 4000
started 5000
init 5000
set 5000
started 6000
init 6000
set 6000
started 7000
init 7000
set 7000
started 8000
init 8000
set 8000
started 9000
init 9000
set 9000
init 0
set 0
run finished 0.705092191696167

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang

pytorch-bot · 2024-06-21T19:36:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129261

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit d435632 with merge base 93a33bf ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-13) (gh) (similar failure)
test_mps.py::TestMPS::test_mps_allocator_module
trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-14) (gh) (similar failure)
test_mps.py::TestMPS::test_mps_allocator_module

This comment was automatically generated by Dr. CI and updates every 15 minutes.

rsdcastro · 2024-06-21T19:43:58Z

torch/csrc/distributed/c10d/TCPStore.cpp

- // client's first query for validation
- validate();
+  C10D_WARNING(
+  "TCP client failed to connect/validate to host {}:{} retrying (try={}): {}",


Include timeout and/or retry in this message?

Perhaps in the exception thrown above too to indicate we're giving after that timeout period.

added timeout and added another message when we timeout

d4l3k · 2024-06-21T20:24:33Z

@pytorchbot merge

pytorchmergebot · 2024-06-21T20:26:06Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

kurman

lgtm, some additional thoughts

torch/csrc/distributed/c10d/TCPStore.cpp

pytorchmergebot · 2024-06-21T23:56:57Z

Merge failed

Reason: New commits were pushed while merging. Please rerun the merge command.

Details for Dev Infra team

Raised by workflow job

c-p-i-o · 2024-06-24T20:53:52Z

torch/csrc/distributed/c10d/TCPStore.cpp

- client_ = detail::TCPClient::connect(addr_, opts);
- // TCP connection established
- C10D_DEBUG("TCP client connected to host {}:{}", addr_.host, addr_.port);
+ C10D_WARNING(


Want to WARN here? or shall we keep it as INFO?

I was thinking any retry is notable, can always decrease it if there's any complaints. We warn on the equivalent retry loop in socket.cpp

XilunWu

LGTM! Thanks for the fix & optimization!

test/cpp/c10d/BackoffTest.cpp

torch/csrc/distributed/c10d/Backoff.cpp

kurman · 2024-06-24T20:58:50Z

torch/csrc/distributed/c10d/TCPStore.cpp

+ // success
+ break;
+ } catch (const c10::DistNetworkError& ex) {
+ if (deadline < std::chrono::steady_clock::now()) {


I wonder if the backoff object can "know" about the deadline

I thought about it but decided against it -- but maybe that's the cleaner implementation. Could also wrap the entire retry block in the backoff instead

Still a bit on the fence

I think I'm going to keep it as is -- abstracting away some of this logic would be nice but I'm not sure there's a clean way to report errors etc in the way we want if we don't own the loop

torch/csrc/distributed/c10d/TCPStore.cpp

torch/csrc/distributed/c10d/socket.cpp

c-p-i-o

LGTM. Minor comments, non blocking.

facebook-github-bot · 2024-06-24T23:12:59Z

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-06-25T19:22:41Z

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

pytorchmergebot · 2024-06-25T19:24:13Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jun 21, 2024

d4l3k requested review from rsdcastro, c-p-i-o and kurman June 21, 2024 19:36

rsdcastro approved these changes Jun 21, 2024

View reviewed changes

wconstab approved these changes Jun 21, 2024

View reviewed changes

d4l3k force-pushed the d4l3k/tcpstore_validate_retry branch from 03aa955 to e601d18 Compare June 21, 2024 20:21

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 21, 2024

pytorchmergebot added the merging label Jun 21, 2024

kurman approved these changes Jun 21, 2024

View reviewed changes

torch/csrc/distributed/c10d/TCPStore.cpp Outdated Show resolved Hide resolved

torch/csrc/distributed/c10d/TCPStore.cpp Outdated Show resolved Hide resolved

d4l3k force-pushed the d4l3k/tcpstore_validate_retry branch from e601d18 to 00a9aef Compare June 21, 2024 23:54

pytorchmergebot removed the merging label Jun 21, 2024

d4l3k force-pushed the d4l3k/tcpstore_validate_retry branch from 00a9aef to a77e5a4 Compare June 24, 2024 20:38

d4l3k changed the title ~~TCPStore: retry on validate errors~~ TCPStore: improve connect and retry logic Jun 24, 2024

d4l3k requested review from kurman, wconstab, rsdcastro and XilunWu June 24, 2024 20:40

c-p-i-o reviewed Jun 24, 2024

View reviewed changes

XilunWu approved these changes Jun 24, 2024

View reviewed changes

c-p-i-o reviewed Jun 24, 2024

View reviewed changes

test/cpp/c10d/BackoffTest.cpp Show resolved Hide resolved

kurman reviewed Jun 24, 2024

View reviewed changes

c-p-i-o reviewed Jun 24, 2024

View reviewed changes

torch/csrc/distributed/c10d/TCPStore.cpp Show resolved Hide resolved

c-p-i-o reviewed Jun 24, 2024

View reviewed changes

torch/csrc/distributed/c10d/socket.cpp Show resolved Hide resolved

c-p-i-o approved these changes Jun 24, 2024

View reviewed changes

TCPStore: retry on validate errors

d435632

d4l3k force-pushed the d4l3k/tcpstore_validate_retry branch from a77e5a4 to d435632 Compare June 24, 2024 23:12

pytorchmergebot added the merging label Jun 25, 2024

pytorchmergebot closed this in 0298560 Jun 25, 2024

pytorchmergebot added Merged and removed merging labels Jun 25, 2024

d4l3k deleted the d4l3k/tcpstore_validate_retry branch June 25, 2024 19:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TCPStore: improve connect and retry logic #129261

TCPStore: improve connect and retry logic #129261

d4l3k commented Jun 21, 2024 •

edited

Loading

pytorch-bot bot commented Jun 21, 2024 •

edited

Loading

rsdcastro Jun 21, 2024

d4l3k Jun 21, 2024

d4l3k commented Jun 21, 2024

pytorchmergebot commented Jun 21, 2024

kurman left a comment

pytorchmergebot commented Jun 21, 2024

c-p-i-o Jun 24, 2024

d4l3k Jun 24, 2024

XilunWu left a comment

kurman Jun 24, 2024

d4l3k Jun 24, 2024

d4l3k Jun 25, 2024

c-p-i-o left a comment

facebook-github-bot commented Jun 24, 2024

facebook-github-bot commented Jun 25, 2024

pytorchmergebot commented Jun 25, 2024

TCPStore: improve connect and retry logic #129261

TCPStore: improve connect and retry logic #129261

Conversation

d4l3k commented Jun 21, 2024 • edited Loading

pytorch-bot bot commented Jun 21, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129261

✅ You can merge normally! (2 Unrelated Failures)

rsdcastro Jun 21, 2024

Choose a reason for hiding this comment

d4l3k Jun 21, 2024

Choose a reason for hiding this comment

d4l3k commented Jun 21, 2024

pytorchmergebot commented Jun 21, 2024

Merge started

kurman left a comment

Choose a reason for hiding this comment

pytorchmergebot commented Jun 21, 2024

Merge failed

c-p-i-o Jun 24, 2024

Choose a reason for hiding this comment

d4l3k Jun 24, 2024

Choose a reason for hiding this comment

XilunWu left a comment

Choose a reason for hiding this comment

kurman Jun 24, 2024

Choose a reason for hiding this comment

d4l3k Jun 24, 2024

Choose a reason for hiding this comment

d4l3k Jun 25, 2024

Choose a reason for hiding this comment

c-p-i-o left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jun 24, 2024

facebook-github-bot commented Jun 25, 2024

pytorchmergebot commented Jun 25, 2024

Merge started

d4l3k commented Jun 21, 2024 •

edited

Loading

pytorch-bot bot commented Jun 21, 2024 •

edited

Loading