[C10D] Avoid lazily creating P2P communicators #129147

wconstab · 2024-06-20T16:59:03Z

Stack from ghstack (oldest at bottom):

Users that opt-into eager initialization (enabled by passing device_id
to init_process_group) will now be able to take advantage of reusing
the existing communicator for the processgroup for send/recv ops rather
than creating new 2-rank communicators for every pair of ranks
performing send/recv.

Existing users not passing device_id to init_process_group will now get
a warning suggesting they do so, but they will still get the
functionality they have today, automatic creation of pair-wise
communicators.

Fixes #129140

Test plan

I didn't figure out a good way to unit test this change. (specifically, to make sure we avoid creating extra communicators when we opt-into the eager init path).

In the meantime, i've locally verified that a script that issues a send/recv gets the WARNING printed about the fallback path, and if I modify the script to either pass device_id=torch.device("cuda:{local_rank}") to init_process_group or issue an allreduce before the send/recv, in both cases the warning about the fallback path does not appear.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @yf225 @chauhang @d4l3k

Differential Revision: D58842474

[ghstack-poisoned]

pytorch-bot · 2024-06-20T16:59:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129147

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures, 3 Unrelated Failures

As of commit 0958986 with merge base failed to retrieve merge base, please contact dev infra:

NEW FAILURES - The following jobs have failed:

pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, linux.8xlarge.nvidia.gpu) (gh)
distributed/test_c10d_nccl.py::NCCLTraceTest::test_batched_send_recv_op_sizes_per_coalesce0_timing_enabled_True
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, linux.8xlarge.nvidia.gpu) (gh)
distributed/test_c10d_nccl.py::NCCLTraceTest::test_batched_send_recv_op_sizes_per_coalesce0_timing_enabled_False
trunk / linux-focal-cuda11.8-py3.10-gcc9-experimental-split-build-test / test (distributed, 1, 3, linux.8xlarge.nvidia.gpu) (gh)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
trunk / linux-focal-cuda11.8-py3.10-gcc9-experimental-split-build-test / test (distributed, 2, 3, linux.8xlarge.nvidia.gpu) (gh)
distributed/test_c10d_nccl.py::NCCLTraceTest::test_batched_send_recv_op_sizes_per_coalesce0_timing_enabled_False
trunk / linux-focal-cuda11.8-py3.10-gcc9-experimental-split-build-test / test (distributed, 3, 3, linux.8xlarge.nvidia.gpu) (gh)
distributed/test_c10d_nccl.py::NCCLTraceTest::test_batched_send_recv_op_sizes_per_coalesce0_timing_enabled_True
trunk / linux-focal-rocm6.1-py3.8 / test (distributed, 1, 1, linux.rocm.gpu) (gh)
distributed/test_c10d_nccl.py::NCCLTraceTest::test_batched_send_recv_op_sizes_per_coalesce0_timing_enabled_True

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

trunk / linux-focal-rocm6.1-py3.8 / test (default, 1, 2, linux.rocm.gpu) (gh) (similar failure)
test_torch.py::TestTorchDeviceTypeCUDA::test_conv_transposed_large_cuda
trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-13) (gh) (similar failure)
test_mps.py::TestMPS::test_mps_allocator_module
trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-14) (gh) (similar failure)
test_mps.py::TestMPS::test_mps_allocator_module

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pavanbalaji

LGTM!

[ghstack-poisoned]

wconstab · 2024-06-20T22:18:31Z

@wconstab has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

[ghstack-poisoned]

wconstab · 2024-06-21T23:39:08Z

@pytorchbot merge

pytorchmergebot · 2024-06-21T23:40:56Z

Merge failed

Reason: Approvers from one of the following sets are needed:

Distributed (mrshenli, pritamdamania87, zhaojuanmao, rohan-varma, wanchaol, ...)
superuser (pytorch/metamates)
Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10)
Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet)

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

[ghstack-poisoned]

nvcastet · 2024-06-24T20:38:07Z

Thank you for improving the p2p communicator creation path! I was just looking at it a few weeks ago.
I agree that will help with a lot of use cases.

Could it be an issue for the case we want to overlap p2p communications? Having a single communicator will just serialize the comm ops of that communicator.
I know that is not recommended to overlap 2 NCCL ops since if it is not properly architectured, it can lead to deadlocks.
For apps relying on overlap of p2p comms, this PR would create a regression, right?

nvcastet · 2024-06-24T20:57:03Z

Alternatively, those apps would need to update their code to properly create sub-groups for pairs where there is overlap.

wconstab · 2024-06-24T22:14:54Z

Could it be an issue for the case we want to overlap p2p communications? Having a single communicator will just serialize the comm ops of that communicator.

Can you be more specific about what the P2P comms need to overlap with?

C10D already puts comms on a separate stream from compute, so it should be possible for these to overlap. Is the goal for P2P ops to overlap with other communication ops in the same PG?

nvcastet · 2024-06-24T22:27:22Z

Is the goal for P2P ops to overlap with other communication ops in the same PG?

Correct.

For example the megatron-lm interleaved pipeline schedule will overlap send/receive ops targeting different peers using the same PG.

pavanbalaji · 2024-06-24T23:08:23Z

Is the goal for P2P ops to overlap with other communication ops in the same PG?

Correct.

For example the megatron-lm interleaved pipeline schedule will overlap send/receive ops targeting different peers using the same PG.

You can use the same NCCL communicator (PyTorch PG) but issue different P2P operations on different streams. That won't be serialized.

wconstab · 2024-06-25T00:07:20Z

I think the issue is that c10d manages the stream used for p2p ops, and its bundled together 1:1 with nccl communicator today.

I amended my RFC to account for this: #129140

@nvcastet do you think this amendment would solve your issue? I can try to make a PR to do this if so.

edit: i updated this PR to attempt to decouple nccl comm from nccl stream. It might be fairly straightforward to do this, but i need to re-examine it with fresh eyes and i assume i may have missed something.

[ghstack-poisoned]

Users that opt-into eager initialization (enabled by passing device_id to init_process_group) will now be able to take advantage of reusing the existing communicator for the processgroup for send/recv ops rather than creating new 2-rank communicators for every pair of ranks performing send/recv. Existing users not passing device_id to init_process_group will now get a warning suggesting they do so, but they will still get the functionality they have today, automatic creation of pair-wise communicators. When reusing an existing communicator, a dedicated nccl stream will still be used for each pair of P2P ranks so that pair-wise comm ops can overlap with each other rather than being serialized on a single stream per PG. Fixes #129140 ghstack-source-id: 3db38c68ea6a4947ef4a3f9fa61fc4865513f63c Pull Request resolved: #129147

nvcastet · 2024-06-25T14:06:47Z

@pavanbalaji @wconstab
Unfortunately, to overlap 2 NCCL comm ops, you need at least those 2 conditions:

Use different NCCL communicators
Place ops on different CUDA streams

NCCL communicator will serialize the ops even if they are put on different streams (because they compete for the NCCL communicator internal resources: internal staging buffers etc...)

nvcastet · 2024-06-25T15:47:49Z

So to preserve overlap behavior, we would still need to create those p2p communicators in the PG.

The only other option I see (besides the obvious one to put this RFE/PR on the shelf for now) to avoid those extra communicators is to have an explicit config setting on the process group to disable the creation of p2p communicators (and documenting that unbatched p2p ops of this PG will be serialized with that setting).

As a side note, NCCL team is actively working on reducing communicator init cost, so I would not be surprise to see improvement in upcoming releases.

d4l3k · 2024-06-25T17:45:08Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

@@ -1993,7 +1993,8 @@ std::shared_ptr<NCCLComm> ProcessGroupNCCL::getNCCLComm(
 OpType opType,
 int p2pRank,
 bool isSendRecvSelf,
- std::optional<const std::string> streamKey) {
+ std::optional<const std::string> streamKey,
+ bool onlyCached) {


Thoughts on having a getOrCreateNCCLComm and then just a getNCCLComm? It's a bit unintutive that this function does both and splitting the behavior might be better than adding a bool to an already complicated function signature

d4l3k · 2024-06-25T17:46:43Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+
+ // Note on keys
+ // devKey identifies this gpu device and is used for accessing a nccl
+ // Communicator for this PG per device p2pKey identifies a pair of ranks doing


Missing period/new line between device and p2pKey?

thanks. lintrunner totally hosed me here.

d4l3k · 2024-06-25T18:07:52Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

 // First let NCCL streams wait for input tensors allocation streams
- syncStream(device, ncclEvents_[key], ncclStream);
+ syncStream(device, ncclEvents_[p2pKey], ncclStream);


In the old logic this is conditionally the p2pKey or the devKey -- is it intentional to always use the p2p key now?

yes- it is intentional to always use the p2pkey for the stream, based on the wrong assumption that using the same comm but different stream would allow overlap between p2p ops involving different peers.

but i suspect i missed something- i probably should have kept this as devKey for batched-p2p ops and only made this p2pkey for true p2p ops.

pavanbalaji · 2024-07-22T02:30:37Z

@pavanbalaji @wconstab Unfortunately, to overlap 2 NCCL comm ops, you need at least those 2 conditions:

Use different NCCL communicators

Place ops on different CUDA streams

NCCL communicator will serialize the ops even if they are put on different streams (because they compete for the NCCL communicator internal resources: internal staging buffers etc...)

Hi @nvcastet - we should discuss this. It's not clear why NCCL needs to serialize point-to-point operations on the same communicator. I understand that collective operations need to be serialized, but p2p operations should be independent of each other. NCCL should be able to handle internal resources correctly in such cases. Is there a technical reason for this restriction or is it just an artifact of the current implementation? If it's an artifact of the current implementation, PyTorch shouldn't be working around that. We should fix it in NCCL.

nvcastet · 2024-07-22T19:02:01Z

@pavanbalaji

It's not clear why NCCL needs to serialize point-to-point operations on the same communicator.

NCCL communicator will serialize ungrouped ops because they share internal resources (net buffers etc...).
For the megatron-lm use case mentioned early on we don't group p2p ops to get finer overlapping.

pavanbalaji · 2024-07-26T02:03:57Z

@pavanbalaji

It's not clear why NCCL needs to serialize point-to-point operations on the same communicator.

NCCL communicator will serialize ungrouped ops because they share internal resources (net buffers etc...). For the megatron-lm use case mentioned early on we don't group p2p ops to get finer overlapping.

Hi @nvcastet - This seems to be overly restrictive and is different from what other communication libraries (such as MPI) provide. Creating a new communicator for every point-to-point pair that we need to talk to is very expensive with respect to number of resources used (and performance in some cases).

nvcastet · 2024-07-26T14:29:06Z

You only need to create a new communicator for pt-to-pt if you are going to overlap it with another NCCL Op.
That is the current semantics of the NCCL library which is what we need to look at for this PR.
I would encourage you to move the discussion to the NCCL repo by opening a discussion/RFE there so that the NCCL engineers can scope your proposal.

Update

cd61a29

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jun 20, 2024

wconstab requested review from kwen2501, chipturner, fegin, H-Huang, pavanbalaji and eqy and removed request for kwen2501 June 20, 2024 17:43

pavanbalaji approved these changes Jun 20, 2024

View reviewed changes

Update

8e22bb9

[ghstack-poisoned]

Update

e8c746f

[ghstack-poisoned]

wconstab added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 20, 2024

wconstab requested a review from kwen2501 June 21, 2024 00:48

c-p-i-o reviewed Jun 21, 2024

View reviewed changes

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp Outdated Show resolved Hide resolved

Update

97eb3df

[ghstack-poisoned]

wconstab mentioned this pull request Jun 21, 2024

[C10D] Make new_group eager when used with comm_split #129284

Closed

Update

01ea6e0

[ghstack-poisoned]

wconstab requested a review from shuqiangzhang June 21, 2024 23:36

pytorchmergebot added the merging label Jun 21, 2024

pytorchmergebot removed the merging label Jun 21, 2024

Update

708a3b6

[ghstack-poisoned]

eqy approved these changes Jun 24, 2024

View reviewed changes

Update

a0d3d0b

[ghstack-poisoned]

wconstab mentioned this pull request Jun 25, 2024

[C10D] Separate deviceKey from streamKey in getNCCLComm #129435

Open

Update

db2de5f

[ghstack-poisoned]

Update

4e2d354

[ghstack-poisoned]

Update

0958986

[ghstack-poisoned]

d4l3k reviewed Jun 25, 2024

View reviewed changes

nvcastet mentioned this pull request Jul 15, 2024

ncclCommSplit optimization does not take effect #129865

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C10D] Avoid lazily creating P2P communicators #129147

[C10D] Avoid lazily creating P2P communicators #129147

wconstab commented Jun 20, 2024 •

edited

Loading

pytorch-bot bot commented Jun 20, 2024 •

edited

Loading

pavanbalaji left a comment

wconstab commented Jun 20, 2024

wconstab commented Jun 21, 2024

pytorchmergebot commented Jun 21, 2024

nvcastet commented Jun 24, 2024

nvcastet commented Jun 24, 2024

wconstab commented Jun 24, 2024

nvcastet commented Jun 24, 2024 •

edited

Loading

pavanbalaji commented Jun 24, 2024

wconstab commented Jun 25, 2024 •

edited

Loading

nvcastet commented Jun 25, 2024

nvcastet commented Jun 25, 2024

d4l3k Jun 25, 2024

d4l3k Jun 25, 2024

wconstab Jun 25, 2024

d4l3k Jun 25, 2024

wconstab Jun 25, 2024

pavanbalaji commented Jul 22, 2024 •

edited

Loading

nvcastet commented Jul 22, 2024 •

edited

Loading

pavanbalaji commented Jul 26, 2024

nvcastet commented Jul 26, 2024

[C10D] Avoid lazily creating P2P communicators #129147

Are you sure you want to change the base?

[C10D] Avoid lazily creating P2P communicators #129147

Conversation

wconstab commented Jun 20, 2024 • edited Loading

Test plan

pytorch-bot bot commented Jun 20, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129147

❌ 6 New Failures, 3 Unrelated Failures

pavanbalaji left a comment

Choose a reason for hiding this comment

wconstab commented Jun 20, 2024

wconstab commented Jun 21, 2024

pytorchmergebot commented Jun 21, 2024

Merge failed

nvcastet commented Jun 24, 2024

nvcastet commented Jun 24, 2024

wconstab commented Jun 24, 2024

nvcastet commented Jun 24, 2024 • edited Loading

pavanbalaji commented Jun 24, 2024

wconstab commented Jun 25, 2024 • edited Loading

nvcastet commented Jun 25, 2024

nvcastet commented Jun 25, 2024

d4l3k Jun 25, 2024

Choose a reason for hiding this comment

d4l3k Jun 25, 2024

Choose a reason for hiding this comment

wconstab Jun 25, 2024

Choose a reason for hiding this comment

d4l3k Jun 25, 2024

Choose a reason for hiding this comment

wconstab Jun 25, 2024

Choose a reason for hiding this comment

pavanbalaji commented Jul 22, 2024 • edited Loading

nvcastet commented Jul 22, 2024 • edited Loading

pavanbalaji commented Jul 26, 2024

nvcastet commented Jul 26, 2024

wconstab commented Jun 20, 2024 •

edited

Loading

pytorch-bot bot commented Jun 20, 2024 •

edited

Loading

nvcastet commented Jun 24, 2024 •

edited

Loading

wconstab commented Jun 25, 2024 •

edited

Loading

pavanbalaji commented Jul 22, 2024 •

edited

Loading

nvcastet commented Jul 22, 2024 •

edited

Loading