-
Notifications
You must be signed in to change notification settings - Fork 21.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C10D] Avoid lazily creating P2P communicators #129147
base: gh/wconstab/309/base
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129147
Note: Links to docs will display an error until the docs builds have been completed. ❌ 6 New Failures, 3 Unrelated FailuresAs of commit 0958986 with merge base failed to retrieve merge base, please contact dev infra: NEW FAILURES - The following jobs have failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@wconstab has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pytorchbot merge |
Merge failedReason: Approvers from one of the following sets are needed:
|
Thank you for improving the p2p communicator creation path! I was just looking at it a few weeks ago. Could it be an issue for the case we want to overlap p2p communications? Having a single communicator will just serialize the comm ops of that communicator. |
Alternatively, those apps would need to update their code to properly create sub-groups for pairs where there is overlap. |
Can you be more specific about what the P2P comms need to overlap with? C10D already puts comms on a separate stream from compute, so it should be possible for these to overlap. Is the goal for P2P ops to overlap with other communication ops in the same PG? |
Correct. For example the megatron-lm interleaved pipeline schedule will overlap send/receive ops targeting different peers using the same PG. |
You can use the same NCCL communicator (PyTorch PG) but issue different P2P operations on different streams. That won't be serialized. |
I think the issue is that c10d manages the stream used for p2p ops, and its bundled together 1:1 with nccl communicator today. I amended my RFC to account for this: #129140 @nvcastet do you think this amendment would solve your issue? I can try to make a PR to do this if so. edit: i updated this PR to attempt to decouple nccl comm from nccl stream. It might be fairly straightforward to do this, but i need to re-examine it with fresh eyes and i assume i may have missed something. |
Users that opt-into eager initialization (enabled by passing device_id to init_process_group) will now be able to take advantage of reusing the existing communicator for the processgroup for send/recv ops rather than creating new 2-rank communicators for every pair of ranks performing send/recv. Existing users not passing device_id to init_process_group will now get a warning suggesting they do so, but they will still get the functionality they have today, automatic creation of pair-wise communicators. When reusing an existing communicator, a dedicated nccl stream will still be used for each pair of P2P ranks so that pair-wise comm ops can overlap with each other rather than being serialized on a single stream per PG. Fixes #129140 ghstack-source-id: 3db38c68ea6a4947ef4a3f9fa61fc4865513f63c Pull Request resolved: #129147
@pavanbalaji @wconstab
NCCL communicator will serialize the ops even if they are put on different streams (because they compete for the NCCL communicator internal resources: internal staging buffers etc...) |
So to preserve overlap behavior, we would still need to create those p2p communicators in the PG. The only other option I see (besides the obvious one to put this RFE/PR on the shelf for now) to avoid those extra communicators is to have an explicit config setting on the process group to disable the creation of p2p communicators (and documenting that unbatched p2p ops of this PG will be serialized with that setting). As a side note, NCCL team is actively working on reducing communicator init cost, so I would not be surprise to see improvement in upcoming releases. |
@@ -1993,7 +1993,8 @@ std::shared_ptr<NCCLComm> ProcessGroupNCCL::getNCCLComm( | |||
OpType opType, | |||
int p2pRank, | |||
bool isSendRecvSelf, | |||
std::optional<const std::string> streamKey) { | |||
std::optional<const std::string> streamKey, | |||
bool onlyCached) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on having a getOrCreateNCCLComm
and then just a getNCCLComm
? It's a bit unintutive that this function does both and splitting the behavior might be better than adding a bool to an already complicated function signature
|
||
// Note on keys | ||
// devKey identifies this gpu device and is used for accessing a nccl | ||
// Communicator for this PG per device p2pKey identifies a pair of ranks doing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing period/new line between device
and p2pKey
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. lintrunner totally hosed me here.
// First let NCCL streams wait for input tensors allocation streams | ||
syncStream(device, ncclEvents_[key], ncclStream); | ||
syncStream(device, ncclEvents_[p2pKey], ncclStream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the old logic this is conditionally the p2pKey or the devKey -- is it intentional to always use the p2p key now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes- it is intentional to always use the p2pkey for the stream, based on the wrong assumption that using the same comm but different stream would allow overlap between p2p ops involving different peers.
but i suspect i missed something- i probably should have kept this as devKey for batched-p2p ops and only made this p2pkey for true p2p ops.
Hi @nvcastet - we should discuss this. It's not clear why NCCL needs to serialize point-to-point operations on the same communicator. I understand that collective operations need to be serialized, but p2p operations should be independent of each other. NCCL should be able to handle internal resources correctly in such cases. Is there a technical reason for this restriction or is it just an artifact of the current implementation? If it's an artifact of the current implementation, PyTorch shouldn't be working around that. We should fix it in NCCL. |
NCCL communicator will serialize ungrouped ops because they share internal resources (net buffers etc...). |
Hi @nvcastet - This seems to be overly restrictive and is different from what other communication libraries (such as MPI) provide. Creating a new communicator for every point-to-point pair that we need to talk to is very expensive with respect to number of resources used (and performance in some cases). |
You only need to create a new communicator for pt-to-pt if you are going to overlap it with another NCCL Op. |
Stack from ghstack (oldest at bottom):
Users that opt-into eager initialization (enabled by passing device_id
to init_process_group) will now be able to take advantage of reusing
the existing communicator for the processgroup for send/recv ops rather
than creating new 2-rank communicators for every pair of ranks
performing send/recv.
Existing users not passing device_id to init_process_group will now get
a warning suggesting they do so, but they will still get the
functionality they have today, automatic creation of pair-wise
communicators.
Fixes #129140
Test plan
I didn't figure out a good way to unit test this change. (specifically, to make sure we avoid creating extra communicators when we opt-into the eager init path).
In the meantime, i've locally verified that a script that issues a send/recv gets the WARNING printed about the fallback path, and if I modify the script to either pass
device_id=torch.device("cuda:{local_rank}")
to init_process_group or issue an allreduce before the send/recv, in both cases the warning about the fallback path does not appear.cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @yf225 @chauhang @d4l3k
Differential Revision: D58842474