[RFC][C10D] Avoid creating new nccl communicator for each P2P pair #129140
Labels
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Motivation
Inside
ProcessGroupNCCL::pointToPoint
(which is the impl for all send/recv and batched send/recv operations), we have logic that creates (and caches) a unique nccl communicator for any pair of sender/recvr. This seems pointless, since the send/recv op could use the same communicator as is used for all collectives on this PG, and creating new communicators is expensive both in nccl resources and runtime.The reasons for this are (1) legacy behavior for PGNccl was to lazily initialize nccl communicators on first collective, so we cannot be sure that the collective communicator for the PG has been created yet upon the first call to PointToPoint, and (2) in pointToPoint we do not require all ranks in the PG to collectively call the pointToPoint API (which we do require for any collective API), and as such we cannot lazily create the full collective communicator.
Proposal
The proposal takes advantage of an existing feature called 'eager initialization'. When users call
init_process_group
, they can optionally pass a 'device_id' argument, which then enables/causes the initialization to happen eagerly. This means the root communicator is created and can be used by the first operation rather than the first operation having to create it. Additionally, any call tonew_group
would eagerly call comm_split if it is available in NCCL, or eagerly create a new communicator otherwise. Hence for all pytorch PGs, we can assume the 'PG communicator' exists at the time of a call to pointToPoint in this mode.For (2), I think the easiest way to implement this is to modify
ProcessGroupNCCL::getNCCLComm
to accept an argument 'cached_only' which defaults to false. When set to true, getNCCLComm(cached_only=True) would return an existing communicator or nullopt. This would let pointToPoint throw a warning and then create the backup pairwise communicator for the legacy case.Amended Proposal to address comm stream and overlap
For P2P ops, we want to allow using a separate cuda stream from the stream that collective ops or p2p ops between other peers run on, facilitating overlap between those comm operations. This can be enabled by decoupling 'stream' and 'ncclcomm', both of which are controlled by ProcessGroupNCCL today and bundled together. The proposal there is to unbundle them.
*at all call sites where we issue collectives, we'll need to audit which key is used to access the stream cache and use the p2p one where appropriate.
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @yf225 @chauhang @d4l3k
The text was updated successfully, but these errors were encountered: