New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some CPU collectives to the NCCL TL #570

Open

pavanbalaji wants to merge 3 commits into openucx:master from pavanbalaji:pr/nccl-cpu-coll

pavanbalaji commented Jul 20, 2022

What

Adds support for CPU collectives to NCCL TL (only covers a subset for now).

Why ?

Currently UCC requires two TLs (NCCL and UCP) to fully support NVIDIA GPU platforms. This patch allows for the NCCL TL to support CPU collectives too, so we do not need to rely on two different TLs for the full coverage.

How ?

Stages CPU data through the GPU and makes NCCL calls on the GPU-resident data.

pavanbalaji added 3 commits

July 19, 2022 17:56


Rename scratch_buf to barrier_scratch

81c94a0

We are going to create additional scratch buffers for other
collectives.  So instead of a generic name, such as "scratch_buf",
rename the existing buffer to "barrier_scratch", so it's more clear
what it is for.


UCC/NCCL support for CPU buffers

0dc195a

Currently, UCC/NCCL does not support CPU buffers.  Any communication
with CPU buffers would need to go through the UCC/UCP backend, thus
requiring support for multiple backends for coverage on a single
platform.

This patch adds support for bcast and allgatherv (only the p2p
algorithm of allgatherv).


Added memory pools support to NCCL CPU collectives

cfba6a4

This patch adds support to use fixed buffer length memory pools for
CPU collectives in NCCL.  This allows us to reduce the amount of
memory usage and restricts the peak memory used to scale with the
number of concurrent CPU collectives, rather than the size of the
collectives themselves.

swx-jenkins3 commented Jul 20, 2022

Can one of the admins verify this patch?

manjugv requested review from vspetrov and Sergei-Lebedev

July 20, 2022 13:39

SeyedMir reviewed

View reviewed changes

src/components/tl/nccl/tl_nccl_context.c

+ if (cu_st != cudaSuccess) {
+ return UCC_ERR_NO_MEMORY;
+ }

SeyedMir Jul 20, 2022 •

edited

RDMA and IPC are not supported with managed memory (on most platforms), so there will be another on-device staging.

vspetrov reviewed

View reviewed changes

Collaborator

vspetrov left a comment

Generic comment regarding code-style:
local variables should be grouped together in the beginning of the function with initialized vars going first. then 1 extra empty line after local vars.

plz run "git-clang-format" using our clang style file

use ucc_rank_t instead of int whenever declaring "rank/peer".

src/components/tl/nccl/tl_nccl.h

@@ @@ -37,6 +37,8 @@ @@
 #include "utils/profile/ucc_profile_off.h"
 #endif
+#define UCC_TL_NCCL_SCRATCH_BUF_SIZE (1024 * 1024)

Collaborator

vspetrov Jul 27, 2022

why not a parameter?

src/components/tl/nccl/allgatherv/allgatherv.c

+ void *cpu_sbuf;
+ void *staged_sbuf;
+ uintptr_t sbuf_len;

Collaborator

vspetrov Jul 27, 2022

remote empty lines.

src/components/tl/nccl/allgatherv/allgatherv.c

+ void *first_peer_cpu_rbuf;
+ uintptr_t first_peer_len;
+ int last_peer_rank;

Collaborator

vspetrov Jul 27, 2022

int -> ucc_rank_t
(during initialization set to UCC_RANK_INVALID instead of -1)

src/components/tl/nccl/allgatherv/allgatherv.c

+ uintptr_t last_peer_len;
+} window_bounds_t;
+#define MIN(a, b) (((a) < (b)) ? (a) : (b))

Collaborator

vspetrov Jul 27, 2022

we have ucc_min and ucc_max in ucc_math.h

src/components/tl/nccl/allgatherv/allgatherv.c

+ /* sbuf setup */
+ uintptr_t sbuf_start = 0;
+ for (int peer = 0; peer < UCC_TL_TEAM_RANK(team); peer++) {

Collaborator

vspetrov Jul 27, 2022

ucc_rank_t peer. plz move declaration in the beginning

src/components/tl/nccl/allgatherv/allgatherv.c

+{
+ ucc_coll_task_t *coll_task = (ucc_coll_task_t *) data;
+ ucc_tl_nccl_task_t *task = ucc_derived_of(coll_task, ucc_tl_nccl_task_t);

Collaborator

vspetrov Jul 27, 2022

window_bounds_t win; goes together with variables decl and empty line goes after.

src/components/tl/nccl/allgatherv/allgatherv.c

+ for (int peer = 0; peer < UCC_TL_TEAM_SIZE(team); peer++) {
+ total_bytes += ucc_coll_args_get_count(args, args->dst.info_v.counts, peer) * rdt_size;
+ }
+ int num_rounds = total_bytes / UCC_TL_NCCL_SCRATCH_BUF_SIZE +

Collaborator

vspetrov Jul 27, 2022

i would cache num_rounds on task as well to avoid repeated re-calculation

src/components/tl/nccl/allgatherv/allgatherv.c

+ !!(total_bytes % UCC_TL_NCCL_SCRATCH_BUF_SIZE);
+ if (task->cpu_coll_round == num_rounds) {
+ ucc_mpool_put(task->cpu_coll_scratch_buf);

Collaborator

vspetrov Jul 27, 2022

why not mpool_put is done in collective_finalize?

src/components/tl/nccl/tl_nccl_context.c

+{
+ cudaError_t cu_st;
+ cu_st = cudaMallocManaged((void**)chunk_p, *size_p, cudaMemAttachGlobal);

Collaborator

vspetrov Jul 27, 2022

as we discussed during last WG lets use explicit host mapped mem instead of managed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment