Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some CPU collectives to the NCCL TL #570

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

pavanbalaji
Copy link

What

Adds support for CPU collectives to NCCL TL (only covers a subset for now).

Why ?

Currently UCC requires two TLs (NCCL and UCP) to fully support NVIDIA GPU platforms. This patch allows for the NCCL TL to support CPU collectives too, so we do not need to rely on two different TLs for the full coverage.

How ?

Stages CPU data through the GPU and makes NCCL calls on the GPU-resident data.

We are going to create additional scratch buffers for other
collectives.  So instead of a generic name, such as "scratch_buf",
rename the existing buffer to "barrier_scratch", so it's more clear
what it is for.
Currently, UCC/NCCL does not support CPU buffers.  Any communication
with CPU buffers would need to go through the UCC/UCP backend, thus
requiring support for multiple backends for coverage on a single
platform.

This patch adds support for bcast and allgatherv (only the p2p
algorithm of allgatherv).
This patch adds support to use fixed buffer length memory pools for
CPU collectives in NCCL.  This allows us to reduce the amount of
memory usage and restricts the peak memory used to scale with the
number of concurrent CPU collectives, rather than the size of the
collectives themselves.
@swx-jenkins3
Copy link

Can one of the admins verify this patch?

if (cu_st != cudaSuccess) {
return UCC_ERR_NO_MEMORY;
}

Copy link

@SeyedMir SeyedMir Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RDMA and IPC are not supported with managed memory (on most platforms), so there will be another on-device staging.

Copy link
Collaborator

@vspetrov vspetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generic comment regarding code-style:
local variables should be grouped together in the beginning of the function with initialized vars going first. then 1 extra empty line after local vars.

plz run "git-clang-format" using our clang style file

use ucc_rank_t instead of int whenever declaring "rank/peer".

@@ -37,6 +37,8 @@
#include "utils/profile/ucc_profile_off.h"
#endif

#define UCC_TL_NCCL_SCRATCH_BUF_SIZE (1024 * 1024)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not a parameter?

void *cpu_sbuf;
void *staged_sbuf;
uintptr_t sbuf_len;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remote empty lines.

void *first_peer_cpu_rbuf;
uintptr_t first_peer_len;

int last_peer_rank;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int -> ucc_rank_t
(during initialization set to UCC_RANK_INVALID instead of -1)

uintptr_t last_peer_len;
} window_bounds_t;

#define MIN(a, b) (((a) < (b)) ? (a) : (b))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have ucc_min and ucc_max in ucc_math.h


/* sbuf setup */
uintptr_t sbuf_start = 0;
for (int peer = 0; peer < UCC_TL_TEAM_RANK(team); peer++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ucc_rank_t peer. plz move declaration in the beginning

{
ucc_coll_task_t *coll_task = (ucc_coll_task_t *) data;
ucc_tl_nccl_task_t *task = ucc_derived_of(coll_task, ucc_tl_nccl_task_t);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

window_bounds_t win; goes together with variables decl and empty line goes after.

for (int peer = 0; peer < UCC_TL_TEAM_SIZE(team); peer++) {
total_bytes += ucc_coll_args_get_count(args, args->dst.info_v.counts, peer) * rdt_size;
}
int num_rounds = total_bytes / UCC_TL_NCCL_SCRATCH_BUF_SIZE +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would cache num_rounds on task as well to avoid repeated re-calculation

!!(total_bytes % UCC_TL_NCCL_SCRATCH_BUF_SIZE);

if (task->cpu_coll_round == num_rounds) {
ucc_mpool_put(task->cpu_coll_scratch_buf);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not mpool_put is done in collective_finalize?

{
cudaError_t cu_st;

cu_st = cudaMallocManaged((void**)chunk_p, *size_p, cudaMemAttachGlobal);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we discussed during last WG lets use explicit host mapped mem instead of managed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants