-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add some CPU collectives to the NCCL TL #570
base: master
Are you sure you want to change the base?
Conversation
We are going to create additional scratch buffers for other collectives. So instead of a generic name, such as "scratch_buf", rename the existing buffer to "barrier_scratch", so it's more clear what it is for.
Currently, UCC/NCCL does not support CPU buffers. Any communication with CPU buffers would need to go through the UCC/UCP backend, thus requiring support for multiple backends for coverage on a single platform. This patch adds support for bcast and allgatherv (only the p2p algorithm of allgatherv).
This patch adds support to use fixed buffer length memory pools for CPU collectives in NCCL. This allows us to reduce the amount of memory usage and restricts the peak memory used to scale with the number of concurrent CPU collectives, rather than the size of the collectives themselves.
Can one of the admins verify this patch? |
if (cu_st != cudaSuccess) { | ||
return UCC_ERR_NO_MEMORY; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RDMA and IPC are not supported with managed memory (on most platforms), so there will be another on-device staging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generic comment regarding code-style:
local variables should be grouped together in the beginning of the function with initialized vars going first. then 1 extra empty line after local vars.
plz run "git-clang-format" using our clang style file
use ucc_rank_t instead of int whenever declaring "rank/peer".
@@ -37,6 +37,8 @@ | |||
#include "utils/profile/ucc_profile_off.h" | |||
#endif | |||
|
|||
#define UCC_TL_NCCL_SCRATCH_BUF_SIZE (1024 * 1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not a parameter?
void *cpu_sbuf; | ||
void *staged_sbuf; | ||
uintptr_t sbuf_len; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remote empty lines.
void *first_peer_cpu_rbuf; | ||
uintptr_t first_peer_len; | ||
|
||
int last_peer_rank; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int -> ucc_rank_t
(during initialization set to UCC_RANK_INVALID instead of -1)
uintptr_t last_peer_len; | ||
} window_bounds_t; | ||
|
||
#define MIN(a, b) (((a) < (b)) ? (a) : (b)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have ucc_min and ucc_max in ucc_math.h
|
||
/* sbuf setup */ | ||
uintptr_t sbuf_start = 0; | ||
for (int peer = 0; peer < UCC_TL_TEAM_RANK(team); peer++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ucc_rank_t peer. plz move declaration in the beginning
{ | ||
ucc_coll_task_t *coll_task = (ucc_coll_task_t *) data; | ||
ucc_tl_nccl_task_t *task = ucc_derived_of(coll_task, ucc_tl_nccl_task_t); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
window_bounds_t win; goes together with variables decl and empty line goes after.
for (int peer = 0; peer < UCC_TL_TEAM_SIZE(team); peer++) { | ||
total_bytes += ucc_coll_args_get_count(args, args->dst.info_v.counts, peer) * rdt_size; | ||
} | ||
int num_rounds = total_bytes / UCC_TL_NCCL_SCRATCH_BUF_SIZE + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would cache num_rounds on task as well to avoid repeated re-calculation
!!(total_bytes % UCC_TL_NCCL_SCRATCH_BUF_SIZE); | ||
|
||
if (task->cpu_coll_round == num_rounds) { | ||
ucc_mpool_put(task->cpu_coll_scratch_buf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not mpool_put is done in collective_finalize?
{ | ||
cudaError_t cu_st; | ||
|
||
cu_st = cudaMallocManaged((void**)chunk_p, *size_p, cudaMemAttachGlobal); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as we discussed during last WG lets use explicit host mapped mem instead of managed
What
Adds support for CPU collectives to NCCL TL (only covers a subset for now).
Why ?
Currently UCC requires two TLs (NCCL and UCP) to fully support NVIDIA GPU platforms. This patch allows for the NCCL TL to support CPU collectives too, so we do not need to rely on two different TLs for the full coverage.
How ?
Stages CPU data through the GPU and makes NCCL calls on the GPU-resident data.