-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCT/CUDA_IPC: Fabric memory support #9982
UCT/CUDA_IPC: Fabric memory support #9982
Conversation
/azp run UCX PR |
Azure Pipelines successfully started running 1 pipeline(s). |
uct_cuda_ipc_rem_mpool_cache_t uct_cuda_ipc_rem_mpool_cache; | ||
|
||
static ucs_status_t | ||
uct_cuda_ipc_get_rem_mpool_cache(CUmemFabricHandle *fabric_handle, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we reuse the existing IPC remote cache instead of adding a new one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IPC mechanism with memory pools involves two steps in the import phase. One expensive step to import the actual memory pool to allows local GPU to be capable of accessing locally mapped memory pool; and another low overhead step to import remote pointer. The exporting process is expected to make several allocations and deallocations from the memory pool and pack associated pointer and memory pool handle in the rkey. Importing process checks for remote memory pool handle and performs a mapping only when not present.
The existing IPC remote cache relies on handle as key and returns a VA as value and directly uses offsets from the VA range for copy purposes. This doesn't translate to remote memory pool cache as we need imported memory and use that with the exported pointer here to import associated VA. Hence there isn't a one-to-one mapping between the two remote caches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could still unite them, if the hash key will contain the handle type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yosefe Today, we use exporter-side base pointer to lookup cache and perform import operation if base poiner isn't found. This works fine for legacy cuda-ipc because there is one independent backing for a given allocation. Two allocations can never have a common backing. So performing expensive import operation for each new unique base pointer is justified (we also need to check that the handle for given base pointer matches with cached handle to detect VA recycling).
For mempools, many sub-allocations will have the same common backing (the memory pool). The first time we encounter a base pointer corresponding to a sub-allocation from the mempool, we import the whole mempool and then use that imported mempool and exported data pointer to get local translation of the associated base pointer of the sub-allocation at the importer side. If another sub-allocation from the same mempool is encountered at the importing process, if we use existing IPC remote cache logic, we will re-import an already imported mempool (paying unnecessary cost) and get local translation. Instead, it is better to check if a given mempool has been imported. Therefore we have a new cache that takes exported handle as input and returns imported handle as output. This way, any new sub-allocation from already imported memory pool will avoid import operation and just data pointer is imported. Since the old cache takes base pointer as input and the new cache introduced takes exporter handle as input, I don't see an immediate way of unifying the cache as we do need a common key type for a common cache.
src/uct/cuda/cuda_ipc/cuda_ipc_md.c
Outdated
status = UCT_CUDADRV_FUNC_LOG_ERR(cuPointerGetAttribute( | ||
&allowed_handle_types, | ||
CU_POINTER_ATTRIBUTE_ALLOWED_HANDLE_TYPES, | ||
(CUdeviceptr)addr)); | ||
if ((status != UCS_OK) || | ||
(!(allowed_handle_types | fabric_type))) { | ||
status = UCS_ERR_NO_RESOURCE; | ||
goto err; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because user may allocate memory pool without fabric capabilities in which case cuda-ipc UCT cannot support fabric handle based IPC.
src/uct/cuda/cuda_ipc/cuda_ipc_md.c
Outdated
|
||
key = ucs_malloc(sizeof(*key), "uct_cuda_ipc_lkey_t"); | ||
if (key == NULL) { | ||
return UCS_ERR_NO_MEMORY; | ||
} | ||
|
||
memset(key, 0, sizeof(*key)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider using calloc
instead of malloc
if it is really needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
src/uct/cuda/cuda_ipc/cuda_ipc_md.h
Outdated
CUmemFabricHandle vmm; /* VMM export handle */ | ||
CUmemFabricHandle mempool;/* MallocAsync handle */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need two fields of the same type? Can we use some common name for them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can have a common variable fabric_handle
. Will change this.
uct_cuda_ipc_rem_mpool_cache_t uct_cuda_ipc_rem_mpool_cache; | ||
|
||
static ucs_status_t | ||
uct_cuda_ipc_get_rem_mpool_cache(CUmemFabricHandle *fabric_handle, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could still unite them, if the hash key will contain the handle type?
src/uct/cuda/cuda_ipc/cuda_ipc_md.c
Outdated
goto err; | ||
} | ||
|
||
if ((status == UCS_OK) && (mempool != 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could mempool be 0 even if status is OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. mempool is 0 when buffer points to any memory that wasn't allocated using cudaMallocAsync. Strictly speaking we don't need this check because all such memory should be filtered out by ipc capability check. I decided to leave it in place in case some corner case is missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so maybe convert it to assertion to show this isn't expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple of minor styling comments
src/uct/cuda/cuda_ipc/cuda_ipc_md.h
Outdated
/** | ||
* @brief cudar ipc region registered for exposure | ||
*/ | ||
typedef struct { | ||
CUipcMemHandle ph; /* Memory handle of GPU memory */ | ||
uct_cuda_ipc_md_handle_t ph; /* Memory handle of GPU memory */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: imo better to either keep comments alignment or remove extra spaces here
src/uct/cuda/cuda_ipc/cuda_ipc_md.h
Outdated
CUmemPoolPtrExportData ptr; | ||
uct_cuda_ipc_key_handle_t handle_type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe worth aligning these two fields, because previous fields (lines 66-67) are aligned
CUmemPoolPtrExportData ptr; | |
uct_cuda_ipc_key_handle_t handle_type; | |
CUmemPoolPtrExportData ptr; | |
uct_cuda_ipc_key_handle_t handle_type; |
src/uct/cuda/cuda_ipc/cuda_ipc_md.c
Outdated
#if HAVE_CUDA_FABRIC | ||
/* cuda_ipc can handle VMM, mallocasync, and legacy pinned device so need to | ||
* pack appropriate handle */ | ||
legacy_handle = &key->ph.handle.legacy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: maybe move it to line 78? i.e. initialize if legacy_capable
only
@@ -79,8 +79,10 @@ uct_cuda_ipc_iface_is_reachable_v2(const uct_iface_h tl_iface, | |||
const uct_iface_is_reachable_params_t *params) | |||
{ | |||
return uct_iface_is_reachable_params_addrs_valid(params) && | |||
#if !HAVE_CUDA_FABRIC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can introduce an env var which would manage MNNVL enablement (i.e. make devices on different nodes unreachable if it is disabled)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems that after this change, we will create cuda_ipc endpoints on ALL UCP endpoints, including systems without MNNVL, just if we compiled with a recent enough cuda version. And it will lead to extra resource consumption, increase of rkey size, trying to unpack cuda_ipc keys even of x86 setup, etc. IMO we need to at least check MNNVL is supported on the system, and ideally check if remote peer is really reachable by MNNVL.
return uct_cuda_ipc_open_memhandle_legacy(key->ph.handle.legacy, | ||
mapped_addr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor:
return uct_cuda_ipc_open_memhandle_legacy(key->ph.handle.legacy, | |
mapped_addr); | |
return uct_cuda_ipc_open_memhandle_legacy(key->ph.handle.legacy, | |
mapped_addr); |
*imported_mpool = kh_val(&uct_cuda_ipc_rem_mpool_cache.hash, khiter); | ||
*key_present = 1; | ||
} else { | ||
ucs_error("unable to use cuda_ipc remote_cache hash"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ucs_error("unable to use cuda_ipc remote_cache hash"); | |
ucs_error("unable to use cuda_ipc remote_cache hash: %d", khret); |
@Akshay-Venkatesh also as we discussed let's remove the caching from this PR |
@yosefe I've removed the cache now. Please let me know if the changes look good. As discussed offline, I'll create a follow up PR to add mempool cache logic back and mnnvl enablement through user env once this PR is approved. |
@yosefe As checks are passing and there aren't more comments, let me know if I can squash the comments. Thanks |
src/uct/cuda/cuda_ipc/cuda_ipc_md.c
Outdated
key->ph.handle_type = UCT_CUDA_IPC_KEY_HANDLE_TYPE_VMM; | ||
ucs_trace("packed vmm fabric handle for %p", addr); | ||
goto common_path; | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we assume that if cuMemRetainAllocationHandle failed, it's a mempool?
can we check it explicitly by some poster query or as part of cuPointerGetAttribute?
in any case need to add documentation in the code that describes the logic (in high level)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We conclude mempool if it's not legacy ipc memory (viz cudaMalloc), and it is not cuMemcreate/VMM memory (because retainAllocHandle failed).
@yosefe the latest changes have |
uct_device_type_t dev_type = UCT_DEVICE_TYPE_SHM; | ||
#if HAVE_CUDA_FABRIC | ||
uct_cuda_ipc_md_t *ipc_md = ucs_derived_of(md, uct_cuda_ipc_md_t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ipc_md -> md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
md is already declared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per convention in other places, let's rename the parameter from md to uct_md (and ipc_md to md)
@@ -94,7 +94,7 @@ static int uct_cuda_ipc_iface_is_coherent() | |||
return 0; | |||
} | |||
|
|||
return coherent; | |||
return (coherent && (md->enable_mnnvl != UCS_NO)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls remove the external ( )
uct_device_type_t dev_type = UCT_DEVICE_TYPE_SHM; | ||
#if HAVE_CUDA_FABRIC | ||
uct_cuda_ipc_md_t *ipc_md = ucs_derived_of(md, uct_cuda_ipc_md_t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per convention in other places, let's rename the parameter from md to uct_md (and ipc_md to md)
uct_cuda_ipc_open_memhandle_mempool(uct_cuda_ipc_rkey_t *key, | ||
CUdeviceptr *mapped_addr) | ||
{ | ||
CUmemAccessDesc access_desc = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUmemAccessDesc access_desc = {}; | |
CUmemAccessDesc access_desc = {}; |
goto release_va_range; | ||
} | ||
|
||
cuCtxGetDevice(&access_desc.location.id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to check error code here
return status; | ||
} | ||
|
||
cuCtxGetDevice(&access_desc.location.id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to check error code here
also maybe it is worth to create a small function initializing access_desc
? It can be reused by uct_cuda_ipc_open_memhandle_vmm
@@ -134,8 +269,7 @@ static void uct_cuda_ipc_cache_invalidate_regions(uct_cuda_ipc_cache_t *cache, | |||
ucs_error("failed to remove address:%p from cache (%s)", | |||
(void *)region->key.d_bptr, ucs_status_string(status)); | |||
} | |||
UCT_CUDADRV_FUNC_LOG_ERR( | |||
cuIpcCloseMemHandle((CUdeviceptr)region->mapped_addr)); | |||
uct_cuda_ipc_close_memhandle(region); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not checking return code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed comments
@yosefe shall I squash commits? |
@yosefe and @brminich I created a branch #10036 by applying https://github.com/openucx/ucx/pull/9982.diff on top of current main branch. Squashing the commits in this branch proved painful after the merges in between. Hope it's ok to ignore this and just merge #10036 into main branch. |
@Akshay-Venkatesh it's better to squash and rebase in the current PR. Moving the changes to a new PR makes it harder to review the changes and track history. |
a59aafa
to
cbf2539
Compare
@brminich provided the fix for https://redmine.mellanox.com/issues/4005305 and IMB/OMB tests are passing now. I've pushed the commit now. Let me know if the changes look good and if there is a further squash needed. FYI, there were some fixes needed in openmpi collective layers and memory detection layers as well. I've started PRs for those and those fixes are needed for OMB/IMB collective tests to pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 besides minor comment
|
||
status = uct_cuda_ipc_close_memhandle(region); | ||
if (status != UCS_OK) { | ||
ucs_error("failed to close memhandle for base addr:%p (%s)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: maybe also print region->key.ph.handle_type
int reachable; | ||
|
||
reachable = uct_iface_is_reachable_params_addrs_valid(params) && | ||
(getpid() != *(pid_t*)params->iface_addr) && | ||
uct_iface_scope_is_reachable(tl_iface, params); | ||
|
||
#if HAVE_CUDA_FABRIC | ||
if (uct_cuda_ipc_iface_is_mnnvl_supported(md)) { | ||
return reachable; | ||
} | ||
#endif | ||
|
||
return reachable && | ||
(ucs_get_system_id() == *((const uint64_t*)params->device_addr)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int reachable; | |
reachable = uct_iface_is_reachable_params_addrs_valid(params) && | |
(getpid() != *(pid_t*)params->iface_addr) && | |
uct_iface_scope_is_reachable(tl_iface, params); | |
#if HAVE_CUDA_FABRIC | |
if (uct_cuda_ipc_iface_is_mnnvl_supported(md)) { | |
return reachable; | |
} | |
#endif | |
return reachable && | |
(ucs_get_system_id() == *((const uint64_t*)params->device_addr)); | |
if (!uct_iface_is_reachable_params_addrs_valid(params) || | |
(getpid() == *(pid_t*)params->iface_addr) || | |
!uct_iface_scope_is_reachable(tl_iface, params) { | |
return 0; | |
}; | |
#if HAVE_CUDA_FABRIC | |
return uct_cuda_ipc_iface_is_mnnvl_supported(md); | |
#else | |
return ucs_get_system_id() == *((const uint64_t*)params->device_addr); | |
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not seem to be correct. If we have HAVE_CUDA_FABRIC
and mnvl is not supported (e.g. disabled by env var), this function will always return 0
src/uct/cuda/cuda_ipc/cuda_ipc_md.c
Outdated
@@ -320,16 +402,21 @@ uct_cuda_ipc_md_open(uct_component_t *component, const char *md_name, | |||
.mem_attach = ucs_empty_function_return_unsupported, | |||
.detect_memory_type = ucs_empty_function_return_unsupported | |||
}; | |||
uct_md_t *md; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove space line
error looks relevant
|
@brminich that |
@yosefe gentle ping to check if latest commits look good |
|
||
return reachable && | ||
(ucs_get_system_id() == *((const uint64_t*)params->device_addr)); | ||
/* Not fabric capable or multi-node nvlink disabled, so iface has to be on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space line before
@@ -281,8 +281,14 @@ static void uct_cuda_ipc_cache_invalidate_regions(uct_cuda_ipc_cache_t *cache, | |||
|
|||
status = uct_cuda_ipc_close_memhandle(region); | |||
if (status != UCS_OK) { | |||
#if HAVE_CUDA_FABRIC | |||
ucs_error("failed to close memhandle for base addr:%p type:%d (%s)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we put only the access to region->key.ph.handle_type in the ifdef?
0385a85
to
a3b7625
Compare
What/Why?
Second part of #9787 and follow up to #9867. This implements IPC-support for fabric handle associated memory using the newer import/export API and caches import operations similar to legacy IPC handles.
Note: need to solve https://redmine.mellanox.com/issues/4005305 before merging