Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/CUDA_IPC: Fabric memory support #9982

Merged

Conversation

Akshay-Venkatesh
Copy link
Contributor

@Akshay-Venkatesh Akshay-Venkatesh commented Jun 28, 2024

What/Why?

Second part of #9787 and follow up to #9867. This implements IPC-support for fabric handle associated memory using the newer import/export API and caches import operations similar to legacy IPC handles.

Note: need to solve https://redmine.mellanox.com/issues/4005305 before merging

@yosefe
Copy link
Contributor

yosefe commented Jul 4, 2024

/azp run UCX PR

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

uct_cuda_ipc_rem_mpool_cache_t uct_cuda_ipc_rem_mpool_cache;

static ucs_status_t
uct_cuda_ipc_get_rem_mpool_cache(CUmemFabricHandle *fabric_handle,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse the existing IPC remote cache instead of adding a new one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPC mechanism with memory pools involves two steps in the import phase. One expensive step to import the actual memory pool to allows local GPU to be capable of accessing locally mapped memory pool; and another low overhead step to import remote pointer. The exporting process is expected to make several allocations and deallocations from the memory pool and pack associated pointer and memory pool handle in the rkey. Importing process checks for remote memory pool handle and performs a mapping only when not present.

The existing IPC remote cache relies on handle as key and returns a VA as value and directly uses offsets from the VA range for copy purposes. This doesn't translate to remote memory pool cache as we need imported memory and use that with the exported pointer here to import associated VA. Hence there isn't a one-to-one mapping between the two remote caches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could still unite them, if the hash key will contain the handle type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yosefe Today, we use exporter-side base pointer to lookup cache and perform import operation if base poiner isn't found. This works fine for legacy cuda-ipc because there is one independent backing for a given allocation. Two allocations can never have a common backing. So performing expensive import operation for each new unique base pointer is justified (we also need to check that the handle for given base pointer matches with cached handle to detect VA recycling).

For mempools, many sub-allocations will have the same common backing (the memory pool). The first time we encounter a base pointer corresponding to a sub-allocation from the mempool, we import the whole mempool and then use that imported mempool and exported data pointer to get local translation of the associated base pointer of the sub-allocation at the importer side. If another sub-allocation from the same mempool is encountered at the importing process, if we use existing IPC remote cache logic, we will re-import an already imported mempool (paying unnecessary cost) and get local translation. Instead, it is better to check if a given mempool has been imported. Therefore we have a new cache that takes exported handle as input and returns imported handle as output. This way, any new sub-allocation from already imported memory pool will avoid import operation and just data pointer is imported. Since the old cache takes base pointer as input and the new cache introduced takes exporter handle as input, I don't see an immediate way of unifying the cache as we do need a common key type for a common cache.

src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_md.c Outdated Show resolved Hide resolved
Comment on lines 87 to 95
status = UCT_CUDADRV_FUNC_LOG_ERR(cuPointerGetAttribute(
&allowed_handle_types,
CU_POINTER_ATTRIBUTE_ALLOWED_HANDLE_TYPES,
(CUdeviceptr)addr));
if ((status != UCS_OK) ||
(!(allowed_handle_types | fabric_type))) {
status = UCS_ERR_NO_RESOURCE;
goto err;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because user may allocate memory pool without fabric capabilities in which case cuda-ipc UCT cannot support fabric handle based IPC.

src/uct/cuda/cuda_ipc/cuda_ipc_md.c Outdated Show resolved Hide resolved

key = ucs_malloc(sizeof(*key), "uct_cuda_ipc_lkey_t");
if (key == NULL) {
return UCS_ERR_NO_MEMORY;
}

memset(key, 0, sizeof(*key));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider using calloc instead of malloc if it is really needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Comment on lines 67 to 68
CUmemFabricHandle vmm; /* VMM export handle */
CUmemFabricHandle mempool;/* MallocAsync handle */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need two fields of the same type? Can we use some common name for them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can have a common variable fabric_handle. Will change this.

@Akshay-Venkatesh
Copy link
Contributor Author

@yosefe / @brminich looks like tests are failing for reasons outside PR logic. Would you mind restarting failing tests?

src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_md.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_md.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_md.c Outdated Show resolved Hide resolved
uct_cuda_ipc_rem_mpool_cache_t uct_cuda_ipc_rem_mpool_cache;

static ucs_status_t
uct_cuda_ipc_get_rem_mpool_cache(CUmemFabricHandle *fabric_handle,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could still unite them, if the hash key will contain the handle type?

@Akshay-Venkatesh
Copy link
Contributor Author

@yosefe and @brminich let me know if the changes look good and if you other comments about the PR.

src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Outdated Show resolved Hide resolved
goto err;
}

if ((status == UCS_OK) && (mempool != 0)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could mempool be 0 even if status is OK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. mempool is 0 when buffer points to any memory that wasn't allocated using cudaMallocAsync. Strictly speaking we don't need this check because all such memory should be filtered out by ipc capability check. I decided to leave it in place in case some corner case is missed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so maybe convert it to assertion to show this isn't expected?

src/uct/cuda/cuda_ipc/cuda_ipc_md.c Outdated Show resolved Hide resolved
Copy link
Contributor

@brminich brminich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple of minor styling comments

/**
* @brief cudar ipc region registered for exposure
*/
typedef struct {
CUipcMemHandle ph; /* Memory handle of GPU memory */
uct_cuda_ipc_md_handle_t ph; /* Memory handle of GPU memory */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: imo better to either keep comments alignment or remove extra spaces here

src/uct/cuda/cuda_ipc/cuda_ipc_md.h Show resolved Hide resolved
Comment on lines 69 to 70
CUmemPoolPtrExportData ptr;
uct_cuda_ipc_key_handle_t handle_type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth aligning these two fields, because previous fields (lines 66-67) are aligned

Suggested change
CUmemPoolPtrExportData ptr;
uct_cuda_ipc_key_handle_t handle_type;
CUmemPoolPtrExportData ptr;
uct_cuda_ipc_key_handle_t handle_type;

#if HAVE_CUDA_FABRIC
/* cuda_ipc can handle VMM, mallocasync, and legacy pinned device so need to
* pack appropriate handle */
legacy_handle = &key->ph.handle.legacy;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: maybe move it to line 78? i.e. initialize if legacy_capable only

src/uct/cuda/cuda_ipc/cuda_ipc_md.c Outdated Show resolved Hide resolved
@@ -79,8 +79,10 @@ uct_cuda_ipc_iface_is_reachable_v2(const uct_iface_h tl_iface,
const uct_iface_is_reachable_params_t *params)
{
return uct_iface_is_reachable_params_addrs_valid(params) &&
#if !HAVE_CUDA_FABRIC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can introduce an env var which would manage MNNVL enablement (i.e. make devices on different nodes unreachable if it is disabled)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that after this change, we will create cuda_ipc endpoints on ALL UCP endpoints, including systems without MNNVL, just if we compiled with a recent enough cuda version. And it will lead to extra resource consumption, increase of rkey size, trying to unpack cuda_ipc keys even of x86 setup, etc. IMO we need to at least check MNNVL is supported on the system, and ideally check if remote peer is really reachable by MNNVL.

Comment on lines 294 to 295
return uct_cuda_ipc_open_memhandle_legacy(key->ph.handle.legacy,
mapped_addr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor:

Suggested change
return uct_cuda_ipc_open_memhandle_legacy(key->ph.handle.legacy,
mapped_addr);
return uct_cuda_ipc_open_memhandle_legacy(key->ph.handle.legacy,
mapped_addr);

*imported_mpool = kh_val(&uct_cuda_ipc_rem_mpool_cache.hash, khiter);
*key_present = 1;
} else {
ucs_error("unable to use cuda_ipc remote_cache hash");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ucs_error("unable to use cuda_ipc remote_cache hash");
ucs_error("unable to use cuda_ipc remote_cache hash: %d", khret);

src/uct/cuda/cuda_ipc/cuda_ipc_iface.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_md.h Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_md.h Outdated Show resolved Hide resolved
@yosefe
Copy link
Contributor

yosefe commented Jul 14, 2024

@Akshay-Venkatesh also as we discussed let's remove the caching from this PR

@Akshay-Venkatesh
Copy link
Contributor Author

Akshay-Venkatesh commented Jul 15, 2024

@Akshay-Venkatesh also as we discussed let's remove the caching from this PR

@yosefe I've removed the cache now. Please let me know if the changes look good. As discussed offline, I'll create a follow up PR to add mempool cache logic back and mnnvl enablement through user env once this PR is approved.

@Akshay-Venkatesh
Copy link
Contributor Author

@yosefe As checks are passing and there aren't more comments, let me know if I can squash the comments. Thanks

src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_md.c Show resolved Hide resolved
key->ph.handle_type = UCT_CUDA_IPC_KEY_HANDLE_TYPE_VMM;
ucs_trace("packed vmm fabric handle for %p", addr);
goto common_path;
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we assume that if cuMemRetainAllocationHandle failed, it's a mempool?
can we check it explicitly by some poster query or as part of cuPointerGetAttribute?
in any case need to add documentation in the code that describes the logic (in high level)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We conclude mempool if it's not legacy ipc memory (viz cudaMalloc), and it is not cuMemcreate/VMM memory (because retainAllocHandle failed).

src/uct/cuda/cuda_ipc/cuda_ipc_md.h Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_md.h Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_md.h Show resolved Hide resolved
@Akshay-Venkatesh
Copy link
Contributor Author

@yosefe the latest changes have enable_mnnvl env var added back and necessary reachability checks. Let me know if the changes look good. Thanks

src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_cache.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_iface.c Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_iface.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_iface.c Outdated Show resolved Hide resolved
uct_device_type_t dev_type = UCT_DEVICE_TYPE_SHM;
#if HAVE_CUDA_FABRIC
uct_cuda_ipc_md_t *ipc_md = ucs_derived_of(md, uct_cuda_ipc_md_t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ipc_md -> md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

md is already declared.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per convention in other places, let's rename the parameter from md to uct_md (and ipc_md to md)

src/uct/cuda/cuda_ipc/cuda_ipc_md.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_md.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_md.h Outdated Show resolved Hide resolved
@@ -94,7 +94,7 @@ static int uct_cuda_ipc_iface_is_coherent()
return 0;
}

return coherent;
return (coherent && (md->enable_mnnvl != UCS_NO));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls remove the external ( )

uct_device_type_t dev_type = UCT_DEVICE_TYPE_SHM;
#if HAVE_CUDA_FABRIC
uct_cuda_ipc_md_t *ipc_md = ucs_derived_of(md, uct_cuda_ipc_md_t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per convention in other places, let's rename the parameter from md to uct_md (and ipc_md to md)

uct_cuda_ipc_open_memhandle_mempool(uct_cuda_ipc_rkey_t *key,
CUdeviceptr *mapped_addr)
{
CUmemAccessDesc access_desc = {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CUmemAccessDesc access_desc = {};
CUmemAccessDesc access_desc = {};

goto release_va_range;
}

cuCtxGetDevice(&access_desc.location.id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to check error code here

return status;
}

cuCtxGetDevice(&access_desc.location.id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to check error code here
also maybe it is worth to create a small function initializing access_desc? It can be reused by uct_cuda_ipc_open_memhandle_vmm

@@ -134,8 +269,7 @@ static void uct_cuda_ipc_cache_invalidate_regions(uct_cuda_ipc_cache_t *cache,
ucs_error("failed to remove address:%p from cache (%s)",
(void *)region->key.d_bptr, ucs_status_string(status));
}
UCT_CUDADRV_FUNC_LOG_ERR(
cuIpcCloseMemHandle((CUdeviceptr)region->mapped_addr));
uct_cuda_ipc_close_memhandle(region);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not checking return code?

Copy link
Contributor Author

@Akshay-Venkatesh Akshay-Venkatesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed comments

brminich
brminich previously approved these changes Jul 31, 2024
yosefe
yosefe previously approved these changes Jul 31, 2024
@Akshay-Venkatesh
Copy link
Contributor Author

@yosefe shall I squash commits?

@Akshay-Venkatesh
Copy link
Contributor Author

@yosefe and @brminich I created a branch #10036 by applying https://github.com/openucx/ucx/pull/9982.diff on top of current main branch. Squashing the commits in this branch proved painful after the merges in between. Hope it's ok to ignore this and just merge #10036 into main branch.

@yosefe
Copy link
Contributor

yosefe commented Aug 1, 2024

@Akshay-Venkatesh it's better to squash and rebase in the current PR. Moving the changes to a new PR makes it harder to review the changes and track history.

@Akshay-Venkatesh
Copy link
Contributor Author

@yosefe

@brminich provided the fix for https://redmine.mellanox.com/issues/4005305 and IMB/OMB tests are passing now. I've pushed the commit now. Let me know if the changes look good and if there is a further squash needed.

FYI, there were some fixes needed in openmpi collective layers and memory detection layers as well. I've started PRs for those and those fixes are needed for OMB/IMB collective tests to pass.

@Akshay-Venkatesh
Copy link
Contributor Author

Akshay-Venkatesh commented Aug 14, 2024

@yosefe / @brminich errors look unrelated. Would you mind restarting the tests?

Copy link
Contributor

@brminich brminich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 besides minor comment


status = uct_cuda_ipc_close_memhandle(region);
if (status != UCS_OK) {
ucs_error("failed to close memhandle for base addr:%p (%s)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: maybe also print region->key.ph.handle_type

Comment on lines 109 to 122
int reachable;

reachable = uct_iface_is_reachable_params_addrs_valid(params) &&
(getpid() != *(pid_t*)params->iface_addr) &&
uct_iface_scope_is_reachable(tl_iface, params);

#if HAVE_CUDA_FABRIC
if (uct_cuda_ipc_iface_is_mnnvl_supported(md)) {
return reachable;
}
#endif

return reachable &&
(ucs_get_system_id() == *((const uint64_t*)params->device_addr));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int reachable;
reachable = uct_iface_is_reachable_params_addrs_valid(params) &&
(getpid() != *(pid_t*)params->iface_addr) &&
uct_iface_scope_is_reachable(tl_iface, params);
#if HAVE_CUDA_FABRIC
if (uct_cuda_ipc_iface_is_mnnvl_supported(md)) {
return reachable;
}
#endif
return reachable &&
(ucs_get_system_id() == *((const uint64_t*)params->device_addr));
if (!uct_iface_is_reachable_params_addrs_valid(params) ||
(getpid() == *(pid_t*)params->iface_addr) ||
!uct_iface_scope_is_reachable(tl_iface, params) {
return 0;
};
#if HAVE_CUDA_FABRIC
return uct_cuda_ipc_iface_is_mnnvl_supported(md);
#else
return ucs_get_system_id() == *((const uint64_t*)params->device_addr);
#endif

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not seem to be correct. If we have HAVE_CUDA_FABRIC and mnvl is not supported (e.g. disabled by env var), this function will always return 0

@@ -320,16 +402,21 @@ uct_cuda_ipc_md_open(uct_component_t *component, const char *md_name,
.mem_attach = ucs_empty_function_return_unsupported,
.detect_memory_type = ucs_empty_function_return_unsupported
};
uct_md_t *md;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove space line

brminich
brminich previously approved these changes Aug 15, 2024
@brminich
Copy link
Contributor

error looks relevant

libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I/__w/1/s/contrib/../src/uct/cuda -I../../.. -DCPU_FLAGS= -I/__w/1/s/contrib/../src -I/home/swx-azure-svc_azpcontainer/85835-20240814.18/build -I/home/swx-azure-svc_azpcontainer/85835-20240814.18/build/src -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch -Wno-language-extension-token -fno-finite-math-only -Wno-recommended-option -Wno-c99-extensions -Wno-pointer-sign -Werror-implicit-function-declaration -Wno-format-zero-length -Wnested-externs -Wshadow -Werror=declaration-after-statement -MT cuda_ipc/libuct_cuda_la-cuda_ipc_cache.lo -MD -MP -MF cuda_ipc/.deps/libuct_cuda_la-cuda_ipc_cache.Tpo -c /__w/1/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_cache.c  -fPIC -DPIC -DUCX_SHARED_LIB -o cuda_ipc/.libs/libuct_cuda_la-cuda_ipc_cache.o
In file included from /__w/1/s/contrib/../src/ucs/debug/log.h:14:0,
                 from /__w/1/s/contrib/../src/uct/base/uct_iface.h:20,
                 from /__w/1/s/contrib/../src/uct/cuda/base/cuda_md.h:9,
                 from /__w/1/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.h:10,
                 from /__w/1/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_cache.h:14,
                 from /__w/1/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_cache.c:11:
/__w/1/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_cache.c: In function 'uct_cuda_ipc_cache_invalidate_regions':
/__w/1/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_cache.c:285:65: error: 'uct_cuda_ipc_md_handle_t' has no member named 'handle_type'
                       (void *)region->key.d_bptr, region->key.ph.handle_type,
                                                                 ^
/__w/1/s/contrib/../src/ucs/debug/log_def.h:35:84: note: in definition of macro 'ucs_log_component'
                              (ucs_log_level_t)(_level), _comp_log_config, _fmt, ## __VA_ARGS__); \
                                                                                    ^
/__w/1/s/contrib/../src/ucs/debug/log_def.h:44:37: note: in expansion of macro 'ucs_log'
 #define ucs_error(_fmt, ...)        ucs_log(UCS_LOG_LEVEL_ERROR, _fmt, ## __VA_ARGS__)
                                     ^
/__w/1/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_cache.c:284:13: note: in expansion of macro 'ucs_error'
             ucs_error("failed to close memhandle for base addr:%p type:%d (%s)",
             ^
/__w/1/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_cache.c: At top level:
cc1: error: unrecognized command line option "-Wno-c99-extensions" [-Werror]

@Akshay-Venkatesh
Copy link
Contributor Author

error looks relevant

libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I/__w/1/s/contrib/../src/uct/cuda -I../../.. -DCPU_FLAGS= -I/__w/1/s/contrib/../src -I/home/swx-azure-svc_azpcontainer/85835-20240814.18/build -I/home/swx-azure-svc_azpcontainer/85835-20240814.18/build/src -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch -Wno-language-extension-token -fno-finite-math-only -Wno-recommended-option -Wno-c99-extensions -Wno-pointer-sign -Werror-implicit-function-declaration -Wno-format-zero-length -Wnested-externs -Wshadow -Werror=declaration-after-statement -MT cuda_ipc/libuct_cuda_la-cuda_ipc_cache.lo -MD -MP -MF cuda_ipc/.deps/libuct_cuda_la-cuda_ipc_cache.Tpo -c /__w/1/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_cache.c  -fPIC -DPIC -DUCX_SHARED_LIB -o cuda_ipc/.libs/libuct_cuda_la-cuda_ipc_cache.o
In file included from /__w/1/s/contrib/../src/ucs/debug/log.h:14:0,
                 from /__w/1/s/contrib/../src/uct/base/uct_iface.h:20,
                 from /__w/1/s/contrib/../src/uct/cuda/base/cuda_md.h:9,
                 from /__w/1/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_md.h:10,
                 from /__w/1/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_cache.h:14,
                 from /__w/1/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_cache.c:11:
/__w/1/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_cache.c: In function 'uct_cuda_ipc_cache_invalidate_regions':
/__w/1/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_cache.c:285:65: error: 'uct_cuda_ipc_md_handle_t' has no member named 'handle_type'
                       (void *)region->key.d_bptr, region->key.ph.handle_type,
                                                                 ^
/__w/1/s/contrib/../src/ucs/debug/log_def.h:35:84: note: in definition of macro 'ucs_log_component'
                              (ucs_log_level_t)(_level), _comp_log_config, _fmt, ## __VA_ARGS__); \
                                                                                    ^
/__w/1/s/contrib/../src/ucs/debug/log_def.h:44:37: note: in expansion of macro 'ucs_log'
 #define ucs_error(_fmt, ...)        ucs_log(UCS_LOG_LEVEL_ERROR, _fmt, ## __VA_ARGS__)
                                     ^
/__w/1/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_cache.c:284:13: note: in expansion of macro 'ucs_error'
             ucs_error("failed to close memhandle for base addr:%p type:%d (%s)",
             ^
/__w/1/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_cache.c: At top level:
cc1: error: unrecognized command line option "-Wno-c99-extensions" [-Werror]

@brminich that handle_type field was missing a guard. Can you please approve/reject again?

brminich
brminich previously approved these changes Aug 15, 2024
@Akshay-Venkatesh
Copy link
Contributor Author

@yosefe gentle ping to check if latest commits look good


return reachable &&
(ucs_get_system_id() == *((const uint64_t*)params->device_addr));
/* Not fabric capable or multi-node nvlink disabled, so iface has to be on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space line before

@@ -281,8 +281,14 @@ static void uct_cuda_ipc_cache_invalidate_regions(uct_cuda_ipc_cache_t *cache,

status = uct_cuda_ipc_close_memhandle(region);
if (status != UCS_OK) {
#if HAVE_CUDA_FABRIC
ucs_error("failed to close memhandle for base addr:%p type:%d (%s)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we put only the access to region->key.ph.handle_type in the ifdef?

@yosefe yosefe enabled auto-merge August 20, 2024 16:16
@Akshay-Venkatesh
Copy link
Contributor Author

@yosefe @brminich These errors look unrelated.

@yosefe yosefe merged commit f29ba60 into openucx:master Aug 22, 2024
142 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants