-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCP/WIREUP/IB: provide a reason when a resource's lane is unreachable #9995
base: master
Are you sure you want to change the base?
UCP/WIREUP/IB: provide a reason when a resource's lane is unreachable #9995
Conversation
d8857c2
to
6fba9fb
Compare
Please fix the PR title to match the main commit (conventions + capitals). |
Please edit PR description:
|
src/ucp/wireup/wireup.c
Outdated
if (info_str != NULL) { | ||
params.field_mask |= UCT_IFACE_IS_REACHABLE_FIELD_INFO_STRING | | ||
UCT_IFACE_IS_REACHABLE_FIELD_INFO_STRING_LENGTH; | ||
params.info_string = info_str; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align =
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unresolved
src/uct/base/uct_iface.c
Outdated
va_start(args, fmt); | ||
|
||
if (info_str_buf && info_str_buf_len > 0 && info_str_buf[0] == '\0') { | ||
snprintf(info_str_buf, info_str_buf_len, fmt, args); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ucs_snprintf_safe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU, it might be called several times in a single flow (e.g uct_ib_iface_dev_addr_is_reachable
then in the caller uct_ib_iface_is_reachable_v2
). The strings overwrite each other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gleon99 This is why I am checking info_str_buf
is empty (info_str_buf[0] == '\0'
), so it won't override an existing reason. I assume the first time info_str_buf
is filled happens in the deepest level of this check (== the most descriptive one), yet I want to provide a more basic reason in case it wasn't filled earlier in the flow.
@yosefe it answers your other question about why we need info_str_buf[0] == '\0'
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you pls post example error messages from UCP level?
Maybe add guest for it?
src/ucp/wireup/select.c
Outdated
ucs_diag(UCT_TL_RESOURCE_DESC_FMT" : unreachable [%s]", | ||
UCT_TL_RESOURCE_DESC_ARG(resource), wireup_info); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can wireup_info be empty string here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can't be empty unless we choose to not return info string in some cases (such invalid params)..
At first I was making sure it's filled in every scenario but you're right that the code doesn't ensure that
src/uct/ib/rc/accel/rc_mlx5_iface.c
Outdated
@@ -713,6 +713,9 @@ uct_rc_mlx5_iface_is_reachable_v2(const uct_iface_h tl_iface, | |||
iface_addr, IFACE_ADDR, NULL); | |||
/* Check hardware tag matching compatibility */ | |||
if ((iface_addr != NULL) && (my_type != *(uint8_t*)iface_addr)) { | |||
uct_iface_populate_info_str_buf( | |||
params, "incompatible hardware tag matching"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to print what is not compatible
8b05c4f
to
335d7a3
Compare
…f fails to connect)
src/uct/ib/mlx5/rc/rc_mlx5_iface.c
Outdated
@@ -647,6 +646,7 @@ uct_rc_mlx5_iface_is_reachable_v2(const uct_iface_h tl_iface, | |||
uint8_t my_type = uct_rc_mlx5_iface_get_address_type(tl_iface); | |||
uint8_t remote_type; | |||
const uct_iface_addr_t *iface_addr; | |||
static const char *tm_type_to_str[] = {"basic", "tag matching"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls move it before my_type since initialized vars should be before uninitialized ones
37d0f41
to
da2d2fb
Compare
src/uct/ib/base/ib_iface.c
Outdated
return 0; | ||
} | ||
|
||
if (!uct_ib_iface_dev_addr_is_reachable(iface, device_addr)) { | ||
if (!uct_ib_iface_dev_addr_is_reachable(iface, device_addr, params)) { | ||
uct_iface_fill_info_str_buf(params, "unreachable ib device address"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ib
-> IB
src/uct/base/uct_iface.c
Outdated
@@ -289,6 +307,15 @@ int uct_iface_is_reachable_v2(const uct_iface_h tl_iface, | |||
const uct_iface_is_reachable_params_t *params) | |||
{ | |||
const uct_base_iface_t *iface = ucs_derived_of(tl_iface, uct_base_iface_t); | |||
char *info_str_buf = UCS_PARAM_VALUE(UCT_IFACE_IS_REACHABLE_FIELD, params, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align =
src/uct/base/uct_iface.c
Outdated
INFO_STRING_LENGTH, 0); | ||
va_list ap; | ||
|
||
va_start(ap, fmt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe instead:
if (buf == NULL) {
return;
}
va_start(..);
..
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's a good idea
src/uct/ib/mlx5/rc/rc_mlx5_iface.c
Outdated
const uct_iface_addr_t *iface_addr; | ||
|
||
iface_addr = UCS_PARAM_VALUE(UCT_IFACE_IS_REACHABLE_FIELD, params, | ||
iface_addr, IFACE_ADDR, NULL); | ||
remote_type = *(uint8_t*)iface_addr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe dereference only after checking != NULL
below?
remote_type = *(uint8_t*)iface_addr; | ||
|
||
/* Check hardware tag matching compatibility */ | ||
if ((iface_addr != NULL) && (my_type != remote_type)) { | ||
if ((iface_addr != NULL) && | ||
(remote_type = *(uint8_t*)iface_addr != my_type)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gleon99 if it's too messy I can just dereference twice..
What
During the wireup process, provide a reason when a resource's lane is unreachable.
Why
Users need more information about why a device is not reachable after an unsuccessful connection establishment. This information should be passed from UCT to UCP.
How ?
Pass a string to select/search_lane functions so in case the wireup process fails, the reason will be printed out in an upper layer.