Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/WIREUP/IB: provide a reason when a resource's lane is unreachable #9995

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

amastbaum
Copy link
Collaborator

@amastbaum amastbaum commented Jul 7, 2024

What

During the wireup process, provide a reason when a resource's lane is unreachable.

Why

Users need more information about why a device is not reachable after an unsuccessful connection establishment. This information should be passed from UCT to UCP.

How ?

Pass a string to select/search_lane functions so in case the wireup process fails, the reason will be printed out in an upper layer.

@amastbaum amastbaum added WIP-DNM Work in progress / Do not review and removed WIP-DNM Work in progress / Do not review labels Jul 7, 2024
@amastbaum amastbaum requested a review from gleon99 July 11, 2024 07:52
@amastbaum amastbaum force-pushed the specify_a_reason_for_wireup_failure branch from d8857c2 to 6fba9fb Compare July 11, 2024 08:01
@amastbaum amastbaum marked this pull request as ready for review July 11, 2024 08:55
@gleon99
Copy link
Contributor

gleon99 commented Jul 11, 2024

Please fix the PR title to match the main commit (conventions + capitals).

@gleon99
Copy link
Contributor

gleon99 commented Jul 11, 2024

Please edit PR description:

  1. Capitals, grammar, styling.
  2. Add the "Why?" section, as per our conventions.

@amastbaum amastbaum changed the title print out a reason when a resource's lane is unreachable UCP/WIREUP/IB: provide a reason when a resource's lane is unreachable Jul 11, 2024
src/ucp/wireup/select.c Outdated Show resolved Hide resolved
if (info_str != NULL) {
params.field_mask |= UCT_IFACE_IS_REACHABLE_FIELD_INFO_STRING |
UCT_IFACE_IS_REACHABLE_FIELD_INFO_STRING_LENGTH;
params.info_string = info_str;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Align =

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolved

src/uct/ib/rc/accel/rc_mlx5_iface.c Outdated Show resolved Hide resolved
src/uct/ib/base/ib_iface.c Outdated Show resolved Hide resolved
src/ucp/wireup/select.c Outdated Show resolved Hide resolved
src/uct/base/uct_iface.c Outdated Show resolved Hide resolved
src/ucp/wireup/select.c Outdated Show resolved Hide resolved
src/uct/ib/base/ib_iface.c Outdated Show resolved Hide resolved
src/uct/ib/base/ib_iface.c Outdated Show resolved Hide resolved
va_start(args, fmt);

if (info_str_buf && info_str_buf_len > 0 && info_str_buf[0] == '\0') {
snprintf(info_str_buf, info_str_buf_len, fmt, args);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ucs_snprintf_safe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, it might be called several times in a single flow (e.g uct_ib_iface_dev_addr_is_reachable then in the caller uct_ib_iface_is_reachable_v2). The strings overwrite each other?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gleon99 This is why I am checking info_str_buf is empty (info_str_buf[0] == '\0'), so it won't override an existing reason. I assume the first time info_str_buf is filled happens in the deepest level of this check (== the most descriptive one), yet I want to provide a more basic reason in case it wasn't filled earlier in the flow.

@yosefe it answers your other question about why we need info_str_buf[0] == '\0' .

Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you pls post example error messages from UCP level?
Maybe add guest for it?

src/ucp/wireup/select.c Outdated Show resolved Hide resolved
Comment on lines 640 to 641
ucs_diag(UCT_TL_RESOURCE_DESC_FMT" : unreachable [%s]",
UCT_TL_RESOURCE_DESC_ARG(resource), wireup_info);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can wireup_info be empty string here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can't be empty unless we choose to not return info string in some cases (such invalid params)..
At first I was making sure it's filled in every scenario but you're right that the code doesn't ensure that

src/ucp/wireup/select.c Outdated Show resolved Hide resolved
src/ucp/wireup/select.c Outdated Show resolved Hide resolved
src/ucp/wireup/wireup.c Outdated Show resolved Hide resolved
src/uct/ib/base/ib_iface.c Outdated Show resolved Hide resolved
src/uct/ib/base/ib_iface.c Outdated Show resolved Hide resolved
src/uct/ib/base/ib_iface.c Outdated Show resolved Hide resolved
src/uct/ib/dc/dc_mlx5.c Outdated Show resolved Hide resolved
@@ -713,6 +713,9 @@ uct_rc_mlx5_iface_is_reachable_v2(const uct_iface_h tl_iface,
iface_addr, IFACE_ADDR, NULL);
/* Check hardware tag matching compatibility */
if ((iface_addr != NULL) && (my_type != *(uint8_t*)iface_addr)) {
uct_iface_populate_info_str_buf(
params, "incompatible hardware tag matching");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to print what is not compatible

@amastbaum amastbaum force-pushed the specify_a_reason_for_wireup_failure branch from 8b05c4f to 335d7a3 Compare July 16, 2024 10:44
src/uct/base/uct_iface.h Outdated Show resolved Hide resolved
src/uct/base/uct_iface.h Outdated Show resolved Hide resolved
src/ucp/wireup/select.c Outdated Show resolved Hide resolved
src/ucp/wireup/select.c Outdated Show resolved Hide resolved
src/ucp/wireup/select.c Outdated Show resolved Hide resolved
src/uct/ib/base/ib_iface.c Outdated Show resolved Hide resolved
src/uct/ib/base/ib_iface.c Outdated Show resolved Hide resolved
src/uct/ib/base/ib_iface.c Outdated Show resolved Hide resolved
src/uct/ib/dc/dc_mlx5.c Outdated Show resolved Hide resolved
src/uct/ib/rc/accel/rc_mlx5_iface.c Outdated Show resolved Hide resolved
src/uct/base/uct_iface.c Outdated Show resolved Hide resolved
src/uct/base/uct_iface.c Show resolved Hide resolved
src/uct/base/uct_iface.c Outdated Show resolved Hide resolved
src/uct/base/uct_iface.c Outdated Show resolved Hide resolved
src/uct/ib/mlx5/rc/rc_mlx5_iface.c Outdated Show resolved Hide resolved
src/uct/base/uct_iface.c Outdated Show resolved Hide resolved
@@ -647,6 +646,7 @@ uct_rc_mlx5_iface_is_reachable_v2(const uct_iface_h tl_iface,
uint8_t my_type = uct_rc_mlx5_iface_get_address_type(tl_iface);
uint8_t remote_type;
const uct_iface_addr_t *iface_addr;
static const char *tm_type_to_str[] = {"basic", "tag matching"};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls move it before my_type since initialized vars should be before uninitialized ones

@amastbaum amastbaum force-pushed the specify_a_reason_for_wireup_failure branch from 37d0f41 to da2d2fb Compare July 22, 2024 06:42
return 0;
}

if (!uct_ib_iface_dev_addr_is_reachable(iface, device_addr)) {
if (!uct_ib_iface_dev_addr_is_reachable(iface, device_addr, params)) {
uct_iface_fill_info_str_buf(params, "unreachable ib device address");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ib -> IB

@@ -289,6 +307,15 @@ int uct_iface_is_reachable_v2(const uct_iface_h tl_iface,
const uct_iface_is_reachable_params_t *params)
{
const uct_base_iface_t *iface = ucs_derived_of(tl_iface, uct_base_iface_t);
char *info_str_buf = UCS_PARAM_VALUE(UCT_IFACE_IS_REACHABLE_FIELD, params,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Align =

INFO_STRING_LENGTH, 0);
va_list ap;

va_start(ap, fmt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead:

if (buf == NULL) {
    return;
}

va_start(..);
..

wdyt?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a good idea

const uct_iface_addr_t *iface_addr;

iface_addr = UCS_PARAM_VALUE(UCT_IFACE_IS_REACHABLE_FIELD, params,
iface_addr, IFACE_ADDR, NULL);
remote_type = *(uint8_t*)iface_addr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe dereference only after checking != NULL below?

Comment on lines 653 to 656
remote_type = *(uint8_t*)iface_addr;

/* Check hardware tag matching compatibility */
if ((iface_addr != NULL) && (my_type != remote_type)) {
if ((iface_addr != NULL) &&
(remote_type = *(uint8_t*)iface_addr != my_type)) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gleon99 if it's too messy I can just dereference twice..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants