-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
providers/rxe: Add tracepoint for post send #1384
Conversation
d98d5ea
to
9f559bd
Compare
I have mixed feelings about this tool being part of rdma-core and before we deep dive into code review, it is worth to understand if rdma-core is right place for it. @jgunthorpe ??? |
I struggled for this during developing this, a kernel proc entry(like TCP |
I think the general concept of some tooling to support introspection is a fine topic for rdma-core But, I don't think this implementation is that compatible with the design philosophy. Could you instead use something like the new Linux trace toolkit support to just add some interesting tracepoints and reconstruct the data you want from there instead? |
Hi @rleon @jgunthorpe |
@pizhenwei There is LTTng support in rdma-core that you can use to add additional tracepoints with data that you're interested in, then you can analyze it offline as @jgunthorpe suggested or use one of many opensource tools that are capable of displaying LTTng traces. |
Thanks! |
Hi @jgunthorpe @rleon @mrgolin |
b559179
to
2a04429
Compare
Hi @jgunthorpe @rleon @mrgolin |
libibverbs/verbs.h
Outdated
{ | ||
return qp->context->ops.post_send(qp, wr, bad_wr); | ||
} | ||
int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making ibv_post_send
non-inline would break some existing applications that relay on dlopen
to load libibverbs at runtime. For example:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the dlopen style would still work
static inline ncclResult_t wrap_ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr, struct ibv_send_wr **bad_wr) {
int ret = qp->context->ops.post_send(qp, wr, bad_wr); /*returns 0 on success, or the value of errno on failure (which indicates the failure reason)*/
if (ret != IBV_SUCCESS) {
WARN("ibv_post_send() failed with error %s, Bad WR %p, First WR %p", strerror(ret), wr, *bad_wr);
return ncclSystemError;
}
return ncclSuccess;
}
but it works around the LTTng tracepoint.
Frankly, the dlopen style always has the risk on symbol changing, is this good practice?
As far as I can see, we could hide the ops data structure in the future and use exported functions instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using dlopen
to load libibverbs seems to be a de facto way to decouple the compilation environment from the runtime environment, which can generate a single executable file that is able to run across different versions of libibverbs. And as far as I know, for a long time libibverbs has only introduced incremental API changes and has not modified existing interfaces in its header, which is a good news for dlopen
method.
BTW, making ibv_post_send
non-inline will introduce an extra function call site, which may add small but non-negligible overhead to the data path, especially when the IO workload is very high. I guess this is the reason why ibv_post_send
is an inline function in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, function call is not expensive anymore on a modern computer.
# cat inline.c
void __attribute__ ((noinline)) foo(void)
{
}
int main()
{
for (int i = 0; i < 1000000000; i++)
foo();
return 0;
}
# gcc inline.c -O0 -o inline
# time ./inline
real 0m1.559s
user 0m1.553s
sys 0m0.004s
An empty function call uses about 1.6ns on Intel i7-11700K
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PING @jgunthorpe @rleon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not valuable enough a functionality to risk the ABI, please don't change things. Put your tracepoint in the drivers instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not valuable enough a functionality to risk the ABI, please don't change things. Put your tracepoint in the drivers instead.
Hi @jgunthorpe
I have several questions:
- A function call is not expensive anymore on a modern computer, hence is there any plan to hide the ibv operations, and expose functions instead in the future?
- As you suggested, I'd put this tracepoint into both mlx5 and rxe provider(I'm using this IB adapter in production, and rxe for test). If someday another vendor IB adapter is used, I'll add a similar tracepoint into its provider driver. Is this a better solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not valuable enough a functionality to risk the ABI, please don't change things. Put your tracepoint in the drivers instead.
Hi @jgunthorpe I have several questions:
- A function call is not expensive anymore on a modern computer, hence is there any plan to hide the ibv operations, and expose functions instead in the future?
- As you suggested, I'd put this tracepoint into both mlx5 and rxe provider(I'm using this IB adapter in production, and rxe for test). If someday another vendor IB adapter is used, I'll add a similar tracepoint into its provider driver. Is this a better solution?
@jgunthorpe Could you please give me any hint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put the tracepoint in the driver function
FAILED: providers/mlx5/CMakeFiles/mlx5.dir/qp.c.o |
c250e90
to
b8f7110
Compare
Hi @rleon |
Generally, ibv_wr_opcode_str() helper function converts ibv wr opcode to string for debug/tracing. Another little change in this commit: - sort 'ibv_cmd_modify_cq'/'verbs_init_cq' by alphabet. Signed-off-by: zhenwei pi <[email protected]>
Add initial support for RXE LTTng tracing using compiled out macros if LTTng flag isn't provided. Once LTTng records tracepoints, an example of python script to analyze per-qp traffic: #!/usr/bin/env python3 # SPDX-License-Identifier: (GPL-2.0 OR Linux-OpenIB) # Copyright (c) 2023, Bytedance. All rights reserved. See COPYING file import argparse import bt2 import collections import sys def stats(args): stats = {} tcm_it = bt2.TraceCollectionMessageIterator(args.tracepoints) for msg in tcm_it: if type(msg) is not bt2._EventMessageConst: continue event = msg.event if event.cls.name != "rdma_core_rxe:post_send": continue # device filter dev = str(event.payload_field["dev"]) if args.device != None and args.device != dev: continue; # QPN filter qpn = event.payload_field['src_qp_num'] if args.qpn != None and args.qpn != qpn: continue; # opcode filter opcode = event.payload_field['opcode'] if args.opcode != None and args.opcode != opcode: continue; key = dev + "-" + str(qpn) qpstats = stats.get(key) if qpstats is None: qpstats = {} qpstats["dev"] = dev qpstats["qpn"] = qpn qpstats["bytes"] = 0 qpstats["ops"] = 0 stats[key] = qpstats bytes = event.payload_field['bytes'] qpstats["bytes"] += bytes qpstats["ops"] += 1 qpstat = qpstats.get(opcode) if qpstat is None: qpstats[opcode] = bytes else: qpstats[opcode] += bytes # sorted by total bytes of a QP sorted_stats = sorted(stats.items(), key=lambda item: (item[1].get("bytes", 0)), reverse=True) # show header print("%16s %10s %10s %10s" % ("Device", "QPN", "Bytes", "Operations")) # show QP stats for qpstats in sorted_stats: stats = qpstats[1] print("%16s %10d %10d %10d" % (stats["dev"], stats["qpn"], stats["bytes"], stats["ops"])) def parse_arg(): examples = """LTTng tracing examples: # lttng create rdma-core-stats-session # lttng list --userspace # lttng enable-event --userspace rdma_core_ibverbs:post_send # lttng start # ... # lttng destroy # ibv_stats -t /path/to/lttng/tracepoints """ parser = argparse.ArgumentParser( description="Summarize IB verbs statistics", formatter_class=argparse.RawDescriptionHelpFormatter, epilog=examples) parser.add_argument("-t", "--tracepoints", help="specify the LTTng tracepoints directory") parser.add_argument("-d", "--device", help="specify IB device filter") parser.add_argument("-q", "--qpn", type=int, help="specify source QP number filter") parser.add_argument("-o", "--opcode", help="specify opcode filter") return parser.parse_args() if __name__ == '__main__': args = parse_arg() if args.tracepoints is None: print("Missing LTTng tracepoints directory(use -t/--tracepoints)") sys.exit(0) stats(args) Signed-off-by: zhenwei pi <[email protected]>
Add initial support for MLX5 LTTng tracing using compiled out macros if LTTng flag isn't provided. Signed-off-by: zhenwei pi <[email protected]>
@rleon Force pushed a new version as you suggested. |
I don't have an objection to this new version |
Hi,
I'd like to introduce LTTng based tracepoints for RXE into rdma-core for trouble shooting/performance monitor..
An sample of python script: