Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

providers/rxe: Add tracepoint for post send #1384

Merged
merged 3 commits into from
Nov 3, 2023

Conversation

pizhenwei
Copy link
Contributor

@pizhenwei pizhenwei commented Aug 30, 2023

Hi,
I'd like to introduce LTTng based tracepoints for RXE into rdma-core for trouble shooting/performance monitor..

An sample of python script:

#!/usr/bin/env python3
# SPDX-License-Identifier: (GPL-2.0 OR Linux-OpenIB)
# Copyright (c) 2023, Bytedance. All rights reserved.  See COPYING file

import argparse
import bt2
import collections
import sys


def stats(args):
	stats = {}

	tcm_it = bt2.TraceCollectionMessageIterator(args.tracepoints)
	for msg in tcm_it:
		if type(msg) is not bt2._EventMessageConst:
			continue

		event = msg.event
		if event.cls.name != "rdma_core_rxe:post_send":
			continue

		# device filter
		dev = str(event.payload_field["dev"])
		if args.device != None and args.device != dev:
			continue;

		# QPN filter
		qpn = event.payload_field['src_qp_num']
		if args.qpn != None and args.qpn != qpn:
			continue;

		# opcode filter
		opcode = event.payload_field['opcode']
		if args.opcode != None and args.opcode != opcode:
			continue;

		key = dev + "-" + str(qpn)
		qpstats = stats.get(key)
		if qpstats is None:
			qpstats = {}
			qpstats["dev"] = dev
			qpstats["qpn"] = qpn
			qpstats["bytes"] = 0
			qpstats["ops"] = 0
			stats[key] = qpstats

		bytes = event.payload_field['bytes']
		qpstats["bytes"] += bytes
		qpstats["ops"] += 1

		qpstat = qpstats.get(opcode)
		if qpstat is None:
			qpstats[opcode] = bytes
		else:
			qpstats[opcode] += bytes

	# sorted by total bytes of a QP
	sorted_stats = sorted(stats.items(),
		key=lambda item: (item[1].get("bytes", 0)),
		reverse=True)

	# show header
	print("%16s %10s %10s %10s" % ("Device", "QPN", "Bytes", "Operations"))

	# show QP stats
	for qpstats in sorted_stats:
		stats = qpstats[1]
		print("%16s %10d %10d %10d" %
			(stats["dev"], stats["qpn"], stats["bytes"], stats["ops"]))

def parse_arg():
	examples = """LTTng tracing examples:
	# lttng create rdma-core-stats-session
	# lttng list --userspace
	# lttng enable-event --userspace rdma_core_ibverbs:post_send
	# lttng start
	# ...
	# lttng destroy
	# ibv_stats -t /path/to/lttng/tracepoints
	"""

	parser = argparse.ArgumentParser(
			description="Summarize IB verbs statistics",
			formatter_class=argparse.RawDescriptionHelpFormatter,
			epilog=examples)

	parser.add_argument("-t", "--tracepoints",
		help="specify the LTTng tracepoints directory")

	parser.add_argument("-d", "--device",
		help="specify IB device filter")

	parser.add_argument("-q", "--qpn", type=int,
		help="specify source QP number filter")

	parser.add_argument("-o", "--opcode",
 		help="specify opcode filter")

	return parser.parse_args()


if __name__ == '__main__':
	args = parse_arg()

	if args.tracepoints is None:
		print("Missing LTTng tracepoints directory(use -t/--tracepoints)")
		sys.exit(0)

	stats(args)


@pizhenwei pizhenwei force-pushed the qp-stats branch 5 times, most recently from d98d5ea to 9f559bd Compare September 5, 2023 02:46
@rleon
Copy link
Member

rleon commented Sep 5, 2023

I have mixed feelings about this tool being part of rdma-core and before we deep dive into code review, it is worth to understand if rdma-core is right place for it.

@jgunthorpe ???

@pizhenwei
Copy link
Contributor Author

I have mixed feelings about this tool being part of rdma-core and before we deep dive into code review, it is worth to understand if rdma-core is right place for it.

@jgunthorpe ???

I struggled for this during developing this, a kernel proc entry(like TCP /proc/net/tcp) or netlink would be better. But this needs a long step to achieve, as far as I can see, they need both ib-core framework and HCA drivers to support this.

@jgunthorpe
Copy link
Member

I think the general concept of some tooling to support introspection is a fine topic for rdma-core

But, I don't think this implementation is that compatible with the design philosophy. Could you instead use something like the new Linux trace toolkit support to just add some interesting tracepoints and reconstruct the data you want from there instead?

@pizhenwei
Copy link
Contributor Author

I think the general concept of some tooling to support introspection is a fine topic for rdma-core

But, I don't think this implementation is that compatible with the design philosophy. Could you instead use something like the new Linux trace toolkit support to just add some interesting tracepoints and reconstruct the data you want from there instead?

Hi @rleon @jgunthorpe
I notice that some RDMA providers(from rdma-core) kick hardware by writing doorbell register in the user space, this means we can NOT hook kernel functions by kprobe or add tracepoints in kernel. And yes, this is the motivation that I add this tool in rdma-core.

@mrgolin
Copy link
Contributor

mrgolin commented Sep 7, 2023

@pizhenwei There is LTTng support in rdma-core that you can use to add additional tracepoints with data that you're interested in, then you can analyze it offline as @jgunthorpe suggested or use one of many opensource tools that are capable of displaying LTTng traces.
See: #1351

@pizhenwei
Copy link
Contributor Author

pizhenwei commented Sep 7, 2023

@pizhenwei There is LTTng support in rdma-core that you can use to add additional tracepoints with data that you're interested in, then you can analyze it offline as @jgunthorpe suggested or use one of many opensource tools that are capable of displaying LTTng traces. See: #1351

Thanks!

@pizhenwei
Copy link
Contributor Author

Hi @jgunthorpe @rleon @mrgolin
I force pushed a new version which is re-implemented by LTTNG, would you please review this version?
ibv_top is missing currently, once this version is reviewed as acceptable, I'd continue to rework this tool.

@pizhenwei pizhenwei reopened this Sep 12, 2023
@pizhenwei pizhenwei force-pushed the qp-stats branch 2 times, most recently from b559179 to 2a04429 Compare September 14, 2023 02:14
@pizhenwei
Copy link
Contributor Author

Hi @jgunthorpe @rleon @mrgolin
LTTng based 'ibv_stats' could be tested in this PR.

{
return qp->context->ops.post_send(qp, wr, bad_wr);
}
int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making ibv_post_send non-inline would break some existing applications that relay on dlopen to load libibverbs at runtime. For example:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the dlopen style would still work

static inline ncclResult_t wrap_ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr, struct ibv_send_wr **bad_wr) {
  int ret = qp->context->ops.post_send(qp, wr, bad_wr); /*returns 0 on success, or the value of errno on failure (which indicates the failure reason)*/
  if (ret != IBV_SUCCESS) {
    WARN("ibv_post_send() failed with error %s, Bad WR %p, First WR %p", strerror(ret), wr, *bad_wr);
    return ncclSystemError;
  }
  return ncclSuccess;
}

but it works around the LTTng tracepoint.

Frankly, the dlopen style always has the risk on symbol changing, is this good practice?

As far as I can see, we could hide the ops data structure in the future and use exported functions instead.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using dlopen to load libibverbs seems to be a de facto way to decouple the compilation environment from the runtime environment, which can generate a single executable file that is able to run across different versions of libibverbs. And as far as I know, for a long time libibverbs has only introduced incremental API changes and has not modified existing interfaces in its header, which is a good news for dlopen method.

BTW, making ibv_post_send non-inline will introduce an extra function call site, which may add small but non-negligible overhead to the data path, especially when the IO workload is very high. I guess this is the reason why ibv_post_send is an inline function in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, function call is not expensive anymore on a modern computer.

# cat inline.c 
void __attribute__ ((noinline)) foo(void)
{
}

int main()
{
	for (int i = 0; i < 1000000000; i++)
		foo();

	return 0;
}

# gcc inline.c -O0 -o inline
# time ./inline 

real	0m1.559s
user	0m1.553s
sys	0m0.004s

An empty function call uses about 1.6ns on Intel i7-11700K

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not valuable enough a functionality to risk the ABI, please don't change things. Put your tracepoint in the drivers instead.

Copy link
Contributor Author

@pizhenwei pizhenwei Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not valuable enough a functionality to risk the ABI, please don't change things. Put your tracepoint in the drivers instead.

Hi @jgunthorpe
I have several questions:

  • A function call is not expensive anymore on a modern computer, hence is there any plan to hide the ibv operations, and expose functions instead in the future?
  • As you suggested, I'd put this tracepoint into both mlx5 and rxe provider(I'm using this IB adapter in production, and rxe for test). If someday another vendor IB adapter is used, I'll add a similar tracepoint into its provider driver. Is this a better solution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not valuable enough a functionality to risk the ABI, please don't change things. Put your tracepoint in the drivers instead.

Hi @jgunthorpe I have several questions:

  • A function call is not expensive anymore on a modern computer, hence is there any plan to hide the ibv operations, and expose functions instead in the future?
  • As you suggested, I'd put this tracepoint into both mlx5 and rxe provider(I'm using this IB adapter in production, and rxe for test). If someday another vendor IB adapter is used, I'll add a similar tracepoint into its provider driver. Is this a better solution?

@jgunthorpe Could you please give me any hint?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put the tracepoint in the driver function

@pizhenwei pizhenwei changed the title libibverbs: support QP stats providers/rxe: Add tracepoint for post send Sep 28, 2023
@rleon
Copy link
Member

rleon commented Oct 16, 2023

FAILED: providers/mlx5/CMakeFiles/mlx5.dir/qp.c.o
/usr/bin/clang-15 -DVERBS_DEBUG -Dmlx5_EXPORTS -I/__w/1/s/build-clang/include -I/usr/include/libnl3 -I/usr/include/drm -Wall -Wextra -Wno-sign-compare -Wno-unused-parameter -Wmissing-prototypes -Wmissing-declarations -Wwrite-strings -Wformat=2 -Wcast-function-type -Wformat-nonliteral -Wdate-time -Wnested-externs -Wshadow -Wno-gnu-variable-sized-type-not-at-end -Wstrict-prototypes -Wold-style-definition -Werror -Wredundant-decls -g -fPIC -std=gnu11 -MD -MT providers/mlx5/CMakeFiles/mlx5.dir/qp.c.o -MF providers/mlx5/CMakeFiles/mlx5.dir/qp.c.o.d -o providers/mlx5/CMakeFiles/mlx5.dir/qp.c.o -c /__w/1/s/providers/mlx5/qp.c
/__w/1/s/providers/mlx5/qp.c:827:15: error: variable 'length' set but not used [-Werror,-Wunused-but-set-variable]
unsigned int length;
^
1 error generated.

@pizhenwei pizhenwei force-pushed the qp-stats branch 2 times, most recently from c250e90 to b8f7110 Compare October 16, 2023 09:03
@pizhenwei
Copy link
Contributor Author

FAILED: providers/mlx5/CMakeFiles/mlx5.dir/qp.c.o /usr/bin/clang-15 -DVERBS_DEBUG -Dmlx5_EXPORTS -I/__w/1/s/build-clang/include -I/usr/include/libnl3 -I/usr/include/drm -Wall -Wextra -Wno-sign-compare -Wno-unused-parameter -Wmissing-prototypes -Wmissing-declarations -Wwrite-strings -Wformat=2 -Wcast-function-type -Wformat-nonliteral -Wdate-time -Wnested-externs -Wshadow -Wno-gnu-variable-sized-type-not-at-end -Wstrict-prototypes -Wold-style-definition -Werror -Wredundant-decls -g -fPIC -std=gnu11 -MD -MT providers/mlx5/CMakeFiles/mlx5.dir/qp.c.o -MF providers/mlx5/CMakeFiles/mlx5.dir/qp.c.o.d -o providers/mlx5/CMakeFiles/mlx5.dir/qp.c.o -c /__w/1/s/providers/mlx5/qp.c /__w/1/s/providers/mlx5/qp.c:827:15: error: variable 'length' set but not used [-Werror,-Wunused-but-set-variable] unsigned int length; ^ 1 error generated.

Hi @rleon
I play a trick to avoid this compiling warning, please correct me if you have a better suggestion.
I tested RXE on ubuntu-2204( in a virtual machine), but I could not build LTTNG on a debian9 host with mlx5 NIC(the lower LTTNG version leads compiling error), so I did not test mlx5 on a real server.

providers/mlx5/qp.c Outdated Show resolved Hide resolved
libibverbs/libibverbs.map.in Outdated Show resolved Hide resolved
Generally, ibv_wr_opcode_str() helper function converts ibv wr opcode
to string for debug/tracing.

Another little change in this commit:
- sort 'ibv_cmd_modify_cq'/'verbs_init_cq' by alphabet.

Signed-off-by: zhenwei pi <[email protected]>
Add initial support for RXE LTTng tracing using compiled out macros if
LTTng flag isn't provided.

Once LTTng records tracepoints, an example of python script to analyze
per-qp traffic:

 #!/usr/bin/env python3
 # SPDX-License-Identifier: (GPL-2.0 OR Linux-OpenIB)
 # Copyright (c) 2023, Bytedance. All rights reserved.  See COPYING file

import argparse
import bt2
import collections
import sys

def stats(args):
	stats = {}

	tcm_it = bt2.TraceCollectionMessageIterator(args.tracepoints)
	for msg in tcm_it:
		if type(msg) is not bt2._EventMessageConst:
			continue

		event = msg.event
		if event.cls.name != "rdma_core_rxe:post_send":
			continue

		# device filter
		dev = str(event.payload_field["dev"])
		if args.device != None and args.device != dev:
			continue;

		# QPN filter
		qpn = event.payload_field['src_qp_num']
		if args.qpn != None and args.qpn != qpn:
			continue;

		# opcode filter
		opcode = event.payload_field['opcode']
		if args.opcode != None and args.opcode != opcode:
			continue;

		key = dev + "-" + str(qpn)
		qpstats = stats.get(key)
		if qpstats is None:
			qpstats = {}
			qpstats["dev"] = dev
			qpstats["qpn"] = qpn
			qpstats["bytes"] = 0
			qpstats["ops"] = 0
			stats[key] = qpstats

		bytes = event.payload_field['bytes']
		qpstats["bytes"] += bytes
		qpstats["ops"] += 1

		qpstat = qpstats.get(opcode)
		if qpstat is None:
			qpstats[opcode] = bytes
		else:
			qpstats[opcode] += bytes

	# sorted by total bytes of a QP
	sorted_stats = sorted(stats.items(),
		key=lambda item: (item[1].get("bytes", 0)),
		reverse=True)

	# show header
	print("%16s %10s %10s %10s" % ("Device", "QPN", "Bytes", "Operations"))

	# show QP stats
	for qpstats in sorted_stats:
		stats = qpstats[1]
		print("%16s %10d %10d %10d" %
			(stats["dev"], stats["qpn"], stats["bytes"], stats["ops"]))

def parse_arg():
	examples = """LTTng tracing examples:
	# lttng create rdma-core-stats-session
	# lttng list --userspace
	# lttng enable-event --userspace rdma_core_ibverbs:post_send
	# lttng start
	# ...
	# lttng destroy
	# ibv_stats -t /path/to/lttng/tracepoints
	"""

	parser = argparse.ArgumentParser(
			description="Summarize IB verbs statistics",
			formatter_class=argparse.RawDescriptionHelpFormatter,
			epilog=examples)

	parser.add_argument("-t", "--tracepoints",
		help="specify the LTTng tracepoints directory")

	parser.add_argument("-d", "--device",
		help="specify IB device filter")

	parser.add_argument("-q", "--qpn", type=int,
		help="specify source QP number filter")

	parser.add_argument("-o", "--opcode",
 		help="specify opcode filter")

	return parser.parse_args()

if __name__ == '__main__':
	args = parse_arg()

	if args.tracepoints is None:
		print("Missing LTTng tracepoints directory(use -t/--tracepoints)")
		sys.exit(0)

	stats(args)

Signed-off-by: zhenwei pi <[email protected]>
Add initial support for MLX5 LTTng tracing using compiled out macros
if LTTng flag isn't provided.

Signed-off-by: zhenwei pi <[email protected]>
@pizhenwei
Copy link
Contributor Author

@rleon Force pushed a new version as you suggested.

@jgunthorpe
Copy link
Member

I don't have an objection to this new version

@rleon rleon merged commit 8b631c2 into linux-rdma:master Nov 3, 2023
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants