-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpudist: Summarize task on-CPU time as histograms #585
Conversation
This is the initial implementation of the tool itself, which uses the sched:sched_switch tracepoint to probe task switches. This is a slightly more robust approach than using a kernel function, such as finish_task_switch. When native BCC support for tracepoints is introduced, this tool can be reimplemented and reliance on the current Tracepoint module can be removed.
Oh, I forgot to add that this is the first tool (I think) that was implemented using a kernel tracepoint instead of a kprobe. I think it works just fine. And of course, when we get proper tracepoint support in BCC, we could update this tool to require Linux 4.7 and use BPF tracepoint support instead of the hack we have right now. |
lgtm, though would be great if you can add proper tracepoint support sooner than later :) |
@4ast I'd be delighted to, but I will need a walkthrough ... As far as I can tell, there is a new BPF program type for tracepoints now. What would be required on the bcc side to use that? |
on bcc side just needs to support new prog_type. fancy clang bits can stay as-is for now, though all auto probe_read insertion will be failing, since there is no 'struct pt_regs *ctx' anymore. that can be done later. |
Sure, but where do I get the tracepoint struct from? The user will provide a probe function that takes void* and cast it? Should the clang side generate the struct? |
eventually yes, clang rewriter can and probably should (to avoid mistakes) generate appropriate input struct, but for now manual struct is fine. Just as samples/bpf/offwaketime_kern.c does:
no type casting necessary. |
Thanks Sasha, looks great! Good usage: I like the choice and use of -p, -P, -L, etc. Yes, it's pretty close to offcpudist #500, so maybe that can be added as an option (maybe "-o"; I'd usually reserve that for output files, but these tools generally just do STDOUT). For discoverability, we could later on add a offcpudist symlink. I tried coming up with a tracepoint example (#567), but it was hard going to process arguments without the new 4.7 stuff. This tool, cpudist, looks much better because you aren't reading args. So I think I'd include this tool as is, and move it to old & updated to 4.7 later on. But I'll still resist adding many more other tracepoint scripts until the 4.7 stuff. LGTM. |
@brendangregg The usage is all yours -- it's pretty much a verbatim copy from Do you prefer -o or -O? I thought the capital O would be more appropriate, no mixup with -o for output, and capitalized because it's a significant behavior change. But it's very subjective obviously. I'll try to add bcc tracepoint support in the next few days. Need to set up latest 4.7, had some problems building bcc on Fedora with 4.7rc5 lately. |
We can go with -O. I guess at some point we might have a tool that uses -o. (eg, imagine a tool that traced network packets and wrote a tcpdump file!). I think -O, and proper tracepoints, can be added later. This works and is useful on older kernels. Another later addition could be to add execnames to the -P output, which could be read from /proc during output (man page can say these are best effort, and may be inaccurate). |
I think this is ok to merge, as there'll be follow-ons. // @4ast |
* cpudist: summarize on-CPU time per task as a histogram This is the initial implementation of the tool itself, which uses the sched:sched_switch tracepoint to probe task switches. This is a slightly more robust approach than using a kernel function, such as finish_task_switch. When native BCC support for tracepoints is introduced, this tool can be reimplemented and reliance on the current Tracepoint module can be removed. * cpudist: add man page and examples
This summarizes task on-CPU time as histograms and displays it with a variety of switches, modeled on
runqlat
. Basic example:With this, we can close #499.
I don't know about what the relationship between #500 and #499 should be. @brendangregg I think you wanted these as separate tools for discoverability -- but the implementation can be 99% shared. Basically we'd have to reverse the histogram update location: update the histogram when the task gets on the CPU, instead of when it's getting off. Do you think it's worth the effort of duplicating the tool itself, its man page, and its examples file? Or can we maybe turn it into a switch, e.g.
-O
, which would reverse the meaning to off-CPU time instead of on-?