Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpudist: Summarize task on-CPU time as histograms #585

Merged
merged 2 commits into from
Jun 29, 2016

Conversation

goldshtn
Copy link
Collaborator

This summarizes task on-CPU time as histograms and displays it with a variety of switches, modeled on runqlat. Basic example:

# ./cpudist.py
Tracing on-CPU time... Hit Ctrl-C to end.
^C
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 1        |                                        |
         4 -> 7          : 1        |                                        |
         8 -> 15         : 13       |**                                      |
        16 -> 31         : 187      |****************************************|
        32 -> 63         : 89       |*******************                     |
        64 -> 127        : 26       |*****                                   |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 1        |                                        |

With this, we can close #499.

I don't know about what the relationship between #500 and #499 should be. @brendangregg I think you wanted these as separate tools for discoverability -- but the implementation can be 99% shared. Basically we'd have to reverse the histogram update location: update the histogram when the task gets on the CPU, instead of when it's getting off. Do you think it's worth the effort of duplicating the tool itself, its man page, and its examples file? Or can we maybe turn it into a switch, e.g. -O, which would reverse the meaning to off-CPU time instead of on-?

This is the initial implementation of the tool itself, which uses
the sched:sched_switch tracepoint to probe task switches. This is
a slightly more robust approach than using a kernel function, such
as finish_task_switch.

When native BCC support for tracepoints is introduced, this tool
can be reimplemented and reliance on the current Tracepoint module
can be removed.
@goldshtn
Copy link
Collaborator Author

Oh, I forgot to add that this is the first tool (I think) that was implemented using a kernel tracepoint instead of a kprobe. I think it works just fine. And of course, when we get proper tracepoint support in BCC, we could update this tool to require Linux 4.7 and use BPF tracepoint support instead of the hack we have right now.

@4ast
Copy link
Member

4ast commented Jun 28, 2016

lgtm, though would be great if you can add proper tracepoint support sooner than later :)

@goldshtn
Copy link
Collaborator Author

@4ast I'd be delighted to, but I will need a walkthrough ... As far as I can tell, there is a new BPF program type for tracepoints now. What would be required on the bcc side to use that?

@4ast
Copy link
Member

4ast commented Jun 28, 2016

on bcc side just needs to support new prog_type. fancy clang bits can stay as-is for now, though all auto probe_read insertion will be failing, since there is no 'struct pt_regs *ctx' anymore. that can be done later.

@goldshtn
Copy link
Collaborator Author

Sure, but where do I get the tracepoint struct from? The user will provide a probe function that takes void* and cast it? Should the clang side generate the struct?

@4ast
Copy link
Member

4ast commented Jun 28, 2016

eventually yes, clang rewriter can and probably should (to avoid mistakes) generate appropriate input struct, but for now manual struct is fine. Just as samples/bpf/offwaketime_kern.c does:

struct sched_switch_args {
        unsigned long long pad;
        char prev_comm[16];
        int prev_pid;
        int prev_prio;
        long long prev_state;
        char next_comm[16];
        int next_pid;
        int next_prio;
};
SEC("tracepoint/sched/sched_switch")
int oncpu(struct sched_switch_args *ctx)
{
        /* record previous thread sleep time */
        u32 pid = ctx->prev_pid;

no type casting necessary.

@brendangregg
Copy link
Member

Thanks Sasha, looks great! Good usage: I like the choice and use of -p, -P, -L, etc.

Yes, it's pretty close to offcpudist #500, so maybe that can be added as an option (maybe "-o"; I'd usually reserve that for output files, but these tools generally just do STDOUT). For discoverability, we could later on add a offcpudist symlink.

I tried coming up with a tracepoint example (#567), but it was hard going to process arguments without the new 4.7 stuff. This tool, cpudist, looks much better because you aren't reading args. So I think I'd include this tool as is, and move it to old & updated to 4.7 later on. But I'll still resist adding many more other tracepoint scripts until the 4.7 stuff.

LGTM.

@goldshtn
Copy link
Collaborator Author

@brendangregg The usage is all yours -- it's pretty much a verbatim copy from runqlat :-)

Do you prefer -o or -O? I thought the capital O would be more appropriate, no mixup with -o for output, and capitalized because it's a significant behavior change. But it's very subjective obviously.

I'll try to add bcc tracepoint support in the next few days. Need to set up latest 4.7, had some problems building bcc on Fedora with 4.7rc5 lately.

@brendangregg
Copy link
Member

We can go with -O. I guess at some point we might have a tool that uses -o. (eg, imagine a tool that traced network packets and wrote a tcpdump file!).

I think -O, and proper tracepoints, can be added later. This works and is useful on older kernels.

Another later addition could be to add execnames to the -P output, which could be read from /proc during output (man page can say these are best effort, and may be inaccurate).

@brendangregg
Copy link
Member

I think this is ok to merge, as there'll be follow-ons. // @4ast

@4ast 4ast merged commit 40975ab into iovisor:master Jun 29, 2016
@goldshtn goldshtn mentioned this pull request Jun 29, 2016
3 tasks
abirchall pushed a commit to abirchall/bcc that referenced this pull request Jul 11, 2016
* cpudist: summarize on-CPU time per task as a histogram

This is the initial implementation of the tool itself, which uses
the sched:sched_switch tracepoint to probe task switches. This is
a slightly more robust approach than using a kernel function, such
as finish_task_switch.

When native BCC support for tracepoints is introduced, this tool
can be reimplemented and reliance on the current Tracepoint module
can be removed.

* cpudist: add man page and examples
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tool request: cpudist
3 participants