Simple futex contention bcc script #1268

gdankel · 2017-07-28T22:50:47Z

This is the starting point for a more comprehensive lock analysis
tool ala Solaris' lockstat.

This initial version only traces the futex system call and measures
the time blocked or acquiring futex related kernel spinlocks due
to user-space lock contention.

Report results per process and per lock.

Several shortcomings of this approach:

Only captures contention in locking constructs using futexes.
Applications and libraries might only use futexes as a last resort.
Does not distinguish between mutexes, conditional variables and
semaphores etc.
Does not capture user-space contention or performance impact due
to cacheline bouncing.
Etc etc. You get the idea.

This is the starting point for a more comprehensive lock analysis tool ala Solaris' lockstat. This initial version only traces the futex system call and measures the time blocked or acquiring futex related kernel spinlocks due to user-space lock contention. Report results per process and per lock. Several shortcomings of this approach: * Only captures contention in locking constructs using futexes. Applications and libraries might only use futexes as a last resort. * Does not distinguish between mutexes, conditional variables and semaphores etc. * Does not capture user-space contention or performance impact due to cacheline bouncing. * Etc etc. You get the idea.

yonghong-song

cc @brendangregg

yonghong-song · 2017-07-31T21:49:40Z

tools/lockstat.py

+ if (cmd != FUTEX_WAIT)
+ return 0;
+
+ u32 pid = bpf_get_current_pid_tgid();


For kprobe, passing pid to kernel does not guarantee that the bpf program only gets called with that pid. This is because kprobe is tricky, sometimes you do not have a user process associated at all. The same could be true for kernel tracepoints.

Therefore, if user specifies a pid, the best practice is to place a check in the beginning of function
if (pid != PREDEFINED_PID)
return 0;

OK curious... I can understand why you wouldn't have a pid at all points (e.g. h/w interrupt, network rx), but what's the reason it's not available sometimes in this case? It's a system call, so has to involve a process?

I think kernel probably just want to be uniform for kprobes and regular kernel tracepoints.

syscalls (sys_enter_* and sys_exit_* tracepoints) are special processed in kernel. Not sure whether it honors pid or not. I am working on a patch to enable sys_enter/exit_* bpf tracepoints and will comment later once I found it.

I think these are two different topics we are discussing here:

For filtering on PID (more context: RFC: Remove cpu / pid parameters from probe interfaces #1278), we'll need to do so from BPF program. i.e., as @yonghong-song mentioned:

if (pid != PREDEFINED_PID) return 0;

For availability of PID and COMM in kprobes, from my experience (please correct me if I'm wrong) I would always get valid PID number (might be 0) and COMM name (might be kswapper / kworker) even for those Kernel function possibly running not from a user Process.

Right, makes sense. I obviously hadn't tested that case - there's no code in there to deal with -p option :/
I will push an updated version.

gdankel · 2017-08-03T18:55:42Z

Yonghong: run_command_get_pid is used in the latest commit :) I've now tested the different options a fair amount.

gdankel · 2017-08-03T18:59:14Z

Note to reviewers: I posted this mainly to get both high-level feedback and detailed feedback (this is my first attempt at using bcc / bpf).
I have a good idea about which direction to take this tool, and this is only the first small warm-up step to get used to the process.
If anyone has comments / suggestions / ideas, now is a good time to chime in.
One obvious question is, why has such a tool not yet been developed? Maybe there's not much perceived value on top of what already exists?

yonghong-song · 2017-08-03T20:54:54Z

Sorry, I may miss the usage of run_command_get_pid....

The code looks good to me. All tools in tools directories have corresponding example.txt to show how the tool should be used. You mentioned that there will be additional patches down the road to make the tool more useful. I guess that at some point, once you feel the tool is ready for real use case run, you should add the example file.

gdankel · 2017-08-03T21:26:10Z

The call was in the first commit, I added the function in the third commit, so that's why you didn't see it :)

brendangregg · 2017-08-03T22:31:30Z

Yes, we need a lockstat tool. In general, please try to follow https://github.com/iovisor/bcc/blob/master/CONTRIBUTING-SCRIPTS.md .

If you think so far it makes a good programming example of bcc/BPF, then I'd put it under /examples/tracing.

If you feel it's useful and safe for people to run, as root, on production Netflix/Facebook instances (Netflix installs all of these by default on all our instances), then it can be in /tools, but it also needs an _examples.txt and man page. And the man page should pull no punches regarding overhead: have an "OVERHEAD" section that gives the end user an expectation before they run it. Also list caveats in the man page (ilke you did above).

If it's just unfinished and in development, it may not need to be in bcc yet. You can file an issue as a RFC and link to where it is, to get our feedback that way. (I've put my in-development stuff here https://github.com/brendangregg/BPF-tools , which reminds me I should upload what I'm currently working on...)

brendangregg · 2017-08-03T22:32:59Z

To answer why lockstat hasn't been done yet. No particular reason. I've been writing/adding tools as I hit issues.

gdankel · 2017-08-04T23:41:12Z

Brendan, thanks - CONTRIBUTING-SCRIPTS.md is very useful.
The aim is to make this a production tool, but only after a couple of more commits to flesh it out.
I might follow your in-development approach...

goldshtn · 2017-11-14T19:55:41Z

[buildbot, ok to test]

goldshtn · 2017-11-14T19:55:54Z

@gdankel Are you planning to add more commits?

gdankel · 2017-11-21T00:29:07Z

@goldshtn Yes, planning on more commits. Down my list of priorities atm. I will post a new patch when I have something solid and slightly more interesting.

joelagnel · 2018-02-28T15:27:41Z

This is quite interesting. In Android we use futexes for conditional vars quite a bit. It then becomes quite important that something waiting isn't falsely considered waiting on a lock. I see you mentioned this in the caveats. One (non-generic) way is to look for signatures in the stack to find usage. Does the equivalent Solaris tool address this issue?

gdankel · 2018-02-28T22:40:10Z

@joelagnel Yes filtering by signatures in the stack is the way I do it now. I am thinking of progressing this in the following way:

Create a bcc script that does minimal user-space processing and just dumps one or more output files with varying level of detail (depending on what is requested). This output will contain a bunch of stats per lock and per stack (including requested number of stack frame symbols). I will also likely separate out the BPF C code into its own file so it can be used by non-bcc code.
Create a couple of tools to analyze the data. One will be similar to lockstat. Another might generate output for other visualization tools. These will have many more options to allow processing and viewing the data in different ways.
Add user-space profiling capabilities. Tracing user-level locking functions is painful but I've been experimenting with sampling locked loads using the DataLA facility to identify locked loads and then also sampling cycles (precise mode) to assign costs to these.

joelagnel · 2018-02-28T23:02:34Z

@gdankel Cool, looking forward to seeing it and also contributing / applying to Android kernels. DataLA I think is Intel specific. Can you not just use perf_events for sampling? Perhaps I don't understand point 3. enough without more background on the profiling you're doing.

gdankel · 2018-02-28T23:10:24Z

@joelagnel Yes unfortunately DataLA is Intel specific, I'm not aware of a similar feature for ARM, but I'm not too familiar with ARM PMUs. Something similar can probably be done on ARM, but without per-lock breakdown. DataLA gives you the the address of the locks when sampling, so it gives you an idea of which particular locks contribute most to synchronization costs. But in many cases, the call stack is sufficient.

And yes, sampling is done using perf_event_open, and using BPF to read the values. There is some advantage to doing it all in the BPF program, but it can also be done separately and combined in post-processing.

joelagnel · 2018-03-01T05:04:34Z

@gdankel Sounds great. For address of locks, can the uaddr and PID combination not suffice to identify a lock? I guess it may not for inter-process futexes but I'm not sure how common that is in practice.
I was talking about stack signatures more for the purpose of identifying how a futex is used (cond.var vs lock, etc).

In any case, if you have some early code for checking common stack signatures in-order to identify locking scenarios, I would love to see it. A tool like this is high on my priority list, so I will start hacking on it soon enough. I could also just start using this PR as a starting point.
thanks!

gdankel · 2018-03-02T18:09:34Z

@joelagnel sure, uaddr and pid is fine. Uaddr is recorded from sys_futex tracing but then also combined with the DataLA samples for a more complete picture.

I think we're talking about the same thing for stack signatures. I also use them to filter out conditional variables etc. But the code I have for that is pretty rudimentary right now, not much more than looking for "lock" and a couple of other things in the first five or so user frames. It should be customizable via an argument or config file.

I'll try to drop some code next week.

gdankel · 2018-04-21T05:23:09Z

Closing this - superseded by new pull request #1697

The current lockstat.py is tracing three kernel functions mutex_lock_enter(), mutex_unlock_enter(), mutex_lock_return() for kernel locking statistics. There are some other efforts trying to get user lock stats by tracing e.g. pthread locking primitives. For example, Sasha Goldshtein's linux-tracing-workshop https://github.com/goldshtn/linux-tracing-workshop is referenced in bcc/docs/tutorial_bcc_python_developer.md. It has a tool called lockstat.py which traces pthread_mutex_init to collect some lock statistics for userspace locks. In bcc, in the past, we also had an effort to gather userspace lock statistics with the same name lockstat.py. #1268 In the future, bcc could have a lockstat tool tracing userspace locks. So let us rename the current lockstat.py to klockstat.py to clearly express its scope.

The current lockstat.py is tracing three kernel functions mutex_lock_enter(), mutex_unlock_enter(), mutex_lock_return() for kernel locking statistics. There are some other efforts trying to get user lock stats by tracing e.g. pthread locking primitives. For example, Sasha Goldshtein's linux-tracing-workshop https://github.com/goldshtn/linux-tracing-workshop is referenced in bcc/docs/tutorial_bcc_python_developer.md. It has a tool called lockstat.py which traces pthread_mutex_init to collect some lock statistics for userspace locks. In bcc, in the past, we also had an effort to gather userspace lock statistics with the same name lockstat.py. iovisor#1268 In the future, bcc could have a lockstat tool tracing userspace locks. So let us rename the current lockstat.py to klockstat.py to clearly express its scope.

yonghong-song requested review from brendangregg and goldshtn July 31, 2017 06:32

Fixed help messages

8b6812a

yonghong-song reviewed Jul 31, 2017

View reviewed changes

Added pid filter and minor fixes

22e7875

gdankel closed this Apr 21, 2018

yonghong-song mentioned this pull request Dec 5, 2019

rename tool lockstat.py to klockstat.py #2634

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple futex contention bcc script #1268

Simple futex contention bcc script #1268

gdankel commented Jul 28, 2017

yonghong-song left a comment

yonghong-song Jul 31, 2017

gdankel Jul 31, 2017

yonghong-song Aug 1, 2017

palmtenor Aug 1, 2017

gdankel Aug 1, 2017

gdankel commented Aug 3, 2017

gdankel commented Aug 3, 2017

yonghong-song commented Aug 3, 2017 •

edited

Loading

gdankel commented Aug 3, 2017

brendangregg commented Aug 3, 2017

brendangregg commented Aug 3, 2017

gdankel commented Aug 4, 2017

goldshtn commented Nov 14, 2017

goldshtn commented Nov 14, 2017

gdankel commented Nov 21, 2017 •

edited

Loading

joelagnel commented Feb 28, 2018

gdankel commented Feb 28, 2018

joelagnel commented Feb 28, 2018

gdankel commented Feb 28, 2018

joelagnel commented Mar 1, 2018 •

edited

Loading

gdankel commented Mar 2, 2018

gdankel commented Apr 21, 2018

Simple futex contention bcc script #1268

Simple futex contention bcc script #1268

Conversation

gdankel commented Jul 28, 2017

yonghong-song left a comment

Choose a reason for hiding this comment

yonghong-song Jul 31, 2017

Choose a reason for hiding this comment

gdankel Jul 31, 2017

Choose a reason for hiding this comment

yonghong-song Aug 1, 2017

Choose a reason for hiding this comment

palmtenor Aug 1, 2017

Choose a reason for hiding this comment

gdankel Aug 1, 2017

Choose a reason for hiding this comment

gdankel commented Aug 3, 2017

gdankel commented Aug 3, 2017

yonghong-song commented Aug 3, 2017 • edited Loading

gdankel commented Aug 3, 2017

brendangregg commented Aug 3, 2017

brendangregg commented Aug 3, 2017

gdankel commented Aug 4, 2017

goldshtn commented Nov 14, 2017

goldshtn commented Nov 14, 2017

gdankel commented Nov 21, 2017 • edited Loading

joelagnel commented Feb 28, 2018

gdankel commented Feb 28, 2018

joelagnel commented Feb 28, 2018

gdankel commented Feb 28, 2018

joelagnel commented Mar 1, 2018 • edited Loading

gdankel commented Mar 2, 2018

gdankel commented Apr 21, 2018

yonghong-song commented Aug 3, 2017 •

edited

Loading

gdankel commented Nov 21, 2017 •

edited

Loading

joelagnel commented Mar 1, 2018 •

edited

Loading