-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple futex contention bcc script #1268
Conversation
This is the starting point for a more comprehensive lock analysis tool ala Solaris' lockstat. This initial version only traces the futex system call and measures the time blocked or acquiring futex related kernel spinlocks due to user-space lock contention. Report results per process and per lock. Several shortcomings of this approach: * Only captures contention in locking constructs using futexes. Applications and libraries might only use futexes as a last resort. * Does not distinguish between mutexes, conditional variables and semaphores etc. * Does not capture user-space contention or performance impact due to cacheline bouncing. * Etc etc. You get the idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tools/lockstat.py
Outdated
if (cmd != FUTEX_WAIT) | ||
return 0; | ||
|
||
u32 pid = bpf_get_current_pid_tgid(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For kprobe, passing pid to kernel does not guarantee that the bpf program only gets called with that pid. This is because kprobe is tricky, sometimes you do not have a user process associated at all. The same could be true for kernel tracepoints.
Therefore, if user specifies a pid, the best practice is to place a check in the beginning of function
if (pid != PREDEFINED_PID)
return 0;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK curious... I can understand why you wouldn't have a pid at all points (e.g. h/w interrupt, network rx), but what's the reason it's not available sometimes in this case? It's a system call, so has to involve a process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think kernel probably just want to be uniform for kprobes and regular kernel tracepoints.
syscalls (sys_enter_* and sys_exit_* tracepoints) are special processed in kernel. Not sure whether it honors pid or not. I am working on a patch to enable sys_enter/exit_* bpf tracepoints and will comment later once I found it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are two different topics we are discussing here:
- For filtering on PID (more context: RFC: Remove cpu / pid parameters from probe interfaces #1278), we'll need to do so from BPF program. i.e., as @yonghong-song mentioned:
if (pid != PREDEFINED_PID)
return 0;
- For availability of PID and COMM in kprobes, from my experience (please correct me if I'm wrong) I would always get valid PID number (might be 0) and COMM name (might be
kswapper
/kworker
) even for those Kernel function possibly running not from a user Process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, makes sense. I obviously hadn't tested that case - there's no code in there to deal with -p option :/
I will push an updated version.
Yonghong: run_command_get_pid is used in the latest commit :) I've now tested the different options a fair amount. |
Note to reviewers: I posted this mainly to get both high-level feedback and detailed feedback (this is my first attempt at using bcc / bpf). |
Sorry, I may miss the usage of run_command_get_pid.... The code looks good to me. All tools in tools directories have corresponding example.txt to show how the tool should be used. You mentioned that there will be additional patches down the road to make the tool more useful. I guess that at some point, once you feel the tool is ready for real use case run, you should add the example file. |
The call was in the first commit, I added the function in the third commit, so that's why you didn't see it :) |
Yes, we need a lockstat tool. In general, please try to follow https://github.com/iovisor/bcc/blob/master/CONTRIBUTING-SCRIPTS.md . If you think so far it makes a good programming example of bcc/BPF, then I'd put it under /examples/tracing. If you feel it's useful and safe for people to run, as root, on production Netflix/Facebook instances (Netflix installs all of these by default on all our instances), then it can be in /tools, but it also needs an _examples.txt and man page. And the man page should pull no punches regarding overhead: have an "OVERHEAD" section that gives the end user an expectation before they run it. Also list caveats in the man page (ilke you did above). If it's just unfinished and in development, it may not need to be in bcc yet. You can file an issue as a RFC and link to where it is, to get our feedback that way. (I've put my in-development stuff here https://github.com/brendangregg/BPF-tools , which reminds me I should upload what I'm currently working on...) |
To answer why lockstat hasn't been done yet. No particular reason. I've been writing/adding tools as I hit issues. |
Brendan, thanks - CONTRIBUTING-SCRIPTS.md is very useful. |
[buildbot, ok to test] |
@gdankel Are you planning to add more commits? |
@goldshtn Yes, planning on more commits. Down my list of priorities atm. I will post a new patch when I have something solid and slightly more interesting. |
This is quite interesting. In Android we use futexes for conditional vars quite a bit. It then becomes quite important that something waiting isn't falsely considered waiting on a lock. I see you mentioned this in the caveats. One (non-generic) way is to look for signatures in the stack to find usage. Does the equivalent Solaris tool address this issue? |
@joelagnel Yes filtering by signatures in the stack is the way I do it now. I am thinking of progressing this in the following way:
|
@gdankel Cool, looking forward to seeing it and also contributing / applying to Android kernels. |
@joelagnel Yes unfortunately DataLA is Intel specific, I'm not aware of a similar feature for ARM, but I'm not too familiar with ARM PMUs. Something similar can probably be done on ARM, but without per-lock breakdown. DataLA gives you the the address of the locks when sampling, so it gives you an idea of which particular locks contribute most to synchronization costs. But in many cases, the call stack is sufficient. And yes, sampling is done using perf_event_open, and using BPF to read the values. There is some advantage to doing it all in the BPF program, but it can also be done separately and combined in post-processing. |
@gdankel Sounds great. For address of locks, can the In any case, if you have some early code for checking common stack signatures in-order to identify locking scenarios, I would love to see it. A tool like this is high on my priority list, so I will start hacking on it soon enough. I could also just start using this PR as a starting point. |
@joelagnel sure, uaddr and pid is fine. Uaddr is recorded from sys_futex tracing but then also combined with the DataLA samples for a more complete picture. I think we're talking about the same thing for stack signatures. I also use them to filter out conditional variables etc. But the code I have for that is pretty rudimentary right now, not much more than looking for "lock" and a couple of other things in the first five or so user frames. It should be customizable via an argument or config file. I'll try to drop some code next week. |
Closing this - superseded by new pull request #1697 |
The current lockstat.py is tracing three kernel functions mutex_lock_enter(), mutex_unlock_enter(), mutex_lock_return() for kernel locking statistics. There are some other efforts trying to get user lock stats by tracing e.g. pthread locking primitives. For example, Sasha Goldshtein's linux-tracing-workshop https://github.com/goldshtn/linux-tracing-workshop is referenced in bcc/docs/tutorial_bcc_python_developer.md. It has a tool called lockstat.py which traces pthread_mutex_init to collect some lock statistics for userspace locks. In bcc, in the past, we also had an effort to gather userspace lock statistics with the same name lockstat.py. #1268 In the future, bcc could have a lockstat tool tracing userspace locks. So let us rename the current lockstat.py to klockstat.py to clearly express its scope.
The current lockstat.py is tracing three kernel functions mutex_lock_enter(), mutex_unlock_enter(), mutex_lock_return() for kernel locking statistics. There are some other efforts trying to get user lock stats by tracing e.g. pthread locking primitives. For example, Sasha Goldshtein's linux-tracing-workshop https://github.com/goldshtn/linux-tracing-workshop is referenced in bcc/docs/tutorial_bcc_python_developer.md. It has a tool called lockstat.py which traces pthread_mutex_init to collect some lock statistics for userspace locks. In bcc, in the past, we also had an effort to gather userspace lock statistics with the same name lockstat.py. #1268 In the future, bcc could have a lockstat tool tracing userspace locks. So let us rename the current lockstat.py to klockstat.py to clearly express its scope.
The current lockstat.py is tracing three kernel functions mutex_lock_enter(), mutex_unlock_enter(), mutex_lock_return() for kernel locking statistics. There are some other efforts trying to get user lock stats by tracing e.g. pthread locking primitives. For example, Sasha Goldshtein's linux-tracing-workshop https://github.com/goldshtn/linux-tracing-workshop is referenced in bcc/docs/tutorial_bcc_python_developer.md. It has a tool called lockstat.py which traces pthread_mutex_init to collect some lock statistics for userspace locks. In bcc, in the past, we also had an effort to gather userspace lock statistics with the same name lockstat.py. iovisor#1268 In the future, bcc could have a lockstat tool tracing userspace locks. So let us rename the current lockstat.py to klockstat.py to clearly express its scope.
This is the starting point for a more comprehensive lock analysis
tool ala Solaris' lockstat.
This initial version only traces the futex system call and measures
the time blocked or acquiring futex related kernel spinlocks due
to user-space lock contention.
Report results per process and per lock.
Several shortcomings of this approach:
Applications and libraries might only use futexes as a last resort.
semaphores etc.
to cacheline bouncing.