Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple futex contention bcc script #1268

Closed
wants to merge 3 commits into from
Closed

Conversation

gdankel
Copy link

@gdankel gdankel commented Jul 28, 2017

This is the starting point for a more comprehensive lock analysis
tool ala Solaris' lockstat.

This initial version only traces the futex system call and measures
the time blocked or acquiring futex related kernel spinlocks due
to user-space lock contention.

Report results per process and per lock.

Several shortcomings of this approach:

  • Only captures contention in locking constructs using futexes.
    Applications and libraries might only use futexes as a last resort.
  • Does not distinguish between mutexes, conditional variables and
    semaphores etc.
  • Does not capture user-space contention or performance impact due
    to cacheline bouncing.
  • Etc etc. You get the idea.

This is the starting point for a more comprehensive lock analysis
tool ala Solaris' lockstat.

This initial version only traces the futex system call and measures
the time blocked or acquiring futex related kernel spinlocks due
to user-space lock contention.

Report results per process and per lock.

Several shortcomings of this approach:
* Only captures contention in locking constructs using futexes.
  Applications and libraries might only use futexes as a last resort.
* Does not distinguish between mutexes, conditional variables and
  semaphores etc.
* Does not capture user-space contention or performance impact due
  to cacheline bouncing.
* Etc etc. You get the idea.
Copy link
Collaborator

@yonghong-song yonghong-song left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (cmd != FUTEX_WAIT)
return 0;

u32 pid = bpf_get_current_pid_tgid();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For kprobe, passing pid to kernel does not guarantee that the bpf program only gets called with that pid. This is because kprobe is tricky, sometimes you do not have a user process associated at all. The same could be true for kernel tracepoints.

Therefore, if user specifies a pid, the best practice is to place a check in the beginning of function
if (pid != PREDEFINED_PID)
return 0;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK curious... I can understand why you wouldn't have a pid at all points (e.g. h/w interrupt, network rx), but what's the reason it's not available sometimes in this case? It's a system call, so has to involve a process?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think kernel probably just want to be uniform for kprobes and regular kernel tracepoints.

syscalls (sys_enter_* and sys_exit_* tracepoints) are special processed in kernel. Not sure whether it honors pid or not. I am working on a patch to enable sys_enter/exit_* bpf tracepoints and will comment later once I found it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are two different topics we are discussing here:

if (pid != PREDEFINED_PID)
    return 0;
  • For availability of PID and COMM in kprobes, from my experience (please correct me if I'm wrong) I would always get valid PID number (might be 0) and COMM name (might be kswapper / kworker) even for those Kernel function possibly running not from a user Process.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, makes sense. I obviously hadn't tested that case - there's no code in there to deal with -p option :/
I will push an updated version.

@gdankel
Copy link
Author

gdankel commented Aug 3, 2017

Yonghong: run_command_get_pid is used in the latest commit :) I've now tested the different options a fair amount.

@gdankel
Copy link
Author

gdankel commented Aug 3, 2017

Note to reviewers: I posted this mainly to get both high-level feedback and detailed feedback (this is my first attempt at using bcc / bpf).
I have a good idea about which direction to take this tool, and this is only the first small warm-up step to get used to the process.
If anyone has comments / suggestions / ideas, now is a good time to chime in.
One obvious question is, why has such a tool not yet been developed? Maybe there's not much perceived value on top of what already exists?

@yonghong-song
Copy link
Collaborator

yonghong-song commented Aug 3, 2017

Sorry, I may miss the usage of run_command_get_pid....

The code looks good to me. All tools in tools directories have corresponding example.txt to show how the tool should be used. You mentioned that there will be additional patches down the road to make the tool more useful. I guess that at some point, once you feel the tool is ready for real use case run, you should add the example file.

@gdankel
Copy link
Author

gdankel commented Aug 3, 2017

The call was in the first commit, I added the function in the third commit, so that's why you didn't see it :)

@brendangregg
Copy link
Member

Yes, we need a lockstat tool. In general, please try to follow https://github.com/iovisor/bcc/blob/master/CONTRIBUTING-SCRIPTS.md .

If you think so far it makes a good programming example of bcc/BPF, then I'd put it under /examples/tracing.

If you feel it's useful and safe for people to run, as root, on production Netflix/Facebook instances (Netflix installs all of these by default on all our instances), then it can be in /tools, but it also needs an _examples.txt and man page. And the man page should pull no punches regarding overhead: have an "OVERHEAD" section that gives the end user an expectation before they run it. Also list caveats in the man page (ilke you did above).

If it's just unfinished and in development, it may not need to be in bcc yet. You can file an issue as a RFC and link to where it is, to get our feedback that way. (I've put my in-development stuff here https://github.com/brendangregg/BPF-tools , which reminds me I should upload what I'm currently working on...)

@brendangregg
Copy link
Member

To answer why lockstat hasn't been done yet. No particular reason. I've been writing/adding tools as I hit issues.

@gdankel
Copy link
Author

gdankel commented Aug 4, 2017

Brendan, thanks - CONTRIBUTING-SCRIPTS.md is very useful.
The aim is to make this a production tool, but only after a couple of more commits to flesh it out.
I might follow your in-development approach...

@goldshtn
Copy link
Collaborator

[buildbot, ok to test]

@goldshtn
Copy link
Collaborator

@gdankel Are you planning to add more commits?

@gdankel
Copy link
Author

gdankel commented Nov 21, 2017

@goldshtn Yes, planning on more commits. Down my list of priorities atm. I will post a new patch when I have something solid and slightly more interesting.

@joelagnel
Copy link
Contributor

This is quite interesting. In Android we use futexes for conditional vars quite a bit. It then becomes quite important that something waiting isn't falsely considered waiting on a lock. I see you mentioned this in the caveats. One (non-generic) way is to look for signatures in the stack to find usage. Does the equivalent Solaris tool address this issue?

@gdankel
Copy link
Author

gdankel commented Feb 28, 2018

@joelagnel Yes filtering by signatures in the stack is the way I do it now. I am thinking of progressing this in the following way:

  1. Create a bcc script that does minimal user-space processing and just dumps one or more output files with varying level of detail (depending on what is requested). This output will contain a bunch of stats per lock and per stack (including requested number of stack frame symbols). I will also likely separate out the BPF C code into its own file so it can be used by non-bcc code.

  2. Create a couple of tools to analyze the data. One will be similar to lockstat. Another might generate output for other visualization tools. These will have many more options to allow processing and viewing the data in different ways.

  3. Add user-space profiling capabilities. Tracing user-level locking functions is painful but I've been experimenting with sampling locked loads using the DataLA facility to identify locked loads and then also sampling cycles (precise mode) to assign costs to these.

@joelagnel
Copy link
Contributor

@gdankel Cool, looking forward to seeing it and also contributing / applying to Android kernels. DataLA I think is Intel specific. Can you not just use perf_events for sampling? Perhaps I don't understand point 3. enough without more background on the profiling you're doing.

@gdankel
Copy link
Author

gdankel commented Feb 28, 2018

@joelagnel Yes unfortunately DataLA is Intel specific, I'm not aware of a similar feature for ARM, but I'm not too familiar with ARM PMUs. Something similar can probably be done on ARM, but without per-lock breakdown. DataLA gives you the the address of the locks when sampling, so it gives you an idea of which particular locks contribute most to synchronization costs. But in many cases, the call stack is sufficient.

And yes, sampling is done using perf_event_open, and using BPF to read the values. There is some advantage to doing it all in the BPF program, but it can also be done separately and combined in post-processing.

@joelagnel
Copy link
Contributor

joelagnel commented Mar 1, 2018

@gdankel Sounds great. For address of locks, can the uaddr and PID combination not suffice to identify a lock? I guess it may not for inter-process futexes but I'm not sure how common that is in practice.
I was talking about stack signatures more for the purpose of identifying how a futex is used (cond.var vs lock, etc).

In any case, if you have some early code for checking common stack signatures in-order to identify locking scenarios, I would love to see it. A tool like this is high on my priority list, so I will start hacking on it soon enough. I could also just start using this PR as a starting point.
thanks!

@gdankel
Copy link
Author

gdankel commented Mar 2, 2018

@joelagnel sure, uaddr and pid is fine. Uaddr is recorded from sys_futex tracing but then also combined with the DataLA samples for a more complete picture.

I think we're talking about the same thing for stack signatures. I also use them to filter out conditional variables etc. But the code I have for that is pretty rudimentary right now, not much more than looking for "lock" and a couple of other things in the first five or so user frames. It should be customizable via an argument or config file.

I'll try to drop some code next week.

@gdankel
Copy link
Author

gdankel commented Apr 21, 2018

Closing this - superseded by new pull request #1697

@gdankel gdankel closed this Apr 21, 2018
yonghong-song added a commit that referenced this pull request Dec 5, 2019
The current lockstat.py is tracing three kernel functions
  mutex_lock_enter(), mutex_unlock_enter(), mutex_lock_return()
for kernel locking statistics.

There are some other efforts trying to get user lock stats by
tracing e.g. pthread locking primitives. For example,
Sasha Goldshtein's linux-tracing-workshop
   https://github.com/goldshtn/linux-tracing-workshop
is referenced in bcc/docs/tutorial_bcc_python_developer.md.
It has a tool called lockstat.py which traces pthread_mutex_init
to collect some lock statistics for userspace locks.

In bcc, in the past, we also had an effort to gather userspace
lock statistics with the same name lockstat.py.
   #1268

In the future, bcc could have a lockstat tool tracing userspace
locks. So let us rename the current lockstat.py to klockstat.py
to clearly express its scope.
yonghong-song added a commit that referenced this pull request Dec 5, 2019
The current lockstat.py is tracing three kernel functions
  mutex_lock_enter(), mutex_unlock_enter(), mutex_lock_return()
for kernel locking statistics.

There are some other efforts trying to get user lock stats by
tracing e.g. pthread locking primitives. For example,
Sasha Goldshtein's linux-tracing-workshop
   https://github.com/goldshtn/linux-tracing-workshop
is referenced in bcc/docs/tutorial_bcc_python_developer.md.
It has a tool called lockstat.py which traces pthread_mutex_init
to collect some lock statistics for userspace locks.

In bcc, in the past, we also had an effort to gather userspace
lock statistics with the same name lockstat.py.
   #1268

In the future, bcc could have a lockstat tool tracing userspace
locks. So let us rename the current lockstat.py to klockstat.py
to clearly express its scope.
CrackerCat pushed a commit to CrackerCat/bcc that referenced this pull request Jul 31, 2024
The current lockstat.py is tracing three kernel functions
  mutex_lock_enter(), mutex_unlock_enter(), mutex_lock_return()
for kernel locking statistics.

There are some other efforts trying to get user lock stats by
tracing e.g. pthread locking primitives. For example,
Sasha Goldshtein's linux-tracing-workshop
   https://github.com/goldshtn/linux-tracing-workshop
is referenced in bcc/docs/tutorial_bcc_python_developer.md.
It has a tool called lockstat.py which traces pthread_mutex_init
to collect some lock statistics for userspace locks.

In bcc, in the past, we also had an effort to gather userspace
lock statistics with the same name lockstat.py.
   iovisor#1268

In the future, bcc could have a lockstat tool tracing userspace
locks. So let us rename the current lockstat.py to klockstat.py
to clearly express its scope.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants