Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

funccount: Fix on-CPU hang when attaching to SyS_* #780

Merged
merged 4 commits into from
Oct 26, 2016

Conversation

goldshtn
Copy link
Collaborator

This is a possible fix to the funccount issue (instead of #777). First, it makes sure we don't attach to the same kprobe more than once by fixing a non-uniqueness bug in BPF.get_kprobe_functions. Second, it modifies funccount to use lookup_or_init and Table.clear instead or lookup and Table.zero.

`BPF.get_kprobe_functions` does not filter duplicates, and
as a result may return the same function name more than
once if it appears in /sys/kernel/debug/tracing/available_filter_functions
more than once. Change the function's behavior to filter
out duplicates before returning, so we don't end up
attaching the same kprobe more than once.
Avoiding the prepopulation of the location cache allows us
to get rid of the `zero()` call at the end of each interval,
which would hang the program at full CPU. Replaced the
prepopulation with a `lookup_or_init` and the `zero()` call
with a call to `clear()`.
@goldshtn
Copy link
Collaborator Author

@drzaeus77 I faintly recall that you said something about an issue with recursive map updates. Does this apply here? Can you please refresh my memory?

@brendangregg
Copy link
Member

Thanks @goldshtn , I'm happy to close #777 if this is better. I'm trying to remember the original issues. #415 ? Here's that test case with your new funccount:

# ./funccount.py '*spin*'
Tracing 17 functions for "*spin*"... Hit Ctrl-C to end.


^C
FUNC                                    COUNT
mutex_spin_on_owner.isra.1                  1
rwsem_spin_on_owner                         9
mutex_optimistic_spin                      87
__pv_queued_spin_lock_slowpath            258
_raw_spin_unlock_bh                       534
_raw_spin_lock_bh                         620
_raw_spin_trylock                        1964
queued_spin_unlock_wait                  5398
_raw_spin_lock_irq                       8959
_raw_spin_lock_irqsave                  21163
_raw_spin_unlock_irqrestore             27018
_raw_spin_lock                         160796
Detaching...

Which works. I can even add a "-i 1".

@brendangregg
Copy link
Member

Does this work for you? (regarding avoiding duplicates):

# ./funccount.py 'SyS_*'
write(/sys/kernel/debug/tracing/kprobe_events, "p:kprobes/p_SyS_sigsuspend SyS_sigsuspend") failed: Device or resource busy

that's in available_filter_functions twice.

The problem with recursive maps I think was fixed in the kernel by https://lkml.org/lkml/2016/3/8/24 "bpf: prevent kprobe+bpf deadlocks". That patch is in latest, but isn't, say, in the current latest Ubuntu Xenial 4.4. I'm not sure if this tool still has the workarounds necessary (now that it's switching back to lookup_or_init()). I still haven't paged #415 back into my brain -- it feels like years ago!

I did test this version on 4.9-rc1 for hours with different workloads, and it was fine.

@goldshtn
Copy link
Collaborator Author

@brendangregg Yes, I don't have that error (tested on FC23, 4.9). Did you remember to make && sudo make install BCC? The duplicate fix is in the BPF module.

@drzaeus77
Copy link
Collaborator

I'm also fuzzy on this one, but I feel like this approach may also result in a hang depending on the order of lookup_or_init() racing with the clear(). I think it may still be possible for a lookup_or_init() to inject a new entry into a particular bucket of a map faster than the userspace can clear that same bucket.

Can you try the original code but instead put a list() around the self.trace_functions.items(). I think what is happening is that the call to zero() is moving items in the same bucket to the back, leading to further calls to items() to find the same entry, meaning that userspace is mutating its own list while it zeroes each entry.

@goldshtn
Copy link
Collaborator Author

@drzaeus77 But the code that does the prepopulation doesn't hang. It's the call to counts.zero() that hangs, which doesn't call .items(). Did you mean I should replace the call to counts.zero() with a loop like the prepopulation code, that calls list(...items()) and then initializes each key to .Leaf()?

@goldshtn
Copy link
Collaborator Author

@drzaeus77 Also, wait, trace_functions.items() isn't a BPF map, so how does putting list(...) around it help?

@goldshtn
Copy link
Collaborator Author

@drzaeus77 But I have another idea, which is even closer to what @4ast wanted in the first place. With this version of funccount, we know the number of locations in advance. So I'll simply declare a BPF array large enough, and we won't need any dynamic lookups or clears.

@drzaeus77
Copy link
Collaborator

Right, probably in table.py it needs to be:

    def zero(self):
        for k in list(self.keys()):
            self[k] = self.Leaf()

Whether the underlying table is map/array/other doesn't matter, since the kernel api exposed is maplike, and python wraps it with an iterator interface. Using list() just collects all of the elements, rather than letting the iterator mutate during the loop.

Honestly this particular race seems to me like it should be fixed in the kernel, but I don't think that is trivial to do so we should work around it.

To put in a nutshell, the way zero is implemented requires that we deal with this racy behavior, otherwise we'll keep hitting it. That said, we can by all means fixup funccount to be more optimal if possible.

@drzaeus77
Copy link
Collaborator

To complete the story, the bug stems from a particular kernel behavior:

Suppose a bpf map contains many entries, two of which happen to hash to the same bucket. That is, (hash(A) == hash(B) == H1)

[...]
[ H1 ] = A -> B
[...]

Calling zero() on the map causes an update to be called as follows:
update(A, 0)
Which creates a new collision chain [ H1 ] = B -> A
update(B, 0)
creates [ H1 ] = A -> B
But in between these two updates, the item iterator tries to find which entry follows B, and finds A again, thus creating a loop. So, my workaround suggestion is to collect all of the items before any update occurs, which is of course not great for large lists, but should avoid cyclic iterators.

Because we know the number of probes in advance before
attaching them, we can simply preinitialize a fixed-size
array instead of using a BPF map. This avoids potential
deadlocks/hangs/race conditions with the Python program
and internally in the kernel. See also iovisor#415, iovisor#665, iovisor#233
for more discussion.
To avoid a potential race with the key zeroing modifying
the next hash key retrieved by the loop in `Table.zero()`,
retrieve all the keys in user space first before starting
the zeroing loop. See discussion on iovisor#780.

Tested with `funccount 'SyS_*' -i 1` while running a heavy
read/write test application (`dd`) in the background for
several minutes with no visible issues.
@goldshtn
Copy link
Collaborator Author

Right. So I'm going to update this branch when I migrate funccount to use an array, and then fix zero.

@brendangregg
Copy link
Member

@goldshtn ah thanks, I'd missed a make install. Duplicate issue now fixed.

I'll try to get a Xenial 4.4 build environment setup for testing (one that lacks that kernel patch).

@brendangregg
Copy link
Member

Kernel hang. This is Ubuntu Xenial with a 4.4 kernel

Linux version 4.4.0-31-generic (buildd@lgw01-16) (gcc version 5.3.1 20160413 (Ubuntu 5.3.1-14ubuntu2.1) ) #50-Ubuntu SMP Wed Jul 13 00:07:12 UTC 2016 (Ubuntu 4.4.0-31.50-generic 4.4.13)

My version in #777 does not induce the hang.

@goldshtn
Copy link
Collaborator Author

Right. I'll be moving back to lookup with map zeroing in funccount, so that should be fixed after I push. For other generic tools, this is concerning.

@4ast
Copy link
Member

4ast commented Oct 25, 2016

Calling zero() on the map causes an update to be called as follows:
update(A, 0)
Which creates a new collision chain [ H1 ] = B -> A
update(B, 0)
creates [ H1 ] = A -> B
But in between these two updates, the item iterator tries to find which entry follows B, and finds A again, thus creating a loop

sounds plausible.
if that's the case should be easy to reproduce?
Create a map with two elements hashing into the same bucket, then
while(..) {
get_next(key, next_key)
update(key, 0)
key=next_key
}
should loop forever, right?
Once there is a testcase, we can do something in the kernel.
Not sure how to avoid this though in the kernel without
changing user space...

@brendangregg
Copy link
Member

Since this didn't hang on 4.9... @4ast, is this the patchset that fixes it in-kernel, or is this a different issue?

  bpf: prevent kprobe+bpf deadlocks
  bpf: introduce percpu_freelist
  bpf: pre-allocate hash map elements
  bpf: check for reserved flag bits in array and stack maps
  bpf: convert stackmap to pre-allocation
  samples/bpf: make map creation more verbose
  samples/bpf: move ksym_search() into library
  samples/bpf: add map_flags to bpf loader
  samples/bpf: test both pre-alloc and normal maps
  samples/bpf: add bpf map stress test
  samples/bpf: stress test bpf_get_stackid
  samples/bpf: add map performance test

Maybe Canonical can backport for Xenial users.

@drzaeus77
Copy link
Collaborator

LGTM

@goldshtn
Copy link
Collaborator Author

@brendangregg @drzaeus77 Pushed two additional commits: one changes funccount to use a fixed-size prepopulated map and clear it key-by-key; the other also changes Table.zero() to use list(...keys) per Brenden's suggestion.

@4ast
Copy link
Member

4ast commented Oct 25, 2016

lgtm @drzaeus77

@brendangregg
Copy link
Member

(root) /mnt/src/5/bcc/tools # ./funccount.py -i 1 '*spin*'
Tracing 16 functions for "*spin*"... Hit Ctrl-C to end.

FUNC                                    COUNT
mutex_optimistic_spin                       2
rwsem_spin_on_owner                         2
__pv_queued_spin_lock_slowpath             23
_raw_spin_unlock_bh                        85
_raw_spin_lock_bh                         106
_raw_spin_trylock                         356
_raw_spin_lock_irq                       2611
_raw_spin_unlock_irqrestore              6071
_raw_spin_lock_irqsave                   6580
_raw_spin_lock                          92633
__pv_queued_spin_unlock                210885

FUNC                                    COUNT
mutex_optimistic_spin                       1
__pv_queued_spin_lock_slowpath             31
_raw_spin_unlock_bh                       120
_raw_spin_lock_bh                         142
_raw_spin_trylock                         310
_raw_spin_lock_irq                       2240
_raw_spin_lock_irqsave                   4973
_raw_spin_unlock_irqrestore              5551
_raw_spin_lock                          33314
__pv_queued_spin_unlock                 40988
^C
FUNC                                    COUNT
mutex_optimistic_spin                       1
rwsem_spin_on_owner                         1
__pv_queued_spin_lock_slowpath             24
_raw_spin_unlock_bh                        81
_raw_spin_lock_bh                          98
_raw_spin_trylock                         274
_raw_spin_lock_irq                       2091
_raw_spin_lock_irqsave                   4845
_raw_spin_unlock_irqrestore              5414
_raw_spin_lock                          33281
__pv_queued_spin_unlock                 40585
Detaching...
(root) /mnt/src/5/bcc/tools # uname -a
Linux bgregg-xenial-test-i-01f808d4259a41b59 4.4.0-31-generic #50-Ubuntu SMP Wed Jul 13 00:07:12 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Ok, that test worked on 4.4. I also checked 4.9.

(... really need to give funccount a "-d duration" take make testing not need a human Ctrl-C (and change the -d for debug to -D). I was hacking it in with a "timeout -s2 5" prefix.)

However:

(root) /mnt/src/5/bcc/tools # ./funccount.py -i 1 'bash:r*'
Tracing 317 functions for "bash:r*"... Hit Ctrl-C to end.

FUNC                                    COUNT
read_octal                                  1
realloc                                     1
readonly_builtin                            2
read_tty_modified                           2
redirection_expand                          2
[...]
^C
FUNC                                    COUNT
read_octal                                  1
realloc                                     1
readonly_builtin                            2
read_tty_modified                           2
redirection_expand                          2
run_exit_trap                               2
reset_parser                                2
restore_input_line_state                    2
restore_parser_state                        2
reader_loop                                 3
run_return_trap                             3
restore_pipestatus_array                    4
remember_args                               4
reset_signal_handlers                       5
reap_dead_jobs                              6
remove_quoted_escapes                      13
run_unwind_frame                           16
reset_terminating_signals                  20
restore_original_signals                   22
reset_internal_getopt                      63
run_debug_trap                            147
read_command                              183
remove_quoted_nulls                       346
run_pending_traps                         645
Detaching...

(root) /mnt/src/5/bcc/tools # ./funccount.py -i 1 'bash:r*'
perf_event_open(/sys/kernel/debug/tracing/events/uprobes/p__bin_bash_0xb2540/id): Cannot allocate memory
Failed to attach BPF to uprobe

(root) /mnt/src/5/bcc/tools # ./reset-trace.sh -v
Reseting tracing state...

Checking /sys/kernel/debug/tracing/kprobe_events
Checking /sys/kernel/debug/tracing/uprobe_events
Needed to reset /sys/kernel/debug/tracing/uprobe_events
uprobe_events, before (line enumerated):
     1  p:uprobes/p__bin_bash_0xb2540 /bin/bash:0x00000000000b2540
uprobe_events, after (line enumerated):

Checking /sys/kernel/debug/tracing/trace
Checking /sys/kernel/debug/tracing/current_tracer
Checking /sys/kernel/debug/tracing/set_ftrace_filter
Checking /sys/kernel/debug/tracing/set_graph_function
Checking /sys/kernel/debug/tracing/set_ftrace_pid
Checking /sys/kernel/debug/tracing/events/enable
Checking /sys/kernel/debug/tracing/tracing_thresh
Checking /sys/kernel/debug/tracing/tracing_on

Done.

So a Ctrl-C had left uprobes still initialized. The reset-ftrace -v shows how it was (that's a cat -nv).

@brendangregg
Copy link
Member

no, it's actually broken, not just still-initialized uprobes:

dmesg
[  255.276131] ------------[ cut here ]------------
[  255.276139] WARNING: CPU: 0 PID: 7843 at /build/linux-dcxD3m/linux-4.4.0/mm/page_counter.c:26 page_counter_cancel+0x34/0x40()
[  255.276141] Modules linked in: binfmt_misc zfs(PO) zunicode(PO) zcommon(PO) znvpair(PO) spl(O) zavl(PO) xfs x86_pkg_temp_thermal coretemp ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_multipath autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd
[  255.276167] CPU: 0 PID: 7843 Comm: funccount.py Tainted: P           O    4.4.0-31-generic #50-Ubuntu
[  255.276168]  0000000000000200 00000000c146c932 ffff8801d4373a88 ffffffff813f1143
[  255.276171]  0000000000000000 ffffffff81cce100 ffff8801d4373ac0 ffffffff81081102
[  255.276173]  ffff8800ff20d8b8 000000000000001f ffff8800ff20d800 ffff8801d6410e80
[  255.276175] Call Trace:
[  255.276180]  [<ffffffff813f1143>] dump_stack+0x63/0x90
[  255.276183]  [<ffffffff81081102>] warn_slowpath_common+0x82/0xc0
[  255.276185]  [<ffffffff8108124a>] warn_slowpath_null+0x1a/0x20
[  255.276187]  [<ffffffff811f9ca4>] page_counter_cancel+0x34/0x40
[  255.276189]  [<ffffffff811f9da2>] page_counter_uncharge+0x22/0x40
[  255.276190]  [<ffffffff811fbef8>] drain_stock.isra.33+0x38/0xa0
[  255.276192]  [<ffffffff811fc990>] try_charge+0x5e0/0x640
[  255.276195]  [<ffffffff81200bdb>] mem_cgroup_try_charge+0x6b/0x1b0
[  255.276198]  [<ffffffff81187d39>] uprobe_write_opcode+0x209/0x4c0
[  255.276202]  [<ffffffff8116f7e3>] ? uprobe_perf_filter+0x43/0x50
[  255.276203]  [<ffffffff8118809a>] set_orig_insn+0x1a/0x20
[  255.276205]  [<ffffffff811884a6>] register_for_each_vma+0x406/0x430
[  255.276207]  [<ffffffff81188962>] uprobe_apply+0x72/0xc0
[  255.276209]  [<ffffffff8116ff60>] uprobe_perf_close+0x90/0xd0
[  255.276211]  [<ffffffff81170882>] trace_uprobe_register+0xd2/0x200
[  255.276214]  [<ffffffff81162eea>] perf_trace_destroy+0x2a/0x50
[  255.276216]  [<ffffffff81179ec9>] tp_perf_event_destroy+0x9/0x10
[  255.276218]  [<ffffffff81180de9>] _free_event+0xb9/0x220
[  255.276219]  [<ffffffff81181013>] put_event+0xc3/0x100
[  255.276221]  [<ffffffff81181070>] perf_release+0x10/0x20
[  255.276224]  [<ffffffff8120ef84>] __fput+0xe4/0x220
[  255.276226]  [<ffffffff8120f0fe>] ____fput+0xe/0x10
[  255.276229]  [<ffffffff8109ec1f>] task_work_run+0x7f/0xa0
[  255.276231]  [<ffffffff81003242>] exit_to_usermode_loop+0xc2/0xd0
[  255.276233]  [<ffffffff81003c6e>] syscall_return_slowpath+0x4e/0x60
[  255.276237]  [<ffffffff8182dc90>] int_ret_from_sys_call+0x25/0x8f
[  255.276238] ---[ end trace 724b0e71eca6554e ]---

@drzaeus77
Copy link
Collaborator

@brendangregg 4.4.0-31 is still broken w.r.t. uprobes, the memory leak is fixed at about 4.4.0-42. This is what made the 1604 buildbot needing a reboot once a week or so.

@drzaeus77 drzaeus77 merged commit 191bac6 into iovisor:master Oct 26, 2016
@brendangregg
Copy link
Member

ok, I'll start digging, but if you know the patch description offhand that'd be useful; just need to always check it's merged in the kernels we're deploying

@drzaeus77
Copy link
Collaborator

@brendangregg
Copy link
Member

@drzaeus77 thanks; so just accounting, not an actual leak

@drzaeus77
Copy link
Collaborator

It's pretty bad. The accounting mismatch results in pretty serious failures. I saw an oops when I tried to do a kexec after I hit the failure in some cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants