funccount: Fix on-CPU hang when attaching to SyS_* #780

goldshtn · 2016-10-24T14:41:29Z

This is a possible fix to the funccount issue (instead of #777). First, it makes sure we don't attach to the same kprobe more than once by fixing a non-uniqueness bug in BPF.get_kprobe_functions. Second, it modifies funccount to use lookup_or_init and Table.clear instead or lookup and Table.zero.

`BPF.get_kprobe_functions` does not filter duplicates, and as a result may return the same function name more than once if it appears in /sys/kernel/debug/tracing/available_filter_functions more than once. Change the function's behavior to filter out duplicates before returning, so we don't end up attaching the same kprobe more than once.

Avoiding the prepopulation of the location cache allows us to get rid of the `zero()` call at the end of each interval, which would hang the program at full CPU. Replaced the prepopulation with a `lookup_or_init` and the `zero()` call with a call to `clear()`.

goldshtn · 2016-10-24T14:48:57Z

@drzaeus77 I faintly recall that you said something about an issue with recursive map updates. Does this apply here? Can you please refresh my memory?

brendangregg · 2016-10-24T19:43:55Z

Thanks @goldshtn , I'm happy to close #777 if this is better. I'm trying to remember the original issues. #415 ? Here's that test case with your new funccount:

# ./funccount.py '*spin*'
Tracing 17 functions for "*spin*"... Hit Ctrl-C to end.


^C
FUNC                                    COUNT
mutex_spin_on_owner.isra.1                  1
rwsem_spin_on_owner                         9
mutex_optimistic_spin                      87
__pv_queued_spin_lock_slowpath            258
_raw_spin_unlock_bh                       534
_raw_spin_lock_bh                         620
_raw_spin_trylock                        1964
queued_spin_unlock_wait                  5398
_raw_spin_lock_irq                       8959
_raw_spin_lock_irqsave                  21163
_raw_spin_unlock_irqrestore             27018
_raw_spin_lock                         160796
Detaching...

Which works. I can even add a "-i 1".

brendangregg · 2016-10-25T08:08:44Z

Does this work for you? (regarding avoiding duplicates):

# ./funccount.py 'SyS_*'
write(/sys/kernel/debug/tracing/kprobe_events, "p:kprobes/p_SyS_sigsuspend SyS_sigsuspend") failed: Device or resource busy

that's in available_filter_functions twice.

The problem with recursive maps I think was fixed in the kernel by https://lkml.org/lkml/2016/3/8/24 "bpf: prevent kprobe+bpf deadlocks". That patch is in latest, but isn't, say, in the current latest Ubuntu Xenial 4.4. I'm not sure if this tool still has the workarounds necessary (now that it's switching back to lookup_or_init()). I still haven't paged #415 back into my brain -- it feels like years ago!

I did test this version on 4.9-rc1 for hours with different workloads, and it was fine.

goldshtn · 2016-10-25T08:34:05Z

@brendangregg Yes, I don't have that error (tested on FC23, 4.9). Did you remember to make && sudo make install BCC? The duplicate fix is in the BPF module.

drzaeus77 · 2016-10-25T12:34:48Z

I'm also fuzzy on this one, but I feel like this approach may also result in a hang depending on the order of lookup_or_init() racing with the clear(). I think it may still be possible for a lookup_or_init() to inject a new entry into a particular bucket of a map faster than the userspace can clear that same bucket.

Can you try the original code but instead put a list() around the self.trace_functions.items(). I think what is happening is that the call to zero() is moving items in the same bucket to the back, leading to further calls to items() to find the same entry, meaning that userspace is mutating its own list while it zeroes each entry.

goldshtn · 2016-10-25T12:49:43Z

@drzaeus77 But the code that does the prepopulation doesn't hang. It's the call to counts.zero() that hangs, which doesn't call .items(). Did you mean I should replace the call to counts.zero() with a loop like the prepopulation code, that calls list(...items()) and then initializes each key to .Leaf()?

goldshtn · 2016-10-25T12:51:41Z

@drzaeus77 Also, wait, trace_functions.items() isn't a BPF map, so how does putting list(...) around it help?

goldshtn · 2016-10-25T12:53:08Z

@drzaeus77 But I have another idea, which is even closer to what @4ast wanted in the first place. With this version of funccount, we know the number of locations in advance. So I'll simply declare a BPF array large enough, and we won't need any dynamic lookups or clears.

drzaeus77 · 2016-10-25T13:22:48Z

Right, probably in table.py it needs to be:

    def zero(self):
        for k in list(self.keys()):
            self[k] = self.Leaf()

Whether the underlying table is map/array/other doesn't matter, since the kernel api exposed is maplike, and python wraps it with an iterator interface. Using list() just collects all of the elements, rather than letting the iterator mutate during the loop.

Honestly this particular race seems to me like it should be fixed in the kernel, but I don't think that is trivial to do so we should work around it.

To put in a nutshell, the way zero is implemented requires that we deal with this racy behavior, otherwise we'll keep hitting it. That said, we can by all means fixup funccount to be more optimal if possible.

drzaeus77 · 2016-10-25T13:34:09Z

To complete the story, the bug stems from a particular kernel behavior:

Suppose a bpf map contains many entries, two of which happen to hash to the same bucket. That is, (hash(A) == hash(B) == H1)

[...]
[ H1 ] = A -> B
[...]

Calling zero() on the map causes an update to be called as follows:
update(A, 0)
Which creates a new collision chain [ H1 ] = B -> A
update(B, 0)
creates [ H1 ] = A -> B
But in between these two updates, the item iterator tries to find which entry follows B, and finds A again, thus creating a loop. So, my workaround suggestion is to collect all of the items before any update occurs, which is of course not great for large lists, but should avoid cyclic iterators.

Because we know the number of probes in advance before attaching them, we can simply preinitialize a fixed-size array instead of using a BPF map. This avoids potential deadlocks/hangs/race conditions with the Python program and internally in the kernel. See also iovisor#415, iovisor#665, iovisor#233 for more discussion.

To avoid a potential race with the key zeroing modifying the next hash key retrieved by the loop in `Table.zero()`, retrieve all the keys in user space first before starting the zeroing loop. See discussion on iovisor#780. Tested with `funccount 'SyS_*' -i 1` while running a heavy read/write test application (`dd`) in the background for several minutes with no visible issues.

goldshtn · 2016-10-25T15:29:49Z

Right. So I'm going to update this branch when I migrate funccount to use an array, and then fix zero.

brendangregg · 2016-10-25T15:34:33Z

@goldshtn ah thanks, I'd missed a make install. Duplicate issue now fixed.

I'll try to get a Xenial 4.4 build environment setup for testing (one that lacks that kernel patch).

brendangregg · 2016-10-25T16:14:38Z

Kernel hang. This is Ubuntu Xenial with a 4.4 kernel

Linux version 4.4.0-31-generic (buildd@lgw01-16) (gcc version 5.3.1 20160413 (Ubuntu 5.3.1-14ubuntu2.1) ) #50-Ubuntu SMP Wed Jul 13 00:07:12 UTC 2016 (Ubuntu 4.4.0-31.50-generic 4.4.13)

My version in #777 does not induce the hang.

goldshtn · 2016-10-25T16:16:16Z

Right. I'll be moving back to lookup with map zeroing in funccount, so that should be fixed after I push. For other generic tools, this is concerning.

4ast · 2016-10-25T16:43:55Z

Calling zero() on the map causes an update to be called as follows:
update(A, 0)
Which creates a new collision chain [ H1 ] = B -> A
update(B, 0)
creates [ H1 ] = A -> B
But in between these two updates, the item iterator tries to find which entry follows B, and finds A again, thus creating a loop

sounds plausible.
if that's the case should be easy to reproduce?
Create a map with two elements hashing into the same bucket, then
while(..) {
get_next(key, next_key)
update(key, 0)
key=next_key
}
should loop forever, right?
Once there is a testcase, we can do something in the kernel.
Not sure how to avoid this though in the kernel without
changing user space...

brendangregg · 2016-10-25T17:21:58Z

Since this didn't hang on 4.9... @4ast, is this the patchset that fixes it in-kernel, or is this a different issue?

  bpf: prevent kprobe+bpf deadlocks
  bpf: introduce percpu_freelist
  bpf: pre-allocate hash map elements
  bpf: check for reserved flag bits in array and stack maps
  bpf: convert stackmap to pre-allocation
  samples/bpf: make map creation more verbose
  samples/bpf: move ksym_search() into library
  samples/bpf: add map_flags to bpf loader
  samples/bpf: test both pre-alloc and normal maps
  samples/bpf: add bpf map stress test
  samples/bpf: stress test bpf_get_stackid
  samples/bpf: add map performance test

Maybe Canonical can backport for Xenial users.

drzaeus77 · 2016-10-25T19:11:21Z

LGTM

goldshtn · 2016-10-25T19:17:55Z

@brendangregg @drzaeus77 Pushed two additional commits: one changes funccount to use a fixed-size prepopulated map and clear it key-by-key; the other also changes Table.zero() to use list(...keys) per Brenden's suggestion.

4ast · 2016-10-25T21:16:28Z

lgtm @drzaeus77

brendangregg · 2016-10-25T23:53:21Z

(root) /mnt/src/5/bcc/tools # ./funccount.py -i 1 '*spin*'
Tracing 16 functions for "*spin*"... Hit Ctrl-C to end.

FUNC                                    COUNT
mutex_optimistic_spin                       2
rwsem_spin_on_owner                         2
__pv_queued_spin_lock_slowpath             23
_raw_spin_unlock_bh                        85
_raw_spin_lock_bh                         106
_raw_spin_trylock                         356
_raw_spin_lock_irq                       2611
_raw_spin_unlock_irqrestore              6071
_raw_spin_lock_irqsave                   6580
_raw_spin_lock                          92633
__pv_queued_spin_unlock                210885

FUNC                                    COUNT
mutex_optimistic_spin                       1
__pv_queued_spin_lock_slowpath             31
_raw_spin_unlock_bh                       120
_raw_spin_lock_bh                         142
_raw_spin_trylock                         310
_raw_spin_lock_irq                       2240
_raw_spin_lock_irqsave                   4973
_raw_spin_unlock_irqrestore              5551
_raw_spin_lock                          33314
__pv_queued_spin_unlock                 40988
^C
FUNC                                    COUNT
mutex_optimistic_spin                       1
rwsem_spin_on_owner                         1
__pv_queued_spin_lock_slowpath             24
_raw_spin_unlock_bh                        81
_raw_spin_lock_bh                          98
_raw_spin_trylock                         274
_raw_spin_lock_irq                       2091
_raw_spin_lock_irqsave                   4845
_raw_spin_unlock_irqrestore              5414
_raw_spin_lock                          33281
__pv_queued_spin_unlock                 40585
Detaching...
(root) /mnt/src/5/bcc/tools # uname -a
Linux bgregg-xenial-test-i-01f808d4259a41b59 4.4.0-31-generic #50-Ubuntu SMP Wed Jul 13 00:07:12 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Ok, that test worked on 4.4. I also checked 4.9.

(... really need to give funccount a "-d duration" take make testing not need a human Ctrl-C (and change the -d for debug to -D). I was hacking it in with a "timeout -s2 5" prefix.)

However:

(root) /mnt/src/5/bcc/tools # ./funccount.py -i 1 'bash:r*'
Tracing 317 functions for "bash:r*"... Hit Ctrl-C to end.

FUNC                                    COUNT
read_octal                                  1
realloc                                     1
readonly_builtin                            2
read_tty_modified                           2
redirection_expand                          2
[...]
^C
FUNC                                    COUNT
read_octal                                  1
realloc                                     1
readonly_builtin                            2
read_tty_modified                           2
redirection_expand                          2
run_exit_trap                               2
reset_parser                                2
restore_input_line_state                    2
restore_parser_state                        2
reader_loop                                 3
run_return_trap                             3
restore_pipestatus_array                    4
remember_args                               4
reset_signal_handlers                       5
reap_dead_jobs                              6
remove_quoted_escapes                      13
run_unwind_frame                           16
reset_terminating_signals                  20
restore_original_signals                   22
reset_internal_getopt                      63
run_debug_trap                            147
read_command                              183
remove_quoted_nulls                       346
run_pending_traps                         645
Detaching...

(root) /mnt/src/5/bcc/tools # ./funccount.py -i 1 'bash:r*'
perf_event_open(/sys/kernel/debug/tracing/events/uprobes/p__bin_bash_0xb2540/id): Cannot allocate memory
Failed to attach BPF to uprobe

(root) /mnt/src/5/bcc/tools # ./reset-trace.sh -v
Reseting tracing state...

Checking /sys/kernel/debug/tracing/kprobe_events
Checking /sys/kernel/debug/tracing/uprobe_events
Needed to reset /sys/kernel/debug/tracing/uprobe_events
uprobe_events, before (line enumerated):
     1  p:uprobes/p__bin_bash_0xb2540 /bin/bash:0x00000000000b2540
uprobe_events, after (line enumerated):

Checking /sys/kernel/debug/tracing/trace
Checking /sys/kernel/debug/tracing/current_tracer
Checking /sys/kernel/debug/tracing/set_ftrace_filter
Checking /sys/kernel/debug/tracing/set_graph_function
Checking /sys/kernel/debug/tracing/set_ftrace_pid
Checking /sys/kernel/debug/tracing/events/enable
Checking /sys/kernel/debug/tracing/tracing_thresh
Checking /sys/kernel/debug/tracing/tracing_on

Done.

So a Ctrl-C had left uprobes still initialized. The reset-ftrace -v shows how it was (that's a cat -nv).

brendangregg · 2016-10-25T23:55:11Z

no, it's actually broken, not just still-initialized uprobes:

dmesg
[  255.276131] ------------[ cut here ]------------
[  255.276139] WARNING: CPU: 0 PID: 7843 at /build/linux-dcxD3m/linux-4.4.0/mm/page_counter.c:26 page_counter_cancel+0x34/0x40()
[  255.276141] Modules linked in: binfmt_misc zfs(PO) zunicode(PO) zcommon(PO) znvpair(PO) spl(O) zavl(PO) xfs x86_pkg_temp_thermal coretemp ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_multipath autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd
[  255.276167] CPU: 0 PID: 7843 Comm: funccount.py Tainted: P           O    4.4.0-31-generic #50-Ubuntu
[  255.276168]  0000000000000200 00000000c146c932 ffff8801d4373a88 ffffffff813f1143
[  255.276171]  0000000000000000 ffffffff81cce100 ffff8801d4373ac0 ffffffff81081102
[  255.276173]  ffff8800ff20d8b8 000000000000001f ffff8800ff20d800 ffff8801d6410e80
[  255.276175] Call Trace:
[  255.276180]  [<ffffffff813f1143>] dump_stack+0x63/0x90
[  255.276183]  [<ffffffff81081102>] warn_slowpath_common+0x82/0xc0
[  255.276185]  [<ffffffff8108124a>] warn_slowpath_null+0x1a/0x20
[  255.276187]  [<ffffffff811f9ca4>] page_counter_cancel+0x34/0x40
[  255.276189]  [<ffffffff811f9da2>] page_counter_uncharge+0x22/0x40
[  255.276190]  [<ffffffff811fbef8>] drain_stock.isra.33+0x38/0xa0
[  255.276192]  [<ffffffff811fc990>] try_charge+0x5e0/0x640
[  255.276195]  [<ffffffff81200bdb>] mem_cgroup_try_charge+0x6b/0x1b0
[  255.276198]  [<ffffffff81187d39>] uprobe_write_opcode+0x209/0x4c0
[  255.276202]  [<ffffffff8116f7e3>] ? uprobe_perf_filter+0x43/0x50
[  255.276203]  [<ffffffff8118809a>] set_orig_insn+0x1a/0x20
[  255.276205]  [<ffffffff811884a6>] register_for_each_vma+0x406/0x430
[  255.276207]  [<ffffffff81188962>] uprobe_apply+0x72/0xc0
[  255.276209]  [<ffffffff8116ff60>] uprobe_perf_close+0x90/0xd0
[  255.276211]  [<ffffffff81170882>] trace_uprobe_register+0xd2/0x200
[  255.276214]  [<ffffffff81162eea>] perf_trace_destroy+0x2a/0x50
[  255.276216]  [<ffffffff81179ec9>] tp_perf_event_destroy+0x9/0x10
[  255.276218]  [<ffffffff81180de9>] _free_event+0xb9/0x220
[  255.276219]  [<ffffffff81181013>] put_event+0xc3/0x100
[  255.276221]  [<ffffffff81181070>] perf_release+0x10/0x20
[  255.276224]  [<ffffffff8120ef84>] __fput+0xe4/0x220
[  255.276226]  [<ffffffff8120f0fe>] ____fput+0xe/0x10
[  255.276229]  [<ffffffff8109ec1f>] task_work_run+0x7f/0xa0
[  255.276231]  [<ffffffff81003242>] exit_to_usermode_loop+0xc2/0xd0
[  255.276233]  [<ffffffff81003c6e>] syscall_return_slowpath+0x4e/0x60
[  255.276237]  [<ffffffff8182dc90>] int_ret_from_sys_call+0x25/0x8f
[  255.276238] ---[ end trace 724b0e71eca6554e ]---

drzaeus77 · 2016-10-26T02:54:01Z

@brendangregg 4.4.0-31 is still broken w.r.t. uprobes, the memory leak is fixed at about 4.4.0-42. This is what made the 1604 buildbot needing a reboot once a week or so.

brendangregg · 2016-10-26T04:54:56Z

ok, I'll start digging, but if you know the patch description offhand that'd be useful; just need to always check it's merged in the kernels we're deploying

drzaeus77 · 2016-10-26T04:59:51Z

https://www.mail-archive.com/[email protected]/msg1229614.html

brendangregg · 2016-10-26T05:39:19Z

@drzaeus77 thanks; so just accounting, not an actual leak

drzaeus77 · 2016-10-26T06:01:08Z

It's pretty bad. The accounting mismatch results in pretty serious failures. I saw an oops when I tried to do a kexec after I hit the failure in some cases.

goldshtn added 2 commits October 20, 2016 16:18

goldshtn mentioned this pull request Oct 24, 2016

funccount: avoid map.zero and improve perf #777

Closed

goldshtn added 2 commits October 25, 2016 07:18

drzaeus77 merged commit 191bac6 into iovisor:master Oct 26, 2016

This was referenced Oct 26, 2016

Fix function keys enter an infinite loop (#665) #676

Closed

funccount "SyS_.*" stuck on-CPU #665

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

funccount: Fix on-CPU hang when attaching to SyS_* #780

funccount: Fix on-CPU hang when attaching to SyS_* #780

goldshtn commented Oct 24, 2016

goldshtn commented Oct 24, 2016

brendangregg commented Oct 24, 2016

brendangregg commented Oct 25, 2016

goldshtn commented Oct 25, 2016

drzaeus77 commented Oct 25, 2016

goldshtn commented Oct 25, 2016

goldshtn commented Oct 25, 2016

goldshtn commented Oct 25, 2016

drzaeus77 commented Oct 25, 2016

drzaeus77 commented Oct 25, 2016

goldshtn commented Oct 25, 2016

brendangregg commented Oct 25, 2016

brendangregg commented Oct 25, 2016

goldshtn commented Oct 25, 2016

4ast commented Oct 25, 2016

brendangregg commented Oct 25, 2016

drzaeus77 commented Oct 25, 2016

goldshtn commented Oct 25, 2016

4ast commented Oct 25, 2016

brendangregg commented Oct 25, 2016

brendangregg commented Oct 25, 2016

drzaeus77 commented Oct 26, 2016

brendangregg commented Oct 26, 2016

drzaeus77 commented Oct 26, 2016

brendangregg commented Oct 26, 2016

drzaeus77 commented Oct 26, 2016

funccount: Fix on-CPU hang when attaching to SyS_* #780

funccount: Fix on-CPU hang when attaching to SyS_* #780

Conversation

goldshtn commented Oct 24, 2016

goldshtn commented Oct 24, 2016

brendangregg commented Oct 24, 2016

brendangregg commented Oct 25, 2016

goldshtn commented Oct 25, 2016

drzaeus77 commented Oct 25, 2016

goldshtn commented Oct 25, 2016

goldshtn commented Oct 25, 2016

goldshtn commented Oct 25, 2016

drzaeus77 commented Oct 25, 2016

drzaeus77 commented Oct 25, 2016

goldshtn commented Oct 25, 2016

brendangregg commented Oct 25, 2016

brendangregg commented Oct 25, 2016

goldshtn commented Oct 25, 2016

4ast commented Oct 25, 2016

brendangregg commented Oct 25, 2016

drzaeus77 commented Oct 25, 2016

goldshtn commented Oct 25, 2016

4ast commented Oct 25, 2016

brendangregg commented Oct 25, 2016

brendangregg commented Oct 25, 2016

drzaeus77 commented Oct 26, 2016

brendangregg commented Oct 26, 2016

drzaeus77 commented Oct 26, 2016

brendangregg commented Oct 26, 2016

drzaeus77 commented Oct 26, 2016