Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: Add a map permission check for mmap #13

Closed
stephensmalley opened this issue Nov 17, 2016 · 1 comment
Closed

RFE: Add a map permission check for mmap #13

stephensmalley opened this issue Nov 17, 2016 · 1 comment

Comments

@stephensmalley
Copy link
Member

Add a 'map' check on mmap so that we can distinguish memory mapped access (since it has different implications for revocation) When a file is opened and then read or written via syscalls like read(2)/write(2), we revalidate access on each read/write operation via selinux_file_permission() and therefore can revoke access if the process context, the file context, or the policy changes in such a manner that access is no longer allowed. When a file is opened and then memory mapped via mmap(2) and then subsequently read or written directly in memory, we presently have no way to revalidate or revoke access. The purpose of a separate map permission check on mmap(2) is to permit policy to prohibit memory mapping of specific files for which we need to ensure that every access is revalidated, particularly useful for scenarios where we expect the file to be relabeled at runtime in order to reflect state changes (e.g. cross-domain solution, assured pipeline without data copying).

@pcmoore pcmoore changed the title Add a map permission check for mmap RFE: Add a map permission check for mmap Nov 18, 2016
pcmoore pushed a commit that referenced this issue May 17, 2017
As Eric Dumazet pointed out this also needs to be fixed in IPv6.
v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.

We have seen a few incidents lately where a dst_enty has been freed
with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
dst_entry. If the conditions/timings are right a crash then ensues when the
freed dst_entry is referenced later on. A Common crashing back trace is:

 #8 [] page_fault at ffffffff8163e648
    [exception RIP: __tcp_ack_snd_check+74]
.
.
 #9 [] tcp_rcv_established at ffffffff81580b64
#10 [] tcp_v4_do_rcv at ffffffff8158b54a
#11 [] tcp_v4_rcv at ffffffff8158cd02
#12 [] ip_local_deliver_finish at ffffffff815668f4
#13 [] ip_local_deliver at ffffffff81566bd9
#14 [] ip_rcv_finish at ffffffff8156656d
#15 [] ip_rcv at ffffffff81566f06
#16 [] __netif_receive_skb_core at ffffffff8152b3a2
#17 [] __netif_receive_skb at ffffffff8152b608
#18 [] netif_receive_skb at ffffffff8152b690
#19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
#20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
#21 [] net_rx_action at ffffffff8152bac2
#22 [] __do_softirq at ffffffff81084b4f
#23 [] call_softirq at ffffffff8164845c
#24 [] do_softirq at ffffffff81016fc5
#25 [] irq_exit at ffffffff81084ee5
#26 [] do_IRQ at ffffffff81648ff8

Of course it may happen with other NIC drivers as well.

It's found the freed dst_entry here:

 224 static bool tcp_in_quickack_mode(struct sock *sk)↩
 225 {↩
 226 ▹       const struct inet_connection_sock *icsk = inet_csk(sk);↩
 227 ▹       const struct dst_entry *dst = __sk_dst_get(sk);↩
 228 ↩
 229 ▹       return (dst && dst_metric(dst, RTAX_QUICKACK)) ||↩
 230 ▹       ▹       (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);↩
 231 }↩

But there are other backtraces attributed to the same freed dst_entry in
netfilter code as well.

All the vmcores showed 2 significant clues:

- Remote hosts behind the default gateway had always been redirected to a
different gateway. A rtable/dst_entry will be added for that host. Making
more dst_entrys with lower reference counts. Making this more probable.

- All vmcores showed a postitive LockDroppedIcmps value, e.g:

LockDroppedIcmps                  267

A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
regardless of whether user space has the socket locked. This can result in a
race condition where the same dst_entry cached in sk->sk_dst_entry can be
decremented twice for the same socket via:

do_redirect()->__sk_dst_check()-> dst_release().

Which leads to the dst_entry being prematurely freed with another socket
pointing to it via sk->sk_dst_cache and a subsequent crash.

To fix this skip do_redirect() if usespace has the socket locked. Instead let
the redirect take place later when user space does not have the socket
locked.

The dccp/IPv6 code is very similar in this respect, so fixing it there too.

As Eric Garver pointed out the following commit now invalidates routes. Which
can set the dst->obsolete flag so that ipv4_dst_check() returns null and
triggers the dst_release().

Fixes: ceb3320 ("ipv4: Kill routes during PMTU/redirect updates.")
Cc: Eric Garver <[email protected]>
Cc: Hannes Sowa <[email protected]>
Signed-off-by: Jon Maxwell <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
@stephensmalley
Copy link
Member Author

Resolved by 6941857

pcmoore pushed a commit that referenced this issue Jul 6, 2017
Commit c5f6ce9 tries to address multiple resets but fails as
work_busy doesn't involve any synchronization and can fail.  This is
reproducible easily as can be seen by WARNING below which is triggered
with line:

WARN_ON(dev->ctrl.state == NVME_CTRL_RESETTING)

Allowing multiple resets can result in multiple controller removal as
well if different conditions inside nvme_reset_work fail and which
might deadlock on device_release_driver.

[  480.327007] WARNING: CPU: 3 PID: 150 at drivers/nvme/host/pci.c:1900 nvme_reset_work+0x36c/0xec0
[  480.327008] Modules linked in: rfcomm fuse nf_conntrack_netbios_ns nf_conntrack_broadcast...
[  480.327044]  btusb videobuf2_core ghash_clmulni_intel snd_hwdep cfg80211 acer_wmi hci_uart..
[  480.327065] CPU: 3 PID: 150 Comm: kworker/u16:2 Not tainted 4.12.0-rc1+ #13
[  480.327065] Hardware name: Acer Predator G9-591/Mustang_SLS, BIOS V1.10 03/03/2016
[  480.327066] Workqueue: nvme nvme_reset_work
[  480.327067] task: ffff880498ad8000 task.stack: ffffc90002218000
[  480.327068] RIP: 0010:nvme_reset_work+0x36c/0xec0
[  480.327069] RSP: 0018:ffffc9000221bdb8 EFLAGS: 00010246
[  480.327070] RAX: 0000000000460000 RBX: ffff880498a98128 RCX: dead000000000200
[  480.327070] RDX: 0000000000000001 RSI: ffff8804b1028020 RDI: ffff880498a98128
[  480.327071] RBP: ffffc9000221be50 R08: 0000000000000000 R09: 0000000000000000
[  480.327071] R10: ffffc90001963ce8 R11: 000000000000020d R12: ffff880498a98000
[  480.327072] R13: ffff880498a53500 R14: ffff880498a98130 R15: ffff880498a98128
[  480.327072] FS:  0000000000000000(0000) GS:ffff8804c1cc0000(0000) knlGS:0000000000000000
[  480.327073] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  480.327074] CR2: 00007ffcf3c37f78 CR3: 0000000001e09000 CR4: 00000000003406e0
[  480.327074] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  480.327075] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  480.327075] Call Trace:
[  480.327079]  ? __switch_to+0x227/0x400
[  480.327081]  process_one_work+0x18c/0x3a0
[  480.327082]  worker_thread+0x4e/0x3b0
[  480.327084]  kthread+0x109/0x140
[  480.327085]  ? process_one_work+0x3a0/0x3a0
[  480.327087]  ? kthread_park+0x60/0x60
[  480.327102]  ret_from_fork+0x2c/0x40
[  480.327103] Code: e8 5a dc ff ff 85 c0 41 89 c1 0f.....

This patch addresses the problem by using state of controller to
decide whether reset should be queued or not as state change is
synchronizated using controller spinlock.  Also cancel_work_sync is
used to make sure remove cancels the reset_work and waits for it to
finish.  This patch also changes return value from -ENODEV to more
appropriate -EBUSY if nvme_reset fails to change state.

Fixes: c5f6ce9 ("nvme: don't schedule multiple resets")
Signed-off-by: Rakesh Pandit <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
stephensmalley pushed a commit to stephensmalley/selinux-kernel that referenced this issue Apr 17, 2018
Patch series "kexec_file, x86, powerpc: refactoring for other
architecutres", v2.

This is a preparatory patchset for adding kexec_file support on arm64.

It was originally included in a arm64 patch set[1], but Philipp is also
working on their kexec_file support on s390[2] and some changes are now
conflicting.

So these common parts were extracted and put into a separate patch set
for better integration.  What's more, my original patch#4 was split into
a few small chunks for easier review after Dave's comment.

As such, the resulting code is basically identical with my original, and
the only *visible* differences are:

 - renaming of _kexec_kernel_image_probe() and  _kimage_file_post_load_cleanup()

 - change one of types of arguments at prepare_elf64_headers()

Those, unfortunately, require a couple of trivial changes on the rest
(SELinuxProject#1, SELinuxProject#6 to SELinuxProject#13) of my arm64 kexec_file patch set[1].

Patch SELinuxProject#1 allows making a use of purgatory optional, particularly useful
for arm64.

Patch SELinuxProject#2 commonalizes arch_kexec_kernel_{image_probe, image_load,
verify_sig}() and arch_kimage_file_post_load_cleanup() across
architectures.

Patches SELinuxProject#3-SELinuxProject#7 are also intended to generalize parse_elf64_headers(),
along with exclude_mem_range(), to be made best re-use of.

[1] http:https://lists.infradead.org/pipermail/linux-arm-kernel/2018-February/561182.html
[2] http:https://lkml.iu.edu//hypermail/linux/kernel/1802.1/02596.html

This patch (of 7):

On arm64, crash dump kernel's usable memory is protected by *unmapping*
it from kernel virtual space unlike other architectures where the region
is just made read-only.  It is highly unlikely that the region is
accidentally corrupted and this observation rationalizes that digest
check code can also be dropped from purgatory.  The resulting code is so
simple as it doesn't require a bit ugly re-linking/relocation stuff,
i.e.  arch_kexec_apply_relocations_add().

Please see:

   http:https://lists.infradead.org/pipermail/linux-arm-kernel/2017-December/545428.html

All that the purgatory does is to shuffle arguments and jump into a new
kernel, while we still need to have some space for a hash value
(purgatory_sha256_digest) which is never checked against.

As such, it doesn't make sense to have trampline code between old kernel
and new kernel on arm64.

This patch introduces a new configuration, ARCH_HAS_KEXEC_PURGATORY, and
allows related code to be compiled in only if necessary.

[[email protected]: fix trivial screwup]
  Link: http:https://lkml.kernel.org/r/[email protected]
Link: http:https://lkml.kernel.org/r/[email protected]
Signed-off-by: AKASHI Takahiro <[email protected]>
Acked-by: Dave Young <[email protected]>
Tested-by: Dave Young <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Baoquan He <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
pcmoore pushed a commit that referenced this issue Mar 18, 2019
Marc Gonzalez reported the following kmemleak crash:

  Unable to handle kernel paging request at virtual address ffffffc021e00000
  Mem abort info:
    ESR = 0x96000006
    Exception class = DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
  Data abort info:
    ISV = 0, ISS = 0x00000006
    CM = 0, WnR = 0
  swapper pgtable: 4k pages, 39-bit VAs, pgdp = (____ptrval____) [ffffffc021e00000] pgd=000000017e3ba803, pud=000000017e3ba803, pmd=0000000000000000
  Internal error: Oops: 96000006 [#1] PREEMPT SMP
  Modules linked in:
  CPU: 6 PID: 523 Comm: kmemleak Tainted: G S      W         5.0.0-rc1 #13
  Hardware name: Qualcomm Technologies, Inc. MSM8998 v1 MTP (DT)
  pstate: 80000085 (Nzcv daIf -PAN -UAO)
  pc : scan_block+0x70/0x190
  lr : scan_block+0x6c/0x190
  Process kmemleak (pid: 523, stack limit = 0x(____ptrval____))
  Call trace:
   scan_block+0x70/0x190
   scan_gray_list+0x108/0x1c0
   kmemleak_scan+0x33c/0x7c0
   kmemleak_scan_thread+0x98/0xf0
   kthread+0x11c/0x120
   ret_from_fork+0x10/0x1c
  Code: f9000fb4 d503201f 97ffffd2 35000580 (f9400260)

The crash happens when a no-map area is allocated in
early_init_dt_alloc_reserved_memory_arch().  The allocated region is
registered with kmemleak, but it is then removed from memblock using
memblock_remove() that is not kmemleak-aware.

Replacing memblock_phys_alloc_range() with memblock_find_in_range()
makes sure that the allocated memory is not added to kmemleak and then
memblock_remove()'ing this memory is safe.

As a bonus, since memblock_find_in_range() ensures the allocation in the
specified range, the bounds check can be removed.

[[email protected]: of: fix parameters order for call to memblock_find_in_range()]
  Link: http:https://lkml.kernel.org/r/20190221112619.GC32004@rapoport-lnx
Link: http:https://lkml.kernel.org/r/20190213181921.GB15270@rapoport-lnx
Fixes: 3f0c820 ("drivers: of: add initialization code for dynamic reserved memory")
Signed-off-by: Mike Rapoport <[email protected]>
Acked-by: Marek Szyprowski <[email protected]>
Acked-by: Prateek Patel <[email protected]>
Tested-by: Marc Gonzalez <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Frank Rowand <[email protected]>
Cc: Catalin Marinas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
stephensmalley pushed a commit to stephensmalley/selinux-kernel that referenced this issue Aug 26, 2020
Right now, igc_ptp_reset() is called from igc_reset(), which is called
from igc_probe() before igc_ptp_init() has a chance to run. It is
detected as an attempt to use an spinlock without registering its key
first. See log below.

To avoid this problem, simplify the initialization: igc_ptp_init() is
only called from igc_probe(), and igc_ptp_reset() is only called from
igc_reset().

[    2.736332] INFO: trying to register non-static key.
[    2.736902] input: HDA Intel PCH Front Headphone as /devices/pci0000:00/0000:00:1f.3/sound/card0/input10
[    2.737513] the code is fine but needs lockdep annotation.
[    2.737513] turning off the locking correctness validator.
[    2.737515] CPU: 8 PID: 239 Comm: systemd-udevd Tainted: G            E     5.8.0-rc7+ SELinuxProject#13
[    2.737515] Hardware name: Gigabyte Technology Co., Ltd. Z390 AORUS ULTRA/Z390 AORUS ULTRA-CF, BIOS F7 03/14/2019
[    2.737516] Call Trace:
[    2.737521]  dump_stack+0x78/0xa0
[    2.737524]  register_lock_class+0x6b1/0x6f0
[    2.737526]  ? lockdep_hardirqs_on_prepare+0xca/0x160
[    2.739177]  ? _raw_spin_unlock_irq+0x24/0x50
[    2.739179]  ? trace_hardirqs_on+0x1c/0xf0
[    2.740820]  __lock_acquire+0x56/0x1ff0
[    2.740823]  ? __schedule+0x30c/0x970
[    2.740825]  lock_acquire+0x97/0x3e0
[    2.740830]  ? igc_ptp_reset+0x35/0xf0 [igc]
[    2.740833]  ? schedule_hrtimeout_range_clock+0xb7/0x120
[    2.742507]  _raw_spin_lock_irqsave+0x3a/0x50
[    2.742512]  ? igc_ptp_reset+0x35/0xf0 [igc]
[    2.742515]  igc_ptp_reset+0x35/0xf0 [igc]
[    2.742519]  igc_reset+0x96/0xd0 [igc]
[    2.744148]  igc_probe+0x68f/0x7d0 [igc]
[    2.745796]  local_pci_probe+0x3d/0x70
[    2.745799]  pci_device_probe+0xd1/0x190
[    2.745802]  really_probe+0x15a/0x3f0
[    2.759936]  driver_probe_device+0xe1/0x150
[    2.759937]  device_driver_attach+0xa8/0xb0
[    2.761786]  __driver_attach+0x89/0x150
[    2.761786]  ? device_driver_attach+0xb0/0xb0
[    2.761787]  ? device_driver_attach+0xb0/0xb0
[    2.761788]  bus_for_each_dev+0x66/0x90
[    2.765012]  bus_add_driver+0x12e/0x1f0
[    2.765716]  driver_register+0x8b/0xe0
[    2.766418]  ? 0xffffffffc0230000
[    2.767119]  do_one_initcall+0x5a/0x310
[    2.767826]  ? kmem_cache_alloc_trace+0xe9/0x200
[    2.768528]  do_init_module+0x5c/0x260
[    2.769206]  __do_sys_finit_module+0x93/0xe0
[    2.770048]  do_syscall_64+0x46/0xa0
[    2.770716]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[    2.771396] RIP: 0033:0x7f83534589e0
[    2.772073] Code: 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 2e 2e 2e 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 80 24 0d 00 f7 d8 64 89 01 48
[    2.772074] RSP: 002b:00007ffd31d0ed18 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[    2.774854] RAX: ffffffffffffffda RBX: 000055d52816aba0 RCX: 00007f83534589e0
[    2.774855] RDX: 0000000000000000 RSI: 00007f83535b982f RDI: 0000000000000006
[    2.774855] RBP: 00007ffd31d0ed60 R08: 0000000000000000 R09: 00007ffd31d0ed30
[    2.774856] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000000000
[    2.774856] R13: 0000000000020000 R14: 00007f83535b982f R15: 000055d527f5e120

Fixes: 5f29580 ("igc: Add basic skeleton for PTP")
Signed-off-by: Vinicius Costa Gomes <[email protected]>
Reviewed-by: Andre Guedes <[email protected]>
Tested-by: Aaron Brown <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
pcmoore pushed a commit that referenced this issue Nov 14, 2021
When I do fuzz test for bonding device interface, I got the following
use-after-free Calltrace:

==================================================================
BUG: KASAN: use-after-free in bond_enslave+0x1521/0x24f0
Read of size 8 at addr ffff88825bc11c00 by task ifenslave/7365

CPU: 5 PID: 7365 Comm: ifenslave Tainted: G            E     5.15.0-rc1+ #13
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
Call Trace:
 dump_stack_lvl+0x6c/0x8b
 print_address_description.constprop.0+0x48/0x70
 kasan_report.cold+0x82/0xdb
 __asan_load8+0x69/0x90
 bond_enslave+0x1521/0x24f0
 bond_do_ioctl+0x3e0/0x450
 dev_ifsioc+0x2ba/0x970
 dev_ioctl+0x112/0x710
 sock_do_ioctl+0x118/0x1b0
 sock_ioctl+0x2e0/0x490
 __x64_sys_ioctl+0x118/0x150
 do_syscall_64+0x35/0xb0
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f19159cf577
Code: b3 66 90 48 8b 05 11 89 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 78
RSP: 002b:00007ffeb3083c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffeb3084bca RCX: 00007f19159cf577
RDX: 00007ffeb3083ce0 RSI: 0000000000008990 RDI: 0000000000000003
RBP: 00007ffeb3084bc4 R08: 0000000000000040 R09: 0000000000000000
R10: 00007ffeb3084bc0 R11: 0000000000000246 R12: 00007ffeb3083ce0
R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffeb3083cb0

Allocated by task 7365:
 kasan_save_stack+0x23/0x50
 __kasan_kmalloc+0x83/0xa0
 kmem_cache_alloc_trace+0x22e/0x470
 bond_enslave+0x2e1/0x24f0
 bond_do_ioctl+0x3e0/0x450
 dev_ifsioc+0x2ba/0x970
 dev_ioctl+0x112/0x710
 sock_do_ioctl+0x118/0x1b0
 sock_ioctl+0x2e0/0x490
 __x64_sys_ioctl+0x118/0x150
 do_syscall_64+0x35/0xb0
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Freed by task 7365:
 kasan_save_stack+0x23/0x50
 kasan_set_track+0x20/0x30
 kasan_set_free_info+0x24/0x40
 __kasan_slab_free+0xf2/0x130
 kfree+0xd1/0x5c0
 slave_kobj_release+0x61/0x90
 kobject_put+0x102/0x180
 bond_sysfs_slave_add+0x7a/0xa0
 bond_enslave+0x11b6/0x24f0
 bond_do_ioctl+0x3e0/0x450
 dev_ifsioc+0x2ba/0x970
 dev_ioctl+0x112/0x710
 sock_do_ioctl+0x118/0x1b0
 sock_ioctl+0x2e0/0x490
 __x64_sys_ioctl+0x118/0x150
 do_syscall_64+0x35/0xb0
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Last potentially related work creation:
 kasan_save_stack+0x23/0x50
 kasan_record_aux_stack+0xb7/0xd0
 insert_work+0x43/0x190
 __queue_work+0x2e3/0x970
 delayed_work_timer_fn+0x3e/0x50
 call_timer_fn+0x148/0x470
 run_timer_softirq+0x8a8/0xc50
 __do_softirq+0x107/0x55f

Second to last potentially related work creation:
 kasan_save_stack+0x23/0x50
 kasan_record_aux_stack+0xb7/0xd0
 insert_work+0x43/0x190
 __queue_work+0x2e3/0x970
 __queue_delayed_work+0x130/0x180
 queue_delayed_work_on+0xa7/0xb0
 bond_enslave+0xe25/0x24f0
 bond_do_ioctl+0x3e0/0x450
 dev_ifsioc+0x2ba/0x970
 dev_ioctl+0x112/0x710
 sock_do_ioctl+0x118/0x1b0
 sock_ioctl+0x2e0/0x490
 __x64_sys_ioctl+0x118/0x150
 do_syscall_64+0x35/0xb0
 entry_SYSCALL_64_after_hwframe+0x44/0xae

The buggy address belongs to the object at ffff88825bc11c00
 which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 0 bytes inside of
 1024-byte region [ffff88825bc11c00, ffff88825bc12000)
The buggy address belongs to the page:
page:ffffea00096f0400 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x25bc10
head:ffffea00096f0400 order:3 compound_mapcount:0 compound_pincount:0
flags: 0x57ff00000010200(slab|head|node=1|zone=2|lastcpupid=0x7ff)
raw: 057ff00000010200 ffffea0009a71c08 ffff888240001968 ffff88810004dbc0
raw: 0000000000000000 00000000000a000a 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88825bc11b00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff88825bc11b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88825bc11c00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                   ^
 ffff88825bc11c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88825bc11d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

Put new_slave in bond_sysfs_slave_add() will cause use-after-free problems
when new_slave is accessed in the subsequent error handling process. Since
new_slave will be put in the subsequent error handling process, remove the
unnecessary put to fix it.
In addition, when sysfs_create_file() fails, if some files have been crea-
ted successfully, we need to call sysfs_remove_file() to remove them.
Since there are sysfs_create_files() & sysfs_remove_files() can be used,
use these two functions instead.

Fixes: 7afcaec (bonding: use kobject_put instead of _del after kobject_add)
Signed-off-by: Huang Guobin <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
pcmoore pushed a commit that referenced this issue Nov 18, 2021
The exit function fixes a memory leak with the src field as detected by
leak sanitizer. An example of which is:

Indirect leak of 25133184 byte(s) in 207 object(s) allocated from:
    #0 0x7f199ecfe987 in __interceptor_calloc libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x55defe638224 in annotated_source__alloc_histograms util/annotate.c:803
    #2 0x55defe6397e4 in symbol__hists util/annotate.c:952
    #3 0x55defe639908 in symbol__inc_addr_samples util/annotate.c:968
    #4 0x55defe63aa29 in hist_entry__inc_addr_samples util/annotate.c:1119
    #5 0x55defe499a79 in hist_iter__report_callback tools/perf/builtin-report.c:182
    #6 0x55defe7a859d in hist_entry_iter__add util/hist.c:1236
    #7 0x55defe49aa63 in process_sample_event tools/perf/builtin-report.c:315
    #8 0x55defe731bc8 in evlist__deliver_sample util/session.c:1473
    #9 0x55defe731e38 in machines__deliver_event util/session.c:1510
    #10 0x55defe732a23 in perf_session__deliver_event util/session.c:1590
    #11 0x55defe72951e in ordered_events__deliver_event util/session.c:183
    #12 0x55defe740082 in do_flush util/ordered-events.c:244
    #13 0x55defe7407cb in __ordered_events__flush util/ordered-events.c:323
    #14 0x55defe740a61 in ordered_events__flush util/ordered-events.c:341
    #15 0x55defe73837f in __perf_session__process_events util/session.c:2390
    #16 0x55defe7385ff in perf_session__process_events util/session.c:2420
    ...

Signed-off-by: Ian Rogers <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kajol Jain <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Martin Liška <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
pcmoore pushed a commit that referenced this issue Oct 17, 2022
While setting up a new lab, I accidentally misconfigured the
Ethernet port for a system that tried an NFS mount using RoCE.
This made the NFS server unreachable. The following WARNING
popped on the NFS client while waiting for the mount attempt to
time out:

kernel: workqueue: WQ_MEM_RECLAIM xprtiod:xprt_rdma_connect_worker [rpcrdma] is flushing !WQ_MEM_RECLAI>
kernel: WARNING: CPU: 0 PID: 100 at kernel/workqueue.c:2628 check_flush_dependency+0xbf/0xca
kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs 8021q garp stp mrp llc rfkill rpcrdma>
kernel: CPU: 0 PID: 100 Comm: kworker/u8:8 Not tainted 6.0.0-rc1-00002-g6229f8c054e5 #13
kernel: Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0b 06/12/2017
kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
kernel: RIP: 0010:check_flush_dependency+0xbf/0xca
kernel: Code: 75 2a 48 8b 55 18 48 8d 8b b0 00 00 00 4d 89 e0 48 81 c6 b0 00 00 00 48 c7 c7 65 33 2e be>
kernel: RSP: 0018:ffffb562806cfcf8 EFLAGS: 00010092
kernel: RAX: 0000000000000082 RBX: ffff97894f8c3c00 RCX: 0000000000000027
kernel: RDX: 0000000000000002 RSI: ffffffffbe3447d1 RDI: 00000000ffffffff
kernel: RBP: ffff978941315840 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 00000000000008b0 R11: 0000000000000001 R12: ffffffffc0ce3731
kernel: R13: ffff978950c00500 R14: ffff97894341f0c0 R15: ffff978951112eb0
kernel: FS:  0000000000000000(0000) GS:ffff97987fc00000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007f807535eae8 CR3: 000000010b8e4002 CR4: 00000000003706f0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: Call Trace:
kernel:  <TASK>
kernel:  __flush_work.isra.0+0xaf/0x188
kernel:  ? _raw_spin_lock_irqsave+0x2c/0x37
kernel:  ? lock_timer_base+0x38/0x5f
kernel:  __cancel_work_timer+0xea/0x13d
kernel:  ? preempt_latency_start+0x2b/0x46
kernel:  rdma_addr_cancel+0x70/0x81 [ib_core]
kernel:  _destroy_id+0x1a/0x246 [rdma_cm]
kernel:  rpcrdma_xprt_connect+0x115/0x5ae [rpcrdma]
kernel:  ? _raw_spin_unlock+0x14/0x29
kernel:  ? raw_spin_rq_unlock_irq+0x5/0x10
kernel:  ? finish_task_switch.isra.0+0x171/0x249
kernel:  xprt_rdma_connect_worker+0x3b/0xc7 [rpcrdma]
kernel:  process_one_work+0x1d8/0x2d4
kernel:  worker_thread+0x18b/0x24f
kernel:  ? rescuer_thread+0x280/0x280
kernel:  kthread+0xf4/0xfc
kernel:  ? kthread_complete_and_exit+0x1b/0x1b
kernel:  ret_from_fork+0x22/0x30
kernel:  </TASK>

SUNRPC's xprtiod workqueue is WQ_MEM_RECLAIM, so any workqueue that
one of its work items tries to cancel has to be WQ_MEM_RECLAIM to
prevent a priority inversion. The internal workqueues in the
RDMA/core are currently non-MEM_RECLAIM.

Jason Gunthorpe says this about the current state of RDMA/core:
> If you attempt to do a reconnection/etc from within a RECLAIM
> context it will deadlock on one of the many allocations that are
> made to support opening the connection.
>
> The general idea of reclaim is that the entire task context
> working under the reclaim is marked with an override of the gfp
> flags to make all allocations under that call chain reclaim safe.
>
> But rdmacm does allocations outside this, eg in the WQs processing
> the CM packets. So this doesn't work and we will deadlock.
>
> Fixing it is a big deal and needs more than poking WQ_MEM_RECLAIM
> here and there.

So we will change the ULP in this case to avoid the use of
WQ_MEM_RECLAIM where possible. Deadlocks that were possible before
are not fixed, but at least we no longer have a false sense of
confidence that the stack won't allocate memory during memory
reclaim.

Suggested-by: Leon Romanovsky <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
pcmoore pushed a commit that referenced this issue Oct 17, 2022
Poll CQ functions shouldn't sleep as they are called in atomic context.
The following splat appears once the mlx5_aso_poll_cq() is used in such
flow.

 BUG: scheduling while atomic: swapper/17/0/0x00000100
 Modules linked in: sch_ingress openvswitch nsh mlx5_vdpa vringh vhost_iotlb vdpa mlx5_ib mlx5_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm ib_uverbs ib_core fuse [last unloaded: mlx5_core]
 CPU: 17 PID: 0 Comm: swapper/17 Tainted: G        W          6.0.0-rc2+ #13
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 Call Trace:
  <IRQ>
  dump_stack_lvl+0x34/0x44
  __schedule_bug.cold+0x47/0x53
  __schedule+0x4b6/0x670
  ? hrtimer_start_range_ns+0x28d/0x360
  schedule+0x50/0x90
  schedule_hrtimeout_range_clock+0x98/0x120
  ? __hrtimer_init+0xb0/0xb0
  usleep_range_state+0x60/0x90
  mlx5_aso_poll_cq+0xad/0x190 [mlx5_core]
  mlx5e_ipsec_aso_update_curlft+0x81/0xb0 [mlx5_core]
  xfrm_timer_handler+0x6b/0x360
  ? xfrm_find_acq_byseq+0x50/0x50
  __hrtimer_run_queues+0x139/0x290
  hrtimer_run_softirq+0x7d/0xe0
  __do_softirq+0xc7/0x272
  irq_exit_rcu+0x87/0xb0
  sysvec_apic_timer_interrupt+0x72/0x90
  </IRQ>
  <TASK>
  asm_sysvec_apic_timer_interrupt+0x16/0x20
 RIP: 0010:default_idle+0x18/0x20
 Code: ae 7d ff ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 8b 05 b5 30 0d 01 85 c0 7e 07 0f 00 2d 0a e3 53 00 fb f4 <c3> 0f 1f 80 00 00 00 00 0f 1f 44 00 00 65 48 8b 04 25 80 ad 01 00
 RSP: 0018:ffff888100883ee0 EFLAGS: 00000242
 RAX: 0000000000000001 RBX: ffff888100849580 RCX: 4000000000000000
 RDX: 0000000000000001 RSI: 0000000000000083 RDI: 000000000008863c
 RBP: 0000000000000011 R08: 00000064e6977fa9 R09: 0000000000000001
 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  default_idle_call+0x37/0xb0
  do_idle+0x1cd/0x1e0
  cpu_startup_entry+0x19/0x20
  start_secondary+0xfe/0x120
  secondary_startup_64_no_verify+0xcd/0xdb
  </TASK>
 softirq: huh, entered softirq 8 HRTIMER 00000000a97c08cb with preempt_count 00000100, exited with 00000000?

Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 3, 2023
Ido Schimmel says:

====================
bridge: mcast: Extensions for EVPN

tl;dr
=====

This patchset creates feature parity between user space and the kernel
and allows the former to install and replace MDB port group entries with
a source list and associated filter mode. This is required for EVPN use
cases where multicast state is not derived from snooped IGMP/MLD
packets, but instead derived from EVPN routes exchanged by the control
plane in user space.

Background
==========

IGMPv3 [1] and MLDv2 [2] differ from earlier versions of the protocols
in that they add support for source-specific multicast. That is, hosts
can advertise interest in listening to a particular multicast address
only from specific source addresses or from all sources except for
specific source addresses.

In kernel 5.10 [3][4], the bridge driver gained the ability to snoop
IGMPv3/MLDv2 packets and install corresponding MDB port group entries.
For example, a snooped IGMPv3 Membership Report that contains a single
MODE_IS_EXCLUDE record for group 239.10.10.10 with sources 192.0.2.1,
192.0.2.2, 192.0.2.20 and 192.0.2.21 would trigger the creation of these
entries:

 # bridge -d mdb show
 dev br0 port veth1 grp 239.10.10.10 src 192.0.2.21 temp filter_mode include proto kernel  blocked
 dev br0 port veth1 grp 239.10.10.10 src 192.0.2.20 temp filter_mode include proto kernel  blocked
 dev br0 port veth1 grp 239.10.10.10 src 192.0.2.2 temp filter_mode include proto kernel  blocked
 dev br0 port veth1 grp 239.10.10.10 src 192.0.2.1 temp filter_mode include proto kernel  blocked
 dev br0 port veth1 grp 239.10.10.10 temp filter_mode exclude source_list 192.0.2.21/0.00,192.0.2.20/0.00,192.0.2.2/0.00,192.0.2.1/0.00 proto kernel

While the kernel can install and replace entries with a filter mode and
source list, user space cannot. It can only add EXCLUDE entries with an
empty source list, which is sufficient for IGMPv2/MLDv1, but not for
IGMPv3/MLDv2.

Use cases where the multicast state is not derived from snooped packets,
but instead derived from routes exchanged by the user space control
plane require feature parity between user space and the kernel in terms
of MDB configuration. Such a use case is detailed in the next section.

Motivation
==========

RFC 7432 [5] defines a "MAC/IP Advertisement route" (type 2) [6] that
allows NVE switches in the EVPN network to advertise and learn
reachability information for unicast MAC addresses. Traffic destined to
a unicast MAC address can therefore be selectively forwarded to a single
NVE switch behind which the MAC is located.

The same is not true for IP multicast traffic. Such traffic is simply
flooded as BUM to all NVE switches in the broadcast domain (BD),
regardless if a switch has interested receivers for the multicast stream
or not. This is especially problematic for overlay networks that make
heavy use of multicast.

The issue is addressed by RFC 9251 [7] that defines a "Selective
Multicast Ethernet Tag Route" (type 6) [8] which allows NVE switches in
the EVPN network to advertise multicast streams that they are interested
in. This is done by having each switch suppress IGMP/MLD packets from
being transmitted to the NVE network and instead communicate the
information over BGP to other switches.

As far as the bridge driver is concerned, the above means that the
multicast state (i.e., {multicast address, group timer, filter-mode,
(source records)}) for the VXLAN bridge port is not populated by the
kernel from snooped IGMP/MLD packets (they are suppressed), but instead
by user space. Specifically, by the routing daemon that is exchanging
EVPN routes with other NVE switches.

Changes are obviously also required in the VXLAN driver, but they are
the subject of future patchsets. See the "Future work" section.

Implementation
==============

The user interface is extended to allow user space to specify the filter
mode of the MDB port group entry and its source list. Replace support is
also added so that user space would not need to remove an entry and
re-add it only to edit its source list or filter mode, as that would
result in packet loss. Example usage:

 # bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 permanent \
	source_list 192.0.2.1,192.0.2.3 filter_mode exclude proto zebra
 # bridge -d -s mdb show
 dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.3 permanent filter_mode include proto zebra  blocked    0.00
 dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto zebra  blocked    0.00
 dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode exclude source_list 192.0.2.3/0.00,192.0.2.1/0.00 proto zebra     0.00

The netlink interface is extended with a few new attributes in the
RTM_NEWMDB request message:

[ struct nlmsghdr ]
[ struct br_port_msg ]
[ MDBA_SET_ENTRY ]
	struct br_mdb_entry
[ MDBA_SET_ENTRY_ATTRS ]
	[ MDBE_ATTR_SOURCE ]
		struct in_addr / struct in6_addr
	[ MDBE_ATTR_SRC_LIST ]		// new
		[ MDBE_SRC_LIST_ENTRY ]
			[ MDBE_SRCATTR_ADDRESS ]
				struct in_addr / struct in6_addr
		[ ...]
	[ MDBE_ATTR_GROUP_MODE ]	// new
		u8
	[ MDBE_ATTR_RTPORT ]		// new
		u8

No changes are required in RTM_NEWMDB responses and notifications, as
all the information can already be dumped by the kernel today.

Testing
=======

Tested with existing bridge multicast selftests: bridge_igmp.sh,
bridge_mdb_port_down.sh, bridge_mdb.sh, bridge_mld.sh,
bridge_vlan_mcast.sh.

In addition, added many new test cases for existing as well as for new
MDB functionality.

Patchset overview
=================

Patches #1-#8 are non-functional preparations for the core changes in
later patches.

Patches #9-#10 allow user space to install (*, G) entries with a source
list and associated filter mode. Specifically, patch #9 adds the
necessary kernel plumbing and patch #10 exposes the new functionality to
user space via a few new attributes.

Patch #11 allows user space to specify the routing protocol of new MDB
port group entries so that a routing daemon could differentiate between
entries installed by it and those installed by an administrator.

Patch #12 allows user space to replace MDB port group entries. This is
useful, for example, when user space wants to add a new source to a
source list. Instead of deleting a (*, G) entry and re-adding it with an
extended source list (which would result in packet loss), user space can
simply replace the current entry.

Patches #13-#14 add tests for existing MDB functionality as well as for
all new functionality added in this patchset.

Future work
===========

The VXLAN driver will need to be extended with an MDB so that it could
selectively forward IP multicast traffic to NVE switches with interested
receivers instead of simply flooding it to all switches as BUM.

The idea is to reuse the existing MDB interface for the VXLAN driver in
a similar way to how the FDB interface is shared between the bridge and
VXLAN drivers.

From command line perspective, configuration will look as follows:

 # bridge mdb add dev br0 port vxlan0 grp 239.1.1.1 permanent \
	filter_mode exclude source_list 198.50.100.1,198.50.100.2

 # bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent \
	filter_mode include source_list 198.50.100.3,198.50.100.4 \
	dst 192.0.2.1 dst_port 4789 src_vni 2

 # bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent \
	filter_mode exclude source_list 198.50.100.1,198.50.100.2 \
	dst 192.0.2.2 dst_port 4789 src_vni 2

Where the first command is enabled by this set, but the next two will be
the subject of future work.

From netlink perspective, the existing PF_BRIDGE/RTM_*MDB messages will
be extended to the VXLAN driver. This means that a few new attributes
will be added (e.g., 'MDBE_ATTR_SRC_VNI') and that the handlers for
these messages will need to move to net/core/rtnetlink.c. The rtnetlink
code will call into the appropriate driver based on the ifindex
specified in the ancillary header.

iproute2 patches can be found here [9].

Changelog
=========

Since v1 [10]:

* Patch #12: Remove extack from br_mdb_replace_group_sg().
* Patch #12: Change 'nlflags' to u16 and move it after 'filter_mode' to
  pack the structure.

Since RFC [11]:

* Patch #6: New patch.
* Patch #9: Use an array instead of a list to store source entries.
* Patch #10: Use an array instead of list to store source entries.
* Patch #10: Drop br_mdb_config_attrs_fini().
* Patch #11: Reject protocol for host entries.
* Patch #13: New patch.
* Patch #14: New patch.

[1] https://datatracker.ietf.org/doc/html/rfc3376
[2] https://www.rfc-editor.org/rfc/rfc3810
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6af52ae2ed14a6bc756d5606b29097dfd76740b8
[4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=68d4fd30c83b1b208e08c954cd45e6474b148c87
[5] https://datatracker.ietf.org/doc/html/rfc7432
[6] https://datatracker.ietf.org/doc/html/rfc7432#section-7.2
[7] https://datatracker.ietf.org/doc/html/rfc9251
[8] https://datatracker.ietf.org/doc/html/rfc9251#section-9.1
[9] https://github.com/idosch/iproute2/commits/submit/mdb_v1
[10] https://lore.kernel.org/netdev/[email protected]/
[11] https://lore.kernel.org/netdev/[email protected]/
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 3, 2023
We need to check if we have a OS prefix, otherwise we stumble on a
metric segv that I'm now seeing in Arnaldo's tree:

  $ gdb --args perf stat -M Backend true
  ...
  Performance counter stats for 'true':

          4,712,355      TOPDOWN.SLOTS                    #     17.3 % tma_core_bound

  Program received signal SIGSEGV, Segmentation fault.
  __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:77
  77      ../sysdeps/x86_64/multiarch/strlen-evex.S: No such file or directory.
  (gdb) bt
  #0  __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:77
  #1  0x00007ffff74749a5 in __GI__IO_fputs (str=0x0, fp=0x7ffff75f5680 <_IO_2_1_stderr_>)
  #2  0x0000555555779f28 in do_new_line_std (config=0x555555e077c0 <stat_config>, os=0x7fffffffbf10) at util/stat-display.c:356
  #3  0x000055555577a081 in print_metric_std (config=0x555555e077c0 <stat_config>, ctx=0x7fffffffbf10, color=0x0, fmt=0x5555558b77b5 "%8.1f", unit=0x7fffffffbb10 "%  tma_memory_bound", val=13.165355724442199) at util/stat-display.c:380
  #4  0x00005555557768b6 in generic_metric (config=0x555555e077c0 <stat_config>, metric_expr=0x55555593d5b7 "((CYCLE_ACTIVITY.STALLS_MEM_ANY + EXE_ACTIVITY.BOUND_ON_STORES) / (CYCLE_ACTIVITY.STALLS_TOTAL + (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL) + EXE_ACTIVITY.BOUND_ON_STORES))"..., metric_events=0x555555f334e0, metric_refs=0x555555ec81d0, name=0x555555f32e80 "TOPDOWN.SLOTS", metric_name=0x555555f26c80 "tma_memory_bound", metric_unit=0x55555593d5b1 "100%", runtime=0, map_idx=0, out=0x7fffffffbd90, st=0x555555e9e620 <rt_stat>) at util/stat-shadow.c:934
  #5  0x0000555555778cac in perf_stat__print_shadow_stats (config=0x555555e077c0 <stat_config>, evsel=0x555555f289d0, avg=4712355, map_idx=0, out=0x7fffffffbd90, metric_events=0x555555e078e8 <stat_config+296>, st=0x555555e9e620 <rt_stat>) at util/stat-shadow.c:1329
  #6  0x000055555577b6a0 in printout (config=0x555555e077c0 <stat_config>, os=0x7fffffffbf10, uval=4712355, run=325322, ena=325322, noise=4712355, map_idx=0) at util/stat-display.c:741
  #7  0x000055555577bc74 in print_counter_aggrdata (config=0x555555e077c0 <stat_config>, counter=0x555555f289d0, s=0, os=0x7fffffffbf10) at util/stat-display.c:838
  #8  0x000055555577c1d8 in print_counter (config=0x555555e077c0 <stat_config>, counter=0x555555f289d0, os=0x7fffffffbf10) at util/stat-display.c:957
  #9  0x000055555577dba0 in evlist__print_counters (evlist=0x555555ec3610, config=0x555555e077c0 <stat_config>, _target=0x555555e01c80 <target>, ts=0x0, argc=1, argv=0x7fffffffe450) at util/stat-display.c:1413
  #10 0x00005555555fc821 in print_counters (ts=0x0, argc=1, argv=0x7fffffffe450) at builtin-stat.c:1040
  #11 0x000055555560091a in cmd_stat (argc=1, argv=0x7fffffffe450) at builtin-stat.c:2665
  #12 0x00005555556b1eea in run_builtin (p=0x555555e11f70 <commands+336>, argc=4, argv=0x7fffffffe450) at perf.c:322
  #13 0x00005555556b2181 in handle_internal_command (argc=4, argv=0x7fffffffe450) at perf.c:376
  #14 0x00005555556b22d7 in run_argv (argcp=0x7fffffffe27c, argv=0x7fffffffe270) at perf.c:420
  #15 0x00005555556b26ef in main (argc=4, argv=0x7fffffffe450) at perf.c:550
  (gdb)

Fixes: f123b2d ("perf stat: Remove prefix argument in print_metric_headers()")
Signed-off-by: Ian Rogers <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Athira Jajeev <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: James Clark <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Xing Zhengjun <[email protected]>
Link: http:https://lore.kernel.org/lkml/CAP-5=fUOjSM5HajU9TCD6prY39LbX4OQbkEbtKPPGRBPBN=_VQ@mail.gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
pcmoore pushed a commit that referenced this issue Sep 11, 2023
With latest clang18, I hit test_progs failures for the following test:

  #13/2    bpf_cookie/multi_kprobe_link_api:FAIL
  #13/3    bpf_cookie/multi_kprobe_attach_api:FAIL
  #13      bpf_cookie:FAIL
  #75      fentry_fexit:FAIL
  #76/1    fentry_test/fentry:FAIL
  #76      fentry_test:FAIL
  #80/1    fexit_test/fexit:FAIL
  #80      fexit_test:FAIL
  #110/1   kprobe_multi_test/skel_api:FAIL
  #110/2   kprobe_multi_test/link_api_addrs:FAIL
  #110/3   kprobe_multi_test/link_api_syms:FAIL
  #110/4   kprobe_multi_test/attach_api_pattern:FAIL
  #110/5   kprobe_multi_test/attach_api_addrs:FAIL
  #110/6   kprobe_multi_test/attach_api_syms:FAIL
  #110     kprobe_multi_test:FAIL

For example, for #13/2, the error messages are:

  [...]
  kprobe_multi_test_run:FAIL:kprobe_test7_result unexpected kprobe_test7_result: actual 0 != expected 1
  [...]
  kprobe_multi_test_run:FAIL:kretprobe_test7_result unexpected kretprobe_test7_result: actual 0 != expected 1

clang17 does not have this issue.

Further investigation shows that kernel func bpf_fentry_test7(), used in
the above tests, is inlined by the compiler although it is marked as
noinline.

  int noinline bpf_fentry_test7(struct bpf_fentry_test_t *arg)
  {
        return (long)arg;
  }

It is known that for simple functions like the above (e.g. just returning
a constant or an input argument), the clang compiler may still do inlining
for a noinline function. Adding 'asm volatile ("")' in the beginning of the
bpf_fentry_test7() can prevent inlining.

Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Tested-by: Eduard Zingerman <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
stephensmalley pushed a commit to stephensmalley/selinux-kernel that referenced this issue May 2, 2024
vhost_worker will call tun call backs to receive packets. If too many
illegal packets arrives, tun_do_read will keep dumping packet contents.
When console is enabled, it will costs much more cpu time to dump
packet and soft lockup will be detected.

net_ratelimit mechanism can be used to limit the dumping rate.

PID: 33036    TASK: ffff949da6f20000  CPU: 23   COMMAND: "vhost-32980"
 #0 [fffffe00003fce50] crash_nmi_callback at ffffffff89249253
 SELinuxProject#1 [fffffe00003fce58] nmi_handle at ffffffff89225fa3
 SELinuxProject#2 [fffffe00003fceb0] default_do_nmi at ffffffff8922642e
 SELinuxProject#3 [fffffe00003fced0] do_nmi at ffffffff8922660d
 SELinuxProject#4 [fffffe00003fcef0] end_repeat_nmi at ffffffff89c01663
    [exception RIP: io_serial_in+20]
    RIP: ffffffff89792594  RSP: ffffa655314979e8  RFLAGS: 00000002
    RAX: ffffffff89792500  RBX: ffffffff8af428a0  RCX: 0000000000000000
    RDX: 00000000000003fd  RSI: 0000000000000005  RDI: ffffffff8af428a0
    RBP: 0000000000002710   R8: 0000000000000004   R9: 000000000000000f
    R10: 0000000000000000  R11: ffffffff8acbf64f  R12: 0000000000000020
    R13: ffffffff8acbf698  R14: 0000000000000058  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 SELinuxProject#5 [ffffa655314979e8] io_serial_in at ffffffff89792594
 SELinuxProject#6 [ffffa655314979e8] wait_for_xmitr at ffffffff89793470
 SELinuxProject#7 [ffffa65531497a08] serial8250_console_putchar at ffffffff897934f6
 SELinuxProject#8 [ffffa65531497a20] uart_console_write at ffffffff8978b605
 SELinuxProject#9 [ffffa65531497a48] serial8250_console_write at ffffffff89796558
 SELinuxProject#10 [ffffa65531497ac8] console_unlock at ffffffff89316124
 SELinuxProject#11 [ffffa65531497b10] vprintk_emit at ffffffff89317c07
 SELinuxProject#12 [ffffa65531497b68] printk at ffffffff89318306
 SELinuxProject#13 [ffffa65531497bc8] print_hex_dump at ffffffff89650765
 SELinuxProject#14 [ffffa65531497ca8] tun_do_read at ffffffffc0b06c27 [tun]
 SELinuxProject#15 [ffffa65531497d38] tun_recvmsg at ffffffffc0b06e34 [tun]
 SELinuxProject#16 [ffffa65531497d68] handle_rx at ffffffffc0c5d682 [vhost_net]
 SELinuxProject#17 [ffffa65531497ed0] vhost_worker at ffffffffc0c644dc [vhost]
 SELinuxProject#18 [ffffa65531497f10] kthread at ffffffff892d2e72
 SELinuxProject#19 [ffffa65531497f50] ret_from_fork at ffffffff89c0022f

Fixes: ef3db4a ("tun: avoid BUG, dump packet on GSO errors")
Signed-off-by: Lei Chen <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Acked-by: Jason Wang <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants