libct/nsenter: namespace the bindfd shuffle #3599

corhere · 2022-09-09T22:48:40Z

Processes can watch /proc/self/mounts or /mountinfo, and the kernel will notify them whenever the namespace's mount table is modified. The notified process still needs to read and parse the mountinfo to determine what changed once notified. Many such processes, including udisksd and systemd < v248, make no attempt to rate-limit their mountinfo notifications. This tends to not be a problem on many systems, where mount tables are small and mounting and unmounting is uncommon. Every runC exec which successfully uses the try_bindfd container-escape mitigation performs two mount()s and one umount() in the host's mount namespace, causing any mount-watching processes to wake up and parse the mountinfo file three times in a row. Consequently, using 'exec' health checks on containers has a larger-than-expected impact on system load when such mount-watching daemons are running. Furthermore, the size of the mount table in the host's mount namespace tends to be proportional to the number of OCI containers as a unique mount is required for the rootfs of each container. Therefore, on systems with mount-watching processes, the system load increases quadratically with the number of running containers which use health checks!

Prevent runC from incidentally modifying the host's mount namespace for container-escape mitigations by setting up the mitigation in a private mount namespace.

libcontainer/nsenter/cloned_binary.c

kolyshkin · 2022-09-10T01:45:04Z

Looks good overall, and it's amazing that it works! Left some nits.

libcontainer/nsenter/clone.c

libcontainer/nsenter/cloned_binary.c

libcontainer/nsenter/ipc.h

kolyshkin

LGTM, and thanks!

kolyshkin · 2022-09-28T22:12:32Z

@AkihiroSuda @cyphar PTAL

cyphar · 2022-10-06T05:07:28Z

libcontainer/nsenter/cloned_binary.c

+ /*
+ * The kernel refuses to bind-mount from the magic symlink when the
+ * process has been cloned (or unshared) into a new mount namespace,
+ * at least on Linux 4.4.
+ */
+ char linkbuf[PATH_MAX + 1] = { 0 };
+ ssize_t linkpathlen = readlink("/proc/self/exe", linkbuf, sizeof(linkbuf));


If we cannot use /proc/self/exe (or /proc/$runc_pid/exe) then we can't really do this. It's not safe to be operating on random paths that you get from resolving /proc/self/exe (and more importantly, this won't be doable once we start work on porting runc to libpathrs). (Obviously this is running "on the host" but I just don't feel comfortable with resolving magiclinks because they aren't symlinks at all and any similarity to symlinks is a very leaky illusion.)

However, I'm not sure it's actually true that you cannot use the magiclink as a mount source -- based on a quick test with unshare (Linux 5.19.10) it works without issue and I'm not sure what would've caused it to fail in older kernel versions (nd_jump_link really doesn't care about what you're doing with the path and the mount infrastructure doesn't care about the source IIRC). You probably cannot use it as a mount target but that shouldn't be an issue. What problem did you run into when using it as a mount source?

On both Linux 4.4.0-1122-aws and Linux 5.15.0-33-generic (Ubuntu) any attempts to bind-mount /proc/self/exe in an unshared mount namespace yield EINVAL.

ubuntu@ubuntu2204:~$ sudo strace -f -e trace=mount ./runc run --bundle bundle foobar strace: Process 62293 attached strace: Process 62292 attached strace: Process 62294 attached strace: Process 62295 attached strace: Process 62296 attached strace: Process 62298 attached [pid 62298] mount("none", "/", NULL, MS_REC|MS_SLAVE, NULL) = 0 [pid 62298] mount("/proc/self/exe", "/run/runc/foobar/runc.G52pod", 0x563c345027e2, MS_BIND, 0x563c345027e2) = -1 EINVAL (Invalid argument) [pid 62298] +++ exited with 22 +++ [...]

mount(8) canonicalizes the source path with realpath(3) which is why it appeared to work in your test.

ubuntu@ubuntu2204:~$ sudo unshare --mount root@ubuntu2204:/home/ubuntu# strace -e trace=readlink,mount mount --bind /proc/self/exe selfexe readlink("/proc", 0x7fff7a3b6420, 1023) = -1 EINVAL (Invalid argument) readlink("/proc/self", "62403", 1023) = 5 readlink("/proc/62403", 0x7fff7a3b6420, 1023) = -1 EINVAL (Invalid argument) readlink("/proc/62403/exe", "/usr/bin/mount", 1023) = 14 readlink("/usr", 0x7fff7a3b6420, 1023) = -1 EINVAL (Invalid argument) readlink("/usr/bin", 0x7fff7a3b6420, 1023) = -1 EINVAL (Invalid argument) readlink("/usr/bin/mount", 0x7fff7a3b6420, 1023) = -1 EINVAL (Invalid argument) readlink("/home/ubuntu/selfexe", 0x7fff7a3b6420, 1023) = -1 EINVAL (Invalid argument) mount("/usr/bin/mount", "/home/ubuntu/selfexe", 0x563f8a351840, MS_BIND, NULL) = 0 +++ exited with 0 +++

Minimal reproducer:

#define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <sched.h> #include <sys/mount.h> int main(int argc, char** argv) { if (argc > 1) { puts("Unsharing mount namespace..."); if (unshare(CLONE_NEWNS) < 0) { perror("unshare"); exit(EXIT_FAILURE); } } if (mount("/proc/self/exe", "selfexe", "", MS_BIND, "") < 0) { perror("mount"); exit(EXIT_FAILURE); } int ret = system("/bin/sh"); if (ret < 0) { perror("system"); exit(EXIT_FAILURE); } return ret; }

$ gcc main.c && touch selfexe && sudo ./a.out unshare Unsharing mount namespace... mount: Invalid argument $ sudo ./a.out # exit $ sudo umount selfexe

Indirecting the mount through /proc/self/fd/<n> also yields EINVAL, regardless of whether the file descriptor to /proc/self/exe was opened before or after the mount namespace was unshared, with or without O_PATH. My best guess of what's happening is that /proc/self/exe resolves to a struct path value which, being magic, has its mnt->mnt_ns set to the parent mount namespace, which trips the check_mnt() validation here. If my interpretation is correct, there is no way to successfully bind-mount a process' executable across namespaces without going through path resolution machinery in the process' current mount namespace.

Would you feel more comfortable if there was more assurance that the path resolved to the same file as /proc/self/exe @cyphar? Here's what I'm thinking:

open(O_PATH) both /proc/self/exe and realpath("/proc/self/exe")

Check that the file descriptors refer to the same file by fstat'ing both and testing whether they have the same st_dev and st_ino values

Bind-mount the /proc/self/fd/<n> for the realpath fd

@cyphar WDYT? ^^^

If readlink(/proc/self/exe) is considered to be unsafe, my naive approach would be to use os.Args[0], with or without additional checks using fstat, kcmp with KCMP_FILE etc.

Yes, you were both quite right -- I had forgotten about the annoying way mount(8) handles paths and that (of course) since we are creating a mountns you would run afoul of the check_mnt checks for sources.

The future libpathrs issue is still a problem but I think we need to drop this bindfd code before then anyway, so we can cross that bridge if we ever get to it.

The current stat-based approach is okay. I still am not the biggest fan of all of this, but to fix it completely requires the safe procfs stuff I'm working on with libpathrs, and current runc is also vulnerable to the same issues.

libcontainer/nsenter/cloned_binary.c

changweige · 2023-05-11T02:24:10Z

@cyphar ptal would be good to see this moving 🙏

+1
@cyphar I personally feel like this patch is a strong enhancement and benefits users with large-scale K8s deployment. So PTAL. Thanks

AkihiroSuda · 2023-05-11T02:26:43Z

Needs rebase

adamqqqplay · 2023-05-11T03:13:09Z

This issue is also affecting us, if it can be fixed we will be very grateful.

neersighted · 2023-05-30T13:21:56Z

This should help with docker/for-linux#679, as my (currently untested) recollection is that systemd doesn't log mounts outside the root namespace.

corhere · 2023-05-30T20:25:20Z

This PR has languished for far too long without any feedback from the maintainer who requested changes, even after I made changes which directly address the reason that changes were requested. @crosbymichael could you please take a look?

LastNight1997 · 2023-06-12T08:13:40Z

anything update? we are strongly looking forward to this.

Hunter1016 · 2023-06-14T02:10:15Z

libcontainer/nsenter/cloned_binary.c

+ * save an unnecessary context switch as we'd just be
+ * waiting for the child process to exit anyway.
+ */
+ CLONE_NEWNS | CLONE_VFORK, (void *)args);


In my tests, significant sys cpu usage by runc processes is observed, which seems come out of creating and removing a new mountns each time. Especially on removing, fierce lock competition appeared on a global seqlock, mount_lock, which is used to protect changes to the vfsmount hash. Related kernel stack is as follows.

lock_mount_hash
mntput_no_expire
mntput
namespace_unlock
drop_collected_mounts
put_mnt_ns
free_nsproxy
switch_task_namespaces
exit_task_namespaces
do_exit

@thaJeztah this maybe a serious problem need to take into consideration. the overall impact maybe even higher.

Unfortunately the only alternative is to remove bindfd (which I would like to do!) but will probably have to wait until runc 1.3 since it could cause regressions. The cost being described is the cost of creating and destroying a new mount namespace and there really isn't any way to avoid the lock contention -- though it should be noted that the previous setup would've also caused contention on mount_lock (it was creating a mount, after all).

@cyphar what about instead of create/destroy the mountns everytime exec executed, create one long running mountns once containerd started and paas the handle of mountns to runc when it was forked, the benefit of this approach is

no need to create/destroy mountns everytime

the handle of mountns is optional, it's better for backward compatibility

I encountered a similar issue, and this PR did not solve my problem; on the contrary, from a black-box perspective, it has exacerbated my problem. My scenario is to deploy 100 pods on a single node, each with a container.

Before the modification, due to the high CPU usage of systemd, a large number of pods failed to deploy. However, there is a chance that the pods can successfully restart. After the merger of this PR, the CPU usage of systemd indeed decreased. But the hang up phenomenon still exists.

A large number of pods still fail to deploy, and retries also fail. Not using try_bindfd but using memfd works very well, all pods are deployed successfully for the first time. Even deploying 170 pods at the same time will still succeed.

I think try_bindfd should die and be replaced with just the memfd code.

@cyphar ya, you have your point, thanks for clarify, maybe the best option at this moment is to avoid exec probe for massive pods

@cyphar ya, you have your point, thanks for clarify, maybe the best option at this moment is to avoid exec probe for massive pods
For services migrating from virtual machines to containers, changing the probe from exec mode to non-exec mode would be a substantial amount of work.

@113xiaoji yes, we have the same problem here, I have tried another option to setup a new mountns and "move the containerd and kubelet into the mountns", you can refer to this article https://docs.openshift.com/container-platform/4.12/scalability_and_performance/optimization/optimizing-cpu-usage.html

@cyphar If our situation is rather urgent and we cannot wait for the 1.3.0 version, we are considering releasing our own version, in which we will eliminate the try_bindfd related logic to fit our business scenario. We would like to hear your thoughts on this. As for memory overhead, could you detail how it works? Does it only consume the host's memory, or does it also consume the container's memory? Does each invocation of runc init result in memory consumption? Besides memory overhead, do you foresee any other serious issues that we should be aware of?

It contributes to the container's memory accounting limit, but only while the runc binary is actually running -- the runc process is generally very short-lived (it only exists when it needs to do container operations, once a container process is spawned it disappears -- at least, this is how it works when you use it in the context of Docker/cri-o/containerd/Kubernetes). Each invocation of runc causes a ~10MB (the size of the runc binary) memory allocation in whatever container's cgroup is associated with that runc invocation.

The patch to remove try_bindfd is very trivial, you can see the patch in #3509 (I discovered some kernel bugs in memfd_create when I was working on this, I was going to send it this morning...)

The thing is, creating a container involves doing many mounts already, so I'm still a little surprised that this one mount is causing so many performance issues. I suspect that the the host bind-mount would be okay if it weren't for systemd ingesting every mount event on the system, but adding the short-lived namespace might be triggering a different contention issue.

cyphar

I think try_bindfd should die and be replaced with just the memfd code (which eventually will no longer be needed when I finish working on the magic-link hardening work to block these re-opening attacks -- though we will need to keep it around for a long while...).

However, looking at this PR again, it probably makes sense to carry this (even if we end up dropping bindfd in runc 1.3.0). The extra CPU cost of creating a new mount namespace for such a trivial task kind of sucks, but them's the breaks (you could even argue this makes it clearer that we should drop bindfd sooner).

Sadly the proposal I had a while ago for removing the cost of memfds doesn't work (I figured out this out a while ago but it's probably a good idea to write that down somewhere) -- you cannot bind-mount memfds at all because they are all in the MNT_NS_INTERNAL namespace.

...from nsexec.c so they can be used in cloned_binary.c. Signed-off-by: Cory Snider <[email protected]>

...so the compiler can warn about mismatches between the format string and varargs. Signed-off-by: Cory Snider <[email protected]>

Modify receive_fd() and send_fd() so they can be more readily reused in cloned_binary.c. Change receive_fd() to have a single responsibility: receiving and returning a single file descriptor over a UNIX domain socket. Make send_fd() useable in precarious execution contexts such as a clone(CLONE_VFORK|CLONE_VM) "thread" where allocating heap memory or calling exit() would be dangerous. Signed-off-by: Cory Snider <[email protected]>

Signed-off-by: Cory Snider <[email protected]>

Processes can watch /proc/self/mounts or /mountinfo, and the kernel will notify them whenever the namespace's mount table is modified. The notified process still needs to read and parse the mountinfo to determine what changed once notified. Many such processes, including udisksd and SystemD < v248, make no attempt to rate-limit their mountinfo notifications. This tends to not be a problem on many systems, where mount tables are small and mounting and unmounting is uncommon. Every runC exec which successfully uses the try_bindfd container-escape mitigation performs two mount()s and one umount() in the host's mount namespace, causing any mount-watching processes to wake up and parse the mountinfo file three times in a row. Consequently, using 'exec' health checks on containers has a larger-than-expected impact on system load when such mount-watching daemons are running. Furthermore, the size of the mount table in the host's mount namespace tends to be proportional to the number of OCI containers as a unique mount is required for the rootfs of each container. Therefore, on systems with mount-watching processes, the system load increases *quadratically* with the number of running containers which use health checks! Prevent runC from incidentally modifying the host's mount namespace for container-escape mitigations by setting up the mitigation in a temporary mount namespace. Signed-off-by: Cory Snider <[email protected]>

113xiaoji · 2023-07-06T08:49:12Z

try_bindfd

I think try_bindfd should die and be replaced with just the memfd code (which eventually will no longer be needed when I finish working on the magic-link hardening work to block these re-opening attacks -- though we will need to keep it around for a long while...).

However, looking at this PR again, it probably makes sense to carry this (even if we end up dropping bindfd in runc 1.3.0). The extra CPU cost of creating a new mount namespace for such a trivial task kind of sucks, but them's the breaks (you could even argue this makes it clearer that we should drop bindfd sooner).

Sadly the proposal I had a while ago for removing the cost of memfds doesn't work (I figured out this out a while ago but it's probably a good idea to write that down somewhere) -- you cannot bind-mount memfds at all because they are all in the MNT_NS_INTERNAL namespace.

cannot agree more

corhere changed the title ~~libct/nsenter: namespace the bundfd shuffle~~ libct/nsenter: namespace the bindfd shuffle Sep 9, 2022

corhere force-pushed the bindfd-in-private branch 2 times, most recently from be35225 to 2e73e34 Compare September 9, 2022 23:03

kolyshkin reviewed Sep 10, 2022

View reviewed changes

libcontainer/nsenter/cloned_binary.c Outdated Show resolved Hide resolved

kolyshkin reviewed Sep 10, 2022

View reviewed changes

libcontainer/nsenter/cloned_binary.c Outdated Show resolved Hide resolved

kolyshkin reviewed Sep 10, 2022

View reviewed changes

libcontainer/nsenter/cloned_binary.c Outdated Show resolved Hide resolved

kolyshkin reviewed Sep 10, 2022

View reviewed changes

libcontainer/nsenter/cloned_binary.c Outdated Show resolved Hide resolved

kolyshkin reviewed Sep 10, 2022

View reviewed changes

libcontainer/nsenter/cloned_binary.c Show resolved Hide resolved

kolyshkin reviewed Sep 10, 2022

View reviewed changes

libcontainer/nsenter/cloned_binary.c Outdated Show resolved Hide resolved

kolyshkin added this to the 1.2.0 milestone Sep 10, 2022

AkihiroSuda reviewed Sep 11, 2022

View reviewed changes

libcontainer/nsenter/clone.c Outdated Show resolved Hide resolved

AkihiroSuda reviewed Sep 11, 2022

View reviewed changes

libcontainer/nsenter/clone.c Outdated Show resolved Hide resolved

corhere force-pushed the bindfd-in-private branch 2 times, most recently from 703a133 to 6621d55 Compare September 12, 2022 22:31

corhere commented Sep 12, 2022

View reviewed changes

libcontainer/nsenter/cloned_binary.c Outdated Show resolved Hide resolved

AkihiroSuda previously approved these changes Sep 12, 2022

View reviewed changes

kolyshkin reviewed Sep 14, 2022

View reviewed changes

libcontainer/nsenter/cloned_binary.c Outdated Show resolved Hide resolved

kolyshkin reviewed Sep 14, 2022

View reviewed changes

libcontainer/nsenter/cloned_binary.c Outdated Show resolved Hide resolved

kolyshkin reviewed Sep 14, 2022

View reviewed changes

libcontainer/nsenter/ipc.h Outdated Show resolved Hide resolved

corhere dismissed AkihiroSuda’s stale review via 121002c September 15, 2022 20:33

corhere force-pushed the bindfd-in-private branch from 6621d55 to 121002c Compare September 15, 2022 20:33

kolyshkin previously approved these changes Sep 21, 2022

View reviewed changes

kolyshkin mentioned this pull request Sep 28, 2022

mount: add enhanced mount functionality to support run container in userns with host network #3613

Open

cyphar requested changes Oct 6, 2022

View reviewed changes

corhere dismissed kolyshkin’s stale review via ee114af October 25, 2022 20:50

corhere force-pushed the bindfd-in-private branch 2 times, most recently from ee114af to e2b1a5a Compare October 25, 2022 20:57

corhere requested a review from cyphar October 26, 2022 16:33

corhere force-pushed the bindfd-in-private branch from 2ea2444 to a0d8203 Compare May 30, 2023 20:15

AkihiroSuda approved these changes Jun 12, 2023

View reviewed changes

AkihiroSuda enabled auto-merge June 12, 2023 08:48

AkihiroSuda requested a review from kolyshkin June 12, 2023 08:48

AkihiroSuda disabled auto-merge June 12, 2023 08:48

AkihiroSuda requested a review from thaJeztah June 12, 2023 08:49

Hunter1016 reviewed Jun 14, 2023

View reviewed changes

cyphar mentioned this pull request Jul 4, 2023

ro-bind /proc/self/exe before copying cause systemd's cpu extremely high #3925

Closed

cyphar approved these changes Jul 4, 2023

View reviewed changes

corhere added 5 commits July 4, 2023 17:56

chore(libct/nsenter): extract utility code

35fddfd

...from nsexec.c so they can be used in cloned_binary.c. Signed-off-by: Cory Snider <[email protected]>

libct/nsenter: annotate write_log() prototype

890dcee

...so the compiler can warn about mismatches between the format string and varargs. Signed-off-by: Cory Snider <[email protected]>

libct/nsenter: set FD_CLOEXEC on received fd

3b191ff

Signed-off-by: Cory Snider <[email protected]>

cyphar force-pushed the bindfd-in-private branch from a0d8203 to 017d699 Compare July 4, 2023 07:56

cyphar closed this in 35eff7c Jul 4, 2023

cyphar merged commit 35eff7c into opencontainers:main Jul 4, 2023
36 checks passed

neersighted mentioned this pull request Jul 4, 2023

systemd logs filled with mount unit entries if healtcheck is enabled docker/for-linux#679

Closed

corhere deleted the bindfd-in-private branch July 5, 2023 00:36

rata mentioned this pull request Jul 5, 2023

Support idmap mounts for volumes #3717

Merged

5 tasks

This was referenced Aug 5, 2023

runc: release 1.2.0-rc.1 #3963

Closed

Consider revert #3931 nsexec: cloned_binary: remove bindfd logic entirely #3973

Closed

cyphar mentioned this pull request Mar 14, 2024

release 1.2.0-rc.1 #4221

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libct/nsenter: namespace the bindfd shuffle #3599

libct/nsenter: namespace the bindfd shuffle #3599

corhere commented Sep 9, 2022 •

edited

Loading

kolyshkin commented Sep 10, 2022

kolyshkin left a comment

kolyshkin commented Sep 28, 2022

cyphar Oct 6, 2022

corhere Oct 6, 2022

kolyshkin Oct 17, 2022

cyphar Jul 4, 2023 •

edited

Loading

changweige commented May 11, 2023

AkihiroSuda commented May 11, 2023

adamqqqplay commented May 11, 2023

neersighted commented May 30, 2023 •

edited

Loading

corhere commented May 30, 2023

LastNight1997 commented Jun 12, 2023

Hunter1016 Jun 14, 2023

xujihui1985 Jul 5, 2023

cyphar Jul 5, 2023 •

edited

Loading

xujihui1985 Jul 5, 2023 •

edited

Loading

113xiaoji Jul 6, 2023

xujihui1985 Jul 7, 2023 •

edited

Loading

113xiaoji Jul 7, 2023

xujihui1985 Jul 7, 2023

113xiaoji Jul 7, 2023 •

edited

Loading

cyphar Jul 7, 2023

cyphar left a comment

113xiaoji commented Jul 6, 2023

libct/nsenter: namespace the bindfd shuffle #3599

libct/nsenter: namespace the bindfd shuffle #3599

Conversation

corhere commented Sep 9, 2022 • edited Loading

kolyshkin commented Sep 10, 2022

kolyshkin left a comment

Choose a reason for hiding this comment

kolyshkin commented Sep 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyphar Jul 4, 2023 • edited Loading

Choose a reason for hiding this comment

changweige commented May 11, 2023

AkihiroSuda commented May 11, 2023

adamqqqplay commented May 11, 2023

neersighted commented May 30, 2023 • edited Loading

corhere commented May 30, 2023

LastNight1997 commented Jun 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyphar Jul 5, 2023 • edited Loading

Choose a reason for hiding this comment

xujihui1985 Jul 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xujihui1985 Jul 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

113xiaoji Jul 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyphar left a comment

Choose a reason for hiding this comment

113xiaoji commented Jul 6, 2023

corhere commented Sep 9, 2022 •

edited

Loading

cyphar Jul 4, 2023 •

edited

Loading

neersighted commented May 30, 2023 •

edited

Loading

cyphar Jul 5, 2023 •

edited

Loading

xujihui1985 Jul 5, 2023 •

edited

Loading

xujihui1985 Jul 7, 2023 •

edited

Loading

113xiaoji Jul 7, 2023 •

edited

Loading