-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[20.10 backport] seccomp: add support for "clone3" syscall in default policy #42836
Conversation
Doh, |
hmmmm yes. ISTR there was a recent change in runc related to that (at least, I recall a PR about ignoring defaults) |
Yup; fix for that is in v1.0.2. But that seems to indicate we may not need this patch / backport? opencontainers/runc#3173 |
Now that I'm reading that error message with a fresh eye, I realize it's complaining that |
Scratch that; looking at the implementation there, I see it does take |
eee326b
to
0a709f7
Compare
I think I found what was wrong. 🤞 |
0a709f7
to
df18e28
Compare
Got it -- the only failure on the last run was |
FWIW, @lucaskanashiro tested this and reported in https://bugs.launchpad.net/cloud-images/+bug/1943049/comments/25 that it does fix the problem 😄 |
This is a backport of 9f6b562, adapted to avoid the refactoring that happened in d927397. Original commit message is as follows: > If no seccomp policy is requested, then the built-in default policy in > dockerd applies. This has no rule for "clone3" defined, nor any default > errno defined. So when runc receives the config it attempts to determine > a default errno, using logic defined in its commit: > > opencontainers/runc@7a8d716 > > As explained in the above commit message, runc uses a heuristic to > decide which errno to return by default: > > [quote] > The solution applied here is to prepend a "stub" filter which returns > -ENOSYS if the requested syscall has a larger syscall number than any > syscall mentioned in the filter. The reason for this specific rule is > that syscall numbers are (roughly) allocated sequentially and thus newer > syscalls will (usually) have a larger syscall number -- thus causing our > filters to produce -ENOSYS if the filter was written before the syscall > existed. > [/quote] > > Unfortunately clone3 appears to one of the edge cases that does not > result in use of ENOSYS, instead ending up with the historical EPERM > errno. > > Latest glibc (2.33.9000, in Fedora 35 rawhide) will attempt to use > clone3 by default. If it sees ENOSYS then it will automatically > fallback to using clone. Any other errno is treated as a fatal > error. Thus when docker seccomp policy triggers EPERM from clone3, > no fallback occurs and programs are thus unable to spawn threads. > > The clone3 syscall is much more complicated than clone, most notably its > flags are not exposed as a directly argument any more. Instead they are > hidden inside a struct. This means that seccomp filters are unable to > apply policy based on values seen in flags. Thus we can't directly > replicate the current "clone" filtering for "clone3". We can at least > ensure "clone3" returns ENOSYS errno, to trigger fallback to "clone" > at which point we can filter on flags. Signed-off-by: Tianon Gravi <[email protected]> Co-authored-by: Daniel P. Berrangé <[email protected]>
df18e28
to
567c01f
Compare
Re-reviewing this, it felt silly to still have the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Hi, Can you please consider merging this change and expediting a release containing the fix? This clone3 issue has caused a lot of problems over the past couple months. Thanks you for your work. I apologize if I'm being too pushy. -- Maxwell 😀 |
Following commit migrates CI to latest stable `f35`. * Make sure we are using latest docker: Since support for `clone3` was added in default seccomp profile was backported via moby/moby#42836. * Bump podman to `v3.4.0` for `f35` * New versions of `clang-check` adds a check for `-Wunused-but-set-variable` so mark variables with `__attribute__ ((unused))` where its needed. Signed-off-by: Aditya Rajan <[email protected]>
Following commit migrates CI to latest stable `f35`. * Make sure we are using latest docker: Since support for `clone3` was added in default seccomp profile was backported via moby/moby#42836. * Bump podman to `v3.4.0` for `f35` and skip newly added tests which are not supported in this env. * New versions of `clang-check` adds a check for `-Wunused-but-set-variable` so mark variables with `__attribute__ ((unused))` where its needed. Signed-off-by: Aditya Rajan <[email protected]>
Following commit migrates CI to latest stable `f35`. * Make sure we are using latest docker: Since support for `clone3` was added in default seccomp profile was backported via moby/moby#42836. * Bump podman to `v3.4.0` for `f35` and skip newly added tests which are not supported in this env. Signed-off-by: Aditya Rajan <[email protected]>
@zhsj brought https://buildd.debian.org/status/fetch.php?pkg=docker.io&arch=mips64el&ver=20.10.10%2Bdfsg1-1&stamp=1636015943&raw=0 to my attention in which
I think this test will probably fail on MIPS arches on the version that made it into master, too (for the same reason). Edit: I've raised this further up the chain at opencontainers/runtime-spec#1087 (comment) now |
On the plus side, at least this only affects MIPS* and sparc64? 😬 $ git grep -inE 'ENOSYS[[:space:]]+=' | grep linux | column -t
unix/zerrors_linux_386.go:588: ENOSYS = syscall.Errno(0x26)
unix/zerrors_linux_amd64.go:588: ENOSYS = syscall.Errno(0x26)
unix/zerrors_linux_arm.go:594: ENOSYS = syscall.Errno(0x26)
unix/zerrors_linux_arm64.go:585: ENOSYS = syscall.Errno(0x26)
unix/zerrors_linux_mips.go:591: ENOSYS = syscall.Errno(0x59)
unix/zerrors_linux_mips64.go:591: ENOSYS = syscall.Errno(0x59)
unix/zerrors_linux_mips64le.go:591: ENOSYS = syscall.Errno(0x59)
unix/zerrors_linux_mipsle.go:591: ENOSYS = syscall.Errno(0x59)
unix/zerrors_linux_ppc.go:646: ENOSYS = syscall.Errno(0x26)
unix/zerrors_linux_ppc64.go:650: ENOSYS = syscall.Errno(0x26)
unix/zerrors_linux_ppc64le.go:650: ENOSYS = syscall.Errno(0x26)
unix/zerrors_linux_riscv64.go:575: ENOSYS = syscall.Errno(0x26)
unix/zerrors_linux_s390x.go:650: ENOSYS = syscall.Errno(0x26)
unix/zerrors_linux_sparc64.go:639: ENOSYS = syscall.Errno(0x5a) Edit: I've raised this further up the chain at opencontainers/runtime-spec#1087 (comment) now |
Ahm.. yes, so we could adjust the test and use a different moby/profiles/seccomp/default_linux.go Lines 613 to 619 in 2b70006
|
But I guess the "portable" way would be to use a string for this, and have the conversion done by the runtime that sets the actual profile |
…MIPS The value of ENOSYS differs between MIPS and non-MIPS architectures. While this is not problematic for the embedded seccomp profile, it prevents the profile from being saved as a portable JSON file that can be used for both architectures. To work around this situation, we include conditional rules for both arches. and hard-code the value for ENOSYS in both. For more details, refer to moby#42836 (comment) and opencontainers/runtime-spec#1087 (comment) A test was added, which can be run with: go test --tags=seccomp -run TestLoadConditionalClone3 -v ./profiles/seccomp/ === RUN TestLoadConditionalClone3 === RUN TestLoadConditionalClone3/clone3_default_amd64 === RUN TestLoadConditionalClone3/clone3_default_mips === RUN TestLoadConditionalClone3/clone3_cap_sys_admin_amd64 === RUN TestLoadConditionalClone3/clone3_cap_sys_admin_mips --- PASS: TestLoadConditionalClone3 (0.01s) --- PASS: TestLoadConditionalClone3/clone3_default_amd64 (0.00s) --- PASS: TestLoadConditionalClone3/clone3_default_mips (0.00s) --- PASS: TestLoadConditionalClone3/clone3_cap_sys_admin_amd64 (0.00s) --- PASS: TestLoadConditionalClone3/clone3_cap_sys_admin_mips (0.00s) PASS ok github.com/docker/docker/profiles/seccomp 0.015s Signed-off-by: Sebastiaan van Stijn <[email protected]>
…MIPS The value of ENOSYS differs between MIPS and non-MIPS architectures. While this is not problematic for the embedded seccomp profile, it prevents the profile from being saved as a portable JSON file that can be used for both architectures. To work around this situation, we include conditional rules for both arches. and hard-code the value for ENOSYS in both. For more details, refer to moby#42836 (comment) and opencontainers/runtime-spec#1087 (comment) A test was added, which can be run with: go test --tags=seccomp -run TestLoadConditionalClone3 -v ./profiles/seccomp/ === RUN TestLoadConditionalClone3 === RUN TestLoadConditionalClone3/clone3_default_amd64 === RUN TestLoadConditionalClone3/clone3_default_mips === RUN TestLoadConditionalClone3/clone3_cap_sys_admin_amd64 === RUN TestLoadConditionalClone3/clone3_cap_sys_admin_mips --- PASS: TestLoadConditionalClone3 (0.01s) --- PASS: TestLoadConditionalClone3/clone3_default_amd64 (0.00s) --- PASS: TestLoadConditionalClone3/clone3_default_mips (0.00s) --- PASS: TestLoadConditionalClone3/clone3_cap_sys_admin_amd64 (0.00s) --- PASS: TestLoadConditionalClone3/clone3_cap_sys_admin_mips (0.00s) PASS ok github.com/docker/docker/profiles/seccomp 0.015s Signed-off-by: Sebastiaan van Stijn <[email protected]>
Fedora containers fail with QEMU 6.1.0 and Docker 20.10.9. It can be worked around with option '--security-opt seccomp=unconfined', but that is not allowed on GitHub Actions. Anyway, it is fixed in Docker 20.10.10. * moby/moby#42963 * moby/moby#42836
glibc 2.34 uses the clone3 syscall, which is not part of the seccomp filters that moby ships on older versions. While as a workaround you might be able to run containers with `--privileged`, it's the better call to just run a more recent Docker runtime. References: - docker/buildx#772 - moby/buildkit#2379 - moby/moby#42836 - NixOS/nixpkgs#170900
glibc 2.34 uses the clone3 syscall, which is not part of the seccomp filters that moby ships on older versions. While as a workaround you might be able to run containers with `--privileged`, it's the better call to just run a more recent Docker runtime. References: - docker/buildx#772 - moby/buildkit#2379 - moby/moby#42836 - NixOS/nixpkgs#170900
This is a backport of 9f6b562 (#42681, see also #42680), adapted to avoid the refactoring that happened in d927397 (#42005).
fixes #42963
fixes #42876