Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seccomp filter breaks latest glibc (in fedora rawhide) by blocking clone3 with EPERM #42680

Closed
berrange opened this issue Jul 27, 2021 · 2 comments · Fixed by #42681
Closed

Comments

@berrange
Copy link
Contributor

Description
I have a docker built with seccomp running on Fedora 34 host. Attempting to run commands inside a container with the registry.fedoraproject.org/fedora:rawhide image results in programs failing to fork processes.

eg

$ docker run -it registry.fedoraproject.org/fedora:rawhide  curl google.com
curl: (6) getaddrinfo() thread failed to start

Tracing the container "curl" process I can see

clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, child_tid=0x7f000ec6d910, parent_tid=0x7f000ec6d910, exit_signal=0, stack=0x7f000e46d000, stack_size=0x7ffe00, tls=0x7f000ec6d640}, 88) = -1 EPERM (Operation not permitted)

The latest glibc now attempts to use 'clone3' by default. For backwards compatibility it will look for ENOSYS errno and fallback to "clone". The EPERM errno meanwhile is treated as a fatal error.

The default seccomp filter installed by docker is causing EPERM and so this breaks the glibc fallback.

Explicitly passing the default seccomp profile config makes it work, despite not allowing clone3

$ wget https://raw.githubusercontent.com/docker/labs/master/security/seccomp/seccomp-profiles/default.json -O profile.json
$ docker run --security-opt seccomp=profile2.json -it registry.fedoraproject.org/fedora:rawhide  curl google.com
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
..snip...

Tracing again shows clone3 now returns ENOSYS

clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, child_tid=0x7f098bf8a910, parent_tid=0x7f098bf8a910, exit_signal=0, stack=0x7f098b78a000, stack_size=0x7ffe00, tls=0x7f098bf8a640}, 88) = -1 ENOSYS (Function not implemented)

I expect this difference in behaviour is as a result of the heuristics implemented for choosing EPERM vs ENOSYS in runc with opencontainers/runc@7a8d716

Also it is impossible to run docker build

$ cat test.dkr 
FROM registry.fedoraproject.org/fedora:rawhide

RUN curl google.com

$ docker build -f test.dkr  .
Sending build context to Docker daemon  2.048kB
Step 1/2 : FROM registry.fedoraproject.org/fedora:rawhide
 ---> 887689ee223e
Step 2/2 : RUN curl google.com
 ---> Running in a370ae01f27e
curl: (6) getaddrinfo() thread failed to start
The command '/bin/sh -c curl google.com' returned a non-zero code: 6

and seccomp can't be overriden to make it work

$ docker build --security-opt seccomp=~/profile2.json -f test.dkr  .
Sending build context to Docker daemon  2.048kB
Error response from daemon: The daemon on this platform does not support setting security options on build

Steps to reproduce the issue:

  1. Install docker 20.10.7, with seccomp enabled in biuld
  2. docker run -it registry.fedoraproject.org/fedora:rawhide curl google.com

Describe the results you received:
curl: (6) getaddrinfo() thread failed to start

Describe the results you expected:
Dump of google.com

Output of docker version:

Client:
 Version:           20.10.7
 API version:       1.41
 Go version:        go1.16.6
 Git commit:        f0df350
 Built:             Mon Jul 26 16:34:29 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.7
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.6
  Git commit:       b0f5bc3
  Built:            Thu Jul 22 00:00:00 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.5.3
  GitCommit:        
 runc:
  Version:          1.0.1
  GitCommit:        4fc6f22
 docker-init:
  Version:          0.19.0
  GitCommit:        

Output of docker info:

Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 78
  Running: 1
  Paused: 0
  Stopped: 77
 Images: 3
 Server Version: 20.10.7
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: journald
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: /usr/libexec/docker/docker-init
 containerd version: 
 runc version: 4fc6f22
 init version: 
 Security Options:
  seccomp
   Profile: default
  selinux
  cgroupns
 Kernel Version: 5.14.0-0.rc2.20210721git8cae8cd89f05.24.fc35.x86_64
 Operating System: Fedora Linux 35 (Server Edition Prerelease)
 OSType: linux
 Architecture: x86_64
 CPUs: 12
 Total Memory: 7.438GiB
 Name: fedora
 ID: GQBM:HCKW:MKVM:Y5RK:HXPA:ZCCY:EXPA:FQBS:S4ZN:HRL5:5PSZ:KK7B
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: true

Additional environment details (AWS, VirtualBox, physical, etc.):
Virtual machine running Fedora 35 VM. Also seen in GitLab CI when using 'docker:dind' for builds

berrange added a commit to berrange/moby that referenced this issue Jul 27, 2021
If no seccomp policy is requested, then the built-in default policy in
dockerd applies. This has no rule for "clone3" defined, nor any default
errno defined. So when runc receives the config it attempts to determine
a default errno, using logic defined in its commit:

  opencontainers/runc@7a8d716

As explained in the above commit message, runc uses a heuristic to
decide which errno to return by default:

[quote]
  The solution applied here is to prepend a "stub" filter which returns
  -ENOSYS if the requested syscall has a larger syscall number than any
  syscall mentioned in the filter. The reason for this specific rule is
  that syscall numbers are (roughly) allocated sequentially and thus newer
  syscalls will (usually) have a larger syscall number -- thus causing our
  filters to produce -ENOSYS if the filter was written before the syscall
  existed.
[/quote]

Unfortunately clone3 appears to one of the edge cases that does not
result in use of ENOSYS, instead ending up with the historical EPERM
errno.

Latest glibc (2.33.9000, in Fedora 35 rawhide) will attempt to use
clone3 by default. If it sees ENOSYS then it will automatically
fallback to using clone. Any other errno is treated as a fatal
error. Thus when docker seccomp policy triggers EPERM from clone3,
no fallback occurs and programs are thus unable to spawn threads.

The clone3 syscall is much more complicated than clone, most notably its
flags are not exposed as a directly argument any more. Instead they are
hidden inside a struct. This means that seccomp filters are unable to
apply policy based on values seen in flags. Thus we can't directly
replicate the current "clone" filtering for "clone3". We can at least
ensure "clone3" returns ENOSYS errno, to trigger fallback to "clone"
at which point we can filter on flags.

Fixes: moby#42680
Signed-off-by: Daniel P. Berrangé <[email protected]>
docker-jenkins pushed a commit to docker-archive/docker-ce that referenced this issue Jul 30, 2021
If no seccomp policy is requested, then the built-in default policy in
dockerd applies. This has no rule for "clone3" defined, nor any default
errno defined. So when runc receives the config it attempts to determine
a default errno, using logic defined in its commit:

  opencontainers/runc@7a8d716

As explained in the above commit message, runc uses a heuristic to
decide which errno to return by default:

[quote]
  The solution applied here is to prepend a "stub" filter which returns
  -ENOSYS if the requested syscall has a larger syscall number than any
  syscall mentioned in the filter. The reason for this specific rule is
  that syscall numbers are (roughly) allocated sequentially and thus newer
  syscalls will (usually) have a larger syscall number -- thus causing our
  filters to produce -ENOSYS if the filter was written before the syscall
  existed.
[/quote]

Unfortunately clone3 appears to one of the edge cases that does not
result in use of ENOSYS, instead ending up with the historical EPERM
errno.

Latest glibc (2.33.9000, in Fedora 35 rawhide) will attempt to use
clone3 by default. If it sees ENOSYS then it will automatically
fallback to using clone. Any other errno is treated as a fatal
error. Thus when docker seccomp policy triggers EPERM from clone3,
no fallback occurs and programs are thus unable to spawn threads.

The clone3 syscall is much more complicated than clone, most notably its
flags are not exposed as a directly argument any more. Instead they are
hidden inside a struct. This means that seccomp filters are unable to
apply policy based on values seen in flags. Thus we can't directly
replicate the current "clone" filtering for "clone3". We can at least
ensure "clone3" returns ENOSYS errno, to trigger fallback to "clone"
at which point we can filter on flags.

Fixes: moby/moby#42680
Signed-off-by: Daniel P. Berrangé <[email protected]>
Upstream-commit: 9f6b562dd12ef7b1f9e2f8e6f2ab6477790a6594
Component: engine
mrc0mmand added a commit to mrc0mmand/restraint that referenced this issue Aug 25, 2021
Current Docker version on Ubuntu 20.04 used by GH Actions suffers from
an incompatibility with newer glibc [0] used by Fedora Rawhide, causing
Rawhide containers in CI to fail with:

```
Errors during downloading metadata for repository 'fedora-cisco-openh264':
  - Curl error (6): Couldn't resolve host name for https://mirrors.fedoraproject.org/metalink?repo=fedora-cisco-openh264-rawhide&arch=x86_64 [getaddrinfo() thread failed to start]
```

glibc 2.34 and later tries to use the clone3 syscall (for
hardware-assisted security hardening on x86_64), and falls back to clone2
on ENOSYS. However, with the current seccomp profile Docker returns EPERM
instead, which is considered a "hard" fail.

A fix [1] has been merged in upstream, but until then let's run the CI Docker
containers without any seccomp profiles to allow Rawhide jobs to to their job.
(I tried to disable seccomp only for the Rawhide jobs, but I couldn't procure
any solution which wouldn't make my eyes bleed...)

[0] moby/moby#42680
[1] moby/moby#42681
UncombedCoconut added a commit to naev/naev-infrastructure that referenced this issue Aug 29, 2021
tonistiigi pushed a commit to tonistiigi/docker that referenced this issue Sep 28, 2021
If no seccomp policy is requested, then the built-in default policy in
dockerd applies. This has no rule for "clone3" defined, nor any default
errno defined. So when runc receives the config it attempts to determine
a default errno, using logic defined in its commit:

  opencontainers/runc@7a8d716

As explained in the above commit message, runc uses a heuristic to
decide which errno to return by default:

[quote]
  The solution applied here is to prepend a "stub" filter which returns
  -ENOSYS if the requested syscall has a larger syscall number than any
  syscall mentioned in the filter. The reason for this specific rule is
  that syscall numbers are (roughly) allocated sequentially and thus newer
  syscalls will (usually) have a larger syscall number -- thus causing our
  filters to produce -ENOSYS if the filter was written before the syscall
  existed.
[/quote]

Unfortunately clone3 appears to one of the edge cases that does not
result in use of ENOSYS, instead ending up with the historical EPERM
errno.

Latest glibc (2.33.9000, in Fedora 35 rawhide) will attempt to use
clone3 by default. If it sees ENOSYS then it will automatically
fallback to using clone. Any other errno is treated as a fatal
error. Thus when docker seccomp policy triggers EPERM from clone3,
no fallback occurs and programs are thus unable to spawn threads.

The clone3 syscall is much more complicated than clone, most notably its
flags are not exposed as a directly argument any more. Instead they are
hidden inside a struct. This means that seccomp filters are unable to
apply policy based on values seen in flags. Thus we can't directly
replicate the current "clone" filtering for "clone3". We can at least
ensure "clone3" returns ENOSYS errno, to trigger fallback to "clone"
at which point we can filter on flags.

Fixes: moby#42680
Signed-off-by: Daniel P. Berrangé <[email protected]>
(cherry picked from commit 9f6b562)
ssssam referenced this issue in flatpak/flatpak Oct 9, 2021
clone3() can be used to implement clone() with CLONE_NEWUSER, allowing
a sandboxed process to get CAP_SYS_ADMIN in a new namespace and
manipulate its root directory. We need to block this so that AF_UNIX-based
socket servers (X11, Wayland, etc.) can rely on
/proc/PID/root/.flatpak-info existing for all Flatpak-sandboxed apps.

Partially fixes GHSA-67h7-w3jq-vh4q.

Thanks: an anonymous reporter
Signed-off-by: Simon McVittie <[email protected]>
fishilico added a commit to fishilico/shared that referenced this issue Oct 31, 2021
Recently, glibc broke with seccomp again: syscall "clone3" is used by
glibc 2.34 moby/moby#42680

This was fixed in Docker 20.10.10, moby/moby#42836
which was packaged in Arch Linux in October 2021
domq pushed a commit to epfl-si/wp-ops that referenced this issue Jun 14, 2023
- Because [some morons don't know the difference between `EPERM` and `ENOSYS`](moby/moby#42680), we have to change our `wp-receptor` build strategy from (no pun intended) building on top of the `receptor` Docker image, to using `ghcr.io/ansible/awx` ; as the latter at least has git already installed
- To add the `receptor` binary (self-contained, thanks Golang!) on top, use `{{ shellmacro_poor_mans_curl }}` twice in a shell pipeline, yow! to query the GitHub API, find the suitable binary relase of `ansible/receptor`, download and untar it

As a further upside of this change, we are no longer pinned to v1.3.0 for reasons of  Docker image format.
domq pushed a commit to epfl-si/wp-ops that referenced this issue Jun 16, 2023
- Because [some morons don't know the difference between `EPERM` and `ENOSYS`](moby/moby#42680), we have to change our `wp-receptor` build strategy from (no pun intended) building on top of the `receptor` Docker image, to using `ghcr.io/ansible/awx` ; as the latter at least has git already installed
- To add the `receptor` binary (self-contained, thanks Golang!) on top, use `{{ shellmacro_poor_mans_curl }}` twice in a shell pipeline, yow! to query the GitHub API, find the suitable binary relase of `ansible/receptor`, download and untar it

As a further upside of this change, we are no longer pinned to v1.3.0 for reasons of  Docker image format.
domq pushed a commit to epfl-si/wp-ops that referenced this issue Jun 17, 2023
- Because [some morons don't know the difference between `EPERM` and `ENOSYS`](moby/moby#42680), we have to change our `wp-receptor` build strategy from (no pun intended) building on top of the `receptor` Docker image, to using `ghcr.io/ansible/awx` ; as the latter at least has git already installed.
- To add the `receptor` binary (self-contained, thanks Golang!) on top, introduce `{{ shellmacro_poor_mans_curl }}` and use it twice in a shell pipeline, yow! to query the GitHub API, find the suitable binary relase of `ansible/receptor`, download and untar it.

As a further upside of this change, we are no longer pinned to receptor v1.3.0 for reasons of  Docker image format.
domq pushed a commit to epfl-si/wp-ops that referenced this issue Jun 17, 2023
- Because [some morons don't know the difference between `EPERM` and `ENOSYS`](moby/moby#42680), we have to change our `wp-receptor` build strategy from (no pun intended) building on top of the `receptor` Docker image, to using `ghcr.io/ansible/awx` ; as the latter at least has git already installed.
- To add the `receptor` binary (self-contained, thanks Golang!) on top, introduce `{{ shellmacro_poor_mans_curl }}` and use it twice in a shell pipeline, yow! to query the GitHub API, find the suitable binary relase of `ansible/receptor`, download and untar it.

As a further upside of this change, we are no longer pinned to receptor v1.3.0 for reasons of  Docker image format.
@JefriReynaldi

This comment was marked as spam.

@JefriReynaldi
Copy link

$ docker run -it registry.fedoraproject.org/fedora:rawhide curl google.com
curl: (6) getaddrinfo() thread failed to start

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants