Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrectly handled clock_gettime64 syscall #8326

Open
1 of 2 tasks
igorsnunes opened this issue Sep 2, 2020 · 7 comments
Open
1 of 2 tasks

Incorrectly handled clock_gettime64 syscall #8326

igorsnunes opened this issue Sep 2, 2020 · 7 comments

Comments

@igorsnunes
Copy link

igorsnunes commented Sep 2, 2020

  • I have tried with the latest version of my channel (Stable or Edge)
  • I have uploaded Diagnostics
  • Diagnostics ID:

Expected behavior

Actual behavior

Information

Hi everyone,

While running my application on a i386 debian image (bullseye), I am constantly receiving an "Operation not permitted" for the clock_gettime system call. This error only happens when using a newer version of libc6 (2.31-1). And when this event happens, at some point, my application stops working.

Doing some investigation I figured out that newer versions of glibc, clock_gettime() syscalls falls back to clock_gettime64(). When using "strace" (to scan system calls) on my application, I can see that when clock_gettime64() is called, an EPERM is returned. This specific error code breaks the application. The ploblem with this is: glibc expects a ENOSYS, indicating that this syscall is not implemented by the kernel. If that happens, libc uses another implementation of clock_gettime, returning the correct value; if EPERM is returned instead, libc handles this return value as an error.

I can bypass this issue by running the container with the “—privileged” flag, or creating a seccomp profile that has the following configuration:

"defaultAction": "SCMP_ACT_TRACE"

Which means: return ENOSYS as a default behavior, instead of EPERM.

The –privileged flag bypasses seccomp, and allow every syscall to be handled by the kernel (and apparently, the kernel returns the correct code).

Question: why “clock_gettime64” is not being matched on any seccomp profile (including the default one, used by the engine)? The only way I managed to make this syscall returns ENOSYS, using seccomp profile, was enabling the defaultAction as SCMP_ACT_TRACE. And as far as I can see, this is not a good practice; the correct action would be SCMP_ACT_ERRNO for default cases. See below the two approaches that I tried on my seccomp profile, and didn`t work:

Explicitly allowing clock_gettime64:
{
"names": ["clock_gettime64"],
"action": "SCMP_ACT_ALLOW",
"args": [],
"comment": "",
"includes": {},
"excludes": {}
}

Explicitly setting the behavior of clock_gettime64 to TRACE:

{
"names": ["clock_gettime64"],
"action": "SCMP_ACT_TRACE",
"args": [],
"comment": "",
"includes": {},
"excludes": {}
}

As shown here https://bugs.launchpad.net/ubuntu/+source/libseccomp/+bug/1868720 , this might be a problem related to older versions of libseccomp installed on the host. Is there a way to get this information from the host linux system used by Docker Desktop?

Please, let me know if I am missing something on my analysis.

Ps: Some documentation used for this analysis:
https://gitlab.alpinelinux.org/alpine/aports/-/issues/11774
https://lwn.net/Articles/795128/
https://docs.docker.com/engine/security/seccomp/
https://bugs.launchpad.net/ubuntu/+source/libseccomp/+bug/1868720

  • Windows Version: Windows 10 Pro, version 1903
  • Docker Desktop Version: 2.3.0.4 (46911)
  • Are you running inside a virtualized Windows e.g. on a cloud server or on a mac VM: No

Steps to reproduce the behavior

Compile the following code for 32 bits, i.e. "gcc -m32":

#include <stdio.h>
#include <time.h>
#include <fcntl.h>

int main () {
	struct timespec tp;
	
	if (clock_gettime(CLOCK_REALTIME, &tp) == -1) {
		perror("clock_gettime");
	}
	else {
		printf("clock_gettime success: %ld\n", tp.tv_nsec);
	}
	return 0;
}

Run the binary on a docker image with a newer version of libc on a i386 environment. You can do:

docker run --entrypoint  bash   -v "C:\path_to_bin:/path_to_bin"  -it i386/debian:bullseye
@stephen-turner
Copy link
Contributor

Thanks for the report, @igorsnunes. I found a seemingly relevant patch in version 19.03.9 of the upstream engine: moby/moby@v19.03.8...v19.03.9. But we have engine 19.03.12 in Desktop 2.3.0.4 so maybe there's something else we need to do in our VM. We'll take a look.

@djs55
Copy link

djs55 commented Sep 22, 2020

Hi @igorsnunes -- we believe we need to upgrade the version of libseccomp bundled inside Docker Desktop. We currently link libseccomp statically into the dockerd binary. The simplest solution therefore is to bump the version inside the build environment -- which is in progress -- but unfortunately this is the build environment used to build other Linux packages and the change has knock-on effects for other architectures like armhf so it may take a while to fix.

We're also considering switching Docker Desktop to using a dynamically-linked dockerd, which would allow Desktop to bump the libseccomp version in our Dockerfile without worrying about effects on armhf (for now anyway!) However this is quite a big change to our build process too, so will take a while.

In summary

  • we're working on it, but it might take a while
  • I'll add a known issue to the docs referencing this issue
  • we'll keep this issue open and keep you informed as we make progress.

Thanks again for your report. I was hoping there was a quick fix available but unfortunately I've failed to find one.

/lifecycle frozen

djs55 added a commit to djs55/docker.github.io that referenced this issue Sep 22, 2020
usha-mandya added a commit to docker/docs that referenced this issue Sep 22, 2020
* Docker Desktop release notes: add clock_gettime64 known issue

See docker/for-win#8326

Signed-off-by: David Scott <[email protected]>

* Minor style update

* minor style update

* Minor style updte

* Minor style update

Co-authored-by: Usha Mandya <[email protected]>
@raxvan
Copy link

raxvan commented Sep 24, 2020

Hello, just for information i found this issue while searching reasons why clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &now_time); returns incorrect time in new_time.tv_nsec. With the same code posted by @igorsnunes (except for CLOCK_PROCESS_CPUTIME_ID) i'm getting more or less random values in tv_nsec. I'm using docker engine with WSL2.
Running the image with --privileged solves the issue.

@igorsnunes
Copy link
Author

Thanks @djs55 and @stephen-turner . I'll keep following your updates.

@microhobby
Copy link

Using a Kernel v5.x the issue does not occurs:

image

These syscalls have been added if I'm not mistaken in Kernel v5.1. So, I hope that the next update of the WSL 2 Kernel, which is planned to use the Kernel v5.4 LTS should solve this.

@disconnect3d
Copy link

disconnect3d commented Apr 15, 2021

I am running into the same issue on MacOS with (currently latest) Docker for Desktop 3.2.2: the clock_gettime64 syscall returns EPERM.

...and this can be workarounded with --security-opt seccomp=unconfined so its related to seccomp blocking the syscall. It seems that Docker whitelisted this syscall in their default seccomp policy a year ago, but for some reason this is not used in Docker for Desktop? Why?

Anyway, showing this on the log below (container run with default flags + --cap-add=SYS_PTRACE).

root@72bbc100bb69:/# cat a.c
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <signal.h>
#include <time.h>
#include <fcntl.h>

int main() {
   struct timespec tp;
   syscall(SYS_clock_gettime64, 0, &tp);
}
root@72bbc100bb69:/# gcc -m32 a.c
root@72bbc100bb69:/# strace ./a.out
execve("./a.out", ["./a.out"], 0x7ffc77fd7d50 /* 8 vars */) = 0
strace: [ Process PID=19 runs in 32 bit mode. ]
brk(NULL)                               = 0x58289000
arch_prctl(0x3001 /* ARCH_??? */, 0xffef8c28) = -1 EINVAL (Invalid argument)
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf7f60000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=21704, ...}) = 0
mmap2(NULL, 21704, PROT_READ, MAP_PRIVATE, 3, 0) = 0xf7f5a000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib32/libc.so.6", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\3\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\220\360\1\0004\0\0\0"..., 512) = 512
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\205\327\273-\255\17\201r\321\300,\3\21\240\fF"..., 96, 468) = 96
fstat64(3, {st_mode=S_IFREG|0755, st_size=2002268, ...}) = 0
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\205\327\273-\255\17\201r\321\300,\3\21\240\fF"..., 96, 468) = 96
mmap2(NULL, 2010892, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xf7d6f000
mmap2(0xf7d8c000, 1409024, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1d000) = 0xf7d8c000
mmap2(0xf7ee4000, 458752, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x175000) = 0xf7ee4000
mmap2(0xf7f54000, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1e4000) = 0xf7f54000
mmap2(0xf7f58000, 7948, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xf7f58000
close(3)                                = 0
set_thread_area({entry_number=-1, base_addr=0xf7f61100, limit=0x0fffff, seg_32bit=1, contents=0, read_exec_only=0, limit_in_pages=1, seg_not_present=0, useable=1}) = 0 (entry_number=12)
mprotect(0xf7f54000, 8192, PROT_READ)   = 0
mprotect(0x5662b000, 4096, PROT_READ)   = 0
mprotect(0xf7f92000, 4096, PROT_READ)   = 0
munmap(0xf7f5a000, 21704)               = 0
clock_gettime64(CLOCK_REALTIME, 0xffef8c14) = -1 EPERM (Operation not permitted)
exit_group(0)                           = ?
+++ exited with 0 +++
root@72bbc100bb69:/#

With --security-opt seccomp=unconfined (which I don't recommend) it returns ENOSYS (as expected):

clock_gettime64(CLOCK_REALTIME, 0xfffad424) = -1 ENOSYS (Function not implemented)

@djs55
Copy link

djs55 commented Apr 29, 2021

I believe this issue is fixed on Docker Desktop 3.3.1 with the newer runc:

Screenshot 2564-04-29 at 09 26 29

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants