Newer Glibc use faccessat2 to implement faccessat #16739

mscherer · 2020-08-15T16:13:07Z

cf https://repo.or.cz/glibc.git/commit/3d3ab573a5f3071992cbc4f57d50d1d29d55bde2

This cause breakage on Fedora Rawhide: https://bugzilla.redhat.com/show_bug.cgi?id=1869030

cf https://repo.or.cz/glibc.git/commit/3d3ab573a5f3071992cbc4f57d50d1d29d55bde2 This cause breakage on Fedora Rawhide: https://bugzilla.redhat.com/show_bug.cgi?id=1869030

keszybz · 2020-08-15T17:05:03Z

LGTM.

fweimer · 2020-08-15T18:04:46Z

This would work automatically if the filter returned ENOSYS for unknown system calls, like the kernel does.

poettering · 2020-08-16T13:10:00Z

This would work automatically if the filter returned ENOSYS for unknown system calls, like the kernel does.

Well, we implement something that we advertise as sandbox. i.e. it's about permissions for us, hence we usually bind this to EPERM. This is configurable in unit files though. To me returning ENOSYS would be more about pretending something didn#t exist that exists, while EPERM (and not EACESS) appears to be the right error to use when we prohibit access to something, which is really what we do here.

I think it would be smart if glibc would only use new syscalls if explicitly asked for them, i.e. unless the specified flags require so just stick to faccessat() instead of faccessat2(). I mean, this otherwise creates needless churn, not just here but various other places where seccomp sandboxing is implemented... needlessly using new syscalls where the old calls would be entirely equivalent will also fuckup strace for a moment and everything else really that tracks syscalls, and isn't updated in sync with the new kernel.

I mean, I am all for making use of new syscalls where there's benefit in it, but just calling a new syscalls instead of an old one, if both are 100% equivalent just smells like unnecessary churn to me.

fweimer · 2020-08-16T18:35:17Z

while EPERM (and not EACESS) appears to be the right error to use when we prohibit access to something, which is really what we do here.

The problem is that EPERM is also returned by implemented system calls, so it's impossible to use as a trigger for fallback. This only works with ENOSYS.

I think it would be smart if glibc would only use new syscalls if explicitly asked for them, i.e. unless the specified flags require so just stick to faccessat() instead of faccessat2().

I think in this case, faccessat was actually called with AT_EACCESS, previously triggering the broken emulation.

keszybz · 2020-08-17T13:11:27Z

I started thinking that we should indeed switch to ENOSYS as the error, and it seems doable, but not so simple.

On first blush, it seems reasonable to use ENOSYS whenever a new syscall is added. A program making use of the syscall is expected to expect that it might not be available, so it will usually include a fallback for ENOSYS. But it much less useful for old system calls, since programs expect that the syscall is available on all possible kernels. And for such widely-available "old" systems calls, handling is more likely to be implemented for EPERM than for ENOSYS. In systemd we filter many "old" syscalls, in particular many syscalls we qualify as "obsolete". Switching from EPERM to ENOSYS everywhere would be an unacceptable compat break in our interface.

But I think this could be made to work in a limited fashion: for any given version of systemd, record what was the maximum allocated syscall number (this would sadly have to be per-architecture...). For any syscall that is newer than that return ENOSYS. For the rest, return EPERM as we currently do.

This will give a fairly stable behaviour: when the kernel is upgraded beneath systemd, we behave as if we were still on the old kernel until the filters are upgraded. And when the filters are upgraded by means of installing a new systemd version, we assume that all new syscalls have been added to the filters, so that is then OK to return EPERM if a syscall is not allowed.

WDYT?

poettering · 2020-08-17T14:29:36Z

It's more complicated than that, i.e. the syscall numbering space is not contiguous, and numbers different on all archs. I think the more recently added syscalls use the same numbering on all archs, but even there it's not in historical order. for example facessat2() was added in 5.8 iirc as syscall nr 439. But close_range() is being added in 5.9 as syscall 438.

poettering · 2020-08-17T14:31:08Z

The problem is that EPERM is also returned by implemented system calls, so it's impossible to use as a trigger for fallback. This only works with ENOSYS.

Hmm, there are also syscalls that return ENOSYS even when implemented iirc, ioctl() for example and other multiplexed stuff iirc.

I think in this case, faccessat was actually called with AT_EACCESS, previously triggering the broken emulation.

But openat() vs. open(). IIRC glibc only uses openat() these days on x86-64 even if the dirfd is not used... But I guess the ship for that one definitely is sailed...

keszybz · 2020-08-17T17:03:38Z

This commit is now stable-246.2.

fweimer · 2020-08-17T19:03:41Z

On first blush, it seems reasonable to use ENOSYS whenever a new syscall is added. A program making use of the syscall is expected to expect that it might not be available, so it will usually include a fallback for ENOSYS. But it much less useful for old system calls, since programs expect that the syscall is available on all possible kernels. And for such widely-available "old" systems calls, handling is more likely to be implemented for EPERM than for ENOSYS. In systemd we filter many "old" syscalls, in particular many syscalls we qualify as "obsolete". Switching from EPERM to ENOSYS everywhere would be an unacceptable compat break in our interface.

Yes, that seems reasonable.

But I think this could be made to work in a limited fashion: for any given version of systemd, record what was the maximum allocated syscall number (this would sadly have to be per-architecture...). For any syscall that is newer than that return ENOSYS. For the rest, return EPERM as we currently do.

Do you mean in the systemd sources? Yes, that would work, because it will be stable across rebuilds with different kernel headers. It's probably not going to be a single maximum number for architectures with multiple system call tables (e.g., lp64 vs lp32).

In glibc, we use our own system call tables which get regularly sync'ed against the kernel, after review. It was the only way for us to get a stable implementation of the Y2038 system calls, otherwise the system call profile would change depending on the kernel header version. It was relatively easy for us to implement this because we already had scripts for building cross-toolchains, so the system call table maintenance was a small add-on on top of that.

brauner · 2020-08-18T12:43:27Z

It's more complicated than that, i.e. the syscall numbering space is not contiguous, and numbers different on all archs. I think the more recently added syscalls use the same numbering on all archs, but even there it's not in historical order. for example

Depending on what architectures you care about this needs to be qualified:

alpha is still special
ia64 has a 1024 offset
mips has a 4000/5000/6000 offset depending on exact abi

For example

#ifndef __NR_clone3
	#if defined __alpha__
		#define __NR_clone3 545
	#elif defined _MIPS_SIM
		#if _MIPS_SIM == _MIPS_SIM_ABI32	/* o32 */
			#define __NR_clone3 (435 + 4000)
		#endif
		#if _MIPS_SIM == _MIPS_SIM_NABI32	/* n32 */
			#define __NR_clone3 (435 + 6000)
		#endif
		#if _MIPS_SIM == _MIPS_SIM_ABI64	/* n64 */
			#define __NR_clone3 (435 + 5000)
		#endif
	#elif defined __ia64__
		#define __NR_clone3 (435 + 1024)
	#else
		#define __NR_clone3 435
	#endif
#endif

facessat2() was added in 5.8 iirc as syscall nr 439. But close_range() is being added in 5.9 as syscall 438.

keszybz · 2020-08-19T06:56:55Z

After looking into implementing this, I don't think we can reasonably do this with our current toolling. It is fairly easy to get a list of all syscalls. There is an arbitrary number of unknown syscalls, if we want to return ENOSYS for the unknown ones, we need to use ENOSYS as the default action. This means that that we need to enumerate the other ones or generate some smart rule to catch them all.

It is possible to express this concisely, because the syscall number table is almost completely filled. For example, for amd64 we'd use EPERM when n <= 181 || 186 <= n <= 235 || 237 <= n <= 334 || 424 <= n <= 439. It is reasonable to assume that syscalls would be added at the end, possibly with an occasional gap, but the expression would always remain reasonably short.

But libseccomp does not allow comparison operators on syscall numbers, so we would need to enumerate all syscalls. This would mean that we'd create very lengthy filter scripts, which would create a high overhead on all syscalls, and I don't think we want to do that.

To make this work reasonably, libseccomp would need to support comparison operators also for the syscall number, or alternatively, we'd need to use something else than libseccomp to generate the filter. I'll open an RFE against libseccomp to support this.

poettering · 2020-08-19T08:12:46Z

sounds as if one wants a "gperf" that generates BPF code...

topimiettinen · 2020-08-19T08:27:45Z

My RFE for libseccomp asks for range ops, wouldn't that work?

srd424 · 2020-08-19T08:36:32Z

Just found this discussion .. looks like a mirror of the one on the mailing list! As for range ops, I don't think libseccomp currently supports checking the syscall number itself with an operator, does it? Only syscall arguments.

poettering · 2020-08-19T11:00:11Z

so i just realized thta we basically maintain a list of all known syscalls anyway, in syscall_filter_sets[]. On my machine it lists all syscalls the kernel knows except for three. One group in particular is basically a list of system calls we know: @system-service. My guess is that it contains > 80% of all known syscalls anyway, and we advocate that people use it for their services.

If somebody uses @system-service we end up asking libseccomp for a filter that includes every single syscall listed in there in O(n) work. I kinda hope that libseccomp in the backend actually tries to merge these syscall entries into something smarter than just a list of equality checks. (If it didn't it would be pretty useless a library if you ask me, but I didn't check that).

Now we could define a new special group @known or so, which simply includes all syscalls. And then if you install an allowlist syscall filter and use @system-service, what actually happens in the bg is that we allow everything in @system-service, EPERM everything in @known that is not in @system-service, and ENOSYS everything else.

I think effectively this wouldn't really make complexity of the filters much worse... Because if we have 80% of the syscalls in the filter or 100% doesn't really matter...

srd424 · 2020-08-19T11:37:25Z

I kinda hope that libseccomp in the backend actually tries to merge these syscall entries into something smarter than just a list of equality checks. (If it didn't it would be pretty useless a library if you ask me, but I didn't check that).

Looking at the code, which is a bit over my head, it seems to construct a binary tree, and convert that into BPF, which I guess is a sensible approach.

srd424 · 2020-08-19T11:44:20Z

Might need to think about performance a bit - does glibc remember getting ENOSYS for a particular syscall? I assume it would do so to avoid unnecessary context switches.

srd424 · 2020-08-19T11:51:00Z

Doesn't look like it does - but I suppose "upgrade your software then" is a reasonable response to suboptimal performance :)

srd424 · 2020-08-19T12:17:24Z

[Apologies for spam] It also occurred to me that if the ENOSYS handling ever gets pushed down into libseccomp, that should make it safe to invert large allowlists into small denylists?

keszybz · 2020-08-19T13:03:10Z

Note that the kernel does some optimization too. It might be able to coalesce some entries... Even if it doesn't do that now, it could in the future.

srd424 · 2020-08-19T15:06:29Z

Looking at the code, which is a bit over my head, it seems to construct a binary tree, and convert that into BPF, which I guess is a sensible approach.

This seems to be a new thing in libseccomp 2.5.x, and it's off by default - we might want to consider turning on SCMP_FLTATR_CTL_OPTIMIZE if it's available?

On Linux the major C libraries expect that syscalls that are blocked from running in the container runtime return ENOSYS to allow fallbacks to be used. Returning EPERM by default is not useful particularly for syscalls that would return EPERM for actual access restrictions e.g. the new faccessat2. The runtime-spec should set the standard and recommend ENOSYS be returned just like a kernel would that doesn't support that syscall. This allows C runtimes to fall back on other possible implementations given the userspace policies. Please see the upstream discussions: https://lwn.net/Articles/738694/ - Discusses fragility of syscall filtering. opencontainers/runc#2151 - glibc and musl request ENOSYS return for unknown syscalls. systemd/systemd#16739 - Discusses systemd-nspawn breakage with faccessat2. systemd/systemd#16819 - General policy for systemd-nspawn to return ENOSYS. seccomp/libseccomp#286 - Block unknown syscalls and erturn ENOSYS.

On Linux the major C libraries expect that syscalls that are blocked from running in the container runtime return ENOSYS to allow fallbacks to be used. Returning EPERM by default is not useful particularly for syscalls that would return EPERM for actual access restrictions e.g. the new faccessat2. The runtime-spec should set the standard and recommend ENOSYS be returned just like a kernel would that doesn't support that syscall. This allows C runtimes to fall back on other possible implementations given the userspace policies. Please see the upstream discussions: https://lwn.net/Articles/738694/ - Discusses fragility of syscall filtering. opencontainers/runc#2151 - glibc and musl request ENOSYS return for unknown syscalls. systemd/systemd#16739 - Discusses systemd-nspawn breakage with faccessat2. systemd/systemd#16819 - General policy for systemd-nspawn to return ENOSYS. seccomp/libseccomp#286 - Block unknown syscalls and return ENOSYS.

Newer Glibc use faccessat2 to implement faccessat

7c9c627

cf https://repo.or.cz/glibc.git/commit/3d3ab573a5f3071992cbc4f57d50d1d29d55bde2 This cause breakage on Fedora Rawhide: https://bugzilla.redhat.com/show_bug.cgi?id=1869030

keszybz added the seccomp label Aug 15, 2020

keszybz added the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Aug 15, 2020

poettering merged commit bcf08ac into systemd:master Aug 16, 2020

keszybz removed the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Aug 17, 2020

keszybz mentioned this pull request Aug 18, 2020

List of projects which system calls table helped hrw/syscalls-table#12

Open

keszybz mentioned this pull request Aug 19, 2020

RFE: distinguish unknown syscalls seccomp/libseccomp#286

Open

keszybz mentioned this pull request Aug 22, 2020

Return ENOSYS in nspawn for "unknown" syscalls #16819

Merged

srd424 mentioned this pull request Aug 25, 2020

Seccomp blocks install of "libc6" in Ubuntu 20.04 "focal" image on armhf (arm32v7) moby/moby#40734

Open

dschwoerer mentioned this pull request Nov 20, 2020

Fedora rawhide test failing due to out of date shell boutproject/BOUT-dev#2143

Closed

mrc0mmand mentioned this pull request Feb 8, 2021

Update 60-evdev.hwdb #18493

Merged

koalo mentioned this pull request Jun 14, 2021

Return ENOSYS for "unknown" syscalls when executing services #19922

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Newer Glibc use faccessat2 to implement faccessat #16739

Newer Glibc use faccessat2 to implement faccessat #16739

mscherer commented Aug 15, 2020

keszybz commented Aug 15, 2020

fweimer commented Aug 15, 2020

poettering commented Aug 16, 2020

fweimer commented Aug 16, 2020

keszybz commented Aug 17, 2020

poettering commented Aug 17, 2020

poettering commented Aug 17, 2020 •

edited

Loading

keszybz commented Aug 17, 2020 •

edited

Loading

fweimer commented Aug 17, 2020

brauner commented Aug 18, 2020

keszybz commented Aug 19, 2020 •

edited

Loading

poettering commented Aug 19, 2020

topimiettinen commented Aug 19, 2020

srd424 commented Aug 19, 2020

poettering commented Aug 19, 2020

srd424 commented Aug 19, 2020

srd424 commented Aug 19, 2020

srd424 commented Aug 19, 2020

srd424 commented Aug 19, 2020

keszybz commented Aug 19, 2020

srd424 commented Aug 19, 2020

Newer Glibc use faccessat2 to implement faccessat #16739

Newer Glibc use faccessat2 to implement faccessat #16739

Conversation

mscherer commented Aug 15, 2020

keszybz commented Aug 15, 2020

fweimer commented Aug 15, 2020

poettering commented Aug 16, 2020

fweimer commented Aug 16, 2020

keszybz commented Aug 17, 2020

poettering commented Aug 17, 2020

poettering commented Aug 17, 2020 • edited Loading

keszybz commented Aug 17, 2020 • edited Loading

fweimer commented Aug 17, 2020

brauner commented Aug 18, 2020

keszybz commented Aug 19, 2020 • edited Loading

poettering commented Aug 19, 2020

topimiettinen commented Aug 19, 2020

srd424 commented Aug 19, 2020

poettering commented Aug 19, 2020

srd424 commented Aug 19, 2020

srd424 commented Aug 19, 2020

srd424 commented Aug 19, 2020

srd424 commented Aug 19, 2020

keszybz commented Aug 19, 2020

srd424 commented Aug 19, 2020

poettering commented Aug 17, 2020 •

edited

Loading

keszybz commented Aug 17, 2020 •

edited

Loading

keszybz commented Aug 19, 2020 •

edited

Loading