Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newer Glibc use faccessat2 to implement faccessat #16739

Merged
merged 1 commit into from
Aug 16, 2020

Conversation

mscherer
Copy link
Contributor

@keszybz
Copy link
Member

keszybz commented Aug 15, 2020

LGTM.

@keszybz keszybz added the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Aug 15, 2020
@fweimer
Copy link

fweimer commented Aug 15, 2020

This would work automatically if the filter returned ENOSYS for unknown system calls, like the kernel does.

@poettering
Copy link
Member

This would work automatically if the filter returned ENOSYS for unknown system calls, like the kernel does.

Well, we implement something that we advertise as sandbox. i.e. it's about permissions for us, hence we usually bind this to EPERM. This is configurable in unit files though. To me returning ENOSYS would be more about pretending something didn#t exist that exists, while EPERM (and not EACESS) appears to be the right error to use when we prohibit access to something, which is really what we do here.

I think it would be smart if glibc would only use new syscalls if explicitly asked for them, i.e. unless the specified flags require so just stick to faccessat() instead of faccessat2(). I mean, this otherwise creates needless churn, not just here but various other places where seccomp sandboxing is implemented... needlessly using new syscalls where the old calls would be entirely equivalent will also fuckup strace for a moment and everything else really that tracks syscalls, and isn't updated in sync with the new kernel.

I mean, I am all for making use of new syscalls where there's benefit in it, but just calling a new syscalls instead of an old one, if both are 100% equivalent just smells like unnecessary churn to me.

@poettering poettering merged commit bcf08ac into systemd:master Aug 16, 2020
@fweimer
Copy link

fweimer commented Aug 16, 2020

while EPERM (and not EACESS) appears to be the right error to use when we prohibit access to something, which is really what we do here.

The problem is that EPERM is also returned by implemented system calls, so it's impossible to use as a trigger for fallback. This only works with ENOSYS.

I think it would be smart if glibc would only use new syscalls if explicitly asked for them, i.e. unless the specified flags require so just stick to faccessat() instead of faccessat2().

I think in this case, faccessat was actually called with AT_EACCESS, previously triggering the broken emulation.

@keszybz
Copy link
Member

keszybz commented Aug 17, 2020

I started thinking that we should indeed switch to ENOSYS as the error, and it seems doable, but not so simple.

On first blush, it seems reasonable to use ENOSYS whenever a new syscall is added. A program making use of the syscall is expected to expect that it might not be available, so it will usually include a fallback for ENOSYS. But it much less useful for old system calls, since programs expect that the syscall is available on all possible kernels. And for such widely-available "old" systems calls, handling is more likely to be implemented for EPERM than for ENOSYS. In systemd we filter many "old" syscalls, in particular many syscalls we qualify as "obsolete". Switching from EPERM to ENOSYS everywhere would be an unacceptable compat break in our interface.

But I think this could be made to work in a limited fashion: for any given version of systemd, record what was the maximum allocated syscall number (this would sadly have to be per-architecture...). For any syscall that is newer than that return ENOSYS. For the rest, return EPERM as we currently do.

This will give a fairly stable behaviour: when the kernel is upgraded beneath systemd, we behave as if we were still on the old kernel until the filters are upgraded. And when the filters are upgraded by means of installing a new systemd version, we assume that all new syscalls have been added to the filters, so that is then OK to return EPERM if a syscall is not allowed.

WDYT?

@poettering
Copy link
Member

It's more complicated than that, i.e. the syscall numbering space is not contiguous, and numbers different on all archs. I think the more recently added syscalls use the same numbering on all archs, but even there it's not in historical order. for example facessat2() was added in 5.8 iirc as syscall nr 439. But close_range() is being added in 5.9 as syscall 438.

@poettering
Copy link
Member

poettering commented Aug 17, 2020

The problem is that EPERM is also returned by implemented system calls, so it's impossible to use as a trigger for fallback. This only works with ENOSYS.

Hmm, there are also syscalls that return ENOSYS even when implemented iirc, ioctl() for example and other multiplexed stuff iirc.

I think in this case, faccessat was actually called with AT_EACCESS, previously triggering the broken emulation.

But openat() vs. open(). IIRC glibc only uses openat() these days on x86-64 even if the dirfd is not used... But I guess the ship for that one definitely is sailed...

@keszybz keszybz removed the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Aug 17, 2020
@keszybz
Copy link
Member

keszybz commented Aug 17, 2020

This commit is now stable-246.2.

@fweimer
Copy link

fweimer commented Aug 17, 2020

On first blush, it seems reasonable to use ENOSYS whenever a new syscall is added. A program making use of the syscall is expected to expect that it might not be available, so it will usually include a fallback for ENOSYS. But it much less useful for old system calls, since programs expect that the syscall is available on all possible kernels. And for such widely-available "old" systems calls, handling is more likely to be implemented for EPERM than for ENOSYS. In systemd we filter many "old" syscalls, in particular many syscalls we qualify as "obsolete". Switching from EPERM to ENOSYS everywhere would be an unacceptable compat break in our interface.

Yes, that seems reasonable.

But I think this could be made to work in a limited fashion: for any given version of systemd, record what was the maximum allocated syscall number (this would sadly have to be per-architecture...). For any syscall that is newer than that return ENOSYS. For the rest, return EPERM as we currently do.

Do you mean in the systemd sources? Yes, that would work, because it will be stable across rebuilds with different kernel headers. It's probably not going to be a single maximum number for architectures with multiple system call tables (e.g., lp64 vs lp32).

In glibc, we use our own system call tables which get regularly sync'ed against the kernel, after review. It was the only way for us to get a stable implementation of the Y2038 system calls, otherwise the system call profile would change depending on the kernel header version. It was relatively easy for us to implement this because we already had scripts for building cross-toolchains, so the system call table maintenance was a small add-on on top of that.

@brauner
Copy link
Contributor

brauner commented Aug 18, 2020

It's more complicated than that, i.e. the syscall numbering space is not contiguous, and numbers different on all archs. I think the more recently added syscalls use the same numbering on all archs, but even there it's not in historical order. for example

Depending on what architectures you care about this needs to be qualified:

  • alpha is still special
  • ia64 has a 1024 offset
  • mips has a 4000/5000/6000 offset depending on exact abi

For example

#ifndef __NR_clone3
	#if defined __alpha__
		#define __NR_clone3 545
	#elif defined _MIPS_SIM
		#if _MIPS_SIM == _MIPS_SIM_ABI32	/* o32 */
			#define __NR_clone3 (435 + 4000)
		#endif
		#if _MIPS_SIM == _MIPS_SIM_NABI32	/* n32 */
			#define __NR_clone3 (435 + 6000)
		#endif
		#if _MIPS_SIM == _MIPS_SIM_ABI64	/* n64 */
			#define __NR_clone3 (435 + 5000)
		#endif
	#elif defined __ia64__
		#define __NR_clone3 (435 + 1024)
	#else
		#define __NR_clone3 435
	#endif
#endif

facessat2() was added in 5.8 iirc as syscall nr 439. But close_range() is being added in 5.9 as syscall 438.

@keszybz
Copy link
Member

keszybz commented Aug 19, 2020

After looking into implementing this, I don't think we can reasonably do this with our current toolling. It is fairly easy to get a list of all syscalls. There is an arbitrary number of unknown syscalls, if we want to return ENOSYS for the unknown ones, we need to use ENOSYS as the default action. This means that that we need to enumerate the other ones or generate some smart rule to catch them all.

It is possible to express this concisely, because the syscall number table is almost completely filled. For example, for amd64 we'd use EPERM when n <= 181 || 186 <= n <= 235 || 237 <= n <= 334 || 424 <= n <= 439. It is reasonable to assume that syscalls would be added at the end, possibly with an occasional gap, but the expression would always remain reasonably short.

But libseccomp does not allow comparison operators on syscall numbers, so we would need to enumerate all syscalls. This would mean that we'd create very lengthy filter scripts, which would create a high overhead on all syscalls, and I don't think we want to do that.

To make this work reasonably, libseccomp would need to support comparison operators also for the syscall number, or alternatively, we'd need to use something else than libseccomp to generate the filter. I'll open an RFE against libseccomp to support this.

@poettering
Copy link
Member

sounds as if one wants a "gperf" that generates BPF code...

@topimiettinen
Copy link
Contributor

My RFE for libseccomp asks for range ops, wouldn't that work?

@srd424
Copy link
Contributor

srd424 commented Aug 19, 2020

Just found this discussion .. looks like a mirror of the one on the mailing list! As for range ops, I don't think libseccomp currently supports checking the syscall number itself with an operator, does it? Only syscall arguments.

@poettering
Copy link
Member

so i just realized thta we basically maintain a list of all known syscalls anyway, in syscall_filter_sets[]. On my machine it lists all syscalls the kernel knows except for three. One group in particular is basically a list of system calls we know: @system-service. My guess is that it contains > 80% of all known syscalls anyway, and we advocate that people use it for their services.

If somebody uses @system-service we end up asking libseccomp for a filter that includes every single syscall listed in there in O(n) work. I kinda hope that libseccomp in the backend actually tries to merge these syscall entries into something smarter than just a list of equality checks. (If it didn't it would be pretty useless a library if you ask me, but I didn't check that).

Now we could define a new special group @known or so, which simply includes all syscalls. And then if you install an allowlist syscall filter and use @system-service, what actually happens in the bg is that we allow everything in @system-service, EPERM everything in @known that is not in @system-service, and ENOSYS everything else.

I think effectively this wouldn't really make complexity of the filters much worse... Because if we have 80% of the syscalls in the filter or 100% doesn't really matter...

@srd424
Copy link
Contributor

srd424 commented Aug 19, 2020

I kinda hope that libseccomp in the backend actually tries to merge these syscall entries into something smarter than just a list of equality checks. (If it didn't it would be pretty useless a library if you ask me, but I didn't check that).

Looking at the code, which is a bit over my head, it seems to construct a binary tree, and convert that into BPF, which I guess is a sensible approach.

@srd424
Copy link
Contributor

srd424 commented Aug 19, 2020

Might need to think about performance a bit - does glibc remember getting ENOSYS for a particular syscall? I assume it would do so to avoid unnecessary context switches.

@srd424
Copy link
Contributor

srd424 commented Aug 19, 2020

Doesn't look like it does - but I suppose "upgrade your software then" is a reasonable response to suboptimal performance :)

@srd424
Copy link
Contributor

srd424 commented Aug 19, 2020

[Apologies for spam] It also occurred to me that if the ENOSYS handling ever gets pushed down into libseccomp, that should make it safe to invert large allowlists into small denylists?

@keszybz
Copy link
Member

keszybz commented Aug 19, 2020

Note that the kernel does some optimization too. It might be able to coalesce some entries... Even if it doesn't do that now, it could in the future.

@srd424
Copy link
Contributor

srd424 commented Aug 19, 2020

Looking at the code, which is a bit over my head, it seems to construct a binary tree, and convert that into BPF, which I guess is a sensible approach.

This seems to be a new thing in libseccomp 2.5.x, and it's off by default - we might want to consider turning on SCMP_FLTATR_CTL_OPTIMIZE if it's available?

codonell added a commit to codonell/runtime-spec that referenced this pull request Nov 17, 2020
On Linux the major C libraries expect that syscalls that are
blocked from running in the container runtime return ENOSYS
to allow fallbacks to be used. Returning EPERM by default is
not useful particularly for syscalls that would return EPERM
for actual access restrictions e.g. the new faccessat2.

The runtime-spec should set the standard and recommend ENOSYS
be returned just like a kernel would that doesn't support that
syscall. This allows C runtimes to fall back on other possible
implementations given the userspace policies.

Please see the upstream discussions:
https://lwn.net/Articles/738694/
- Discusses fragility of syscall filtering.
opencontainers/runc#2151
- glibc and musl request ENOSYS return for unknown syscalls.
systemd/systemd#16739
- Discusses systemd-nspawn breakage with faccessat2.
systemd/systemd#16819
- General policy for systemd-nspawn to return ENOSYS.
seccomp/libseccomp#286
- Block unknown syscalls and erturn ENOSYS.
codonell added a commit to codonell/runtime-spec that referenced this pull request Nov 17, 2020
On Linux the major C libraries expect that syscalls that are
blocked from running in the container runtime return ENOSYS
to allow fallbacks to be used. Returning EPERM by default is
not useful particularly for syscalls that would return EPERM
for actual access restrictions e.g. the new faccessat2.

The runtime-spec should set the standard and recommend ENOSYS
be returned just like a kernel would that doesn't support that
syscall. This allows C runtimes to fall back on other possible
implementations given the userspace policies.

Please see the upstream discussions:
https://lwn.net/Articles/738694/
- Discusses fragility of syscall filtering.
opencontainers/runc#2151
- glibc and musl request ENOSYS return for unknown syscalls.
systemd/systemd#16739
- Discusses systemd-nspawn breakage with faccessat2.
systemd/systemd#16819
- General policy for systemd-nspawn to return ENOSYS.
seccomp/libseccomp#286
- Block unknown syscalls and return ENOSYS.
@mrc0mmand mrc0mmand mentioned this pull request Feb 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

7 participants