Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: allow adding an exact syscall number for an architecture #259

Open
Xyene opened this issue Jun 25, 2020 · 3 comments
Open

RFE: allow adding an exact syscall number for an architecture #259

Xyene opened this issue Jun 25, 2020 · 3 comments

Comments

@Xyene
Copy link
Contributor

Xyene commented Jun 25, 2020

Hi,

I have a usage of libseccomp that involves a 64-bit process setting up a filter for and then launching a 32-bit process, with SCMP_ACT_TRACE and using ptrace for syscall emulation.

The code already has to maintain a syscall name to per-arch ID (similar to libseccomp's syscall.csv in spirit), since it falls back to pure ptrace if seccomp is not available. This means that in order to use libseccomp's handy SCMP_SYS helpers, we'd need to maintain another mapping to use only on the seccomp path. Possible, but icky. We already have the raw mappings; might as well use them.

Reading the docs, I would have expected the following code to set up a filter allowing fork for x86, and causing traps for all other syscalls (most error-checking elided for brevity):

#include <stdio.h>
#include <string.h>
#include <seccomp.h>

void pfc(scmp_filter_ctx ctx) {
    seccomp_export_pfc(ctx, 2);
    fprintf(stderr, "-----\n");
}

int main(int argc, char **argv) {
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_TRACE(0));
    seccomp_arch_add(ctx, SCMP_ARCH_X86);
    seccomp_arch_remove(ctx, SCMP_ARCH_NATIVE);
    pfc(ctx);

    int rc;
    // fork is syscall 2 on x86
    if ((rc = seccomp_rule_add_exact(ctx, SCMP_ACT_ALLOW, 2, 0))) {
        fprintf(stderr, "seccomp_rule_add: %s\n", strerror(-rc));
    }

    pfc(ctx);
    return 0;
}

Compiling and running this example produces:

$ gcc -o test test.c -lseccomp                                                
$ file test                                                                 
test: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=85c3d0a91296a9107f390c1124e69f1d45387134, for GNU/Linux 3.2.0, not stripped
$ ./test
#
# pseudo filter code start
#
# filter for arch x86 (1073741827)
if ($arch == 1073741827)
  # default action
  action TRACE(0);
# invalid architecture action
action KILL;
#
# pseudo filter code end
#
-----
#
# pseudo filter code start
#
# filter for arch x86 (1073741827)
if ($arch == 1073741827)
  # filter for syscall "open" (5) [priority: 65535]
  if ($syscall == 5)
    action ALLOW;
  # default action
  action TRACE(0);
# invalid architecture action
action KILL;
#
# pseudo filter code end
#
-----

It looks like libseccomp is taking the syscall 2, seeing that it is open on x86-64, then mapping that to x86 syscall 5 in the filter. At the same time, the x64 arch has been removed from the seccomp context, and only x86 remains.

This may or may not be a helpful thing to do (personally, I find it confusing, but that's not a good reason to change something :), but at the very least I couldn't find anything in the documentation describing this behavior. The closest I found was in man 3 seccomp_rule_add (apologies if I've missed something obvious):

While it is possible to specify the syscall value directly using the
standard __NR_syscall values, in order to ensure proper operation
across multiple architectures it is highly recommended to use the
SCMP_SYS() macro instead. See the EXAMPLES section below.

I think the wording, in particularly that of "...to specify the syscall value directly..." doesn't indicate that the syscall number will be translated in any way. My expectation, at least, was that if I call seccomp_rule_add with syscall number N, then syscall number N appears in all branches of architectures currently in the context.

This made it an excellent foot-gun — instead of allowing a rather harmless fork (syscall 2 on x86), the filter instead allows opening anything (syscall 5 on x86).

If this is in fact the intended behavior here (and reading through the code makes it pretty clear it is), I'd be happy to submit a patch clarifying that bit of the documentation. At the same time, having a way to add raw syscall numbers, without this remapping, would be very useful in cases where SCMP_ACT_TRACE is being used. Maintaining a mapping to libseccomp pseudosyscall numbers in such cases seems less than ideal. If this is something you'd consider having upstream, I can also send in a patch for review.

@pcmoore pcmoore added this to the v2.5.0 milestone Jun 25, 2020
@pcmoore pcmoore added enhancement and removed bug labels Jun 25, 2020
@pcmoore pcmoore removed this from the v2.5.0 milestone Jun 25, 2020
@pcmoore
Copy link
Member

pcmoore commented Jun 25, 2020

Hi @Xyene, thanks for the RFE and/or BUG. I'm thinking this may be both, but given the subject line I think it's reasonable to keep it as a RFE.

Yes, you are correct in that libseccomp always interprets syscalls in the context of the native ABI. This is partly due to historical reasons where people wanted to be able to use the __NR_xxxx syscall defines (as you found out in the manpages), but the reality is it also saves callers from having to manage their own per-ABI syscall tables, and it helps provide a consistent library API in the face of differing and/or multiple ABIs in a filter. Changing this behavior would break compatibility with the libseccomp v2.y.z releases so it's unlikely to happen soon, and even if we could I'm not sure it is the correct thing to do for the library.

However, I agree that we should be doing a better job about making this behavior clear in the documentation. @Xyene would you be willing to put together a patch/PR for the manpages? If not, that's fine but please let us know so @drakenclimber or I can work on it. It would be good to get the doc fixes in for the v2.5.0 release.

Beyond that, I suppose we could create a new API that would take raw syscall numbers and pass them through to the filter untranslated; although I haven't given that too much thought yet. Since this hits at the issue's subject line, let's leave this open for now as a way of discussing and tracking this.

@Xyene
Copy link
Contributor Author

Xyene commented Jun 25, 2020

Changing this behavior would break compatibility with the libseccomp v2.y.z releases so it's unlikely to happen soon, and even if we could I'm not sure it is the correct thing to do for the library.

Agreed, I don't think changing something so fundamental (but subtle) is a good idea :)

@Xyene would you be willing to put together a patch/PR for the manpages?

Will do so soon.

Beyond that, I suppose we could create a new API that would take raw syscall numbers and pass them through to the filter untranslated

I think this could be an opportunity to kill two birds with one stone.

As far as I can tell, there's currently no way to specify a disjoint set of syscalls per architecture. In the x86-on-x64 case, regardless of how the architectures are added to the context, at least some syscall rules will be "inherited" by the other. There's no way I see to have an x86-only ruleset and an x64-only one in the same context.

Since there are plenty of bits still left in an int even after considering pseudosyscall numbers, one proposal would be something like this:

// Allow syscall #2 on x86
seccomp_rule_add_exact(ctx, SCMP_ACT_ALLOW, SCMP_SYS_ARCH(2, SCMP_ARCH_X86), 0);
// Allow read syscall on x86, same resolution mechanism as SCMP_SYS
seccomp_rule_add_exact(ctx, SCMP_ACT_ALLOW, SCMP_SYS_ARCH(read, SCMP_ARCH_X86), 0);

@pcmoore
Copy link
Member

pcmoore commented Jun 25, 2020

As far as I can tell, there's currently no way to specify a disjoint set of syscalls per architecture. In the x86-on-x64 case, regardless of how the architectures are added to the context, at least some syscall rules will be "inherited" by the other. There's no way I see to have an x86-only ruleset and an x64-only one in the same context.

Create two filters, one for x86 and one for x86_64, and add whatever rules you want to each, independent of each other and when you are done you merge them back together with a call to seccomp_merge(...). This was the main motivation for adding seccomp_merge(...) to the libseccomp API.

Xyene added a commit to Xyene/libseccomp that referenced this issue Jun 25, 2020
Xyene added a commit to Xyene/libseccomp that referenced this issue Jun 25, 2020
Xyene added a commit to Xyene/libseccomp that referenced this issue Jul 11, 2020
Xyene added a commit to Xyene/libseccomp that referenced this issue Jul 11, 2020
libseccomp performs a translation step when adding a raw syscall value
to a multi-architecture filter. For instance, when adding __NR_open
(syscall value 2 on x86-64) to a filter containing x86 and x86-64 where
the native ABI is x86-64, the x86 BPF branch will use the value 5
(__NR_open on x86).

This commit adds explicit documentation for the translation step.

Refs seccomp#259.

Signed-off-by: Tudor Brindus <[email protected]>
pcmoore pushed a commit that referenced this issue Jul 14, 2020
libseccomp performs a translation step when adding a raw syscall value
to a multi-architecture filter. For instance, when adding __NR_open
(syscall value 2 on x86-64) to a filter containing x86 and x86-64 where
the native ABI is x86-64, the x86 BPF branch will use the value 5
(__NR_open on x86).

This commit adds explicit documentation for the translation step.

Refs #259.

Signed-off-by: Tudor Brindus <[email protected]>
Acked-by: Tom Hromatka <[email protected]>
Signed-off-by: Paul Moore <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants