Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ext/ffi): Implement FFI fast-call trampoline with Dynasmrt #15305

Merged

Conversation

arnauorriols
Copy link
Member

@arnauorriols arnauorriols commented Jul 25, 2022

Fixes #15306

In order to use the fast API of V8 for the ffi functions, Deno needs to wrap them in a trampoline function that performs the following transformations (at the time fo writing):

  • remove initial argument (receiver) and move the rest of arguments one place to the left
  • cast 8/16 bit integer arguments into 32 bytes
  • cast 8/16 bit integer return values into 32 bytes
  • unwrap pointers passed as Uint8Arrays
  • wrap 64 bit return values with a Int32Array

Currently this trampoline is JIT compiled using tcc. However, using tcc is problematic, among other reasons because of the increased build complexity and complicated platform support. This PR implements Deno's own JIT asusing dynasmrt.

Status

  • SysV PoC
  • Aarch64-Apple PoC
  • Win64 fastcall PoC
  • design consolidation
  • initial review iterations
  • Pointer support via FastApiTypedArray<Uint8>
  • 64bit return values support via FastApiTypedArray<Int32>
  • final review iterations
  • final design evaluation
  • decision on V8's Aarch64-Apple bug

Open Questions

  • V8 does not currently follow this Apple's policy, and instead aligns all arguments to 8 Byte boundaries.
    A decision needs to be taken:
    1. leave it broken and wait for v8 to fix the bug
    2. Adapt to v8 bug and follow its ABI instead of Apple's. When V8 fixes the implementation, we'll have to fix it here as well

Benchmarks

Linux (SysV AMD64)

./target/release/deno bench --allow-ffi --allow-read --unstable --v8-flags=--allow-natives-syntax test_ffi/tests/bench.js
cpu: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
runtime: deno 1.24.3 (x86_64-unknown-linux-gnu)

file:https:///home/orriols/dev/deno/deno/test_ffi/tests/bench.js
benchmark                              time (avg)             (min … max)       p75       p99      p995
------------------------------------------------------------------------- -----------------------------
nop()                                2.23 ns/iter    (2.18 ns … 66.29 ns)   2.19 ns   2.64 ns   3.21 ns
hash()                               51.7 ns/iter  (50.83 ns … 113.54 ns)  51.68 ns   60.7 ns  62.43 ns
c string                            115.4 ns/iter (107.48 ns … 178.86 ns) 118.03 ns 134.04 ns 138.99 ns
add_u32()                            4.65 ns/iter    (4.58 ns … 15.62 ns)   4.61 ns   5.24 ns   5.86 ns
return_buffer()                      7.13 ns/iter    (6.76 ns … 16.18 ns)   7.23 ns   7.93 ns   8.66 ns
add_u64()                            7.45 ns/iter     (6.8 ns … 13.68 ns)   7.64 ns   8.34 ns   9.09 ns
return_u64()                        12.71 ns/iter      (11.26 ns … 77 ns)  12.87 ns  17.27 ns  18.37 ns
return_i64()                        32.07 ns/iter   (29.51 ns … 98.87 ns)   33.1 ns  41.63 ns  43.02 ns
nop_u8()                             4.81 ns/iter    (4.58 ns … 69.47 ns)   4.61 ns  12.22 ns  12.46 ns
nop_i8()                             4.64 ns/iter    (4.59 ns … 13.22 ns)    4.6 ns   5.25 ns   6.26 ns
nop_u16()                            4.64 ns/iter    (4.58 ns … 68.63 ns)    4.6 ns   5.18 ns   5.59 ns
nop_i16()                            4.65 ns/iter    (4.58 ns … 69.59 ns)    4.6 ns   5.31 ns   6.31 ns
nop_u32()                            4.66 ns/iter    (4.58 ns … 11.05 ns)   4.62 ns   5.21 ns   5.93 ns
nop_i32()                            4.66 ns/iter    (4.58 ns … 66.25 ns)   4.62 ns   5.18 ns   5.92 ns
nop_u64()                            4.73 ns/iter    (4.58 ns … 68.84 ns)   4.62 ns   7.35 ns  10.99 ns
nop_i64()                            4.66 ns/iter    (4.58 ns … 68.63 ns)   4.61 ns   5.24 ns   5.94 ns
nop_usize() number                   4.87 ns/iter     (4.58 ns … 14.5 ns)    4.6 ns   9.81 ns  10.03 ns
nop_usize() bigint                 102.51 ns/iter   (97.4 ns … 122.63 ns) 103.91 ns 118.02 ns 120.04 ns
nop_isize() number                   4.65 ns/iter    (4.58 ns … 10.51 ns)   4.62 ns   5.21 ns   5.63 ns
nop_isize() bigint                 104.71 ns/iter  (98.99 ns … 150.36 ns) 106.03 ns  118.6 ns 119.96 ns
nop_f32()                            4.85 ns/iter    (4.58 ns … 22.27 ns)   4.62 ns  12.22 ns  12.25 ns
nop_f64()                            4.65 ns/iter    (4.58 ns … 11.08 ns)   4.62 ns   5.21 ns   5.86 ns
nop_buffer()                         5.27 ns/iter    (5.02 ns … 11.02 ns)   5.29 ns   6.01 ns   6.61 ns
nop_buffer() number                 79.67 ns/iter  (76.11 ns … 143.57 ns)  80.05 ns  89.09 ns  93.91 ns
return_u8()                          5.12 ns/iter    (5.02 ns … 14.17 ns)   5.04 ns   7.42 ns   8.03 ns
return_i8()                          5.13 ns/iter    (5.02 ns … 69.47 ns)   5.04 ns   6.83 ns   9.62 ns
return_u16()                         5.09 ns/iter     (5.02 ns … 16.7 ns)   5.04 ns   5.93 ns   6.82 ns
return_i16()                         5.08 ns/iter    (5.02 ns … 11.85 ns)   5.04 ns   5.68 ns    6.3 ns
return_u32()                         4.42 ns/iter     (4.36 ns … 13.3 ns)   4.38 ns   4.95 ns   5.21 ns
return_i32()                         4.41 ns/iter    (4.36 ns … 10.75 ns)   4.38 ns   4.94 ns   5.23 ns
return_usize()                      12.38 ns/iter   (11.28 ns … 30.75 ns)  12.36 ns  17.05 ns  19.19 ns
return_isize()                      29.98 ns/iter   (27.53 ns … 55.83 ns)  30.05 ns  39.01 ns  40.75 ns
return_f32()                         4.88 ns/iter      (4.8 ns … 70.7 ns)   4.84 ns   5.44 ns   6.18 ns
return_f64()                         4.87 ns/iter     (4.8 ns … 67.36 ns)   4.83 ns   5.49 ns   6.03 ns

Apple M1 (Aarch64 Apple)

Note: these benchmarks are run on a M1 dedicated host in AWS.

./target/release/deno bench --allow-ffi --allow-read --unstable --v8-flags=--allow-natives-syntax test_ffi/tests/bench.js
cpu: Apple M1
runtime: deno 1.24.3 (aarch64-apple-darwin)

file:https:///Users/ec2-user/deno/test_ffi/tests/bench.js
benchmark                              time (avg)             (min … max)       p75       p99      p995
------------------------------------------------------------------------- -----------------------------
nop()                                1.92 ns/iter    (1.87 ns … 17.51 ns)   1.88 ns   2.52 ns   2.52 ns
hash()                              51.51 ns/iter   (51.29 ns … 63.67 ns)  51.36 ns  55.73 ns  56.68 ns
c string                            86.41 ns/iter  (82.39 ns … 102.08 ns)  88.64 ns  95.93 ns 100.74 ns
add_u32()                            4.31 ns/iter     (4.29 ns … 6.74 ns)   4.31 ns   4.43 ns   4.58 ns
return_buffer()                       6.7 ns/iter    (6.68 ns … 12.73 ns)   6.69 ns   6.97 ns   7.14 ns
add_u64()                            6.89 ns/iter    (6.86 ns … 10.57 ns)   6.88 ns   7.31 ns   7.35 ns
return_u64()                        11.28 ns/iter   (10.96 ns … 18.48 ns)  10.99 ns  14.74 ns  15.16 ns
return_i64()                        29.03 ns/iter   (28.01 ns … 42.13 ns)  28.82 ns  32.77 ns   33.1 ns
nop_u8()                              4.4 ns/iter    (4.37 ns … 10.18 ns)    4.4 ns   4.67 ns   4.84 ns
nop_i8()                             4.25 ns/iter     (4.22 ns … 8.21 ns)   4.25 ns   4.52 ns   4.67 ns
nop_u16()                            4.25 ns/iter     (4.22 ns … 8.22 ns)   4.25 ns   4.52 ns   4.69 ns
nop_i16()                            4.25 ns/iter     (4.23 ns … 8.46 ns)   4.25 ns   4.52 ns   4.54 ns
nop_u32()                            4.25 ns/iter     (4.22 ns … 8.78 ns)   4.25 ns   4.52 ns   4.69 ns
nop_i32()                            4.25 ns/iter     (4.22 ns … 8.34 ns)   4.24 ns   4.54 ns   4.67 ns
nop_u64()                            4.26 ns/iter     (4.22 ns … 8.44 ns)   4.25 ns   4.52 ns   4.69 ns
nop_i64()                            4.45 ns/iter     (4.41 ns … 8.55 ns)   4.44 ns   4.54 ns   4.62 ns
nop_usize() number                    4.3 ns/iter     (4.22 ns … 9.15 ns)   4.33 ns   4.58 ns   4.68 ns
nop_usize() bigint                 112.71 ns/iter (110.96 ns … 122.07 ns) 112.26 ns 121.81 ns 122.03 ns
nop_isize() number                   4.25 ns/iter     (4.22 ns … 8.76 ns)   4.24 ns   4.52 ns   4.69 ns
nop_isize() bigint                 118.03 ns/iter (113.17 ns … 127.47 ns) 118.07 ns 125.56 ns 126.31 ns
nop_f32()                            4.25 ns/iter     (4.22 ns … 8.48 ns)   4.24 ns   4.51 ns   4.68 ns
nop_f64()                             4.3 ns/iter     (4.27 ns … 8.87 ns)    4.3 ns   4.56 ns   4.72 ns
nop_buffer()                          5.3 ns/iter     (5.28 ns … 9.57 ns)    5.3 ns   5.57 ns   5.74 ns
nop_buffer() number                 93.79 ns/iter     (93 ns … 100.13 ns)     94 ns  96.16 ns  97.26 ns
return_u8()                          4.33 ns/iter     (4.22 ns … 8.33 ns)   4.33 ns    4.6 ns   4.77 ns
return_i8()                          4.32 ns/iter     (4.22 ns … 7.91 ns)   4.33 ns   4.59 ns   4.76 ns
return_u16()                         4.55 ns/iter     (4.46 ns … 8.62 ns)   4.58 ns   4.84 ns   4.94 ns
return_i16()                         4.49 ns/iter     (4.41 ns … 8.34 ns)   4.52 ns    4.7 ns    4.8 ns
return_u32()                         4.33 ns/iter      (4.23 ns … 8.4 ns)   4.33 ns   4.35 ns   4.59 ns
return_i32()                         4.24 ns/iter     (4.21 ns … 8.78 ns)   4.24 ns    4.5 ns   4.52 ns
return_usize()                      11.17 ns/iter   (10.95 ns … 18.68 ns)     11 ns  15.35 ns  15.98 ns
return_isize()                       30.1 ns/iter   (29.05 ns … 39.26 ns)  29.74 ns   35.8 ns  36.17 ns
return_f32()                         4.38 ns/iter     (4.36 ns … 7.28 ns)   4.38 ns   4.65 ns   4.73 ns
return_f64()                         4.58 ns/iter     (4.54 ns … 9.03 ns)   4.58 ns   4.85 ns   5.03 ns

Windows 11 (Windows64 fastcall)

 ./target/release/deno bench --allow-ffi --allow-read --unstable --v8-flags=--allow-natives-syntax test_ffi/tests/bench.js
cpu: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
runtime: deno 1.24.3 (x86_64-pc-windows-msvc)

file:https:///C:/Users/arnau/deno/test_ffi/tests/bench.js
benchmark                              time (avg)             (min … max)       p75       p99      p995
------------------------------------------------------------------------- -----------------------------
nop()                                2.29 ns/iter    (2.18 ns … 10.76 ns)   2.25 ns   3.06 ns   3.39 ns
hash()                              52.09 ns/iter   (51.07 ns … 66.79 ns)  51.83 ns  61.27 ns  62.87 ns
c string                           119.97 ns/iter  (115.3 ns … 239.23 ns) 119.69 ns 137.92 ns 214.37 ns
add_u32()                            4.45 ns/iter     (4.36 ns … 9.41 ns)   4.38 ns   4.98 ns   5.25 ns
return_buffer()                      7.42 ns/iter    (6.77 ns … 17.61 ns)   7.55 ns   8.31 ns   8.81 ns
add_u64()                            7.43 ns/iter    (6.78 ns … 70.22 ns)   7.57 ns   8.51 ns   9.23 ns
return_u64()                        12.76 ns/iter   (11.21 ns … 77.78 ns)   12.5 ns  18.19 ns  20.38 ns
return_i64()                        33.95 ns/iter   (30.92 ns … 97.96 ns)   35.3 ns   42.6 ns  44.49 ns
nop_u8()                             4.44 ns/iter    (4.36 ns … 14.45 ns)   4.38 ns   4.87 ns   5.17 ns
nop_i8()                             4.48 ns/iter    (4.36 ns … 68.27 ns)    4.4 ns   5.53 ns      7 ns
nop_u16()                            4.45 ns/iter    (4.36 ns … 11.38 ns)   4.39 ns   4.87 ns   5.03 ns
nop_i16()                            4.47 ns/iter    (4.36 ns … 16.19 ns)   4.38 ns   5.96 ns   7.18 ns
nop_u32()                            4.45 ns/iter     (4.36 ns … 9.37 ns)   4.38 ns   4.92 ns   5.23 ns
nop_i32()                            4.46 ns/iter    (4.36 ns … 68.58 ns)   4.38 ns   5.08 ns    5.5 ns
nop_u64()                            4.44 ns/iter    (4.36 ns … 10.15 ns)   4.38 ns   4.92 ns   5.18 ns
nop_i64()                            4.46 ns/iter    (4.36 ns … 13.01 ns)   4.47 ns   5.03 ns   5.25 ns
nop_usize() number                   4.45 ns/iter    (4.36 ns … 12.06 ns)   4.38 ns   4.93 ns   5.26 ns
nop_usize() bigint                 159.57 ns/iter (153.78 ns … 186.44 ns) 161.55 ns 177.07 ns 179.94 ns
nop_isize() number                   4.44 ns/iter    (4.36 ns … 10.87 ns)   4.38 ns   4.89 ns   5.11 ns
nop_isize() bigint                  160.6 ns/iter (154.85 ns … 278.81 ns) 162.69 ns 185.48 ns 218.49 ns
nop_f32()                            4.45 ns/iter    (4.36 ns … 13.01 ns)   4.38 ns    4.9 ns   5.18 ns
nop_f64()                            4.47 ns/iter    (4.36 ns … 12.82 ns)   4.42 ns   4.94 ns   5.19 ns
nop_buffer()                          5.1 ns/iter    (4.92 ns … 13.11 ns)   5.25 ns   5.79 ns   6.06 ns
nop_buffer() number                135.39 ns/iter  (132.2 ns … 151.68 ns) 135.77 ns 147.07 ns 149.18 ns
return_u8()                           4.9 ns/iter     (4.8 ns … 12.65 ns)   4.82 ns   5.43 ns   5.69 ns
return_i8()                          4.89 ns/iter     (4.8 ns … 10.81 ns)   4.82 ns    5.4 ns   5.63 ns
return_u16()                         4.91 ns/iter     (4.8 ns … 14.32 ns)   4.82 ns   5.49 ns   5.87 ns
return_i16()                         4.89 ns/iter      (4.8 ns … 9.89 ns)   4.82 ns   5.37 ns   5.66 ns
return_u32()                         4.23 ns/iter    (4.14 ns … 13.29 ns)   4.16 ns   4.67 ns   4.92 ns
return_i32()                         4.25 ns/iter     (4.14 ns … 9.72 ns)   4.16 ns   5.32 ns   6.83 ns
return_usize()                      12.57 ns/iter   (11.39 ns … 30.23 ns)  12.51 ns  18.43 ns  20.62 ns
return_isize()                      33.38 ns/iter    (30.9 ns … 58.53 ns)  33.56 ns  42.58 ns  47.39 ns
return_f32()                          4.9 ns/iter     (4.79 ns … 9.07 ns)   4.91 ns   5.49 ns   5.66 ns
return_f64()                         4.68 ns/iter    (4.58 ns … 11.45 ns)   4.62 ns   5.15 ns   5.36 ns

Comparison with TCC (rev 5beec3f)

Performance is marginally better than using TCC, particularly in those cases where the trampoline tail-calls to the FFI function)
image

@CLAassistant
Copy link

CLAassistant commented Jul 25, 2022

CLA assistant check
All committers have signed the CLA.

ext/ffi/fast_call.rs Outdated Show resolved Hide resolved

impl Trampoline {
fn ptr(&self) -> *const c_void {
&self.0[0] as *const u8 as *const c_void

This comment was marked as resolved.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type update should be upstreamed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. Eventually :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, please do open an issue there referencing what's already open in /deno? I'll get around to it when I'm free (unless someone else does first).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@aapoalas aapoalas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty damn awesome!

ext/ffi/fast_call.rs Outdated Show resolved Hide resolved
ext/ffi/fast_call.rs Outdated Show resolved Hide resolved
test_ffi/tests/test.js Outdated Show resolved Hide resolved
ext/ffi/fast_call.rs Outdated Show resolved Hide resolved
ext/ffi/fast_call.rs Outdated Show resolved Hide resolved
@ry ry requested a review from aslilac July 25, 2022 15:25
.gitmodules Outdated Show resolved Hide resolved
test_ffi/tests/test.js Show resolved Hide resolved
ext/ffi/fast_call.rs Outdated Show resolved Hide resolved
ext/ffi/fast_call.rs Outdated Show resolved Hide resolved
Copy link
Member

@ry ry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very interesting PR - thank you!

We want to wait until we support Uint8Array in FastAPI before landing this one (which should be soon) because it will require a bit more complex C code. I'm sure we can figure out how to also do this in ASM.

Have you given any thought to ARM?

@arnauorriols
Copy link
Member Author

We want to wait until we support Uint8Array in FastAPI before landing this one (which should be soon) because it will require a bit more complex C code. I'm sure we can figure out how to also do this in ASM.

Absolutely. There's still work to do to get this PR ready anyway.

Have you given any thought to ARM?

the intent is to implement support for Apple Silicon as well as windows x64, but haven't really delved into it yet. First I want to see how the SysV AMD64 implementation consolidates after the feedback.

Copy link
Collaborator

@aapoalas aapoalas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely amazing stuff!

ext/ffi/fast_call.rs Outdated Show resolved Hide resolved
ext/ffi/fast_call.rs Outdated Show resolved Hide resolved
ext/ffi/fast_call.rs Outdated Show resolved Hide resolved
match (arg_i, arg) {
// Conventionally, many compilers expect 8 and 16 bit arguments to be sign/zero extended to 32 bits
// See https://stackoverflow.com/a/36760539/2623340
(1, Unsigned(B)) => dynasm!(self.assembler; movzx edi, sil),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to get rid of some of the repetition by getting the mov command from one match:

let instruction = match arg {
  Unsigned(B) | Unsigned(W) => "movszx",
  Signed(B) | Signed(W) => "movsx",
  _ => "mov",
};

and maybe the involved registers from a function as well? Then the match here could be something like:

// instruction from above
let (destination_register, source_register) = get_integer_registers(arg_i, arg);
match (arg_i, arg) => {
  (1 | 2 | 3 | 4 | 5, _) => dnasm!(self.assembler; instruction destination_register, source_register),
  (6, _) => dynasm!(self.assembler; instruction destination_register, source_register[rsp + rsp_offset]), // Here trying to piggyback on source_register to contain "BYTE " or "WORD " or ""
  (_, _) => dynasm!(...), // same kind of piggyback here
}

Not sure if the macro can handle this, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the macro can handle this, though.

Not like this, you need to define a wrapping macro. I'll give it a go anyway. After aarch64 & win though, first I wan to see how these implementations end up looking like.

In any case, for something as fragile as this, I tend to think the plain boring design might be better. Maybe if we keep adding more logic it becomes unmanageable but I don't think we are there yet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, sounds good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this lately. I'm not sure about the net gain of working out a macro to abstract the match. I do however thing that a couple of helper macros can improve the readability significantly by making the code more dense:
https://github.com/arnauorriols/deno/compare/refactor/ffi-trampoline-plain-asm...arnauorriols:deno:alternative-dense-dynasm?expand=1

What's your opinion?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think there's a lot of repetition and it makes reading the code for correctness a bit hard, what with having to jump between two to four axes of "arguments" (length of data, type of data (signed, unsigned, float), destination register, source register).

No two subsequent lines are generally the same, as the register names change based on the type and length of data even when its the "same" register. An advanced macro might abstract out a lot of this, allowing lines to be read in a more "autopilot" way: "All 0th integers move from register X to Y, all unsigned integers use zero-extended moves, signed integers use sign-extended moves, ..."

However, such a magic macro would just move the complexity elsewhere. So, I'm ambivalent on the true need for such a macro.

The dense macros do definitely improve the outlook of the code a lot while not really hiding much anything, so I can definitely get behind that.

ext/ffi/fast_call.rs Show resolved Hide resolved
}

dynasm!(self.assembler
; sub rsp, stack_size as i32
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Godbolt rustc compiler seems to use push and pop for stack allocation, presumably as long as there is no need to push parameters onto the stack. So this would be the case when the return value has to be cast from 8 or 16 bits to 32 bits.

Would it be worth to add that as a special case, or rather leave such stuff out as micro-optimisation?

The rustc compiler also does some funky things with using movups to (if I understand this correctly) move multiple sequential integer arguments temporarily into xmmN (where N is the number of float args) to make stack arg shifting quicker. But that's the kind of optimisation that is probably entirely unnecessary for us :D

Copy link

@ghost ghost Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rustc compiler also does some funky things with using movups to (if I understand this correctly) move multiple sequential integer arguments temporarily into xmmN (where N is the number of float args) to make stack arg shifting quicker. But that's the kind of optimisation that is probably entirely unnecessary for us :D

Invalid unless we assume SSE extension on the target platform. (Which is what Bun does, but Deno so far has not, as far as I'm aware)

Copy link
Member Author

@arnauorriols arnauorriols Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Godbolt rustc compiler seems to use push and pop for stack allocation, presumably as long as there is no need to push parameters onto the stack. So this would be the case when the return value has to be cast from 8 or 16 bits to 32 bits.

AFAIK Rustc (and many others, with the notable exception of GCC) optimize by using push instead of the normative stack allocation in the prologue as long as no SSE instruction is/can be used. This is because stack instructions (PUSH, POP, CALL, RET) are optimized since the addition of the stack engine in Pentium M and newer CPUs, taking fewer uops than sub + mov. Once SSE is used to move packed data, sub is needed anyway so I guess the optimization is no longer significant. See https://godbolt.org/z/PeKaxzjK6

The rustc compiler also does some funky things with using movups to (if I understand this correctly) move multiple sequential integer arguments [...] to make stack arg shifting quicker.

Correct. In a nutshell, it's moving 128 bits of data with a single uops. In the Godbolt linked above, trampoline2 moves parameter k and l together with movups.

into xmmN (where N is the number of float args)

no, N is just part of the register name. When passing floats as parameters, N does have some semantic in the sense that the first float goes to XMM0, the second to XMM1, etc. But when moving data in/out of the register as a temporary operation the N does not have any significance. In the previously linked GodBolt, it uses xxm2 because it is the first available xmm register (xmm0 and xmm1) are used for parameters. Unless I misunderstood what you meant?

But that's the kind of optimisation that is probably entirely unnecessary for us

Not sure if necessary or not in the long run, but I do think that this first iteration should focus on a simple and solid implementation, and worry about optimization in the future. It is not like TinyCC applies any of these optimizations anyway.

Invalid unless we assume SSE extension on the target platform. (Which is what Bun does, but Deno so far has not, as far as I'm aware)

Keep in mind that this implementation is specific for AMD64, which includes SSE & SSE2 instructions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK Rustc (and many others, with the notable exception of GCC) optimize by using push instead of the normative stack allocation in the prologue as long as no SSE instruction is/can be used. This is because stack instructions (PUSH, POP, CALL, RET) are optimized since the addition of the stack engine in Pentium M and newer CPUs, taking fewer uops than sub + mov. Once SSE is used to move packed data, sub is needed anyway so I guess the optimization is no longer significant. See https://godbolt.org/z/PeKaxzjK6

Alright. Sounds good to me to leave a relatively insignificant complication out.

no, N is just part of the register name. When passing floats as parameters, N does have some semantic in the sense that the first float goes to XMM0, the second to XMM1, etc. But when moving data in/out of the register as a temporary operation the N does not have any significance. In the previously linked GodBolt, it uses xxm2 because it is the first available xmm register (xmm0 and xmm1) are used for parameters. Unless I misunderstood what you meant?

Yeah I just meant that it's indeed using the first available xmm register: If there are 0 float args then xmm0 can be used, otherwise 1 or 2 or ... based on the number of float params.

Not sure if necessary or not in the long run, but I do think that this first iteration should focus on a simple and solid implementation, and worry about optimization in the future. It is not like TinyCC applies any of these optimizations anyway.

Sounds good to me. Also, I think it's very unlikely we'll run into many C APIs with this many parameters.


function add10U8Fast(a, b, c, d, e, f, g, h, i, j) { return symbols.add_10_u8(a, b, c, d, e, f, g, h, i, j); };

%PrepareFunctionForOptimization(add10U8Fast);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making this into an assert helper function like this:

function assertOptimizes(fn, callback) {
  &PrepareFunctionForOptimization(fn);
  console.log(callback());
  %OptimizeFunctionOnNextCall(fn);
  console.log(callback());
  assertFastCall(fn);
}

// SysV ABI comment
assertOptimizes(add10U8Fast, () => add10U8Fast(0, 1, 2, 3, 4, 5, 6, 7, 8, 9));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, saw your suggestion in Discord, I'll definitely apply it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about naming this function testOptimized instead of assertOptimizes? The reason is that It also tests the function itself, not only if it has been optimized.

@arnauorriols
Copy link
Member Author

arnauorriols commented Aug 10, 2022

Have you given any thought to ARM?

aarch64-apple-darwin and x86_64-pc-windows-msvc are now ready to review

Copy link
Collaborator

@aapoalas aapoalas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome stuff! Only some minor nitpicks that I could find or understand :)

ext/ffi/fast_call.rs Show resolved Hide resolved
// (this applies to register parameters, as stack parameters are not padded in Apple)
(1, Signed(B)) => dynasm!(self.assembler; .arch aarch64; sxtb w0, w1),
(1, Unsigned(B)) => {
dynasm!(self.assembler; .arch aarch64; and w0, w1, 0xFF)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IS and w0, w1, 0xFF better / faster somehow than uxtb w0, w1?

Copy link

@ghost ghost Aug 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLVM prefers the extension.

They both take a single instruction cycle, i.e. performance won't matter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, both have the same latency and throughput and use the same execution pipelines. I'm using and following Rustc and Clang example: https://godbolt.org/z/1jb4j8vsh

LLVM prefers the extension.

What do you mean?

Copy link

@ghost ghost Aug 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLVM prefers the extension.

What do you mean?

It's irrelevant, but I was under the impression that Rustc didn't prefer the bitwise AND - but that's my experience from 32-bit Arm, not 64-bit Arm :)

I suppose one may need to understand how bitwise truncation works, vs. how the extension instructions work, so maybe the latter should be preferred for that reason? But I have no opinion here.

Copy link
Member Author

@arnauorriols arnauorriols Aug 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose one may need to understand how bitwise truncation works, vs. how the extension instructions work

According to Arm documentation it's essentially the same, except that and does not have to deal with the rotation part.

// > Function arguments may consume slots on the stack that are not multiples of 8 bytes.
// > If the total number of bytes for stack-based arguments is not a multiple of 8 bytes,
// > insert padding on the stack to maintain the 8-byte alignment requirements.
// TODO: V8 does not currently follow this Apple's policy, and instead aligns all arguments to 8 Byte boundaries.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this might be pretty quick on the chopping block for V8 to fix, so I vote we allow it to be broken and wait for V8 to fix it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my personal opinion as well

ext/ffi/fast_call.rs Outdated Show resolved Hide resolved
ext/ffi/fast_call.rs Outdated Show resolved Hide resolved
ext/ffi/fast_call.rs Outdated Show resolved Hide resolved
ext/ffi/fast_call.rs Outdated Show resolved Hide resolved
ext/ffi/fast_call.rs Outdated Show resolved Hide resolved
ext/ffi/fast_call.rs Outdated Show resolved Hide resolved
@bartlomieju
Copy link
Member

@arnauorriols please rebase.

@piscisaureus please review

Copy link
Collaborator

@aapoalas aapoalas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with my limited knowledge of assembly anyway.

Comment on lines 390 to 397
let mut int_params = 0u32;
let mut sse_params = 0u32;
for param in params {
match param {
NativeType::F32 | NativeType::F64 => sse_params += 1,
_ => int_params += 1,
}
}
Copy link

@ghost ghost Aug 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this seems like it could also be done by summing one register type, and subtracting from the length?
Maybe

Suggested change
let mut int_params = 0u32;
let mut sse_params = 0u32;
for param in params {
match param {
NativeType::F32 | NativeType::F64 => sse_params += 1,
_ => int_params += 1,
}
}
let sse_params =
params.iter().filter(|param| matches!(param, NativeType::F32 | NativeType::F64)).count();
let int_params = params.len() - sse_params;

I'm just going off of a "there's a lot of mutability here, so maybe some functional code could help" thinking; it's your choice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get what you mean, but in this case in particular, I think the mutable approach is more expressive.

ext/ffi/fast_call.rs Outdated Show resolved Hide resolved
ext/ffi/fast_call.rs Show resolved Hide resolved
ext/ffi/fast_call.rs Show resolved Hide resolved
ext/ffi/fast_call.rs Outdated Show resolved Hide resolved
let mut fast_call_alloc = None;

let func = if fast_call::is_compatible(sym) {
let trampoline = fast_call::compile_trampoline(sym);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I would have liked to have suggested making compile_trampoline return an Option, but I guess that you can use compile_trampoline without is_compatible, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that you can use compile_trampoline without is_compatible, right?

Not really. You can always use the <Compiler>::compile method if you want to compile without is_compatible. compile_trampoline role is essentially to dispatch based on platform, and it is definitely coupled to is_compatible.

I'm not particularly happy either with the current duplication in fast_call::is_compatible() and fast_call::compile_trampoline(). However, from an API point of view I think the if approach is more explicit (in a good way). Here's the alternative with the Option:

fast_api.rs

pub(crate) fn compile_trampoline(sym: &Symbol) -> Option<Trampoline> {
  if sym.can_callback {
    // Callbacks are not allowed in Fast API FFI
    None
  } else if cfg!(all(target_arch = "x86_64", target_family = "unix")) {
    Some(SysVAmd64::compile(sym))
  } else if cfg!(all(target_arch = "x86_64", target_family = "windows")) {
    Some(Win64::compile(sym))
  } else if cfg!(all(target_arch = "aarch64", target_vendor = "apple")) {
    Some(Aarch64Apple::compile(sym))
  } else {
    None
  }
}

lib.rs

  let func = match fast_call::compile_trampoline(sym) {
    Some(trampoline) => {
      let func = builder.build_fast(
        scope,
        &fast_call::make_template(sym, &trampoline),
        None,
      );
      fast_call_alloc = Some(Box::into_raw(Box::new(trampoline)));
      func
    }
    None => builder.build(scope),
  };

I'm good either way tbh, you or anybody else make the call.

test_ffi/tests/test.js Show resolved Hide resolved
Copy link
Collaborator

@aapoalas aapoalas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work and a Herculean effort! LGTM, what with my limited knowledge.

ext/ffi/fast_call.rs Outdated Show resolved Hide resolved
NativeType::F64 => self.move_float(Double),
NativeType::U8 => self.move_integer(U(B)),
NativeType::U16 => self.move_integer(U(W)),
NativeType::U32 | NativeType::Void => self.move_integer(U(DW)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NativeType::Void here seems unnecessary and could be an unreachable!() instead, I think?

ext/ffi/fast_call.rs Outdated Show resolved Hide resolved
Copy link
Member

@littledivy littledivy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arnauorriols I've looked through fast_api.rs, awesome work LGTM! Thanks @aapoalas @phosra for making the review easier.

Just to summarise for other readers, this PR notably:

  • Enables fast api FFI optimizations on Windows x64
  • Enables FFI on enviornments like SElinux
  • Removes dependency on tinycc

0\n\
0\n\
78\n\
45\n\
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output in the test is 78 on this row as well. Is this correct?

Copy link
Member Author

@arnauorriols arnauorriols Sep 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the consequence of:

deno/test_ffi/tests/test.js

Lines 483 to 488 in 73f05c7

// TODO: this test currently fails in aarch64-apple-darwin (also in branch main!). The reason is v8 does not follow Apple's custom
// ABI properly (aligns arguments to 8 byte boundaries instead of the natural alignment of the parameter type).
// A decision needs to be taken:
// 1. leave it broken and wait for v8 to fix the bug
// 2. Adapt to v8 bug and follow its ABI instead of Apple's. When V8 fixes the implementation, we'll have to fix it here as well
testOptimized(addManyU16Fast, () => addManyU16Fast(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12));

The assertion should expect 78. The problem is that on aarch64-apple-darwin the test outputs 45 due to the aforementioned V8 bug.

The bug has already been reported in https://bugs.chromium.org/p/v8/issues/detail?id=13171.

Considering that this test will unequivocally detect when a new version of V8 fixes the bug I am now slightly leaning towards option 2. However, CI does not run tests on aarch64-apple-darwin...

Opinions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... If the adaptation is easy to do and easy to remove, then I guess option 2 would not be bad ... Two wrongs do make a right sometimes.

Still, I wouldn't really mind if choice 1 was made either. That would put pressure on V8 to fix their stuff, as I'm sure a lot of people would come complaining to Deno repo about FFI being broken on their Macs and we could just guide them to the V8 issue to voice their displeasure.

Copy link

@ghost ghost Sep 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and we could just guide them to the V8 issue to voice their displeasure.

And yet, maybe that wouldn't be the most respectful way to treat their bug tracking forums? :)

How about we adapt, so code works today, and have a "TODO" comment with the code to be exchanged for after the V8 issue is fixed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok let's go with 1 & do 2. in a follow up? I don't want to block this PR any further; the goal right now is to move away from tinycc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lot of people would come complaining to Deno repo about FFI being broken on their Macs and we could just guide them to the V8 issue to voice their displeasure.

This is true, and by default I would chose this approach. But I fear the solution on V8 side might not be as straightforward as once anticipated (according to @devsnek), and I intuitively think option 2 is actually easier than option 1 (I would argue against removing the test, and we cannot leave a broken test for the folks with M1s, so we are left with conditional "expected failure", somehow). Also, it always kinda feels a bit like using the users (and their Deno DX) as pawns. After all, for many the story usually ends at "Deno FFI is broken in my Mac".

Copy link
Member Author

@arnauorriols arnauorriols Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing V8 calling convention is trivial:

-        let size_original = param.size();
+        // let size_original = param.size();
+        let size_original = 8;

I just need to clarify if V8 expects 8/16 bit stack parameters to be extended, considering they are 8-byte aligned. I won't have time till tonight to get to it, so if anybody else wants to take the lead, be my guest!

The more laborious part of adapting to V8 calling convention would had been in Aarch64Apple::allocate_stack(), but funny story, I had it wrong to begin with 🤦 (the kind of wrong that matches V8 expectations)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed changes for option 2: adapt to V8 calling convention.

@arnauorriols
Copy link
Member Author

Before merging this PR, a decision must be taken (and documented) on the open question described in the PR summary:

V8 does not currently follow this Apple's policy, and instead aligns all arguments to 8 Byte boundaries.
A decision needs to be taken:

  1. leave it broken and wait for v8 to fix the bug
  2. Adapt to v8 bug and follow its ABI instead of Apple's. When V8 fixes the implementation, we'll have to fix it here as well

Discussion in #15305 (comment).

@littledivy
Copy link
Member

my ocd is telling me to give this new JIT assembler a name. any suggestions?

Copy link
Contributor

@bnoordhuis bnoordhuis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lightly reviewed but very nice work!

ext/ffi/lib.rs Outdated Show resolved Hide resolved
Comment on lines 483 to 488
// TODO: this test currently fails in aarch64-apple-darwin (also in branch main!). The reason is v8 does not follow Apple's custom
// ABI properly (aligns arguments to 8 byte boundaries instead of the natural alignment of the parameter type).
// A decision needs to be taken:
// 1. leave it broken and wait for v8 to fix the bug
// 2. Adapt to v8 bug and follow its ABI instead of Apple's. When V8 fixes the implementation, we'll have to fix it here as well
testOptimized(addManyU16Fast, () => addManyU16Fast(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I understand this right: the comment is saying that V8 should align naturally but instead aligns on 8 byte boundaries?

That's indeed a V8 bug: https://github.com/v8/v8/blob/0472d5a5aa95ff96438f875bb2f0d1b9a37e046e/src/compiler/wasm-compiler.cc#L7350-L7359

I've opened https://bugs.chromium.org/p/v8/issues/detail?id=13264

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct. An issue already exists: https://bugs.chromium.org/p/v8/issues/detail?id=13171

NativeType::U64 => fast_api::Type::Uint64,
NativeType::ISize => fast_api::Type::Int64,
NativeType::USize | NativeType::Pointer | NativeType::Function => {
fast_api::Type::Uint64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to double-check: this is (or would be if they existed) incorrect for 32 bits builds?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Comment on lines 177 to 179
for param in &sym.parameter_types {
compiler.move_left(param)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to shift if you leave in the receiver in the function prototype (zeroed out or not). In the simple case that means you can elide the trampoline completely.

I'm thinking of fn square(_: *const (), x: u32) -> u32 { x * x } - seems like a reasonable trade-off to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is FFI work, calling native library C APIs we can't just decide to leave the receiver in the function prototype. Sqlite's C API won't change for us even if we decide that it will.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, fair enough. I'm thinking of fast API calls into deno's runtime, where we control the function prototype. On the other hand, we probably wouldn't use FFI for that (or maybe...)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's no reason to use FFI for those calls or to manually write assembly, as then we can just have Rust functions as the fast functions which will be even better optimized than our hand-rolled trampoline ASM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ thats what #[op] is for...AOT fast api calls codegen for deno's runtime.

Comment on lines +180 to +183
if !compiler.is_recv_arg_overridden() {
// the receiver object should never be expected. Avoid its unexpected or deliberate leak
compiler.zero_first_arg();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps not necessary? The receiver has either been shifted out because it was the first integer/pointer argument, or it's inaccessible in the callback because there are only floating point arguments.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this was a conscious decision by the implementer:

  1. Always have the same kind of code in the compile() function, even if some of it is not strictly necessary for that particular architecture. This way all architecture code looks similar.
  2. Explicitly zero out the receiver even if there are only floating point arguments, thus making sure that a C API cannot get its grubby hands on the pointer number even if it lies about its function signature.

Copy link
Member Author

@arnauorriols arnauorriols Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to second-guess my stubbornness about this 😆 My main argument is indeed the point 2 explained by @aapoalas above. I'm imagining scenarios where the trust between the Javascript code/developer (the one declaring the function signature) and the native code/developer is not 100% stablished. It's a remote scenario without even a clear actionable vulnerability, but it is also a trivial, inexpensive instruction.

There doesn't even need to be maliciousness to get value out of this. Consider the developer experience in the case where, for example, the native API expects only floats in one version, and adds an integral parameter on a newer version. If the js dev fails to update the FFI declaration the failure might potentially be much harder to debug without the explicit zeroing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose that makes sense but in the scenario you're describing you'd ideally poison all unused parameter-passing registers, not just the first one.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What could the foreign code even do with the JS pointer? Fill it with garbage bytes? Access OOB within the JS memory heap? Okay, that sounds dangerous, but FFI is really unsafe anyways, and can simply segfault, or do anything that JS/Deno can do already, unless I'm mistaken?

The foreign code wouldn't have access to the JS context or any of Deno's internals? (Technically, as they are already within the same addressing space, it's possible to access these anyways)

I'm just doubtful that it's much of a concern.

// functions returning 64 bit integers have the out array appended as their last parameter,
// and it is a *FastApiTypedArray<Int32>
match self.integer_params {
// rdi is always V8 receiver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rdi has been zeroed by the time this function is called, right?

Copy link
Member Author

@arnauorriols arnauorriols Sep 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it really isn't the most eloquent of the comments. I've elaborated it a bit, what do you think now?

@bnoordhuis
Copy link
Contributor

Apropos the arm64 macos stack alignment issue: I would suggest to stick with V8's behavior for now. Leave some pointers in the code to make it easy to switch once V8 fixes the bug.

(I may give it a try. I don't have a M1 or M2 but it's straightforward enough in theory.)

@piscisaureus
Copy link
Member

Apropos the arm64 macos stack alignment issue: I would suggest to stick with V8's behavior for now. Leave some pointers in the code to make it easy to switch once V8 fixes the bug.

I agree

@devsnek
Copy link
Member

devsnek commented Sep 6, 2022

@bnoordhuis i've tried a few different things including modifying StackSlot to emit different alignments but that seems to be hitting an issue with the stack pointer not being padded to 8 byte alignment which leads me to think the slot allocator needs to be refactored a bit here... best of luck.

Copy link
Member

@littledivy littledivy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM again.

@littledivy littledivy merged commit 8bdc3c2 into denoland:main Sep 7, 2022
@kt3k kt3k mentioned this pull request Sep 9, 2022
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can't use FFI on distros using SELinux
10 participants