feat(ext/ffi): Implement FFI fast-call trampoline with Dynasmrt #15305

arnauorriols · 2022-07-25T10:30:20Z

In order to use the fast API of V8 for the ffi functions, Deno needs to wrap them in a trampoline function that performs the following transformations (at the time fo writing):

remove initial argument (receiver) and move the rest of arguments one place to the left
cast 8/16 bit integer arguments into 32 bytes
cast 8/16 bit integer return values into 32 bytes
unwrap pointers passed as Uint8Arrays
wrap 64 bit return values with a Int32Array

Currently this trampoline is JIT compiled using tcc. However, using tcc is problematic, among other reasons because of the increased build complexity and complicated platform support. This PR implements Deno's own JIT asusing dynasmrt.

Status

Open Questions

V8 does not currently follow this Apple's policy, and instead aligns all arguments to 8 Byte boundaries.
A decision needs to be taken:
1. leave it broken and wait for v8 to fix the bug
2. Adapt to v8 bug and follow its ABI instead of Apple's. When V8 fixes the implementation, we'll have to fix it here as well

Benchmarks

Linux (SysV AMD64)

./target/release/deno bench --allow-ffi --allow-read --unstable --v8-flags=--allow-natives-syntax test_ffi/tests/bench.js
cpu: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
runtime: deno 1.24.3 (x86_64-unknown-linux-gnu)

file:https:///home/orriols/dev/deno/deno/test_ffi/tests/bench.js
benchmark                              time (avg)             (min … max)       p75       p99      p995
------------------------------------------------------------------------- -----------------------------
nop()                                2.23 ns/iter    (2.18 ns … 66.29 ns)   2.19 ns   2.64 ns   3.21 ns
hash()                               51.7 ns/iter  (50.83 ns … 113.54 ns)  51.68 ns   60.7 ns  62.43 ns
c string                            115.4 ns/iter (107.48 ns … 178.86 ns) 118.03 ns 134.04 ns 138.99 ns
add_u32()                            4.65 ns/iter    (4.58 ns … 15.62 ns)   4.61 ns   5.24 ns   5.86 ns
return_buffer()                      7.13 ns/iter    (6.76 ns … 16.18 ns)   7.23 ns   7.93 ns   8.66 ns
add_u64()                            7.45 ns/iter     (6.8 ns … 13.68 ns)   7.64 ns   8.34 ns   9.09 ns
return_u64()                        12.71 ns/iter      (11.26 ns … 77 ns)  12.87 ns  17.27 ns  18.37 ns
return_i64()                        32.07 ns/iter   (29.51 ns … 98.87 ns)   33.1 ns  41.63 ns  43.02 ns
nop_u8()                             4.81 ns/iter    (4.58 ns … 69.47 ns)   4.61 ns  12.22 ns  12.46 ns
nop_i8()                             4.64 ns/iter    (4.59 ns … 13.22 ns)    4.6 ns   5.25 ns   6.26 ns
nop_u16()                            4.64 ns/iter    (4.58 ns … 68.63 ns)    4.6 ns   5.18 ns   5.59 ns
nop_i16()                            4.65 ns/iter    (4.58 ns … 69.59 ns)    4.6 ns   5.31 ns   6.31 ns
nop_u32()                            4.66 ns/iter    (4.58 ns … 11.05 ns)   4.62 ns   5.21 ns   5.93 ns
nop_i32()                            4.66 ns/iter    (4.58 ns … 66.25 ns)   4.62 ns   5.18 ns   5.92 ns
nop_u64()                            4.73 ns/iter    (4.58 ns … 68.84 ns)   4.62 ns   7.35 ns  10.99 ns
nop_i64()                            4.66 ns/iter    (4.58 ns … 68.63 ns)   4.61 ns   5.24 ns   5.94 ns
nop_usize() number                   4.87 ns/iter     (4.58 ns … 14.5 ns)    4.6 ns   9.81 ns  10.03 ns
nop_usize() bigint                 102.51 ns/iter   (97.4 ns … 122.63 ns) 103.91 ns 118.02 ns 120.04 ns
nop_isize() number                   4.65 ns/iter    (4.58 ns … 10.51 ns)   4.62 ns   5.21 ns   5.63 ns
nop_isize() bigint                 104.71 ns/iter  (98.99 ns … 150.36 ns) 106.03 ns  118.6 ns 119.96 ns
nop_f32()                            4.85 ns/iter    (4.58 ns … 22.27 ns)   4.62 ns  12.22 ns  12.25 ns
nop_f64()                            4.65 ns/iter    (4.58 ns … 11.08 ns)   4.62 ns   5.21 ns   5.86 ns
nop_buffer()                         5.27 ns/iter    (5.02 ns … 11.02 ns)   5.29 ns   6.01 ns   6.61 ns
nop_buffer() number                 79.67 ns/iter  (76.11 ns … 143.57 ns)  80.05 ns  89.09 ns  93.91 ns
return_u8()                          5.12 ns/iter    (5.02 ns … 14.17 ns)   5.04 ns   7.42 ns   8.03 ns
return_i8()                          5.13 ns/iter    (5.02 ns … 69.47 ns)   5.04 ns   6.83 ns   9.62 ns
return_u16()                         5.09 ns/iter     (5.02 ns … 16.7 ns)   5.04 ns   5.93 ns   6.82 ns
return_i16()                         5.08 ns/iter    (5.02 ns … 11.85 ns)   5.04 ns   5.68 ns    6.3 ns
return_u32()                         4.42 ns/iter     (4.36 ns … 13.3 ns)   4.38 ns   4.95 ns   5.21 ns
return_i32()                         4.41 ns/iter    (4.36 ns … 10.75 ns)   4.38 ns   4.94 ns   5.23 ns
return_usize()                      12.38 ns/iter   (11.28 ns … 30.75 ns)  12.36 ns  17.05 ns  19.19 ns
return_isize()                      29.98 ns/iter   (27.53 ns … 55.83 ns)  30.05 ns  39.01 ns  40.75 ns
return_f32()                         4.88 ns/iter      (4.8 ns … 70.7 ns)   4.84 ns   5.44 ns   6.18 ns
return_f64()                         4.87 ns/iter     (4.8 ns … 67.36 ns)   4.83 ns   5.49 ns   6.03 ns

Apple M1 (Aarch64 Apple)

Note: these benchmarks are run on a M1 dedicated host in AWS.

./target/release/deno bench --allow-ffi --allow-read --unstable --v8-flags=--allow-natives-syntax test_ffi/tests/bench.js
cpu: Apple M1
runtime: deno 1.24.3 (aarch64-apple-darwin)

file:https:///Users/ec2-user/deno/test_ffi/tests/bench.js
benchmark                              time (avg)             (min … max)       p75       p99      p995
------------------------------------------------------------------------- -----------------------------
nop()                                1.92 ns/iter    (1.87 ns … 17.51 ns)   1.88 ns   2.52 ns   2.52 ns
hash()                              51.51 ns/iter   (51.29 ns … 63.67 ns)  51.36 ns  55.73 ns  56.68 ns
c string                            86.41 ns/iter  (82.39 ns … 102.08 ns)  88.64 ns  95.93 ns 100.74 ns
add_u32()                            4.31 ns/iter     (4.29 ns … 6.74 ns)   4.31 ns   4.43 ns   4.58 ns
return_buffer()                       6.7 ns/iter    (6.68 ns … 12.73 ns)   6.69 ns   6.97 ns   7.14 ns
add_u64()                            6.89 ns/iter    (6.86 ns … 10.57 ns)   6.88 ns   7.31 ns   7.35 ns
return_u64()                        11.28 ns/iter   (10.96 ns … 18.48 ns)  10.99 ns  14.74 ns  15.16 ns
return_i64()                        29.03 ns/iter   (28.01 ns … 42.13 ns)  28.82 ns  32.77 ns   33.1 ns
nop_u8()                              4.4 ns/iter    (4.37 ns … 10.18 ns)    4.4 ns   4.67 ns   4.84 ns
nop_i8()                             4.25 ns/iter     (4.22 ns … 8.21 ns)   4.25 ns   4.52 ns   4.67 ns
nop_u16()                            4.25 ns/iter     (4.22 ns … 8.22 ns)   4.25 ns   4.52 ns   4.69 ns
nop_i16()                            4.25 ns/iter     (4.23 ns … 8.46 ns)   4.25 ns   4.52 ns   4.54 ns
nop_u32()                            4.25 ns/iter     (4.22 ns … 8.78 ns)   4.25 ns   4.52 ns   4.69 ns
nop_i32()                            4.25 ns/iter     (4.22 ns … 8.34 ns)   4.24 ns   4.54 ns   4.67 ns
nop_u64()                            4.26 ns/iter     (4.22 ns … 8.44 ns)   4.25 ns   4.52 ns   4.69 ns
nop_i64()                            4.45 ns/iter     (4.41 ns … 8.55 ns)   4.44 ns   4.54 ns   4.62 ns
nop_usize() number                    4.3 ns/iter     (4.22 ns … 9.15 ns)   4.33 ns   4.58 ns   4.68 ns
nop_usize() bigint                 112.71 ns/iter (110.96 ns … 122.07 ns) 112.26 ns 121.81 ns 122.03 ns
nop_isize() number                   4.25 ns/iter     (4.22 ns … 8.76 ns)   4.24 ns   4.52 ns   4.69 ns
nop_isize() bigint                 118.03 ns/iter (113.17 ns … 127.47 ns) 118.07 ns 125.56 ns 126.31 ns
nop_f32()                            4.25 ns/iter     (4.22 ns … 8.48 ns)   4.24 ns   4.51 ns   4.68 ns
nop_f64()                             4.3 ns/iter     (4.27 ns … 8.87 ns)    4.3 ns   4.56 ns   4.72 ns
nop_buffer()                          5.3 ns/iter     (5.28 ns … 9.57 ns)    5.3 ns   5.57 ns   5.74 ns
nop_buffer() number                 93.79 ns/iter     (93 ns … 100.13 ns)     94 ns  96.16 ns  97.26 ns
return_u8()                          4.33 ns/iter     (4.22 ns … 8.33 ns)   4.33 ns    4.6 ns   4.77 ns
return_i8()                          4.32 ns/iter     (4.22 ns … 7.91 ns)   4.33 ns   4.59 ns   4.76 ns
return_u16()                         4.55 ns/iter     (4.46 ns … 8.62 ns)   4.58 ns   4.84 ns   4.94 ns
return_i16()                         4.49 ns/iter     (4.41 ns … 8.34 ns)   4.52 ns    4.7 ns    4.8 ns
return_u32()                         4.33 ns/iter      (4.23 ns … 8.4 ns)   4.33 ns   4.35 ns   4.59 ns
return_i32()                         4.24 ns/iter     (4.21 ns … 8.78 ns)   4.24 ns    4.5 ns   4.52 ns
return_usize()                      11.17 ns/iter   (10.95 ns … 18.68 ns)     11 ns  15.35 ns  15.98 ns
return_isize()                       30.1 ns/iter   (29.05 ns … 39.26 ns)  29.74 ns   35.8 ns  36.17 ns
return_f32()                         4.38 ns/iter     (4.36 ns … 7.28 ns)   4.38 ns   4.65 ns   4.73 ns
return_f64()                         4.58 ns/iter     (4.54 ns … 9.03 ns)   4.58 ns   4.85 ns   5.03 ns

Windows 11 (Windows64 fastcall)

 ./target/release/deno bench --allow-ffi --allow-read --unstable --v8-flags=--allow-natives-syntax test_ffi/tests/bench.js
cpu: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
runtime: deno 1.24.3 (x86_64-pc-windows-msvc)

file:https:///C:/Users/arnau/deno/test_ffi/tests/bench.js
benchmark                              time (avg)             (min … max)       p75       p99      p995
------------------------------------------------------------------------- -----------------------------
nop()                                2.29 ns/iter    (2.18 ns … 10.76 ns)   2.25 ns   3.06 ns   3.39 ns
hash()                              52.09 ns/iter   (51.07 ns … 66.79 ns)  51.83 ns  61.27 ns  62.87 ns
c string                           119.97 ns/iter  (115.3 ns … 239.23 ns) 119.69 ns 137.92 ns 214.37 ns
add_u32()                            4.45 ns/iter     (4.36 ns … 9.41 ns)   4.38 ns   4.98 ns   5.25 ns
return_buffer()                      7.42 ns/iter    (6.77 ns … 17.61 ns)   7.55 ns   8.31 ns   8.81 ns
add_u64()                            7.43 ns/iter    (6.78 ns … 70.22 ns)   7.57 ns   8.51 ns   9.23 ns
return_u64()                        12.76 ns/iter   (11.21 ns … 77.78 ns)   12.5 ns  18.19 ns  20.38 ns
return_i64()                        33.95 ns/iter   (30.92 ns … 97.96 ns)   35.3 ns   42.6 ns  44.49 ns
nop_u8()                             4.44 ns/iter    (4.36 ns … 14.45 ns)   4.38 ns   4.87 ns   5.17 ns
nop_i8()                             4.48 ns/iter    (4.36 ns … 68.27 ns)    4.4 ns   5.53 ns      7 ns
nop_u16()                            4.45 ns/iter    (4.36 ns … 11.38 ns)   4.39 ns   4.87 ns   5.03 ns
nop_i16()                            4.47 ns/iter    (4.36 ns … 16.19 ns)   4.38 ns   5.96 ns   7.18 ns
nop_u32()                            4.45 ns/iter     (4.36 ns … 9.37 ns)   4.38 ns   4.92 ns   5.23 ns
nop_i32()                            4.46 ns/iter    (4.36 ns … 68.58 ns)   4.38 ns   5.08 ns    5.5 ns
nop_u64()                            4.44 ns/iter    (4.36 ns … 10.15 ns)   4.38 ns   4.92 ns   5.18 ns
nop_i64()                            4.46 ns/iter    (4.36 ns … 13.01 ns)   4.47 ns   5.03 ns   5.25 ns
nop_usize() number                   4.45 ns/iter    (4.36 ns … 12.06 ns)   4.38 ns   4.93 ns   5.26 ns
nop_usize() bigint                 159.57 ns/iter (153.78 ns … 186.44 ns) 161.55 ns 177.07 ns 179.94 ns
nop_isize() number                   4.44 ns/iter    (4.36 ns … 10.87 ns)   4.38 ns   4.89 ns   5.11 ns
nop_isize() bigint                  160.6 ns/iter (154.85 ns … 278.81 ns) 162.69 ns 185.48 ns 218.49 ns
nop_f32()                            4.45 ns/iter    (4.36 ns … 13.01 ns)   4.38 ns    4.9 ns   5.18 ns
nop_f64()                            4.47 ns/iter    (4.36 ns … 12.82 ns)   4.42 ns   4.94 ns   5.19 ns
nop_buffer()                          5.1 ns/iter    (4.92 ns … 13.11 ns)   5.25 ns   5.79 ns   6.06 ns
nop_buffer() number                135.39 ns/iter  (132.2 ns … 151.68 ns) 135.77 ns 147.07 ns 149.18 ns
return_u8()                           4.9 ns/iter     (4.8 ns … 12.65 ns)   4.82 ns   5.43 ns   5.69 ns
return_i8()                          4.89 ns/iter     (4.8 ns … 10.81 ns)   4.82 ns    5.4 ns   5.63 ns
return_u16()                         4.91 ns/iter     (4.8 ns … 14.32 ns)   4.82 ns   5.49 ns   5.87 ns
return_i16()                         4.89 ns/iter      (4.8 ns … 9.89 ns)   4.82 ns   5.37 ns   5.66 ns
return_u32()                         4.23 ns/iter    (4.14 ns … 13.29 ns)   4.16 ns   4.67 ns   4.92 ns
return_i32()                         4.25 ns/iter     (4.14 ns … 9.72 ns)   4.16 ns   5.32 ns   6.83 ns
return_usize()                      12.57 ns/iter   (11.39 ns … 30.23 ns)  12.51 ns  18.43 ns  20.62 ns
return_isize()                      33.38 ns/iter    (30.9 ns … 58.53 ns)  33.56 ns  42.58 ns  47.39 ns
return_f32()                          4.9 ns/iter     (4.79 ns … 9.07 ns)   4.91 ns   5.49 ns   5.66 ns
return_f64()                         4.68 ns/iter    (4.58 ns … 11.45 ns)   4.62 ns   5.15 ns   5.36 ns

Comparison with TCC (rev `5beec3f`)

Performance is marginally better than using TCC, particularly in those cases where the trampoline tail-calls to the FFI function)

CLAassistant · 2022-07-25T10:30:24Z

All committers have signed the CLA.

ext/ffi/fast_call.rs

ghost · 2022-07-26T08:18:50Z

ext/ffi/fast_call.rs

+
+impl Trampoline {
+ fn ptr(&self) -> *const c_void {
+ &self.0[0] as *const u8 as *const c_void


The type update should be upstreamed?

Yes, I think so. Eventually :)

I mean, please do open an issue there referencing what's already open in /deno? I'll get around to it when I'm free (unless someone else does first).

denoland/rusty_v8#1038

aapoalas

Looks pretty damn awesome!

ext/ffi/fast_call.rs

test_ffi/tests/test.js

ext/ffi/fast_call.rs

…i/fast_call

.gitmodules

test_ffi/tests/test.js

ext/ffi/fast_call.rs

ry

This is a very interesting PR - thank you!

We want to wait until we support Uint8Array in FastAPI before landing this one (which should be soon) because it will require a bit more complex C code. I'm sure we can figure out how to also do this in ASM.

Have you given any thought to ARM?

arnauorriols · 2022-07-26T19:04:52Z

We want to wait until we support Uint8Array in FastAPI before landing this one (which should be soon) because it will require a bit more complex C code. I'm sure we can figure out how to also do this in ASM.

Absolutely. There's still work to do to get this PR ready anyway.

Have you given any thought to ARM?

the intent is to implement support for Apple Silicon as well as windows x64, but haven't really delved into it yet. First I want to see how the SysV AMD64 implementation consolidates after the feedback.

…urn value

aapoalas

Absolutely amazing stuff!

ext/ffi/fast_call.rs

aapoalas · 2022-07-28T12:31:53Z

ext/ffi/fast_call.rs

+ match (arg_i, arg) {
+ // Conventionally, many compilers expect 8 and 16 bit arguments to be sign/zero extended to 32 bits
+ // See https://stackoverflow.com/a/36760539/2623340
+ (1, Unsigned(B)) => dynasm!(self.assembler; movzx edi, sil),


Would it be possible to get rid of some of the repetition by getting the mov command from one match:

let instruction = match arg { Unsigned(B) | Unsigned(W) => "movszx", Signed(B) | Signed(W) => "movsx", _ => "mov", };

and maybe the involved registers from a function as well? Then the match here could be something like:

// instruction from above let (destination_register, source_register) = get_integer_registers(arg_i, arg); match (arg_i, arg) => { (1 | 2 | 3 | 4 | 5, _) => dnasm!(self.assembler; instruction destination_register, source_register), (6, _) => dynasm!(self.assembler; instruction destination_register, source_register[rsp + rsp_offset]), // Here trying to piggyback on source_register to contain "BYTE " or "WORD " or "" (_, _) => dynasm!(...), // same kind of piggyback here }

Not sure if the macro can handle this, though.

Not sure if the macro can handle this, though.

Not like this, you need to define a wrapping macro. I'll give it a go anyway. After aarch64 & win though, first I wan to see how these implementations end up looking like.

In any case, for something as fragile as this, I tend to think the plain boring design might be better. Maybe if we keep adding more logic it becomes unmanageable but I don't think we are there yet.

Alright, sounds good.

I've been thinking about this lately. I'm not sure about the net gain of working out a macro to abstract the match. I do however thing that a couple of helper macros can improve the readability significantly by making the code more dense:
https://github.com/arnauorriols/deno/compare/refactor/ffi-trampoline-plain-asm...arnauorriols:deno:alternative-dense-dynasm?expand=1

What's your opinion?

I do think there's a lot of repetition and it makes reading the code for correctness a bit hard, what with having to jump between two to four axes of "arguments" (length of data, type of data (signed, unsigned, float), destination register, source register).

No two subsequent lines are generally the same, as the register names change based on the type and length of data even when its the "same" register. An advanced macro might abstract out a lot of this, allowing lines to be read in a more "autopilot" way: "All 0th integers move from register X to Y, all unsigned integers use zero-extended moves, signed integers use sign-extended moves, ..."

However, such a magic macro would just move the complexity elsewhere. So, I'm ambivalent on the true need for such a macro.

The dense macros do definitely improve the outlook of the code a lot while not really hiding much anything, so I can definitely get behind that.

ext/ffi/fast_call.rs

aapoalas · 2022-07-28T13:07:39Z

ext/ffi/fast_call.rs

+ }
+
+ dynasm!(self.assembler
+ ; sub rsp, stack_size as i32


Godbolt rustc compiler seems to use push and pop for stack allocation, presumably as long as there is no need to push parameters onto the stack. So this would be the case when the return value has to be cast from 8 or 16 bits to 32 bits.

Would it be worth to add that as a special case, or rather leave such stuff out as micro-optimisation?

The rustc compiler also does some funky things with using movups to (if I understand this correctly) move multiple sequential integer arguments temporarily into xmmN (where N is the number of float args) to make stack arg shifting quicker. But that's the kind of optimisation that is probably entirely unnecessary for us :D

The rustc compiler also does some funky things with using movups to (if I understand this correctly) move multiple sequential integer arguments temporarily into xmmN (where N is the number of float args) to make stack arg shifting quicker. But that's the kind of optimisation that is probably entirely unnecessary for us :D

Invalid unless we assume SSE extension on the target platform. (Which is what Bun does, but Deno so far has not, as far as I'm aware)

Godbolt rustc compiler seems to use push and pop for stack allocation, presumably as long as there is no need to push parameters onto the stack. So this would be the case when the return value has to be cast from 8 or 16 bits to 32 bits.

AFAIK Rustc (and many others, with the notable exception of GCC) optimize by using push instead of the normative stack allocation in the prologue as long as no SSE instruction is/can be used. This is because stack instructions (PUSH, POP, CALL, RET) are optimized since the addition of the stack engine in Pentium M and newer CPUs, taking fewer uops than sub + mov. Once SSE is used to move packed data, sub is needed anyway so I guess the optimization is no longer significant. See https://godbolt.org/z/PeKaxzjK6

The rustc compiler also does some funky things with using movups to (if I understand this correctly) move multiple sequential integer arguments [...] to make stack arg shifting quicker.

Correct. In a nutshell, it's moving 128 bits of data with a single uops. In the Godbolt linked above, trampoline2 moves parameter k and l together with movups.

into xmmN (where N is the number of float args)

no, N is just part of the register name. When passing floats as parameters, N does have some semantic in the sense that the first float goes to XMM0, the second to XMM1, etc. But when moving data in/out of the register as a temporary operation the N does not have any significance. In the previously linked GodBolt, it uses xxm2 because it is the first available xmm register (xmm0 and xmm1) are used for parameters. Unless I misunderstood what you meant?

But that's the kind of optimisation that is probably entirely unnecessary for us

Not sure if necessary or not in the long run, but I do think that this first iteration should focus on a simple and solid implementation, and worry about optimization in the future. It is not like TinyCC applies any of these optimizations anyway.

Invalid unless we assume SSE extension on the target platform. (Which is what Bun does, but Deno so far has not, as far as I'm aware)

Keep in mind that this implementation is specific for AMD64, which includes SSE & SSE2 instructions.

AFAIK Rustc (and many others, with the notable exception of GCC) optimize by using push instead of the normative stack allocation in the prologue as long as no SSE instruction is/can be used. This is because stack instructions (PUSH, POP, CALL, RET) are optimized since the addition of the stack engine in Pentium M and newer CPUs, taking fewer uops than sub + mov. Once SSE is used to move packed data, sub is needed anyway so I guess the optimization is no longer significant. See https://godbolt.org/z/PeKaxzjK6

Alright. Sounds good to me to leave a relatively insignificant complication out.

no, N is just part of the register name. When passing floats as parameters, N does have some semantic in the sense that the first float goes to XMM0, the second to XMM1, etc. But when moving data in/out of the register as a temporary operation the N does not have any significance. In the previously linked GodBolt, it uses xxm2 because it is the first available xmm register (xmm0 and xmm1) are used for parameters. Unless I misunderstood what you meant?

Yeah I just meant that it's indeed using the first available xmm register: If there are 0 float args then xmm0 can be used, otherwise 1 or 2 or ... based on the number of float params.

Not sure if necessary or not in the long run, but I do think that this first iteration should focus on a simple and solid implementation, and worry about optimization in the future. It is not like TinyCC applies any of these optimizations anyway.

Sounds good to me. Also, I think it's very unlikely we'll run into many C APIs with this many parameters.

aapoalas · 2022-07-28T13:16:01Z

test_ffi/tests/test.js

+
+function add10U8Fast(a, b, c, d, e, f, g, h, i, j) { return symbols.add_10_u8(a, b, c, d, e, f, g, h, i, j); };
+
+%PrepareFunctionForOptimization(add10U8Fast);


Consider making this into an assert helper function like this:

function assertOptimizes(fn, callback) { &PrepareFunctionForOptimization(fn); console.log(callback()); %OptimizeFunctionOnNextCall(fn); console.log(callback()); assertFastCall(fn); } // SysV ABI comment assertOptimizes(add10U8Fast, () => add10U8Fast(0, 1, 2, 3, 4, 5, 6, 7, 8, 9));

Yeah, saw your suggestion in Discord, I'll definitely apply it.

What do you think about naming this function testOptimized instead of assertOptimizes? The reason is that It also tests the function itself, not only if it has been optimized.

arnauorriols · 2022-08-10T02:36:15Z

Have you given any thought to ARM?

aarch64-apple-darwin and x86_64-pc-windows-msvc are now ready to review

aapoalas

Awesome stuff! Only some minor nitpicks that I could find or understand :)

ext/ffi/fast_call.rs

aapoalas · 2022-08-10T16:10:25Z

ext/ffi/fast_call.rs

+ // (this applies to register parameters, as stack parameters are not padded in Apple)
+ (1, Signed(B)) => dynasm!(self.assembler; .arch aarch64; sxtb w0, w1),
+ (1, Unsigned(B)) => {
+ dynasm!(self.assembler; .arch aarch64; and w0, w1, 0xFF)


IS and w0, w1, 0xFF better / faster somehow than uxtb w0, w1?

LLVM prefers the extension.

They both take a single instruction cycle, i.e. performance won't matter.

Yes, both have the same latency and throughput and use the same execution pipelines. I'm using and following Rustc and Clang example: https://godbolt.org/z/1jb4j8vsh

LLVM prefers the extension.

What do you mean?

LLVM prefers the extension.

What do you mean?

It's irrelevant, but I was under the impression that Rustc didn't prefer the bitwise AND - but that's my experience from 32-bit Arm, not 64-bit Arm :)

I suppose one may need to understand how bitwise truncation works, vs. how the extension instructions work, so maybe the latter should be preferred for that reason? But I have no opinion here.

I suppose one may need to understand how bitwise truncation works, vs. how the extension instructions work

According to Arm documentation it's essentially the same, except that and does not have to deal with the rotation part.

aapoalas · 2022-08-10T16:29:54Z

ext/ffi/fast_call.rs

+ // > Function arguments may consume slots on the stack that are not multiples of 8 bytes.
+ // > If the total number of bytes for stack-based arguments is not a multiple of 8 bytes,
+ // > insert padding on the stack to maintain the 8-byte alignment requirements.
+ // TODO: V8 does not currently follow this Apple's policy, and instead aligns all arguments to 8 Byte boundaries.


Seems like this might be pretty quick on the chopping block for V8 to fix, so I vote we allow it to be broken and wait for V8 to fix it.

That's my personal opinion as well

ext/ffi/fast_call.rs

bartlomieju · 2022-08-15T21:27:35Z

@arnauorriols please rebase.

@piscisaureus please review

…in64

aapoalas

LGTM, with my limited knowledge of assembly anyway.

ghost · 2022-08-16T17:19:37Z

ext/ffi/fast_call.rs

+ let mut int_params = 0u32;
+ let mut sse_params = 0u32;
+ for param in params {
+ match param {
+ NativeType::F32 | NativeType::F64 => sse_params += 1,
+ _ => int_params += 1,
+ }
+ }


Nit: this seems like it could also be done by summing one register type, and subtracting from the length?
Maybe

Suggested change

let mut int_params = 0u32;

let mut sse_params = 0u32;

for param in params {

match param {

NativeType::F32 | NativeType::F64 => sse_params += 1,

_ => int_params += 1,

}

}

let sse_params =

params.iter().filter(|param| matches!(param, NativeType::F32 | NativeType::F64)).count();

let int_params = params.len() - sse_params;

I'm just going off of a "there's a lot of mutability here, so maybe some functional code could help" thinking; it's your choice.

I get what you mean, but in this case in particular, I think the mutable approach is more expressive.

ext/ffi/fast_call.rs

ghost · 2022-08-16T18:04:46Z

ext/ffi/lib.rs

+ let mut fast_call_alloc = None;
+
+ let func = if fast_call::is_compatible(sym) {
+ let trampoline = fast_call::compile_trampoline(sym);


Huh, I would have liked to have suggested making compile_trampoline return an Option, but I guess that you can use compile_trampoline without is_compatible, right?

I guess that you can use compile_trampoline without is_compatible, right?

Not really. You can always use the <Compiler>::compile method if you want to compile without is_compatible. compile_trampoline role is essentially to dispatch based on platform, and it is definitely coupled to is_compatible.

I'm not particularly happy either with the current duplication in fast_call::is_compatible() and fast_call::compile_trampoline(). However, from an API point of view I think the if approach is more explicit (in a good way). Here's the alternative with the Option:

fast_api.rs

pub(crate) fn compile_trampoline(sym: &Symbol) -> Option<Trampoline> { if sym.can_callback { // Callbacks are not allowed in Fast API FFI None } else if cfg!(all(target_arch = "x86_64", target_family = "unix")) { Some(SysVAmd64::compile(sym)) } else if cfg!(all(target_arch = "x86_64", target_family = "windows")) { Some(Win64::compile(sym)) } else if cfg!(all(target_arch = "aarch64", target_vendor = "apple")) { Some(Aarch64Apple::compile(sym)) } else { None } }

lib.rs

let func = match fast_call::compile_trampoline(sym) { Some(trampoline) => { let func = builder.build_fast( scope, &fast_call::make_template(sym, &trampoline), None, ); fast_call_alloc = Some(Box::into_raw(Box::new(trampoline))); func } None => builder.build(scope), };

I'm good either way tbh, you or anybody else make the call.

test_ffi/tests/test.js

aapoalas

Awesome work and a Herculean effort! LGTM, what with my limited knowledge.

ext/ffi/fast_call.rs

aapoalas · 2022-08-22T19:31:25Z

ext/ffi/fast_call.rs

+ NativeType::F64 => self.move_float(Double),
+ NativeType::U8 => self.move_integer(U(B)),
+ NativeType::U16 => self.move_integer(U(W)),
+ NativeType::U32 | NativeType::Void => self.move_integer(U(DW)),


NativeType::Void here seems unnecessary and could be an unreachable!() instead, I think?

…rch64-apple

ext/ffi/fast_call.rs

littledivy

@arnauorriols I've looked through fast_api.rs, awesome work LGTM! Thanks @aapoalas @phosra for making the review easier.

Just to summarise for other readers, this PR notably:

Enables fast api FFI optimizations on Windows x64
Enables FFI on enviornments like SElinux
Removes dependency on tinycc

aapoalas · 2022-09-05T15:18:59Z

test_ffi/tests/integration_tests.rs

+ 0\n\
+ 0\n\
+ 78\n\
+ 45\n\


Output in the test is 78 on this row as well. Is this correct?

This is the consequence of:

deno/test_ffi/tests/test.js

Lines 483 to 488 in 73f05c7

// TODO: this test currently fails in aarch64-apple-darwin (also in branch main!). The reason is v8 does not follow Apple's custom

// ABI properly (aligns arguments to 8 byte boundaries instead of the natural alignment of the parameter type).

// A decision needs to be taken:

// 1. leave it broken and wait for v8 to fix the bug

// 2. Adapt to v8 bug and follow its ABI instead of Apple's. When V8 fixes the implementation, we'll have to fix it here as well

testOptimized(addManyU16Fast, () => addManyU16Fast(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12));

The assertion should expect 78. The problem is that on aarch64-apple-darwin the test outputs 45 due to the aforementioned V8 bug.

The bug has already been reported in https://bugs.chromium.org/p/v8/issues/detail?id=13171.

Considering that this test will unequivocally detect when a new version of V8 fixes the bug I am now slightly leaning towards option 2. However, CI does not run tests on aarch64-apple-darwin...

Opinions?

Hmm... If the adaptation is easy to do and easy to remove, then I guess option 2 would not be bad ... Two wrongs do make a right sometimes.

Still, I wouldn't really mind if choice 1 was made either. That would put pressure on V8 to fix their stuff, as I'm sure a lot of people would come complaining to Deno repo about FFI being broken on their Macs and we could just guide them to the V8 issue to voice their displeasure.

... and we could just guide them to the V8 issue to voice their displeasure.

And yet, maybe that wouldn't be the most respectful way to treat their bug tracking forums? :)

How about we adapt, so code works today, and have a "TODO" comment with the code to be exchanged for after the V8 issue is fixed?

Ok let's go with 1 & do 2. in a follow up? I don't want to block this PR any further; the goal right now is to move away from tinycc.

lot of people would come complaining to Deno repo about FFI being broken on their Macs and we could just guide them to the V8 issue to voice their displeasure.

This is true, and by default I would chose this approach. But I fear the solution on V8 side might not be as straightforward as once anticipated (according to @devsnek), and I intuitively think option 2 is actually easier than option 1 (I would argue against removing the test, and we cannot leave a broken test for the folks with M1s, so we are left with conditional "expected failure", somehow). Also, it always kinda feels a bit like using the users (and their Deno DX) as pawns. After all, for many the story usually ends at "Deno FFI is broken in my Mac".

Implementing V8 calling convention is trivial:

- let size_original = param.size(); + // let size_original = param.size(); + let size_original = 8;

I just need to clarify if V8 expects 8/16 bit stack parameters to be extended, considering they are 8-byte aligned. I won't have time till tonight to get to it, so if anybody else wants to take the lead, be my guest!

The more laborious part of adapting to V8 calling convention would had been in Aarch64Apple::allocate_stack(), but funny story, I had it wrong to begin with 🤦 (the kind of wrong that matches V8 expectations)

pushed changes for option 2: adapt to V8 calling convention.

arnauorriols · 2022-09-05T17:26:58Z

Before merging this PR, a decision must be taken (and documented) on the open question described in the PR summary:

V8 does not currently follow this Apple's policy, and instead aligns all arguments to 8 Byte boundaries.
A decision needs to be taken:

leave it broken and wait for v8 to fix the bug

Adapt to v8 bug and follow its ABI instead of Apple's. When V8 fixes the implementation, we'll have to fix it here as well

Discussion in #15305 (comment).

littledivy · 2022-09-06T05:34:05Z

my ocd is telling me to give this new JIT assembler a name. any suggestions?

bnoordhuis

Lightly reviewed but very nice work!

ext/ffi/lib.rs

bnoordhuis · 2022-09-06T10:10:07Z

test_ffi/tests/test.js

+// TODO: this test currently fails in aarch64-apple-darwin (also in branch main!). The reason is v8 does not follow Apple's custom
+// ABI properly (aligns arguments to 8 byte boundaries instead of the natural alignment of the parameter type).
+// A decision needs to be taken:
+// 1. leave it broken and wait for v8 to fix the bug
+// 2. Adapt to v8 bug and follow its ABI instead of Apple's. When V8 fixes the implementation, we'll have to fix it here as well
+testOptimized(addManyU16Fast, () => addManyU16Fast(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12));


Just to make sure I understand this right: the comment is saying that V8 should align naturally but instead aligns on 8 byte boundaries?

That's indeed a V8 bug: https://github.com/v8/v8/blob/0472d5a5aa95ff96438f875bb2f0d1b9a37e046e/src/compiler/wasm-compiler.cc#L7350-L7359

I've opened https://bugs.chromium.org/p/v8/issues/detail?id=13264

Yes, that is correct. An issue already exists: https://bugs.chromium.org/p/v8/issues/detail?id=13171

bnoordhuis · 2022-09-06T10:16:21Z

ext/ffi/fast_call.rs

+ NativeType::U64 => fast_api::Type::Uint64,
+ NativeType::ISize => fast_api::Type::Int64,
+ NativeType::USize | NativeType::Pointer | NativeType::Function => {
+ fast_api::Type::Uint64


I want to double-check: this is (or would be if they existed) incorrect for 32 bits builds?

bnoordhuis · 2022-09-06T10:51:01Z

ext/ffi/fast_call.rs

+ for param in &sym.parameter_types {
+ compiler.move_left(param)
+ }


You don't have to shift if you leave in the receiver in the function prototype (zeroed out or not). In the simple case that means you can elide the trampoline completely.

I'm thinking of fn square(_: *const (), x: u32) -> u32 { x * x } - seems like a reasonable trade-off to me.

Since this is FFI work, calling native library C APIs we can't just decide to leave the receiver in the function prototype. Sqlite's C API won't change for us even if we decide that it will.

Okay, fair enough. I'm thinking of fast API calls into deno's runtime, where we control the function prototype. On the other hand, we probably wouldn't use FFI for that (or maybe...)

Yeah, there's no reason to use FFI for those calls or to manually write assembly, as then we can just have Rust functions as the fast functions which will be even better optimized than our hand-rolled trampoline ASM.

^ thats what #[op] is for...AOT fast api calls codegen for deno's runtime.

bnoordhuis · 2022-09-06T10:51:39Z

ext/ffi/fast_call.rs

+ if !compiler.is_recv_arg_overridden() {
+ // the receiver object should never be expected. Avoid its unexpected or deliberate leak
+ compiler.zero_first_arg();
+ }


Perhaps not necessary? The receiver has either been shifted out because it was the first integer/pointer argument, or it's inaccessible in the callback because there are only floating point arguments.

I believe this was a conscious decision by the implementer:

Always have the same kind of code in the compile() function, even if some of it is not strictly necessary for that particular architecture. This way all architecture code looks similar.

Explicitly zero out the receiver even if there are only floating point arguments, thus making sure that a C API cannot get its grubby hands on the pointer number even if it lies about its function signature.

I'm starting to second-guess my stubbornness about this 😆 My main argument is indeed the point 2 explained by @aapoalas above. I'm imagining scenarios where the trust between the Javascript code/developer (the one declaring the function signature) and the native code/developer is not 100% stablished. It's a remote scenario without even a clear actionable vulnerability, but it is also a trivial, inexpensive instruction.

There doesn't even need to be maliciousness to get value out of this. Consider the developer experience in the case where, for example, the native API expects only floats in one version, and adds an integral parameter on a newer version. If the js dev fails to update the FFI declaration the failure might potentially be much harder to debug without the explicit zeroing.

I suppose that makes sense but in the scenario you're describing you'd ideally poison all unused parameter-passing registers, not just the first one.

What could the foreign code even do with the JS pointer? Fill it with garbage bytes? Access OOB within the JS memory heap? Okay, that sounds dangerous, but FFI is really unsafe anyways, and can simply segfault, or do anything that JS/Deno can do already, unless I'm mistaken?

The foreign code wouldn't have access to the JS context or any of Deno's internals? (Technically, as they are already within the same addressing space, it's possible to access these anyways)

I'm just doubtful that it's much of a concern.

bnoordhuis · 2022-09-06T10:52:11Z

ext/ffi/fast_call.rs

+ // functions returning 64 bit integers have the out array appended as their last parameter,
+ // and it is a *FastApiTypedArray<Int32>
+ match self.integer_params {
+ // rdi is always V8 receiver


rdi has been zeroed by the time this function is called, right?

Yeah, it really isn't the most eloquent of the comments. I've elaborated it a bit, what do you think now?

bnoordhuis · 2022-09-06T11:05:06Z

Apropos the arm64 macos stack alignment issue: I would suggest to stick with V8's behavior for now. Leave some pointers in the code to make it easy to switch once V8 fixes the bug.

(I may give it a try. I don't have a M1 or M2 but it's straightforward enough in theory.)

piscisaureus · 2022-09-06T23:25:45Z

Apropos the arm64 macos stack alignment issue: I would suggest to stick with V8's behavior for now. Leave some pointers in the code to make it easy to switch once V8 fixes the bug.

I agree

devsnek · 2022-09-06T23:37:04Z

@bnoordhuis i've tried a few different things including modifying StackSlot to emit different alignments but that seems to be hitting an issue with the stack pointer not being padded to 8 byte alignment which leads me to think the slot allocator needs to be refactored a bit here... best of luck.

littledivy

LGTM again.

arnauorriols added 3 commits July 22, 2022 02:19

Replace tinycc-based JIT trampoline with js recv-as-first-arg "pattern"

8be9d6e

add assertFastCall in test_ffi to ensure fast-API-call optimization

4ebd951

Implement JIT compilation of FFI fast-call trampoline with Dynasmrt

9a0b543

arnauorriols commented Jul 25, 2022

View reviewed changes

ext/ffi/fast_call.rs Outdated Show resolved Hide resolved

aapoalas reviewed Jul 25, 2022

View reviewed changes

ext/ffi/fast_call.rs Outdated Show resolved Hide resolved

ext/ffi/fast_call.rs Outdated Show resolved Hide resolved

test_ffi/tests/test.js Outdated Show resolved Hide resolved

ext/ffi/fast_call.rs Outdated Show resolved Hide resolved

ext/ffi/fast_call.rs Outdated Show resolved Hide resolved

Comment on rationale behind integer/sse argument classification in ff…

fed0fd2

…i/fast_call

ry requested a review from aslilac July 25, 2022 15:25

aslilac reviewed Jul 26, 2022

View reviewed changes

.gitmodules Outdated Show resolved Hide resolved

test_ffi/tests/test.js Show resolved Hide resolved

aapoalas reviewed Jul 26, 2022

View reviewed changes

ext/ffi/fast_call.rs Outdated Show resolved Hide resolved

aapoalas reviewed Jul 26, 2022

View reviewed changes

ext/ffi/fast_call.rs Outdated Show resolved Hide resolved

Merge branch 'main' into refactor/ffi-trampoline-plain-asm

5ede98d

ry reviewed Jul 26, 2022

View reviewed changes

arnauorriols added 2 commits July 27, 2022 00:32

Apply nitpick suggestions

a4182fa

Implement 8 and 16 bit integer casting to 32 bit in arguments and ret…

f1abc06

…urn value

aapoalas reviewed Jul 28, 2022

View reviewed changes

Update test_util/std submodule

c34f764

ThalusA mentioned this pull request Jul 29, 2022

fix(ffi): Fix FFI usage on distros using SELinux #15307

Closed

Implement Aarch64-apple-darwin and win64

eb7b529

aapoalas requested changes Aug 10, 2022

View reviewed changes

arnauorriols added 3 commits August 16, 2022 02:33

Correct spelling errors

c1b4341

Cleanup and extend documentation

b018240

Implement stack-allocation unit tests for sysv and aarch64. Pending w…

144442b

…in64

aapoalas approved these changes Aug 16, 2022

View reviewed changes

ghost reviewed Aug 16, 2022

View reviewed changes

Add integer-casting test for win64

93f43dc

aapoalas approved these changes Aug 22, 2022

View reviewed changes

arnauorriols added 5 commits August 23, 2022 01:46

Rename Integer::TypedArray => Integer::Pointer

90b4fe2

Remove leftover commented code

7ecad49

Use padding_to_align to calculate padding of stack parameters in aa…

924b2fb

…rch64-apple

add small helper macros to improve density of assembly lines

f323ddf

move helper macros

9728da9

arnauorriols commented Aug 29, 2022

View reviewed changes

ext/ffi/fast_call.rs Outdated Show resolved Hide resolved

Merge branch 'main' into refactor/ffi-trampoline-plain-asm

65863c9

Symbitic mentioned this pull request Sep 2, 2022

Can't use FFI on distros using SELinux #15306

Closed

littledivy approved these changes Sep 5, 2022

View reviewed changes

ry added 2 commits September 5, 2022 07:51

Merge branch 'main' into refactor/ffi-trampoline-plain-asm

ba0047f

fix

73f05c7

aapoalas reviewed Sep 5, 2022

View reviewed changes

bnoordhuis reviewed Sep 6, 2022

View reviewed changes

arnauorriols and others added 6 commits September 7, 2022 03:11

Workaround Aarch64Apple calling convention to match V8's incorrect one

a4e8b9f

Cleanup temporary TODO comments

1680832

Improve clarifying comment and apply explicitness suggestion

3b9f726

format

d900b2b

Merge branch 'main' into refactor/ffi-trampoline-plain-asm

7edffb1

Merge branch 'main' into refactor/ffi-trampoline-plain-asm

4000c7b

littledivy approved these changes Sep 7, 2022

View reviewed changes

littledivy merged commit 8bdc3c2 into denoland:main Sep 7, 2022

kt3k pushed a commit that referenced this pull request Sep 9, 2022

feat(ext/ffi): Implement FFI fast-call trampoline with Dynasmrt (#15305)

dd428d1

kt3k mentioned this pull request Sep 9, 2022

1.25.2 #15830

Merged

4 tasks

branchvincent mentioned this pull request Sep 15, 2022

deno 1.25.3 Homebrew/homebrew-core#110826

Closed


		function add10U8Fast(a, b, c, d, e, f, g, h, i, j) { return symbols.add_10_u8(a, b, c, d, e, f, g, h, i, j); };

		%PrepareFunctionForOptimization(add10U8Fast);

+\n\
+\n\
+\n\
+\n\

	// TODO: this test currently fails in aarch64-apple-darwin (also in branch main!). The reason is v8 does not follow Apple's custom
	// ABI properly (aligns arguments to 8 byte boundaries instead of the natural alignment of the parameter type).
	// A decision needs to be taken:
	// 1. leave it broken and wait for v8 to fix the bug
	// 2. Adapt to v8 bug and follow its ABI instead of Apple's. When V8 fixes the implementation, we'll have to fix it here as well
	testOptimized(addManyU16Fast, () => addManyU16Fast(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12));

feat(ext/ffi): Implement FFI fast-call trampoline with Dynasmrt #15305

feat(ext/ffi): Implement FFI fast-call trampoline with Dynasmrt #15305

Conversation

arnauorriols commented Jul 25, 2022 • edited Loading

Status

Open Questions

Benchmarks

Linux (SysV AMD64)

Apple M1 (Aarch64 Apple)

Windows 11 (Windows64 fastcall)

Comparison with TCC (rev 5beec3f)

CLAassistant commented Jul 25, 2022 • edited Loading

This comment was marked as resolved.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aapoalas left a comment

Choose a reason for hiding this comment

ry left a comment

Choose a reason for hiding this comment

arnauorriols commented Jul 26, 2022

aapoalas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost Jul 29, 2022 • edited by ghost Loading

Choose a reason for hiding this comment

arnauorriols Jul 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnauorriols commented Aug 10, 2022 • edited Loading

aapoalas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost Aug 11, 2022 • edited by ghost Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost Aug 14, 2022 • edited by ghost Loading

Choose a reason for hiding this comment

arnauorriols Aug 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bartlomieju commented Aug 15, 2022

aapoalas left a comment

Choose a reason for hiding this comment

ghost Aug 16, 2022 • edited by ghost Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aapoalas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

littledivy left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnauorriols Sep 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost Sep 5, 2022 • edited by ghost Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnauorriols Sep 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnauorriols commented Sep 5, 2022

littledivy commented Sep 6, 2022

bnoordhuis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnauorriols commented Jul 25, 2022 •

edited

Loading

Comparison with TCC (rev `5beec3f`)

CLAassistant commented Jul 25, 2022 •

edited

Loading

ghost Jul 29, 2022 •

edited by ghost

Loading

arnauorriols Jul 29, 2022 •

edited

Loading

arnauorriols commented Aug 10, 2022 •

edited

Loading

ghost Aug 11, 2022 •

edited by ghost

Loading

ghost Aug 14, 2022 •

edited by ghost

Loading

arnauorriols Aug 15, 2022 •

edited

Loading

ghost Aug 16, 2022 •

edited by ghost

Loading

littledivy left a comment •

edited

Loading

arnauorriols Sep 5, 2022 •

edited

Loading

ghost Sep 5, 2022 •

edited by ghost

Loading

arnauorriols Sep 6, 2022 •

edited

Loading

arnauorriols Sep 6, 2022 •

edited

Loading

arnauorriols Sep 7, 2022 •

edited

Loading