-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ext/ffi): Implement FFI fast-call trampoline with Dynasmrt #15305
feat(ext/ffi): Implement FFI fast-call trampoline with Dynasmrt #15305
Conversation
|
||
impl Trampoline { | ||
fn ptr(&self) -> *const c_void { | ||
&self.0[0] as *const u8 as *const c_void |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type update should be upstreamed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so. Eventually :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, please do open an issue there referencing what's already open in /deno? I'll get around to it when I'm free (unless someone else does first).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty damn awesome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very interesting PR - thank you!
We want to wait until we support Uint8Array in FastAPI before landing this one (which should be soon) because it will require a bit more complex C code. I'm sure we can figure out how to also do this in ASM.
Have you given any thought to ARM?
Absolutely. There's still work to do to get this PR ready anyway.
the intent is to implement support for Apple Silicon as well as windows x64, but haven't really delved into it yet. First I want to see how the SysV AMD64 implementation consolidates after the feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely amazing stuff!
ext/ffi/fast_call.rs
Outdated
match (arg_i, arg) { | ||
// Conventionally, many compilers expect 8 and 16 bit arguments to be sign/zero extended to 32 bits | ||
// See https://stackoverflow.com/a/36760539/2623340 | ||
(1, Unsigned(B)) => dynasm!(self.assembler; movzx edi, sil), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to get rid of some of the repetition by getting the mov command from one match:
let instruction = match arg {
Unsigned(B) | Unsigned(W) => "movszx",
Signed(B) | Signed(W) => "movsx",
_ => "mov",
};
and maybe the involved registers from a function as well? Then the match here could be something like:
// instruction from above
let (destination_register, source_register) = get_integer_registers(arg_i, arg);
match (arg_i, arg) => {
(1 | 2 | 3 | 4 | 5, _) => dnasm!(self.assembler; instruction destination_register, source_register),
(6, _) => dynasm!(self.assembler; instruction destination_register, source_register[rsp + rsp_offset]), // Here trying to piggyback on source_register to contain "BYTE " or "WORD " or ""
(_, _) => dynasm!(...), // same kind of piggyback here
}
Not sure if the macro can handle this, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if the macro can handle this, though.
Not like this, you need to define a wrapping macro. I'll give it a go anyway. After aarch64 & win though, first I wan to see how these implementations end up looking like.
In any case, for something as fragile as this, I tend to think the plain boring design might be better. Maybe if we keep adding more logic it becomes unmanageable but I don't think we are there yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about this lately. I'm not sure about the net gain of working out a macro to abstract the match. I do however thing that a couple of helper macros can improve the readability significantly by making the code more dense:
https://github.com/arnauorriols/deno/compare/refactor/ffi-trampoline-plain-asm...arnauorriols:deno:alternative-dense-dynasm?expand=1
What's your opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think there's a lot of repetition and it makes reading the code for correctness a bit hard, what with having to jump between two to four axes of "arguments" (length of data, type of data (signed, unsigned, float), destination register, source register).
No two subsequent lines are generally the same, as the register names change based on the type and length of data even when its the "same" register. An advanced macro might abstract out a lot of this, allowing lines to be read in a more "autopilot" way: "All 0th integers move from register X to Y, all unsigned integers use zero-extended moves, signed integers use sign-extended moves, ..."
However, such a magic macro would just move the complexity elsewhere. So, I'm ambivalent on the true need for such a macro.
The dense macros do definitely improve the outlook of the code a lot while not really hiding much anything, so I can definitely get behind that.
ext/ffi/fast_call.rs
Outdated
} | ||
|
||
dynasm!(self.assembler | ||
; sub rsp, stack_size as i32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Godbolt rustc compiler seems to use push and pop for stack allocation, presumably as long as there is no need to push parameters onto the stack. So this would be the case when the return value has to be cast from 8 or 16 bits to 32 bits.
Would it be worth to add that as a special case, or rather leave such stuff out as micro-optimisation?
The rustc compiler also does some funky things with using movups to (if I understand this correctly) move multiple sequential integer arguments temporarily into xmmN (where N is the number of float args) to make stack arg shifting quicker. But that's the kind of optimisation that is probably entirely unnecessary for us :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rustc compiler also does some funky things with using movups to (if I understand this correctly) move multiple sequential integer arguments temporarily into xmmN (where N is the number of float args) to make stack arg shifting quicker. But that's the kind of optimisation that is probably entirely unnecessary for us :D
Invalid unless we assume SSE extension on the target platform. (Which is what Bun does, but Deno so far has not, as far as I'm aware)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Godbolt rustc compiler seems to use push and pop for stack allocation, presumably as long as there is no need to push parameters onto the stack. So this would be the case when the return value has to be cast from 8 or 16 bits to 32 bits.
AFAIK Rustc (and many others, with the notable exception of GCC) optimize by using push instead of the normative stack allocation in the prologue as long as no SSE instruction is/can be used. This is because stack instructions (PUSH, POP, CALL, RET) are optimized since the addition of the stack engine in Pentium M and newer CPUs, taking fewer uops than sub + mov
. Once SSE is used to move packed data, sub
is needed anyway so I guess the optimization is no longer significant. See https://godbolt.org/z/PeKaxzjK6
The rustc compiler also does some funky things with using movups to (if I understand this correctly) move multiple sequential integer arguments [...] to make stack arg shifting quicker.
Correct. In a nutshell, it's moving 128 bits of data with a single uops. In the Godbolt linked above, trampoline2 moves parameter k and l together with movups
.
into xmmN (where N is the number of float args)
no, N is just part of the register name. When passing floats as parameters, N does have some semantic in the sense that the first float goes to XMM0, the second to XMM1, etc. But when moving data in/out of the register as a temporary operation the N does not have any significance. In the previously linked GodBolt, it uses xxm2 because it is the first available xmm register (xmm0 and xmm1) are used for parameters. Unless I misunderstood what you meant?
But that's the kind of optimisation that is probably entirely unnecessary for us
Not sure if necessary or not in the long run, but I do think that this first iteration should focus on a simple and solid implementation, and worry about optimization in the future. It is not like TinyCC applies any of these optimizations anyway.
Invalid unless we assume SSE extension on the target platform. (Which is what Bun does, but Deno so far has not, as far as I'm aware)
Keep in mind that this implementation is specific for AMD64, which includes SSE & SSE2 instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK Rustc (and many others, with the notable exception of GCC) optimize by using push instead of the normative stack allocation in the prologue as long as no SSE instruction is/can be used. This is because stack instructions (PUSH, POP, CALL, RET) are optimized since the addition of the stack engine in Pentium M and newer CPUs, taking fewer uops than
sub + mov
. Once SSE is used to move packed data,sub
is needed anyway so I guess the optimization is no longer significant. See https://godbolt.org/z/PeKaxzjK6
Alright. Sounds good to me to leave a relatively insignificant complication out.
no, N is just part of the register name. When passing floats as parameters, N does have some semantic in the sense that the first float goes to XMM0, the second to XMM1, etc. But when moving data in/out of the register as a temporary operation the N does not have any significance. In the previously linked GodBolt, it uses xxm2 because it is the first available xmm register (xmm0 and xmm1) are used for parameters. Unless I misunderstood what you meant?
Yeah I just meant that it's indeed using the first available xmm register: If there are 0 float args then xmm0 can be used, otherwise 1 or 2 or ... based on the number of float params.
Not sure if necessary or not in the long run, but I do think that this first iteration should focus on a simple and solid implementation, and worry about optimization in the future. It is not like TinyCC applies any of these optimizations anyway.
Sounds good to me. Also, I think it's very unlikely we'll run into many C APIs with this many parameters.
test_ffi/tests/test.js
Outdated
|
||
function add10U8Fast(a, b, c, d, e, f, g, h, i, j) { return symbols.add_10_u8(a, b, c, d, e, f, g, h, i, j); }; | ||
|
||
%PrepareFunctionForOptimization(add10U8Fast); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider making this into an assert helper function like this:
function assertOptimizes(fn, callback) {
&PrepareFunctionForOptimization(fn);
console.log(callback());
%OptimizeFunctionOnNextCall(fn);
console.log(callback());
assertFastCall(fn);
}
// SysV ABI comment
assertOptimizes(add10U8Fast, () => add10U8Fast(0, 1, 2, 3, 4, 5, 6, 7, 8, 9));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, saw your suggestion in Discord, I'll definitely apply it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about naming this function testOptimized
instead of assertOptimizes
? The reason is that It also tests the function itself, not only if it has been optimized.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome stuff! Only some minor nitpicks that I could find or understand :)
ext/ffi/fast_call.rs
Outdated
// (this applies to register parameters, as stack parameters are not padded in Apple) | ||
(1, Signed(B)) => dynasm!(self.assembler; .arch aarch64; sxtb w0, w1), | ||
(1, Unsigned(B)) => { | ||
dynasm!(self.assembler; .arch aarch64; and w0, w1, 0xFF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IS and w0, w1, 0xFF
better / faster somehow than uxtb w0, w1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLVM prefers the extension.
They both take a single instruction cycle, i.e. performance won't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, both have the same latency and throughput and use the same execution pipelines. I'm using and
following Rustc and Clang example: https://godbolt.org/z/1jb4j8vsh
LLVM prefers the extension.
What do you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLVM prefers the extension.
What do you mean?
It's irrelevant, but I was under the impression that Rustc didn't prefer the bitwise AND - but that's my experience from 32-bit Arm, not 64-bit Arm :)
I suppose one may need to understand how bitwise truncation works, vs. how the extension instructions work, so maybe the latter should be preferred for that reason? But I have no opinion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose one may need to understand how bitwise truncation works, vs. how the extension instructions work
According to Arm documentation it's essentially the same, except that and
does not have to deal with the rotation part.
ext/ffi/fast_call.rs
Outdated
// > Function arguments may consume slots on the stack that are not multiples of 8 bytes. | ||
// > If the total number of bytes for stack-based arguments is not a multiple of 8 bytes, | ||
// > insert padding on the stack to maintain the 8-byte alignment requirements. | ||
// TODO: V8 does not currently follow this Apple's policy, and instead aligns all arguments to 8 Byte boundaries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this might be pretty quick on the chopping block for V8 to fix, so I vote we allow it to be broken and wait for V8 to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my personal opinion as well
@arnauorriols please rebase. @piscisaureus please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with my limited knowledge of assembly anyway.
ext/ffi/fast_call.rs
Outdated
let mut int_params = 0u32; | ||
let mut sse_params = 0u32; | ||
for param in params { | ||
match param { | ||
NativeType::F32 | NativeType::F64 => sse_params += 1, | ||
_ => int_params += 1, | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this seems like it could also be done by summing one register type, and subtracting from the length?
Maybe
let mut int_params = 0u32; | |
let mut sse_params = 0u32; | |
for param in params { | |
match param { | |
NativeType::F32 | NativeType::F64 => sse_params += 1, | |
_ => int_params += 1, | |
} | |
} | |
let sse_params = | |
params.iter().filter(|param| matches!(param, NativeType::F32 | NativeType::F64)).count(); | |
let int_params = params.len() - sse_params; |
I'm just going off of a "there's a lot of mutability here, so maybe some functional code could help" thinking; it's your choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get what you mean, but in this case in particular, I think the mutable approach is more expressive.
let mut fast_call_alloc = None; | ||
|
||
let func = if fast_call::is_compatible(sym) { | ||
let trampoline = fast_call::compile_trampoline(sym); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, I would have liked to have suggested making compile_trampoline
return an Option
, but I guess that you can use compile_trampoline
without is_compatible
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that you can use compile_trampoline without is_compatible, right?
Not really. You can always use the <Compiler>::compile
method if you want to compile without is_compatible
. compile_trampoline
role is essentially to dispatch based on platform, and it is definitely coupled to is_compatible
.
I'm not particularly happy either with the current duplication in fast_call::is_compatible()
and fast_call::compile_trampoline()
. However, from an API point of view I think the if
approach is more explicit (in a good way). Here's the alternative with the Option:
fast_api.rs
pub(crate) fn compile_trampoline(sym: &Symbol) -> Option<Trampoline> {
if sym.can_callback {
// Callbacks are not allowed in Fast API FFI
None
} else if cfg!(all(target_arch = "x86_64", target_family = "unix")) {
Some(SysVAmd64::compile(sym))
} else if cfg!(all(target_arch = "x86_64", target_family = "windows")) {
Some(Win64::compile(sym))
} else if cfg!(all(target_arch = "aarch64", target_vendor = "apple")) {
Some(Aarch64Apple::compile(sym))
} else {
None
}
}
lib.rs
let func = match fast_call::compile_trampoline(sym) {
Some(trampoline) => {
let func = builder.build_fast(
scope,
&fast_call::make_template(sym, &trampoline),
None,
);
fast_call_alloc = Some(Box::into_raw(Box::new(trampoline)));
func
}
None => builder.build(scope),
};
I'm good either way tbh, you or anybody else make the call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work and a Herculean effort! LGTM, what with my limited knowledge.
ext/ffi/fast_call.rs
Outdated
NativeType::F64 => self.move_float(Double), | ||
NativeType::U8 => self.move_integer(U(B)), | ||
NativeType::U16 => self.move_integer(U(W)), | ||
NativeType::U32 | NativeType::Void => self.move_integer(U(DW)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NativeType::Void
here seems unnecessary and could be an unreachable!()
instead, I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arnauorriols I've looked through fast_api.rs
, awesome work LGTM! Thanks @aapoalas @phosra for making the review easier.
Just to summarise for other readers, this PR notably:
- Enables fast api FFI optimizations on Windows x64
- Enables FFI on enviornments like SElinux
- Removes dependency on tinycc
test_ffi/tests/integration_tests.rs
Outdated
0\n\ | ||
0\n\ | ||
78\n\ | ||
45\n\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Output in the test is 78 on this row as well. Is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the consequence of:
Lines 483 to 488 in 73f05c7
// TODO: this test currently fails in aarch64-apple-darwin (also in branch main!). The reason is v8 does not follow Apple's custom | |
// ABI properly (aligns arguments to 8 byte boundaries instead of the natural alignment of the parameter type). | |
// A decision needs to be taken: | |
// 1. leave it broken and wait for v8 to fix the bug | |
// 2. Adapt to v8 bug and follow its ABI instead of Apple's. When V8 fixes the implementation, we'll have to fix it here as well | |
testOptimized(addManyU16Fast, () => addManyU16Fast(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)); |
The assertion should expect 78. The problem is that on aarch64-apple-darwin
the test outputs 45 due to the aforementioned V8 bug.
The bug has already been reported in https://bugs.chromium.org/p/v8/issues/detail?id=13171.
Considering that this test will unequivocally detect when a new version of V8 fixes the bug I am now slightly leaning towards option 2. However, CI does not run tests on aarch64-apple-darwin
...
Opinions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... If the adaptation is easy to do and easy to remove, then I guess option 2 would not be bad ... Two wrongs do make a right sometimes.
Still, I wouldn't really mind if choice 1 was made either. That would put pressure on V8 to fix their stuff, as I'm sure a lot of people would come complaining to Deno repo about FFI being broken on their Macs and we could just guide them to the V8 issue to voice their displeasure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and we could just guide them to the V8 issue to voice their displeasure.
And yet, maybe that wouldn't be the most respectful way to treat their bug tracking forums? :)
How about we adapt, so code works today, and have a "TODO" comment with the code to be exchanged for after the V8 issue is fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok let's go with 1 & do 2. in a follow up? I don't want to block this PR any further; the goal right now is to move away from tinycc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lot of people would come complaining to Deno repo about FFI being broken on their Macs and we could just guide them to the V8 issue to voice their displeasure.
This is true, and by default I would chose this approach. But I fear the solution on V8 side might not be as straightforward as once anticipated (according to @devsnek), and I intuitively think option 2 is actually easier than option 1 (I would argue against removing the test, and we cannot leave a broken test for the folks with M1s, so we are left with conditional "expected failure", somehow). Also, it always kinda feels a bit like using the users (and their Deno DX) as pawns. After all, for many the story usually ends at "Deno FFI is broken in my Mac".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementing V8 calling convention is trivial:
- let size_original = param.size();
+ // let size_original = param.size();
+ let size_original = 8;
I just need to clarify if V8 expects 8/16 bit stack parameters to be extended, considering they are 8-byte aligned. I won't have time till tonight to get to it, so if anybody else wants to take the lead, be my guest!
The more laborious part of adapting to V8 calling convention would had been in Aarch64Apple::allocate_stack()
, but funny story, I had it wrong to begin with 🤦 (the kind of wrong that matches V8 expectations)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pushed changes for option 2: adapt to V8 calling convention.
Before merging this PR, a decision must be taken (and documented) on the open question described in the PR summary:
Discussion in #15305 (comment). |
my ocd is telling me to give this new JIT assembler a name. any suggestions? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lightly reviewed but very nice work!
test_ffi/tests/test.js
Outdated
// TODO: this test currently fails in aarch64-apple-darwin (also in branch main!). The reason is v8 does not follow Apple's custom | ||
// ABI properly (aligns arguments to 8 byte boundaries instead of the natural alignment of the parameter type). | ||
// A decision needs to be taken: | ||
// 1. leave it broken and wait for v8 to fix the bug | ||
// 2. Adapt to v8 bug and follow its ABI instead of Apple's. When V8 fixes the implementation, we'll have to fix it here as well | ||
testOptimized(addManyU16Fast, () => addManyU16Fast(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I understand this right: the comment is saying that V8 should align naturally but instead aligns on 8 byte boundaries?
That's indeed a V8 bug: https://github.com/v8/v8/blob/0472d5a5aa95ff96438f875bb2f0d1b9a37e046e/src/compiler/wasm-compiler.cc#L7350-L7359
I've opened https://bugs.chromium.org/p/v8/issues/detail?id=13264
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct. An issue already exists: https://bugs.chromium.org/p/v8/issues/detail?id=13171
NativeType::U64 => fast_api::Type::Uint64, | ||
NativeType::ISize => fast_api::Type::Int64, | ||
NativeType::USize | NativeType::Pointer | NativeType::Function => { | ||
fast_api::Type::Uint64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to double-check: this is (or would be if they existed) incorrect for 32 bits builds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
ext/ffi/fast_call.rs
Outdated
for param in &sym.parameter_types { | ||
compiler.move_left(param) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't have to shift if you leave in the receiver in the function prototype (zeroed out or not). In the simple case that means you can elide the trampoline completely.
I'm thinking of fn square(_: *const (), x: u32) -> u32 { x * x }
- seems like a reasonable trade-off to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is FFI work, calling native library C APIs we can't just decide to leave the receiver in the function prototype. Sqlite's C API won't change for us even if we decide that it will.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, fair enough. I'm thinking of fast API calls into deno's runtime, where we control the function prototype. On the other hand, we probably wouldn't use FFI for that (or maybe...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there's no reason to use FFI for those calls or to manually write assembly, as then we can just have Rust functions as the fast functions which will be even better optimized than our hand-rolled trampoline ASM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ thats what #[op]
is for...AOT fast api calls codegen for deno's runtime.
if !compiler.is_recv_arg_overridden() { | ||
// the receiver object should never be expected. Avoid its unexpected or deliberate leak | ||
compiler.zero_first_arg(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps not necessary? The receiver has either been shifted out because it was the first integer/pointer argument, or it's inaccessible in the callback because there are only floating point arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this was a conscious decision by the implementer:
- Always have the same kind of code in the
compile()
function, even if some of it is not strictly necessary for that particular architecture. This way all architecture code looks similar. - Explicitly zero out the receiver even if there are only floating point arguments, thus making sure that a C API cannot get its grubby hands on the pointer number even if it lies about its function signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm starting to second-guess my stubbornness about this 😆 My main argument is indeed the point 2 explained by @aapoalas above. I'm imagining scenarios where the trust between the Javascript code/developer (the one declaring the function signature) and the native code/developer is not 100% stablished. It's a remote scenario without even a clear actionable vulnerability, but it is also a trivial, inexpensive instruction.
There doesn't even need to be maliciousness to get value out of this. Consider the developer experience in the case where, for example, the native API expects only floats in one version, and adds an integral parameter on a newer version. If the js dev fails to update the FFI declaration the failure might potentially be much harder to debug without the explicit zeroing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose that makes sense but in the scenario you're describing you'd ideally poison all unused parameter-passing registers, not just the first one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What could the foreign code even do with the JS pointer? Fill it with garbage bytes? Access OOB within the JS memory heap? Okay, that sounds dangerous, but FFI is really unsafe anyways, and can simply segfault, or do anything that JS/Deno can do already, unless I'm mistaken?
The foreign code wouldn't have access to the JS context or any of Deno's internals? (Technically, as they are already within the same addressing space, it's possible to access these anyways)
I'm just doubtful that it's much of a concern.
ext/ffi/fast_call.rs
Outdated
// functions returning 64 bit integers have the out array appended as their last parameter, | ||
// and it is a *FastApiTypedArray<Int32> | ||
match self.integer_params { | ||
// rdi is always V8 receiver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rdi has been zeroed by the time this function is called, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it really isn't the most eloquent of the comments. I've elaborated it a bit, what do you think now?
Apropos the arm64 macos stack alignment issue: I would suggest to stick with V8's behavior for now. Leave some pointers in the code to make it easy to switch once V8 fixes the bug. (I may give it a try. I don't have a M1 or M2 but it's straightforward enough in theory.) |
I agree |
@bnoordhuis i've tried a few different things including modifying |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM again.
Fixes #15306
In order to use the fast API of V8 for the ffi functions, Deno needs to wrap them in a trampoline function that performs the following transformations (at the time fo writing):
Currently this trampoline is JIT compiled using tcc. However, using tcc is problematic, among other reasons because of the increased build complexity and complicated platform support. This PR implements Deno's own JIT asusing dynasmrt.
Status
FastApiTypedArray<Uint8>
FastApiTypedArray<Int32>
Open Questions
A decision needs to be taken:
Benchmarks
Linux (SysV AMD64)
Apple M1 (Aarch64 Apple)
Note: these benchmarks are run on a M1 dedicated host in AWS.
Windows 11 (Windows64 fastcall)
Comparison with TCC (rev 5beec3f)
Performance is marginally better than using TCC, particularly in those cases where the trampoline tail-calls to the FFI function)
![image](https://user-images.githubusercontent.com/4871949/185955206-f49e8bdd-716c-42b2-9562-7984138117f4.png)