-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(ext/ffi): use fast api calls for 64bit return types #15313
Conversation
ext/ffi/prelude.h
Outdated
// https://source.chromium.org/chromium/chromium/src/+/main:v8/include/v8-fast-api-calls.h;l=336 | ||
struct FastApiTypedArray { | ||
uintptr_t length_; | ||
uint32_t* data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I guess it's not an option.
ext/ffi/00_ffi.js
Outdated
if (needsUnpacking && !isNonBlocking) { | ||
const call = this.symbols[symbol]; | ||
const parameters = symbols[symbol].parameters; | ||
const v = new Int32Array(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be a bad idea to have this be a general buffer for the entire dlopen, so that all calls use the retval memory? Definitely a minor optimisation memory optimisation, though.
ext/ffi/jit_trampoline.rs
Outdated
@@ -82,16 +95,31 @@ pub(crate) fn codegen(sym: &crate::Symbol) -> String { | |||
c += native_arg_to_c(ty); | |||
let _ = write!(c, " p{i}"); | |||
} | |||
if needs_unwrap { | |||
let _ = write!(c, ", struct FastApiTypedArray *p_ret"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
SysV ABI at least passes return value pointers as the 0th parameter. We can't quite get that far due to the receiver object (unless we piggy-back on that somehow) but I'd still maybe put the return value buffer as the first argument, instead of being the very last one. Just because.
But it doesn't really matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, Windows ABI does the same for return value pointers now that I think about it.
@@ -811,11 +821,46 @@ fn make_sync_fn<'s>( | |||
// SAFETY: The pointer will not be deallocated until the function is | |||
// garbage collected. | |||
let symbol = unsafe { &*(external.value() as *const Symbol) }; | |||
let needs_unwrap = match needs_unwrap(symbol.result_type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bool-like name here is a bit confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the needs_unwrap(symbol.result_type)
call & match and the is_i64(symbol.result_type)
call can be taken outside the closure, if wanted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would mean leaking it with the Symbol. V8 function closures cannot capture environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can split apart into multiple branches, with different specialised function pointers passed to the v8::FunctionTemplate::builder
?
I would say to hoist them outside of the function.
This would be three different paths, for needs_unwrap.is_some()
, false == is_i64
, true == is_i64
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs_unwrap is cheap, happy to have 3 paths if there is a performance benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also like to presume that it's cheap, but there's also args.get(symbol.parameter_types.len() as i32)
, which may be better optimised by it... but another PR can measure differences and consider this - no reason to halt the PR on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from the possible confusion with BigInt64Array
/ CType::Int64
leading to the probably needless use of Uint32Array
as the return value buffer, this LGTM.
I would personally still place the return value pointer as the first parameter just so it's similar to other ABIs but it shouldn't really matter, and shifting arguments over would require changes to the ffi_parse_args
, perhaps in the form of an offset
parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C trampoline seems to be wrong.
Also, I do think it would be better to have the return value pointer as the first argument (not counting receiver object), meaning adding an offset
parameter to parse_ffi_args
function for the slow path. This will make the manual assembly generation easier.
${ | ||
isI64(resultType) | ||
? `const n1 = Number(b[0])` | ||
: `const n1 = vui[0] + 2 ** 32 * vui[1]` // Faster path for u64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is BigUint64Array
avoided? It's slower? If so, has a similar attempt been done for i64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it is ~8ns slower on my machine.
has a similar attempt been done for i64?
yes but there was no perf benefit compared to BigUint64Array. Any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but there was no perf benefit compared to BigUint64Array. Any suggestions?
I'd recommend measuring the performance of unconditionally using BigUint64Array
, but conditionally applying BigInt.asIntN(64, x)
to the extracted u64 BigInt
. V8 optimises BigInt.asIntN(64, x)
/BigInt.asUintN(64, x)
to native 64-bit integers, but I'm unaware of whether it optimises handling of 64-bit input.
If you find that it yields no gain, discard the suggestion and resolve this.
ext/ffi/00_ffi.js
Outdated
"b", | ||
"call", | ||
`return function (${params.join(", ")}) { | ||
call(${params.join(", ")}${parameters.length > 0 ? ", " : ""}vi); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ${parameters.length > 0 ? ", " : ""}vi
may be avoided by pushing "vi"
to params
after mapping?
Co-authored-by: Phosra <[email protected]>
@@ -83,20 +86,35 @@ pub(crate) fn codegen(sym: &crate::Symbol) -> String { | |||
c += native_arg_to_c(ty); | |||
let _ = write!(c, " p{i}"); | |||
} | |||
if needs_unwrap { | |||
let _ = write!(c, ", struct FastApiTypedArray* const p_ret"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also change the void* recv
to be void* const recv
while we're here, since it similarly cannot be rewritten for any meaningful effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm now curious as to whether the extra text and validation could have a negative impact on performance? Every "const " is an extra 2x6 bytes per function - that could add a minor amount of extra reallocations and copies for our Rust side, and some extra processing for the C compiler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These compilations are done once per symbol at Deno.dlopen()
time, which is most likely done once. The dlopen()
call isn't performance critical, most users probably wouldn't realise even if it took a whole second.
More impactful text removals would be eg. using u8
etc. type names in the prelude (This would actually be more proper in some sense, as _t
type names are reserved for official types to avoid name clashes. Of course, here we're just copying the types so it's not really intruding.), using a shorter name for FastApiTypedArray
etc. None of this is likely to make much of a performance difference.
Testing out compiling with and without const
s on Godbolt doesn't show any meaningful difference, though there of course a majority of the time is probably just spent in the HTTP requests etc. Still, I doubt a few consts are going to make a meaningful difference in the dlopen
performance, and we've already seen that not having them allowed for a surprising kind of (potential) bug to be introduced in the C code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, I suppose that it's comparable to say WebAssembly.instantiate
- rarely does one care about the initialisation time.
if needs_unwrap { | ||
// <return_type> r = func(p0, p1, ...); | ||
// ((<return_type>*)p_ret->data)[0] = r; | ||
let _ = write!(c, " {ffi_ret} r = {call_s}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs two spaces before {ffi_ret}
I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's how the non-wrapped path is indented so consistency.
const vi = new Int32Array(2); | ||
const vui = new Uint32Array(vi.buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vi
is only used on the Rust side as a polymorphic [u8]
? If so, the Uint32Array
can be allocated instead, and the vi
may be discarded here, using vui
in its place.
Although, as we're logically handling a 64-bit value, it may be more expressive to use BigUint64Array(1)
.
} | ||
|
||
if (needsUnpacking && !isNonBlocking) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
if (needsUnpacking && !isNonBlocking) { | |
} else if (needsUnpacking) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'm leaving an approval as the remaining comments are mostly about moving stuff around and adding safety comments. We can address those after releasing this patch in v1.24.1 tonight
e2f8e04
to
a603454
Compare
Optimize 64bit integer / bigint types to go through the fast call.