Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(ext/ffi): use fast api calls for 64bit return types #15313

Merged
merged 21 commits into from
Jul 28, 2022

Conversation

littledivy
Copy link
Member

@littledivy littledivy commented Jul 26, 2022

Optimize 64bit integer / bigint types to go through the fast call.

ext/ffi/00_ffi.js Outdated Show resolved Hide resolved
// https://source.chromium.org/chromium/chromium/src/+/main:v8/include/v8-fast-api-calls.h;l=336
struct FastApiTypedArray {
uintptr_t length_;
uint32_t* data;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I guess it's not an option.

ext/ffi/00_ffi.js Outdated Show resolved Hide resolved
if (needsUnpacking && !isNonBlocking) {
const call = this.symbols[symbol];
const parameters = symbols[symbol].parameters;
const v = new Int32Array(2);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be a bad idea to have this be a general buffer for the entire dlopen, so that all calls use the retval memory? Definitely a minor optimisation memory optimisation, though.

@@ -82,16 +95,31 @@ pub(crate) fn codegen(sym: &crate::Symbol) -> String {
c += native_arg_to_c(ty);
let _ = write!(c, " p{i}");
}
if needs_unwrap {
let _ = write!(c, ", struct FastApiTypedArray *p_ret");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

SysV ABI at least passes return value pointers as the 0th parameter. We can't quite get that far due to the receiver object (unless we piggy-back on that somehow) but I'd still maybe put the return value buffer as the first argument, instead of being the very last one. Just because.

But it doesn't really matter.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, Windows ABI does the same for return value pointers now that I think about it.

@littledivy littledivy marked this pull request as ready for review July 27, 2022 08:28
@littledivy littledivy changed the title experiment(ext/ffi): use fast api calls with 64bit types perf(ext/ffi): use fast api calls for 64bit return types Jul 27, 2022
ext/ffi/00_ffi.js Outdated Show resolved Hide resolved
@@ -811,11 +821,46 @@ fn make_sync_fn<'s>(
// SAFETY: The pointer will not be deallocated until the function is
// garbage collected.
let symbol = unsafe { &*(external.value() as *const Symbol) };
let needs_unwrap = match needs_unwrap(symbol.result_type) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bool-like name here is a bit confusing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the needs_unwrap(symbol.result_type) call & match and the is_i64(symbol.result_type) call can be taken outside the closure, if wanted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would mean leaking it with the Symbol. V8 function closures cannot capture environment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay.

Copy link

@ghost ghost Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can split apart into multiple branches, with different specialised function pointers passed to the v8::FunctionTemplate::builder?
I would say to hoist them outside of the function.
This would be three different paths, for needs_unwrap.is_some(), false == is_i64, true == is_i64.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs_unwrap is cheap, happy to have 3 paths if there is a performance benefit.

Copy link

@ghost ghost Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also like to presume that it's cheap, but there's also args.get(symbol.parameter_types.len() as i32), which may be better optimised by it... but another PR can measure differences and consider this - no reason to halt the PR on this.

ext/ffi/lib.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@aapoalas aapoalas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the possible confusion with BigInt64Array / CType::Int64 leading to the probably needless use of Uint32Array as the return value buffer, this LGTM.

I would personally still place the return value pointer as the first parameter just so it's similar to other ABIs but it shouldn't really matter, and shifting arguments over would require changes to the ffi_parse_args, perhaps in the form of an offset parameter.

ext/ffi/jit_trampoline.rs Outdated Show resolved Hide resolved
ext/ffi/jit_trampoline.rs Outdated Show resolved Hide resolved
ext/ffi/jit_trampoline.rs Outdated Show resolved Hide resolved
@ry ry requested a review from piscisaureus July 27, 2022 17:38
Copy link
Collaborator

@aapoalas aapoalas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C trampoline seems to be wrong.

Also, I do think it would be better to have the return value pointer as the first argument (not counting receiver object), meaning adding an offset parameter to parse_ffi_args function for the slow path. This will make the manual assembly generation easier.

ext/ffi/jit_trampoline.rs Outdated Show resolved Hide resolved
ext/ffi/00_ffi.js Outdated Show resolved Hide resolved
ext/ffi/jit_trampoline.rs Outdated Show resolved Hide resolved
${
isI64(resultType)
? `const n1 = Number(b[0])`
: `const n1 = vui[0] + 2 ** 32 * vui[1]` // Faster path for u64
Copy link

@ghost ghost Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is BigUint64Array avoided? It's slower? If so, has a similar attempt been done for i64?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it is ~8ns slower on my machine.

has a similar attempt been done for i64?

yes but there was no perf benefit compared to BigUint64Array. Any suggestions?

Copy link

@ghost ghost Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but there was no perf benefit compared to BigUint64Array. Any suggestions?

I'd recommend measuring the performance of unconditionally using BigUint64Array, but conditionally applying BigInt.asIntN(64, x) to the extracted u64 BigInt. V8 optimises BigInt.asIntN(64, x)/BigInt.asUintN(64, x) to native 64-bit integers, but I'm unaware of whether it optimises handling of 64-bit input.

If you find that it yields no gain, discard the suggestion and resolve this.

"b",
"call",
`return function (${params.join(", ")}) {
call(${params.join(", ")}${parameters.length > 0 ? ", " : ""}vi);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ${parameters.length > 0 ? ", " : ""}vi may be avoided by pushing "vi" to params after mapping?

ext/ffi/jit_trampoline.rs Outdated Show resolved Hide resolved
@littledivy littledivy requested a review from ry July 28, 2022 04:17
ext/ffi/jit_trampoline.rs Outdated Show resolved Hide resolved
test_ffi/src/lib.rs Outdated Show resolved Hide resolved
@@ -83,20 +86,35 @@ pub(crate) fn codegen(sym: &crate::Symbol) -> String {
c += native_arg_to_c(ty);
let _ = write!(c, " p{i}");
}
if needs_unwrap {
let _ = write!(c, ", struct FastApiTypedArray* const p_ret");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also change the void* recv to be void* const recv while we're here, since it similarly cannot be rewritten for any meaningful effect.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now curious as to whether the extra text and validation could have a negative impact on performance? Every "const " is an extra 2x6 bytes per function - that could add a minor amount of extra reallocations and copies for our Rust side, and some extra processing for the C compiler?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These compilations are done once per symbol at Deno.dlopen() time, which is most likely done once. The dlopen() call isn't performance critical, most users probably wouldn't realise even if it took a whole second.

More impactful text removals would be eg. using u8 etc. type names in the prelude (This would actually be more proper in some sense, as _t type names are reserved for official types to avoid name clashes. Of course, here we're just copying the types so it's not really intruding.), using a shorter name for FastApiTypedArray etc. None of this is likely to make much of a performance difference.

Testing out compiling with and without consts on Godbolt doesn't show any meaningful difference, though there of course a majority of the time is probably just spent in the HTTP requests etc. Still, I doubt a few consts are going to make a meaningful difference in the dlopen performance, and we've already seen that not having them allowed for a surprising kind of (potential) bug to be introduced in the C code.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I suppose that it's comparable to say WebAssembly.instantiate - rarely does one care about the initialisation time.

if needs_unwrap {
// <return_type> r = func(p0, p1, ...);
// ((<return_type>*)p_ret->data)[0] = r;
let _ = write!(c, " {ffi_ret} r = {call_s}");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs two spaces before {ffi_ret} I think?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's how the non-wrapped path is indented so consistency.

ext/ffi/lib.rs Show resolved Hide resolved
Comment on lines +345 to +346
const vi = new Int32Array(2);
const vui = new Uint32Array(vi.buffer);
Copy link

@ghost ghost Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vi is only used on the Rust side as a polymorphic [u8]? If so, the Uint32Array can be allocated instead, and the vi may be discarded here, using vui in its place.
Although, as we're logically handling a 64-bit value, it may be more expressive to use BigUint64Array(1).

Comment on lines 340 to +342
}

if (needsUnpacking && !isNonBlocking) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
if (needsUnpacking && !isNonBlocking) {
} else if (needsUnpacking) {

Copy link
Member

@bartlomieju bartlomieju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm leaving an approval as the remaining comments are mostly about moving stuff around and adding safety comments. We can address those after releasing this patch in v1.24.1 tonight

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants