-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[performance] [swift] Add systematic inlining #7253
[performance] [swift] Add systematic inlining #7253
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hassila Thanks for this PR. I've looked through the PR and noticed that you added @inline(__always)
to some public
APIs, can we remove those since we would prefer not inlining
public APIs ~ link to reasoning.
However, since you are at it there might be some other functions that might benefit from this PR. like:
func reallocate(_ size: Int, writerSize: Int, alignment: Int)
func add(loc: FieldLoc)
VTableStorage.clear()
ByteBuffer.Storage.initialize(for size: Int)
Would be interesting to see the performance when the public functions are removed and these are added instead.
This is a reference to all the work that has been done for the swift lib in terms of performance. |
It seems the major reason is to avoid excessive code blowup if used without care? I'd argue that in the case of ByteBuffer, the public API is very small functions that should run fairly low risk of causing such regressions (while it seems to give a real performance benefit) - most functions are just a couple of lines - I can remove them if you want, but would ask to at least consider it for such small functions (agree it may be a bad idea to do it without discrimination). I can see the following difference for BB inlining, quite significant (20%!): With ByteBuffers inlining: 88 ns (got it down a little bit more thanks to smaller initial FBB size - that also pushed down the c++ test to 59 ns as a reference)
I tried those, no major differences (a couple of them are also public as far as I can see, so didn't try those) |
Yeah basically. Especially if the project contains multiple schemes.
The inlining for the
This might change with the benchmark test I appended below. What benchmarks are you running? Can you run the following benchmark test suite on both this branch and master on your device so we can have a more in-depth understanding of what steps should be taken. |
I'll try that separately, but see results from your benchmark suite below, even more significant improvements than in my simple test for some of them (between 6 and 33%) :-)
|
So the changes are to to the ByteBuffer and the none public functions only in the FlatbufferBuilder and we got those test results? |
Turned on inlining for the four additional ones you suggested, similar results -but there is some variation between runs too, I would suggest increasing the number of iterations at least 10x for that benchmark to get more stable number.
|
No, that was the PR as-is, will look at yet another version. |
Yes, the 3M structs is just a huge benchmark test that I kinda only run once since we know allocation and rewriting the buffer is a bit slower than cpp in swift |
I'll structure up a few different tests and try to break numbers down - would you be open to a PR changing the run count of your benchmark so it's closer to 1s in magnitude? Makes for more stable and comparable results. |
Yeah. Go for it! |
Please see #7255 |
Ok, I have run three setups and compared - first current master as the reference benchmark level. Then one setup where everything in ByteBuffer is inlined explicitly. Then one setup where everything in both ByteBuffer and FlatBufferBuilder is inlined explicitly. Editing the output for clarity so we only look at the time spent for each test (manually picked 100 iterations which is more than I got dynamically, just to have extra runs to get more robust results); Performance delta from master -> Bytebuffer inline:
So average for tests 9% faster. Then moving to inlining FBB completely too - gives a clear improvement on all tests (also on my standalone test with small structs it gives 8% or so):
With both these changes applied and compare to current master as baseline:
Overall a 16% improvement with improvements ranging from 5% to 34%. [edit, misread results, aggressive inline is best for all tests] What do you think, is it acceptable? I'd push a new update with all changes (think there are a couple of missing inlines in current PR, as I added the ones you suggest too here). |
My reasoning why inlining probably is ok in flat buffers is that most use cases I can think of would funnel call sites to FBB to a single place per model entity and we would only inline a single copy - granted more complex models might cause more copies, but looking at the actual code that is inlined they are usually just a few lines (with the exception of |
Looking at the benchmarks it seems that its worth it for now to just inline everything. We can keep it inlined for now since most probably most people who will use this are going to use it on macOS/iOS/server side and rarely on web/embedded systems so we can be safe for now. And if people complain about the binary size we can reiterate over this to find the perfect balance.
|
Agree, I may have a larger test suite in the future and will definitely be happy to iterate further in the future to strike a balance if needed.
Ok, good news - nothing really - I removed just a few of the larger functions from inline (see latest forced push of PR) and this is master vs current PR:
I also see the drop still on my small test down to 92ns (~15%) which is nice. If you are happy with it, I'd say its good to go now. |
(different future discussion point is use of @inlinable - https://forums.swift.org/t/when-should-both-inlinable-and-inline-always-be-used/37375/2 - separate discussion though) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
when reading this it seems to me that |
It seems that this is the RFC in question and it seems pretty interesting |
@hassila Thanks for the contribution. |
Thanks, will have a look. |
…atBufferBuilder (google#7253) Co-authored-by: Joakim Hassila <[email protected]>
Did some more performance testing following up on previous PR, by adding systematic inlining in ByteBuffer and FlatbuffersBuilder managed to push it down further closer to C++ performance from 107ns to 94ns per message (C++ 71ns or so.) - so around 13% improvement with this PR.
Would be interesting to hear you test it if you have some additional benchmarks with this and the previous change, for my test case its a significant improvement.