[performance] [swift] Add systematic inlining #7253

hassila · 2022-04-18T08:37:46Z

Did some more performance testing following up on previous PR, by adding systematic inlining in ByteBuffer and FlatbuffersBuilder managed to push it down further closer to C++ performance from 107ns to 94ns per message (C++ 71ns or so.) - so around 13% improvement with this PR.

Would be interesting to hear you test it if you have some additional benchmarks with this and the previous change, for my test case its a significant improvement.

mustiikhalil

@hassila Thanks for this PR. I've looked through the PR and noticed that you added @inline(__always) to some public APIs, can we remove those since we would prefer not inlining public APIs ~ link to reasoning.

However, since you are at it there might be some other functions that might benefit from this PR. like:

func reallocate(_ size: Int, writerSize: Int, alignment: Int)
func add(loc: FieldLoc)
VTableStorage.clear()
ByteBuffer.Storage.initialize(for size: Int)

Would be interesting to see the performance when the public functions are removed and these are added instead.

mustiikhalil · 2022-04-18T09:01:53Z

This is a reference to all the work that has been done for the swift lib in terms of performance.

#5798

hassila · 2022-04-18T12:01:31Z

@hassila Thanks for this PR. I've looked through the PR and noticed that you added @inline(__always) to some public APIs, can we remove those since we would prefer not inlining public APIs ~ link to reasoning.

It seems the major reason is to avoid excessive code blowup if used without care?

I'd argue that in the case of ByteBuffer, the public API is very small functions that should run fairly low risk of causing such regressions (while it seems to give a real performance benefit) - most functions are just a couple of lines - I can remove them if you want, but would ask to at least consider it for such small functions (agree it may be a bad idea to do it without discrimination).

I can see the following difference for BB inlining, quite significant (20%!):

With ByteBuffers inlining: 88 ns
Without inlining: 107ns

(got it down a little bit more thanks to smaller initial FBB size - that also pushed down the c++ test to 59 ns as a reference)

However, since you are at it there might be some other functions that might benefit from this PR. like:

func reallocate(_ size: Int, writerSize: Int, alignment: Int)

func add(loc: FieldLoc)

VTableStorage.clear()

ByteBuffer.Storage.initialize(for size: Int)

Would be interesting to see the performance when the public functions are removed and these are added instead.

I tried those, no major differences (a couple of them are also public as far as I can see, so didn't try those)

mustiikhalil · 2022-04-18T12:25:58Z

It seems the major reason is to avoid excessive code blowup if used without care?

Yeah basically. Especially if the project contains multiple schemes.

I can see the following difference for BB inlining, quite significant (20%!):
With ByteBuffers inlining: 88 ns
Without inlining: 107ns

The inlining for the ByteBuffer we can definitely keep, (even the public one since its a one liner). However I was more worried about public functions in the FlatbufferBuilder since those are the ones that might expand the binary since if misused.

I tried those, no major differences (a couple of them are also public as far as I can see, so didn't try those)

This might change with the benchmark test I appended below.

What benchmarks are you running? Can you run the following benchmark test suite on both this branch and master on your device so we can have a more in-depth understanding of what steps should be taken.

hassila · 2022-04-18T13:29:01Z

I tried those, no major differences (a couple of them are also public as far as I can see, so didn't try those)

This might change with the benchmark test I appended below.

I'll try that separately, but see results from your benchmark suite below, even more significant improvements than in my simple test for some of them (between 6 and 33%) :-)

What benchmarks are you running? Can you run the following benchmark test suite on both this branch and master on your device so we can have a more in-depth understanding of what steps should be taken.

---------------------

Building for production...
[3/3] Linking FlatBuffers.Benchmarks.swift
Build complete! (1.99s)
done 500_000: in 0.03937615156173706
done 10 str: in 0.00013124942779541016
done 100 str: in 0.00027724504470825193
done 3M strc: in 0.25692498683929443
-------------------------------------
|	Name		  |		Scores		|
-------------------------------------
|	500_000		|		0.039376s	| 
-------------------------------------
|	10 str		|		0.000131s	| 
-------------------------------------
|	3M strc		|		0.256925s	| 
-------------------------------------
|	100 str		|		0.000277s	| 
-------------------------------------

hassila@max ~/D/G/p/f/t/FlatBuffers.Benchmarks.swift (master)> 

106 ns

---------------------

Building for production...
[3/3] Linking FlatBuffers.Benchmarks.swift
Build complete! (2.22s)
done 500_000: in 0.03720535039901733
done 10 str: in 0.00010554790496826172
done 100 str: in 0.0003162503242492676
done 3M strc: in 0.24083900451660156
-------------------------------------
|	Name		  |		Scores		|
-------------------------------------
|	500_000		|		0.037205s	| 
-------------------------------------
|	10 str		|		0.000106s	| 
-------------------------------------
|	3M strc		|		0.240839s	| 
-------------------------------------
|	100 str		|		0.000187s	| 
-------------------------------------

hassila@max ~/D/G/p/f/t/FlatBuffers.Benchmarks.swift (swift-optimization-inlining)> 

95 ns

---------------------
Performance differences:

-------------------------------------
|	Name		  |		Elapsed time factor (lower is faster in branch)	|
-------------------------------------
|	500_000		|		0.94	| 
-------------------------------------
|	10 str		|		0.81	| 
-------------------------------------
|	3M strc		|		0.94	| 
-------------------------------------
|	100 str		|		0.67	| 
-------------------------------------
---------------------

mustiikhalil · 2022-04-18T13:36:56Z

So the changes are to to the ByteBuffer and the none public functions only in the FlatbufferBuilder and we got those test results?
That sounds really good!

hassila · 2022-04-18T13:42:56Z

Turned on inlining for the four additional ones you suggested, similar results -but there is some variation between runs too, I would suggest increasing the number of iterations at least 10x for that benchmark to get more stable number.

when I did that I could see that those 4 actually regressed performance (numbers below not comparable with previous run as I just increased iterations 10x except for structs which was 3x more):

4 additional inlines vs branch:
-------------------------------------
|	Name		  |		Scores		|
-------------------------------------
|	500_000		|		0.423372s	| 
-------------------------------------
|	10 str		|		0.000955s	| 
-------------------------------------
|	3M strc		|		0.862539s	| 
-------------------------------------
|	100 str		|		0.002244s	| 
-------------------------------------

vs branch as is:
-------------------------------------
|	Name		  |		Scores		|
-------------------------------------
|	500_000		|		0.406291s	| 
-------------------------------------
|	10 str		|		0.000934s	| 
-------------------------------------
|	3M strc		|		0.823144s	| 
-------------------------------------
|	100 str		|		0.001985s	| 
-------------------------------------

hassila · 2022-04-18T13:43:50Z

So the changes are to to the ByteBuffer and the none public functions only in the FlatbufferBuilder and we got those test results?
That sounds really good!

No, that was the PR as-is, will look at yet another version.

mustiikhalil · 2022-04-18T13:46:09Z

Yes, the 3M structs is just a huge benchmark test that I kinda only run once since we know allocation and rewriting the buffer is a bit slower than cpp in swift

hassila · 2022-04-18T18:12:36Z

I'll structure up a few different tests and try to break numbers down - would you be open to a PR changing the run count of your benchmark so it's closer to 1s in magnitude? Makes for more stable and comparable results.

mustiikhalil · 2022-04-18T19:56:10Z

Yeah. Go for it!

hassila · 2022-04-19T08:36:55Z

Yeah. Go for it!

Please see #7255

hassila · 2022-04-19T11:42:49Z

Ok, I have run three setups and compared - first current master as the reference benchmark level.

Then one setup where everything in ByteBuffer is inlined explicitly.

Then one setup where everything in both ByteBuffer and FlatBufferBuilder is inlined explicitly.

Editing the output for clarity so we only look at the time spent for each test (manually picked 100 iterations which is more than I got dynamically, just to have extra runs to get more robust results);

Performance delta from master -> Bytebuffer inline:

hassila@max ~/D/G/p/f/t/FlatBuffers.Benchmarks.swift (swift-inline-bytebuffer)> python3 ./compare.py master.json bytebuffer_inlined.json
benchmark             column          master bytebuffer_inlined      %
----------------------------------------------------------------------
10Strings             time       14758250.00        10587021.00  28.26
100Strings            time       28954958.50        26783479.00   7.50
FlatBufferBuilder.add time       82753104.50        83566000.00  -0.98
structs               time       80878063.00        82333687.50  -1.80
----------------------------------------------------------------------
                      time                                        9.12

So average for tests 9% faster.

Then moving to inlining FBB completely too - gives a clear improvement on all tests (also on my standalone test with small structs it gives 8% or so):

hassila@max ~/D/G/p/f/t/FlatBuffers.Benchmarks.swift (swift-inline-bytebuffer)> python3 ./compare.py bytebuffer_inlined.json all_inlined.json
benchmark             column     bytebuffer_inlined all_inlined       %
-----------------------------------------------------------------------
10Strings             time              10587021.00  9716541.50    8.22
100Strings            time              26783479.00 24986292.00    6.71
FlatBufferBuilder.add time              83566000.00 77523312.00    7.23
structs               time              82333687.50 76435729.50    7.16
-----------------------------------------------------------------------
                      time                                         7.33

With both these changes applied and compare to current master as baseline:

hassila@max ~/D/G/p/f/t/FlatBuffers.Benchmarks.swift (swift-inline-bytebuffer)> python3 ./compare.py master.json all_inlined.json
benchmark             column          master all_inlined      %
---------------------------------------------------------------
10Strings             time       14758250.00  9716541.50  34.16
100Strings            time       28954958.50 24986292.00  13.71
FlatBufferBuilder.add time       82753104.50 77523312.00   6.32
structs               time       80878063.00 76435729.50   5.49
---------------------------------------------------------------
                      time                                15.78

Overall a 16% improvement with improvements ranging from 5% to 34%.

[edit, misread results, aggressive inline is best for all tests]
We can see we win on all tests when adding the aggressive inlining in FBB and there is a clear overall improvement vs. master even with all inlining in place.

What do you think, is it acceptable? I'd push a new update with all changes (think there are a couple of missing inlines in current PR, as I added the ones you suggest too here).

hassila · 2022-04-19T17:15:09Z

My reasoning why inlining probably is ok in flat buffers is that most use cases I can think of would funnel call sites to FBB to a single place per model entity and we would only inline a single copy - granted more complex models might cause more copies, but looking at the actual code that is inlined they are usually just a few lines (with the exception of endTable(at - maybe can remove the inline for that one?)

mustiikhalil · 2022-04-20T05:34:42Z

Looking at the benchmarks it seems that its worth it for now to just inline everything. We can keep it inlined for now since most probably most people who will use this are going to use it on macOS/iOS/server side and rarely on web/embedded systems so we can be safe for now. And if people complain about the binary size we can reiterate over this to find the perfect balance.

How much would the performance drop when we remove the inlining at endTable(at?

…atBufferBuilder

hassila · 2022-04-20T06:49:54Z

Looking at the benchmarks it seems that its worth it for now to just inline everything. We can keep it inlined for now since most probably most people who will use this are going to use it on macOS/iOS/server side and rarely on web/embedded systems so we can be safe for now. And if people complain about the binary size we can reiterate over this to find the perfect balance.

Agree, I may have a larger test suite in the future and will definitely be happy to iterate further in the future to strike a balance if needed.

How much would the performance drop when we remove the inlining at endTable(at?

Ok, good news - nothing really - I removed just a few of the larger functions from inline (see latest forced push of PR) and this is master vs current PR:

hassila@max ~/D/G/p/f/t/FlatBuffers.Benchmarks.swift (swift-optimization-inlining)> python3 ./compare.py master.json next.json
benchmark             column          master        next     %
--------------------------------------------------------------
10Strings             time       14758250.00  9685729.00 34.37
100Strings            time       28954958.50 25886229.50 10.60
FlatBufferBuilder.add time       82753104.50 77747895.50  6.05
structs               time       80878063.00 75260562.50  6.95
--------------------------------------------------------------
                      time                               15.37

I also see the drop still on my small test down to 92ns (~15%) which is nice.

If you are happy with it, I'd say its good to go now.

hassila · 2022-04-20T06:52:48Z

(different future discussion point is use of @inlinable - https://forums.swift.org/t/when-should-both-inlinable-and-inline-always-be-used/37375/2 - separate discussion though)

mustiikhalil

LGTM

mustiikhalil · 2022-04-20T06:57:19Z

(different future discussion point is use of @inlinable - https://forums.swift.org/t/when-should-both-inlinable-and-inline-always-be-used/37375/2 - separate discussion though)

when reading this it seems to me that @inlinable is just used when we have binary frameworks, not 100% sure though.

mustiikhalil · 2022-04-20T06:59:03Z

It seems that this is the RFC in question and it seems pretty interesting

mustiikhalil · 2022-04-20T07:00:29Z

@hassila Thanks for the contribution.

hassila · 2022-04-20T07:00:45Z

It seems that this is the RFC in question and it seems pretty interesting

Thanks, will have a look.

…atBufferBuilder (google#7253) Co-authored-by: Joakim Hassila <[email protected]>

github-actions bot added the swift label Apr 18, 2022

mustiikhalil requested changes Apr 18, 2022

View reviewed changes

[performance] Add aggressive systematic inlining in ByteBuffer and Fl…

f97bbdb

…atBufferBuilder

mustiikhalil approved these changes Apr 20, 2022

View reviewed changes

mustiikhalil merged commit a45f564 into google:master Apr 20, 2022

mustiikhalil pushed a commit to mustiikhalil/flatbuffers that referenced this pull request Apr 20, 2022

[performance] Add aggressive systematic inlining in ByteBuffer and Fl…

721968f

…atBufferBuilder (google#7253) Co-authored-by: Joakim Hassila <[email protected]>

hassila deleted the swift-optimization-inlining branch April 20, 2022 07:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[performance] [swift] Add systematic inlining #7253

[performance] [swift] Add systematic inlining #7253

hassila commented Apr 18, 2022

mustiikhalil left a comment •

edited

Loading

mustiikhalil commented Apr 18, 2022

hassila commented Apr 18, 2022

mustiikhalil commented Apr 18, 2022

hassila commented Apr 18, 2022

mustiikhalil commented Apr 18, 2022

hassila commented Apr 18, 2022

hassila commented Apr 18, 2022

mustiikhalil commented Apr 18, 2022

hassila commented Apr 18, 2022

mustiikhalil commented Apr 18, 2022

hassila commented Apr 19, 2022

hassila commented Apr 19, 2022 •

edited

Loading

hassila commented Apr 19, 2022

mustiikhalil commented Apr 20, 2022

hassila commented Apr 20, 2022

hassila commented Apr 20, 2022

mustiikhalil left a comment

mustiikhalil commented Apr 20, 2022

mustiikhalil commented Apr 20, 2022

mustiikhalil commented Apr 20, 2022

hassila commented Apr 20, 2022

[performance] [swift] Add systematic inlining #7253

[performance] [swift] Add systematic inlining #7253

Conversation

hassila commented Apr 18, 2022

mustiikhalil left a comment • edited Loading

Choose a reason for hiding this comment

mustiikhalil commented Apr 18, 2022

hassila commented Apr 18, 2022

mustiikhalil commented Apr 18, 2022

hassila commented Apr 18, 2022

mustiikhalil commented Apr 18, 2022

hassila commented Apr 18, 2022

hassila commented Apr 18, 2022

mustiikhalil commented Apr 18, 2022

hassila commented Apr 18, 2022

mustiikhalil commented Apr 18, 2022

hassila commented Apr 19, 2022

hassila commented Apr 19, 2022 • edited Loading

hassila commented Apr 19, 2022

mustiikhalil commented Apr 20, 2022

hassila commented Apr 20, 2022

hassila commented Apr 20, 2022

mustiikhalil left a comment

Choose a reason for hiding this comment

mustiikhalil commented Apr 20, 2022

mustiikhalil commented Apr 20, 2022

mustiikhalil commented Apr 20, 2022

hassila commented Apr 20, 2022

mustiikhalil left a comment •

edited

Loading

hassila commented Apr 19, 2022 •

edited

Loading