Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[performance] [swift] Add systematic inlining #7253

Merged
merged 1 commit into from
Apr 20, 2022
Merged

[performance] [swift] Add systematic inlining #7253

merged 1 commit into from
Apr 20, 2022

Conversation

hassila
Copy link
Contributor

@hassila hassila commented Apr 18, 2022

Did some more performance testing following up on previous PR, by adding systematic inlining in ByteBuffer and FlatbuffersBuilder managed to push it down further closer to C++ performance from 107ns to 94ns per message (C++ 71ns or so.) - so around 13% improvement with this PR.

Would be interesting to hear you test it if you have some additional benchmarks with this and the previous change, for my test case its a significant improvement.

@github-actions github-actions bot added the swift label Apr 18, 2022
Copy link
Collaborator

@mustiikhalil mustiikhalil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hassila Thanks for this PR. I've looked through the PR and noticed that you added @inline(__always) to some public APIs, can we remove those since we would prefer not inlining public APIs ~ link to reasoning.

However, since you are at it there might be some other functions that might benefit from this PR. like:

  • func reallocate(_ size: Int, writerSize: Int, alignment: Int)
  • func add(loc: FieldLoc)
  • VTableStorage.clear()
  • ByteBuffer.Storage.initialize(for size: Int)

Would be interesting to see the performance when the public functions are removed and these are added instead.

@mustiikhalil
Copy link
Collaborator

This is a reference to all the work that has been done for the swift lib in terms of performance.

#5798

@hassila
Copy link
Contributor Author

hassila commented Apr 18, 2022

@hassila Thanks for this PR. I've looked through the PR and noticed that you added @inline(__always) to some public APIs, can we remove those since we would prefer not inlining public APIs ~ link to reasoning.

It seems the major reason is to avoid excessive code blowup if used without care?

I'd argue that in the case of ByteBuffer, the public API is very small functions that should run fairly low risk of causing such regressions (while it seems to give a real performance benefit) - most functions are just a couple of lines - I can remove them if you want, but would ask to at least consider it for such small functions (agree it may be a bad idea to do it without discrimination).

I can see the following difference for BB inlining, quite significant (20%!):

With ByteBuffers inlining: 88 ns
Without inlining: 107ns

(got it down a little bit more thanks to smaller initial FBB size - that also pushed down the c++ test to 59 ns as a reference)

However, since you are at it there might be some other functions that might benefit from this PR. like:

  • func reallocate(_ size: Int, writerSize: Int, alignment: Int)
  • func add(loc: FieldLoc)
  • VTableStorage.clear()
  • ByteBuffer.Storage.initialize(for size: Int)

Would be interesting to see the performance when the public functions are removed and these are added instead.

I tried those, no major differences (a couple of them are also public as far as I can see, so didn't try those)

@mustiikhalil
Copy link
Collaborator

It seems the major reason is to avoid excessive code blowup if used without care?

Yeah basically. Especially if the project contains multiple schemes.

I can see the following difference for BB inlining, quite significant (20%!):
With ByteBuffers inlining: 88 ns
Without inlining: 107ns

The inlining for the ByteBuffer we can definitely keep, (even the public one since its a one liner). However I was more worried about public functions in the FlatbufferBuilder since those are the ones that might expand the binary since if misused.

I tried those, no major differences (a couple of them are also public as far as I can see, so didn't try those)

This might change with the benchmark test I appended below.

What benchmarks are you running? Can you run the following benchmark test suite on both this branch and master on your device so we can have a more in-depth understanding of what steps should be taken.

@hassila
Copy link
Contributor Author

hassila commented Apr 18, 2022

I tried those, no major differences (a couple of them are also public as far as I can see, so didn't try those)

This might change with the benchmark test I appended below.

I'll try that separately, but see results from your benchmark suite below, even more significant improvements than in my simple test for some of them (between 6 and 33%) :-)

What benchmarks are you running? Can you run the following benchmark test suite on both this branch and master on your device so we can have a more in-depth understanding of what steps should be taken.

---------------------

Building for production...
[3/3] Linking FlatBuffers.Benchmarks.swift
Build complete! (1.99s)
done 500_000: in 0.03937615156173706
done 10 str: in 0.00013124942779541016
done 100 str: in 0.00027724504470825193
done 3M strc: in 0.25692498683929443
-------------------------------------
|	Name		  |		Scores		|
-------------------------------------
|	500_000		|		0.039376s	| 
-------------------------------------
|	10 str		|		0.000131s	| 
-------------------------------------
|	3M strc		|		0.256925s	| 
-------------------------------------
|	100 str		|		0.000277s	| 
-------------------------------------

hassila@max ~/D/G/p/f/t/FlatBuffers.Benchmarks.swift (master)> 

106 ns

---------------------

Building for production...
[3/3] Linking FlatBuffers.Benchmarks.swift
Build complete! (2.22s)
done 500_000: in 0.03720535039901733
done 10 str: in 0.00010554790496826172
done 100 str: in 0.0003162503242492676
done 3M strc: in 0.24083900451660156
-------------------------------------
|	Name		  |		Scores		|
-------------------------------------
|	500_000		|		0.037205s	| 
-------------------------------------
|	10 str		|		0.000106s	| 
-------------------------------------
|	3M strc		|		0.240839s	| 
-------------------------------------
|	100 str		|		0.000187s	| 
-------------------------------------

hassila@max ~/D/G/p/f/t/FlatBuffers.Benchmarks.swift (swift-optimization-inlining)> 

95 ns

---------------------
Performance differences:

-------------------------------------
|	Name		  |		Elapsed time factor (lower is faster in branch)	|
-------------------------------------
|	500_000		|		0.94	| 
-------------------------------------
|	10 str		|		0.81	| 
-------------------------------------
|	3M strc		|		0.94	| 
-------------------------------------
|	100 str		|		0.67	| 
-------------------------------------
---------------------

@mustiikhalil
Copy link
Collaborator

So the changes are to to the ByteBuffer and the none public functions only in the FlatbufferBuilder and we got those test results?
That sounds really good!

@hassila
Copy link
Contributor Author

hassila commented Apr 18, 2022

Turned on inlining for the four additional ones you suggested, similar results -but there is some variation between runs too, I would suggest increasing the number of iterations at least 10x for that benchmark to get more stable number.

  • when I did that I could see that those 4 actually regressed performance (numbers below not comparable with previous run as I just increased iterations 10x except for structs which was 3x more):
4 additional inlines vs branch:
-------------------------------------
|	Name		  |		Scores		|
-------------------------------------
|	500_000		|		0.423372s	| 
-------------------------------------
|	10 str		|		0.000955s	| 
-------------------------------------
|	3M strc		|		0.862539s	| 
-------------------------------------
|	100 str		|		0.002244s	| 
-------------------------------------

vs branch as is:
-------------------------------------
|	Name		  |		Scores		|
-------------------------------------
|	500_000		|		0.406291s	| 
-------------------------------------
|	10 str		|		0.000934s	| 
-------------------------------------
|	3M strc		|		0.823144s	| 
-------------------------------------
|	100 str		|		0.001985s	| 
-------------------------------------

@hassila
Copy link
Contributor Author

hassila commented Apr 18, 2022

So the changes are to to the ByteBuffer and the none public functions only in the FlatbufferBuilder and we got those test results?
That sounds really good!

No, that was the PR as-is, will look at yet another version.

@mustiikhalil
Copy link
Collaborator

Yes, the 3M structs is just a huge benchmark test that I kinda only run once since we know allocation and rewriting the buffer is a bit slower than cpp in swift

@hassila
Copy link
Contributor Author

hassila commented Apr 18, 2022

I'll structure up a few different tests and try to break numbers down - would you be open to a PR changing the run count of your benchmark so it's closer to 1s in magnitude? Makes for more stable and comparable results.

@mustiikhalil
Copy link
Collaborator

Yeah. Go for it!

@hassila
Copy link
Contributor Author

hassila commented Apr 19, 2022

Yeah. Go for it!

Please see #7255

@hassila
Copy link
Contributor Author

hassila commented Apr 19, 2022

Ok, I have run three setups and compared - first current master as the reference benchmark level.

Then one setup where everything in ByteBuffer is inlined explicitly.

Then one setup where everything in both ByteBuffer and FlatBufferBuilder is inlined explicitly.

Editing the output for clarity so we only look at the time spent for each test (manually picked 100 iterations which is more than I got dynamically, just to have extra runs to get more robust results);

Performance delta from master -> Bytebuffer inline:

hassila@max ~/D/G/p/f/t/FlatBuffers.Benchmarks.swift (swift-inline-bytebuffer)> python3 ./compare.py master.json bytebuffer_inlined.json
benchmark             column          master bytebuffer_inlined      %
----------------------------------------------------------------------
10Strings             time       14758250.00        10587021.00  28.26
100Strings            time       28954958.50        26783479.00   7.50
FlatBufferBuilder.add time       82753104.50        83566000.00  -0.98
structs               time       80878063.00        82333687.50  -1.80
----------------------------------------------------------------------
                      time                                        9.12

So average for tests 9% faster.

Then moving to inlining FBB completely too - gives a clear improvement on all tests (also on my standalone test with small structs it gives 8% or so):

hassila@max ~/D/G/p/f/t/FlatBuffers.Benchmarks.swift (swift-inline-bytebuffer)> python3 ./compare.py bytebuffer_inlined.json all_inlined.json
benchmark             column     bytebuffer_inlined all_inlined       %
-----------------------------------------------------------------------
10Strings             time              10587021.00  9716541.50    8.22
100Strings            time              26783479.00 24986292.00    6.71
FlatBufferBuilder.add time              83566000.00 77523312.00    7.23
structs               time              82333687.50 76435729.50    7.16
-----------------------------------------------------------------------
                      time                                         7.33

With both these changes applied and compare to current master as baseline:

hassila@max ~/D/G/p/f/t/FlatBuffers.Benchmarks.swift (swift-inline-bytebuffer)> python3 ./compare.py master.json all_inlined.json
benchmark             column          master all_inlined      %
---------------------------------------------------------------
10Strings             time       14758250.00  9716541.50  34.16
100Strings            time       28954958.50 24986292.00  13.71
FlatBufferBuilder.add time       82753104.50 77523312.00   6.32
structs               time       80878063.00 76435729.50   5.49
---------------------------------------------------------------
                      time                                15.78

Overall a 16% improvement with improvements ranging from 5% to 34%.

[edit, misread results, aggressive inline is best for all tests]
We can see we win on all tests when adding the aggressive inlining in FBB and there is a clear overall improvement vs. master even with all inlining in place.

What do you think, is it acceptable? I'd push a new update with all changes (think there are a couple of missing inlines in current PR, as I added the ones you suggest too here).

@hassila
Copy link
Contributor Author

hassila commented Apr 19, 2022

My reasoning why inlining probably is ok in flat buffers is that most use cases I can think of would funnel call sites to FBB to a single place per model entity and we would only inline a single copy - granted more complex models might cause more copies, but looking at the actual code that is inlined they are usually just a few lines (with the exception of endTable(at - maybe can remove the inline for that one?)

@mustiikhalil
Copy link
Collaborator

Looking at the benchmarks it seems that its worth it for now to just inline everything. We can keep it inlined for now since most probably most people who will use this are going to use it on macOS/iOS/server side and rarely on web/embedded systems so we can be safe for now. And if people complain about the binary size we can reiterate over this to find the perfect balance.

  • How much would the performance drop when we remove the inlining at endTable(at?

@hassila
Copy link
Contributor Author

hassila commented Apr 20, 2022

Looking at the benchmarks it seems that its worth it for now to just inline everything. We can keep it inlined for now since most probably most people who will use this are going to use it on macOS/iOS/server side and rarely on web/embedded systems so we can be safe for now. And if people complain about the binary size we can reiterate over this to find the perfect balance.

Agree, I may have a larger test suite in the future and will definitely be happy to iterate further in the future to strike a balance if needed.

  • How much would the performance drop when we remove the inlining at endTable(at?

Ok, good news - nothing really - I removed just a few of the larger functions from inline (see latest forced push of PR) and this is master vs current PR:

hassila@max ~/D/G/p/f/t/FlatBuffers.Benchmarks.swift (swift-optimization-inlining)> python3 ./compare.py master.json next.json
benchmark             column          master        next     %
--------------------------------------------------------------
10Strings             time       14758250.00  9685729.00 34.37
100Strings            time       28954958.50 25886229.50 10.60
FlatBufferBuilder.add time       82753104.50 77747895.50  6.05
structs               time       80878063.00 75260562.50  6.95
--------------------------------------------------------------
                      time                               15.37

I also see the drop still on my small test down to 92ns (~15%) which is nice.

If you are happy with it, I'd say its good to go now.

@hassila
Copy link
Contributor Author

hassila commented Apr 20, 2022

(different future discussion point is use of @inlinable - https://forums.swift.org/t/when-should-both-inlinable-and-inline-always-be-used/37375/2 - separate discussion though)

Copy link
Collaborator

@mustiikhalil mustiikhalil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mustiikhalil
Copy link
Collaborator

(different future discussion point is use of @inlinable - https://forums.swift.org/t/when-should-both-inlinable-and-inline-always-be-used/37375/2 - separate discussion though)

when reading this it seems to me that @inlinable is just used when we have binary frameworks, not 100% sure though.

@mustiikhalil
Copy link
Collaborator

It seems that this is the RFC in question and it seems pretty interesting

@mustiikhalil mustiikhalil merged commit a45f564 into google:master Apr 20, 2022
@mustiikhalil
Copy link
Collaborator

@hassila Thanks for the contribution.

@hassila
Copy link
Contributor Author

hassila commented Apr 20, 2022

It seems that this is the RFC in question and it seems pretty interesting

Thanks, will have a look.

mustiikhalil pushed a commit to mustiikhalil/flatbuffers that referenced this pull request Apr 20, 2022
@hassila hassila deleted the swift-optimization-inlining branch April 20, 2022 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants