Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize CancellableContinuationImpl.invokeOnCancellation(..) for Segments #3084

Merged
merged 3 commits into from
Feb 28, 2023

Conversation

ndkoval
Copy link
Member

@ndkoval ndkoval commented Dec 14, 2021

Current semaphore implementation uses Segment-s for storing waiting continuations. Moreover, the upcoming new channels and mutex algorithms also use segments to store waiters. When suspending, a cancellation handler should be provided via cont.invokeOnCancellation { ... }-- it cleans up the corresponding slot in the segment and physically removes this segment from the linked list if it becomes full of cancelled cells. However, this cancellation handler requires an allocation every time.

To reduce the memory pressure, we can store the segment along with the slot index in CancellableContinuationImpl directly, as a cancellation handler instruction; thus, eliminating allocations for the corresponding cancellation handlers. For this purpose, we:

  1. Allow storing Segment in state field, similarly to CancelHandler. On cancellation, Segment.invokeOnCancellation(index, cause) function is called.
  2. Store the slot index in the existing decision integer field, extending its purpose correspondingly.

The benchmark below (see the comment) shows a significant allocation rate reduction.

@ndkoval ndkoval force-pushed the optimize-invoke-on-cancellation branch from aeeea94 to e903287 Compare December 14, 2021 17:57
@ndkoval
Copy link
Member Author

ndkoval commented Dec 14, 2021

As semaphore leverages this optimization, I added a simple sequential benchmark to show the impact. The results are below.

WITHOUT optimization:

Benchmark                                                                   Mode  Cnt         Score           Error   Units
SequentialSemaphoreAsMutexBenchmark.benchmark                               avgt   10         0.123 ±         0.007    s/op
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.alloc.rate                avgt   10       340.849 ±        20.797  MB/sec
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.alloc.rate.norm           avgt   10  64500711.033 ±         4.138    B/op
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.churn.G1_Eden_Space       avgt   10       389.221 ±       620.649  MB/sec
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.churn.G1_Eden_Space.norm  avgt   10  74868326.400 ± 119718113.259    B/op
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.count                     avgt   10         5.000                  counts
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.time                      avgt   10        57.000                      ms

WITH optimization:

Benchmark                                                                   Mode  Cnt         Score           Error   Units
SequentialSemaphoreAsMutexBenchmark.benchmark                               avgt   10         0.123 ±        0.004    s/op
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.alloc.rate                avgt   10       213.820 ±       10.688  MB/sec
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.alloc.rate.norm           avgt   10  40500711.033 ±        4.138    B/op
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.churn.G1_Eden_Space       avgt   10       157.394 ±      501.946  MB/sec
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.churn.G1_Eden_Space.norm  avgt   10  30303846.400 ± 96795349.649    B/op
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.churn.G1_Old_Gen          avgt   10        ≈ 10⁻⁴                 MB/sec
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.churn.G1_Old_Gen.norm     avgt   10        25.778 ±      123.241    B/op
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.count                     avgt   10         2.000                 counts
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.time                      avgt   10        22.000                     ms

@ndkoval ndkoval marked this pull request as ready for review December 15, 2021 10:54
@ndkoval ndkoval force-pushed the optimize-invoke-on-cancellation branch from 001e766 to 1092fca Compare August 3, 2022 19:13
Copy link
Collaborator

@qwwdfsad qwwdfsad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite okay with the general idea. Please postpone it from merge though, I'll evaluate it once channels are properly reviewed

@ndkoval
Copy link
Member Author

ndkoval commented Nov 28, 2022

Will be delivered along with #3103

@ndkoval ndkoval closed this Nov 28, 2022
@ndkoval ndkoval reopened this Feb 6, 2023
@ndkoval ndkoval force-pushed the optimize-invoke-on-cancellation branch from 1092fca to 638760f Compare February 10, 2023 20:10
@ndkoval ndkoval changed the base branch from develop to fast-channels February 10, 2023 20:14
@ndkoval ndkoval force-pushed the optimize-invoke-on-cancellation branch 2 times, most recently from c7042e2 to f8af950 Compare February 13, 2023 14:48
@ndkoval ndkoval changed the base branch from fast-channels to develop February 13, 2023 14:48
@ndkoval
Copy link
Member Author

ndkoval commented Feb 13, 2023

Let's keep the separation into two commits: the first that fixes/adds benchmarks, and the second that optimizes the cancellation handling mechanism.

@qwwdfsad
Copy link
Collaborator

Could you please show before/after on ChannelSinkBenchmark?

@ndkoval ndkoval force-pushed the optimize-invoke-on-cancellation branch 3 times, most recently from 9ed3b1b to 7317fe5 Compare February 15, 2023 12:57
@ndkoval
Copy link
Member Author

ndkoval commented Feb 15, 2023

See below the results on my laptop (MacBook Pro 16-inch, 2021, Apple M1 Max, 64 GB, OpenJDK 64-Bit Server VM Zulu19.32+13-CA).

WITHOUT the optimization:

Benchmark                                                                             Mode  Cnt        Score         Error   Units
ChannelSinkBenchmark.channelPipeline                                                  avgt    5        1.375 ±       0.023   ms/op
ChannelSinkBenchmark.channelPipeline:·gc.alloc.rate.norm                              avgt    5   668370.560 ±     256.516    B/op
ChannelSinkBenchmark.channelPipelineOneThreadLocal                                    avgt    5        1.756 ±       0.012   ms/op
ChannelSinkBenchmark.channelPipelineOneThreadLocal:·gc.alloc.rate.norm                avgt    5   668468.144 ±     296.998    B/op
ChannelSinkBenchmark.channelPipelineTwoThreadLocals                                   avgt    5        2.501 ±       0.123   ms/op
ChannelSinkBenchmark.channelPipelineTwoThreadLocals:·gc.alloc.rate.norm               avgt    5  1668726.477 ±     115.551    B/op
ChannelSinkNoAllocationsBenchmark.channelPipeline                                     avgt    5        6.081 ±       0.140   ms/op
ChannelSinkNoAllocationsBenchmark.channelPipeline:·gc.alloc.rate.norm                 avgt    5  3426068.483 ±     354.374    B/op

WITH the optimization:

Benchmark                                                                              Mode  Cnt        Score         Error   Units
ChannelSinkBenchmark.channelPipeline                                                   avgt    5        1.248 ±       0.013   ms/op
ChannelSinkBenchmark.channelPipeline:·gc.alloc.rate.norm                               avgt    5   488344.851 ±     129.550    B/op
ChannelSinkBenchmark.channelPipelineOneThreadLocal                                     avgt    5        1.681 ±       0.031   ms/op
ChannelSinkBenchmark.channelPipelineOneThreadLocal:·gc.alloc.rate.norm                 avgt    5   488460.940 ±     258.884    B/op
ChannelSinkBenchmark.channelPipelineTwoThreadLocals                                    avgt    5        2.518 ±       0.027   ms/op
ChannelSinkBenchmark.channelPipelineTwoThreadLocals:·gc.alloc.rate.norm                avgt    5  1493167.804 ±     116.414    B/op
ChannelSinkNoAllocationsBenchmark.channelPipeline                                      avgt    5        5.971 ±       0.760   ms/op
ChannelSinkNoAllocationsBenchmark.channelPipeline:·gc.alloc.rate.norm                  avgt    5  1025957.479 ±     274.255    B/op    

@ndkoval
Copy link
Member Author

ndkoval commented Feb 15, 2023

These benchmarks do not show performance improvement but clearly show reduced allocations (more than 3x on ChannelSinkNoAllocationsBenchmark).

@qwwdfsad
Copy link
Collaborator

Nice! I'm looking into that

…annelSinkBenchmark` that supports buffered channels and pre-allocates elements.

Signed-off-by: Nikita Koval <[email protected]>
@ndkoval ndkoval force-pushed the optimize-invoke-on-cancellation branch from 7317fe5 to 8202abf Compare February 23, 2023 13:09
Copy link
Collaborator

@qwwdfsad qwwdfsad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Will wait for the tests to run and merge

@qwwdfsad qwwdfsad merged commit 2da6817 into develop Feb 28, 2023
@qwwdfsad qwwdfsad deleted the optimize-invoke-on-cancellation branch February 28, 2023 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants