Skip to content

Latest commit

 

History

History
411 lines (338 loc) · 22.5 KB

mpsc_perf_comparison.md

File metadata and controls

411 lines (338 loc) · 22.5 KB

Comparing MPSC Channel Performance

Let's compare thingbuf::mpsc's performance properties with those of other popular multi-producer, single-consumer (MPSC) channel implementations.

Before we can make a meaningful analysis of such a comparison, however, we need to identify the characteristics of a MPSC channel that are important, and why they're important. The primary properties we'll be looking at are the channel's throughput, latency, and memory use.

Throughput

This is probably the first thing we think of when we think about measuring a channel's performance: how many messages can we send and receive on the channel in a fixed period of time? This gives us a sense of the channel's overall time performance.

Some important things to note about measuring channel throughput:

  • No real application will saturate a channel's maximum throughput. Here's something that, in retrospect, will seem quite obvious: all software must do something beyond just sending and receiving MPSC channel messages. In a benchmark, the thread or task that receives messages from a channel can simply discard them without doing anything, but in a real program...we're sending those messages to the receiver for a reason, so it must do some actual work with the messages we send it.

    If the receiver is not able to process messages as fast as they are produced, the channel will fill up and the senders will have to slow down, regardless of how efficient the actual channel implementation is. In practice, the bottleneck will almost always be the receiver processing messages, rather than the channel's maximum throughput.

    The reason we measure throughput is not because we expect to saturate the channel, but because it may give us a sense of the overall overhead introduced by a channel implementation.

  • Throughput will depend on contention. In almost all cases, the time it takes to receive n messages will increase based on the number of threads concurrently sending messages. Therefore, we must measure throughput with varying numbers of senders in order to analyze how contention affects throughput.

    Importantly, if we have a sense of how contended a channel will be, it may influence our choice of implementation! Some implementations might outperform others at low contention, but lose throughput more rapidly as contention increases. Others might have worse throughput at low contentention, but suffer less as the number of senders increases. If we know in advance the maximum number of senders in our program, this can influence the which implementation is best suited for our use case!

  • We can measure throughput by sending a fixed number of messages. We defined throughput as the number of messages sent over a fixed time interval. Note that this is the inverse of how long it takes to send a fixed number of messages. It's often much easier to write benchmarks that send n messages and record how long it takes, than to record how many messages can be sent in s seconds.

Latency

TODO(eliza): write this part

Memory

All the channel implementations we will be discussing here are bounded. This means they enforce a limit on the maximum number of messages that may be in flight at any given point in time. When this bound is reached, backpressure is exerted on senders, telling them to either wait or shed load until the channel has capacity for new messages.

Why are we only discussing bounded channels? Because unbounded queues are evil and should never be used1. The primary reason we want to avoid the use of unbounded queues in our software is that they often result in programs running out of memory. Each message in a channel occupies memory. If the receiver cannot consume messages as fast as they are produced, an unbounded queue will allow producers to continue producing messages, using more and more memory. Eventually, the program probably gets OOMkilled.

If we are choosing to use a bounded channel in our program, we probably care about limiting the maximum amount of memory that is used by the channel. Naïvely, we might assume that all bounded channels have bounded memory use — if the channel will only store a fixed number of messages, it will never allocate more memory than is needed to store those messages...right?

HAHAHAHA, WRONG. When a bounded channel is full, backpressure is exerted on senders. Typically, this means that sending threads will block (in a synchronous channel) or sending tasks will yield (in an asynchronous channel) until there is capacity remaining to send their messages. Because these are multiple-producer channels, this means that a variable number of threads or tasks might be waiting for capacity to send messages, and the channel must have some way of keeping track of those waiters in order to wake them up again when there is capacity for their messages.

Depending on how a channel is implemented, storing these waiting tasks or threads may require additional memory allocations. This means that if the channel is at capacity, the amount of memory it consumes may continue to increase as more and more waiters are added to the list of threads or tasks waiting for capacity. This means that some "bounded" channels are, in a sense, not bounded at all! If we are concerned about memory exhaustion, we need to ask if the memory use of the channel is proportional to the number of waiters.

There are other aspects of a channel's memory use that we might want to pay attention to. We might also want to know if memory is ever released over the channel's lifetime. Based on how a channel stores the list of waiting senders, the it may or may not deallocate that memory as the number of waiting senders decreases.2

Whether or not this matters depends on our particular workload. Imagine we have a workload that's bursty: the number of messages sent over a given time interval typically falls in some range, but occasionally, the number of messages spikes dramatically — and the number of waiting senders increases as well. In that case, a channel that never releases allocated memory for waiters will continue to use the amount of space required for the highest number of senders it has ever seen over its lifetime, even if that much memory is not required most of the time. Similarly, if a channel lasts for the entire runtime of our program, we might care about whether the memory it uses ever goes down, whereas if the channels are frequently created and destroyed, it might be fine for them to not release memory until they are dropped.

Heap allocations can also directly impact throughput and latency. Interestingly, the performance impact of heap allocations is often lower in microbenchmarks than it is in real applications. If we run a benchmark where a number of messages are sent over a channel, the channel is the only part of that program allocating memory while the benchmark runs. Its allocations will likely follow a particular pattern; they might all be of similar size, and so on. This is generally a favorable condition for most allocator implementations! Allocators love it when you make a bunch of same-sized allocations and deallocate them in a predictable pattern. In real software, though, other parts of the program are also allocating and deallocating memory, and the pattern of allocations and deallocations becomes much less regular. And, increased contention in the allocator may also impact its performance for other, unrelated parts of the program that are concurrently allocating memory. Therefore, microbenchmarks tend not to capture the complete performance impact of heap allocations, both in the code being benchmarked and on unrelated parts of the software.

Finally, in embedded systems and other memory-constrained environments, the system may only have a very limited amount of memory available, so we may want to limit the amount of heap allocations our channel performs. In some use cases, we may not be able to make heap allocations at all. This will almost certainly limit which implementations we can use.

Performance Comparison — Async

The following benchmark measures the time taken to send 1000 i32s on an asynchronous channel, as the number of sending tasks increases. It compares the performance of several popular asynchronous MPSC channel implementations (tokio::sync::mpsc, async_std::channel, and futures::channel::mpsc) with thingbuf::mpsc.

Benchmark results

The benchmark was performed with the following versions of each crate:

Each benchmark was run on a separate tokio multi-threaded runtime, in sequence.

Description of the benchmark environment
:# eliza at noctis in thingbuf
:; rustc --version
rustc 1.57.0 (f1edd0429 2021-11-29)

:# eliza at noctis in thingbuf
:; uname -a
Linux noctis 5.14.9 #1-NixOS SMP Thu Sep 30 08:13:08 UTC 2021 x86_64 GNU/Linux

:# eliza at noctis in thingbuf
:; lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          24
On-line CPU(s) list:             0-23
Thread(s) per core:              2
Core(s) per socket:              12
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           113
Model name:                      AMD Ryzen 9 3900X 12-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         3800.000
CPU max MHz:                     4672.0698
CPU min MHz:                     2200.0000
BogoMIPS:                        7585.77
Virtualization:                  AMD-V
L1d cache:                       384 KiB
L1i cache:                       384 KiB
L2 cache:                        6 MiB
L3 cache:                        64 MiB
NUMA node0 CPU(s):               0-23
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, IBPB conditional, STIBP conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx
                                 mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni p
                                 clmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm exta
                                 pic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext
                                  perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt
                                 _a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm
                                 _mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushby
                                 asid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succo
                                 r smca sme sev sev_es

Analysis

Here's my analysis of this benchmark, plus looking at the implementation of each channel tested.

tokio::sync::mpsc

Tokio's MPSC channel stores messages using an atomic linked list of blocks. To improve performance, rather than storing each individual element in a separate linked list node, each linked list node instead contains an array that can store multiple messages. This amortizes the overhead of allocating new nodes, and reduces the link hopping necessary to index a message.

This approach means that all the capacity of a bounded MPSC channel is not allocated up front when the channel is created. If we create a Tokio bounded MPSC of capacity 1024, but the channel only has 100 messages in the queue, the channel will not currently be occupying enough memory to store 1024 messages. This may reduce the average memory use of the application if the channel is rarely close to capacity. On the other hand, this has the downside that heap allocations may occur as a during any given send operation when the channel is not at capacity, potentially increasing allocator churn.

Because the linked list is atomic, send operations are lock-free when the channel has capacity remaining.

Backpressure is implemented using Tokio's async semaphore to allow senders to wait for channel capacity.3 The async semaphore is implemented using an intrusive doubly-linked list. In this design, the linked list node is stored inside of the future that's waiting to send a message, rather than in a separate heap allocation. Critically, this means that once a channel is at capacity, additional senders waiting for capacity will never cause additional memory allocations; the channel is properly bounded.

The semaphore protects its wait list using a Mutex, so waiting does require acquiring a lock, although the critical sections of this lock should be quite short.

Tokio's MPSC has some other interesting properties that differentiate it from some other implementations. It participates in Tokio's task budget system for automatic cooperative yielding, which may improve tail latencies in some workloads.[^budget] Also, it has a nice API where senders can reserve a Permit type representing the capacity to send a single message, which can be consumed later.

In the benchmark, however, we see that tokio::sync::mpsc offers the worst throughput at medium (50 senders) and high (100 senders) levels of contention, and the second-worst at low (10 senders) contention. This is likely because the implementation prioritizes low memory use over low latency.

In general, tokio::sync::mpsc is probably a decent choice for channels that are long-lived and may have a very large number of tasks waiting for send capacity, because the intrusive wait list means that senders waiting for capacity will not cause additional memory allocations. While its throughput is poor compared to other implementations, if the receiving task performs a lot of work for each message, this may not have a major impact.

However, I will also note that thingbuf's MSPC channels also use an intrusive wait list and don't require per-waiter allocations, just like Tokio's. If a channel with better throughput is needed, thingbuf's MPSC channel may be preferable.

Finally, the tokio crate does not support no_std, with or without alloc.

futures-channel

The futures crate re-exports a channel implementation from the futures-channel crate. This is one of the oldest MPSC channels in async Rust, and its API in many ways provides the template on which the other channels were based. futures-channel uses a linked list as a wait queue4. This linked list is not intrusive: each entry has its own heap allocation. The same linked list is used to store the messages in the channel.

In my opinion, futures-channel should generally not be used. That may sound inflammatory, but here's why:

  1. Throughput at all levels of contention is relatively poor, as the line graph shows. In the low-contention case (10 senders), futures-channel was the slowest, and it was the second slowest in the medium- (50 senders) and high-contention (100 senders) cases.
  2. Because the linked list wait queue is not intrusive, every waiting task will require an additional memory allocation. This means that the maximum memory use of the channel is not actually bounded.
  3. The same linked list implementation is used for messages in the channel, rather than an array. This means sending a message will always cause a heap allocation, and receiving one will always cause a deallocation, resulting in increased allocator churn that may impact overall allocator performance.

The futures-channel crate has limited support for no_std (with alloc), but the MPSC channel is not supported without the standard library.

async-channel

The async-std library provides bounded and unbounded channels via a re-export of the async-channel crate. This is actually a multi-producer, multi-consumer (MPMP) channel, rather than a MPSC channel; I've included it in this comparison because async-std doesn't implement MPSC channels.

Messages sent in the channel are stored using a bounded lock-free queue from the concurrent-queue crate, while the wait queue is implemented using the event-listener crate.5

The concurrent-queue crate's bounded channel implements Dmitry Vyukov's MPMC array-based lock-free queue. This is the same algorithm used by thingbuf. Because this queue uses an array to store messages, a single allocation is performed when the channel is constructed; after that, no send operation will allocate memory as long as the channel has capacity. When there is capacity in the channel, send operations are lock-free.

event-listener implements a type of wait queue using a non-intrusive linked list. This means that every waiter requires an additional memory allocation. Therefore, the channel's maximum memory usage is not actually bounded. Also, allocating linked list nodes may cause allocator churn that could impact other parts of the system. The linked list is protected by a lock, so waiting requires a lock acquisition.

In the benchmark, async-channel was the fastest implementation tested in the medium (50 senders) and high (100 senders) contention tests, and performed very similarly to thingbuf in the low contention (10 senders) test. This means that it offers very good throughput, and presumably also low latency.

I would consider the use of async-channel in cases where extremely high throughput is vital, but it isn't actually necessary to bound maximum memory use (e.g. some other part of the system enforces a limit on the number of sending tasks). Outside of microbenchmarks, though, throughput is likely to be less of a priority than memory use.

async-channel does not support no_std, with or without alloc.

thingbuf::mpsc

TODO(eliza): write this part

Footnotes

  1. Unless you really, actually know what you're doing. This footnote is here because someone on hackernews will probably yell at me unless I include a disclaimer telling them that they're a very special boy who's allowed to use unbounded queues.

  2. For example, if the channel stores waiting senders in a Vec, and never calls Vec::shrink_to_fit on it, the amount of memory it occupies will never go back down as the number of waiters decreases.

  3. I happen to be quite familiar with the implementation details of the Tokio semaphore, because I wrote the initial implementation of it.

  4. Based on Dmitry Vyukov's non-intrusive lock-free MPSC queue, a rather neat lock-free queue algorithm.

  5. I honestly had kind of a hard time analyzing the implementation of async-std's channel because EVERYTHING IS SPLIT ACROSS SO MANY TINY CRATES.