This guide contains basic information about deterministic testing of distributed, message-based, event-driven, or actor systems.
The wild success of Jepsen to destroy nearly every distributed system tested against it repeatedly shows us that we are building distributed systems in a fundamentally bug-prone way.
As mentioned in the sled error handling post, most of our distributed data-intensive systems break because we don’t actually subject them to realistic conditions that cause all IO to fail during testing.
In distributed and message-based systems, we have to remember that our outbound communication will get mixed up with any other communication happening concurrently. Anything that goes over the network may be dropped. Unless particular care is taken on top of TCP or UDP, it can be easy to mix up message ordering. An earlier message may be sent to a remote host, the message takes a detour through the internet, meanwhile a new connection has been created and sends later messages which arrive before the earlier message.
How do we apply the same principle to systems where messages are sent between communicating components? How do we build Jepsen-proof systems? A single Jepsen test can take 5 minutes. Let’s run thousands per second using discrete event simulation!
Step 1: write your code in a way that can be deterministically tested on top of a simulator. This also ensures you’re properly applying the dependency inversion principle. The state machine of your correctness-critical logic can run on top of any network transport you want, just as easily as in the simple simulator that will cause tons of race conditions to jump out within milliseconds.
Step 2: build a simulator that will exercise realistic message passing behavior. Anyone who doesn’t do this is building a very buggy distributed system, as Jepsen repeatedly shows. A notable exception being FoundationDB. Let’s learn from their success and simulate.
After you have the experience of building your first distributed system on top of a simulator that induces partitions / delays etc… you will forever consider writing them without a simulator to be like driving drunk and blindfolded. So many bugs will pop out. Our laptops don’t behave like networked systems. If you built a plane in windtunnel with zero induced turbulence effects, would you then fly that plane? Because that’s how people are building the distributed systems you use today, and fixes only happen occasionally when someone is lucky enough to realize there’s even anything wrong. They would save so much time debugging high-cost issues in production if they had tested the thing with a simulator.
But it’s not complex to build a simulator. Much simpler than the subtleties of CRDT convergence with dynamic membership or raft with a single partition that causes dueling candidacies etc… And if you write a few simple tests, like “is the sequence of observed client responses actually possible to observe with any combination of client requests (possibly dropping ones where responses were not observed) executed serially?” which gives you a straightforward linearizability test. You can write simpler invariants that get executed after each message is delivered and processed, like “are there more than 1 leaders that can convince a majority of the cluster to do as they say?” (split brain). If you can’t come up with any properties to assert, you don’t understand your system well enough to build it correctly yet.
Simulators give you implementations that Jepsen will not find bugs in, at least as far as the core state machine for your distributed algorithms is concerned. It takes 5 minutes to run 1 jepsen test on a cluster, usually after spending 1 month to implement jepsen for the system, or paying someone hundreds of thousands of dollars to do it for you. You can run thousands of interesting workloads per second on a laptop, stretching the possible delivery horizons for messages way farther, and finding far more bugs.
You can choose your timing assumption model, but the simplest to implement, and also the one that guarantees the freedom from the most bugs in the wild, is the asynchronous model, where any message can be arbitrarily delayed (maybe forever, dropped) or reordered with others.
This is one possible interface that has worked well for me:
// called when this state machine receives a message,
// responds with outgoing messages
fn receive(msg, at) -> [(msg, destination)]
// periodically called for handling periodic functionality,
// like leader election, pending request timeout etc...
fn tick(at) -> [(msg, destination)]
You can also fold tick
into receive
if your system supports the common
actor pattern of a periodic message being sent to itself as a kind of timer.
Recipe:
This general pattern is called “discrete event simulation”. If you’re coming into distributed systems, if you learn this technique, you will have a massive advantage over anyone who claims to be a distributed systems expert but just tests in production / their laptop / jepsen.
Who else sees success with this technique? It’s by no means novel, it’s just slow to catch on for some reason.
etc…
Let’s stop gawking at the repeated success of Jepsen and start building systems that Jepsen does not find bugs in. Let’s use techniques that are thousands of times faster than Jepsen to catch bugs immediately instead of once per month before a big release, or never at all. Let’s build systems that you can test for race conditions to begin with. Let’s learn from our mistakes.
Thanks for reading!
If you found this article to be useful, please consider supporting my efforts efforts to share knowledge and productionize cutting edge database research with implementations in Rust :)