Skip to content

Commit

Permalink
Through to 8.
Browse files Browse the repository at this point in the history
  • Loading branch information
ResidentMario committed Jun 23, 2018
1 parent 17483db commit a738d62
Show file tree
Hide file tree
Showing 2 changed files with 144 additions and 1 deletion.
84 changes: 83 additions & 1 deletion Chapter 7.1 --- Jepsen.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,89 @@
"* So by default MongoDB is not consistent.\n",
"\n",
"\n",
"* As with Redis you can tell MongoDB to use majority acknowledgement. This makes it properly CP, but increases latency by a lot."
"* As with Redis you can tell MongoDB to use majority acknowledgement. This makes it properly CP, but increases latency by a lot.\n",
"\n",
"## More MongoDB\n",
"\n",
"* Jespen found the Redis consensus algorithm inscruitable, with a wide range of possible edge cases. Again, use it for speed, not for any gaurantees of anything.\n",
"* The MongoDB consensus algorithm is less topologically diverse.\n",
"* Jespen went very, very deep into the MongoDB algorithms...interesting case analysis follows.\n",
"\n",
"\n",
"* MongoDB now has two consensus protocols, v0 and v1.\n",
"* v1 is the new default.\n",
"\n",
"\n",
"* MongoDB is designed to reach consensus over a log of database operations called an **oplog**.\n",
"* If a network partition occurs, divergent data may be written both to the old leader and to a new quorom-elected leader. This causes divergent history, which necessitates a rollback on heal.\n",
"* Distributed logs typically maintain a **commit point**, a furthest index in the log. Ops before the commit point are safe, but ops after the commit point are not gauranteed to be retained (see e.g. GFS).\n",
"* MongoDB however applies and acks operations immediately.\n",
"* This causes dirty reads and lost updates.\n",
"* v0 cannot prevent dirty reads, but it can prevent lost updates using the majority write concern (the leader will not ack until it knows that at least half of the follower nodes have the update).\n",
"\n",
"\n",
"* In case of a network partition, non-majority write durability hinges on the election of the most up-to-date follower in the quorom. However how do you determine who's furthest ahead?\n",
"* In v0 each write to the log is assigned an **optime**. That optime is a combination of the sequence order of the operation, and the wall clock time on the node at the time of the log write.\n",
"* The majority will elect the node that has the highest optime.\n",
"* In theory this will always be the node which is furthest ahead. In practice, clocks can drift, so this is not always so.\n",
"* Thus if the old leader is on a fast clock, and the newly elected leader is on a slow clock, MongoDB may in practice choose to preserve the old leader's data, not the new leader's data!\n",
"* So lost updates are non-deterministic in the sense that it can occur on either side of the partition.\n",
"* Jespen makes a lot of hooplah about this.\n",
"\n",
"\n",
"* v1 aimed to address these problems. v1 is the new default replication protocol.\n",
"* It takes elements from the Raft consensus algorithm (TODO: look into Raft).\n",
"* In version 0 there are election IDs. These are used for determining which primary node is the most recently voted-upon primary.\n",
"* In order to prevent double-voting, nodes were equipped with a 30-second cooldown on voting, but this was very imperfect.\n",
"* In v1 election IDs are not election terms, assigned to every node in the cluster, which increment every time the node votes. Thus a node can vote only once per term, a much better system.\n",
"* Optimes are no longer ordered using sequential wall clock entries, but instead `(term, timestamp)` tuples.\n",
"* Thus even if there is wall clock drift, and during the healing process it is discovered that the older election generation is ahead of the newer election generation, logical order will still be preserved.\n",
"* The other big change is that there is now snapshot isolation.\n",
"* There is a read concern level that can be set that will hold commits away from reads until the commit is confirmed to be durable.\n",
"* This addresses the dirty reads problem that v0 majority reads suffered from.\n",
"* Interesting showing of how hard it is to design a performant data system.\n",
"* These improvements were two years in the making.\n",
"* Jespen is mollified, which gives me more confidence in using MongoDB in production, as long as you have the right flags set.\n",
"\n",
"\n",
"## Riak\n",
"* Riak is a close commercial implementation of the Dynamo system.\n",
"* In Riak nodes are arranged in a ring.\n",
"* Incoming write data is hashed and duplicated to multiple nodes in the ring. You can choose how many nodes must acknowledge a write before the service acks.\n",
"* Incoming reads are bounced through a configurable number of nodes, with the most recent value among those nodes being the return value. You can choose many nodes must participate before the service acks.\n",
"* If a partition occurs, the nodes split off into subrings which continue to attempt to service connection as best they can.\n",
"* Lower write duplication numbers increase write availability, as fewer nodes have to be present in the ring to accept your write. They are also faster to perform. But this comes at the cost of read availability (if a network partition occurs and no node carries the data, you're out of luck).\n",
"* The partioned nodes are in a **sloppy quorom**.\n",
"* When the partition heals the nodes perform a **hinted handoff**, moving temporarily localized data onto the nodes it was originally supposed to land on.\n",
"* Dynamo and thus Raik uses an algorithm called **vector clocks** to find, after a partition heals and data comparison operations are made, conflicting data writes from different sides of the partition.\n",
"* These conflicts may be resolved by the user application. Sometimes this is easy. For example, merging shopping carts (the original Dynamo use case). Sometimes this is hard. Worst-case, last-write-wins can be used.\n",
"* Of course last-write-wins is approximate, due to the clock drift problems explored in the previous section.\n",
"* Riak will introduce high latency onto requests that are made during a network partition. This is by design; Riak really doesn't want to drop write attempts, so it will hold onto them until the backup rings come online, which can take time.\n",
"* To use Riak effectively, you should be structuring your stored objects as **CRTDs**. CRTDs are a newfangled concept of a thing. They're immutable, commutative, and associative. Thus they can be merged without data loss!\n",
"* And don't use last-write-wins because \"clocks are evil\", unless it doesn't matter much or there's no other way. A merge is much better.\n",
"\n",
"## Summary thus far\n",
"* Practical lessons:\n",
" * Network errors means \"I don't know\", not failure.\n",
" * Node consensus does not mean data consistency.\n",
" * Maintaining a consistent data log in the face of primary failover is hard.\n",
" * Wall clocks are an approximation.\n",
" * Choose the right design for your problem space.\n",
"\n",
"## Zookeeper\n",
"* ZooKeeper is a distributed consistent data store.\n",
"* As explored in the previous notebook ZK is meant for small amounts of config data that have to be fully correct.\n",
"* It uses majority quoroms and offers full consistency (serialization).\n",
"* The trade-off is that writes must wait until the full quorom acknowledges a write, and the entire dataset must fit in memory.\n",
"* Jepsen says: Zookeeper is great!\n",
"\n",
"\n",
"## NouDB\n",
"* This database was posed as a refutation of CAP by some famous guy.\n",
"* It is not.\n",
"* It is basically a CP database that blocks forever when a partition occurs. \"Trivial availability\" according to one reading of the conjecture.\n",
"\n",
"## Kafka"
]
}
],
Expand Down
61 changes: 61 additions & 0 deletions Chapter 8 --- Distributed Failures.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Distributed Failures\n",
"\n",
"* Computers are designed to fail all at once and catastrophically.\n",
"* Distributed services are fundamentally different. They need to fail only partially, if possible, in order to allow as many services as possible to continue to operate.\n",
"* This unlocks a lot of additional failure modes, **partial failures**, which are both hard to understand and hard to reason about.\n",
"\n",
"\n",
"* Long discussion of network flakiness...\n",
"\n",
"\n",
"* Modern computers have both a time-of-day clock and a monotonic clock. The former is a user-facing clock that can move backwards or stop in time. The latter is a constantly forward-propagating clock, but only the difference between values is meaningful; its absolute magnitude is meaningless.\n",
"* Both clocks are ruled over by NTP synchronization, when the computer is online.\n",
"* These clocks are quartz-based, and subject to approximately 35 ms average errors.\n",
"* Furthermore, processes may pause for arbitrarily long periods of time due to things like garbage collection and context switches.\n",
"* Things that depend on timeouts, like creating a global log in a distriobuted system (databases) or expiring leader-given leases (GFS), must account for this possibility.\n",
"* You can omit these delays if you try hard enough. Systems like flight controls and airbags operate are **hard real-time systems** that do so. But these are really, really hard to build, and highly limiting.\n",
"\n",
"\n",
"* Algorithms can be proven to work correctly in a given an unreliable **system model**. But you have to choose your model carefully.\n",
"* A point that Jespen makes: these algorithms are hard to implement in a durable way. The theory is far ahead of the practice, but a lot of architects try to homegrow their conflict resolution, which is problematic.\n",
"\n",
"\n",
"* One simplifying assumption we make when designing fault tolerant systems is that nodes and services never purposefully lie.\n",
"* If a node sends an knowingly untrue message to another node, this is known as a **Byzantine fault**. \n",
"* Byzantine faults must be addressed in the design of public distributed systems, like the Internet or blockchain. But they can be ignored in the design of a service system, as obviously you will not purposefully lie to yourself.\n",
"\n",
"\n",
"* In designing a distributed system, one that is protected from Byzantine faults, you often need to design around causal dependency. You need to understand what a node knew when it made a decision, as that knowledge informs how you will heal any resulting divergence.\n",
"* **Vector clocks** are an algorithm for this (e.g. Dynamo).\n",
"* An alterantive is preserving a **total ordering** of operations (e.g. MongoDB)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

0 comments on commit a738d62

Please sign in to comment.