{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Consistency and consensus\n", "\n", "* All durable data systems provide **eventual consistency**, but their implementations of the \"eventually\" part differ.\n", "* In the pre-previous chapter we examined isolation, which is consistency within (and between) concurrent transactions.\n", "* Two chapters before that we examined broad consistency gaurantees in the context of partitioned data.\n", "* Consistency is a difficult topic, and it's one wrapped up in all of the other concerns of distributed systems.\n", "\n", "\n", "## Linearizability\n", "* A distributed system is **linearizable** if:\n", " * All operations are atomic, acting as though they happened at instantaneous moments in time, resulting in an implicit order of operations within the system.\n", " * Concurrency or an illusion of concurrency may happen, but operations are sequential within the system.\n", " * All reads and compare-and-sets are up-to-date at return time with respect to that order (no stale operations).\n", "* Not part of this definition: op start times (e.g. when the request is made) do not determine operation order. The atomic operation may occur at any time before the ack. With concurrent ops, the service may determine the landing sequence.\n", "\n", "\n", "* Seriazation, the strongest isolation gaurantee, and linerazation, here, are two different concepts of service behavior.\n", "* The strongest possible gaurantees a system can provide overall is \"strict serializability\", which is both of these properties combined.\n", "\n", "\n", "* Even if your system does not provide linearizability, some system components may still require linearizability to function properly.\n", "* Lock services, leader elections, data uniqueness gaurantees, and two-channel or cross-channel operations (e.g. concurrently asynchronously write a file to a file system and a message to a message broker telling people to do something with that file) all require linearizability.\n", "\n", "\n", "* Asynchronous multi-leader and leaderless replication are not linearizable due to potential data loss in a failover. Synchronous multi-leader and leaderless replication are linearizable, but you rarely want to pay this performance cost. Single-leader replication can be made synchronous, but again, meh.\n", "* Sloppy quoroms (used in leaderless replication) are obviously not linearizable due to data divergence.\n", "* Strict quoroms (same) are not linearizable if the op is applied immediately on the replica. They are linearizable if the write is delayed until all replicas ack, but this is usually not the default behavior.\n", "\n", "\n", "* Brief tangent on the CAP theorem, how it's mainly historical nowadays (I still find it to be a useful heuristic but OK).\n", "\n", "\n", "## Ordering\n", "* Linearizability implies **causal consistency** (discussed earlier).\n", "* That is because linearizability affords a **total ordering**.\n", "* However, you can have causal consistency without linearizability (a **partial ordering**).\n", "* A causally consistent system preserves causally linked orders of operations.\n", "* In practice, however, determining which operations are causally linked and which ones are not is tedious.\n", "* Thus in practice in a causally consistent system we weaken the second criterion of linearizability:\n", " * All operations are atomic, acting as though they happened at instantaneous moments in time, resulting in an implicit order of operations within the system.\n", " * If two ops are in flight at the same time, the ops may be applied in any order.\n", "* The difference between causal consistency and linearizability is that in the latter case the timestamps of concurrent requests cannot be compared, because there *is* no globally consistent timestamp.\n", "\n", "\n", "* Causal consistency can be achived without linearization.\n", "* Causal consistency is the strongest consistency model that doesn't incur a heavy network cost.\n", "* Providing it requires defining a sequence of operations across nodes that the nodes agree upon.\n", "* First problem: how do you get a sequence of numbers that matches the operational sequence?\n", "* Using a naive solution, like clocks, won't work, because see chapter 8.\n", "\n", "\n", "* **Lamport timestamps** are a lightweight way of achieving this agreed-upon sequential order. In this algorithm nodes increment a `(node_unique_id, node_sequence_number, largest_sequence_number_seen)` counter.\n", "* If a node wants to add an op to the log, it increments its sequence number.\n", "* If a node sees a message with a larger sequence number, it immediately raises its own next sequence number to match.\n", "* The unique ID is used to break sequence number ties.\n", "\n", "\n", "* **Vector clocks** are an alternative more heavyweight algorithm.\n", "\n", "\n", "* Systems using these schemes are eventually consistent across all nodes.\n", "* However, this is insufficient if there are requests (like e.g. asking for a unique username) that need to be aware of every previous operation at runtime.\n", "* Supporting such operations requires knowing, eventually, the total order of operations. The systems that provide this are performing **total order broadcast** (also known as **atomic broadcast**).\n", "* Total order broadcast requires two operational capacities:\n", " * Reliable delivery (no message loss)\n", " * Totally ordered delivery (messages are delivered to all nodes in the same order).\n", "* Implementing consensus requires implementing total order broadcast. This feature is provided by e.g. etcd and Zookeeper.\n", "\n", "\n", "* Linearizable systems can be used to build total order broadcast systems, and vice versa.\n", "* While these two properties are different, having one allows you to have the other, if you desire.\n", "\n", "\n", "\n", "## Consensus\n", "* **Consensus algorithms** are a class of algorithms for addressing the node consensus problem.\n", "* No matter how you design a distributed system, there will always be situations where all nodes must agree on state. For example, transactions across data partitions on different nodes, or leader elections.\n", "* Total order broadcast is an identity transform of a consensus algorithm.\n", "\n", "\n", "* The classic algorithm for this is the **two-phase commit**.\n", "* Do not confuse this with two-phase locking!\n", "* Two-phase commit relies on a **coordinator**, usually a separate process, which performs a pre-flight check on all of the nodes involving, asking if they can perform the op. The nodes check and reply yes or no. If any nodes say no, the state change is aborted. If all nodes say yes, the coordinator sends a green light.\n", "* This algorithm obviously relies on atomic broadcast to work.\n", "* This algorithm has one glaring weakness: if the coordinator goes down after all nodes ack but before it can send a green or red light, it's non-obvious how to recover (restarting the coordinator, sure, but that takes time).\n", "* If nodes are blocked during this period (they probably are), this is really bad, as it stops the system dead.\n", "\n", "\n", "* Two-phase commit is a simple idea, but it turns out to make for a crappy consensus algorithm, due to this exact problem.\n", "* Good consensus algorithms avoid this problem. The field includes Paxos (the oldest), Zab, Raft, and VSR.\n", "* The consensus algorithms are all designed around epochs and strict quoroms.\n", "* Each time a leader loss occurs, a quorom of nodes is gathered to vote on a new leader. The new leader increments the epoch number.\n", "* Every time a leader wants to perform an action, it checks whether or not another leader with a higher epoch number exists.\n", "* To do this, it asks a quorom of nodes what the highest epoch number they have seen is. The insight: within a strict quorom, at least one of the nodes, at a minimum, was present in the most recent vote!\n", "* Thus if a leader does not learn from its quorom of another higher-epoch leader, it knows it can safely perform the op.\n", "\n", "\n", "* Consensus algorithms implement synchronous replication. Totally ordered atomic broadcast, we thus learn, requires we be synchronous.\n", "* Therefore it is very slow, particularly if the network is bad.\n", "* Additionally, certain network partitions can lead to very bad worst-case behavior, such as continuous elections.\n", "* Designing consensus algorithms that are more robust to network failures is an ongoing area of research.\n", "\n", "\n", "* The chapter ends with a discussion of the sorts of things ZooKeeper is used for." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }