+

ResidentMario · Jun 29, 2018 · 76fb813 · 76fb813
1 parent c7e3451
commit 76fb813
Showing 1 changed file with 105 additions and 2 deletions.
diff --git a/Chapter 11 --- Streams.ipynb b/Chapter 11 --- Streams.ipynb
@@ -6,7 +6,110 @@
  "source": [
  "# Streams\n",
  "\n",
- "*"
+ "* Stream processing is a near-realtime, high-throughput alternative to batch processing.\n",
+ "* The smallest element of a stream is an **event**: a small, self-contained, immutable object representing some chunk of state.\n",
+ "* An event is generated by one **producer**, but processed by multiple **consumers**.\n",
+ "* Related events grouped together form a **topic** or **stream**.\n",
+ "* Streams follow a **publish/subscribe model**.\n",
+ "\n",
+ "\n",
+ "## Messaging systems\n",
+ "* At the core of a stream is a messaging system. The messaging system is responsible for receipt and delivery of events.\n",
+ "* Broadly speaking, stream processing systems can be differenciated by asking two questions about their top-level functionality:\n",
+ " * What happens when the producer publishing outstrips consumer processing? There are three approaches: queueing, dropping messages, and **backpressure** (the latter being blocking the producer until the current set of messages have been processed).\n",
+ " * What are the durability guarantees? Messaging systems that may lose data will be faster than ones that provide strong durability.\n",
+ "* You can use any of the tools discussed in the data flow section to implement a message system:\n",
+ " * Direct producer-consumer messaging.\n",
+ " * This is simple, and doesn't require standing up new services, but mixes concerns (and doesn't scale very well because of that) and complicates application logic.\n",
+ " * Message brokers.\n",
+ " * This makes the most sense at scale.\n",
+ "\n",
+ "\n",
+ "* When multiple consumers are subscribed to a topic, two patterns are used.\n",
+ " * **Load balancing**. Each message is delivered to one consumer.\n",
+ " * **Fan-out**. Each message is delivered to all consumers.\n",
+ "\n",
+ "\n",
+ "* The same consistency and isolation concerns that were present in databases emerge with message streams as well.\n",
+ "* Message brokers that guarantee delivery wait until an ack response from the delivery target before they dequeue a message.\n",
+ "* If no ack comes within a certain timeout, and a load balancing pattern is used, the message is re-delivered to a different node.\n",
+ "* This means that messages do not necessarily arrive in order!\n",
+ "* Furthermore, the message might have actually been processed, but the ack was dropped or delayed by a network problem.\n",
+ "* So duplication of data between consumers is also possible.\n",
+ "\n",
+ " \n",
+ "## Log-based message brokers\n",
+ "* Message brokers may keep everything in memory, and delete messages that have been sufficiently acked.\n",
+ "* Or they may choose to persist their contents to a local log. Message brokers that choose to do so are **log-based message brokers**.\n",
+ "* Log-based message brokers have the advantage of persistance, whilst still being able to reach extremely high throughputs.\n",
+ "* They are also naturally sequential.\n",
+ "* The throughput comes via partitioning. Multiple brokers may be deployed, each one responsible for a single topic.\n",
+ "* For consumers, reading a message is as simple as reading from the (append-only) log file.\n",
+ "* Fan-out is trivial to implement in this case.\n",
+ "* The load balancing pattern can be implemented by assigning nodes on a per-partition basis. However, this is admittedly much less flexible than is achievable with a non log-based message broker, as it requires having at least as many brokers (and topics) as you have consumers.\n",
+ "* In this mode, consumers maintain an offset in the log file. The broker may periodically check which consumer has the furthest-back offset, and \"clean out\" or perhaps archive the segment of the log that predates it.\n",
+ "* This greatly reduces network volume, as acks are no longer necessary. This helps make log-based message brokers more easily scalable.\n",
+ "\n",
+ "\n",
+ "## Syncing aross systems\n",
+ "\n",
+ "* Cross-system sychronization is an often necessary task, but it's one fraught with danger.\n",
+ "* The simplest solution would be to perform a full table dump, from time to time, and have the other system parse that dump.\n",
+ "* If this is too slow, or has too much latency, you can use **dual writes**: have the application explicitly write to both systems. This is a bundle of fun of race conditions and failure modes, however. You cannot do it fully safely without cross-system atomic commit, which is a big task to consider.\n",
+ "\n",
+ "\n",
+ "* An increasingly popular alternative is **change data capture**.\n",
+ "* Databases maintain a log of actions in either their log (log and SSLTable -based architectures) or their write-ahead log (B-tree architectures). Another application could consume that log.\n",
+ "* This is better than dual writes in many cases because, as with log-based message brokers, you are explicitly sychronized.\n",
+ "\n",
+ "\n",
+ "* Some operations require the full database state to work. For example, building a full-text search index.\n",
+ "* The replication log by default is a database-specific implementation detail. Particularly write-ahead logs will get cleaned up from time to time.\n",
+ "* To continue to have \"all the state\" without having to maintain an infinitely long log, you can do one of two things:\n",
+ " * Pair snapshots of the database at a certain timestamp with the replication log thereafter. This allows you to delete log entries accounted for in the most recent snapshot.\n",
+ " * Perform log compaction on the CDC side. This preserves only the most recent value set, which is what you need.\n",
+ "* Message brokers that support these features (Kafka does) effectively implement durable storage. In other words, you could use this mechanism to implement a Kafka -based database!\n",
+ "\n",
+ "\n",
+ "## Event sourcing\n",
+ "* **Event sourcing** is a service design pattern (one of Martin Fowler's three event-driven design patterns; see his GOTO talk).\n",
+ "* It is a way of organizing entire systems using events.\n",
+ "* A database allows destructive operations: overwrites and deletes.\n",
+ "* Immutability has big advantages and disadvantages.\n",
+ "* You can capture the current state of the database using the event log and/or a database snapshot.\n",
+ "* But now you have a full history of changes, including changes that have since been overwritten.\n",
+ "* This makes it easier to e.g. debug your system, as you can just point to the place in history where a failure occurred and step through it.\n",
+ "* On the other hand truly deleting data (as necessary for e.g. legal compliance) becomes a problem.\n",
+ "\n",
+ "\n",
+ "## Time\n",
+ "\n",
+ "* Time is a problem.\n",
+ "* When do you know that you have every event within a certain time window? You don't, there could be another event communication stuck in network traffic!\n",
+ "* You have to do something to deal with **straggler events**. You could either:\n",
+ " * Ignore them, past a certain maximum wait time.\n",
+ " * Issue a correction to downstream systems relying on event timing.\n",
+ "\n",
+ "\n",
+ "* Another problem is with wall clock times. Remember, clocks can be out-of-sync.\n",
+ "* You can't necessarily trust the clock time of any device owned by an end user, due to the possibility of subversion.\n",
+ "* So most streaming systems just lop a ton of different timestamps (server recieve time, server send time, end-user read time) together and deal with the combination of attributes these expose.\n",
+ "\n",
+ "\n",
+ "* Joins between streams are time-dependent, because streams in general do not gaurantee the order of their contents.\n",
+ "* The possibility of a stream join mutating do to event order changes is known as a **slowly changing dimension** in the data warehousing literature. The solution generally used is timestamping each version of the join, so joins are considered non-deterministic!\n",
+ "\n",
+ "\n",
+ "## Fault tolerance\n",
+ "* How do you tolerate faults in a stream-based system? The answer is not immediately obvious.\n",
+ "* We cannot simply retry the operation, as we had with batch processing, because a stream never ends.\n",
+ "* How do we still provide **exactly once execution** with streams?\n",
+ "* One method is to organize stream events in **microbatches**. A microbatch is complete when all necessary acks are recieved from the subscribers to a selected batch of a topic.\n",
+ "* For example, Spark Streaming is organized around microbatches that are a single second in length.\n",
+ "* Apache Flink dynamically chooses batch sizes and boundary conditions.\n",
+ "* Another approach is to implement an atomic commit protocol.\n",
+ "* Another is to rely on (or force) idempotent operations. E.g. only allow operations that, when applied multiple times, work the same as if they were applied once.\n",
+ "* If your stream processors are idempotent fault tolerance becomes *much* less of an issue!"
  ]
  }
  ],
@@ -26,7 +129,7 @@
  "name": "python",
  "nbconvert_exporter": "python",
  "pygments_lexer": "ipython3",
- "version": "3.6.4"
+ "version": "3.6.5"
  }
  },
  "nbformat": 4,