Skip to content

Commit

Permalink
+
Browse files Browse the repository at this point in the history
  • Loading branch information
ResidentMario committed Jun 29, 2018
1 parent c7e3451 commit 76fb813
Showing 1 changed file with 105 additions and 2 deletions.
107 changes: 105 additions & 2 deletions Chapter 11 --- Streams.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,110 @@
"source": [
"# Streams\n",
"\n",
"*"
"* Stream processing is a near-realtime, high-throughput alternative to batch processing.\n",
"* The smallest element of a stream is an **event**: a small, self-contained, immutable object representing some chunk of state.\n",
"* An event is generated by one **producer**, but processed by multiple **consumers**.\n",
"* Related events grouped together form a **topic** or **stream**.\n",
"* Streams follow a **publish/subscribe model**.\n",
"\n",
"\n",
"## Messaging systems\n",
"* At the core of a stream is a messaging system. The messaging system is responsible for receipt and delivery of events.\n",
"* Broadly speaking, stream processing systems can be differenciated by asking two questions about their top-level functionality:\n",
" * What happens when the producer publishing outstrips consumer processing? There are three approaches: queueing, dropping messages, and **backpressure** (the latter being blocking the producer until the current set of messages have been processed).\n",
" * What are the durability guarantees? Messaging systems that may lose data will be faster than ones that provide strong durability.\n",
"* You can use any of the tools discussed in the data flow section to implement a message system:\n",
" * Direct producer-consumer messaging.\n",
" * This is simple, and doesn't require standing up new services, but mixes concerns (and doesn't scale very well because of that) and complicates application logic.\n",
" * Message brokers.\n",
" * This makes the most sense at scale.\n",
"\n",
"\n",
"* When multiple consumers are subscribed to a topic, two patterns are used.\n",
" * **Load balancing**. Each message is delivered to one consumer.\n",
" * **Fan-out**. Each message is delivered to all consumers.\n",
"\n",
"\n",
"* The same consistency and isolation concerns that were present in databases emerge with message streams as well.\n",
"* Message brokers that guarantee delivery wait until an ack response from the delivery target before they dequeue a message.\n",
"* If no ack comes within a certain timeout, and a load balancing pattern is used, the message is re-delivered to a different node.\n",
"* This means that messages do not necessarily arrive in order!\n",
"* Furthermore, the message might have actually been processed, but the ack was dropped or delayed by a network problem.\n",
"* So duplication of data between consumers is also possible.\n",
"\n",
" \n",
"## Log-based message brokers\n",
"* Message brokers may keep everything in memory, and delete messages that have been sufficiently acked.\n",
"* Or they may choose to persist their contents to a local log. Message brokers that choose to do so are **log-based message brokers**.\n",
"* Log-based message brokers have the advantage of persistance, whilst still being able to reach extremely high throughputs.\n",
"* They are also naturally sequential.\n",
"* The throughput comes via partitioning. Multiple brokers may be deployed, each one responsible for a single topic.\n",
"* For consumers, reading a message is as simple as reading from the (append-only) log file.\n",
"* Fan-out is trivial to implement in this case.\n",
"* The load balancing pattern can be implemented by assigning nodes on a per-partition basis. However, this is admittedly much less flexible than is achievable with a non log-based message broker, as it requires having at least as many brokers (and topics) as you have consumers.\n",
"* In this mode, consumers maintain an offset in the log file. The broker may periodically check which consumer has the furthest-back offset, and \"clean out\" or perhaps archive the segment of the log that predates it.\n",
"* This greatly reduces network volume, as acks are no longer necessary. This helps make log-based message brokers more easily scalable.\n",
"\n",
"\n",
"## Syncing aross systems\n",
"\n",
"* Cross-system sychronization is an often necessary task, but it's one fraught with danger.\n",
"* The simplest solution would be to perform a full table dump, from time to time, and have the other system parse that dump.\n",
"* If this is too slow, or has too much latency, you can use **dual writes**: have the application explicitly write to both systems. This is a bundle of fun of race conditions and failure modes, however. You cannot do it fully safely without cross-system atomic commit, which is a big task to consider.\n",
"\n",
"\n",
"* An increasingly popular alternative is **change data capture**.\n",
"* Databases maintain a log of actions in either their log (log and SSLTable -based architectures) or their write-ahead log (B-tree architectures). Another application could consume that log.\n",
"* This is better than dual writes in many cases because, as with log-based message brokers, you are explicitly sychronized.\n",
"\n",
"\n",
"* Some operations require the full database state to work. For example, building a full-text search index.\n",
"* The replication log by default is a database-specific implementation detail. Particularly write-ahead logs will get cleaned up from time to time.\n",
"* To continue to have \"all the state\" without having to maintain an infinitely long log, you can do one of two things:\n",
" * Pair snapshots of the database at a certain timestamp with the replication log thereafter. This allows you to delete log entries accounted for in the most recent snapshot.\n",
" * Perform log compaction on the CDC side. This preserves only the most recent value set, which is what you need.\n",
"* Message brokers that support these features (Kafka does) effectively implement durable storage. In other words, you could use this mechanism to implement a Kafka -based database!\n",
"\n",
"\n",
"## Event sourcing\n",
"* **Event sourcing** is a service design pattern (one of Martin Fowler's three event-driven design patterns; see his GOTO talk).\n",
"* It is a way of organizing entire systems using events.\n",
"* A database allows destructive operations: overwrites and deletes.\n",
"* Immutability has big advantages and disadvantages.\n",
"* You can capture the current state of the database using the event log and/or a database snapshot.\n",
"* But now you have a full history of changes, including changes that have since been overwritten.\n",
"* This makes it easier to e.g. debug your system, as you can just point to the place in history where a failure occurred and step through it.\n",
"* On the other hand truly deleting data (as necessary for e.g. legal compliance) becomes a problem.\n",
"\n",
"\n",
"## Time\n",
"\n",
"* Time is a problem.\n",
"* When do you know that you have every event within a certain time window? You don't, there could be another event communication stuck in network traffic!\n",
"* You have to do something to deal with **straggler events**. You could either:\n",
" * Ignore them, past a certain maximum wait time.\n",
" * Issue a correction to downstream systems relying on event timing.\n",
"\n",
"\n",
"* Another problem is with wall clock times. Remember, clocks can be out-of-sync.\n",
"* You can't necessarily trust the clock time of any device owned by an end user, due to the possibility of subversion.\n",
"* So most streaming systems just lop a ton of different timestamps (server recieve time, server send time, end-user read time) together and deal with the combination of attributes these expose.\n",
"\n",
"\n",
"* Joins between streams are time-dependent, because streams in general do not gaurantee the order of their contents.\n",
"* The possibility of a stream join mutating do to event order changes is known as a **slowly changing dimension** in the data warehousing literature. The solution generally used is timestamping each version of the join, so joins are considered non-deterministic!\n",
"\n",
"\n",
"## Fault tolerance\n",
"* How do you tolerate faults in a stream-based system? The answer is not immediately obvious.\n",
"* We cannot simply retry the operation, as we had with batch processing, because a stream never ends.\n",
"* How do we still provide **exactly once execution** with streams?\n",
"* One method is to organize stream events in **microbatches**. A microbatch is complete when all necessary acks are recieved from the subscribers to a selected batch of a topic.\n",
"* For example, Spark Streaming is organized around microbatches that are a single second in length.\n",
"* Apache Flink dynamically chooses batch sizes and boundary conditions.\n",
"* Another approach is to implement an atomic commit protocol.\n",
"* Another is to rely on (or force) idempotent operations. E.g. only allow operations that, when applied multiple times, work the same as if they were applied once.\n",
"* If your stream processors are idempotent fault tolerance becomes *much* less of an issue!"
]
}
],
Expand All @@ -26,7 +129,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
"version": "3.6.5"
}
},
"nbformat": 4,
Expand Down

0 comments on commit 76fb813

Please sign in to comment.