+

ResidentMario · Jun 20, 2018 · 3288a59 · 3288a59
1 parent 6e69512
commit 3288a59
Showing 1 changed file with 129 additions and 0 deletions.
diff --git a/Chapter 4 --- Encoding and Evolution.ipynb b/Chapter 4 --- Encoding and Evolution.ipynb
@@ -0,0 +1,129 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Encoding and Evolution\n",
+ "\n",
+ "\n",
+ "## Introduction to data serialization\n",
+ "* Applications evolve over time.\n",
+ "* When the application is server-side, you perform a **rolling upgrade**, taking down a few nodes in your deployment at a time for upgrades.\n",
+ "* When the application is client-side, updating is at the mercy of the user.\n",
+ "* To make this easier, on the data layer, you may have either or both of **backwards compatibility** and **forwards compatibility**. The latter is far trickier.\n",
+ "\n",
+ "\n",
+ "* In memory we consider data in terms of the data structures they live in.\n",
+ "* On disc we work with memory which has been **encoded** into a sequence of bytes somehow.\n",
+ "* With a handful of exceptions for memory maps or memory-mapped files, which are direct representations of on-disc packing in in-memory data structure terms.\n",
+ "\n",
+ "\n",
+ "* The process of transliteration between these two formats is known as serialization, encoding, or marshalling. A specific format is a **data serialization format**.\n",
+ "* Most languages include language-specific formats, like `pickle`, which can be used to store language object.\n",
+ "* These formats are easy to use when working with a specific language, but how serializability boundaries and are not easily compatible with other languages.\n",
+ "* Still, if your data always stays inside of your application boundary, these formats are fine.\n",
+ "\n",
+ "\n",
+ "## Human-readable data interchange formats\n",
+ "* JSON and XML (and CSV, and the other usual suspects) are common **data interchange formats**, meant to be moved between application boundaries.\n",
+ "* These formats are considered to be lowest common denominators, however.\n",
+ "* They have parsing problems. For example, it's often impossible to difficult to determine the type of an object.\n",
+ "* Being human-readable, they are also inefficient in resource terms when performing network transfers.\n",
+ "* Still, for simple use cases these formats are usually sufficient.\n",
+ "\n",
+ "\n",
+ "## Binary data interchange formats\n",
+ "* An improvement on the human-readable data interchange formats are encoded binary formats.\n",
+ "* Several competing encodings of the human-readable data interchange formats above exist. For example, MongoDB stores JSON data encoded in the BJSON format.\n",
+ "* Your application can also invent its own binary file format, if it so desires.\n",
+ "* For general-purpose use and interchangabilty, however, several binary data serialization formats exist.\n",
+ "* The two examples the book uses are Google Protobufs and Apache Thrift (both of which are still in good use in the ecosystem today).\n",
+ "* Binary data interchange formats provide a wealth of tooling, including programs that can be run to automatically machine-generate APIs for working with the data in your language of choice.\n",
+ "* These APIs are naturally verbose and do not match very well against common language patterns, becuase they are machine-written and not human-written, but they work well enough.\n",
+ "* These advanced binary data interchange formats are especially neat in that they provide forward and backwards compatibility built-in.\n",
+ "* In the context of data interchange formats this is known as **schema evolution**.\n",
+ "\n",
+ "\n",
+ "* The book talks about two similar but differently designed binary interchange formats. Apache Thrift and Google Protocol Buffers are in one camp, and Apache Avro is in the other.\n",
+ "\n",
+ "\n",
+ "* Fields in Protobuf (and in Thrift) are identified by field IDs.\n",
+ "* These field IDs can be omitted if the field is not required, in which case they will simply be removed from the encoded byte sequence.\n",
+ "* But once assigned, they cannot be changed, as doing so would change the meaning of past data encodings.\n",
+ "* This provides backwards compatibility. There is one catch however. You cannot mark new fields required, for the obvious reason that doing so would cause all old encodings to fail.\n",
+ "* How about forward compatibility? Every tag is provided a type annotation. When an old reader reads new data, and finds a type it doesn't know about, it skips it, using the type annotation to determine how many bytes to skip. Easy!\n",
+ "* I have some experience with Google Protobufs, so I know how this whole thing works reasonably well.\n",
+ "\n",
+ "\n",
+ "* Aother perspective, used by Apache Avro, is that these formats must be resiliant to differences between the **reader schema** and the **writer schema**. The challenge is to have a reader that understands every possible version of the writer.\n",
+ "* Avro requires you provide version information on read, which the other two formats do not require. This is additional overhead, as you must either encode that information in the file or provide it through some other means (the former is better if the file is big, and the latter if the file(s) are small).\n",
+ "* This allows Avro to omit data tags. This in turn makes Avro much easier to use with a dynamic schema, e.g. one that is changing all the time.\n",
+ "* This use case is what motivated Avro in the first case. And this is a tradeoff! Avro is more dynamic, Buffers and Thrift are more static but less work.\n",
+ "\n",
+ "\n",
+ "* All three are equipped with **interface description languages**. These allow you to perform **code generation** and get a machine-written API for your data\n",
+ "* However, code generation is mainly useful for statically typed languages, which benefit from explicit type checking. Dynamically typed languages, like Python, do not get much benefit.\n",
+ "\n",
+ "\n",
+ "## Data flow\n",
+ "* The rest of the chapter discusses the general idea of **data flow**.\n",
+ "* Data flow is how data flows through your system. It involves thinking about data usage patterns, application boundaries, and similar such things.\n",
+ "* The book points to three types of data flows in particular.\n",
+ "\n",
+ "* Databases are the first data flow concept.\n",
+ "* Both backwards and forward compatibility matters in a database. Backwards compatibility is important because a database is fundamentally messages to your future self. Forward compatibility matters because when you perform rolling updates, your database will be accessed concurrently by both newer and older versions of software.\n",
+ "* When you perform a database migration, the format of the underlying data is actually left unchanged, in those cases where no new information is stored (e.g. adding a new column full of `null` values). Only when you touch the newly created columns will the database figure out where to store the data so that it has space to include the additional information.\n",
+ "* This is in recognition that the data that gets stored typically outlives the code that stores it, and your database may have values in it that have not been touched in five years or more! Moving all of that at once is expensive, so databases migrate lazily when they can.\n",
+ "* An approach that deals data principally through databases is using what is known as an **integration database**. Heavy-on-database data flow is an architectural pattern commonly associated with **monolithic architecture**.\n",
+ "\n",
+ "\n",
+ "* The second generalized type of data flow is service communication.\n",
+ "* In service communication you have services that talk to one another using some kind of agreed-upon interchange format. The web is a great example.\n",
+ "* Often these services are organized in terms of clients and servers, with clients talking to the servers on behalf of end users.\n",
+ "* API calls over a network of some kind have a long lineage.\n",
+ "* On the web, this is where REST and SOAP live.\n",
+ "* This is where **service-oriented architecture** and **microservices** matter (the latter is a more recently coined and more specific subset of the former).\n",
+ "* REST is a design philosophy that opines on how well-designed services built over `HTTP` and `HTTPS` should look like. SOAP, by contrast, is an XML-based and technically HTTP-independent (but usually HTTP-using) design philosophy. The two compete for mindshare.\n",
+ "* REST is winning over SOAP, at least in part due to the decline of XML.\n",
+ "* The other philosophy is **remote procedure calls**, or RPC. RPC wants you to treat your inter-service calls the same way you would treat your intra-service calls (e.g. data flow *within* a program).\n",
+ "* RPC is a nice idea but it has problems. From a design perspective it's important to understand what they are:\n",
+ " * A local function call is predictable, in the sense that it succeeds or fails or causes the program to crash. A remote call may simply never result in a response (until you time it out yourself).\n",
+ " * Moreover, remote calls are basically always way less durable and have more variable latency, because they rely on potentially flakey, bandwidth-limited network.\n",
+ " * If you retry an operation you may duplicate an action on the endpoint. This can be worked around but requires additional thought and design.\n",
+ " * Networks have a much higher fixed cost than local functions. Youy need to encode all of the data necessary and send it out. Additional serialization is necessary, and large objects are a problem.\n",
+ "* At the end of the day, costs of space in memory is much lower than it is in-network. A good term: good local calls are fine, network calls are coarse. This is a big reason why monoliths are hard to refactor into microservice things!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# "
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.4"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}