Skip to content

Commit

Permalink
+
Browse files Browse the repository at this point in the history
  • Loading branch information
ResidentMario committed Jun 29, 2018
1 parent 86ae2b8 commit c7e3451
Show file tree
Hide file tree
Showing 3 changed files with 134 additions and 7 deletions.
94 changes: 94 additions & 0 deletions Chapter 10 --- Batch Processing.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## MapReduce\n",
"\n",
"* We can place the nearness data processing systems on a continuum, between online systems on one end and **batch processing systems** on the other end (with stream processing as an intermediate; another chapter).\n",
"* Batch processing systems process data on a scheduled or as-needed basis, instead of immediate basis of an online systme.\n",
"* Thus the concerns are very different. Latency doesn't matter. We design for total application success or total failure. **Throughput** is the most important measurement.\n",
"* Batch processing is really the original programming use case, harkening back to the US Census counting card machine days!\n",
"\n",
"\n",
"* Discussion of the UNIX philosophy omitted; I know it well.\n",
"\n",
"\n",
"* Discussion of the basic MapReduce architecture omitted; I read the paper.\n",
"\n",
"\n",
"* A significant issue in MapReduce is skew. If keys are partitioned amongst reducers naively, hot keys, as typical in a Zipf-distributed system (e.g. celebrities on Twitter), will result in very bad tail-reducer performance.\n",
"* Some on-top-of-Hadoop systems, like Apache Pig, provide a skew join facility for working with such keys.\n",
"* These keys get randomly sub-partitioned amongst the reducers to distribute the load.\n",
"\n",
"\n",
"* The default utilization is to perform joins on the reducer side. It's also possible to perform a mapper-side join.\n",
"\n",
"\n",
"* You do not want to push the final producer of a MapReduce job to a database via insert operations as this is slow. It's better to just build a new database in place. A number of databases designed with batch processing in mind provide this feature (see e.g. LevelDB).\n",
"\n",
"## Dataflow\n",
"\n",
"* MapReduce is becoming ever more historical.\n",
"* There are better processing models, that solve problems that were discovered with MapReduce.\n",
"* One big problem with MapReduce has to do with its **state materialization**. In a chained MapReduce, in between every map-reduce there is a write to disc.\n",
"* Another one, each further step in the chain must wait for *all* of the previous job to finish before it can start its work.\n",
"* Another problem is that mappers are often redundant, and could be omitted entirely.\n",
"\n",
"\n",
"* The new approach is known as **dataflow engines**. Spark etcetera.\n",
"* Dataflow engines build graphs over the entire data workflow, so they contain all of the state (instead of just parts of it, as in MapReduce).\n",
"* They generalize the map-reduce steps to generic **operators**. Map-reducers are a subset of operators.\n",
"* The fact that the dataflow engines are aware of all steps allows them to optimize movement between steps in ways that were difficult to impossible with MapReduce.\n",
"* For example, if the data is small they may avoid writing to disc at all.\n",
"* Sorting is now optional, not mandatory. Steps that don't need to sort can omit that operation entirely, saving a ton of time.\n",
"* Since operators are generic, you can often combine what used to be several map-reduce steps into one operation. This saves on all of the file writes, all of the sorts, and all of the overhead in between those moves.\n",
"* It also allows you to more easily express a broad range of computational ideas. E.g. to perform some of the developer experience optimizations that the API layers that were built on top of Hadoop performed.\n",
"\n",
"\n",
"* On the other hand, since there may not be intermediate materialized state to back up on, in order to retain fault tolerance dataflow introduces the requirement that computations be deterministic.\n",
"* In practice, there are a lot of sneaky ways in which non-determinism may sneak into your processing.\n",
"\n",
"\n",
"## Graph processing\n",
"\n",
"* What about graph processing, e.g. processing data using graph algorithms?\n",
"* Dataflow engines have implemented this feature using the **bulk sychronous parallel** model of computation. This was populared by Google Pregel (Google again...).\n",
"* The insight is that most of these algorithms can be implemented by processing one node at a time and \"walking\" the graph.\n",
"* This algorithm archetype is known as a **transitive closure**.\n",
"* In BSP nodes are processed in stages. At each stage you process the nodes that match some condition, and evaluate what nodes to step to next.\n",
"* When you run out of node to jump to you stop.\n",
"* It is possible to parallelize this algorithm across multiple partitions. Ideally you want to partition on the neighborhoods you are going to be within during the walks, but this is hard to do, so most schemes just partition the graph arbitrarily.\n",
"* This creates unavoidable message overhead, when nodes of interest are on different machines.\n",
"* Ongoing area of research.\n",
"\n",
"\n",
"## Declarative query languages\n",
"* There has been a move to a SQL-like declarative query languages with dataflow engines.\n",
"* Also, a whole bunch of useful algorithms are \"baked in\" the ecosystem. E.g. there are machine learning specific Spark libraries!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
34 changes: 34 additions & 0 deletions Chapter 11 --- Streams.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Streams\n",
"\n",
"*"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
13 changes: 6 additions & 7 deletions Databases Talk Notes.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
"source": [
"(Speaking notes for a talk on database architecture)\n",
"\n",
"# Part 1\n",
"* There are just three properties intrinsic to a database.\n",
" * Persistent insert --- You can push data into the database, and expect it to still be there later.\n",
" * Persistent read --- You can read data out of the database, and expect it to return the same value as later.\n",
Expand All @@ -14,12 +15,13 @@
" \n",
"* Not included in these properties:\n",
" * Transactions --- transactions are thought of as a common database feature. But they are not an intrinsic thing!\n",
" * SQL --- SQL is jsut a convenient, uniform standard.\n",
"\n",
"\n",
"# Part 2\n",
"* Simplest possible database: append to a file.\n",
"* Simple log-structured database.\n",
"* Add log merges.\n",
"* This is Bitcask!\n",
"* Why are log files good? Append operation is fast.\n",
"\n",
"\n",
Expand All @@ -37,7 +39,8 @@
"* B-trees involve random seek-writes, but do not necessitate a log merge process.\n",
"* So more predictable performance, but worse median write performance.\n",
" \n",
" \n",
"\n",
"# Part 3\n",
"* Example transactional databases\n",
" * SQL databases are not the same, but their user-facing design is the same.\n",
" * SQLite\n",
Expand Down Expand Up @@ -136,11 +139,7 @@
" * This is meant to make it convenient to query user activity or other time-series chained events.\n",
"\n",
"\n",
"* Point is that there are LOTS of different database models!\n",
"\n",
"\n",
"* One last thing worth understanding is the CAP theorem.\n",
"* Discuss this."
"* Point is that there are LOTS of different database models!"
]
}
],
Expand Down

0 comments on commit c7e3451

Please sign in to comment.