+

ResidentMario · Jun 15, 2018 · fa1d13d · fa1d13d
1 parent e69f287
commit fa1d13d
Show file tree

Hide file tree

Showing 3 changed files with 132 additions and 0 deletions.
diff --git a/Chapter 3 --- Storage and Retrieval.ipynb b/Chapter 3 --- Storage and Retrieval.ipynb
@@ -166,6 +166,11 @@
  "* Compression is a whole other topic worth exploring.\n",
  "\n",
  "\n",
+ "* You still insert data into column stores in a row-wise manner.\n",
+ "* It probably doesn't make sense to use B-trees to store the data, as overwriting data with new values that are too large to fit in the allotted memory space will break column contiguity.\n",
+ "* Log structured storage still works well, however. (need to think some more about why this is the case)\n",
+ "\n",
+ "\n",
  "* When data is organized in column order, there is no intrinsic row sort order.\n",
  "* Leaving the rows unsorted improves write performance, as it results in a simple file append.\n",
  "* Sorting the rows improves read performance when you *do* need to query specific rows in a column-oriented database. You can multi-sort by as many rows as desired, obviously, but rows beyond the first will only help when performing grouped queries.\n",

diff --git a/Chapter 3.3 --- Cassandra.ipynb b/Chapter 3.3 --- Cassandra.ipynb
@@ -0,0 +1,70 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Mission statement\n",
+ "\n",
+ "* Cassandra is a **wide-column database**.\n",
+ "* Wide-column databases are two-dimensional key-value stores, where each key corresponds with a set of columns. The in-memory equivalent would be a `dict` inside a `dict` in Python.\n",
+ "* They differ from a true relational database in that the sub-keys can be anything. You are not forced to use the specific columns in the table schema, and instead can use any columns and values that strike your fancy.\n",
+ "* Meanwhile, while true column stores provide locality on all of the columns, wide-column data stores provide locality on the individual records.\n",
+ "* They're effectively schemaless transactional databases.\n",
+ "* Cassandra itself is designed for partition tolerance and availability, but not consistency.\n",
+ "* It's designed to be sharded heavily and to deal with truly big \"big data\" distributions.\n",
+ "\n",
+ "## Data model\n",
+ "\n",
+ "* Cassandra arranges its clusters using a shared-nothing **ring architecture**. Each node is in communication with a node immediately to its left and to its right, but doesn't explicitly depend on the availability of its peers.\n",
+ "* There are no masters or slaves, only peers.\n",
+ "* Clients access specific entry nodes, but data is asynchronously replicated across the nodes. If the desired node goes down, the client can cycle to other nodes, which should still be up (achieving availability).\n",
+ "* This architecture is lifted directly from the Dynamo system, which came slightly beforehand.\n",
+ "\n",
+ "\n",
+ "* The Cassandra data model is a standard SSTable LSM-tree implementation, taken from BigTable. So: write to commit log, write to memtable, acknowledge to client, periodically flush the memtable into log files, periodicially merge logfiles into a new unified log.\n",
+ "* A hash is used to determine which node in the ring will accept the write for a chunk of data.\n",
+ "* After the insertion operation is finished, replication is done by sending the data to the left-right nodes the nodes is in communication.\n",
+ "* You can configure more replicating by specifying the **replication factor**.\n",
+ "\n",
+ "\n",
+ "* On a read, the client connects to any node they want (or any node that's available, really). The node services the request by internally routing to nodes in the cluster which have the data.\n",
+ "* Due to network partitions and the asynchronous nature of data sharing, different nodes may have data that is in different states of recency.\n",
+ "* Thus you can tune how far to look for a response, depending on the level of consistency that you desire.\n",
+ "* A minimum level of consistency, `ONE`, will result in the *first* node that has the data reporting that data to the client.\n",
+ "* On the flip side, the maximum level of consistency, `ALL`, will result in *every* node that has that data shared with it reporting that data. The data that has the most recent timestamp will be the data that is reported.\n",
+ "* Most commonly you want `QUORUM`. In this case, 51% of nodes report, and the most recently timestamped data point amongst these nodes is returned.\n",
+ "* Thus in Cassandra there is a trade-off between *consistency* and *speed*.\n",
+ "* If only one node needs to report data on read, then Cassandra is highly available, but not highly consistent (there can be laspes in data sameness). If full consensus is used, then Cassandra is highly consistent, but not highly available (what happens if a node goes down?).\n",
+ "* Thus Cassandra offers tunable consistency.\n",
+ "\n",
+ "\n",
+ "* This model has linear scaling performance. The Cassandra database architecture scales better than basically any other database architecture out there, making it a preferred solution for truly \"big data\" problems.\n",
+ "\n",
+ "\n",
+ "* Queries are via CQL."
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.4"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/Chapter 3.4 --- HBase.ipynb b/Chapter 3.4 --- HBase.ipynb
@@ -0,0 +1,57 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Mission statement\n",
+ "\n",
+ "* HBase is a wide-column database (for more on what these are look at the Cassandra notes).\n",
+ "* However, whilst Cassandra emphasizes availability, HBase emphasizes consistency.\n",
+ "* HBase is based on HDFS as the underlying data store. HDFS is designed to be highly durable.\n",
+ "* HBase provides good high-volume, large-data, random read-write performance.\n",
+ "* Its primary use case is \"reporting\".\n",
+ "\n",
+ "\n",
+ "## Zookeeper\n",
+ "* HBase makes use of another Apache project, Zookeeper.\n",
+ "* Zookeeper is a high availability, strongly consistent, totally ordered log-based key-value store.\n",
+ "* It stores the configuration details for HBase in a durable way, and is meant to be a reusable component that can be plugged into other database architectures.\n",
+ "* Zookeeper is designed to address a need common to every database implementation, which is providing configuration details in a durabile way.\n",
+ "\n",
+ "\n",
+ "## Data model and architecture\n",
+ "* The HBase server is sharded into a master service and slave daemons.\n",
+ "* All of the coordination services live on the master. All of the data lives on the slaves.\n",
+ "* There are many services in play; the architecture is relatively complex.\n",
+ "* HBase has the concept of a column family. A column family is a set of commonly group-accessed columns which are grouped together in memory. This improves read performance.\n",
+ "* HBase also supports having multiple versions of a dataset entry.\n",
+ "\n",
+ "\n",
+ "* Writes are slow (why?).\n",
+ "* Reads are fast! HBase is the data store of choice for large Hadoop jobs. Hadoop is the job framework specifically designed for running jobs against data at scale!"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.4"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}