+

ResidentMario · Jun 18, 2018 · 0d8d5e0 · 0d8d5e0
1 parent 778b417
commit 0d8d5e0
Showing 1 changed file with 94 additions and 35 deletions.
diff --git a/Databases Talk Notes.ipynb b/Databases Talk Notes.ipynb
@@ -16,22 +16,27 @@
  " * Transactions --- transactions are thought of as a common database feature. But they are not an intrinsic thing!\n",
  "\n",
  "\n",
- "* Classical transactional databases.\n",
- " * ACID --- atomic, consistent, isolated, durable.\n",
- " * Originally invented for the online transaction processing, or OLTP, use case.\n",
- " * That explains a lot about why these properties, particularly the second one, are important.\n",
+ "* Simplest possible database: append to a file.\n",
+ "* Simple log-structured database.\n",
+ "* Add log merges.\n",
+ "* This is Bitcask!\n",
+ "* Why are log files good? Append operation is fast.\n",
  "\n",
  "\n",
- "* External data model\n",
- " * Data is modeled using tables.\n",
- " * SQL --- Structured Query Language is provided as an abstraction.\n",
- " * SQL queries reconstruct data of interest by performing joins between tables.\n",
- " * SQL is what is called a \"declarative query language\". They are designed to look kind of like English sentences.\n",
- " * Why do you need SQL at all? So that database writing skill is portable between databases.\n",
- " * Why does it need to be declarative? To hide the details of database internals (the internal data model) from you.\n",
- " * Why is this useful? Because queries that you write against databases are not immediately executed; instead they are pushed through a query optimizer, which finds ways to speed your query up by using its knowledge of the underlying hardware and software layers.\n",
- " * SQL says what you want, not how you want. The alternative is imperative - like writing a program.\n",
- " * Unfortunately this abstraction is a bit leaky, due to vendors implementing things as they like them.\n",
+ "* More complicated arch: SSLTables.\n",
+ "* Use a memcache which supports random order, sorted iteration to write log files that are key-sorted.\n",
+ "* Log files are now much faster to merge.\n",
+ "* If a sudden shutdown occurs, the in-memory data is lost. A small write-ahead log is used to recover the state of the memcache in the case of a failure.\n",
+ "\n",
+ "\n",
+ "* Final architecture: B-trees.\n",
+ "* B-trees sort data into pages, which are accessed from previous pages.\n",
+ "* Also use a write-ahead log.\n",
+ "* To add an item you can split a page.\n",
+ "* Deleting an item is more complicated.\n",
+ "* B-trees involve random seek-writes, but do not necessitate a log merge process.\n",
+ "* So more predictable performance, but worse median write performance.\n",
+ " \n",
  " \n",
  "* Example transactional databases\n",
  " * SQL databases are not the same, but their user-facing design is the same.\n",
@@ -47,42 +52,96 @@
  " * Transactions.\n",
  " * Broad SQL implementation.\n",
  " * Open-source.\n",
- " * OracleDB?\n",
+ " * OracleDB\n",
  " * Complicated.\n",
  " * Concurrent.\n",
  " * Transactions.\n",
  " * Broad SQL implementation.\n",
  " * Closed-source.\n",
  "\n",
  "\n",
+ "* Classical transactional databases.\n",
+ " * ACID --- atomic, consistent, isolated, durable.\n",
+ " * Originally invented for the online transaction processing, or OLTP, use case.\n",
+ " * That explains a lot about why these properties, particularly the second one, are important.\n",
  "\n",
  "\n",
- "* Simplest possible database: append to a file.\n",
- "* Simple log-structured database.\n",
- "* Add log merges.\n",
- "* This is Bitcask!\n",
- "* Why are log files good? Append operation is fast.\n",
+ "* External data model\n",
+ " * Data is modeled using tables.\n",
+ " * SQL --- Structured Query Language is provided as an abstraction.\n",
+ " * SQL queries reconstruct data of interest by performing joins between tables.\n",
+ " * SQL is what is called a \"declarative query language\". They are designed to look kind of like English sentences.\n",
+ " * Why do you need SQL at all? So that database writing skill is portable between databases.\n",
+ " * Why does it need to be declarative? To hide the details of database internals (the internal data model) from you.\n",
+ " * Why is this useful? Because queries that you write against databases are not immediately executed; instead they are pushed through a query optimizer, which finds ways to speed your query up by using its knowledge of the underlying hardware and software layers.\n",
+ " * SQL says what you want, not how you want. The alternative is imperative - like writing a program.\n",
+ " * Unfortunately this abstraction is a bit leaky, due to vendors implementing things as they like them.\n",
  "\n",
  "\n",
- "* More complicated arch: SSLTables.\n",
- "* Use a memcache which supports random order, sorted iteration to write log files that are key-sorted.\n",
- "* Log files are now much faster to merge.\n",
+ "* On-disc key-value stores \n",
+ " * A key-value store is just a hash maps. Just like in your favorite programming language! \n",
+ " * Explain how hash maps work, briefly. \n",
+ " * Bitcask is an example. It uses a log-structured file for storage.\n",
+ " * This is technically NoSQL.\n",
+ " * But it's a bit _too_ simple to be useful.\n",
+ " * This arrangement is mainly used for databases used for e.g. persistent configuration services. \n",
+ " * For example, Apache Zookeeper.\n",
  "\n",
  "\n",
- "* Final architecture: B-trees.\n",
- "* B-trees sort data into pages, which are accessed from previous pages.\n",
- "* B-trees involve random seek-writes, but do not necessitate a log merge process.\n",
- "* So more predictable performance, but worse median write performance.\n",
+ "* In-memory key-value stores\n",
+ " * If you do not need persistance, you can use an in-memory database.\n",
+ " * Why? Volatile memory (RAM) is much, much faster than disc.\n",
+ " * This is what Memcached and Redis do.\n",
+ " * The memory structures these databases use are also just hash maps.\n",
+ " * Explain how hash maps work, briefly.\n",
+ " * The disadvantage is obviously that if the computer shuts down suddenly, everything in the database will be gone.\n",
+ " * You can make a semi-persistent database by writing to disc occassionally, but the more often you do this the slower your database gets.\n",
+ "\n",
+ "\n",
+ "* Document store\n",
+ " * A document store is a kind of more complicated key-value store that treats everything in the database as a document.\n",
+ " * Documents are basically JSON blobs (though they're not necessarily stored that way on disc).\n",
+ " * Consider a transaction database. This database is schema-on-write, in that all of the data that you store in it has to be put in a certain format (in terms of rows and columns) when you store it.\n",
+ " * A document database meanwhile contains documents of any format. You can only know for sure what you have when you parse a document.\n",
+ " * On the one hand, the idea of a document idea is much more flexible than the idea of a table.\n",
+ " * On the other hand, it naturally leads to more complex code, as you need to be aware of and catch any document model differences.\n",
+ " * Use a document database when you often need to access documents \"all at once\". For this use case they have much better locality than databases, which, when you have to join data, have to look much further.\n",
+ " * However, do not use it for very complex applications and expect it to work well.\n",
+ " * MongoDB is an example document store.\n",
+ "\n",
+ "\n",
+ "* Wide-column stores\n",
+ " * A wide-column store is a two-dimensional key-value store.\n",
+ " * It's in between transactional databases and document stores in shape.\n",
+ " * Meanwhile, while true column stores provide locality on all of the columns, wide-column data stores provide locality on the individual records.\n",
+ " * Example wide-columns stores are Cassandra and HBase.\n",
+ "\n",
+ "\n",
+ "* Column-oriented stores\n",
+ " * Introduce OLAP versus OLTP.\n",
+ " * OLTP wants to work on tons of columns at a time, not tons of rows at a time. An architecture that provides locality on columns is better.\n",
+ " * Column-oriented stores basically invert row-based databases.\n",
+ " * Druid is probably the most common of these.\n",
+ " \n",
+ " \n",
+ "* Graph database\n",
+ " * Good for dealing with interconnectiveness and graph searches.\n",
+ " * Examples: JanusGraph, Neo4j.\n",
  "\n",
- "* NoSQL databases use different data models."
+ "\n",
+ "* Stream-oriented datase\n",
+ " * This is a category I'm inventing for TrailDB.\n",
+ " * TrailDB is an interesting data model.\n",
+ " * Explain the data model.\n",
+ " * This is meant to make it convenient to query user activity or other time-series chained events.\n",
+ "\n",
+ "\n",
+ "* Point is that there are LOTS of different database models!\n",
+ "\n",
+ "\n",
+ "* One last thing worth understanding is the CAP theorem.\n",
+ "* Discuss this."
  ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
  }
  ],
  "metadata": {