Skip to content

Commit

Permalink
+ more
Browse files Browse the repository at this point in the history
  • Loading branch information
ResidentMario committed Jun 14, 2018
1 parent 547c270 commit e69f287
Show file tree
Hide file tree
Showing 5 changed files with 209 additions and 10 deletions.
Binary file added .DS_Store
Binary file not shown.
2 changes: 1 addition & 1 deletion Chapter 2 --- Data Models.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.6.4"
}
},
"nbformat": 4,
Expand Down
100 changes: 91 additions & 9 deletions Chapter 3 --- Storage and Retrieval.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -81,15 +81,97 @@
"* But SSTables has several advantages. The most important ones being much larger practical memory limits, due to the sparse in-memory structure, and intrinsic support for range queries.\n",
"* There are many potential performance tuning measures that can be taken.\n",
"* For example, the SSTables architecture can be slow when looking up keys that do not exist in the database, or which have not been written to in a long time, because walking back the memtable and the list of SSTables takes time.\n",
"* You can use a **Bloom filter** (note: implemented in another notebook) to speed this process up. This algorithm is great for approximating set contents."
"* You can use a **Bloom filter** (note: implemented in another notebook) to speed this process up. This algorithm is great for approximating set contents.\n",
"\n",
"\n",
"## B-tree\n",
"\n",
"* The most common database implementation, and the one used by most of the \"classic\" SQL engines, isn't a log storage design, it's a **B-tree** design.\n",
"* The B-tree is a data structure which provides balanced in-order key access. It's not dissimilar to the red-black trees, actually.\n",
"* Note: implemented elsewhere.\n",
"* B-trees are organized in terms of **pages**. Each page contains references to pages further down the list, except for the last page (the leaf page), which inlines pointers to the data it references.\n",
"* Pages are traditionally 4 KB in size, and are designed to emulate the way that memory pages in the underlying hardware work (keeping B-tree pages smaller than hardware pages speeds up access!)\n",
"* To look up a value, you start at the root page, whose keys describe a range of values indexed by sort order. You find the key-value pair associated with the index range your value is in. The value will be a pointer to another page, with another range of valid values. Keep burrowing until you get to a leaf page, which contains an exact index location key and a pointer to your desired value. Tada!\n",
"* To insert a value, find the legal space for it amongst the leaf pages. If any of the pages along the way are too full to accept new values, split them into half pages, and refactor references appropriately.\n",
"* Deleting a value is a lot more involved, however.\n",
"* B-trees are balanced trees: for $n$ keys they have a height of $O(\\log{n})$.\n",
"* B-trees operations are not intrinsically safe. To make them safe you need to add a **write-ahead log**, and write the B-tree operations that are necessary to that log before performing the actual operations.\n",
"* In contrast to the file append-only operation of the log-structured database, B-tree databases need to do seek-writes to specific memory pages, which is slower. On the other hand, they do not need to perform the background write operations necessary in log-structured databases.\n",
"* Additionally, if you allow for concurrent access (and concurrent modification), you need to introduce locks (specifically, latches on the level of the tree being modified) in order to prevent ongoing writes from causing ongoing reads to return inconsistent data. SSTable designs are much less complex in this regard.\n",
"* Some B-tree optimizations:\n",
" * Instead of overwriting pages and maintaining a write-ahead log you can use a copy-on-write scheme: write a new page with the included modification elsewhere in memory, atomically swap the reference from the old page to the new page, and delete the old page. This improves concurrenct read performance, at the cost of write performance. LMDB is an example of a database that implements this.\n",
" * In higher-level pages, instead of storing entire keys store the leading bytes of the keys. Keys only need to provide enough information to boundary where to go next, and smaller keys allow fewer more packed memory pages, a higher branching factor, and ultimately shallower value access. This is known as the **B+ tree**.\n",
" * B-trees try to store their leaf memory pages in sequential order, obviating the need to seek (for certain access, and only when the hardware complies).\n",
" * Pointers may be added to leaf pages pointing to their immediate left and right companion leaf pages. This increases range query speed as it obviates the need to return to the parent page.\n",
" * Finally there are more complex B-tree variants, like **fractal trees**, which attempt to further reduce seek volume.\n",
" \n",
" \n",
"* Comparing B-trees and LSM-trees...\n",
"* B-trees involve seeks, so they're slower when read volume is high. This is the primary thing that makes LSM-trees more attractive.\n",
"* On the other hand, they do not involve periodic compaction. Compaction uses slow disc I/O resources. It's possible to schedule it when nothing else is hitting the disc, but hard to do so reliably. This means that, particularly at higher percentiles, B-trees have more reliable performance than LSM-trees do.\n",
"* In general it's hard to predict which of these two structures will perform better on a particular workload. You have to test emperically.\n",
"\n",
"## More index considerations\n",
"* Secondary indices are barely different from primary indices. The only implementation differene is that they allow duplicate keys, so you need to bake a slightly different key. An easy way around this is to append the row number.\n",
"\n",
"\n",
"* The key in a primary index is always an identifier for the thing you are searching for.\n",
"* The value however could be one of two things. Either it's exactly the data in question, or a pointer to that data somewhere else in memory.\n",
"* The latter is the **heap file** approach.\n",
"* The heap file approach is more common. If multiple indices are defined on a particular chunk of data, it avoids duplicating it.\n",
"* An implementation detail is what to do when inserting new data.\n",
"* If the new data is smaller than or the same size as the previous data (as it would be when you have a fixed-width type!), you can overwrite in place.\n",
"* If the new data is bigger, you need to either write to a new location and move the reference, or write to a new location and populate a pass-through pointer.\n",
"* The jump to the heap location is a seek, so it has a time cost. Storing the data directly in the index instead is known as a **clustered index** approach, and it avoids this cost.\n",
"* Secondary indices in a table with a clustered index \"just\" populate references to the primary index.\n",
"* However, a clustered index represents duplication of data, since obviously that data has to exist in the log first. Thus you trade worse write performance for better read performance.\n",
"* Also, it hits size limits much faster, if your objects are large.\n",
"* A compromise between the two is a **covering index**, which stores just *some* of the columns in the index. Write operations on these columns are slower, but read operations are faster. Columns not \"covered\" are faster.\n",
"\n",
"\n",
"* Finally, there are multi-column indices. These speed up queries that pivot on several columns at once.\n",
"* The most common type of multi-column index is the **concatenated index**, which is keyed using a sequence of values. The semantics are slightly different, but the structure is mostly the same.\n",
"\n",
"\n",
"* There are more complex index structures, like R-trees and whatever the text search engines use. But they're slightly out of score for now! Something to investigate further down the line.\n",
"\n",
"## OLAP systems\n",
"* Databases in production can generally be split into two access patterns. **OLTP** systems (short for online transaction processing) is the original use-case. **OLAP**, short for online analytics processing, is the new use case.\n",
"* The operational reasons for splitting OLTP from OLAP I know too well!\n",
"* OLAP systems are less creative than OLTP systems in terms of overall design. Almost all applications use some variant of the **star schema**.\n",
"* In the star schema there is a central table that acts as an access point, which provides foreign keys to a bunch of other tables.\n",
"* Each of the other tables is known as a **fact table**, and it provides details about one specific aspect of the elements in the central table: who, what, where, when, how, why?\n",
"* A more complex and normalized variant of this schema is the **snowflake schema**, which has additional branching factors on the fact table.\n",
"* These arrangements are so named because they look a star or a snowflake surrounding a central fact table in the middle.\n",
"\n",
"\n",
"* OLAP systems injest data every once in a while. Simultaneously, the queries that are run against them are often very heavy, touching hundreds of thousands and potentially billions of rows.\n",
"* One way to increase query efficiency is to maintain a **materialized view** of oft-used summary statistics. This is known as a **data cube**.\n",
"* In a data cube you choose a certain number of dimensions $n$ (an $n$-fold data cube), and then compute a summary statistic against the resultant aggregate.\n",
"* If those statistics are used often, it may be worth paying that cost up-front once at injestion time, instead of having to rebuild every time a query using that summary comes in.\n",
"\n",
"\n",
"## Column-oriented storage\n",
"\n",
"* While the OLTP access pattern is row oriented, the OLAP access pattern is column-oriented. For efficient OLAP architecture, you want to provide locality on the columns, *not* the rows.\n",
"* Thus the natural OLAP adaptation: **column-oriented stores**.\n",
"* Column-oriented databases fundamentally lay colums contiguously in memory, instead of rows.\n",
"\n",
"\n",
"* Additionally, ordering data in terms of columns unlocks columnar compression.\n",
"* There are various schems that you can use. One of the most common is the **bitmap encoding**, which calculates a dummy encoding for each product and stores it.\n",
"* When there is low columnar cardinality, bitmaps are very efficient memory-wise. They are also essentially **vectorized** already, so math operations (like selection with `OR` clauses) are extremely fast, making common queries much faster\n",
"* You can compress further using something along the lines of **run-length encoding**.\n",
"* Compression is particularly important in OLAP systems because these may need to scan millions of entries at a time. Sequential disk I/O is the bottleneck, not entry look-ups, so reducing the memory footprint is crucial.\n",
"* Compression is a whole other topic worth exploring.\n",
"\n",
"\n",
"* When data is organized in column order, there is no intrinsic row sort order.\n",
"* Leaving the rows unsorted improves write performance, as it results in a simple file append.\n",
"* Sorting the rows improves read performance when you *do* need to query specific rows in a column-oriented database. You can multi-sort by as many rows as desired, obviously, but rows beyond the first will only help when performing grouped queries.\n",
"* Additionally, sorting keys can help greatly with compression. Especially on the first sort-order key, something like run-length encoding can result in incredible read performance. With low enough cardinality multiple gigabytes of data can get pushed down to mere kilobytes in size!\n",
"* A clever idea is to actually maintain multiple sort orders on disc, by replicating the data with several different sorts. Obviously this is taking read performance at the cost of write performance as far is will go. Vertica is one example of a database that offers this feature."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand All @@ -108,7 +190,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.6.4"
}
},
"nbformat": 4,
Expand Down
56 changes: 56 additions & 0 deletions Chapter 3.1 --- Memcached.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Memcached\n",
"\n",
"## Mission statement\n",
"* Memcached is a free and open-source in-memory key-value store.\n",
"* It was originally developed by and for the LiveJournal website, back in the day.\n",
"* It's almost a database, but not quite one. It differs in that it's not persistant. Memcached is allotted a certain amount of memory, and when that memory limit reached, to insert new values memcached starts deleting old values.\n",
"* This deletion occurs using the simple \"Least Recently Used\", or **LRU**, **caching strategy** (in this context this is referred to as the **eviction mode**).\n",
"* When you insert data into memcached that you want to access later, you cannot assume that it will still be there.\n",
"* The tradeoff is that memcached is ridiculously fast, and has a simple, easy-to-understand architecture.\n",
"* Memcached is thus meant to be used as a **caching layer**. Put it in front of your production database to speed up your queries!\n",
"\n",
"## Data model\n",
"* Memcached uses a simple hash map log structured storage architecture (this is the simplest practical database architecture; see Chapter 3 notes for more).\n",
"* Memcached supports clusters containing multiple nodes. A hash is computed on the data being inserted in order to determine which node the data gets sent to. Then the data is hashed again for storage.\n",
"* This is a **shared nothing architecture**. The client knows the locations of the nodes, obviously, but the nodes know nothing about each other, and do not share any resources.\n",
"* Memcached is an **in-memory data store**. The nodes are meant to be volatile memory resources.\n",
"* There is no type support. A **word** in memcached is a byte.\n",
"* As mentioned in the previous section, a least recently used caching strategy is used to purge old data when the service reaches its size limit.\n",
"\n",
"## Security\n",
"* memcache emphasized brutal simplicity and efficiency. But in the case of security that simplicity apparently makes it emminently hackable.\n",
"* Memcache uses a flat security model, with privileges applying to lots of things all at once. For example, if you have write access, you have all the write access; same with reads.\n",
"* When deployed on an unsecured network, it's very easy for external actors to get to, inspect, and even modify a memcached service.\n",
"* memcache over UDP is a particular problem. This feature was disabled by default eventually, but for a while you could access cache data for a lot of public websites apparently!\n",
"* Since memcache has such lousy security you should implement your own security layer."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
61 changes: 61 additions & 0 deletions Chapter 3.2 --- Redis.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Redis\n",
"\n",
"## Mission statement\n",
"* Redis is a free and open source in-memory key-value store.\n",
"* It is like a more advanced version of memcached, the subject of the previous section.\n",
"* Redis stands for \"REmote DIctionary Server\".\n",
"* It was originated by a guy at VMWare, and has since spun off twice to a dedicated maintainer in Redis Labs.\n",
"* Like memcache it is designed to be blazing fast, and most often used as a cache layer.\n",
"* Persistence to disc is configurable via either writing to a log or by dumping to disc (snapshotting) at regular intervals.\n",
"* Thus Redis is great for blazing-fast, *mostly* consistent storage.\n",
"* You can use it as a cache by disabling the persistance layer entirely (gotta go fast).\n",
"\n",
"## Data model\n",
"* The Redis data model is very similar to the memcached one.\n",
"* cluster config?\n",
"* A word in Redis is a string. Interesting choice. Redis supports a variety of structures involving strings: hashes, lists, sets, sorted sets, bitmaps, hyperloglogs, and geospatials (via geohashes).\n",
"\n",
"\n",
"* The cluster is managed as a sequence of masters and slaves.\n",
"* When the network is fully operational, masters asynchronously send key-store modification traffic to the slaves, which replicate the data locally themselves.\n",
"* If a network partition occurs (a timeout or something else), in order to heal, the slaves will ask the master for first a \"partial synchronization\", where all of the missed updates are batched in one send, and then if that is not possible (due to a high volume of updates not recieved) request a \"full synchronization\", which requires a much slower backup-and-push on the past of the master.\n",
"* This synchronization configuration has high performance, but also high latency, fitting the Redis philosophy perfectly.\n",
"* You can optionally request synchronous replication to a specific number of nodes. However in the case of a failover, in some cases it's still possible to lose that update. Redis is not for persistance!\n",
"* A master can have multiple slaves.\n",
"* Slaves can be connected to one another. This will traffic data replications against one another in exactly the same manner as the master-slave traffic.\n",
"* Replication is non-blocking on the master side, except in the case of a failure requiring full synchronization to recover from.\n",
"* It is lightly blocking on the slave side: by default slaves will serve using the old copy of the dataset, but there is a brief period when the slave must switch to the new dataset version (delete old data, insert new data) where it blocks.\n",
"* Why replicate at all? Reliability, for one thing, but additionally because slaves can be used to farm off long-running requests.\n",
"* Since slaves are full dataset replicas, this architecture is obviously insufficient for large data volumes, as it results in a lot of duplicate data!\n",
"* High-availability and automatic clustering are available via a few different feature staple-ons."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

0 comments on commit e69f287

Please sign in to comment.