{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Google File System\n", "\n", "* Google File System (GFS) is a distributed file system abstractions that is used at Google.\n", "* It is another Jeffrey Dean invention, dating from around 2003 or so.\n", "* It was replaced in production by another distributed file system, Collusus, is 2010.\n", "* Still the paper on this subject is considered very influential.\n", "* Some argue that GFS is the true innovation in big-data processing in the early-mid 2000s, not Hadoop...\n", "\n", "\n", "* GFS was built with the assumption of constant failures in a large number of commodity parts.\n", "* It assumes huge (in 2003 this meant multi-gigabyte) files.\n", "* It assumes mostly append and sequential scan based workloads, not random writes, against those huge files. This emphasizes I/O speed (reliably sustained I/O moreso than latency) and deemphasizes caching.\n", "* It needs to have native support for concurrent reads (and writes?).\n", "\n", "\n", "* There is one master process and and a large number of worker processes. The latter are called chunkservers.\n", "* Data is split in chunks of a certain size, and each chunk is assigned a globally unique ID.\n", "* The master sends the chunks out to the chunkservers for storage. Usually three copies per chunk are stored, to account for possible hardware failures.\n", "* The master polls chunkservers for health checks.\n", "* The master maintains all top-level processes, like chunk migration, metadata storage, chunk mapping, garbage collection, and so on.\n", "\n", "\n", "* GFS does not provide a POSIX API. You are expected to interact with it via a client.\n", "* At read time the client asks the master for the locations of the chunks that constitute the file. It then goes and asks the chunkservers for the chunks, using the ID list the master created.\n", "* Since the chunks are on a network of machines you can get multiple concurrent I/O, and the file-reading operation is naively only as slow as the slowest chunk read!\n", "\n", "\n", "* GFS originally used a 64 MB chunk size, huge by OS standards.\n", "* Each chunk is stored as a Linux file.\n", "* This leads to file system fragmentation. To deal with it, GFS uses **lazy space allocation**. Basically, chunks smaller than 64 MB are sometimes kept in memory. Chunks are only written once they hit 64 MB in size. This minimizes fragmentation as much as possible.\n", "* Larger chunk sizes reduce load on the master node, which is an important design consideration.\n", "* However, this overall design can lead to hotspots, when there are chunkservers hosting many small files.\n", "\n", "\n", "* The master stores three major types of metadata: the file and chunk namespaces, the mapping from files to chunks, and the locations of each chunk’s replicas.\n", "* For fast access this metadata is kept in-memory. The first two are made persistent, in the event of the master crashing, by writing to a local log file (which is itself replicated elsewhere). E.g. what Apache Zookeeper and other durable metadata stores do.\n", "* The master does not store chunk location information persistently. Instead, it asks each chunkserver about its chunks at master startup and whenever a chunkserver joins or leaves the cluster.\n", "* This last thing was done to make keeping up with cluster changes easier.\n", "\n", "\n", "* File creation is atomic, and is handled exclusively by the master.\n", "* File writes and appends are treated under a concurrent write model. Appends occur at an offset of GFS's choosing, which may be past the \"true last byte\" of the file. Duplicate records and padding in between concurrent write regions may be inserted by GFS.\n", "* In a multi-mutation environment (e.g. concurrent writes and appends) GFS guarantees that the last mutation goes through correctly (so if multiple writes change the same chunk, at worst the last chunk write will be reflected).\n", "* GFS achieves this by applying mutations to a chunk in the same order on all its replicas and by using chunk version numbers to detect stale replicas that have missed mutations due to being down, and garbage collecting those.\n", "\n", "\n", "* This is weak consistency, because it is not guaranteed that all writes succeed!\n", "* The trade-off is concurrent writes, and hence higher possible sustained I/O volume.\n", "* Dealing with weak consistency because the problem of application code.\n", "* What you can do in the face of such an unreliable filesystem:\n", " * Don't mutate, append.\n", " * If you are a writer process, buffer data in-memory (or perhaps on disc) and keep it buffered until absolutely sure the write succeeded.\n", " * If you are a reader process, establish checkpoints and treat data from the checkpoints instead of reading from the disc.\n", "\n", "\n", "* How is this achieved?\n", "* When a mutation on a chunk appears the master hands a lease to one of workers handling that chunk.\n", "* The lease lasts 60 seconds. For all ensuing in-flight mutations, that worker designates the order of operation for the writes and applies them.\n", "* The backup workers follow suit.\n", "* The lease may expire or, if mutations are still in flight, it may be renewed. Leases can be renewed indefinitely.\n", "* This communication design ensures that write operations follow a total order that is the same on every node, whilst handing off as much work as possible away from the master and to the chunk node.\n", "\n", "* Interestingly, control flow is decoupled from data flow.\n", "* In terms of data flow, every chunkserver that having the same chunk is arranged in a linear chain on the network, regardless of which one is the current lease owner (leader).\n", "* Mutation data is streamed down this pipe.\n", "* The client connects to the master, is given a leader, and then connects to the leader. The leader determines write order and sends that control signal to the follower nodes.\n", "* If the data signal arrives before the corresponding control signal, the data is just held in place until the control signal comes through. Similarly, if the control signal comes before the data signal, the control is readied for eventual arrival.\n", "\n", "\n", "* GFS defines two special non-POSIX operations, atomic append and snapshot.\n", "\n", "\n", "\n", "* Atomic append is an append that is guaranteed to succeed at least once. It occurs at a GFS-determined offset, however.\n", "* Atomic append checks if the write would cause the chunk to exceed the maximum size. If so, it pads the chunk to the maximum size (to avoid fragmentation on-disc) and tells the client to write to a new chunk. If not, the leader performs the append, signals to the followers to do the same, and returns a success to the client.\n", "* To minimize worst-case file padding behavior GFS will only allow atomic appends at 1/4 of the chunk size.\n", "* A failed record append on a replica will cause a retry. As a result, replicas of the same chunk may contain different and partially or fully duplicated records.\n", "* In other words GFS only guarantees at least one successful write, it does not guarantee that data is located at the same byte offset in all three files.\n", "* GFS handles these so-called inconsistent regions in a separate process.\n", "\n", "\n", "* Notes on snapshot go here.\n", "* The snapshot operations makes a fast copy of a file or directory tree. It is designed to have minimal impact on write operations.\n", "* Snapshots are essentially an implementation of **copy-on-write** in a distributed manner.\n", "* When it recieves a snapshot request for a chunk, the master first revokes the lease for it, or if it cannot communicate with the leader node it waits for the lease to naturally expire.\n", "* This prevents write operations during the snapshot occurring without the master's knowledge thereof, as any new writes will require a new lease.\n", "* The master proceeds as normal until a lease request for a snapshotted chunk comes in.\n", "* At that point the master designates a leader chunkserver and tells it to replicate that chunk locally (locally to avoid an unnecessary network move).\n", "* Once local replication has succeeded, the master gives the client application a reference to the *replicated* chunk, instead of the original chunk. It also updates the followers to point at the new copy. The original node is now a snapshot!\n", "\n", "\n", "* Some concerns around rackwise replica placement and the priorization of replication in the face of node loss, and how the system handles durability, follow.\n", "\n", "\n", "* What succeeded GFS: http://www.pdsw.org/pdsw-discs17/slides/PDSW-DISCS-Google-Keynote.pdf." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }