Skip to content
This repository has been archived by the owner on Oct 23, 2020. It is now read-only.

Design notes

crnixon edited this page Mar 7, 2013 · 1 revision

Assumptions

  • Data is tabular in nature
  • Most datasets will be between 100 MB and 50 GB
  • Data will be served in JSON and CSV (at a minimum)
    • CORS and JSONP are supported
  • Each data set has a specified set of fields that can be filtered on

High-level decisions

  • The API server should run on top of the Java Virtual Machine (JVM)
    • Java is a well-known enterprise platform
    • Java provides for threading and high-performance
  • The API should be backed by the Mongo database (MongoDB)
    • Mongo's auto-sharding makes horizontal scaling easy
    • Mongo's aggregation framework makes complex data queries easy
  • The API should draw ideas from, but not adhere to, current standards
    • There is no one standard to use for serving data
    • Current standards are either overly complex, incomplete, or specific to certain applications
    • The two to draw from are Google's Dataset Publishing Language (DSPL) for loading datasets and Socrata's Open Data API (SODA) for querying datasets
  • The API should be discoverable
    • Use the Swagger method for describing the API to let others visualize and test the API

Why the Java Virtual Machine?

The Java Virtual Machine is a very mature platform (almost 20 years old) that is widely used both inside the government and in industry. It is well-supported and deployment and operations strategies are well-known.

For this particular project, the performance characteristics of the JVM are important. The JVM is very fast compared to the runtimes for dynamic languages such as Python; it is comparable to C code in speed. In addition, it has support for multi-processor threading, which most dynamic language runtimes do not.

The JVM was originally created to run the Java programming language, which could be used for this project. In addition, many other languages run on the JVM, allowing us to choose the right language for each part of the API project.

Why MongoDB?

We compared MongoDB to two other datasets, Postgres and Cassandra. Postgres was chosen because it is the best of the open-source relational databases, while Cassandra represented another take on the NoSQL database type.

Data fit

Before discussing performance and architecture, let’s discuss why MongoDB and Postgres are good fits for the data. Both are typical table-structured databases (MongoDB is schema-less, but uses “collections” and “fields” in a very similar way to tables and columns.) Given that the data we ingest is this same sort of data (CSV format or other spreadsheet-like, probably denormalized data), this is perfect. MongoDB allows for differing fields in each row, which is great for less-structured data. Postgres’s HStore would fill the same role.

Cassandra was thrown out at this point because it cannot sort data in the database. While its column-oriented nature might be great for some data, and its resiliency is much better than Mongo’s, this is a fatal flaw for data that will be manipulated and sorted via the API in ways we cannot always expect.

MongoDB’s query language is JSON-based, which allows for composability. This will make constructing queries through API parameters much easier than constructing SQL.

Performance

MongoDB and Postgres have very similar performance characteristics with medium amounts of data (< 1,000,000 rows). With proper indexing, queries respond very quickly.

When the dataset grows larger than the available memory, both MongoDB and Postgres lose performance quickly. The performance drops by an order of magnitude with both. Postgres does a little better in this category because it can store data much more compactly. MongoDB keeps the field names with each record, which eats up memory. Postgres does not run again the memory barrier nearly as quickly as MongoDB because of this.

Scaling architecture

This area is where MongoDB seals the deal for me. Before I go into MongoDB, let me talk about what an ideal situation would look like. Ideally, data would be spread across M nodes, with each datum stored in N nodes, where M >= N. This allows individual nodes to go down and come back up without sacrificing the uptime of the entire group.

MongoDB does things a little differently. You have two scaling notions: sharding and replication. Sharding is used when you have more data than can fit in memory. You choose a shard key in your data: one of the fields that every record has. This shard key has to be very carefully chosen. It needs high cardinality (that is, there needs to be a lot of different values in the data for this field) so that the data can be split. (It will be split into chunks based on the shard key.) It is good to have a high degree of randomness in the shard key if you are going to be writing a lot, since data is stored in sorted chunks. Lastly, queries are much faster if they include the shard key. This makes sharding work great for pulling individual records, and work much less great for free-form data queries, which we will have.

Unfortunately, Postgres doesn’t do much better for us here. Sharding is not automatic in Postgres, as it is in MongoDB, and it has the same issues. The key to scaling Postgres queries is to get as much memory as possible.

Both MongoDB and Postgres scale reads and deal with resiliency in the same manner: replication. You have multiple servers with the same data and allow reads from any of them.

MongoDB’s sharding has real resiliency problems by itself, as well as high system needs. If one shard goes down, you lose your ability to query and read. In my tests, the sharding doesn’t heal itself well, either: bringing a shard back up did not result in reconnection. Replica sets do work with sharding, and according to the documentation, automatic failover will happen. This is good enough, but will require a lot of monitoring.

Summary

MongoDB is a good choice, but will require more resources, both from a system and ops perspective. It might make less sense if we didn’t have the ability to work with 10gen, who can help us scale. Postgres would be less resource intensive and deal with larger datasets before losing performance, but only really scales vertically.

Resources

Data API Standards

  • Open Data Protocol (OData)

    OData is incredibly comprehensive and is a standard supported by Microsoft, IBM, and the Open Government Data Initiative (OGDI). However, it's incredibly complex, and the best tools for working with it are all .NET. Not sure supporting this is feasible/a good idea. Looking at the examples from Netflix will give you a good idea of how complex this can be.

  • Open Data Protocols/Simple Data Format

    This couldn't have a more confusing name if it tried. Much simpler, much easier to implement set of proposed standards. Problem: lots and lots of open questions, not a lot of answers. Specifications are still being made. This is good, as it allows us flexibility; it's not so great, in that it means we are pretty much making it up. The metadata format should be expanded to work more like DSPL.

  • Dataset Publishing Language (DSPL)

    Google's standard for creating datasets to import into the Google Public Data Explorer. I love this one, although it's pretty complex. The metadata that you give it will allow us to automatically generate API parameters. This doesn't include a standard for the API, just for the dataset.

  • Getting Started with SODA 2.0

    Socrata's Open Data API isn't a bad place to start for designing our API. In particular, the SODA2 SoQL query language looks easy to implement and yet is full-featured.

  • Swagger

    Swagger is a specification for programatically describing an API, allowing it to be discoverable, as well as a set of tools for documenting and consuming Swagger-compatible APIs.