ROAPI automatically spins up read-only APIs for static datasets without requiring you to write a single line of code. It builds on top of Apache Arrow and Datafusion. The core of its design can be boiled down to the following:
- Query frontends to translate SQL, GraphQL and REST API queries into Datafusion plans.
- Datafusion for query plan execution.
- Data layer to load datasets from a variety of sources and formats with automatic schema inference.
- Response encoding layer to serialize intermediate Arrow record batch into various formats requested by client.
See below for a high level diagram:
pip install roapi-http
Check out Github release page for pre-built binaries for each platform. Pre-built docker images are also available at ghcr.io/roapi/roapi-http.
cargo install --git https://github.com/roapi/roapi --branch main --bin roapi-http
Spin up APIs for test_data/uk_cities_with_headers.csv
and
test_data/spacex-launches.json
:
roapi-http \
--table 'uk_cities:test_data/uk_cities_with_headers.csv' \
--table 'spacex_launches:test_data/spacex-launches.json'
Or using docker:
docker run -t --rm -p 8080:8080 ghcr.io/roapi/roapi-http:latest --addr 0.0.0.0:8080 \
--table 'uk_cities:test_data/uk_cities_with_headers.csv' \
--table 'spacex_launches:test_data/spacex-launches.json'
Query tables using SQL, GraphQL or REST:
curl -X POST -d "SELECT city, lat, lng FROM uk_cities LIMIT 2" localhost:8080/api/sql
curl -X POST -d "query { uk_cities(limit: 2) {city, lat, lng} }" localhost:8080/api/graphql
curl "localhost:8080/api/tables/uk_cities?columns=city,lat,lng&limit=2"
Get inferred schema for all tables:
curl 'localhost:8080/api/schema'
You can also configure multiple table sources using YAML config, which supports more advanced format specific table options:
addr: 0.0.0.0:8084
tables:
- name: "blogs"
uri: "test_data/blogs.parquet"
- name: "ubuntu_ami"
uri: "test_data/ubuntu-ami.json"
option:
format: "json"
pointer: "/aaData"
array_encoded: true
schema:
columns:
- name: "zone"
data_type: "Utf8"
- name: "name"
data_type: "Utf8"
- name: "version"
data_type: "Utf8"
- name: "arch"
data_type: "Utf8"
- name: "instance_type"
data_type: "Utf8"
- name: "release"
data_type: "Utf8"
- name: "ami_id"
data_type: "Utf8"
- name: "aki_id"
data_type: "Utf8"
- name: "spacex_launches"
uri: "https://api.spacexdata.com/v4/launches"
option:
format: "json"
- name: "github_jobs"
uri: "https://jobs.github.com/positions.json"
To run serve tables using config file:
roapi-http -c ./roapi.yml
See config documentation for more options including using Google spreadsheet as a table source.
By default, ROAPI encodes responses in JSON format, but you can request
different encodings by specifying the ACCEPT
header:
curl -X POST \
-H 'ACCEPT: application/vnd.apache.arrow.stream' \
-d "SELECT launch_library_id FROM spacex_launches WHERE launch_library_id IS NOT NULL" \
localhost:8080/api/sql
You can query tables through REST API by sending GET
requests to
/api/tables/{table_name}
. Query operators are specified as query params.
REST query frontend currently supports the following query operators:
- columns
- sort
- limit
- filter
To sort column col1
in ascending order and col2
in descending order, set
query param to: sort=col1,-col2
.
To find all rows with col1
equal to string 'foo'
, set query param to:
filter[col1]='foo'
. You can also do basic comparisons with filters, for
example predicate 0 <= col2 < 5
can be expressed as
filter[col2]gte=0&filter[col2]lt=5
.
To query tables using GraphQL, send the query through POST
request to
/api/graphql
endpoint.
GraphQL query frontend supports the same set of operators supported by REST query frontend. Here how is you can apply various operators in a query:
{
table_name(
filter: {
col1: false
col2: { gteq: 4, lt: 1000 }
}
sort: [
{ field: "col2", order: "desc" }
{ field: "col3" }
]
limit: 100
) {
col1
col2
col3
}
}
To query tables using a subset of standard SQL, send the query through POST
request to /api/sql
endpoint. This is the only query interface that supports
table joins.
Query layer:
- REST API GET
- GraphQL
- SQL
- join between tables
- support filter on nested struct fields
- index
- protocol
- gRPC
- MySQL
- Postgres
Response serialization:
- JSON
application/json
- Arrow
application/vnd.apache.arrow.stream
- msgpack
Data layer:
- filesystem
- HTTP/HTTPS
- S3
- GCS
- Google spreadsheet
- MySQL
- Postgres
- Airtable
- Data format
- CSV
- JSON
- NDJSON
- parquet
- xls, xlsx, xlsm, ods: https://github.com/tafia/calamine
- DeltaLake
Misc:
- auto gen OpenAPI doc for rest layer
- query input type conversion based on table schema
- stream arrow encoding response
- authentication layer
The core of ROAPI, including query frontends and data layer, lives in the self-contained columnq crate. It takes queries and outputs Arrow record batches. Data sources will also be loaded and stored in memory as Arrow record batches.
The roapi-http crate
wraps columnq
with a HTTP based API layer. It serializes Arrow record batches
produced by columnq
into different formats based on client request.
Building ROAPI with simd
optimization requires nightly rust toolchain.
docker build --rm -t ghcr.io/roapi/roapi-http:latest .
You can set RELEASE
variable to any git reference to build for a specific version.