Dione - an indexing Library for data on HDFS and Spark.
The main offering is APIs for building an index for data on HDFS and querying the index in both:
- Multi-row load - using Spark as a distributed processing engine, load a subset of the data (0.1% to 100% of key space) much faster than Spark/Hive joins.
- Single-row Fetch -
get(key)
with seconds latency, and low throughput.
This way we can reuse HDFS data, that is primarily used for batch processing, for more ad-hoc access use-cases.
There are three main building blocks:
HdfsIndexer
- a library for indexing HDFS data and loading back the data given the index metadata.AvroBtreeFile
- an Avro based file format for storing rows in a file in a B-Tree order for fast search.IndexManager
- a high-level API for index management using Spark.
For deeper overview please see our Dione documentation.
- Data and index are available for batch processing.
- Use the same technology stack for the index and for the data.
- No data duplication.
- Support multiple indices for the same data.
- No need to be the data owner.
Check out our Quick Start or Quick Start Python guides.
Dione | Spark |
---|---|
0.5.x | 2.3.x |
0.6.x | 2.4.x |
0.7.x | 3.1.x |
Please open issues in the GitHub issues.
This project is licensed under the Apache 2 License.