Skip to content
This repository has been archived by the owner on Jul 24, 2022. It is now read-only.
/ etl-1 Public archive
forked from getoutreach/etl

Flexible ETL system written in Ruby

License

Notifications You must be signed in to change notification settings

Yardlynk/etl-1

 
 

Repository files navigation

Ruby ETL Gem

This ETL (Extract Transform Load) Ruby gem provides a back-end system for moving data between data sources in a reliable and scalable manner. It includes features for extracting from several different data sources, transforming the data, and loading to different databases with a user-specified load strategy.

The ETL system is designed from the ground up to be highly available and scalable through usage of queued ETL jobs, distributed workers, and optimized load paths.

Key Use Cases

  • Loading data in batches into a data warehouse (e.g. Redshift) for downstream analytic workloads

  • Creating aggregated/roll-up tables from raw data

Because the ETL system is configured through Ruby and JSON, the target users are back-end developers who are looking to integrate a lightweight solution into their system.

Key Features

Currently implemented features:

  • Connects to various data sources including relational, non-relational, and flat file

  • Flexible load strategies including insert, partition, and upsert

  • Job configuration and specification through Ruby for flexibility

  • Jobs are parameterized by JSON payloads that can be sent as queue messages for distributed processing

  • Automated management of common warehouse columns such as load date

  • Worker process that reads jobs from queue and runs them

Future features:

  • Parameterizable Ruby process for scheduling jobs

  • Posts metrics on job scheduling and execution performance

  • Additional data sources/destinations

  • Job scheduling and dependency representation

  • Support for streaming data sources

  • Code hooks for validation and auditing of data loads

Data Sources and Destinations

Currently the following sources are supported for both data input and output.

  • CSV

  • MySQL

  • PostgreSQL

  • InfluxDB

The following are on the short list for future support:

  • Redshift

  • CloudWatch

Status

Although there are many working parts of this system and it is being used in production at my current company, the interface is still evolving and not all features have been fully implemented. This should be considered pre-Alpha software. Please contact me if you’re interested in contributing or have questions.

To be completed before I’d consider this “alpha”:

  • Finish scheduling and metrics features listed above

  • Provide documentation and examples of the API

  • Publish to RubyGems

To run the existing tests

Run ‘docker build . -t outreach/etl-test` Run `docker run -it outreach/etl -v /MY_TEST_CONFIG_DIR:/etl_config bash` Run `service postgresql start` Run `bundle exec rspec` RUN `export ETL_CONFIG_DIR=/etl_config`

Copyright © 2015-2016 Charles Smith

This is an Open Source project licensed under the terms of the MIT license as described in the LICENSE file in the root directory of this repository.

About

Flexible ETL system written in Ruby

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Ruby 97.8%
  • Shell 1.5%
  • Other 0.7%