Skip to content
This repository has been archived by the owner on Jul 15, 2024. It is now read-only.

A Flexible Python ETL toolkit for datawarehousing framework based on Dask, Prefect and the pydata stack

License

Notifications You must be signed in to change notification settings

dataverbinders/nl-open-data

Repository files navigation

nl-open-data

A Flexible Python ETL toolkit for datawarehousing framework based on Dask, Prefect and the pydata stack. It follows the original design principles from these libraries, combined with a functional programming approach to data engineering.

Google Cloud Platform (GCP) is used as the core infrastructure, particularly BigQuery (GBQ) and Cloud Storage (GCS) as the main storage engines. We follow Google's recommendations on how to use BigQuery for data warehouse applications with four layers:

Motivation

In order to take advantage of open data, the ability to mix various datasets together must be available. As of now, in order to to that, a substantial knowledge of programming and data engineering must be available to any who wishes to do so. This project library aims to make that task easier.

Build status

Pypi Status Build Status Docs Status

Installation

Using pip: pip install nl_open_data -> NOT IMPLEMENTED YET

Using Poetry: Being a Poetry managed package, installing via Poetry is also possible. Assuming Poetry is already installed:

  1. Clone the repository
  2. From your local clone's root folder, run poetry install

Configuration

There are two elements that need to be configured prior to using the library.

1. GCP and Paths through config.toml

The GCP project id, bucket, and location should be given by editing nl-open-data/nl_open_data/config.toml, allowing up to 3 choices at runtime: dev, test and prod. Note that you must nest gcp projects details correctly for them to be interperted, as seen below. You must have the proper IAM (permissions) on the GCP projects (more details below).

Correct nesting in config file:

[gcp]
    [gcp.prod]
    project_id = "my_dev_project_id"
    bucket = "my_dev_bucket"
    location = "EU"

    [gcp.test]
    project_id = "my_test_project_id"
    bucket = "my_test_bucket"
    location = "EU"

    [gcp.prod]
    project_id = "my_prod_project_id"
    bucket = "my_prod_bucket"
    location = "EU"

Additionally, the local paths used by the library can configured here. Under [paths], define the path to the library, and other temporary folders.

Credits

About

A Flexible Python ETL toolkit for datawarehousing framework based on Dask, Prefect and the pydata stack

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published