GitHub - dataverbinders/nl-open-data: A Flexible Python ETL toolkit for datawarehousing framework based on Dask, Prefect and the pydata stack

nl-open-data

A Flexible Python ETL toolkit for datawarehousing framework based on Dask, Prefect and the pydata stack. It follows the original design principles from these libraries, combined with a functional programming approach to data engineering.

Google Cloud Platform (GCP) is used as the core infrastructure, particularly BigQuery (GBQ) and Cloud Storage (GCS) as the main storage engines. We follow Google's recommendations on how to use BigQuery for data warehouse applications with four layers:

source data, in production environment or file-based
staging, on GCS
datavaault, on GBQ
datamarts, on GBQ using ARRAY_AGG, STRUCT, UNNEST SQL-pattern

Motivation

In order to take advantage of open data, the ability to mix various datasets together must be available. As of now, in order to to that, a substantial knowledge of programming and data engineering must be available to any who wishes to do so. This project library aims to make that task easier.

Build status

Installation

Using pip: pip install nl_open_data -> NOT IMPLEMENTED YET

Using Poetry: Being a Poetry managed package, installing via Poetry is also possible. Assuming Poetry is already installed:

Clone the repository
From your local clone's root folder, run poetry install

Configuration

There are two elements that need to be configured prior to using the library.

1. GCP and Paths through config.toml

The GCP project id, bucket, and location should be given by editing nl-open-data/nl_open_data/config.toml, allowing up to 3 choices at runtime: dev, test and prod. Note that you must nest gcp projects details correctly for them to be interperted, as seen below. You must have the proper IAM (permissions) on the GCP projects (more details below).

Correct nesting in config file:

[gcp]
    [gcp.prod]
    project_id = "my_dev_project_id"
    bucket = "my_dev_bucket"
    location = "EU"

    [gcp.test]
    project_id = "my_test_project_id"
    bucket = "my_test_bucket"
    location = "EU"

    [gcp.prod]
    project_id = "my_prod_project_id"
    bucket = "my_prod_bucket"
    location = "EU"

Additionally, the local paths used by the library can configured here. Under [paths], define the path to the library, and other temporary folders.

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
docs		docs
nl_open_data		nl_open_data
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.travis.yml		.travis.yml
AUTHORS.md		AUTHORS.md
CONTRIBUTING.md		CONTRIBUTING.md
HISTORY.md		HISTORY.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nl-open-data

Motivation

Build status

Installation

Configuration

1. GCP and Paths through config.toml

Credits

About

Releases

Packages

Contributors 3

Languages

License

dataverbinders/nl-open-data

Folders and files

Latest commit

History

Repository files navigation

nl-open-data

Motivation

Build status

Installation

Configuration

1. GCP and Paths through config.toml

Credits

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages