Cost Effective Data Pipelines

Cost Effective Data Pipelines

Cost Effective Data Pipelines

Code for blog at: Building cost effective data pipelines with Python & DuckDB

Setup

Run on CodeSpaces

You can run the code in the repo with Codespaces.

Fork this repo.
Click on code button and select the codespaces tab.
Click on the Create codespace on main button.

When codespaces open, wait for it to complete installing the required libraries from ./requirements.txt.

You can open a Python REPL and run the scripts in the generate data section.

Run code on your machine

Alternatively you can run this code on your machine. You'll need the following:

Python 3.8 or above
sqlite3
Sufficient disk memory (depending on if you want to run with 1 or 10 or 100GB)

Clone the repo and create a virtual env and install the libraries needed:

git clone https://github.com/josephmachado/cost_effective_data_pipelines.git
cd cost_effective_data_pipelines

python3 -m venv myenv
# windows
# myenv\Scripts\activate
# Linux
source myenv/bin/activate
pip install -r requirements.txt
# After you are done, deactivate with the command `deactivate`

Generate data

For the example in this repo we use the TPC-H data set and Coincap API. Let's generate the TPCH data, by running the following commands in your terminal:

# NOTE: This is to clean up any data (if present) 
rm tpch-dbgen/*.tbl
# Generate data set of 1 GB size
cd tpch-dbgen
make
./dbgen -s 1 # Change this number to generate a data of desired size
cd ..

# NOTE: Load the generated data into a tpch sqlite3 db
sqlite3 tpch.db < ./upstream_db/tpch_DDL_DML.sql > /dev/null 2>&1

Let's open a sqlite3 shell and run a quick count check to ensure that the tables were loaded properly.

sqlite3 tpch.db
.read ./upstream_db/count_test.sql
/* 
Your output will be (if you generated a 1GB dataset)
150000
6001215
25
1500000
200000
800000
5
10000
*/
.exit  -- exit sqlite3

Data processing

You can run the ETL scripts using the make up command.

make up

Alternatively you can run the scripts individually as shown below.

time python ./src/data_processor/exchange_data.py 2024-05-29
time python ./src/data_processor/dim_parts_supplier.py 2024-05-29
time python ./src/data_processor/one_big_table.py 2024-05-29
time python ./src/data_processor/wide_month_supplier_metrics.py 2024-05-29
# This last script will fail on base CodeSpace VM, see here for powerful Codespace VMS:`
# https://docs.github.com/en/codespaces/customizing-your-codespace/changing-the-machine-type-for-your-codespace?tool=webui

Running the ETL on a 8 core, 32 GB RAM, 1TB HDD 2017 Thinkpad with performs as shown below. The script wide_month_supplier_metrics.py ran in 7m and 20s, this included reading in about 10GB of data, ineffectively processing it and writing it out.

Resource utilization:

NOTE: The code in this repo can be optimized following principles stated in this performance guide.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
assets		assets
processed_data		processed_data
src/data_processor		src/data_processor
tpch-dbgen		tpch-dbgen
upstream_db		upstream_db
.gitignore		.gitignore
.tool-versions		.tool-versions
Makefile		Makefile
README.md		README.md
duckdb		duckdb
requirements.txt		requirements.txt
setup.sh		setup.sh
timed_etl.sh		timed_etl.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cost Effective Data Pipelines

Setup

Run on CodeSpaces

Run code on your machine

Generate data

Data processing

About

Releases

Packages

Languages

josephmachado/cost_effective_data_pipelines

Folders and files

Latest commit

History

Repository files navigation

Cost Effective Data Pipelines

Setup

Run on CodeSpaces

Run code on your machine

Generate data

Data processing

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages