Skip to content

Data engineering practice, including building data pipelines (ELT) from a variety of sources.

Notifications You must be signed in to change notification settings

BethanyWeisberg/data-pipeline-practice

Repository files navigation

Data Pipeline Practice

Hello, welcome to my Data Pipeline Practice repo! Here I keep all the python scripts and relevant files I used for creating data pipelines via the extract - load - transform (ELT) method. I am using the Data Pipelines Pocket Reference (Densmore 2021) to guide my ETL-practice journey. Below, I keep track of the ETL Practice Steps and the Highlights of my learning journey.


Data Ingestion Pipeline (Densmore 2021)

1. ETL Practice Steps

Configuration and Local Environment:
  • Virtual Environment
  • AWS Account
  • MySQL database
  • Create table in MySQL
Data Extraction:
  • Python script (full extract table to s3 bucket)
  • Redshift Data Warehouse
  • Python script (incremental extract table to s3 bucket)
  • BinLog Replication of MySQL data Note - will practice CDC method at later point.
  • MongoDB data extraction method
  • REST API data extraction method
Data Loading:
  • Load CSV file to Redshift data warehouse via query editor
  • Load CSV file to Redshift data warehouse via python script
Data Transformation:
  • Deduplicating records in a data warehouse table via sql
  • Parsing URLs via python
  • Transform data from fact and dimension tables by creating a new data model via SQL
Workflow Orchestration Management:
  • Install Apache Airflow
  • Create Postgres database
  • Configure Airflow to use Postgres database
  • Build and Run a Simple Airflow DAG
  • Build an ELT Pipeline DAG
  • Configure DAG Status Alerts
  • Coordinate Multiple DAGs with Sensors
Pipeline Data Validation:
  • Create validation test script

2. Highlights

Data Extraction
MySQL Database (via RDS) Table Created in MySQL S3 Bucket for Extracted MySQL Table Redshift Data Warehouse MongoDB Database
Screen Shot 2023-01-16 at 3 15 25 PM_thumbnail Screen Shot 2023-01-16 at 3 11 24 PM_thumbnail Screen Shot 2023-01-16 at 3 09 47 PM_thumbnail Screen Shot 2023-01-16 at 3 13 57 PM_thumbnail Screen Shot 2023-01-16 at 3 08 03 PM_thumbnail
Data Loading
Load CSV Files to Redshift (Query Editor) Successfully Loaded CSV File to Redshift (Query Editor) Successfully Loaded CSV File to Redshift (Python Script) Successfully Loaded CSV File to Redshift (Python Script)
Screen Shot 2023-01-18 at 5 08 12 PM_thumbnail Screen Shot 2023-01-18 at 5 09 42 PM_thumbnail Screen Shot 2023-01-18 at 5 52 16 PM_thumbnail Screen Shot 2023-01-18 at 6 04 43 PM_thumbnail
Data Transformation
Create Table with Duplicate Count (SQL) Deduplicate Original Table (SQL) Create Transformed Data Model (SQL)
Screen Shot 2023-01-18 at 6 40 24 PM_thumbnail Screen Shot 2023-01-18 at 6 40 39 PM_thumbnail Screen Shot 2023-01-22 at 12 01 32 PM_thumbnail
Workflow Orchestration
Install Apache Airflow ELT Pipeline DAG Graph Create and Run ELT Pipeline Airflow DAG
Screen Shot 2023-01-24 at 9 24 33 PM_thumbnail Screen Shot 2023-02-04 at 6 59 54 PM_thumbnail Screen Shot 2023-02-04 at 7 00 24 PM_thumbnail

Reference

Densmore, James (2021). Data Pipelines Pocket Reference (O'Reilly). Copyright 2021 James Densmore, 978-1-492-08783-0.

Releases

No releases published

Packages

No packages published

Languages