Stars
Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.
Step by step instructions to create a production-ready data pipeline
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
🐍 Quick reference guide to common patterns & functions in PySpark.
JupyterLab computational environment.
Simple ETL demonstrated with literate programming
The fastest way to create an HTML app
Repository for Data Engineering Interview Series
Primary repository for NYC DCP's Data Engineering team
A systematic approach to creating better documentation.
Code for my "Efficient Data Processing in SQL" book.
Code for data quality with greatexpectations blog
Code for "Advanced data transformations in SQL" free live workshop
This project demonstrates an end-to-end solution for processing and analyzing real-time conversations data from a JSON file using GCP services and infrastructure automation, showcasing data storage…
Python or SQL for data transformation
jless is a command-line JSON viewer designed for reading, exploring, and searching through JSON data.
Simple repo to demonstrate how to submit a spark job to EMR from Airflow
A template repository to create a data project with IAC, CI/CD, Data migrations, & testing
Example repo to create end to end tests for data pipeline.
Repo for CDC with debezium blog post
Sample repo for startdataengineering DE 101 free course
Code for blog at https://www.startdataengineering.com/post/python-for-de/
Cost Efficient Data Pipelines with DuckDB
Project for "Data pipeline design patterns" blog.