Skip to content

Data pipeline implementing kafka, spark structured streaming, dbt, google cloud, bigquery and more

Notifications You must be signed in to change notification settings

locdoan12121997/ticketsim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ticketsim

A data pipeline with Kafka, Spark Streaming, dbt, Docker, Airflow, GCP and more.

Description

Objective

The project will stream data from a simulation that sells event tickets and create a data pipeline that consumes the real-time data. The data would then be processed in realtime and stored in data lake every two minutes. An hourly batch job will consume this data, apply transformations and create tables in data warehouse for analytics and reports. We will analyze basic attributes of the data like total users, average waiting time, etc ...

Data Simulation

Ticketsim is inspired by article by Kevin Brown. Using simpy, the program will generate wait and buy ticket events.

Tools & Technologies

Architecture

ticketsim-architecture

Final Result

You can watch the dashboard here.

Setup

In this project, I used 300$ free credit when create a new GCP account. The project consists of 3 vm instance: 1 ubuntu for running ticketsim and kafka stack, 1 dataproc for running spark jobs and 1 ubuntu for running airflow to orchestrate periodic jobs on data lake and data warehouse. The vm names are listed as in the picture.

Pre-requisites

Action Parts

  • Setup GCP - Setup
  • Setup infrastructure using terraform - Setup
  • Setup Kafka Compute Instance and start sending messages from Eventsim - Setup
  • Setup Spark Cluster for stream processing - Setup
  • Setup Airflow on Compute Instance to trigger the hourly data pipeline - Setup

Special Mentions

I'd like to thank the DataTalks.Club for offering this Data Engineering course for completely free. The course knowledge help me kickstart this project. I also want to thank Ankur for his streamify. I copy a lot of his code and ideas to study the concepts in Data Engineering field.

About

Data pipeline implementing kafka, spark structured streaming, dbt, google cloud, bigquery and more

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published