Pinterest Data Pipeline (Data and Cloud Engineering)

Overview

Pinterest crunches billions of data points every day to decide how to provide more value to their users.

The goal of this project will be to simulate the pipeline implemented by Pinterest with both Batch and Stream Processing which will involve a host of various AWS software platforms.

Learning Objectives

To implement a data pipeline using AWS.

To process Batch and Stream data.

To conduct cleaning and analysis on the processed data.

Technologies Used

Kafka: Used as a scalable and fault-tolerant messaging system for data streaming.

AWS Managed Streaming for Apache Kafka (MSK): Managed service for Apache Kafka, providing easy setup, monitoring, and scaling capabilities.

MSK Connect: Enables seamless data integration between Apache Kafka and other data sources and sinks.

AWS API Gateway: Used to create, publish, and manage APIs, allowing seamless interaction with the pipeline.

AWS S3: Scalable object storage used for data storage and retrieval.

Spark: Fast and distributed processing engine used for data transformations and analytics.

Spark Structured Streaming: Enables real-time processing and analysis of streaming data.

Databricks: Cloud-based platform for collaborative data engineering and analytics.

Airflow: Open-source platform for orchestrating and scheduling workflows.

AWS Managed Workflows for Apache Airflow (MWAA): Fully managed service for Apache Airflow, simplifying the deployment and management of workflows.

AWS Kinesis: Managed service for real-time data streaming and processing.

Project Highlights

Developed a robust data processing pipeline capable of handling Pinterest's experimental data requirements.

Implemented a Lambda architecture, combining batch processing and stream processing methods for efficient data analysis.

Leveraged Kafka and MSK to handle scalable and fault-tolerant data streaming.

Utilized Spark and Spark Structured Streaming for real-time data processing and analytics.

Integrated various data sources and sinks using MSK Connect for seamless data integration.

Ensured data storage and retrieval using AWS S3.

Orchestrated and scheduled workflows using Airflow and MWAA for efficient pipeline management.

Project Structure

Milestone 1 - "Batch Processing: Configure the EC2 Kafka client."

Create .pem key file and connect to EC2 Instance

Set up Kafka and topics from EC2 instance

Milestone 2 - "Batch Processing: Connect a MSK cluster to a S3 Bucket."

Create a custom plugin, MSK Connect

Create a connector, MSK Connect

Milestone 3 - "Batch Processing: Configuring an API in API Gateway."

Build a Kafka REST proxy intergration method for the API

Set up Kafka REST proxy on EC2 instance

Send data to the API

Milestone 4 - "Batch Processing: Databricks."

Set up Databricks

Mount an S3 bucket

Milestone 5 - "Batch Processing: Spark on Databricks."

Clean the Dataframes

Querying the Dataframes with SQL and Pyspark

Milestone 6 - "Batch Processing: AWS MWAA."

Create and upload DAG to MWAA

Trigger DAG

Milestone 7 - "Stream Processing: AWS Kinesis."

Create Data Stream with Kinesis

Configure an API with Kinesis Proxy Integration

Send data to Kinesis and read to Databricks

Transform Kinesis stream in Databricks and Write to Delta Tables

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
12cc24ac7551_dag.py		12cc24ac7551_dag.py
README.md		README.md
databricks.py		databricks.py
databricks_streaming.py		databricks_streaming.py
user_posting_emulation.py		user_posting_emulation.py
user_posting_emulation_streaming.py		user_posting_emulation_streaming.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pinterest Data Pipeline (Data and Cloud Engineering)

Overview

Learning Objectives

Technologies Used

Project Highlights

Project Structure

Milestone 1 - "Batch Processing: Configure the EC2 Kafka client."

Milestone 2 - "Batch Processing: Connect a MSK cluster to a S3 Bucket."

Milestone 3 - "Batch Processing: Configuring an API in API Gateway."

Milestone 4 - "Batch Processing: Databricks."

Milestone 5 - "Batch Processing: Spark on Databricks."

Milestone 6 - "Batch Processing: AWS MWAA."

Milestone 7 - "Stream Processing: AWS Kinesis."

About

Releases

Packages

Languages

Steve-3PO/Pinterest-Data-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Pinterest Data Pipeline (Data and Cloud Engineering)

Overview

Learning Objectives

Technologies Used

Project Highlights

Project Structure

Milestone 1 - "Batch Processing: Configure the EC2 Kafka client."

Milestone 2 - "Batch Processing: Connect a MSK cluster to a S3 Bucket."

Milestone 3 - "Batch Processing: Configuring an API in API Gateway."

Milestone 4 - "Batch Processing: Databricks."

Milestone 5 - "Batch Processing: Spark on Databricks."

Milestone 6 - "Batch Processing: AWS MWAA."

Milestone 7 - "Stream Processing: AWS Kinesis."

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages