- Pinterest crunches billions of data points every day to decide how to provide more value to their users.
- The goal of this project will be to simulate the pipeline implemented by Pinterest with both Batch and Stream Processing which will involve a host of various AWS software platforms.
- To implement a data pipeline using AWS.
- To process Batch and Stream data.
- To conduct cleaning and analysis on the processed data.
- Kafka: Used as a scalable and fault-tolerant messaging system for data streaming.
- AWS Managed Streaming for Apache Kafka (MSK): Managed service for Apache Kafka, providing easy setup, monitoring, and scaling capabilities.
- MSK Connect: Enables seamless data integration between Apache Kafka and other data sources and sinks.
- AWS API Gateway: Used to create, publish, and manage APIs, allowing seamless interaction with the pipeline.
- AWS S3: Scalable object storage used for data storage and retrieval.
- Spark: Fast and distributed processing engine used for data transformations and analytics.
- Spark Structured Streaming: Enables real-time processing and analysis of streaming data.
- Databricks: Cloud-based platform for collaborative data engineering and analytics.
- Airflow: Open-source platform for orchestrating and scheduling workflows.
- AWS Managed Workflows for Apache Airflow (MWAA): Fully managed service for Apache Airflow, simplifying the deployment and management of workflows.
- AWS Kinesis: Managed service for real-time data streaming and processing.
- Developed a robust data processing pipeline capable of handling Pinterest's experimental data requirements.
- Implemented a Lambda architecture, combining batch processing and stream processing methods for efficient data analysis.
- Leveraged Kafka and MSK to handle scalable and fault-tolerant data streaming.
- Utilized Spark and Spark Structured Streaming for real-time data processing and analytics.
- Integrated various data sources and sinks using MSK Connect for seamless data integration.
- Ensured data storage and retrieval using AWS S3.
- Orchestrated and scheduled workflows using Airflow and MWAA for efficient pipeline management.
- Create .pem key file and connect to EC2 Instance
- Set up Kafka and topics from EC2 instance
- Create a custom plugin, MSK Connect
- Create a connector, MSK Connect
- Build a Kafka REST proxy intergration method for the API
- Set up Kafka REST proxy on EC2 instance
- Send data to the API
- Set up Databricks
- Mount an S3 bucket
- Clean the Dataframes
- Querying the Dataframes with SQL and Pyspark
- Create and upload DAG to MWAA
- Trigger DAG
- Create Data Stream with Kinesis
- Configure an API with Kinesis Proxy Integration
- Send data to Kinesis and read to Databricks
- Transform Kinesis stream in Databricks and Write to Delta Tables