This project is used to document the process of learning DE. I found this valuable project while searching something about DE, so I want to finish it all by myself from scratch!Also I've made changes to minor parts, like some commands not working due to updates, etc.
The original github link is [https://github.com/san089/Udacity-Data-Engineering-Projects/tree/master]
In this project, we apply Data Modeling with Postgres and build an ETL pipeline using Python. A startup wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. Currently, they are collecting data in json format and the analytics team is particularly interested in understanding what songs users are listening to.
Link:[https://github.com/ZhengyuOfficial/The-Road-To-Data-Engineering./tree/main/ETL]
In this project, we apply the Data Warehouse architectures we learnt and build a Data Warehouse on AWS cloud. We build an ETL pipeline to extract and transform data stored in json format in s3 buckets and move the data to Warehouse hosted on Amazon Redshift.
Link:[https://github.com/ZhengyuOfficial/Road2DE/tree/main/WareHouse]
In this project, we will build a Data Lake on AWS cloud using Spark and AWS EMR cluster. The data lake will serve as a Single Source of Truth for the Analytics Platform. We will write spark jobs to perform ELT operations that picks data from landing zone on S3 and transform and stores data on the S3 processed zone.
In this project, we will orchestrate our Data Pipeline workflow using an open-source Apache project called Apache Airflow. We will schedule our ETL jobs in Airflow, create project related custom plugins and operators and automate the pipeline execution.
In this project, we build an etl pipeline to fetch data from yelp API and insert it into the Postgres Database. This project is a very basic example of fetching real time data from an open source API.