DATA LAKE ETL ON AMAZON EMR USING APACHE SPARK

This project uses a simulated music streaming service dataset of a hypothetical company, Sparkify. The datasets consist of activity logs and songs data stored in partitioned json formats on Amazon s3. The goal is to extract the data into spark, transform the data into forms useful for analytics, then load them back into s3 storage.

Activities Carried Out

In order to achieve this task, I

Created an EMR cluster with one master node (m5.xlarge) and two slave nodes (m5.xlarge), using Spark 3.1.1 (for analysis and trasformation), Hadoop 3.2.1 (for storage) and YARN (for cluster management);
Enabled SSH connection to the EMR master node by including my laptop's IP in the inbound rule for SSH;
Moved the configuration file (for S3 connection) and SSH key (for communication between the master and slave nodes) to the master node;
Created an ETL script, etl.py, on the master node;
Implemented the ETL process in the etl.py file using Spark;
Ran spark-submit etl.py to kickstart the execution;
Monitored the execution which lasted over 2 hours.
Inspected the data saved in s3 and found that they conformed with my expectation.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
etl.py		etl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DATA LAKE ETL ON AMAZON EMR USING APACHE SPARK

Activities Carried Out

About

Releases

Packages

Languages

ridwan-salau/data-lake-etl-on-emr-with-spark

Folders and files

Latest commit

History

Repository files navigation

DATA LAKE ETL ON AMAZON EMR USING APACHE SPARK

Activities Carried Out

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages