SPARKIFY ETL (Amazon Redshift)

Introduction

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

The project is to build an ETL pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team to continue finding insights into what songs their users are listening to.

Tools

Usage

First, you have to create the redshift cluster and the IAM Role to stage the data in it by following the Cluster_Create_Delete.ipynb instructions
Second, to run the ETL we have to create the database and the tables so we have to execute create_tables.py file.
Create_tables.py creates 2 tables for staging the log and song files from S3 into AWS Redshift and 4 dimensions and a fact table for the star schema data warehouse
Now we can run etl.py to transfer our data from songs and log files into the staging database tables then transfer the data from staging tables to the data warehouse dimensions and fact.

Important: _update the database and account credentials in the files to your own credentials in dwh.cfg

Project Files

Cluster_Create_Delete.ipynb
- This jupyter notebook is used to create the cluster and deleting it after completing
- It is a series of instructions to create the cluster and the roles associated to it
- The file gets its credentials from dwh.cfg so you have to fill the required credentials first
dwh.cfg
- Create your IAM User with Admin Access and put your credentials in this file
- Put your cluster credentials and number of nodes you want to create in Cluster_Create_Delete.ipynb in this file
- Important: Don't forget to remove your credentials before publishing the project
create_tables.py
- You run this python script after creating the cluster
- This file creates the 2 staging tables which is staging_songs and staging_events tables to store the s3 staged json log and songs data into them
- This file creates the 4 dimensions and fact table to transfer the data in the staging tables into it
- The queries that execute to run this script is in sql_queries.py
etl.py
- After we create the cluster and the tables we call this script to transfer the data from the s3 json files to the 2 staging tables created
- Then it transfers the data from the 2 staging tables to the data warehouse dimensions (4 dimensions) and fact.
sql_queries.py
- This file contains the queries executed to create and drop the tables, transfering the data into the 2 staging tables then the queries to transfer the data into the data warehouse.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPARKIFY ETL (Amazon Redshift)

Table of Contents

Introduction

Tools

Usage

Project Files

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
images		images
Cluster_Create_Delete.ipynb		Cluster_Create_Delete.ipynb
README.md		README.md
create_tables.py		create_tables.py
dwh.cfg		dwh.cfg
etl.py		etl.py
sql_queries.py		sql_queries.py

amrelauoty/Sparkify-ETL-Redshift

Folders and files

Latest commit

History

Repository files navigation

SPARKIFY ETL (Amazon Redshift)

Table of Contents

Introduction

Tools

Usage

Project Files

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages