Skip to content

This repository contains tasks on how to build an ETL pipeline for online transactions data of an online retail company.

Notifications You must be signed in to change notification settings

DSKunth/ETL-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ETL Pipeline Project

Project Description

The goal of the project is to build an ETL pipeline. ETL (Extract, Transform, Load) is a data pipeline used to collect data from various sources, transforms the data according to business requirements, and loads the data into a destination data storage.

This project contains the following files:

  • src/extract.py - a python script that contains instructions to connect to Amazon Redshift data warehouse and to extract online transactions data with transformation tasks performed using SQL
  • src/transform.py - a python script that contains instructions to identify and remove duplicated records
  • src/load_data_to_s3.py - a python script that contains instructions to connect to Amazon S3 cloud object storage and to write the cleaned data as a CSV file into an S3 bucket
  • main.py - a python script that contains all instructions to execute all the steps to extract, transform, and load the transformed data using the functions from extract.py, transform.py, and load_data_to_s3.py
  • .env.example - a text document that contains the list of environment variables used in .env file
  • requirements.txt - a text document that contains all the libraries required to execute the code
  • Dockerfile - a text document that contains all the instructions a user could call on the command line to assemble an image
  • .dockerignore - a text document that contains files or directories to be excluded when docker CLI sends the context to docker daemon. This helps to avoid unnecessarily sending large or sensitive files and directories to the daemon and potentially adding them to images using ADD or COPY.
  • .gitignore - a text document that specifies intentionally untracked files that Git should ignore

How to Run the ETL Pipeline project

A. To run the ETL pipeline from Command Line Interface:

Requirements

  • Python 3+

  • Python IDE or Text Editor and Command Line Interface

  • Instructions:

    • Copy the .env.example file to .env and fill out the environment variables.
    • Install all the libraries needed to execute main.py.
    • Run the main.py script
  • Windows:

  pip3 install -r requirements.txt
  python main.py
  • Mac:
  pip3 install -r requirements.txt
  python3 main.py

B. To run ETL pipeline using Docker:

Requirements

  docker image build -t etl-pipeline .
  • Run the etl job
  docker run --env-file .env etl-pipeline

About

This repository contains tasks on how to build an ETL pipeline for online transactions data of an online retail company.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published