YTS MovieData ETL Pipeline

Overview

The MovieData ETL Pipeline project is a comprehensive solution for extracting, transforming, and loading movie data from web sources into a PostgreSQL database. This project is designed with scalability and efficiency in mind, leveraging Docker for containerization, Python for scripting, and PostgreSQL for data storage.

Project Workflow

Project Structure

MovieData-ETL
├── elt
│   ├── Dockerfile
│   ├── elt_pipeline.py
│   └── requirements.txt
├── source_db
│   ├── Dockerfile
│   ├── create_movies.sql
│   └── insert_movies.sql
├── worker.py
├── output.csv
├── output.xlsx
├── report.md
├── yts.txt
├── docker-compose.yaml
└── README.md

`elt` Directory

Dockerfile: Defines the environment for running the ETL script.
elt_pipeline.py: Contains the ETL logic to load data into PostgreSQL.
requirements.txt: Lists the Python dependencies.

`source_db` Directory

Dockerfile: Defines the environment for the PostgreSQL database.
create_movies.sql: SQL script to create the movies table.
insert_movies.sql: SQL script to insert movie data.

Root Directory

worker.py: Script to scrape movie data from the web and generate SQL insert statements.
output.csv: CSV file containing scraped movie data.
output.xlsx: Excel file containing scraped movie data.
report.md: Markdown file for logging the ETL process.
yts.txt: Raw HTML content of the scraped web page.
docker-compose.yaml: Docker Compose configuration to orchestrate the ETL pipeline.

Prerequisites

Ensure you have the following installed:

Docker: For containerization.
Docker Compose: To manage multi-container applications.
Python 3.8: For running the worker script.

Setup and Installation

Clone the Repository:

git clone https://github.com/TheODDYSEY/YTS-Pipeline-Postgres
cd MovieData-ETL

Build and Run the Docker Containers:
```
docker-compose up --build
```
This command builds the Docker images and starts the containers for both the PostgreSQL database and the ETL script.

Usage

Scrape Movie Data:

The worker.py script scrapes the latest movie data from YTS and generates SQL insert statements.
```
python worker.py
```
This will generate output.csv, output.xlsx, and insert_movies.sql files with the scraped movie data.
Run the ETL Pipeline:

The elt_pipeline.py script will wait for the PostgreSQL database to be ready, then load the movie data from the SQL script into the database.
```
docker-compose up elt_script
```

ETL Process Details

Extract

The extraction step involves scraping movie data from the YTS website using the worker.py script. BeautifulSoup is used to parse the HTML and extract relevant movie information.

Transform

The extracted data is transformed into a structured format using Pandas. The data is saved into CSV and Excel files for reference and further processing.

Load

The transformed data is loaded into the PostgreSQL database. The elt_pipeline.py script uses the psycopg2 library to connect to the database and load the data using SQL insert statements generated by the worker.py script.

Script Breakdown

`worker.py`

Step 1: Sends a GET request to the specified URL to fetch movie data.
Step 2: Parses the HTML content using BeautifulSoup.
Step 3: Extracts movie details and stores them in a list.
Step 4: Creates a Pandas DataFrame from the extracted data.
Step 5: Saves the DataFrame to CSV and Excel files.
Step 6: Generates SQL insert statements and saves them to a .sql file.

`elt_pipeline.py`

Database Connection: Connects to the PostgreSQL database using psycopg2.
Data Loading: Loads the movie data from the SQL script into the database.
Retry Logic: Implements retry logic to handle transient connection issues.

Configuration

The PostgreSQL database configuration and credentials are set in the docker-compose.yaml file and passed as environment variables to the Docker containers. Ensure the following values are correctly set:

services:
  movies_postgres:
    environment:
      POSTGRES_DB: movies_db
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: secret

Future Improvements

Automated Scheduling: Integrate a scheduler like cron or Airflow to automate the ETL process at regular intervals.
Data Validation: Implement data validation and cleaning steps to ensure data quality.
Enhanced Logging: Add more detailed logging to track the ETL process and handle errors more effectively.

Contributing

Contributions are welcome! Please read the contributing guidelines for more details on how to get started.

License

This project is licensed under the MIT License. See the LICENSE file for details.

For any inquiries or support, please open an issue on our GitHub repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YTS MovieData ETL Pipeline

Overview

Project Workflow

Table of Contents

Project Structure

`elt` Directory

`source_db` Directory

Root Directory

Prerequisites

Setup and Installation

Usage

ETL Process Details

Extract

Transform

Load

Script Breakdown

`worker.py`

`elt_pipeline.py`

Configuration

Future Improvements

Contributing

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
elt		elt
source_db		source_db
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
docker-compose.yaml		docker-compose.yaml

TheODDYSEY/YTS-Pipeline-Postgres

Folders and files

Latest commit

History

Repository files navigation

YTS MovieData ETL Pipeline

Overview

Project Workflow

Table of Contents

Project Structure

elt Directory

source_db Directory

Root Directory

Prerequisites

Setup and Installation

Usage

ETL Process Details

Extract

Transform

Load

Script Breakdown

worker.py

elt_pipeline.py

Configuration

Future Improvements

Contributing

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`elt` Directory

`source_db` Directory

`worker.py`

`elt_pipeline.py`

Packages