This repository contains the source code and configurations for an ETL (Extract, Transform, Load)
data engineering project focused on Uber data. The project encompasses various components, including an Analytics layer, data pipeline tree, Mage UI, Mage pipeline blocks, and more.
data_exporter.py
This module is responsible for exporting data to a BigQuery warehouse. It utilizes Mage AI's BigQuery
class and reads configurations from io_config.yaml
.
# Example usage:
# export_data_to_big_query(df, table_name='example_table')
from mage_ai.settings.repo import get_repo_path
from mage_ai.io.bigquery import BigQuery
from mage_ai.io.config import ConfigFileLoader
from pandas import DataFrame
from os import path
# ... (imports and setup)
@data_exporter
def export_data_to_big_query(df: DataFrame, **kwargs) -> None:
# Implementation details...
data_loader.py
This module demonstrates loading data from an API using the requests
library. The data is fetched from a specified URL and read into a Pandas DataFrame.
# Example usage:
# df = load_data_from_api()
import io
import pandas as pd
import requests
# ... (imports)
@data_loader
def load_data_from_api(*args, **kwargs):
# Implementation details...
transformer.py
This module provides a template for a transformer block, showcasing data transformation operations using Pandas. The example includes handling duplicates, dropping rows with missing values, converting datetime columns, generating a new trip_id column, and creating dimension tables.
# Example usage:
# output = transform(input_df)
import pandas as pd
# ... (imports)
@transformer
def transform(df, *args, **kwargs):
# Transformation logic...
return output_tables
@test
def test_output(output, *args) -> None:
# Testing logic...
The Analytics Layer in this project contains essential components for data analysis and querying. Below are the details of the files included in this layer:
- File:
Job_Execution_Graph.png
This visual representation provides insights into the execution flow of jobs within the analytics layer.
- File:
bigquery_analytics.sql
The bigquery_analytics.sql
file contains the SQL query used to create a table (bigquery_analytics_query
) within the Uber_dataset in BigQuery. This table is a result of joining various dimensions and fact tables to form a comprehensive dataset for analytics purposes.
DROP TABLE IF EXISTS Uber_dataset.bigquery_analytics_query;
CREATE TABLE Uber_dataset.bigquery_analytics_query AS
(
-- SQL Query --
);
The SQL query involves joining multiple tables such as trip_table
, datetime_dim
, passenger_count_dim
, trip_distance_dim
, ratecode_dim
, pickup_location_dim
, dropoff_location_dim
, and payment_type_dim
to create a denormalized view suitable for analytics.
The dataset used in this project comprises trip records from both yellow and green taxi services. It includes essential information such as pick-up and drop-off dates/times, locations, trip distances, fare details, rate types, payment types, and driver-reported passenger counts.
The dataset is sourced from the TLC Trip Record Data
, which provides comprehensive information about taxi trips in New York City.
The dataset used in the video can be found here - https://github.com/anmol1512/UberData_ETL_Data-Engineering_Project/blob/main/data/uber_data.csv
- Additional Information For a detailed understanding of the dataset and its attributes, refer to the Data Dictionary - https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf
The configuration files in this project play a crucial role in setting up and managing various parameters for the project. Below is the description of the configuration file included:
-type
: Specifies the type of the service account.
project_id
: Your Google Cloud Project ID.private_key_id
: Your private key ID.private_key
: Your private key.client_email
: Your client email.auth_uri
: Your authentication URI.token_uri
: Your token URI.auth_provider_x509_cert_url
: Your authentication -provider x509 certificate URL.client_x509_cert_url
: Your client x509 - certificate URL.
GOOGLE_SERVICE_ACC_KEY_FILEPATH
: Filepath to the Google Service Account Key file.GOOGLE_LOCATION
: (Optional) Specifies the location, e.g., "US."
The project dependencies are specified in the requirements.txt file. Install the dependencies using:
pip install -r requirements.txt
Feel free to explore, contribute, or use the components provided in this repository for your data engineering projects!
Programming Language
- Python
Cloud Platform
- Google Cloud Platform (GCP)
Google Cloud Services
- Google Storage
- Compute Instance
- BigQuery
Data Pipeline Tool
- Mage AI
Data Visualization
- Looker Studio