FlightRadar ELT pipeline

About the project

This project implements an ELT (Extract, Load, Transform) pipeline using the FlightRadar24 API which exposes real-time flights data, airports and airlines data.

Pipeline architecture

The pipeline is built in the Google Cloud Platform environment. It makes use of the following services:

Cloud Storage to store data
Dataproc to run PySpark jobs
BigQuery to query data
Cloud Composer Environment to orchestrate the whole pipeline

The pipeline is materialized by a DAG within a Cloud Composer Environment. The DAG is triggered every 2 hours to run a PySpark job in a Dataproc cluster. The PySpark job runs requests upon FlightRadar24 API (step 2), stores raw data as a timestamped .json file into a Cloud Storage bucket, normalizes raw data, stores them as partitioned .parquet files in the same bucket (step 3) and eventually loads the data into a BigQuery dataset (step 4).

The cloud infrastructure is mainly built with Terraform (step 1) and transformations are performed within dbt Cloud (step 5).

Usage

To build the pipeline, you can reproduce the following steps:

Clone the following repository
Go to Google Cloud Platform Console and create a new project
Under IAM & Admin > Service Accounts, click on Create service account and give it a name
Click on Keys and Add key > Create new key > JSON
Download the generated JSON file and place it at the root of this project
Under IAM & Admin > IAM, click on the pencil to edit principal and assign the following roles to the service account:
- Project IAM Admin
- Service Account Admin
- Service Account User
- Storage Admin
- Composer Administrator
- Dataproc Administrator
- Dataproc Worker
- BigQuery User
In the main.tf file, replace "flight-radar-service-account-credentials.json" with the name of your service account credentials file
In the locals.tf file, replace project_name, project_id and region with your the project name, project id and region of the project you just created
Run the following command make build. This step could take a while (about 30 minutes).

The pipeline should now be up and running!

Search for Composer Environment on the GCP console top bar, then click on your-project-name-environment. Under dags, you should be able to see the deployed DAG your_project_name_dag. From there, you can select a run and its associated tasks. If a task has failed, you can click on it and analyze the logs. Since the DAG is composed of one task (a DataprocSubmitJobOperator task), you can also search for Dataproc in the top bar, select the cluster your-project-name-cluster, then under Jobs, click on the job that failed and analyze the output.

Author

Olivier Valenduc

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
dags		dags
dbt		dbt
images		images
modules		modules
.gitignore		.gitignore
.terraform.lock.hcl		.terraform.lock.hcl
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
dbt_project.yml		dbt_project.yml
initialize_cluster.sh		initialize_cluster.sh
locals.tf		locals.tf
main.tf		main.tf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FlightRadar ELT pipeline

About the project

Pipeline architecture

Usage

Author

About

Languages

License

oli2v/flight-radar-gcp

Folders and files

Latest commit

History

Repository files navigation

FlightRadar ELT pipeline

About the project

Pipeline architecture

Usage

Author

About

Topics

Resources

License

Stars

Watchers

Forks

Languages