Serverless Data Pipeline with AWS Lambda

This guide provides step-by-step instructions to set up a serverless data pipeline using AWS Lambda for processing raw data into a Delta table without the need for Databricks or Spark. This approach is suitable for handling data sizes up to 400 MB.

By leveraging AWS Lambda, a serverless compute service, we can design and deploy a lightweight data pipeline that efficiently processes raw data and stores it in a Delta table on Amazon S3. This approach offers scalability, cost-effectiveness, and simplicity, making it ideal for scenarios where the data size is relatively small and the infrastructure requirements are minimal.

Prerequisites

Before starting, ensure you have:

An AWS account
AWS CLI installed and configured

Step 1: Create Bucket and Objects

Create an S3 bucket named nyc_taxi_data.
Inside the bucket, create two objects: raw_data and delta_table.

Step 2: Create AWS Lambda for Data Processing

Deploy an AWS Lambda function.

authenticate Docker to your Amazon ECR registry using:

  `aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 111122223333.dkr.ecr.us-east-1.amazonaws.com`

Create repository in AWS ECR for lambda function

   `aws ecr create-repository --repository-name data_processing --region us-east-1 --image-scanning-configuration scanOnPush=true --image-tag-mutability MUTABLE`

run the deploy_lambda.sh scrpit to deploy aws lambda containers

Configure the Lambda function to trigger when data is put into the raw_data object and store the processed data in the delta_table object.
Assign an IAM role with appropriate permissions to the Lambda function and S3 bucket.

Step 3: Run `download_parquet.sh` Script

Run the download_parquet.sh script to download the required files and upload them to the raw_data object in the S3 bucket.

Step 4: Example Analysis from S3 Directly

Utilize the nyc_taxi_status.py script to perform analysis directly on the data stored in the S3 bucket.

Step 5: Examine Delta Table

Use the examine.py script to examine the Delta table, including metadata, schema, history, and current add actions.

Conclusion

By following these steps, you can set up a serverless data pipeline on AWS Lambda for processing and analyzing raw data efficiently. This approach allows for scalable and cost-effective data processing, making it suitable for various data processing tasks without relying on Databricks or Spark.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
serverless_data_process		serverless_data_process
src		src
Makefile		Makefile
README.md		README.md
deploy_lambda.sh		deploy_lambda.sh
download_parquet.sh		download_parquet.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Serverless Data Pipeline with AWS Lambda

Prerequisites

Step 1: Create Bucket and Objects

Step 2: Create AWS Lambda for Data Processing

Step 3: Run `download_parquet.sh` Script

Step 4: Example Analysis from S3 Directly

Step 5: Examine Delta Table

Conclusion

About

Releases

Packages

Languages

Krit-p3/Serverless-Data-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Serverless Data Pipeline with AWS Lambda

Prerequisites

Step 1: Create Bucket and Objects

Step 2: Create AWS Lambda for Data Processing

Step 3: Run download_parquet.sh Script

Step 4: Example Analysis from S3 Directly

Step 5: Examine Delta Table

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Step 3: Run `download_parquet.sh` Script

Packages