Data Engineering Coding Challenge

Gabriele Degola, June 2022

Introduction

This project simulates concrete data engineering scenario, regarding generation of business intelligence reports and delivery of data insights, and applied machine learning.

Dependencies

Solutions are developed using the Python language on top of Apache Spark, leveraging the RDD API, the DataFrame API and the MLlib library. To download and install Spark, refer to the official documentation.

Task 2.5 is solved through Apache Airflow.

Project organization

This git repo is organized as follows:

.
..
data/
src/
out/
README.md

data/ directory contains the datasets used in the different exercises.
src/ directory contains the source code files, named as task_x_y.py (solution of part x, task y). Solutions are described in the associated README file.
out/ directory contains output files, named following the same convention.

Data

Three datasets are used in total, one for each part of the challenge:

groceries.csv: shopping transactions, in csv format
sf-airbnb-clean.parquet: small version of the AirBnB dataset, in parquet format
iris.csv: the classic iris dataset, in csv format

Run the solutions

All solutions are designed to be run through the spark-submit command on a local Spark cluster with a single worker thread.

spark-submit task_x_y.py path/to/input/file.txt path/to/output/file.txt

Specific instructions are returned and contained in each Python script.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Coding Challenge

Introduction

Dependencies

Project organization

Data

Run the solutions

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
out		out
src		src
LICENSE		LICENSE
README.md		README.md

License

gabridego/spark-exercises

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Coding Challenge

Introduction

Dependencies

Project organization

Data

Run the solutions

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages