spark-tutorial

A tutorial on how to use PySpark on the Magarvey Lab computational server.

Purpose

Apache Spark is a distributed computing platform written in Java and PySpark is the Python client library to access Apache Spark. When your compute workload exceeds the memory capacity of computational server, consider browsing this tutorial to learn how to distribute your workload on the cluster. In essence, Apache Spark decomposes your task into partitions and maps your algorithm onto each partition before collecting the partitions and performing aggregate computations. Thus, ensure that your workload conforms to this compute paradigm.

Install

Install Anaconda 3
Configure Anaconda 3 to prioritize conda-forge channel

conda config --add channels conda-forge
conda config --set channel_priority strict

Install and use mamba to parallelize conda operations

conda install mamba --yes

Install dependencies

mamba install conda-pack pyspark --yes

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
tutorial.ipynb		tutorial.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-tutorial

Purpose

Install

About

Releases

Packages

Languages

magarveylab/spark-tutorial

Folders and files

Latest commit

History

Repository files navigation

spark-tutorial

Purpose

Install

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages