datapains-spark-k8s-example

In this repo I show how to build a docker image running spark and pyspark, which will be compatible with the official spark operator.

Please also see my other repository where I show how to deploy the operator.

Pre Requisites

python3.9
poetry 1.1.7
docker-desktop
make

Docker

make build-container-image

For m1

make build-container-image DOCKER_BUILD="buildx build --platform linux/amd64"

I have defined a docker image which uses:

python 3.9
spark version 4.0.0preview
delta version 4.0.0rc1
- See official github release here
- See official delta docs here
scala version 2.13
java version 17

NOTE! The tools/scripts/entrypoint.sh has been modified to setup poetry to use the docker images poetry environment.

Test locally

Spark Shell

make local-pyspark-shell

You will see how the entrypoint works in action
The shell starts and you can play around with the spark distribution without having to set it up on your local machine, but rather run it in a shell in the docker image.

+ cd /opt/spark/work-dir
++ poetry show -v
++ cut -d ' ' -f 3
++ head -n1
+ export PYSPARK_PYTHON=/root/.cache/pypoetry/virtualenvs/datapains-spark-k8s-examples-2OPaUQvv-py3.9/bin/python
+ PYSPARK_PYTHON=/root/.cache/pypoetry/virtualenvs/datapains-spark-k8s-examples-2OPaUQvv-py3.9/bin/python
++ poetry show -v
++ head -n1
++ cut -d ' ' -f 3
+ export PYSPARK_DRIVER_PYTHON=/root/.cache/pypoetry/virtualenvs/datapains-spark-k8s-examples-2OPaUQvv-py3.9/bin/python
+ PYSPARK_DRIVER_PYTHON=/root/.cache/pypoetry/virtualenvs/datapains-spark-k8s-examples-2OPaUQvv-py3.9/bin/python
+ cd -
/
++ id -u
+ myuid=0
++ id -g
+ mygid=0
+ set +e
++ getent passwd 0
+ uidentry=root:x:0:0:root:/root:/bin/bash
+ set -e
+ '[' -z root:x:0:0:root:/root:/bin/bash ']'
+ SPARK_CLASSPATH=':opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -z x ']'
+ export PYSPARK_PYTHON
+ '[' -z x ']'
+ export PYSPARK_DRIVER_PYTHON
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z ']'
+ '[' -z x ']'
+ SPARK_CLASSPATH='opt/spark/conf::opt/spark/jars/*'
+ case "$1" in
+ echo 'Non-spark-on-k8s command provided, proceeding in pass-through mode...'
Non-spark-on-k8s command provided, proceeding in pass-through mode...
+ CMD=("$@")
+ exec /usr/bin/tini -s -- pyspark --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
Python 3.9.2 (default, Feb 28 2021, 17:03:44)
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/07/10 08:28:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.1
      /_/

Using Python version 3.9.2 (default, Feb 28 2021 17:03:44)
Spark context Web UI available at http:https://f6c40b7da839:4040
Spark context available as 'sc' (master = local[*], app id = local-1720600130530).
SparkSession available as 'spark'.
>>>

Quick example:

>>> from delta.tables import DeltaTable
>>> data = [[1, ("Alice", "Smith", 29)], [2, ("Bob", "Brown", 40)], [3, ("Charlie", "Johnson", 35)]]
>>> columns = columns = ["id", "data"]
>>> df = spark.createDataFrame(data, columns)
>>>

Deploy - Argo Workflow

Please go to my argo workflow repo to see how I deploy an example job with this image, utilising the spark operator with this base image which can be re-used.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
examples		examples
tools/docker		tools/docker
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

datapains-spark-k8s-example

Pre Requisites

Docker

Test locally

Spark Shell

Deploy - Argo Workflow

About

Releases

Packages

Languages

Thelin90/datapains-spark-k8s-exmaples

Folders and files

Latest commit

History

Repository files navigation

datapains-spark-k8s-example

Pre Requisites

Docker

Test locally

Spark Shell

Deploy - Argo Workflow

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages