This repo was started as a wrapper around Spark REPLs for easier use with the Spark RAPIDS plugin. Lately I have been putting more effort in maintaining standalone Jupyter notebooks that can be easily started without the wrapper script, and particularly easy to simply open them in VSCode with the Jupyter extension.
A utility to start RAPIDS-enabled Spark Shell with access to unit tests resources from https://github.com/NVIDIA/spark-rapids
Before running the examples make sure to at least execute mvn package
in your local spark-rapids repo if you are not using binaries.
See rapids.sh --help
for up to date information
Usage: rapids.sh [OPTION]
Options:
--debug
enable bash tracing
-h, --help
prints this message
-l4j=LOG4J_CONF_FILE, --log4j-file=LOG4J_CONF_FILE
LOG4J_CONF_FILE location of a custom log4j config for local mode
-nsys, --nsys-profile
run with Nsights profile
-m=MASTER, --master=MASTER
specify MASTER for spark command, default is local[-cluster], see --num-local-execs
-n, --dry-run
generates and prints the spark submit command without executing
-nle=N, --num-local-execs=N
specify the number of local executors to use, default is 2. If > 1 use pseudo-distributed
local-cluster, otherwise local[*]
-uecp, --use-extra-classpath
use extraClassPath instead of --jars to add RAPIDS jars to spark-submit (default)
-uj, --use-jars
use --jars instead of extraClassPath to add RAPIDS jars to spark-submit
--ucx-shim=spark<3xy>
Spark buildver to populate shim-dependent package name of RapidsShuffleManager.
Will be replaced by a Boolean option
-cmd=CMD, --spark-command=CMD
specify one of spark-submit (default), spark-shell, pyspark, jupyter, jupyter-lab
-dopts=EOPTS, --driver-opts=EOPTS
pass EOPTS as --driver-java-options
-eopts=EOPTS, --executor-opts=EOPTS
pass EOPTS as spark.executor.extraJavaOptions
--gpu-fraction=GPU_FRACTION
GPU share per executor JVM unless local or local-cluster mode, see spark.rapids.memory.gpu.allocFraction
-
SPARK_RAPIDS_HOME
- the path either to the local repo or to the location used for downloading the binaries -
SPARK_HOME
- the path either to the local Spark repo or to the root fo binary distro -
SPARK_CMD
- one ofspark-shell
,spark-submit
(default),pyspark
,jupyter
,jupyter-lab
Use Spark RAPIDS in Jupyter notebook
SPARK_HOME=~/spark-3.1.1-bin-hadoop3.2 SPARK_CMD=jupyter[-lab] rapids.sh
Run in pseudo-distirbuted local-cluster
mode
NUM_LOCAL_EXECS=2 SPARK_HOME=~/spark-3.1.1-bin-hadoop3.2 rapids.sh
Allow attaching a java debugger to the driver JVM
JDBSTR=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 SPARK_HOME=~/spark-3.1.1-bin-hadoop3.2 rapids.sh
Single test suite
scala> run(new com.nvidia.spark.rapids.InsertPartition311Suite)
InsertPartition311Suite:
...
Single test case
scala> run(new com.nvidia.spark.rapids.HashAggregatesSuite, "sum(floats) group by more_floats 2 partitions")
HashAggregatesSuite:
...
In pyspark based drivers one can use data generators from spark-rapids/integration-tests or run whole pytests.
Add rapids.py
as an ipython startup file, e.g. on *NIX
cp src/python/rapids.py ~/.ipython/profile_default/startup/
key_data_gen = StructGen([
('a', IntegerGen(min_val=0, max_val=4)),
('b', IntegerGen(min_val=5, max_val=9)),
], nullable=False)
val_data_gen = IntegerGen()
df = two_col_df(spark, key_data_gen, val_data_gen)
...
runpytest('test_struct_count_distinct')