TPC DS is a standard used for benchmarking relational databases with a power test. The official TPC-DS tools can be found at tpc.org. We have used v2.10.0 with some modifications done by gregrahn/tpcds-kit.
For TPC DS benchmarking, we have used Apache spark's module (Spark SQL) for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. You can find out more from their official doc.
Before starting, you need to setup your machine first. Please follow the below mentioned guide to setup Spark on Mac machine.
Note: For Linux, you might need to manually install apache-spark and setup your python enviornment
We have created a setup script which will setup brew, apache-spark and conda enviornment. If you are on Mac machine, you can run the following commands:
git clone https://github.com/mohammadzainabbas/data-warehouse-project.git
cd data-warehouse-project && sh scripts/setup.sh
After running the above mentioned commands, your file structure should look something like this. (you need to go to parent directory first cd ..
)
βββ data-warehouse-project
β βββ schema
β βββ scripts
βββ tpcds-kit
βββ query_templates
βββ query_variants
βββ specification
βββ tools
For Mac users, that's all you needed to do. Now, you can skip to benchmarking part.
For Linux users, please follow below mentioned steps for setup.
If you are on Linux, you need to install Apache Spark by yourself. You can follow this helpful guide to install apache spark.
We recommend you to install conda on your machine. You can setup conda from here
After you have conda, create new enviornment via:
conda create -n spark_env python=3.8 pandas
Note: We are using Python3.8 because spark doesn't support Python3.9 and above (at the time of writing this)
Activate your enviornment:
conda activate spark_env
Now, you need to install pyspark:
pip install pyspark
If you are using bash:
echo "export PYSPARK_DRIVER_PYTHON=$(which python)" >> ~/.bashrc
echo "export PYSPARK_DRIVER_PYTHON_OPTS=''" >> ~/.bashrc
. ~/.bashrc
And if you are using zsh:
echo "export PYSPARK_DRIVER_PYTHON=$(which python)" >> ~/.zshrc
echo "export PYSPARK_DRIVER_PYTHON_OPTS=''" >> ~/.zshrc
. ~/.zshrc
For Linux users, you need to manually install tpcds-kit from gregrahn/tpcds-kit
Make sure you have the same file structure as mentioned above
Run the following command to benchmark for scale N
sh scripts/benchmark.sh -scale N
Here, N
is the scale factor against which you want to benchmark.
If you want to run for multiple benchmark scales
- Modify the
benchmark_scales
variable to your scale values - Run the below command:
sh scripts/benchmark_all.sh
This will run for all the scales.
The benchmarking results can be found under benchmark
folder. For example, when you run sh benchmark.sh -scale 1
(i.e: benchmarking for 1 GB scale), you will find two files in benchmark
directory:
benchmark_timings_1gb.csv
contains the average time (in seconds) for all queries.benchmark_timings_1gb.pickle
contains all the timings for each query (usually 5 per query).