GitHub - AdrianAntico/Benchmarks: Compare run times for various data frame packages

Last Updated: 06/21/2024

Background

This repo contains files for a data frames benchmark. Currently, the data frame pacakges tested include R data.table, Python Polars, R DuckDB, Python Pandas, and R Collapse.

All of the packages are installed as recommended. I'm using Windows 10 OS. If anyone wants to run these on MAC or Linux, please share your results and I will display them. Lastly, I'm running this locally, not on cloud.

The datasets utilized replicates a real world example of a beverage company's data, for 1M, 10M, 100M, and 1B records. The datasets include a date variable, four group variables, and four numeric variables. The benchmark tests each dataset, using the Date variables, then adds additional group variables, and then repeats that with additional numeric variables, for each of the datasets.

Last, but not least, I run the R scripts in RStudio and the Python scripts in vscode.

Current Frameworks Tested

R data.table: v1.15.99
R Collapse: v.2.0.15
R DuckDB: v1.0.0
Python Polars: v1.0.0
Python Pandas: v2.2.2

Current Operations

Aggregation
Melt
Cast
Windowing (lags)
Union
Left Join
Inner Join
Filter

Dataset Attributes

Common attributes across datasets:

Brand: 13 levels
Category: 6 levels
Beverage Flavor: 21 levels
Four numeric variables
One Date Variable

1M Rows Data

Customer: 99 levels

10M Rows Data

Customer: 1071 levels

100M Rows Data

Customer: 10793 levels

1Bn Rows Data

Customer: 108017 levels

Machine Specs

Windows 10 OS
Memory: 256GB
CPU: 32 cores / 64 threads
AMD Ryzen CPU

Replicate Benchmarks

Aggregation Sum

Click here to see steps

Fork the repo and clone it to your local machine
Modify the Path variable at the top of each script to reflect your file location
Run FakeBevDataBuilds.R to generate the benchmarking datasets
Run AggSum_datatable.R
Run AggSum_DuckDB.R
Run AggSum_Polars.py
Run AggSum_Pandas.py
Run AggSum_collapse.py
Run CombineResults_AggSum
Done!

Melt

Click here to see steps

Fork the repo and clone it to your local machine
Modify the Path variable at the top of each script to reflect your file location
Run FakeBevDataBuilds.R to generate the benchmarking datasets
Run Melt_datatable.R
Run Melt_DuckDB.R
Run Melt_Polars.py
Run Melt_Pandas.py
Run Melt_collapse.py
Run CombineResults_Melt
Done!

Cast

Click here to see steps

Fork the repo and clone it to your local machine
Modify the Path variable at the top of each script to reflect your file location
Run FakeBevDataBuilds.R to generate the benchmarking datasets
Run Cast_datatable.R
Run Cast_DuckDB.R
Run Cast_Polars.py
Run Cast_Pandas.py
Run Cast_collapse.py
Run CombineResults_Cast
Done!

Windowing (lags)

Click here to see steps

Fork the repo and clone it to your local machine
Modify the Path variable at the top of each script to reflect your file location
Run FakeBevDataBuilds.R to generate the benchmarking datasets
Run Lags_datatable.R
Run Lags_DuckDB.R
Run Lags_Polars.py
Run Lags_Pandas.py
Run Lags_collapse.py
Run CombineResults_Lags
Done!

Union

Click here to see steps

Fork the repo and clone it to your local machine
Modify the Path variable at the top of each script to reflect your file location
Run FakeBevDataBuilds.R to generate the benchmarking datasets
Run Union_datatable.R
Run Union_DuckDB.R
Run Union_Polars.py
Run Union_Pandas.py
Run CombineResults_Union
Done!

Left Join

Click here to see steps

Fork the repo and clone it to your local machine
Modify the Path variable at the top of each script to reflect your file location
Run FakeBevDataBuilds.R to generate the benchmarking datasets
Run LeftJoin_datatable.R
Run LeftJoin_collapse.R
Run LeftJoin_DuckDB.R
Run LeftJoin_Polars.py
Run LeftJoin_Pandas.py
Run CombineResults_LeftJoin
Done!

Inner Join

Click here to see steps

Fork the repo and clone it to your local machine
Modify the Path variable at the top of each script to reflect your file location
Run FakeBevDataBuilds.R to generate the benchmarking datasets
Run InnerJoin_datatable.R
Run InnerJoin_collapse.R
Run InnerJoin_DuckDB.R
Run InnerJoin_Polars.py
Run InnerJoin_Pandas.py
Run CombineResults_InnerJoin
Done!

Filter

Click here to see steps

Fork the repo and clone it to your local machine
Modify the Path variable at the top of each script to reflect your file location
Run FakeBevDataBuilds.R to generate the benchmarking datasets
Run Filter_datatable.R
Run Filter_collapse.R
Run Filter_DuckDB.R
Run Filter_Polars.py
Run Filter_Pandas.py
Run CombineResults_Filter
Done!

Benmark Results

In the plots below the x-axis "Experiments" shows four letters with numbers in front of them. This is what they mean:

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
Collapse		Collapse
Data		Data
Datatable		Datatable
DuckDB		DuckDB
Images		Images
Pandas		Pandas
Polars		Polars
CombineResults_AggSum.R		CombineResults_AggSum.R
CombineResults_Cast.R		CombineResults_Cast.R
CombineResults_Filter.R		CombineResults_Filter.R
CombineResults_InnerJoin.R		CombineResults_InnerJoin.R
CombineResults_Lags.R		CombineResults_Lags.R
CombineResults_LeftJoin.R		CombineResults_LeftJoin.R
CombineResults_Melt.R		CombineResults_Melt.R
CombineResults_Union.R		CombineResults_Union.R
LICENSE		LICENSE
README.md		README.md

License

AdrianAntico/Benchmarks

Folders and files

Latest commit

History

Repository files navigation

Background

Current Frameworks Tested

Current Operations

Dataset Attributes

Machine Specs

Replicate Benchmarks

Aggregation Sum

Melt

Cast

Windowing (lags)

Union

Left Join

Inner Join

Filter

Benmark Results

Sum Aggregation

Total Run Time

Melt

Total Run Time

With DuckDB

Without DuckDB

Cast

Total Run Time

Windowing (lags)

Total Run Time

With DuckDB

Without DuckDB

Union

Total Run Time

Left Join

Total Run Time

Inner Join

Total Run Time

Filter

Total Run Time

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages