GitHub - AdrianAntico/Benchmarks at 02312dc9cc94a5dc3a0f8d80c45f8f0d3e58ed5e

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
Collapse		Collapse
Data		Data
Datatable		Datatable
DuckDB		DuckDB
Images		Images
Pandas		Pandas
Polars		Polars
CombineResults_AggSum.R		CombineResults_AggSum.R
CombineResults_Cast.R		CombineResults_Cast.R
CombineResults_Lags.R		CombineResults_Lags.R
CombineResults_Melt.R		CombineResults_Melt.R
LICENSE		LICENSE
README.md		README.md

Repository files navigation

Background

This repo contains files for a data frames benchmark. Currently, the data frame pacakges tested include R's data.table, Python's Polars, DuckDB, Pandas, and Collapse. Others to come.

All of the packages utilize the installation that comes recommended. For example, DuckDB recommends to install in R as install.packages("duckdb") so I utilize that. There are no special installation setup operations taking place for any of the packages. I want to see off the shelf, simple installation, benchmarks. I believe that is what most people use when running these frameworks. Also, I'm using a Windows OS which I believe to be the most popular OS that people use. If anyone wants to run these on MAC or Linux, please share your results and I will display them. Lastly, I'm running this locally, not on cloud, as I also believe that to be the more common usage. Regardless of commonality, I think it's important to see results under these conditions vs. cloud and linux environments only.

The datasets utilized replicates a real world example of a beverage company's data, for 1M, 10M, 100M, and 1B records. The datasets include a Date variable, 4 group variables, and 4 numeric variables. The benchmark tests each dataset, using the Date variables, then adds additional group variables, and then repeats that with additional numeric variables, for each of the datasets.

Current Frameworks Tested

R data.table
Python Polars
R DuckDB
Python Pandas
R Collapse

Current Operations

Group-By with Sum Aggregation
Melt
Cast
Lags

Replicate Benchmarks

Aggregation Sum

Click here to see steps

Fork the repo and clone it to your local machine
Modify the Path variable at the top of each script to reflect your file location
Run FakeBevDataBuilds.R to generate the benchmarking datasets
Run AggSum_datatable.R
Run AggSum_DuckDB.R
Run AggSum_Polars.py
Run AggSum_Pandas.py
Run AggSum_collapse.py
Run CombineResults_AggSum
Done!

Melt Data

Click here to see steps

Fork the repo and clone it to your local machine
Modify the Path variable at the top of each script to reflect your file location
Run FakeBevDataBuilds.R to generate the benchmarking datasets
Run Melt_datatable.R
Run Melt_DuckDB.R
Run Melt_Polars.py
Run Melt_Pandas.py
Run Melt_collapse.py
Run CombineResults_Melt
Done!

Cast Data

Click here to see steps

Fork the repo and clone it to your local machine
Modify the Path variable at the top of each script to reflect your file location
Run FakeBevDataBuilds.R to generate the benchmarking datasets
Run Cast_datatable.R
Run Cast_DuckDB.R
Run Cast_Polars.py
Run Cast_Pandas.py
Run Cast_collapse.py
Run CombineResults_Cast
Done!

Cast Data

Click here to see steps

Fork the repo and clone it to your local machine
Modify the Path variable at the top of each script to reflect your file location
Run FakeBevDataBuilds.R to generate the benchmarking datasets
Run Lags_datatable.R
Run Lags_DuckDB.R
Run Lags_Polars.py
Run Lags_Pandas.py
Run Lags_collapse.py
Run CombineResults_Lags
Done!

Machine Specs

Windows OS
Memory: 234GB
CPU: 32 cores / 64 threads
AMD Ryzen CPU

Benmark Results

In the plots below the x-axis "Experiments" shows four letters with numbers in front of them. This is what they mean:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Background

Current Frameworks Tested

Current Operations

Replicate Benchmarks

Aggregation Sum

Melt Data

Cast Data

Cast Data

Machine Specs

Benmark Results

Sum Aggregation

Top 3 Performers

Top 3 Performers

Melt

With DuckDB: Note - DuckDB timed out after a few successful runs

Without DuckDB

Cast

Lags

With DuckDB: Note - DuckDB timed out after a few successful runs

Without DuckDB

About

Releases

Packages

Languages

License

AdrianAntico/Benchmarks

Folders and files

Latest commit

History

Repository files navigation

Background

Current Frameworks Tested

Current Operations

Replicate Benchmarks

Aggregation Sum

Melt Data

Cast Data

Cast Data

Machine Specs

Benmark Results

Sum Aggregation

Top 3 Performers

Top 3 Performers

Melt

With DuckDB: Note - DuckDB timed out after a few successful runs

Without DuckDB

Cast

Lags

With DuckDB: Note - DuckDB timed out after a few successful runs

Without DuckDB

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages