Skip to content

CSV processing benchmarks for different open source technologies

Notifications You must be signed in to change notification settings

datapythonista/bench_csv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python benchmarks to process a csv file

To run the benchmarks install Pixi, clone this repository and from inside the repository directory run:

gzip -dk data.csv.gz
pixi install
pixi run bench

Results

The results of processing the CSV file (not counting the time to initialize the Python interpreter or load libraries) are next:

Description File / Function Time (seconds)
Pure Python looping with csv module using int types pure_python_int 3.4547557830810547
Pure Python looping with csv module using float types pure_python_float 3.8738009929656982
pandas with C engine pandas_c 1.50089430809021
pandas with Python engine pandas_python 8.328583478927612
pandas with PyArrow engine and NumPy dtypes pandas_pyarrow 0.31276631355285645
pandas with PyArrow engine and PyArrow dtypes pandas_pyarrow_arrow 0.29172492027282715
Polars in lazy mode polars_lazy 0.10555672645568848
Polars in streaming mode polars_streaming 0.11504125595092773
Polars with SQL API polars_sql 0.09796714782714844
DuckDB with SQL API duckdb_sql 0.8167853355407715
DataFusion with SQL API datafusion_sql 0.20633697509765625
NumPy with loadtxt function numpy_loadtxt 1.8354885578155518

The exact version of each library can be seen in the pixi.toml file. Note that DuckDB seems to package for conda-forge later, so the benchmarks use DuckDB 0.9 while 0.10 seems to be available in other package managers.

About

CSV processing benchmarks for different open source technologies

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages