[EPIC] A collection of items to improve speed of parquet metadata encoding #5853

alamb · 2024-06-07T12:22:58Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

There have been several recent assertions that Parquet is not suitable for handling wide tables with 1000s of columns

The rationale often goes something like wide tables have “large” metadata, which takes a “long time” to decode, often longer than reading the data itself.

This has led to several proposals for new file formats such as in BtrBlocks, Lance V2. Nimble, and recent discussions on the parquet mailing list.

However, there are several ways we can improve the performance of the existing thrift decoding in parquet-rs and this ticket captures several ideas of how to do so

alamb added enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate labels Jun 7, 2024

This was referenced Jun 7, 2024

Use custom thrift decoder to improve speed of parsing parquet metadata #5854

Open

Reduce Allocations When Reading Parquet Metadata #5775

Open

Selective decoding of a subset (e.g. columns or row groups) of parquet metadata #5855

Open

kszlim mentioned this issue Jul 2, 2024

Loading wide parquet data with scan_parquet is orders of magnitude slower than long data pola-rs/polars#17259

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] A collection of items to improve speed of parquet metadata encoding #5853

[EPIC] A collection of items to improve speed of parquet metadata encoding #5853

alamb commented Jun 7, 2024 •

edited

Loading

[EPIC] A collection of items to improve speed of parquet metadata encoding #5853

[EPIC] A collection of items to improve speed of parquet metadata encoding #5853

Comments

alamb commented Jun 7, 2024 • edited Loading

alamb commented Jun 7, 2024 •

edited

Loading