Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit Parquet Page Row Count By Default to reduce writer memory requirements with highly compressable columns #5797

Closed
tustvold opened this issue May 23, 2024 · 5 comments · Fixed by #5957
Assignees
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@tustvold
Copy link
Contributor

tustvold commented May 23, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

There is a discussion on the mailing list about default settings for page sizing, and one of the suggestions is that the page row count limit should be enabled by default, where currently it is not. Row groups are limited to 1M rows by default, there is some suggestion pages should be limited to 20,000.

Creating this issue to track

Describe the solution you'd like

Describe alternatives you've considered

Additional context

@tustvold tustvold added the enhancement Any new improvement worthy of a entry in the changelog label May 23, 2024
@alamb
Copy link
Contributor

alamb commented May 30, 2024

Maybe we can use some data from @XiangpengHao 's analysis in #5770 as part of this decision

@alamb
Copy link
Contributor

alamb commented Jun 1, 2024

Specifically I think @tustvold 's proposal is to change the default value of this

https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterProperties.html#method.data_page_row_count_limit

Currently it seems there is no row count limit (limit is usize::max):

data_page_row_count_limit: usize::MAX,

Note that @XiangpengHao hit this when trying to write large numbers of rows: #5828

@alamb
Copy link
Contributor

alamb commented Jun 3, 2024

I think setting a default page size to 20k is a good idea -- the default "no limit" can result in substantial memory usage on the writer, that people seem to think is a limit of the parquet format itself.

Another potential default value per page would be 64K (the size used in BtrBlocks)

@alamb
Copy link
Contributor

alamb commented Jun 24, 2024

We just found another example in InfluxDB where setting the data page limit to 64k dramatically lowered the amount of memory written to the parquet writer

@alamb alamb changed the title Limit Page Row Count By Default Limit Parquet Page Row Count By Default to reduce writer memory requirements with highly compressable columns Jun 24, 2024
@alamb alamb self-assigned this Jun 24, 2024
@alamb alamb added the parquet Changes to the parquet crate label Jul 2, 2024
@alamb
Copy link
Contributor

alamb commented Jul 2, 2024

label_issue.py automatically added labels {'parquet'} from #5957

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants