-
Notifications
You must be signed in to change notification settings - Fork 764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit Parquet Page Row Count By Default to reduce writer memory requirements with highly compressable columns #5797
Comments
Maybe we can use some data from @XiangpengHao 's analysis in #5770 as part of this decision |
Specifically I think @tustvold 's proposal is to change the default value of this Currently it seems there is no row count limit (limit is usize::max): arrow-rs/parquet/src/file/properties.rs Line 356 in 5a24119
Note that @XiangpengHao hit this when trying to write large numbers of rows: #5828 |
I think setting a default page size to 20k is a good idea -- the default "no limit" can result in substantial memory usage on the writer, that people seem to think is a limit of the parquet format itself. Another potential default value per page would be 64K (the size used in BtrBlocks) |
We just found another example in InfluxDB where setting the data page limit to 64k dramatically lowered the amount of memory written to the parquet writer |
|
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
There is a discussion on the mailing list about default settings for page sizing, and one of the suggestions is that the page row count limit should be enabled by default, where currently it is not. Row groups are limited to 1M rows by default, there is some suggestion pages should be limited to 20,000.
Creating this issue to track
Describe the solution you'd like
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: