[Data] Use sampled fragments to estimate Parquet reader batch size #45749

bveeramani · 2024-06-05T09:56:07Z

Why are these changes needed?

ParquetDatasource uses fetched metadata to determine the Parquet reader batch size. If the metadata provider doesn't provide metadata, ParquetDatasource uses a default value of 10,000 rows. While this value might be reasonable for tabular data, the default value can lead to poor performance or even errors if each row is large. To avoid this issue, this PR updates the implementation to use the sampled fragments to determine the batch size.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Balaji Veeramani <[email protected]>

raulchen · 2024-06-05T22:12:19Z

python/ray/data/datasource/parquet_datasource.py

+ sample_info.actual_bytes_per_row is None
+ or sample_info.estimated_bytes_per_row is None
+ ):
+ return PARQUET_ENCODING_RATIO_ESTIMATE_LOWER_BOUND


if some samples are None while some are not, should we drop the None values? I.E., only fallback to this lower bound if all samples are None.

I matched the behavior of the existing implementation, but that makes sense to me.

@c21 wdyt (I think you wrote the original implementation)? I can address this in a follow-up PR

yes make sense to me.

c21

late LGTM

…ay-project#45749) ParquetDatasource uses fetched metadata to determine the Parquet reader batch size. If the metadata provider doesn't provide metadata, ParquetDatasource uses a default value of 10,000 rows. While this value might be reasonable for tabular data, the default value can lead to poor performance or even errors if each row is large. To avoid this issue, this PR updates the implementation to use the sampled fragments to determine the batch size. Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>

…ay-project#45749) ParquetDatasource uses fetched metadata to determine the Parquet reader batch size. If the metadata provider doesn't provide metadata, ParquetDatasource uses a default value of 10,000 rows. While this value might be reasonable for tabular data, the default value can lead to poor performance or even errors if each row is large. To avoid this issue, this PR updates the implementation to use the sampled fragments to determine the batch size. Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: yucai <[email protected]>

…ay-project#45749) ParquetDatasource uses fetched metadata to determine the Parquet reader batch size. If the metadata provider doesn't provide metadata, ParquetDatasource uses a default value of 10,000 rows. While this value might be reasonable for tabular data, the default value can lead to poor performance or even errors if each row is large. To avoid this issue, this PR updates the implementation to use the sampled fragments to determine the batch size. Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>

…ay-project#45749) ParquetDatasource uses fetched metadata to determine the Parquet reader batch size. If the metadata provider doesn't provide metadata, ParquetDatasource uses a default value of 10,000 rows. While this value might be reasonable for tabular data, the default value can lead to poor performance or even errors if each row is large. To avoid this issue, this PR updates the implementation to use the sampled fragments to determine the batch size. Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: yucai <[email protected]>

…ay-project#45749) ParquetDatasource uses fetched metadata to determine the Parquet reader batch size. If the metadata provider doesn't provide metadata, ParquetDatasource uses a default value of 10,000 rows. While this value might be reasonable for tabular data, the default value can lead to poor performance or even errors if each row is large. To avoid this issue, this PR updates the implementation to use the sampled fragments to determine the batch size. Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: Richard Liu <[email protected]>

Initial commit

9fba119

Signed-off-by: Balaji Veeramani <[email protected]>

bveeramani requested review from ericl, scv119, c21, amogkam, scottjlee, raulchen, stephanie-wang and omatthew98 as code owners June 5, 2024 09:56

bveeramani assigned raulchen Jun 5, 2024

bveeramani enabled auto-merge (squash) June 5, 2024 10:06

github-actions bot added the go add ONLY when ready to merge, run all tests label Jun 5, 2024

bveeramani assigned c21 Jun 5, 2024

Merge branch 'master' into parquet-estimation

1b0d74f

Signed-off-by: Balaji Veeramani <[email protected]>

github-actions bot disabled auto-merge June 5, 2024 18:45

bveeramani enabled auto-merge (squash) June 5, 2024 18:57

raulchen approved these changes Jun 5, 2024

View reviewed changes

bveeramani merged commit 6031b4a into ray-project:master Jun 5, 2024
7 checks passed

c21 reviewed Jun 5, 2024

View reviewed changes

bveeramani deleted the parquet-estimation branch June 5, 2024 22:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Use sampled fragments to estimate Parquet reader batch size #45749

[Data] Use sampled fragments to estimate Parquet reader batch size #45749

bveeramani commented Jun 5, 2024

raulchen Jun 5, 2024

bveeramani Jun 5, 2024

c21 Jun 5, 2024

c21 left a comment

[Data] Use sampled fragments to estimate Parquet reader batch size #45749

[Data] Use sampled fragments to estimate Parquet reader batch size #45749

Conversation

bveeramani commented Jun 5, 2024

Why are these changes needed?

Related issue number

Checks

raulchen Jun 5, 2024

Choose a reason for hiding this comment

bveeramani Jun 5, 2024

Choose a reason for hiding this comment

c21 Jun 5, 2024

Choose a reason for hiding this comment

c21 left a comment

Choose a reason for hiding this comment