Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose remaining parquet config options into ConfigOptions (try 2) #4427

Merged
merged 8 commits into from
Dec 1, 2022

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Nov 29, 2022

this is a reworked version of #3885

Which issue does this PR close?

Closes #3821

This also helps towards #3887 and #4349

Rationale for this change

  1. Make it easier for people to see what parquet config options are available will make it more likely they are used
  2. The more mechanisms that configuration is supplied, the more likely it to confuse people

It turns out options for reading parquet files were able to be set (and possibly) overridden by no less than three different structures! This is confusing, to say the least.

What changes are included in this PR?

  1. move metadata_size_hint, enable_pruning, and merge_schema_metadata to new config options
  2. Make the precidence of the parquet options passed down to the ParquetExec clear

Are there any user-facing changes?

  1. parquet reader settings are visible session wide
  2. overrides that are specified per-table or per ParquetExec are handled consistently as an override to session wide defaults

Previously, depending on which of the APIs was used to create / register / run parquet, the settings might change if you change the session config or they might have been a snapshot based on when you registered the reader

@github-actions github-actions bot added the core Core DataFusion crate label Nov 29, 2022
@@ -396,7 +396,8 @@ async fn get_table(
}
"parquet" => {
let path = format!("{}/{}", path, table);
let format = ParquetFormat::default().with_enable_pruning(true);
let format = ParquetFormat::new(ctx.config_options())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the parquet format now reads defaults from ConfigOptions they need to be passed to the constructor

One read to fetch the 8-byte parquet footer and \
another to fetch the metadata length encoded in the footer.",
DataType::UInt64,
ScalarValue::UInt64(None),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned by @thinkharderdev on #3885 (comment) we should probably change this default to something reasonable (like 64K) but I would rather do that in a follow on PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #4459 to track

// Session level configuration
config_options: Arc<RwLock<ConfigOptions>>,
// local overides
enable_pruning: Option<bool>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By changing these to Option I think it is now clearer that if they are left at default, the (documented) value from ConfigOptions is used

let listing_options = options
.parquet_pruning(parquet_pruning)
.to_listing_options(target_partitions);
let listing_options = options.to_listing_options(&self.state.read().config);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some ways I think this is cleaner as now the options make it to the parquet reader, where as before there are places like this that copy some (but not all) of the settings around

@@ -1183,7 +1179,6 @@ impl Default for SessionConfig {
repartition_joins: true,
repartition_aggregations: true,
repartition_windows: true,
parquet_pruning: true,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yet another copy of this setting!

/// metadata. Defaults to true.
// TODO move this into ConfigOptions
pub skip_metadata: bool,
/// Should the parquet reader use the predicate to prune row groups?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, here make it clear that these settings are overrides to the defaults on the session configuration

@@ -72,6 +72,16 @@ use super::get_output_ordering;
/// Execution plan for scanning one or more Parquet partitions
#[derive(Debug, Clone)]
pub struct ParquetExec {
/// Override for `Self::with_pushdown_filters`. If None, uses
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added this back in so that the overrides set directly on a ParquetExec do not affect the global configuration options

@@ -707,8 +707,11 @@ async fn show_all() {
"| datafusion.execution.coalesce_batches | true |",
"| datafusion.execution.coalesce_target_batch_size | 4096 |",
"| datafusion.execution.parquet.enable_page_index | false |",
"| datafusion.execution.parquet.metadata_size_hint | NULL |",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 one can now see much more explicitly both 1) what the parquet options are and 2) what their default values are

@alamb alamb added the api change Changes the API exposed to users of the crate label Nov 29, 2022
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @alamb

Copy link
Contributor

@liukun4515 liukun4515 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb

@alamb alamb merged commit fb8eeb2 into apache:master Dec 1, 2022
@ursabot
Copy link

ursabot commented Dec 1, 2022

Benchmark runs are scheduled for baseline = 09aea09 and contender = fb8eeb2. fb8eeb2 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@alamb alamb deleted the alamb/consolidate_parquet_take2 branch December 1, 2022 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow configuring parquet filter pushdown dynamically
5 participants