Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't share ConfigOptions (#3886) #4712

Merged
merged 6 commits into from
Dec 23, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Dec 22, 2022

Which issue does this PR close?

Closes #3886
Closes #3909
Relates to #4349
Relates to #4617

Rationale for this change

Having shared mutable state makes reasoning about mutation difficult (#4617), the locking is verbose and potentially error prone (#3886),

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core datafusion crate label Dec 22, 2022
@@ -302,6 +287,8 @@ impl ExecutionPlan for ParquetExec {
})
})?;

let config_options = ctx.session_config().config_options();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to fetch this at execution time, in order that datafusion-proto can still deserialize ParquetExec without a SessionState. Longer term as we strip out the overrides this will make more sense anyway so 🤷

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is reasonable to look at the session configuration while executing 🤷

It certainly seems better than the current state of master where the config options (attached to session state) are read via interior mutability

@@ -90,7 +90,8 @@ message CsvFormat {
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend viewing this with whitespace disabled

image

@@ -353,12 +353,9 @@ impl AsLogicalPlan for LogicalPlanNode {
self
))
})? {
&FileFormatType::Parquet(protobuf::ParquetFormat {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plumbing for this override was actually incorrect, it would convert false -> None, the other overrides aren't present, and we plan to remove this override mechanism as part of #4349 so I just opted to remove it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree serializing the same config options multiple times (once in the main session context and then once again as part of the file format) is undesirable for many reasons

@tustvold tustvold force-pushed the no-shared-config-options branch 2 times, most recently from b650b86 to 3327d11 Compare December 22, 2022 12:31
@tustvold tustvold added the api change Changes the API exposed to users of the crate label Dec 22, 2022
impl ParquetScanOptions {
/// Returns a [`SessionConfig`] with the given options
pub fn config(&self) -> SessionConfig {
SessionConfig::new()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debated simply removing ParquetScanOptions in favour of SessionConfig but figured this PR was large enough as it was

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I agree this PR is already large. I also think the ParquetScanOptions predated the config options.

I think removing the ParquetScanOptions as a follow on PR is a good idea 👍

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👨‍🍳 👌

This looks really good @tustvold -- thank you for helping sort out the configuration situation

Pin<Box<dyn Stream<Item = Result<ActionType, Status>> + Send + Sync + 'static>>;
type DoExchangeStream =
Pin<Box<dyn Stream<Item = Result<FlightData, Status>> + Send + Sync + 'static>>;
type HandshakeStream = BoxStream<'static, Result<HandshakeResponse, Status>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot this was here -- I have to give this example love to give this after my work to make arrow-flight easier to use

datafusion/core/src/datasource/file_format/parquet.rs Outdated Show resolved Hide resolved
datafusion/core/src/datasource/file_format/parquet.rs Outdated Show resolved Hide resolved
@@ -85,13 +84,9 @@ impl ParquetFormat {
}

/// Return true if pruning is enabled
pub fn enable_pruning(&self) -> bool {
pub fn enable_pruning(&self, config_options: &ConfigOptions) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -1173,7 +1166,7 @@ pub struct SessionConfig {
/// due to `resolve_table_ref` which passes back references)
default_schema: String,
/// Configuration options
pub config_options: Arc<RwLock<ConfigOptions>>,
config_options: ConfigOptions,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@@ -302,6 +287,8 @@ impl ExecutionPlan for ParquetExec {
})
})?;

let config_options = ctx.session_config().config_options();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is reasonable to look at the session configuration while executing 🤷

It certainly seems better than the current state of master where the config options (attached to session state) are read via interior mutability

CurrentDate=70;
CurrentTime=71;
Uuid=72;
Abs = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace!

@@ -1132,6 +1133,9 @@ message ScanLimit {
}

message FileScanExecConf {
// Was repeated ConfigOption options = 10;
reserved 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -353,12 +353,9 @@ impl AsLogicalPlan for LogicalPlanNode {
self
))
})? {
&FileFormatType::Parquet(protobuf::ParquetFormat {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree serializing the same config options multiple times (once in the main session context and then once again as part of the file format) is undesirable for many reasons

impl ParquetScanOptions {
/// Returns a [`SessionConfig`] with the given options
pub fn config(&self) -> SessionConfig {
SessionConfig::new()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I agree this PR is already large. I also think the ParquetScanOptions predated the config options.

I think removing the ParquetScanOptions as a follow on PR is a good idea 👍

@tustvold tustvold merged commit 07f4980 into apache:master Dec 23, 2022
@ursabot
Copy link

ursabot commented Dec 23, 2022

Benchmark runs are scheduled for baseline = afb1ae2 and contender = 07f4980. 07f4980 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make ConfigOptions easier to work with
3 participants