Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify most of SessionConfig settings into ConfigOptions #4492

Merged
merged 4 commits into from
Dec 5, 2022

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Dec 3, 2022

Which issue does this PR close?

Towards #3887

Also re #4349

Rationale for this change

see #3887

TLDR is

  1. A single way to do configuration options in ConfigOptions makes the code cleaner
  2. Using ConfigOptions makes them easier to find (as they are documented)
  3. Using ConfigOptions makes these settings dynamically configurable (via environment variable or SET <variable> syntax

What changes are included in this PR?

Move all SessionConfig settings other than the schema / catalog names into the ConfigOptions structure,

Are these changes tested?

Are there any user-facing changes?

Some fields that were pub not must be set via a with_ method or accessed via an accessor. The number of locations in the DataFusion codebase was quite low, which was a nice surprise to me

The mix of HashMap /
rust struct style / builder styles is somewhat of a bummer. I bet some fancy macro magic could unify the two but I don't know the right incantations

@github-actions github-actions bot added the core Core DataFusion crate label Dec 3, 2022
@alamb
Copy link
Contributor Author

alamb commented Dec 3, 2022

I need to deal with the fact that target_partitions is not a constant (it is a function of the number of cpu cores)

@alamb
Copy link
Contributor Author

alamb commented Dec 3, 2022

FYI @xudong963 and @mvanschellebeeck -- I am loving the sqllogictest framework

This PR has a good first test -- the output of SHOW needs to be normalized (specifically, target_partitions isn't a constant value). I have some ideas and plan to follow the PRAGMA syntax used by duckdb for passing along options
https://github.com/duckdb/duckdb/blob/85843041ec06fa2d1cafaace993076ef778cdbfe/test/sql/types/nested/list/test_nested_list.test#L5-L10C17

I am working on some code, but I don't expect it will be ready until tomorrow.

Actually, I found a better way: cb0f651

@alamb alamb mentioned this pull request Dec 3, 2022
28 tasks
@@ -410,7 +410,7 @@ async fn get_table(
let options = ListingOptions::new(format)
.with_file_extension(extension)
.with_target_partitions(target_partitions)
.with_collect_stat(ctx.config.collect_statistics);
.with_collect_stat(ctx.config.collect_statistics());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this illustrates the major API change for DataFusion users

default_schema: String,
/// Whether the default catalog and schema should be created automatically
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change in this PR is to remove these fields and move them to ConfigOptions

@@ -1328,27 +1381,27 @@ impl SessionConfig {
}
map.insert(
TARGET_PARTITIONS.to_owned(),
format!("{}", self.target_partitions),
format!("{}", self.target_partitions()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole function should likely be deprecated and instead we should use ConfigOptions directly. However, for this PR I opted to leave it alone (and thus backwards compatible for Ballista) cc @mingmwang

@xudong963
Copy link
Member

I am working on some code, but I don't expect it will be ready until tomorrow.

Looking forward to it!

# to a known value that is unlikely to be
# the real number of cores on a system
statement ok
SET datafusion.execution.target_partitions=7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clever 👍

@alamb alamb marked this pull request as ready for review December 4, 2022 15:58
@alamb
Copy link
Contributor Author

alamb commented Dec 4, 2022

FYI @mingmwang

@mingmwang
Copy link
Contributor

@alamb
Feel free to change to use ConfigOptions directly. If there are any changes need to be made in Ballista accordingly, I will take care.

Copy link
Contributor

@thinkharderdev thinkharderdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice!

Comment on lines +31 to 53
pub const OPT_TARGET_PARTITIONS: &str = "datafusion.execution.target_partitions";

/// Configuration option "datafusion.catalog.create_default_catalog_and_schema"
pub const OPT_CREATE_DEFAULT_CATALOG_AND_SCHEMA: &str =
"datafusion.catalog.create_default_catalog_and_schema";
/// Configuration option "datafusion.catalog.information_schema"
pub const OPT_INFORMATION_SCHEMA: &str = "datafusion.catalog.information_schema";

/// Configuration option "datafusion.optimizer.repartition_joins"
pub const OPT_REPARTITION_JOINS: &str = "datafusion.optimizer.repartition_joins";

/// Configuration option "datafusion.optimizer.repartition_aggregations"
pub const OPT_REPARTITION_AGGREGATIONS: &str =
"datafusion.optimizer.repartition_aggregations";

/// Configuration option "datafusion.optimizer.repartition_windows"
pub const OPT_REPARTITION_WINDOWS: &str = "datafusion.optimizer.repartition_windows";

/// Configuration option "datafusion.execuction_collect_statistics"
pub const OPT_COLLECT_STATISTICS: &str = "datafusion.execuction_collect_statistics";

/// Configuration option "datafusion.optimizer.filter_null_join_keys"
pub const OPT_FILTER_NULL_JOIN_KEYS: &str = "datafusion.optimizer.filter_null_join_keys";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason we couldn't just make these an enum? Then we could make the set_x and get_x methods type safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 that is an excellent idea -- I don't think there is any reason we could not do so. I will file a follow on ticket to do so

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb alamb merged commit 740a4fa into apache:master Dec 5, 2022
@alamb alamb deleted the alamb/unify_configuration branch December 5, 2022 19:56
@ursabot
Copy link

ursabot commented Dec 5, 2022

Benchmark runs are scheduled for baseline = 2a754f8 and contender = 740a4fa. 740a4fa is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants