Cost based optimization #226

nils-braun · 2021-08-25T15:41:41Z

Related to #183.
This PR introduces real cost-based optimization into dask-sql using the Calcite volcano planner.
For this, I did three steps:

refactored the RelationalAlgebraGenerator class into smaller sub-classes to make the handing and go away from the "default" Framework program
Put a volcano after the hep-planner, which is allowed to choose rules based on the costs
give the user the possibility to add custom statistics to a table - currently, it is only the row count but we might want to add more later.

The create_table function has now an additional parameter statistics, where you can give a dask_sql.Statistics object. In my very first small tests, I could not find a relational algebra which is now optimized differently - but that does not mean there is none. Maybe we can find some benchmark use-cases.

* Extract out utility classes for easier development * Strip the optimization into a non-cost based (using the same rules as before) and a volcano planner. * For this, implement physical nodes (with a DaskRel) which are so far only a copy of the already present logical nodes (but with a different convention). * Only exception: split the sort+limit into a sort and a limit

…oing this, improve the logging output

…ot implemented. So keep the default

…re then used in the optimization

…timization

…va-8 compatible

codecov-commenter · 2021-08-26T06:41:18Z

Codecov Report

Merging #226 (8f94429) into main (f69f9bc) will decrease coverage by 0.11%.
The diff coverage is 94.85%.

@@            Coverage Diff             @@
##             main     #226      +/-   ##
==========================================
- Coverage   95.68%   95.56%   -0.12%     
==========================================
  Files          66       67       +1     
  Lines        2898     2933      +35     
  Branches      542      547       +5     
==========================================
+ Hits         2773     2803      +30     
- Misses         75       79       +4     
- Partials       50       51       +1

Impacted Files	Coverage Δ
dask_sql/java.py	`100.00% <ø> (ø)`
dask_sql/physical/rel/custom/create_experiment.py	`96.25% <0.00%> (-1.25%)`	⬇️
dask_sql/physical/rel/custom/create_model.py	`91.52% <0.00%> (-1.70%)`	⬇️
dask_sql/physical/rex/base.py	`77.77% <33.33%> (-22.23%)`	⬇️
dask_sql/physical/rel/logical/limit.py	`93.02% <93.02%> (ø)`
dask_sql/__init__.py	`100.00% <100.00%> (ø)`
dask_sql/context.py	`100.00% <100.00%> (+0.84%)`	⬆️
dask_sql/datacontainer.py	`93.80% <100.00%> (+0.22%)`	⬆️
dask_sql/mappings.py	`100.00% <100.00%> (ø)`
dask_sql/physical/rel/base.py	`92.10% <100.00%> (+0.21%)`	⬆️
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f69f9bc...8f94429. Read the comment docs.

nils-braun · 2021-08-30T13:26:40Z

Hi @rajagurunath! I know, this is quite a large PR (and I am sorry for that) and I do not expect you to read through all of it - but maybe you find the time to have a quick look? I am more or less confident that I did not screw it up completely (because I only changed minor things in the tests and they still work), but I would feel a lot better of you also have a rough look :-) Thanks!

rajagurunath · 2021-08-31T16:32:19Z

Hi @nils-braun,

Really Amazing work, kudos to you !!! 👏

Not able to understand all the things, but able to correlate ~ 60% of things, And I have tested in my local, as you mentioned it doesn't break anything and works like a charm. Dask Nodes and their respective Dask Rules are looking clean and elegant from Java side.

I compared the plan of SQL, with and without cost-based optimization able to see effect of the row count in the plan, definitely it will be a great value add for the dask-sql users.

With Cost Based optimization plan

DaskProject(sepal_length=[$0], sepal_width=[$1], petal_length=[$2], petal_width=[$3], species=[$4]): rowcount = 22.5, cumulative cost = {256.25 rows, 314.5 cpu, 0.0 io}, id = 273
  DaskJoin(condition=[AND(=($4, $5), =($0, $6))], joinType=[inner]): rowcount = 22.5, cumulative cost = {233.75 rows, 202.0 cpu, 0.0 io}, id = 272
    DaskTableScan(table=[[root, iris]]): rowcount = 100.0, cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, id = 249
    DaskAggregate(group=[{4}], sepal_length=[MAX($0)]): rowcount = 10.0, cumulative cost = {111.25 rows, 101.0 cpu, 0.0 io}, id = 271
      DaskTableScan(table=[[root, iris]]): rowcount = 100.0, cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, id = 249

And time to complete the above join :)
119 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Without Cost Based optimization plan

LogicalProject(sepal_length=[$0], sepal_width=[$1], petal_length=[$2], petal_width=[$3], species=[$4])
  LogicalJoin(condition=[AND(=($4, $5), =($0, $6))], joinType=[inner])
    LogicalTableScan(table=[[root, iris]])
    LogicalAggregate(group=[{4}], sepal_length=[MAX($0)])
      LogicalTableScan(table=[[root, iris]])

136 ms ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Not able to do benchmark completely but with help of dask-sql Feature-Overview.ipynb, I am able to see some improvement and I am super exicited about that. (even the default row count was set as 100, but able to see some improvement in the above join query)

So As you have mentioned we will be adding more statistics (min, max, avg ) to the plan right? And how about the ANALYZE statement to add those statistics to each table. (maybe we will create a new sub-issue to track this )

…optimization

charlesbluca · 2021-12-09T21:27:15Z

planner/src/main/java/com/dask/sql/schema/DaskTable.java

 		this.name = name;
 		this.tableColumns = new ArrayList<Pair<String, SqlTypeName>>();
+		this.statistics = new DaskStatistics(rowCount);


Naive guess that setting this.statistics to Statistics.UNKNOWN here will revert behavior to pre-CBO? cc @jdye64

I don't believe it would revert it back to exactly like it was pre-cbo however without the statistics the more fine grained CBO improvements would not be available. Especially optimizations around data types, sizes, or number of rows.

…optimization

jdye64 · 2021-12-21T19:17:26Z

dask_sql/context.py

@@ -848,7 +848,7 @@ def _get_ral(self, sql):
            self.schema_name, case_sensitive
        )
        for schema in schemas:
-            generator_builder.addSchema(schema)
+            generator_builder = generator_builder.addSchema(schema)


This was the only non-java change I made. Since this Java pattern uses a builder previously if multiple schemas were present only the last schema instance would have been saved.

Cost based optimization (dask-contrib#226)

nils-braun added 11 commits August 24, 2021 17:38

Adjust the python implementations to follow the new conventions

429acc5

Fix a bug in the column name creation and in the window node. While d…

51e267b

…oing this, improve the logging output

Implement the missing sample node

07a539f

Calcite is now creating a new function: /int. Implement it

35622da

Additional optimization rules (probably not needed)

e3dce64

Expaning will lead to RexSubQueries, which are harder to handle and n…

49d256e

…ot implemented. So keep the default

Documentation

cefcebe

Allow the user to pass in row-count statistics on the tables, which a…

417c26f

…re then used in the optimization

Merge remote-tracking branch 'origin/main' into feature/cost-based-op…

672d063

…timization

List.of is a Java 11 feature -> replace with Arrays.asList to stay ja…

73ee12f

…va-8 compatible

Bring the coverage up again

4e27798

nils-braun force-pushed the feature/cost-based-optimization branch from d51c72b to 4e27798 Compare August 26, 2021 07:17

nils-braun mentioned this pull request Aug 28, 2021

Add DISTRIBUTE BY to dask-sql grammar #228

Merged

rajagurunath approved these changes Aug 31, 2021

View reviewed changes

charlesbluca added 4 commits October 12, 2021 11:49

Merge remote-tracking branch 'upstream/main' into feature/cost-based-…

63abd27

…optimization

Merge remote-tracking branch 'upstream/main' into feature/cost-based-…

395d7fa

…optimization

Run pre-commit hooks

20e7891

Remove formatted brackets causing explain failures

cca757d

charlesbluca reviewed Dec 9, 2021

View reviewed changes

charlesbluca and others added 2 commits December 13, 2021 11:37

Merge remote-tracking branch 'upstream/main' into feature/cost-based-…

266b26f

…optimization

apply changes from Jeremy

8f94429

jdye64 approved these changes Dec 21, 2021

View reviewed changes

galipremsagar merged commit 98c38cf into main Dec 21, 2021

galipremsagar deleted the feature/cost-based-optimization branch December 21, 2021 20:22

raydouglass added a commit to rapidsai/dask-sql that referenced this pull request Jan 5, 2022

Merge pull request #14 from dask-contrib/main

985fdcc

Cost based optimization (dask-contrib#226)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cost based optimization #226

Cost based optimization #226

nils-braun commented Aug 25, 2021

codecov-commenter commented Aug 26, 2021 •

edited

Loading

nils-braun commented Aug 30, 2021

rajagurunath commented Aug 31, 2021

charlesbluca Dec 9, 2021

jdye64 Dec 21, 2021

jdye64 Dec 21, 2021

Cost based optimization #226

Cost based optimization #226

Conversation

nils-braun commented Aug 25, 2021

codecov-commenter commented Aug 26, 2021 • edited Loading

Codecov Report

nils-braun commented Aug 30, 2021

rajagurunath commented Aug 31, 2021

charlesbluca Dec 9, 2021

Choose a reason for hiding this comment

jdye64 Dec 21, 2021

Choose a reason for hiding this comment

jdye64 Dec 21, 2021

Choose a reason for hiding this comment

codecov-commenter commented Aug 26, 2021 •

edited

Loading