Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cost based optimization #226

Merged
merged 18 commits into from
Dec 21, 2021
Merged

Cost based optimization #226

merged 18 commits into from
Dec 21, 2021

Conversation

nils-braun
Copy link
Collaborator

Related to #183.
This PR introduces real cost-based optimization into dask-sql using the Calcite volcano planner.
For this, I did three steps:

  • refactored the RelationalAlgebraGenerator class into smaller sub-classes to make the handing and go away from the "default" Framework program
  • Put a volcano after the hep-planner, which is allowed to choose rules based on the costs
  • give the user the possibility to add custom statistics to a table - currently, it is only the row count but we might want to add more later.

The create_table function has now an additional parameter statistics, where you can give a dask_sql.Statistics object. In my very first small tests, I could not find a relational algebra which is now optimized differently - but that does not mean there is none. Maybe we can find some benchmark use-cases.

* Extract out utility classes for easier development
* Strip the optimization into a non-cost based (using the same rules as
before) and a volcano planner.
* For this, implement physical nodes (with a DaskRel) which are so far
only a copy of the already present logical nodes (but with a different
convention).
* Only exception: split the sort+limit into a sort and a limit
@codecov-commenter
Copy link

codecov-commenter commented Aug 26, 2021

Codecov Report

Merging #226 (8f94429) into main (f69f9bc) will decrease coverage by 0.11%.
The diff coverage is 94.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #226      +/-   ##
==========================================
- Coverage   95.68%   95.56%   -0.12%     
==========================================
  Files          66       67       +1     
  Lines        2898     2933      +35     
  Branches      542      547       +5     
==========================================
+ Hits         2773     2803      +30     
- Misses         75       79       +4     
- Partials       50       51       +1     
Impacted Files Coverage Δ
dask_sql/java.py 100.00% <ø> (ø)
dask_sql/physical/rel/custom/create_experiment.py 96.25% <0.00%> (-1.25%) ⬇️
dask_sql/physical/rel/custom/create_model.py 91.52% <0.00%> (-1.70%) ⬇️
dask_sql/physical/rex/base.py 77.77% <33.33%> (-22.23%) ⬇️
dask_sql/physical/rel/logical/limit.py 93.02% <93.02%> (ø)
dask_sql/__init__.py 100.00% <100.00%> (ø)
dask_sql/context.py 100.00% <100.00%> (+0.84%) ⬆️
dask_sql/datacontainer.py 93.80% <100.00%> (+0.22%) ⬆️
dask_sql/mappings.py 100.00% <100.00%> (ø)
dask_sql/physical/rel/base.py 92.10% <100.00%> (+0.21%) ⬆️
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f69f9bc...8f94429. Read the comment docs.

@nils-braun
Copy link
Collaborator Author

Hi @rajagurunath! I know, this is quite a large PR (and I am sorry for that) and I do not expect you to read through all of it - but maybe you find the time to have a quick look? I am more or less confident that I did not screw it up completely (because I only changed minor things in the tests and they still work), but I would feel a lot better of you also have a rough look :-) Thanks!

@rajagurunath
Copy link
Collaborator

Hi @nils-braun,

Really Amazing work, kudos to you !!! 👏

Not able to understand all the things, but able to correlate ~ 60% of things, And I have tested in my local, as you mentioned it doesn't break anything and works like a charm. Dask Nodes and their respective Dask Rules are looking clean and elegant from Java side.

I compared the plan of SQL, with and without cost-based optimization able to see effect of the row count in the plan, definitely it will be a great value add for the dask-sql users.

With Cost Based optimization plan

DaskProject(sepal_length=[$0], sepal_width=[$1], petal_length=[$2], petal_width=[$3], species=[$4]): rowcount = 22.5, cumulative cost = {256.25 rows, 314.5 cpu, 0.0 io}, id = 273
  DaskJoin(condition=[AND(=($4, $5), =($0, $6))], joinType=[inner]): rowcount = 22.5, cumulative cost = {233.75 rows, 202.0 cpu, 0.0 io}, id = 272
    DaskTableScan(table=[[root, iris]]): rowcount = 100.0, cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, id = 249
    DaskAggregate(group=[{4}], sepal_length=[MAX($0)]): rowcount = 10.0, cumulative cost = {111.25 rows, 101.0 cpu, 0.0 io}, id = 271
      DaskTableScan(table=[[root, iris]]): rowcount = 100.0, cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, id = 249

And time to complete the above join :)
119 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Without Cost Based optimization plan

LogicalProject(sepal_length=[$0], sepal_width=[$1], petal_length=[$2], petal_width=[$3], species=[$4])
  LogicalJoin(condition=[AND(=($4, $5), =($0, $6))], joinType=[inner])
    LogicalTableScan(table=[[root, iris]])
    LogicalAggregate(group=[{4}], sepal_length=[MAX($0)])
      LogicalTableScan(table=[[root, iris]])

136 ms ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Not able to do benchmark completely but with help of dask-sql Feature-Overview.ipynb, I am able to see some improvement and I am super exicited about that. (even the default row count was set as 100, but able to see some improvement in the above join query)

So As you have mentioned we will be adding more statistics (min, max, avg ) to the plan right? And how about the ANALYZE statement to add those statistics to each table. (maybe we will create a new sub-issue to track this )

this.name = name;
this.tableColumns = new ArrayList<Pair<String, SqlTypeName>>();
this.statistics = new DaskStatistics(rowCount);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naive guess that setting this.statistics to Statistics.UNKNOWN here will revert behavior to pre-CBO? cc @jdye64

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe it would revert it back to exactly like it was pre-cbo however without the statistics the more fine grained CBO improvements would not be available. Especially optimizations around data types, sizes, or number of rows.

@@ -848,7 +848,7 @@ def _get_ral(self, sql):
self.schema_name, case_sensitive
)
for schema in schemas:
generator_builder.addSchema(schema)
generator_builder = generator_builder.addSchema(schema)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the only non-java change I made. Since this Java pattern uses a builder previously if multiple schemas were present only the last schema instance would have been saved.

@galipremsagar galipremsagar merged commit 98c38cf into main Dec 21, 2021
@galipremsagar galipremsagar deleted the feature/cost-based-optimization branch December 21, 2021 20:22
raydouglass added a commit to rapidsai/dask-sql that referenced this pull request Jan 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants