-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cost based optimization #226
Conversation
* Extract out utility classes for easier development * Strip the optimization into a non-cost based (using the same rules as before) and a volcano planner. * For this, implement physical nodes (with a DaskRel) which are so far only a copy of the already present logical nodes (but with a different convention). * Only exception: split the sort+limit into a sort and a limit
…oing this, improve the logging output
…ot implemented. So keep the default
…re then used in the optimization
Codecov Report
@@ Coverage Diff @@
## main #226 +/- ##
==========================================
- Coverage 95.68% 95.56% -0.12%
==========================================
Files 66 67 +1
Lines 2898 2933 +35
Branches 542 547 +5
==========================================
+ Hits 2773 2803 +30
- Misses 75 79 +4
- Partials 50 51 +1
Continue to review full report at Codecov.
|
d51c72b
to
4e27798
Compare
Hi @rajagurunath! I know, this is quite a large PR (and I am sorry for that) and I do not expect you to read through all of it - but maybe you find the time to have a quick look? I am more or less confident that I did not screw it up completely (because I only changed minor things in the tests and they still work), but I would feel a lot better of you also have a rough look :-) Thanks! |
Hi @nils-braun, Really Amazing work, kudos to you !!! 👏 Not able to understand all the things, but able to correlate ~ 60% of things, And I have tested in my local, as you mentioned it doesn't break anything and works like a charm. Dask Nodes and their respective Dask Rules are looking clean and elegant from Java side. I compared the plan of SQL, with and without cost-based optimization able to see effect of the row count in the plan, definitely it will be a great value add for the dask-sql users. With Cost Based optimization plan
And time to complete the above join :) Without Cost Based optimization plan
Not able to do benchmark completely but with help of So As you have mentioned we will be adding more statistics (min, max, avg ) to the plan right? And how about the |
this.name = name; | ||
this.tableColumns = new ArrayList<Pair<String, SqlTypeName>>(); | ||
this.statistics = new DaskStatistics(rowCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naive guess that setting this.statistics
to Statistics.UNKNOWN
here will revert behavior to pre-CBO? cc @jdye64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe it would revert it back to exactly like it was pre-cbo however without the statistics the more fine grained CBO improvements would not be available. Especially optimizations around data types, sizes, or number of rows.
@@ -848,7 +848,7 @@ def _get_ral(self, sql): | |||
self.schema_name, case_sensitive | |||
) | |||
for schema in schemas: | |||
generator_builder.addSchema(schema) | |||
generator_builder = generator_builder.addSchema(schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the only non-java change I made. Since this Java pattern uses a builder previously if multiple schemas were present only the last schema instance would have been saved.
Cost based optimization (dask-contrib#226)
Related to #183.
This PR introduces real cost-based optimization into dask-sql using the Calcite volcano planner.
For this, I did three steps:
alAlgebraGenerator
class into smaller sub-classes to make the handing and go away from the "default"Framework
programThe
create_table
function has now an additional parameterstatistics
, where you can give adask_sql.Statistics
object. In my very first small tests, I could not find a relational algebra which is now optimized differently - but that does not mean there is none. Maybe we can find some benchmark use-cases.