Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rule to reimplement Eliminate cross join and remove it in planner #4185

Merged
merged 7 commits into from
Nov 22, 2022

Conversation

jackwener
Copy link
Member

@jackwener jackwener commented Nov 12, 2022

Which issue does this PR close?

Closes #4176.
Part of #4267

  • Current implementation of eliminate_cross_join is wrong. Because it can't rearrange join order. detail is inside Reimplement the eliminate_cross_join #4176..
  • remove the eliminate_cross_join from planner
  • reimplement eliminate_cross_join to avoid to use global-state. In new implementation, we just depends only on the shape of the subtree.

Rationale for this change

What changes are included in this PR?

Reimplement Eliminate cross join

Are these changes tested?

Test Reorder join for eliminating cross join.

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner labels Nov 12, 2022
\n TableScan: person\
\n SubqueryAlias: p2\
\n TableScan: person\
\n Inner Join: person.id = p.id\
\n CrossJoin:\
\n TableScan: person\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the queries now have more cross joins? Is this expected?

Copy link
Member Author

@jackwener jackwener Nov 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expected
It's just because it's just build plan without optimization.
quick_test () just invoke logical_plan_with_dialect

@github-actions github-actions bot added core Core datafusion crate physical-expr Physical Expressions labels Nov 12, 2022
@jackwener jackwener mentioned this pull request Nov 12, 2022
@github-actions github-actions bot removed core Core datafusion crate physical-expr Physical Expressions labels Nov 12, 2022
@jackwener jackwener force-pushed the eliminate_cross_join branch 3 times, most recently from a6e42cb to c5582ea Compare November 13, 2022 06:59

// look for expressions of the form `<column> = <column>`
let mut possible_join_keys = vec![];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to see those logic is moved out from the planner.

@mingmwang
Copy link
Contributor

mingmwang commented Nov 13, 2022

There is an important point about Schema need to discuss.

We can change Schema into set instead of list. Because it would cause many projection just for order of field.

such as

a b -> proj(join(b a))
a b c ->  join(proj(join(b a)) c) -> project(join(c, project (join(b a))))

Lots of projection will make our rule become complex, we should consider many case, especially for join reorder.

A simple but efficient way is override ==

I think it is because you make the rule run a bottom-up process, then lots of projections are added. If we make it a top-down
process, only a final projection will be added.

@jackwener
Copy link
Member Author

Thanks @mingmwang 👍, I will fix them

@jackwener
Copy link
Member Author

Thanks @mingmwang review !❤️
cc @liukun4515 @Dandandan @andygrove @alamb

@mingmwang
Copy link
Contributor

@jackwener I will take a closer look Tomorrow.

@alamb
Copy link
Contributor

alamb commented Nov 15, 2022

I will try and review this in the next day or two -- but it may take me a while. Hopefully @mingmwang can help review as well

@mingmwang
Copy link
Contributor

There is specific logic in extract_possible_join_keys() to handle and pull up common Exprs in the Or branches.
But I think this logic should be handled in a common rule like ExprSimplifier.
@alamb
Once we reach the other rules, Exprs in Filters or Join conditions should be the already simplified form.

(A.a = B.b and B.x = xxx) or (A.a = B.b and B.y = yyy)

To

(A.a = B.b) and (B.x = xxx or B.y = yyy)

@jackwener
Copy link
Member Author

There is specific logic in extract_possible_join_keys() to handle and pull up common Exprs in the Or branches. But I think this logic should be handled in a common rule like ExprSimplifier. @alamb Once we reach the other rules, Exprs in Filters or Join conditions should be the already simplified form.

(A.a = B.b and B.x = xxx) or (A.a = B.b and B.y = yyy)

To

(A.a = B.b) and (B.x = xxx or B.y = yyy)

Nice idea👍! It's a great future ticket.

@alamb
Copy link
Contributor

alamb commented Nov 20, 2022

I plan to review this and other join related PRs tomorrow. I apologize for the delays. The join work is really neat, but it is not a high priority at the moment in IOx so I have had to prioritize other work higher and do join related

I appreciate the help that @jackwener @mingmwang are giving each other in the review process. 🙏

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this PR quite thoroughly -- thank you very much @jackwener

I'll plan to merge this tomorrow unless there are any other comments

Inner Join: nation.n_regionkey = region.r_regionkey
Inner Join: supplier.s_nationkey = nation.n_nationkey
Inner Join: partsupp.ps_suppkey = supplier.s_suppkey
Projection: part.p_partkey, part.p_mfgr, supplier.s_name, supplier.s_address, supplier.s_phone, supplier.s_acctbal, supplier.s_comment, nation.n_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the difference much easier to see with whitespace blind diff

https://github.com/apache/arrow-datafusion/pull/4185/files?w=1

I found github's rendering of this really hard to understand the change in plan -- I drew out the join graphs by hand to make sure they were the same.

Comment on lines -2959 to -3220
fn extract_possible_join_keys(
expr: &Expr,
accum: &mut Vec<(Column, Column)>,
) -> Result<()> {
match expr {
Expr::BinaryExpr(BinaryExpr { left, op, right }) => match op {
Operator::Eq => match (left.as_ref(), right.as_ref()) {
(Expr::Column(l), Expr::Column(r)) => {
accum.push((l.clone(), r.clone()));
Ok(())
}
_ => Ok(()),
},
Operator::And => {
extract_possible_join_keys(left, accum)?;
extract_possible_join_keys(right, accum)
}
_ => Ok(()),
},
_ => Ok(()),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LogicalPlan::CrossJoin(_) => {
flatten_join_inputs(child, possible_join_keys, all_inputs)?;
}
_ => all_inputs.push((*child).clone()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eventually it would be awesome to avoid so much cloneing -- maybe as a follow on PR

@@ -849,14 +955,14 @@ mod tests {

let expected = vec![
"Filter: (t4.c < UInt32(15) OR t4.c = UInt32(688)) AND (t4.c < UInt32(15) OR t3.c = UInt32(688) OR t3.b = t4.b) [a:UInt32, b:UInt32, c:UInt32, a:UInt32, b:UInt32, c:UInt32, a:UInt32, b:UInt32, c:UInt32, a:UInt32, b:UInt32, c:UInt32]",
" Inner Join: t1.a = t3.a [a:UInt32, b:UInt32, c:UInt32, a:UInt32, b:UInt32, c:UInt32, a:UInt32, b:UInt32, c:UInt32, a:UInt32, b:UInt32, c:UInt32]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@alamb
Copy link
Contributor

alamb commented Nov 21, 2022

I took the liberty of merging master into the branch so that we avoid any possible logical conflicts.

Thanks again for all the comments @mingmwang and @Dandandan

@alamb
Copy link
Contributor

alamb commented Nov 21, 2022

There is specific logic in extract_possible_join_keys() to handle and pull up common Exprs in the Or branches. But I

Nice idea👍! It's a great future ticket.

I agree it would be a good idea for a future ticket. There is also https://github.com/apache/arrow-datafusion/blob/master/datafusion/optimizer/src/rewrite_disjunctive_predicate.rs which I think tries to do the same thing

@alamb
Copy link
Contributor

alamb commented Nov 21, 2022

cc @xudong963

@jackwener
Copy link
Member Author

I took the liberty of merging master into the branch so that we avoid any possible logical conflicts.

Pulled master into it. Thanks @alamb ❤️

@xudong963
Copy link
Member

There is specific logic in extract_possible_join_keys() to handle and pull up common Exprs in the Or branches. But I

Nice idea👍! It's a great future ticket.

I agree it would be a good idea for a future ticket. There is also https://github.com/apache/arrow-datafusion/blob/master/datafusion/optimizer/src/rewrite_disjunctive_predicate.rs which I think tries to do the same thing

Yes, thanks for your mention :)

Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to see the logical moved from planner to optimizer, nice clean.

Maybe @DhamoPS would like to take a look, I remember he has worked similar work!

@jackwener
Copy link
Member Author

fix new conflict.

@alamb
Copy link
Contributor

alamb commented Nov 22, 2022

Happy to see the logical moved from planner to optimizer, nice clean.

Maybe @DhamoPS would like to take a look, I remember he has worked similar work!

Good call @xudong963 -- I am going to merge this PR now as it has been outstanding for a long time. Thanks again @jackwener the recent contributions are awesome

@alamb alamb merged commit d355f69 into apache:master Nov 22, 2022
@ursabot
Copy link

ursabot commented Nov 22, 2022

Benchmark runs are scheduled for baseline = bfce076 and contender = d355f69. d355f69 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@jackwener jackwener deleted the eliminate_cross_join branch November 23, 2022 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core datafusion crate optimizer Optimizer rules sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reimplement the eliminate_cross_join
6 participants