fix `Planner` don't generate `SubqueryAlias` and generate duplicated `SubqueryAlias` #4484

jackwener · 2022-12-02T09:51:04Z

Which issue does this PR close?

Closes #4483
Closes #4454
Closes #4481

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jackwener · 2022-12-02T09:53:04Z

benchmarks/expected-plans/q13.txt

-Sort: custdist DESC NULLS FIRST, c_orders.c_count DESC NULLS FIRST
- Projection: c_orders.c_count, COUNT(UInt8(1)) AS custdist
- Aggregate: groupBy=[[c_orders.c_count]], aggr=[[COUNT(UInt8(1))]]
- SubqueryAlias: c_orders


below already existed SubqueryAlias: c_orders

jackwener · 2022-12-02T09:57:07Z

datafusion/sql/src/planner.rs

- (Some(cte_plan), _) => match table_alias {
- Some(cte_alias) => {
- Ok(with_alias(cte_plan.clone(), cte_alias))
- }
- _ => Ok(cte_plan.clone()),
- },
+ (Some(cte_plan), _) => Ok(cte_plan.clone()),
 (_, Ok(provider)) => {
- let scan =
- LogicalPlanBuilder::scan(&table_name, provider, None);
- let scan = match table_alias.as_ref() {
- Some(ref name) => scan?.alias(name.to_owned().as_str()),
- _ => scan,
- };
- scan?.build()


We must remove these code, because they will add SubqueryAlias.

But following code will be added once again, it will cause duplicated SubqueryAlias

jackwener · 2022-12-02T09:57:29Z

datafusion/sql/src/planner.rs

-
- let plan = match normalized_alias {
- Some(alias) => with_alias(logical_plan, alias),
- _ => logical_plan,
- };
- (plan, alias)


We must remove these code, because they will add SubqueryAlias.
But following code will be added once again, it will cause duplicated SubqueryAlias

jackwener · 2022-12-02T10:01:32Z

datafusion/sql/src/planner.rs

- let columns_alias = alias.clone().columns;
- if columns_alias.is_empty() {
- // sqlparser-rs encodes AS t as an empty list of column alias


Original code is bug (add twice) + bug (when columns-alias is empty, forget to add alias). let me have explain.
The process is as follows

Add table-alias (those code that are removed above)

Add table-alias and columns-alias (these code here)

bug: when columns-alias is empty, these code will ignore table-alias

So

bug (add twice) will cause duplicated subquery_alias

bug (when columns-alias, forget to add alias) will cause missing subquery_alias

so we just unify them into

Add table-alias and columns-alias

jackwener · 2022-12-02T10:12:32Z

datafusion/core/tests/sql/window.rs

- " EmptyRelation []",
- " Projection: Int64(7) AS a, Utf8(\"bb\") AS b [a:Int64, b:Utf8]",
- " EmptyRelation []",
+ " SubqueryAlias: _sample_data [a:Int64, b:Utf8]",


Missing SubqueryAlias

jackwener · 2022-12-05T09:32:38Z

PTAL @liukun4515 @mingmwang @alamb @andygrove

andygrove · 2022-12-05T19:01:05Z

datafusion/expr/src/logical_plan/builder.rs

+pub fn subquery_alias_owned(plan: LogicalPlan, alias: &str) -> Result<LogicalPlan> {
+ let schema: Schema = plan.schema().as_ref().clone().into();
+ let schema = DFSchemaRef::new(DFSchema::try_from_qualified_schema(alias, &schema)?);
+ Ok(LogicalPlan::SubqueryAlias(SubqueryAlias {
 input: Arc::new(plan),
- alias,
- schema: Arc::new(schema),
- })
+ alias: alias.to_string(),
+ schema,
+ }))
 }


Could we move this logic into SubqueryAlias::try_new instead?

Has added it.

mingmwang · 2022-12-06T03:38:57Z

I still think the SubqueryAlias should be removed at the early stage of the logical optimizer.
I just run the same SQL on SparkSQL and find the SubqueryAlias was totally removed.

Spark SQL Output:

+----------------------------------------------------+
|                        plan                        |
+----------------------------------------------------+
| == Parsed Logical Plan ==
CTE [_sample_data, _data2]
:  :- SubqueryAlias _sample_data
:  :  +- Union
:  :     :- Union
:  :     :  :- Union
:  :     :  :  :- Project [1 AS a#5683777, aa AS b#5683778]
:  :     :  :  :  +- OneRowRelation
:  :     :  :  +- Project [3 AS a#5683779, aa AS b#5683780]
:  :     :  :     +- OneRowRelation
:  :     :  +- Project [5 AS a#5683781, bb AS b#5683782]
:  :     :     +- OneRowRelation
:  :     +- Project [7 AS a#5683783, bb AS b#5683784]
:  :        +- OneRowRelation
:  +- 'SubqueryAlias _data2
:     +- 'Project ['row_number() windowspecdefinition('s.b, 's.a ASC NULLS FIRST, unspecifiedframe$()) AS seq#5683785, 's.a, 's.b]
:        +- 'SubqueryAlias s
:           +- 'UnresolvedRelation [_sample_data]
+- 'Sort ['d.b ASC NULLS FIRST], true
   +- 'Aggregate ['d.b], ['d.b, 'MAX('d.a) AS max_a#5683776]
      +- 'SubqueryAlias d
         +- 'UnresolvedRelation [_data2]

== Analyzed Logical Plan ==
b: string, max_a: int
Project [b#5683778, max_a#5683776]
+- Sort [b#5683778 ASC NULLS FIRST], true
   +- Aggregate [b#5683778], [b#5683778, max(a#5683777) AS max_a#5683776]
      +- SubqueryAlias d
         +- SubqueryAlias _data2
            +- Project [seq#5683785, a#5683777, b#5683778]
               +- Project [a#5683777, b#5683778, seq#5683785, seq#5683785]
                  +- Window [row_number() windowspecdefinition(b#5683778, a#5683777 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS seq#5683785], [b#5683778], [a#5683777 ASC NULLS FIRST]
                     +- Project [a#5683777, b#5683778]
                        +- SubqueryAlias s
                           +- SubqueryAlias _sample_data
                              +- Union
                                 :- Union
                                 :  :- Union
                                 :  :  :- Project [1 AS a#5683777, aa AS b#5683778]
                                 :  :  :  +- OneRowRelation
                                 :  :  +- Project [3 AS a#5683779, aa AS b#5683780]
                                 :  :     +- OneRowRelation
                                 :  +- Project [5 AS a#5683781, bb AS b#5683782]
                                 :     +- OneRowRelation
                                 +- Project [7 AS a#5683783, bb AS b#5683784]
                                    +- OneRowRelation

== Optimized Logical Plan ==
Sort [b#5683778 ASC NULLS FIRST], true
+- Aggregate [b#5683778], [b#5683778, max(a#5683777) AS max_a#5683776]
   +- Union
      :- Project [1 AS a#5683777, aa AS b#5683778]
      :  +- OneRowRelation
      :- Project [3 AS a#5683779, aa AS b#5683780]
      :  +- OneRowRelation
      :- Project [5 AS a#5683781, bb AS b#5683782]
      :  +- OneRowRelation
      +- Project [7 AS a#5683783, bb AS b#5683784]
         +- OneRowRelation

mingmwang · 2022-12-06T03:39:38Z

Test SQL:

WITH _sample_data AS (
                SELECT 1 as a, 'aa' AS b
                UNION ALL
                SELECT 3 as a, 'aa' AS b
                UNION ALL
                SELECT 5 as a, 'bb' AS b
                UNION ALL
                SELECT 7 as a, 'bb' AS b
            ), _data2 AS (
                SELECT
                row_number() OVER (PARTITION BY s.b ORDER BY s.a) AS seq,
                s.a,
                s.b
                FROM _sample_data s
            )
            SELECT d.b, MAX(d.a) AS max_a
            FROM _data2 d
            GROUP BY d.b
            ORDER BY d.b;

jackwener · 2022-12-06T04:35:35Z

Current this PR is just for fix bug and don't include any optimization.
I will do this optimization #4412, your comment is helpful for it, Thanks @mingmwang !👍

we can see Spark exist analyzed plan

      +- SubqueryAlias d
         +- SubqueryAlias _data2

But SubqueryAlias _data2 in datafusion will be missing because of this BUG

mingmwang · 2022-12-06T05:04:05Z

To remove the SubqueryAlias from the logical plan tree, I think there are two approaches:

Add a rule and unnest the inner child within SubqueryAlias, update the child schema's qualifier name. Need to add a method to logical operator types to update schema's qualifier name.
Modify the logical Column expr, instead of depending on qualifier/relation name, have it depend on the index, just like the physical Column expr. Another approach is to define two types of Column exprs : UnResolvedColumn and ResolvedColumn, UnResolvedColumn behaves like the current Column expr and depends on the qualifier/relation name, ResolvedColumn behaves like the current physical Column expr and depends on index.

jackwener · 2022-12-06T05:12:03Z

Modify the logical Column expr, instead of depending on qualifier/relation name, have it depend on the index, just like the physical Column expr. Another approach is to define two types of Column exprs : UnResolvedColumn and ResolvedColumn, UnResolvedColumn behaves like the current Column expr and depends on the qualifier/relation name, ResolvedColumn behaves like the current physical Column expr and depends on index.

If we want to do this, I think we may need to add binder/analyzer for datafusion. cc @liukun4515

DuckDB binder do ParsedExpression -> BoundExpression
Spark analyzer do Unresolved -> Resolved

This job will be complicated.

I will try to use the solution of 1 in #4412

mingmwang · 2022-12-06T06:15:14Z

Modify the logical Column expr, instead of depending on qualifier/relation name, have it depend on the index, just like the physical Column expr. Another approach is to define two types of Column exprs : UnResolvedColumn and ResolvedColumn, UnResolvedColumn behaves like the current Column expr and depends on the qualifier/relation name, ResolvedColumn behaves like the current physical Column expr and depends on index.

If we want to do this, I think we may need to add binder/analyzer for datafusion. cc @liukun4515

DuckDB binder do ParsedExpression -> BoundExpression Spark analyzer do Unresolved -> Resolved

This job will be complicated.

I will try to use the solution of 1 in #4412

Just curious, does Apache Doris have an explicit binder/analyzer phase ?

jackwener · 2022-12-06T08:03:15Z

Just curious, does Apache Doris have an explicit binder/analyzer phase ?

Yes, Doris has analyzer in here
It's a explicit phase and as a Job in Cascade Optimizer, and there are some analyze rule to do this job, like bindRelation, code.
It will bind UnboundXXX with catalog and convert them to BoundXXXX

alamb

I found https://github.com/apache/arrow-datafusion/pull/4484/files?w=1 easier to review

The changes to the plans look good to me -- thank you @jackwener

@mingmwang do you have any concerns about merging this PR?

alamb · 2022-12-06T21:07:10Z

Thanks again @jackwener -- this is great stuff

alamb · 2022-12-07T11:39:16Z

I plan to merge this later today (in about 9 hours) unless I hear otherwise

mingmwang · 2022-12-07T12:05:52Z

I found https://github.com/apache/arrow-datafusion/pull/4484/files?w=1 easier to review

The changes to the plans look good to me -- thank you @jackwener

@mingmwang do you have any concerns about merging this PR?

No, this PR is good to merge I think.

alamb · 2022-12-07T21:59:36Z

Oh no -- this PR again has conflicts 😢 @jackwener can you possibly rebase it one more time ?

jackwener · 2022-12-08T02:48:43Z

has resolved conflict.

alamb · 2022-12-08T12:17:21Z

Thanks again @jackwener -- sorry this one took so long to get in

github-actions bot added core Core datafusion crate logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner labels Dec 2, 2022

jackwener commented Dec 2, 2022

View reviewed changes

jackwener changed the title ~~fix Planner don't generate SubqueryAlias and generate replicated SubqueryAlias~~ fix Planner don't generate SubqueryAlias and generate duplicated SubqueryAlias Dec 5, 2022

jackwener marked this pull request as ready for review December 5, 2022 09:31

andygrove reviewed Dec 5, 2022

View reviewed changes

jackwener force-pushed the fix_planner branch from b5160b1 to ebdf17b Compare December 6, 2022 00:53

github-actions bot removed the optimizer Optimizer rules label Dec 6, 2022

jackwener mentioned this pull request Dec 6, 2022

Avoid adding redundant SubqueryAlias. #4412

Closed

jackwener force-pushed the fix_planner branch from abc1c64 to 6ae1ae7 Compare December 6, 2022 16:49

alamb approved these changes Dec 6, 2022

View reviewed changes

fix planner generate replicated subquery_alias.

ca39a16

jackwener force-pushed the fix_planner branch from 6ae1ae7 to ca39a16 Compare December 8, 2022 02:46

alamb merged commit 1a55d64 into apache:master Dec 8, 2022

jackwener deleted the fix_planner branch December 8, 2022 12:23

jackwener mentioned this pull request Dec 8, 2022

refactor code about query -> plan for subqueries #4559

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix `Planner` don't generate `SubqueryAlias` and generate duplicated `SubqueryAlias` #4484

fix `Planner` don't generate `SubqueryAlias` and generate duplicated `SubqueryAlias` #4484

jackwener commented Dec 2, 2022 •

edited by alamb

Loading

jackwener Dec 2, 2022 •

edited

Loading

jackwener Dec 2, 2022 •

edited

Loading

jackwener Dec 2, 2022 •

edited

Loading

jackwener Dec 2, 2022 •

edited

Loading

jackwener Dec 2, 2022

jackwener commented Dec 5, 2022

andygrove Dec 5, 2022

jackwener Dec 6, 2022

mingmwang commented Dec 6, 2022

mingmwang commented Dec 6, 2022

jackwener commented Dec 6, 2022 •

edited

Loading

mingmwang commented Dec 6, 2022

jackwener commented Dec 6, 2022 •

edited

Loading

mingmwang commented Dec 6, 2022

jackwener commented Dec 6, 2022 •

edited

Loading

alamb left a comment

alamb commented Dec 6, 2022

alamb commented Dec 7, 2022

mingmwang commented Dec 7, 2022

alamb commented Dec 7, 2022

jackwener commented Dec 8, 2022

alamb commented Dec 8, 2022

fix Planner don't generate SubqueryAlias and generate duplicated SubqueryAlias #4484

fix Planner don't generate SubqueryAlias and generate duplicated SubqueryAlias #4484

Conversation

jackwener commented Dec 2, 2022 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jackwener Dec 2, 2022 • edited Loading

Choose a reason for hiding this comment

jackwener Dec 2, 2022 • edited Loading

Choose a reason for hiding this comment

jackwener Dec 2, 2022 • edited Loading

Choose a reason for hiding this comment

jackwener Dec 2, 2022 • edited Loading

Choose a reason for hiding this comment

jackwener Dec 2, 2022

Choose a reason for hiding this comment

jackwener commented Dec 5, 2022

andygrove Dec 5, 2022

Choose a reason for hiding this comment

jackwener Dec 6, 2022

Choose a reason for hiding this comment

mingmwang commented Dec 6, 2022

mingmwang commented Dec 6, 2022

jackwener commented Dec 6, 2022 • edited Loading

mingmwang commented Dec 6, 2022

jackwener commented Dec 6, 2022 • edited Loading

mingmwang commented Dec 6, 2022

jackwener commented Dec 6, 2022 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

alamb commented Dec 6, 2022

alamb commented Dec 7, 2022

mingmwang commented Dec 7, 2022

alamb commented Dec 7, 2022

jackwener commented Dec 8, 2022

alamb commented Dec 8, 2022

fix `Planner` don't generate `SubqueryAlias` and generate duplicated `SubqueryAlias` #4484

fix `Planner` don't generate `SubqueryAlias` and generate duplicated `SubqueryAlias` #4484

jackwener commented Dec 2, 2022 •

edited by alamb

Loading

jackwener Dec 2, 2022 •

edited

Loading

jackwener Dec 2, 2022 •

edited

Loading

jackwener Dec 2, 2022 •

edited

Loading

jackwener Dec 2, 2022 •

edited

Loading

jackwener commented Dec 6, 2022 •

edited

Loading

jackwener commented Dec 6, 2022 •

edited

Loading

jackwener commented Dec 6, 2022 •

edited

Loading