Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix Planner don't generate SubqueryAlias and generate duplicated SubqueryAlias #4484

Merged
merged 1 commit into from
Dec 8, 2022

Conversation

jackwener
Copy link
Member

@jackwener jackwener commented Dec 2, 2022

Which issue does this PR close?

Closes #4483
Closes #4454
Closes #4481

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added core Core datafusion crate logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner labels Dec 2, 2022
Sort: custdist DESC NULLS FIRST, c_orders.c_count DESC NULLS FIRST
Projection: c_orders.c_count, COUNT(UInt8(1)) AS custdist
Aggregate: groupBy=[[c_orders.c_count]], aggr=[[COUNT(UInt8(1))]]
SubqueryAlias: c_orders
Copy link
Member Author

@jackwener jackwener Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

below already existed SubqueryAlias: c_orders

Comment on lines 859 to 929
(Some(cte_plan), _) => match table_alias {
Some(cte_alias) => {
Ok(with_alias(cte_plan.clone(), cte_alias))
}
_ => Ok(cte_plan.clone()),
},
(Some(cte_plan), _) => Ok(cte_plan.clone()),
(_, Ok(provider)) => {
let scan =
LogicalPlanBuilder::scan(&table_name, provider, None);
let scan = match table_alias.as_ref() {
Some(ref name) => scan?.alias(name.to_owned().as_str()),
_ => scan,
};
scan?.build()
Copy link
Member Author

@jackwener jackwener Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must remove these code, because they will add SubqueryAlias.

But following code will be added once again, it will cause duplicated SubqueryAlias

Comment on lines 889 to 950

let plan = match normalized_alias {
Some(alias) => with_alias(logical_plan, alias),
_ => logical_plan,
};
(plan, alias)
Copy link
Member Author

@jackwener jackwener Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must remove these code, because they will add SubqueryAlias.
But following code will be added once again, it will cause duplicated SubqueryAlias

Comment on lines -924 to -986
let columns_alias = alias.clone().columns;
if columns_alias.is_empty() {
// sqlparser-rs encodes AS t as an empty list of column alias
Copy link
Member Author

@jackwener jackwener Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original code is bug (add twice) + bug (when columns-alias is empty, forget to add alias). let me have explain.
The process is as follows

  • Add table-alias (those code that are removed above)
  • Add table-alias and columns-alias (these code here)
    • bug: when columns-alias is empty, these code will ignore table-alias

So

  • bug (add twice) will cause duplicated subquery_alias
  • bug (when columns-alias, forget to add alias) will cause missing subquery_alias

so we just unify them into

  • Add table-alias and columns-alias

" EmptyRelation []",
" Projection: Int64(7) AS a, Utf8(\"bb\") AS b [a:Int64, b:Utf8]",
" EmptyRelation []",
" SubqueryAlias: _sample_data [a:Int64, b:Utf8]",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing SubqueryAlias

@jackwener jackwener changed the title fix Planner don't generate SubqueryAlias and generate replicated SubqueryAlias fix Planner don't generate SubqueryAlias and generate duplicated SubqueryAlias Dec 5, 2022
@jackwener jackwener marked this pull request as ready for review December 5, 2022 09:31
@jackwener
Copy link
Member Author

Comment on lines 972 to 973
pub fn subquery_alias_owned(plan: LogicalPlan, alias: &str) -> Result<LogicalPlan> {
let schema: Schema = plan.schema().as_ref().clone().into();
let schema = DFSchemaRef::new(DFSchema::try_from_qualified_schema(alias, &schema)?);
Ok(LogicalPlan::SubqueryAlias(SubqueryAlias {
input: Arc::new(plan),
alias,
schema: Arc::new(schema),
})
alias: alias.to_string(),
schema,
}))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this logic into SubqueryAlias::try_new instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has added it.

@mingmwang
Copy link
Contributor

I still think the SubqueryAlias should be removed at the early stage of the logical optimizer.
I just run the same SQL on SparkSQL and find the SubqueryAlias was totally removed.

Spark SQL Output:

+----------------------------------------------------+
|                        plan                        |
+----------------------------------------------------+
| == Parsed Logical Plan ==
CTE [_sample_data, _data2]
:  :- SubqueryAlias _sample_data
:  :  +- Union
:  :     :- Union
:  :     :  :- Union
:  :     :  :  :- Project [1 AS a#5683777, aa AS b#5683778]
:  :     :  :  :  +- OneRowRelation
:  :     :  :  +- Project [3 AS a#5683779, aa AS b#5683780]
:  :     :  :     +- OneRowRelation
:  :     :  +- Project [5 AS a#5683781, bb AS b#5683782]
:  :     :     +- OneRowRelation
:  :     +- Project [7 AS a#5683783, bb AS b#5683784]
:  :        +- OneRowRelation
:  +- 'SubqueryAlias _data2
:     +- 'Project ['row_number() windowspecdefinition('s.b, 's.a ASC NULLS FIRST, unspecifiedframe$()) AS seq#5683785, 's.a, 's.b]
:        +- 'SubqueryAlias s
:           +- 'UnresolvedRelation [_sample_data]
+- 'Sort ['d.b ASC NULLS FIRST], true
   +- 'Aggregate ['d.b], ['d.b, 'MAX('d.a) AS max_a#5683776]
      +- 'SubqueryAlias d
         +- 'UnresolvedRelation [_data2]

== Analyzed Logical Plan ==
b: string, max_a: int
Project [b#5683778, max_a#5683776]
+- Sort [b#5683778 ASC NULLS FIRST], true
   +- Aggregate [b#5683778], [b#5683778, max(a#5683777) AS max_a#5683776]
      +- SubqueryAlias d
         +- SubqueryAlias _data2
            +- Project [seq#5683785, a#5683777, b#5683778]
               +- Project [a#5683777, b#5683778, seq#5683785, seq#5683785]
                  +- Window [row_number() windowspecdefinition(b#5683778, a#5683777 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS seq#5683785], [b#5683778], [a#5683777 ASC NULLS FIRST]
                     +- Project [a#5683777, b#5683778]
                        +- SubqueryAlias s
                           +- SubqueryAlias _sample_data
                              +- Union
                                 :- Union
                                 :  :- Union
                                 :  :  :- Project [1 AS a#5683777, aa AS b#5683778]
                                 :  :  :  +- OneRowRelation
                                 :  :  +- Project [3 AS a#5683779, aa AS b#5683780]
                                 :  :     +- OneRowRelation
                                 :  +- Project [5 AS a#5683781, bb AS b#5683782]
                                 :     +- OneRowRelation
                                 +- Project [7 AS a#5683783, bb AS b#5683784]
                                    +- OneRowRelation

== Optimized Logical Plan ==
Sort [b#5683778 ASC NULLS FIRST], true
+- Aggregate [b#5683778], [b#5683778, max(a#5683777) AS max_a#5683776]
   +- Union
      :- Project [1 AS a#5683777, aa AS b#5683778]
      :  +- OneRowRelation
      :- Project [3 AS a#5683779, aa AS b#5683780]
      :  +- OneRowRelation
      :- Project [5 AS a#5683781, bb AS b#5683782]
      :  +- OneRowRelation
      +- Project [7 AS a#5683783, bb AS b#5683784]
         +- OneRowRelation


@mingmwang
Copy link
Contributor

Test SQL:

WITH _sample_data AS (
                SELECT 1 as a, 'aa' AS b
                UNION ALL
                SELECT 3 as a, 'aa' AS b
                UNION ALL
                SELECT 5 as a, 'bb' AS b
                UNION ALL
                SELECT 7 as a, 'bb' AS b
            ), _data2 AS (
                SELECT
                row_number() OVER (PARTITION BY s.b ORDER BY s.a) AS seq,
                s.a,
                s.b
                FROM _sample_data s
            )
            SELECT d.b, MAX(d.a) AS max_a
            FROM _data2 d
            GROUP BY d.b
            ORDER BY d.b;


@jackwener
Copy link
Member Author

jackwener commented Dec 6, 2022

Current this PR is just for fix bug and don't include any optimization.
I will do this optimization #4412, your comment is helpful for it, Thanks @mingmwang !👍

we can see Spark exist analyzed plan

      +- SubqueryAlias d
         +- SubqueryAlias _data2

But SubqueryAlias _data2 in datafusion will be missing because of this BUG

@mingmwang
Copy link
Contributor

To remove the SubqueryAlias from the logical plan tree, I think there are two approaches:

  1. Add a rule and unnest the inner child within SubqueryAlias, update the child schema's qualifier name. Need to add a method to logical operator types to update schema's qualifier name.

  2. Modify the logical Column expr, instead of depending on qualifier/relation name, have it depend on the index, just like the physical Column expr. Another approach is to define two types of Column exprs : UnResolvedColumn and ResolvedColumn, UnResolvedColumn behaves like the current Column expr and depends on the qualifier/relation name, ResolvedColumn behaves like the current physical Column expr and depends on index.

@jackwener
Copy link
Member Author

jackwener commented Dec 6, 2022

  1. Modify the logical Column expr, instead of depending on qualifier/relation name, have it depend on the index, just like the physical Column expr. Another approach is to define two types of Column exprs : UnResolvedColumn and ResolvedColumn, UnResolvedColumn behaves like the current Column expr and depends on the qualifier/relation name, ResolvedColumn behaves like the current physical Column expr and depends on index.

If we want to do this, I think we may need to add binder/analyzer for datafusion. cc @liukun4515

DuckDB binder do ParsedExpression -> BoundExpression
Spark analyzer do Unresolved -> Resolved

This job will be complicated.

I will try to use the solution of 1 in #4412

@mingmwang
Copy link
Contributor

  1. Modify the logical Column expr, instead of depending on qualifier/relation name, have it depend on the index, just like the physical Column expr. Another approach is to define two types of Column exprs : UnResolvedColumn and ResolvedColumn, UnResolvedColumn behaves like the current Column expr and depends on the qualifier/relation name, ResolvedColumn behaves like the current physical Column expr and depends on index.

If we want to do this, I think we may need to add binder/analyzer for datafusion. cc @liukun4515

DuckDB binder do ParsedExpression -> BoundExpression Spark analyzer do Unresolved -> Resolved

This job will be complicated.

I will try to use the solution of 1 in #4412

Just curious, does Apache Doris have an explicit binder/analyzer phase ?

@jackwener
Copy link
Member Author

jackwener commented Dec 6, 2022

Just curious, does Apache Doris have an explicit binder/analyzer phase ?

Yes, Doris has analyzer in here
It's a explicit phase and as a Job in Cascade Optimizer, and there are some analyze rule to do this job, like bindRelation, code.
It will bind UnboundXXX with catalog and convert them to BoundXXXX

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found https://github.com/apache/arrow-datafusion/pull/4484/files?w=1 easier to review

The changes to the plans look good to me -- thank you @jackwener

@mingmwang do you have any concerns about merging this PR?

@alamb
Copy link
Contributor

alamb commented Dec 6, 2022

Thanks again @jackwener -- this is great stuff

@alamb
Copy link
Contributor

alamb commented Dec 7, 2022

I plan to merge this later today (in about 9 hours) unless I hear otherwise

@mingmwang
Copy link
Contributor

I found https://github.com/apache/arrow-datafusion/pull/4484/files?w=1 easier to review

The changes to the plans look good to me -- thank you @jackwener

@mingmwang do you have any concerns about merging this PR?

No, this PR is good to merge I think.

@alamb
Copy link
Contributor

alamb commented Dec 7, 2022

Oh no -- this PR again has conflicts 😢 @jackwener can you possibly rebase it one more time ?

@jackwener
Copy link
Member Author

has resolved conflict.

@alamb alamb merged commit 1a55d64 into apache:master Dec 8, 2022
@alamb
Copy link
Contributor

alamb commented Dec 8, 2022

Thanks again @jackwener -- sorry this one took so long to get in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core datafusion crate logical-expr Logical plan and expressions sql SQL Planner
Projects
None yet
4 participants