Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push join into TableScan through Project #21607

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

SemionPar
Copy link
Contributor

@SemionPar SemionPar commented Apr 18, 2024

Description

Add a ruleset to push join into TableScan through Project.

Draft todos:

  • decide whether to merge with PushJoinIntoTableScan - extend PushJoinIntoTableScan with new behavior
  • Result correctness - PostgreSQL reference methods: account for collation - UPPER/LOWER COLLATE C as well as other standard collations are not producing results consistent with Trino:

PostgreSQL:

SELECT UPPER('ślad');

C_Collation        PL_Collation
--------------------------------------
śLAD                  ŚLAD

Trino:

SELECT UPPER('ślad');

--------------------------------------
ŚLAD

Therefore the test use REVERSE as a reference function call which does not exhibit inconsistent behavior.

Additional context and related issues

Why push though projection is needed?
To enable pushdown of joins with function calls in join conditions (UPPER/LOWER/CASTS etc.)

Why projection is not consumed by connector?
It is currently unsupported for JDBC connectors. There was an attempt #19740, but it didn't make it due to reasons (edge cases, fact that it needs JDBC type to be derived from Trino type when mapping).

What if there is filter below projection?
This rule will not work.

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Apr 18, 2024
@SemionPar SemionPar force-pushed the semionpar/push-join-through-project branch from 928fc95 to 465170a Compare April 24, 2024 14:16
@ssheikin
Copy link
Contributor

Draft todos:
PostgreSQL reference methods: account for collation

Another function can be selected for tests which does not imply such complications.

@ssheikin
Copy link
Contributor

@lukasz-stec @raunaqmorarka please review.

@SemionPar SemionPar force-pushed the semionpar/push-join-through-project branch from 465170a to 173acd3 Compare May 13, 2024 12:42
@SemionPar SemionPar marked this pull request as ready for review May 13, 2024 12:42
@SemionPar SemionPar changed the title [wip] Push join into TableScan through Project Push join into TableScan through Project May 13, 2024
@SemionPar SemionPar force-pushed the semionpar/push-join-through-project branch 2 times, most recently from 60fa72c to 5406688 Compare May 14, 2024 08:22
@@ -309,6 +309,8 @@ public PostgreSqlClient(
.add(new RewriteIn())
.withTypeClass("integer_type", ImmutableSet.of("tinyint", "smallint", "integer", "bigint"))
.withTypeClass("numeric_type", ImmutableSet.of("tinyint", "smallint", "integer", "bigint", "decimal", "real", "double"))
.withTypeClass("string_type", ImmutableSet.of("char", "varchar"))
.map("reverse(value: string_type)").to("REVERSE(value)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it work on columns which are mapped to varchar i.e money being mapped to varchar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not have sense to use it for money :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The serious answer is that we cannot control it.
Also, in most cases it does not corrupt data - it has an immediate trino-runtime, remote-compile-time exception.
The best idea I came so far is to cover it with join.pushdown.disabled-functions=reverse,cast config/session property.
Also it's a wider issue - it does not relate to the reverse(value) expression itself, but to the external scope as well, when different type of return is expected. e.g. reverse(value) = value
#21939 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, in most cases it does not corrupt data - it has an immediate trino-runtime, remote-compile-time exception.

Mostly yes - I was thinking on re-writing RewriteVaraible rule which rewrites the column which has a native mapping and the column which supports pushdown operation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some JDBC connector like PostgreSQL - we try to map unknown column type or special column type like money as varchar - so in this case we should somehow restrict the expression rewrite - I think it can be fixed by avoid rewriting the variables which are synthetically mapped.

It's tricky, as information about initial remote type is already lost, at the moment we try to rewrite for pushdown.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RewriteVaraible

Thank you for pointing out, will take a look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this will be happening in this case as in all other cases, take this predicate pushdown, for example:

TestTable table = new TestTable(
                onRemoteDatabase(),
                "test_money_functions",
                "(col money, col2 varchar)",
                List.of("8, 'test'", "9, 'test'");

"SELECT col FROM " + table.getName() + " WHERE col LIKE '%$8%';

Connector attempts to push down LIKE on money and fails:

Caused by: org.postgresql.util.PSQLException: ERROR: operator does not exist: money ~~ character varying
Hint: No operator matches the given name and argument types. You might need to add explicit type casts.
Position: 74

@martint
Copy link
Member

martint commented May 16, 2024

Can you elaborate on this point?

It is currently unsupported for JDBC connectors. There was an attempt #19740, but it didn't make it due to reasons (edge cases, fact that it needs JDBC type to be derived from Trino type when mapping).

What are the edge cases? What's the issue with deriving a JDBC type?

@Praveen2112
Copy link
Member

Wrt to type derivation we have two issues

  • In some JDBC connector like PostgreSQL - we try to map unknown column type or special column type like money as varchar - so in this case we should somehow restrict the expression rewrite - I think it can be fixed by avoid rewriting the variables which are synthetically mapped.
  • In case of projection pushdown - we need to ensure that type returned by the native PG function matches with Trino's type to ensure the subsequent filter operation is accurate

@martint
Copy link
Member

martint commented May 16, 2024

Instead of a dedicated rule to pull a projection above join before pushing down the join into the connector, we should consider adding a generic rule for doing that optimization regardless of where it occurs in the plan.

@Praveen2112
Copy link
Member

But pushing join before projection as a generic rule might affect the overall plan right ? What if its a cross join or if its estimate result in more number of rows than probe or build side ?

@martint
Copy link
Member

martint commented May 16, 2024

We could make it a cost-aware decision. In practice, joins that explode data (in particular, cross joins that produce significant amount of data) are not that common, though.

@SemionPar
Copy link
Contributor Author

Few thoughts on idea proposed by @martint:

Emphasis: regardless of where it occurs in the plan

Instead of PushJoinIntoTableScan we could have PullProjectionAboveNode (not an established name, existing code use PushXXXIntoYYY convention).
What types of Node would qualify for this operation?
In our concrete Join case we have this incoming fragment:

Join
 \
  | Project
    \
     | TableScan
  | TableScan

Should we consider only pulling Project up when directly over TableScan:

TOP_XXX
 \
  | Project
    \
     | TableScan

Or in any position?

TOP_XXX
 \
  | Project
    \
     | BOTTOM_YYY
       \
        | TableScan

Emphasis: before pushing down the join into the connector

This generic rule would not attempt to call Metadata.applyJoin (pushdown attempt), but just pull Poject above Join, if possible?
But how can Project node be pulled up without pushing the Join? Consider this case (one of tested queries) - project node sole purpose here is to transform left side of join criteria, so it must be performed before joining data:

SELECT c.custkey, o.totalprice FROM customer c JOIN orders o ON REVERSE(c.phone) = o.clerk;
Output[columnNames = [custkey, totalprice]]
│   Layout: [custkey:bigint, totalprice:double]
└─ InnerJoin[criteria = (expr = clerk), distribution = PARTITIONED]
   │   Layout: [custkey:bigint, totalprice:double]
   │   Distribution: PARTITIONED
   ├─ ScanProject[table = postgresql:tpch.customer tpch.customer columns=[custkey:bigint:int8, phone:varchar(15):varchar]]
   │      Layout: [custkey:bigint, expr:varchar(15)]
   │      expr := reverse(phone)
   │      custkey := custkey:bigint:int8
   │      phone := phone:varchar(15):varchar
   └─ TableScan[table = postgresql:tpch.orders tpch.orders columns=[totalprice:double:float8, clerk:varchar(15):varchar]]
          Layout: [totalprice:double, clerk:varchar(15)]
          clerk := clerk:varchar(15):varchar
          totalprice := totalprice:double:float8

Is my reasoning correct here?

@SemionPar SemionPar force-pushed the semionpar/push-join-through-project branch from 5406688 to 9f57619 Compare June 14, 2024 17:10
@SemionPar SemionPar force-pushed the semionpar/push-join-through-project branch 2 times, most recently from cb68b27 to 1349edb Compare June 18, 2024 14:58
Extend rule to support join pushdown in cases where join conditions
contain Calls causing Project nodes to appear over scans.

Fix failing tests
@SemionPar SemionPar force-pushed the semionpar/push-join-through-project branch from 1349edb to 9904e6f Compare June 18, 2024 15:40
@martint
Copy link
Member

martint commented Jul 1, 2024

we could have PullProjectionAboveNode

Rather, it would be PushXXXBelowProject, and it should generally be done:

  • for any operation that reduces the amount of data that the projection would evaluate
  • for projections that don't prune the number of input columns

But how can Project node be pulled up without pushing the Join

It can't. But that's a case where pushing down the join into the connector first requires pushing down the projection. That should be handled normally by an invocation of applyProject() followed by applyJoin().

@SemionPar
Copy link
Contributor Author

The problem I was trying to solve in this PR (pushing Join through Project into TableScan to enable join pushdown with function calls in the join condition) is solved by #22203 (amazing work @Praveen2112!). With this new projection pushdown framework, all the test cases I added for PostgreSQL joins with REVERSE are fully pushed down in the intended way (PushProjectionIntoTableScan -> PushJoinIntoTableScan).

@SemionPar
Copy link
Contributor Author

Does it still makes sense to continue with PushXXXBelowProject approach, does it bring value otherwise?

Rather, it would be PushXXXBelowProject, and it should generally be done:

  • for any operation that reduces the amount of data that the projection would evaluate
  • for projections that don't prune the number of input columns.

These points seem relevant. I can repurpose this PR or create a new one, I would appreciate some help with crafting a new problem statement though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

4 participants