Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable JDBC join pushdown #22442

Closed
wants to merge 1 commit into from
Closed

Conversation

Dith3r
Copy link
Member

@Dith3r Dith3r commented Jun 19, 2024

Description

Disable JDBC join pushdown as it could degrade performance.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
(X) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@@ -31,7 +31,5 @@ protected void setup(Binder binder)
{
configBinder(binder).bindConfig(JdbcJoinPushdownConfig.class);
bindSessionPropertiesProvider(binder, JdbcJoinPushdownSessionProperties.class);

configBinder(binder).bindConfigDefaults(JdbcMetadataConfig.class, config -> config.setJoinPushdownEnabled(true));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JdbcJoinPushdownSupportModule is used only by connectors which support cost-based join pushdown.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JdbcMetadataConfig your comment:

Join pushdown is disabled by default as this is the safer option.

And it is enabled by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that meant that it was unsafe to enable in all cases but is safe if we do it based on the cost based estimate.
I think the cost based estimate doesn't account for reduction in probe output due to dynamic filtering. It might be possible to estimate this based on NDV of the columns being joined. @Dith3r maybe we can try building that into the cost estimation.
Otherwise, I don't think we should outright disable this as this covers a wide variety of data sources and we can't prove that disabling pushdown is the best strategy overall.

Copy link
Member

@sopel39 sopel39 Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the cost based estimate doesn't account for reduction in probe output due to dynamic filtering.

I think we need to consider different join types under assumption that we want to minimize data size fetched from connector

1). Filtering join (e.g. fact table join with dimension table), non-expanding

Filtering join are most common and should always be beneficial to pushdown. Even with perfect DF from Trino, join output will always be smaller than probe and build side

2). Expanding join (multiplies output rows for every probe input row), no filtering

Here DFs are irrelevant (no filtering). Might still be beneficial to pushdown depending on build side size. However, expanding joins are usually hard to estimate, so I would avoid pushing them

3). Expanding join that does some filtering also

Maybe we could estimate % of rows filtered and expanding factor for remaining rows? Generally, I would probably avoid pushdown of expanding joins because of uncertainity of stats. Assuming DFs work, then they should do good job at filtering probe side data so we should only get rows that get expanded. Pushing expanding join to connector might explode data produced by connector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that meant that it was unsafe to enable in all cases but is safe if we do it based on the cost based estimate.

yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm essentially saying is that we probably could simplify the cost decision to expanding vs non-expanding join.

I believe that's the current logic as implemented.

// Normally we would put 1.0 as a default value here to only allow joins which do not expand data to be pushed down.
// We use 1.25 to adjust for the fact that NDV estimations sometimes are off and joins which should be pushed down are not.
private double joinPushdownAutomaticMaxJoinToTablesRatio = 1.25;

However, by not-pushing join we might prevent further pushdowns

true

Hence it's not entirely connector decision do do or not do pushdown

that would require exploratory optimizer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would require exploratory optimizer

yes, but not full exploratory optimizer, but it would be sufficient to have bunch of alternatives in source stage cc. @assaf2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For SQL databases, we should only consider push down if a covering index is present and for Postgres if any of the joined tables is not partitioned, as this case underperforms.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For SQL databases, we should only consider push down if a covering index is present and for Postgres if any of the joined tables is not partitioned, as this case underperforms.

I think it's something we should put into documentation (cc @mosabua ) that users should add indexes on columns they plan to use in joins

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is indeed the case, then first and foremost the join pushdown code should check for necessary indexes being present.

@sopel39
Copy link
Member

sopel39 commented Jun 19, 2024

I think we should investigate more why performance is degraded

@mosabua
Copy link
Member

mosabua commented Jun 19, 2024

I think removal because it could degrade performance is not correct. It could (and does) also improve performance in many cases and as such it should stay on in the base connector. Individual connectors can then potentially override it to either always be disabled, change the default to be disabled or enabled, and also provide a user configurable option.

Once we know when and why performance it degraded we can make a more informed decision how to proceed. Do you have any specific about the problems you are experiencing and caused you to send this PR @Dith3r ?

@Dith3r
Copy link
Member Author

Dith3r commented Jun 20, 2024

PostgreSQL sf10 partitioned with analyzed tables suffers 3x degradation of performance (in some rare queries, degradation is 8-10x). With other PostgreSQL benchmarks, I've not seen any degradation/improvement from this option set to enable. I will try to investigate it further.

@Dith3r Dith3r closed this Jun 27, 2024
@hashhar
Copy link
Member

hashhar commented Jul 1, 2024

@Dith3r Have you checked if the Postgres connector provides stats for partitioned tables at all? IIRC we haven't implemented it with partitioned tables in mind so either the stats are entirely missing or incorrect.

Do you see same issue with unpartitioned tables?

@Dith3r
Copy link
Member Author

Dith3r commented Jul 1, 2024

Unpartitioned tables does not suffer from join push down in ppstgresql. For partitioned tables connector provides estimates and it triggers push down which is not handled well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

6 participants