Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement dynamic row filtering #22411

Merged
merged 9 commits into from
Jul 11, 2024
Merged

Conversation

raunaqmorarka
Copy link
Member

@raunaqmorarka raunaqmorarka commented Jun 18, 2024

Description

Dynamic row filtering performs fine-grained filtering of rows in the scan operator,
thus greatly improving performance of some queries.
So far dynamic filters have been pushed into connectors which have used them for
partition, bucket, split and row-group/stripe pruning. This change adds evaluation of
dynamic filters in the engine on worker nodes after the usual static filter (if any) has been
evaluated in ScanFilterProject.
Non-selective dynamic filters are automatically detected and removed from execution
so that overhead of execution these filters is low when they are not useful.

Additional context and related issues

Fixes #13305

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# General
* Improve performance of queries with selective joins by performing fine-grained filtering of rows using dynamic filters.
  This optimization is enabled by default and can be disabled using `enable-dynamic-row-filtering` configuration property or `enable_dynamic_row_filtering` session property. ({issue}`22411`)

Copy link
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about adding yet more system properties and config for this. I'm fine having a kill switch if the feature has problems, but I'd make them hidden, and generally we should remove them when we think the feature is working.

@raunaqmorarka
Copy link
Member Author

I'm concerned about adding yet more system properties and config for this. I'm fine having a kill switch if the feature has problems, but I'd make them hidden, and generally we should remove them when we think the feature is working.

Since this is a new implementation, we need the properties to allow us to easily root cause any potential issues. These also make it easy for us to write tests for the feature. The selectivity threshold is something that a user might want to legitimately tune for their workload. Making them hidden just hinders their usage, I would like to keep them as normal properties for now as there isn't anything harmful about them.

So far dynamic filters have been pushed into connectors which have used
them to filter data at the level of granularity supported by them
(e.g. partition, bucket, file, split, row-group etc.).
This change adds evaluation of dynamic filters in the engine on worker nodes
after the usual static filter (if any) has been evaluated in ScanFilterProject.
Non-selective dynamic filters are automatically detected and removed from execution
so that overhead of execution these filters is low when they are not useful.
BenchmarkInCodeGenerator columnarEvaluationEnabled
(hitRate)  (inListCount)  (type)  Mode  Cnt   Before Score    After Score     Units
      0.1              2  bigint  avgt   12    9.638 ? 0.265  9.138 ? 0.709   us/op
      0.1              4  bigint  avgt   12   10.549 ? 0.682  8.410 ? 0.060   us/op
      0.1             25  bigint  avgt   12   30.833 ? 4.390  8.967 ? 0.346   us/op
      0.1            100  bigint  avgt   12   33.023 ? 5.527  8.691 ? 0.328   us/op
      0.1           1000  bigint  avgt   12   34.606 ? 6.841  8.438 ? 0.097   us/op
      0.1          10000  bigint  avgt   12   32.668 ? 4.724  8.450 ? 0.121   us/op
@raunaqmorarka raunaqmorarka merged commit 73a5581 into trinodb:master Jul 11, 2024
102 checks passed
@raunaqmorarka raunaqmorarka deleted the drf branch July 11, 2024 04:08
@github-actions github-actions bot added this to the 452 milestone Jul 11, 2024
@yx-keith
Copy link

Dynamic row filtering parquet unpartitioned sf1k.pdf

Screenshot 2024-06-28 at 7 32 14 AM [Dynamic row filtering parquet partitioned sf1k.pdf](https://github.com/user-attachments/files/16023339/Dynamic.row.filtering.parquet.partitioned.sf1k.pdf) Screenshot 2024-06-28 at 7 48 55 AM

how much tpcds data?

@raunaqmorarka
Copy link
Member Author

how much tpcds data?

It's scale factor 1000 (1 TB)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector iceberg Iceberg connector performance
Development

Successfully merging this pull request may close these issues.

Translate dynamic filter to compiled filter
6 participants