[GLUTEN-6189][CORE] Shouldn't pushdown undeterministic filter to scan node #6191

WangGuangxin · 2024-06-24T02:42:54Z

What changes were proposed in this pull request?

We shouldn't pushdown undeterministic filter to scan node, for example, if we push down filter with rand() < 0.5 to scan, the result will be around 1/4 instead of 1/2.

(Fixes: #6189)

How was this patch tested?

UT

github-actions · 2024-06-24T02:43:12Z

#6189

github-actions · 2024-06-24T02:43:24Z

Run Gluten Clickhouse CI

WangGuangxin · 2024-06-24T02:43:28Z

cc @PHILO-HE @rui-mo @ulysses-you

zhztheplayer

Is there any root cause from Velox's code for this issue? It doesn't make sense if rand() results in a constant vector or involves in constant folding procedure.

zhztheplayer · 2024-06-24T03:15:01Z

backends-velox/src/test/scala/org/apache/gluten/execution/TestOperator.scala

+ |""".stripMargin)
+ .collect()
+ .length
+ assert(resultLength >= 25000)


Could give resultLength a upper bound to improve stability of the test? Say if rand() yielded a constant 0.1 then the test could still pass as false-positive.

It's not velox's rand bug, it was because rand() < 0.5 was evaluate twice, first is in ScanNode, then in FilterNode, so that the filtered result is 0.5 * 0.5.

it was because rand() < 0.5 was evaluate twice

In which place it gets evaluated twice? It should be either in subfield filters or in remaining filters already by design. Am I missing something? cc @rui-mo

Would you mean it's placed both in subfield filters and remaining filters?

The design is all filters are pushed down to Spark's Scan operator on Scala side, and in Velox's TableScanNode there are subfield filters and remaining filter to handle two kinds filters according to if their push-down is supported. I'm not clear if the duplication occurs on Scala side between Scan and Filter operator, or on native side between subfield filters and remaining filter. @WangGuangxin Could you clarify?

let me double check

Take the UT as an example, the rand() < 0.5 both exists in

getPushedFilter

incubator-gluten/gluten-core/src/main/scala/org/apache/gluten/backendsapi/SparkPlanExecApi.scala

Line 672 in 9cceba6

def getPushedFilter(dataFilters: Seq[Expression]): Seq[Expression] = {

which is pushed to ScanNode ( it has nothing to do with the logic of remainingFilter and subFieldFilter)

and also, the FilterExecTransformer also has rand() < 0.5 in getRemainingCondition

incubator-gluten/backends-velox/src/main/scala/org/apache/gluten/execution/FilterExecTransformer.scala

Line 39 in 9cceba6

FilterHandler.getRemainingFilters(scanFilters, splitConjunctivePredicates(condition))

Take the UT as an example, the rand() < 0.5 both exists in

getPushedFilter

incubator-gluten/gluten-core/src/main/scala/org/apache/gluten/backendsapi/SparkPlanExecApi.scala

Line 672 in 9cceba6

def getPushedFilter(dataFilters: Seq[Expression]): Seq[Expression] = {

which is pushed to ScanNode ( it has nothing to do with the logic of remainingFilter and subFieldFilter)

and also, the FilterExecTransformer also has rand() < 0.5 in getRemainingCondition

incubator-gluten/backends-velox/src/main/scala/org/apache/gluten/execution/FilterExecTransformer.scala

Line 39 in 9cceba6

FilterHandler.getRemainingFilters(scanFilters, splitConjunctivePredicates(condition))

It looks like a bug. By looking into the code, I believe the author didn't want a filter expression appearing in FilterHandler.getRemainingFilters once it was pushed into scan. So perhaps we need a bugfix for that.

However I will not be against this change if it doesn't bring performance issues. Given that it's a more robust way to make sure non-deterministic filter expressions be only evaluated once from JVM side.

But I think we still need to find out reason of the bug.

if it doesn't bring performance issues.

But probably it's hardly true in Velox backend if the backend does support pushing non-deterministic exprs to file reader to shorten read time. So my assumption could be too ideal, I am not sure.

rui-mo

rand() < 0.5 was evaluate twice, first is in ScanNode, then in FilterNode

If one filter exists in scan node, we do not add it to the filter node again. getRemainingFilters is for that purpose. Is there any bug in it? Thanks.

github-actions · 2024-06-24T04:28:01Z

Run Gluten Clickhouse CI

WangGuangxin · 2024-06-24T05:02:41Z

rand() < 0.5 was evaluate twice, first is in ScanNode, then in FilterNode

If one filter exists in scan node, we do not add it to the filter node again. getRemainingFilters is for that purpose. Is there any bug in it? Thanks.

@rui-mo I think the assumption was broken since rand() < 0.5 doesn't exists in FileSourceScanExec's dataFilters, that was because Spark will not pushdown undeterministic filter to scan https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala#L291

rui-mo · 2024-06-24T05:14:07Z

@WangGuangxin The purpose of postProcessPushDownFilter is to push down all the filters to Scan even if only Filter operator contains it. After that, Filter will only process those do not exist in Scan. It looks like a bug if both Scan and Filter contain the same expression.
https://github.com/apache/incubator-gluten/blob/main/backends-velox/src/main/scala/org/apache/gluten/execution/FilterExecTransformer.scala#L38-L40

github-actions · 2024-06-24T07:18:21Z

Run Gluten Clickhouse CI

ulysses-you · 2024-06-25T06:08:38Z

gluten-core/src/main/scala/org/apache/gluten/backendsapi/SparkPlanExecApi.scala

+ .filterNot(_.references.exists {
+ attr => SparkShimLoader.getSparkShims.isRowIndexMetadataColumn(attr.name)
+ })
+ .filter(_.deterministic)


Vanilla Spark would not push down non-deterministic filer exprs. But as Gluten have already supported it, can we add a new config to control if push down non-deterministic filer exprs ?

zml1206 · 2024-06-26T09:19:58Z

Bug of getRemainingFilters. ExpressionSet can only remove deterministic expression.

def getRemainingFilters(scanFilters: Seq[Expression], filters: Seq[Expression]): Seq[Expression] =
    (ExpressionSet(filters) -- ExpressionSet(scanFilters)).toSeq

to

def getRemainingFilters(scanFilters: Seq[Expression], filters: Seq[Expression]): Seq[Expression] =
    (filters.toSet -- scanFilters.toSet).toSeq

This seems to work, if the filter reference pushed down has not changed.

zml1206 · 2024-07-01T07:55:52Z

I submitted a PR to resolve it. #6296

FelixYBW · 2024-07-01T18:30:51Z

Is this error in Velox or Gluten? Velox table scan supports undeterministic filter pushdown so it shouldn't be an issue to pushdown to Velox.

@rui-mo I remember initially we have to distinguish subfiled filter and remaining filter in oap/velox. Is the code moved to Gluten?

rui-mo · 2024-07-02T01:05:36Z

@FelixYBW This bug is in Gluten Scala code, seeing #6296.

I remember initially we have to distinguish subfiled filter and remaining filter in oap/velox.

This code is in Gluten cpp, taking effect during the conversion from Substrait to Velox. The bug mentioned in this PR is not related to that.

Shouldn't pushdown undeterministic filter to scan node

d7bc075

zhztheplayer reviewed Jun 24, 2024

View reviewed changes

rui-mo reviewed Jun 24, 2024

View reviewed changes

fix typo

438ebe5

fix

088e708

ulysses-you reviewed Jun 25, 2024

View reviewed changes

zhztheplayer mentioned this pull request Jul 1, 2024

[CORE] Fix non-deterministic filter executed twice when push down to scan #6296

Merged

WangGuangxin closed this Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GLUTEN-6189][CORE] Shouldn't pushdown undeterministic filter to scan node #6191

[GLUTEN-6189][CORE] Shouldn't pushdown undeterministic filter to scan node #6191

WangGuangxin commented Jun 24, 2024

github-actions bot commented Jun 24, 2024

github-actions bot commented Jun 24, 2024

WangGuangxin commented Jun 24, 2024

zhztheplayer left a comment

zhztheplayer Jun 24, 2024

WangGuangxin Jun 24, 2024 •

edited

Loading

zhztheplayer Jun 24, 2024

rui-mo Jun 24, 2024

WangGuangxin Jun 24, 2024

WangGuangxin Jun 24, 2024

zhztheplayer Jun 24, 2024

zhztheplayer Jun 24, 2024

zhztheplayer Jun 24, 2024 •

edited

Loading

rui-mo left a comment •

edited

Loading

github-actions bot commented Jun 24, 2024

WangGuangxin commented Jun 24, 2024

rui-mo commented Jun 24, 2024

github-actions bot commented Jun 24, 2024

ulysses-you Jun 25, 2024

zml1206 commented Jun 26, 2024

zml1206 commented Jul 1, 2024

FelixYBW commented Jul 1, 2024

rui-mo commented Jul 2, 2024 •

edited

Loading

[GLUTEN-6189][CORE] Shouldn't pushdown undeterministic filter to scan node #6191

[GLUTEN-6189][CORE] Shouldn't pushdown undeterministic filter to scan node #6191

Conversation

WangGuangxin commented Jun 24, 2024

What changes were proposed in this pull request?

How was this patch tested?

github-actions bot commented Jun 24, 2024

github-actions bot commented Jun 24, 2024

WangGuangxin commented Jun 24, 2024

zhztheplayer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WangGuangxin Jun 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhztheplayer Jun 24, 2024 • edited Loading

Choose a reason for hiding this comment

rui-mo left a comment • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Jun 24, 2024

WangGuangxin commented Jun 24, 2024

rui-mo commented Jun 24, 2024

github-actions bot commented Jun 24, 2024

Choose a reason for hiding this comment

zml1206 commented Jun 26, 2024

zml1206 commented Jul 1, 2024

FelixYBW commented Jul 1, 2024

rui-mo commented Jul 2, 2024 • edited Loading

WangGuangxin Jun 24, 2024 •

edited

Loading

zhztheplayer Jun 24, 2024 •

edited

Loading

rui-mo left a comment •

edited

Loading

rui-mo commented Jul 2, 2024 •

edited

Loading