Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL][Core] Add InputFileReplaceFallback Rule #6139

Conversation

gaoyangxiaozhu
Copy link
Contributor

@gaoyangxiaozhu gaoyangxiaozhu commented Jun 19, 2024

What changes were proposed in this pull request?

The is a following up PR for input_file_name native support, it

  1. Add a inputfileNameReplaceFallbackRule to handle the case if scan fallback due to complex data type
  2. Add a feature flag to control if enable the rules in case any corner issues happen we can disable the rule

(Fixes: facebookincubator/velox#9957)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

@gaoyangxiaozhu gaoyangxiaozhu changed the title [VL][Core] Add InputFile [VL][Core] Add InputFileReplaceFallback Rule to handle scan fallback cases Jun 19, 2024
Copy link

Run Gluten Clickhouse CI

@gaoyangxiaozhu gaoyangxiaozhu changed the title [VL][Core] Add InputFileReplaceFallback Rule to handle scan fallback cases [VL][Core] Add InputFileReplaceFallback Rule Jun 19, 2024
@zhztheplayer
Copy link
Member

zhztheplayer commented Jun 19, 2024

I think we'd follow a criteria when rewriting an expression / operator before entering actual offloading phase (e.g., using genExtendedColumnarValidationRules to do the replacement): to make it compatible with both native lib and vanilla Spark as long as possible and to avoid converting it back when the computation doesn't offload.

Can you take a look at some current rules that already follow the criteria, e.g., the ones to offload collet_list, collect_set, hll, bloom-filter, dpp subquery, to see if we can do the same on the expressions used by input_file_name ?

@zhztheplayer
Copy link
Member

zhztheplayer commented Jun 19, 2024

A basic question for #6021 : Why didn't we do the replacement during scan offloading?

If it's about the co-offloading for project and scan, what I can imagine is that we can add a individual offloading rule. E.g., OffloadInputFileAwareProjectAndScan or something to match against project + scan pattern then offload them together. Once not offload-able, the rule just does nothing. Thus we don't have to use another rule to do fallback.

Copy link

Run Gluten Clickhouse CI

@gaoyangxiaozhu
Copy link
Contributor Author

gaoyangxiaozhu commented Jun 19, 2024

I think we'd follow a criteria when rewriting an expression / operator before entering actual offloading phase (e.g., using genExtendedColumnarValidationRules to do the replacement): to make it compatible with both native lib and vanilla Spark as long as possible and to avoid converting it back when the computation doesn't offload.

Can you take a look at some current rules that already follow the criteria, e.g., the ones to offload collet_list, collect_set, hll, bloom-filter, dpp subquery, to see if we can do the same on the expressions used by input_file_name ?

@zhztheplayer , thank you! actually i have checked this part but didn't figure out a similar way for input_file_name support. The current implement for collect_list, collect_set etc is based on truth that velox already has a related function available only need some data type compatible change. But velox right now doesn't have a input_file_name available.

And you can check this discussion for why we pick current solution for input_file_name - facebookincubator/velox#9957

@gaoyangxiaozhu
Copy link
Contributor Author

gaoyangxiaozhu commented Jun 19, 2024

A basic question for #6021 : Why didn't we do the replacement during scan offloading?

If it's about the co-offloading for project and scan, what I can imagine is that we can add a individual offloading rule. E.g., OffloadInputFileAwareProjectAndScan or something to match against project + scan pattern then offload them together. Once not offload-able, the rule just does nothing. Thus we don't have to use another rule to do fallback.

A basic question for #6021 : Why didn't we do the replacement during scan offloading?

If it's about the co-offloading for project and scan, what I can imagine is that we can add a individual offloading rule. E.g., OffloadInputFileAwareProjectAndScan or something to match against project + scan pattern then offload them together. Once not offload-able, the rule just does nothing. Thus we don't have to use another rule to do fallback.

For this question - Why didn't we do the replacement during scan offloading?
hello, Hongze, because we only want to do the replacement action when needed as the related project node does has input_file_name expression otherwise do nothing.

And for this part - If it's about the co-offloading for project and scan, what I can imagine is that we can add a individual offloading rule
Actually, Hongze, I thought before if it is possible to leverage just one rule do the both thing replacement or nothing. But didn't figure out how to do that since one project may has multiple scan child nodes which some of them is fallback, some not, and honestly i don't make it clear about how to accurately obtain which projects + scans need replacement and which do not in just one rule.

So if you have a good idea to share, I'd like do a following PR after this PR to refactor this part to merge 2 rules together. :)

Copy link

Run Gluten Clickhouse CI

@zhztheplayer
Copy link
Member

I'd like do a following PR after this PR to refactor this part to merge 2 rules together. :)

Which works for me but let's turn off the option by default and mark it as experimental? Until we got a more robust design. Thanks.

@zhztheplayer
Copy link
Member

BTW I'll take a closer look at the code asap.

@gaoyangxiaozhu gaoyangxiaozhu force-pushed the gayangya/input_file_name_phase2 branch from 20c45d9 to 9e34524 Compare June 19, 2024 11:09
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@gaoyangxiaozhu
Copy link
Contributor Author

I'd like do a following PR after this PR to refactor this part to merge 2 rules together. :)

Which works for me but let's turn off the option by default and mark it as experimental? Until we got a more robust design. Thanks.

thanks, done to mark it experimental and also turn off by default @FelixYBW FYI

@gaoyangxiaozhu
Copy link
Contributor Author

BTW I'll take a closer look at the code asap.

thanks, since without this change, user may encounter the failure issue if scan fallback with complex data type and project has input_file_name.

@gaoyangxiaozhu
Copy link
Contributor Author

gaoyangxiaozhu commented Jun 19, 2024

don't think the 3.4 fail related to my change and it is randomly, meanwhile the CH CI fail also looks a CI issue ? @zhouyuan / @PHILO-HE could you help double check ?
image

@PHILO-HE
Copy link
Contributor

Run Gluten Clickhouse CI

@PHILO-HE
Copy link
Contributor

@gaoyangxiaozhu, Spark 3.4 tests passed after re-triggered. Let's wait for CH CI.

@zhztheplayer
Copy link
Member

zhztheplayer commented Jun 20, 2024

@gaoyangxiaozhu I opened #6157 for further work to improve this. I am assign that issue to you. And I suggest to have a clear plan for the new implementation before merging this.

@gaoyangxiaozhu
Copy link
Contributor Author

gaoyangxiaozhu commented Jun 20, 2024

@gaoyangxiaozhu I opened #6157 for further work to improve this. I am assign that issue to you. And I suggest to have a clear plan for the new implementation before merging this.

@gaoyangxiaozhu I opened #6157 for further work to improve this. I am assign that issue to you. And I suggest to have a clear plan for the new implementation before merging this.

got! So I send seperate PR #6161 to turn off inputFileNameReplaceRule with feature flag, let make sure that PR merged in case any issue caused due to scan fallback.

@jinchengchenghh / @zhztheplayer / @FelixYBW please help review.

@lwz9103
Copy link
Contributor

lwz9103 commented Jun 20, 2024

Run Gluten Clickhouse CI

1 similar comment
@lwz9103
Copy link
Contributor

lwz9103 commented Jun 20, 2024

Run Gluten Clickhouse CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spark input_file_name design
4 participants