-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark input_file_name design #9957
Comments
@gaoyangxiaozhu can we pass the input_file_name as a literal to velox for each split? Since it's fixed for each split. |
@FelixYBW The path is already in the split. The problem is how to carry the information from split into the function. |
Oh, I see. your option 2 is what I'm thinking. Add a project after table scan to append a literal column to scan result, hide the input_file_name() implementation in Gluten completely. In this way we can add similar function implementation in Gluten directly. @rui-mo Is there any potential issue of this? |
@FelixYBW If option 2 sound good to you, you can follow what Presto does by adding a Hive column handle with type |
Make sense. I will follow up with Rui and Yangyang on this. Thank you. @Yuhta |
@FelixYBW Looks like the second option is feasible in Gluten. Thanks. |
One Spark task may read multiple files according to |
@Yohahaha The |
@rui-mo / @FelixYBW just back from OOF, so do you agree with option 2 to use |
hey @Yuhta, quick question, do you have example about how to |
@gaoyangxiaozhu I think you need to do it in the planner, rewriting |
I see, so looks a little bit trick we still need change I can image we may need to apply a similar planning strategy to other parts with similar functions @rui-mo / @FelixYBW to double check if it is a acceptable way before i start the code part. |
The Spark implementation of
input_file_name
uses a thread local to stash the file name and retrieve it from the function. The same method does not work in Velox because the driver can be taken off from the thread and a different driver can be scheduled when the function is called. There are 2 ways to do it in VeloxDriverCtx
. This imposes some challenge to hide the file specific detail from the driver level, while we need to be able to set it in table scan and read it back in the function.$path
to the output type of table scan, then the function will just project that special field out and do the escaping. All the data type between table scan and filer project needs to be changed in Gluten plan.CC: @gaoyangxiaozhu @mbasmanova
The text was updated successfully, but these errors were encountered: