Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark input_file_name design #9957

Open
Yuhta opened this issue May 28, 2024 · 15 comments · Fixed by apache/incubator-gluten#6021
Open

Spark input_file_name design #9957

Yuhta opened this issue May 28, 2024 · 15 comments · Fixed by apache/incubator-gluten#6021
Labels
enhancement New feature or request

Comments

@Yuhta
Copy link
Contributor

Yuhta commented May 28, 2024

The Spark implementation of input_file_name uses a thread local to stash the file name and retrieve it from the function. The same method does not work in Velox because the driver can be taken off from the thread and a different driver can be scheduled when the function is called. There are 2 ways to do it in Velox

  1. To mimic what Spark is doing, we need to store the information in DriverCtx. This imposes some challenge to hide the file specific detail from the driver level, while we need to be able to set it in table scan and read it back in the function.
  2. To mimic what Presto is doing, Gluten can change the plan to add an extra field $path to the output type of table scan, then the function will just project that special field out and do the escaping. All the data type between table scan and filer project needs to be changed in Gluten plan.

CC: @gaoyangxiaozhu @mbasmanova

@mbasmanova
Copy link
Contributor

CC: @FelixYBW @rui-mo

@FelixYBW
Copy link
Contributor

@gaoyangxiaozhu can we pass the input_file_name as a literal to velox for each split? Since it's fixed for each split.

@Yuhta
Copy link
Contributor Author

Yuhta commented May 28, 2024

@FelixYBW The path is already in the split. The problem is how to carry the information from split into the function.

@FelixYBW
Copy link
Contributor

Oh, I see. your option 2 is what I'm thinking. Add a project after table scan to append a literal column to scan result, hide the input_file_name() implementation in Gluten completely. In this way we can add similar function implementation in Gluten directly. @rui-mo Is there any potential issue of this?

@Yuhta
Copy link
Contributor Author

Yuhta commented May 28, 2024

@FelixYBW If option 2 sound good to you, you can follow what Presto does by adding a Hive column handle with type kSynthesized and name $path, then extract that field in the function. We don't need to change anything in driver or table scan operator for this path.

@FelixYBW
Copy link
Contributor

Make sense. I will follow up with Rui and Yangyang on this. Thank you. @Yuhta

@rui-mo
Copy link
Collaborator

rui-mo commented May 29, 2024

@FelixYBW Looks like the second option is feasible in Gluten. Thanks.

@Yohahaha
Copy link
Contributor

Yohahaha commented May 31, 2024

One Spark task may read multiple files according to spark.sql.files.maxPartitionBytes, which file will be returned in the Gluten for input_file_name in design?

@Yuhta
Copy link
Contributor Author

Yuhta commented May 31, 2024

@Yohahaha The $path value is set according to the information in split, does not matter how many splits the task is reading.

@gaoyangxiaozhu
Copy link
Contributor

@FelixYBW If option 2 sound good to you, you can follow what Presto does by adding a Hive column handle with type kSynthesized and name $path, then extract that field in the function. We don't need to change anything in driver or table scan operator for this path.

@rui-mo / @FelixYBW just back from OOF, so do you agree with option 2 to use $path synthetic column ? If it is the option, i can follow the code implement.

@gaoyangxiaozhu
Copy link
Contributor

gaoyangxiaozhu commented Jun 4, 2024

@FelixYBW If option 2 sound good to you, you can follow what Presto does by adding a Hive column handle with type kSynthesized and name $path, then extract that field in the function. We don't need to change anything in driver or table scan operator for this path.

hey @Yuhta, quick question, do you have example about how to extract specific field in function can be referenced ?
or @rui-mo you may also know ?

@Yuhta
Copy link
Contributor Author

Yuhta commented Jun 4, 2024

@gaoyangxiaozhu I think you need to do it in the planner, rewriting input_file_name() to url_encode($path)

@gaoyangxiaozhu
Copy link
Contributor

@gaoyangxiaozhu I think you need to do it in the planner, rewriting input_file_name() to url_encode($path)

I see, so looks a little bit trick we still need change planner for this specifial case to leverage both url_encode function and $path ksynthetic column handler.

I can image we may need to apply a similar planning strategy to other parts with similar functions

@rui-mo / @FelixYBW to double check if it is a acceptable way before i start the code part.

@FelixYBW
Copy link
Contributor

FelixYBW commented Jun 5, 2024

@rui-mo / @FelixYBW to double check if it is a acceptable way before i start the code part.

Go ahead to implement. Just talked with Rui. A new project will be too complex, let's add it in future.

@gaoyangxiaozhu
Copy link
Contributor

got! thank you @FelixYBW / @rui-mo ! Let me do the follow up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
6 participants