Spark input_file_name design #9957

Yuhta · 2024-05-28T14:41:35Z

The Spark implementation of input_file_name uses a thread local to stash the file name and retrieve it from the function. The same method does not work in Velox because the driver can be taken off from the thread and a different driver can be scheduled when the function is called. There are 2 ways to do it in Velox

To mimic what Spark is doing, we need to store the information in DriverCtx. This imposes some challenge to hide the file specific detail from the driver level, while we need to be able to set it in table scan and read it back in the function.
To mimic what Presto is doing, Gluten can change the plan to add an extra field $path to the output type of table scan, then the function will just project that special field out and do the escaping. All the data type between table scan and filer project needs to be changed in Gluten plan.

CC: @gaoyangxiaozhu @mbasmanova

The text was updated successfully, but these errors were encountered:

mbasmanova · 2024-05-28T14:44:29Z

CC: @FelixYBW @rui-mo

FelixYBW · 2024-05-28T17:56:04Z

@gaoyangxiaozhu can we pass the input_file_name as a literal to velox for each split? Since it's fixed for each split.

Yuhta · 2024-05-28T18:01:16Z

@FelixYBW The path is already in the split. The problem is how to carry the information from split into the function.

FelixYBW · 2024-05-28T18:15:20Z

Oh, I see. your option 2 is what I'm thinking. Add a project after table scan to append a literal column to scan result, hide the input_file_name() implementation in Gluten completely. In this way we can add similar function implementation in Gluten directly. @rui-mo Is there any potential issue of this?

Yuhta · 2024-05-28T19:04:43Z

@FelixYBW If option 2 sound good to you, you can follow what Presto does by adding a Hive column handle with type kSynthesized and name $path, then extract that field in the function. We don't need to change anything in driver or table scan operator for this path.

FelixYBW · 2024-05-28T19:10:18Z

Make sense. I will follow up with Rui and Yangyang on this. Thank you. @Yuhta

rui-mo · 2024-05-29T09:30:01Z

@FelixYBW Looks like the second option is feasible in Gluten. Thanks.

Yohahaha · 2024-05-31T05:52:08Z

One Spark task may read multiple files according to spark.sql.files.maxPartitionBytes, which file will be returned in the Gluten for input_file_name in design?

Yuhta · 2024-05-31T13:43:47Z

@Yohahaha The $path value is set according to the information in split, does not matter how many splits the task is reading.

gaoyangxiaozhu · 2024-06-04T06:18:51Z

@FelixYBW If option 2 sound good to you, you can follow what Presto does by adding a Hive column handle with type kSynthesized and name $path, then extract that field in the function. We don't need to change anything in driver or table scan operator for this path.

@rui-mo / @FelixYBW just back from OOF, so do you agree with option 2 to use $path synthetic column ? If it is the option, i can follow the code implement.

gaoyangxiaozhu · 2024-06-04T06:34:48Z

@FelixYBW If option 2 sound good to you, you can follow what Presto does by adding a Hive column handle with type kSynthesized and name $path, then extract that field in the function. We don't need to change anything in driver or table scan operator for this path.

hey @Yuhta, quick question, do you have example about how to extract specific field in function can be referenced ?
or @rui-mo you may also know ?

Yuhta · 2024-06-04T14:37:40Z

@gaoyangxiaozhu I think you need to do it in the planner, rewriting input_file_name() to url_encode($path)

gaoyangxiaozhu · 2024-06-05T04:46:46Z

@gaoyangxiaozhu I think you need to do it in the planner, rewriting input_file_name() to url_encode($path)

I see, so looks a little bit trick we still need change planner for this specifial case to leverage both url_encode function and $path ksynthetic column handler.

I can image we may need to apply a similar planning strategy to other parts with similar functions

@rui-mo / @FelixYBW to double check if it is a acceptable way before i start the code part.

FelixYBW · 2024-06-05T05:29:34Z

@rui-mo / @FelixYBW to double check if it is a acceptable way before i start the code part.

Go ahead to implement. Just talked with Rui. A new project will be too complex, let's add it in future.

gaoyangxiaozhu · 2024-06-05T06:12:03Z

got! thank you @FelixYBW / @rui-mo ! Let me do the follow up.

Yuhta added the enhancement New feature or request label May 28, 2024

Yuhta mentioned this issue May 28, 2024

Fix input_file_name spark function to return full path of file not only name #9870

Closed

gaoyangxiaozhu mentioned this issue Jun 5, 2024

[Core] Spark assert_true and raise_error function support apache/incubator-gluten#5991

Open

gaoyangxiaozhu mentioned this issue Jun 7, 2024

[VL] [Core] Spark Input_file_name Support apache/incubator-gluten#6021

Merged

gaoyangxiaozhu mentioned this issue Jun 19, 2024

[VL][Core] Add InputFileReplaceFallback Rule apache/incubator-gluten#6139

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark input_file_name design #9957

Spark input_file_name design #9957

Yuhta commented May 28, 2024 •

edited

Loading

mbasmanova commented May 28, 2024

FelixYBW commented May 28, 2024

Yuhta commented May 28, 2024

FelixYBW commented May 28, 2024

Yuhta commented May 28, 2024

FelixYBW commented May 28, 2024

rui-mo commented May 29, 2024

Yohahaha commented May 31, 2024 •

edited

Loading

Yuhta commented May 31, 2024

gaoyangxiaozhu commented Jun 4, 2024

gaoyangxiaozhu commented Jun 4, 2024 •

edited

Loading

Yuhta commented Jun 4, 2024 •

edited

Loading

gaoyangxiaozhu commented Jun 5, 2024

FelixYBW commented Jun 5, 2024

gaoyangxiaozhu commented Jun 5, 2024

Spark input_file_name design #9957

Spark input_file_name design #9957

Comments

Yuhta commented May 28, 2024 • edited Loading

mbasmanova commented May 28, 2024

FelixYBW commented May 28, 2024

Yuhta commented May 28, 2024

FelixYBW commented May 28, 2024

Yuhta commented May 28, 2024

FelixYBW commented May 28, 2024

rui-mo commented May 29, 2024

Yohahaha commented May 31, 2024 • edited Loading

Yuhta commented May 31, 2024

gaoyangxiaozhu commented Jun 4, 2024

gaoyangxiaozhu commented Jun 4, 2024 • edited Loading

Yuhta commented Jun 4, 2024 • edited Loading

gaoyangxiaozhu commented Jun 5, 2024

FelixYBW commented Jun 5, 2024

gaoyangxiaozhu commented Jun 5, 2024

Yuhta commented May 28, 2024 •

edited

Loading

Yohahaha commented May 31, 2024 •

edited

Loading

gaoyangxiaozhu commented Jun 4, 2024 •

edited

Loading

Yuhta commented Jun 4, 2024 •

edited

Loading