parquet `Row` construct `recordbatch` #4844

smallzhongfeng · 2023-09-21T09:10:21Z

Which part is this question about

Describe your question

Additional context

smallzhongfeng · 2023-09-21T09:11:33Z

For now, can we let parquet's row generate a reoctdBatch?

tustvold · 2023-09-21T09:27:04Z

I'm not sure I follow what you're requesting, there are native readers for reading parquet files as arrow RecordBatch?

https://docs.rs/parquet/latest/parquet/arrow/index.html#example-of-reading-parquet-file-into-arrow-record-batch

smallzhongfeng · 2023-09-21T11:58:59Z

Thanks for your reply ! What I mean is that the data of the parquet file is read out through rowIter, and then after certain processing, it is converted into recordBatch or written into the parquet file. Is this currently allowed? @tustvold

tustvold · 2023-09-21T12:15:54Z

This is not currently supported, and I would be hesitant to add support for this.

What is the motivation for using RowIter and not just doing the processing on arrow data? If you can describe the processing I can suggest how you might achieve this? Keeping the data in a columnar representation will yield orders of magnitude better performance

smallzhongfeng · 2023-09-21T13:00:01Z

For our business needs, we have made corresponding changes to datafusion. Now in order to meet some needs, we need to merge multiple parquet files. At the same time, we need to maintain order (through a certain field) after merging, so that order can be achieved. Read to increase data processing speed for our business needs.

tustvold · 2023-09-21T14:16:07Z

At the same time, we need to maintain order (through a certain field) after merging

This sounds like something SortPreservingMergeExec could handle for you?

smallzhongfeng · 2023-09-26T08:54:34Z

Another question:

Could SerializedFileReader read hdfs path? Doesn't seem to work. @tustvold

smallzhongfeng · 2023-09-26T08:55:45Z

At the same time, we need to maintain order (through a certain field) after merging

This sounds like something SortPreservingMergeExec could handle for you?

This really help me. thanks!

tustvold · 2023-09-26T08:59:43Z

Could SerializedFileReader read hdfs path? Doesn't seem to work.

You will likely want to use https://github.com/datafusion-contrib/datafusion-objectstore-hdfs

smallzhongfeng · 2023-09-26T09:11:02Z

I think not, I want to use like this https://docs.rs/parquet/latest/parquet/file/index.html#example-of-reading-an-existing-file, It seems not to support hdfs path, right?

tustvold · 2023-09-26T09:29:34Z

It won't support anything that isn't a local filesystem path, correct. If you wish to use a networked API such as HDFS you will need to provide an implementation of IO for this, or use one already implemented for you

tustvold · 2023-09-27T09:27:46Z

I'm going to close this as the question I believe has been answered, feel free to open a new ticket/discussion/get in touch on Discord if you have any further questions

smallzhongfeng added the question Further information is requested label Sep 21, 2023

smallzhongfeng changed the title ~~parquet Row construct recordbatch~~ parquet Row construct recordbatch Sep 21, 2023

tustvold closed this as not planned Won't fix, can't repro, duplicate, stale Sep 27, 2023

tustvold added the development-process Related to development process of arrow-rs label Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet `Row` construct `recordbatch` #4844

parquet `Row` construct `recordbatch` #4844

smallzhongfeng commented Sep 21, 2023

smallzhongfeng commented Sep 21, 2023

tustvold commented Sep 21, 2023 •

edited

Loading

smallzhongfeng commented Sep 21, 2023 •

edited

Loading

tustvold commented Sep 21, 2023

smallzhongfeng commented Sep 21, 2023

tustvold commented Sep 21, 2023

smallzhongfeng commented Sep 26, 2023 •

edited

Loading

smallzhongfeng commented Sep 26, 2023

tustvold commented Sep 26, 2023

smallzhongfeng commented Sep 26, 2023

tustvold commented Sep 26, 2023 •

edited

Loading

tustvold commented Sep 27, 2023

parquet Row construct recordbatch #4844

parquet Row construct recordbatch #4844

Comments

smallzhongfeng commented Sep 21, 2023

smallzhongfeng commented Sep 21, 2023

tustvold commented Sep 21, 2023 • edited Loading

smallzhongfeng commented Sep 21, 2023 • edited Loading

tustvold commented Sep 21, 2023

smallzhongfeng commented Sep 21, 2023

tustvold commented Sep 21, 2023

smallzhongfeng commented Sep 26, 2023 • edited Loading

smallzhongfeng commented Sep 26, 2023

tustvold commented Sep 26, 2023

smallzhongfeng commented Sep 26, 2023

tustvold commented Sep 26, 2023 • edited Loading

tustvold commented Sep 27, 2023

parquet `Row` construct `recordbatch` #4844

parquet `Row` construct `recordbatch` #4844

tustvold commented Sep 21, 2023 •

edited

Loading

smallzhongfeng commented Sep 21, 2023 •

edited

Loading

smallzhongfeng commented Sep 26, 2023 •

edited

Loading

tustvold commented Sep 26, 2023 •

edited

Loading