Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet Row construct recordbatch #4844

Closed
smallzhongfeng opened this issue Sep 21, 2023 · 12 comments
Closed

parquet Row construct recordbatch #4844

smallzhongfeng opened this issue Sep 21, 2023 · 12 comments
Labels
development-process Related to development process of arrow-rs question Further information is requested

Comments

@smallzhongfeng
Copy link
Contributor

Which part is this question about

Describe your question

Additional context

@smallzhongfeng smallzhongfeng added the question Further information is requested label Sep 21, 2023
@smallzhongfeng smallzhongfeng changed the title parquet Row construct recordbatch parquet Row construct recordbatch Sep 21, 2023
@smallzhongfeng
Copy link
Contributor Author

For now, can we let parquet's row generate a reoctdBatch?

@tustvold
Copy link
Contributor

tustvold commented Sep 21, 2023

I'm not sure I follow what you're requesting, there are native readers for reading parquet files as arrow RecordBatch?

https://docs.rs/parquet/latest/parquet/arrow/index.html#example-of-reading-parquet-file-into-arrow-record-batch

@smallzhongfeng
Copy link
Contributor Author

smallzhongfeng commented Sep 21, 2023

Thanks for your reply ! What I mean is that the data of the parquet file is read out through rowIter, and then after certain processing, it is converted into recordBatch or written into the parquet file. Is this currently allowed? @tustvold

@tustvold
Copy link
Contributor

This is not currently supported, and I would be hesitant to add support for this.

What is the motivation for using RowIter and not just doing the processing on arrow data? If you can describe the processing I can suggest how you might achieve this? Keeping the data in a columnar representation will yield orders of magnitude better performance

@smallzhongfeng
Copy link
Contributor Author

For our business needs, we have made corresponding changes to datafusion. Now in order to meet some needs, we need to merge multiple parquet files. At the same time, we need to maintain order (through a certain field) after merging, so that order can be achieved. Read to increase data processing speed for our business needs.

@tustvold
Copy link
Contributor

At the same time, we need to maintain order (through a certain field) after merging

This sounds like something SortPreservingMergeExec could handle for you?

@smallzhongfeng
Copy link
Contributor Author

smallzhongfeng commented Sep 26, 2023

Another question:

Could SerializedFileReader read hdfs path? Doesn't seem to work. @tustvold

@smallzhongfeng
Copy link
Contributor Author

At the same time, we need to maintain order (through a certain field) after merging

This sounds like something SortPreservingMergeExec could handle for you?

This really help me. thanks!

@tustvold
Copy link
Contributor

Could SerializedFileReader read hdfs path? Doesn't seem to work.

You will likely want to use https://github.com/datafusion-contrib/datafusion-objectstore-hdfs

@smallzhongfeng
Copy link
Contributor Author

I think not, I want to use like this https://docs.rs/parquet/latest/parquet/file/index.html#example-of-reading-an-existing-file, It seems not to support hdfs path, right?

@tustvold
Copy link
Contributor

tustvold commented Sep 26, 2023

It won't support anything that isn't a local filesystem path, correct. If you wish to use a networked API such as HDFS you will need to provide an implementation of IO for this, or use one already implemented for you

@tustvold
Copy link
Contributor

I'm going to close this as the question I believe has been answered, feel free to open a new ticket/discussion/get in touch on Discord if you have any further questions

@tustvold tustvold closed this as not planned Won't fix, can't repro, duplicate, stale Sep 27, 2023
@tustvold tustvold added the development-process Related to development process of arrow-rs label Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development-process Related to development process of arrow-rs question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants