-
Notifications
You must be signed in to change notification settings - Fork 689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet Row
construct recordbatch
#4844
Comments
Row
construct recordbatchRow
construct recordbatch
For now, can we let parquet's row generate a reoctdBatch? |
I'm not sure I follow what you're requesting, there are native readers for reading parquet files as arrow RecordBatch? |
Thanks for your reply ! What I mean is that the data of the parquet file is read out through rowIter, and then after certain processing, it is converted into recordBatch or written into the parquet file. Is this currently allowed? @tustvold |
This is not currently supported, and I would be hesitant to add support for this. What is the motivation for using RowIter and not just doing the processing on arrow data? If you can describe the processing I can suggest how you might achieve this? Keeping the data in a columnar representation will yield orders of magnitude better performance |
For our business needs, we have made corresponding changes to datafusion. Now in order to meet some needs, we need to merge multiple parquet files. At the same time, we need to maintain order (through a certain field) after merging, so that order can be achieved. Read to increase data processing speed for our business needs. |
This sounds like something SortPreservingMergeExec could handle for you? |
Another question: Could |
This really help me. thanks! |
You will likely want to use https://github.com/datafusion-contrib/datafusion-objectstore-hdfs |
I think not, I want to use like this https://docs.rs/parquet/latest/parquet/file/index.html#example-of-reading-an-existing-file, It seems not to support hdfs path, right? |
It won't support anything that isn't a local filesystem path, correct. If you wish to use a networked API such as HDFS you will need to provide an implementation of IO for this, or use one already implemented for you |
I'm going to close this as the question I believe has been answered, feel free to open a new ticket/discussion/get in touch on Discord if you have any further questions |
Which part is this question about
Describe your question
Additional context
The text was updated successfully, but these errors were encountered: