-
Notifications
You must be signed in to change notification settings - Fork 707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Confused about bit order in parquet BIT_PACKED encoding #5338
Comments
Probably the quickest way to resolve this would be to obtain a parquet file making use of this encoding scheme written by parquet-mr and try to read it. One observation is this encoding is only used for level information, and so you would need a very complex schema with more than 256 nested levels to detect any endianess discrepancy. |
Well, the good news is that configuring parquet-mr to use bitpacking encoding is extremely unsupported 😅 The only way to do that seems to be classpath-shadowing The bad news is that the resulting file shows different null values for the java vs the rust implementation Java output:
Arrow-rs
|
The apache/arrow/go implementation seems to have the same issue with this file. Their code also reuses the same BitWriter for bitpacked decoding and inside the rle decoder. |
Trying to read the file with polars (which forked the parquet2 code) results in the following error:
It seems it is trying to read a 4-byte length prefix, which should only be written for rle encoding. I also asked a colleague with a working python setup to try read the file and he reported the following outcome
The latter seems similar to polars, looking at the code it's a bounds check that expected 4 additional bytes even if the number of bytes is calculated based on bit width. Considering all these issues, it would probably be better to remove support for bitpacked levels completely. |
The documentation for the parquet BIT_PACKED encoding says:
Followed by an example that is clearly different than the example for the RLE encoding. The documentation there also says
However, in the arrow-rs/parquet code base, I see both encodings use the same
BitReader::get_batch
implementation. For bitpacked it is used directly, while for rle indirectly viaRleDecoder::get_batch
. I think parquet2 is doing similar reuse of the bitpacking logic.As far as I know, both rust parquet implementations pass the integration test suite, so there are multiple options to describe this discrepancy:
The text was updated successfully, but these errors were encountered: