Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confused about bit order in parquet BIT_PACKED encoding #5338

Closed
jhorstmann opened this issue Jan 28, 2024 · 4 comments
Closed

Confused about bit order in parquet BIT_PACKED encoding #5338

jhorstmann opened this issue Jan 28, 2024 · 4 comments
Labels
bug help wanted parquet Changes to the parquet crate

Comments

@jhorstmann
Copy link
Contributor

The documentation for the parquet BIT_PACKED encoding says:

For compatibility reasons, this implementation packs values from the most significant bit to the least significant bit, which is not the same as the RLE/bit-packing hybrid.

Followed by an example that is clearly different than the example for the RLE encoding. The documentation there also says

The bit-packing here is done in a different order than the one in the deprecated bit-packing encoding

However, in the arrow-rs/parquet code base, I see both encodings use the same BitReader::get_batch implementation. For bitpacked it is used directly, while for rle indirectly via RleDecoder::get_batch. I think parquet2 is doing similar reuse of the bitpacking logic.

As far as I know, both rust parquet implementations pass the integration test suite, so there are multiple options to describe this discrepancy:

  • The documentation is wrong, maybe confusing bit order with little-/big-endian byte order
  • The rust code is wrong, but bitpacked encoding is not used in practice, not even in the test suite
  • The difference only shows in big-endian machines (I don't think this can be the case since the examples show bytes)
@jhorstmann jhorstmann added the question Further information is requested label Jan 28, 2024
@tustvold
Copy link
Contributor

Probably the quickest way to resolve this would be to obtain a parquet file making use of this encoding scheme written by parquet-mr and try to read it.

One observation is this encoding is only used for level information, and so you would need a very complex schema with more than 256 nested levels to detect any endianess discrepancy.

@jhorstmann
Copy link
Contributor Author

Well, the good news is that configuring parquet-mr to use bitpacking encoding is extremely unsupported 😅

The only way to do that seems to be classpath-shadowing ParquetProperties. Changing that method to use BitPackingValuesWriter enables bitpacked encoding of definition levels. Otherwise bitpacked seems to be only used for a bitwidth of 0, where the bit-order does not make a difference because all values are zero.

The bad news is that the resulting file shows different null values for the java vs the rust implementation

Java output:

required_string1
optional_string
1.0
1.0
12345678
12345678
true
false

Arrow-rs parquet-read output

{required_string: "required_string1", optional_string: null, required_double: 1.0, optional_double: null, required_timestamp: 1970-01-01 03:25:45 +00:00, optional_timestamp: null, required_bool: true, optional_bool: null}

bitpacking_levels.parquet.zip

Java code that wrote this file

@tustvold tustvold added bug help wanted and removed question Further information is requested labels Jan 28, 2024
@jhorstmann
Copy link
Contributor Author

The apache/arrow/go implementation seems to have the same issue with this file. Their code also reuses the same BitWriter for bitpacked decoding and inside the rle decoder.

@jhorstmann
Copy link
Contributor Author

Trying to read the file with polars (which forked the parquet2 code) results in the following error:

parquet: File out of specification: The number of bytes declared in v1 def levels is higher than the page size

It seems it is trying to read a 4-byte length prefix, which should only be written for rle encoding.

I also asked a colleague with a working python setup to try read the file and he reported the following outcome

  • fastparquet: endless loop
  • pyarrow: error Received invalid number of bytes (corrupt data page?)

The latter seems similar to polars, looking at the code it's a bounds check that expected 4 additional bytes even if the number of bytes is calculated based on bit width.

Considering all these issues, it would probably be better to remove support for bitpacked levels completely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug help wanted parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

2 participants