Confused about bit order in parquet BIT_PACKED encoding #5338

jhorstmann · 2024-01-28T09:44:16Z

The documentation for the parquet BIT_PACKED encoding says:

For compatibility reasons, this implementation packs values from the most significant bit to the least significant bit, which is not the same as the RLE/bit-packing hybrid.

Followed by an example that is clearly different than the example for the RLE encoding. The documentation there also says

The bit-packing here is done in a different order than the one in the deprecated bit-packing encoding

However, in the arrow-rs/parquet code base, I see both encodings use the same BitReader::get_batch implementation. For bitpacked it is used directly, while for rle indirectly via RleDecoder::get_batch. I think parquet2 is doing similar reuse of the bitpacking logic.

As far as I know, both rust parquet implementations pass the integration test suite, so there are multiple options to describe this discrepancy:

The documentation is wrong, maybe confusing bit order with little-/big-endian byte order
The rust code is wrong, but bitpacked encoding is not used in practice, not even in the test suite
The difference only shows in big-endian machines (I don't think this can be the case since the examples show bytes)

The text was updated successfully, but these errors were encountered:

tustvold · 2024-01-28T11:45:06Z

Probably the quickest way to resolve this would be to obtain a parquet file making use of this encoding scheme written by parquet-mr and try to read it.

One observation is this encoding is only used for level information, and so you would need a very complex schema with more than 256 nested levels to detect any endianess discrepancy.

jhorstmann · 2024-01-28T13:54:21Z

Well, the good news is that configuring parquet-mr to use bitpacking encoding is extremely unsupported 😅

The only way to do that seems to be classpath-shadowing ParquetProperties. Changing that method to use BitPackingValuesWriter enables bitpacked encoding of definition levels. Otherwise bitpacked seems to be only used for a bitwidth of 0, where the bit-order does not make a difference because all values are zero.

The bad news is that the resulting file shows different null values for the java vs the rust implementation

Java output:

required_string1
optional_string
1.0
1.0
12345678
12345678
true
false

Arrow-rs parquet-read output

{required_string: "required_string1", optional_string: null, required_double: 1.0, optional_double: null, required_timestamp: 1970-01-01 03:25:45 +00:00, optional_timestamp: null, required_bool: true, optional_bool: null}

bitpacking_levels.parquet.zip

Java code that wrote this file

jhorstmann · 2024-01-28T20:50:01Z

The apache/arrow/go implementation seems to have the same issue with this file. Their code also reuses the same BitWriter for bitpacked decoding and inside the rle decoder.

jhorstmann · 2024-01-29T11:03:58Z

Trying to read the file with polars (which forked the parquet2 code) results in the following error:

parquet: File out of specification: The number of bytes declared in v1 def levels is higher than the page size

It seems it is trying to read a 4-byte length prefix, which should only be written for rle encoding.

I also asked a colleague with a working python setup to try read the file and he reported the following outcome

fastparquet: endless loop
pyarrow: error Received invalid number of bytes (corrupt data page?)

The latter seems similar to polars, looking at the code it's a bounds check that expected 4 additional bytes even if the number of bytes is calculated based on bit width.

Considering all these issues, it would probably be better to remove support for bitpacked levels completely.

jhorstmann added the question Further information is requested label Jan 28, 2024

tustvold added bug help wanted and removed question Further information is requested labels Jan 28, 2024

jhorstmann mentioned this issue Jan 30, 2024

Mark Encoding::BIT_PACKED as deprecated and document its compatibility issues #5348

Merged

tustvold closed this as completed May 5, 2024

tustvold added the parquet Changes to the parquet crate label May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confused about bit order in parquet BIT_PACKED encoding #5338

Confused about bit order in parquet BIT_PACKED encoding #5338

jhorstmann commented Jan 28, 2024

tustvold commented Jan 28, 2024

jhorstmann commented Jan 28, 2024

jhorstmann commented Jan 28, 2024

jhorstmann commented Jan 29, 2024

Confused about bit order in parquet BIT_PACKED encoding #5338

Confused about bit order in parquet BIT_PACKED encoding #5338

Comments

jhorstmann commented Jan 28, 2024

tustvold commented Jan 28, 2024

jhorstmann commented Jan 28, 2024

jhorstmann commented Jan 28, 2024

jhorstmann commented Jan 29, 2024