-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZSTD performance is not good on LEB128 encoded integers #1325
Comments
The LEB128 format breaks the regularity of input compared to non-encoded source, where all numbers are 64-bit. While the format "seems" to reduce the amount of data to compress, it actually makes the work of the compressor more difficult. This is unfortunately a common scenario. Also, note that level 1 of A recommended mitigation is to not use LEB128 encoding for tables of integer numbers. A better transform for this kind of data would be a delta encoding, followed by a transpose, followed by |
Thanks @Cyan4973 for replying. I tried to use different compression levels on both LEB128-encoded and non-encoded data. It seems that the compression ratio remains unchanged for LEB128-encoded integers from level 1 to 19. Those suggestions make sense. Thanks! |
Hi All,
We use zstd in production to compress various kinds of data and overall it performs very well. However, when it comes to LEB128 encoded integers, it could not handle the case very well. In case you don't know LEB128, here is a brief introduction: https://en.wikipedia.org/wiki/LEB128.
To reproduce the issue: suppose I have a file called '/tmp/number' which contains 512 unsigned 64-bit integers.
Use the data provided below, /tmp/encoded is 1601 bytes and /tmp/non_encoded is 4096 bytes.
zstd -1 /tmp/encoded -o /tmp/encoded.zstd
zstd -1 /tmp/non_encoded -o /tmp/non_encoded.zstd
We can find /tmp/encoded.zstd is 1615 bytes which is larger than its input (1601 bytes). But /tmp/non_encoded.zstd is only 1167 bytes and it is way smaller than the compressed file of LEB128 data. We cannot dismiss LEB128 because the file format we use is Apache ORC and all integers are forced to use LEB128 before compression.
Below are the 512 unsigned integers used in the above example. Other data can also reproduce similar results.
I have also tried zlib to compress these two kinds of data. Regardless of its performance compared to zstd, zlib can compress LEB128 data into smaller size than little endian uint64_t. Therefore I think there may be some room for zstd to improve on LEB128 data. Any comment is welcome. Thanks!
The text was updated successfully, but these errors were encountered: