Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compresison ratio worse than zlib (deflate) #256

Closed
hansinator opened this issue Jul 21, 2016 · 14 comments
Closed

Compresison ratio worse than zlib (deflate) #256

hansinator opened this issue Jul 21, 2016 · 14 comments
Labels

Comments

@hansinator
Copy link

Hi,

I have an input file on which zstd gives a worse compression ratio than zlib does with deflate. Basically my file is mostly a list of 24-bit integers. If i store all integers as 32-bit with the MSB set to zero, then my input file is of course larger but compresses with more than double the ratio.
I find this rather odd.. Maybe you can have a look at it?

@hansinator
Copy link
Author

hansinator commented Jul 21, 2016

using v0.7.4

$>zstd -22 24bit.bpt -o 24bit.zs
24bit.bpt : 82.64% (1377618 => 1138433 bytes, 24bit.zs)

$>zstd -22 32bit.bpt -o 32bit.zs
32bit.bpt : 26.03% (1836370 => 477944 bytes, 32bit.zs)

@Cyan4973
Copy link
Contributor

Cyan4973 commented Jul 21, 2016

I agree it feels strange, but I believe it's in line with how the algorithm works.
I think it's related to the fact that minimum match length in Zstandard is 3, so it will more easily find 3-bytes matches in 32-bits structures than in 24-bits one.

For such kind of data though, I would recommend to have a look at @FrancescAlted 's Blosc filter, which combined with zstd provides excellent performance. See this blog post for detailed information.

@FrancescAlted
Copy link

Yes, the kind of behavior that exposes @hansinator is quite typical in Blosc, but to be frank I was not expecting to see it in Zstd. It would be interesting to see how Blosc+Zstd can perform on this file indeed. Is there a place were it can be accessed?

@hansinator
Copy link
Author

Here are both files:
samples.tar.gz

Blosc sounds interesting, though I didn't understand it fully. Will it increase speed, compression ratio, or both? Will decompression be fast on an embedded risc processor running @ 50MHz?

@FrancescAlted
Copy link

FrancescAlted commented Jul 26, 2016

Generally, Blosc will increase speed (it is multithreaded) and could increase compression ratios depending on whether its filters (shuffle and bitshuffle) can rearrange data in a way that is easier for the codec to look for duplicities in data.

Regarding using Blosc on a RISC processor, this is indeed feasible, but you won't get the extra speed that provides SIMD instructions on Intel (SSE2, AVX2) or in ARM (NEON, but only in Blosc2). Another constraint is that on RISC processors the size of binaries is normally important, and Blosc library will necessarily be fatter (although you can always disable codecs and keep the important ones for you).

At any rate, I gave your files a go in my laptop, and got interesting results. The original files were:

$ ll *.bpt
-rwx------ 1 faltet faltet 1377618 jul 26 14:15 24bit.bpt*
-rwx------ 1 faltet faltet 1836370 jul 20 19:07 32bit.bpt*

and I processed them with bloscpack (a high level interface to Blosc):

$ /usr/bin/time blpk -f c -c zstd -t 4 -l 9  32bit.bpt 32bit-shuffle.blp
0.56user 0.22system 0:00.61elapsed 128%CPU (0avgtext+0avgdata 38032maxresident)k
0inputs+24outputs (0major+10542minor)pagefaults 0swaps
$ /usr/bin/time blpk -f c -c zstd -t 4 -l 9  24bit.bpt 24bit-shuffle.blp
0.45user 0.23system 0:00.47elapsed 143%CPU (0avgtext+0avgdata 38676maxresident)k
0inputs+2288outputs (0major+11012minor)pagefaults 0swaps
$ /usr/bin/time blpk -f c -c zstd -t 4 -l 9 -s 32bit.bpt 32bit-noshuffle.blp
0.58user 0.28system 0:00.59elapsed 146%CPU (0avgtext+0avgdata 37292maxresident)k
0inputs+952outputs (0major+10323minor)pagefaults 0swaps
$ /usr/bin/time blpk -f c -c zstd -t 4 -l 9 -s 24bit.bpt 24bit-noshuffle.blp
0.46user 0.15system 0:00.46elapsed 131%CPU (0avgtext+0avgdata 38904maxresident)k
0inputs+2216outputs (0major+10635minor)pagefaults 0swaps

[note that compression level 9 in Blosc translates to 22 in Zstd]

And the sizes for the compressed files are:

$ ll *.blp
-rw-rw-r-- 1 faltet faltet 1132165 jul 26 17:22 24bit-noshuffle.blp
-rw-rw-r-- 1 faltet faltet 1170249 jul 26 17:21 24bit-shuffle.blp
-rw-rw-r-- 1 faltet faltet  485925 jul 26 17:22 32bit-noshuffle.blp
-rw-rw-r-- 1 faltet faltet    8249 jul 26 17:22 32bit-shuffle.blp

so, when not using the shuffle filter, I almost can reproduce plain zstd figures (expected). However, when the shuffle filter is used, the 32bit version can be reduced down to 8 KB.

As I was quite surprised by this result, I double checked that the latter file can reproduce the original one:

$ /usr/bin/time blpk d 32bit-shuffle.blp 32bit-decompressed.bpt0.42user 0.24system 0:00.40elapsed 167%CPU (0avgtext+0avgdata 32508maxresident)k
0inputs+3592outputs (0major+9681minor)pagefaults 0swaps
$ ll 32bit.bpt 32bit-decompressed.bpt
-rwx------ 1 faltet faltet 1836370 jul 20 19:07 32bit.bpt*
-rw-rw-r-- 1 faltet faltet 1836370 jul 26 17:42 32bit-decompressed.bpt
$ diff 32bit.bpt 32bit-decompressed.bpt
$ echo $?
0

so yup, it does.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Jul 26, 2016

I suspect :
blpk -f c -c zstd -t 3 -l 9 24bit.bpt 24bit-shuffle.blp
would probably result in something as well compressed as the 32-bits version
(if -t 3 is a supported value ...)

@hansinator
Copy link
Author

hansinator commented Jul 26, 2016

Thank you a lot. This opens up new possibilities to look into. I need to transfer look-up tables for image processing to an FPGA via a 115kbps serial line and it has got limited storage and memory resources. I am trying different compression algorithms since a couple of weeks with mixed results. Algorithms like XZ or LZHAM take too long to decompress, deflate and similar take too long to transfer. Now Blosc with Zstd seems to be the perfect solution.. the only codec that achieved a 10kb compressed size was RAR which I can't use without winrar..

@Cyan4973
Wouldn't it be useful to include the shuffle filter into the zstd binary like xz includes the BCJ filter?

@FrancescAlted
I'd like to play around with this and try a whole bunch of files. I've installed bloscpack but it has no zstd support. I'm no python guy, so maybe you can tell me how to get it to use the newest Blosc from its git repo?

@FrancescAlted
Copy link

@Cyan4973 Unfortunately -t 3 is not supported and the only values that are supported for shuffle are 2, 4, 8, 16 and 16 + (any other of the previous ones).

@hansinator Zstd support has been included in python-blosc (a dependency of bloscpack) last week, so you may need to install the latest version of it (1.4.0).

@KrzysFR
Copy link
Contributor

KrzysFR commented Jul 27, 2016

@FrancescAlted stupid question, but would it be helpful, for the compression ratio, to have a "pre-filter" that would expand 24-bit integers into 32-bits (HSB padded with 0) to improve compression using -t 4 and zstd? (and same for 5-6-7 bytes expanded to 8 bytes?)

@FrancescAlted
Copy link

Your suggestions made me think twice, and -t 3 is actually supported in Blosc, but one should help bloscpack to choose a chunksize that is divisible by the typesize (3 in this case):

$ /usr/bin/time blpk -f c -z 1377618 -c zstd -t 3 -l 9 24bit.bpt 24bit-shuffle.blp
0.62user 0.29system 0:00.64elapsed 142%CPU (0avgtext+0avgdata 41424maxresident)k
0inputs+16outputs (0major+10012minor)pagefaults 0swaps
$ ll 24bit-shuffle.blp 
-rw-rw-r-- 1 faltet faltet 6805 jul 27 16:50 24bit-shuffle.blp

which is another 20% smaller than the 32bit ones.

But for 3 bytes, a regular shuffle function in C is used, and not one using an SIMD accelerated codepath, so perhaps your idea of adding a padding would be a good one. Added a ticket for C-Blosc2.

@FrancescAlted
Copy link

FrancescAlted commented Jul 27, 2016

Just for completeness, apparently the improvement in compression for the 24bit case is not just that we are removing the 0s in MSB. Taking the 32bit file and choosing a chunksize that is close to the length of the file (for some reason, the size of the 32bit file is not exactly divisible by 4), we get:

$ /usr/bin/time blpk -f c -z 1836368 -c zstd -t 4 -l 9 32bit.bpt 32bit-shuffle.blp
0.65user 0.21system 0:00.66elapsed 131%CPU (0avgtext+0avgdata 42428maxresident)k
0inputs+16outputs (0major+11214minor)pagefaults 0swaps
$ ll 32bit-shuffle.blp 
-rw-rw-r-- 1 faltet faltet 6738 jul 27 17:15 32bit-shuffle.blp

which is actually a bit smaller than the 24bit base. In addition, using the accelerated SIMD path for shuffle still provides a good advantage in performance (Python code ahead):

In [1]: import blosc

In [4]: i24 = open("24bit.bpt", mode="rb").read()

In [5]: i32 = open("32bit.bpt", mode="rb").read()

In [6]: len(i24), len(i32)
Out[6]: (1377618, 1836370)

In [8]: i24c = blosc.compress(i24, 3, cname="zstd", shuffle=1, clevel=9)

In [9]: i32c = blosc.compress(i32, 4, cname="zstd", shuffle=True, clevel=9)

In [10]: len(i24c), len(i32c)
Out[10]: (6681, 6514)

In [11]: %timeit blosc.compress(i24, 3, cname="zstd", shuffle=1, clevel=9)
10 loops, best of 3: 178 ms per loop

In [12]: %timeit blosc.compress(i32, 4, cname="zstd", shuffle=True, clevel=9)
1 loop, best of 3: 212 ms per loop    # not an improvement for compression, but see later

In [13]: i24d = blosc.decompress(i24c)

In [14]: i32d = blosc.decompress(i32c)

In [15]: len(i24d), len(i32d)
Out[15]: (1377618, 1836370)

In [16]: %timeit blosc.decompress(i24c)
1000 loops, best of 3: 841 µs per loop

In [17]: %timeit blosc.decompress(i32c)
1000 loops, best of 3: 307 µs per loop

So, decompression happens 2.7x faster (at more than 5 GB/s) when using the padded version. Although if that would be an actual padding pre-filter, perhaps the additional copies of internal buffers would defeat the advantage. Well, I suppose the only way to know is experimenting.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Aug 1, 2016

Is there anything else that can be done on this issue ?

@hansinator
Copy link
Author

As far as I understood Zstd is intended to be a better replacement for deflate, so my intent for this issue was to point out a weird case where deflate performs better, hoping that I could help bring Zstd closer to its goal. It turns out to be a complex issue with no "easy" one-fits-all solution. I appreciate your replies and the effort you have put into experimenting. You've helped me to get a better understanding of this and showed me how to get better performance in my concrete compression scenario. If there is nothing tangible that we can learn here to improve Zstd to perform better, I'd say there's nothing more that can be done on this issue.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Aug 1, 2016

Thanks @hansinator

We'll try to analyze what happens in such case, because while I'm not surprised that 24bit.bpt compresses badly, hence inviting other methods, I'm nonetheless surprised that it compresses a bit better with gzip. So there might be something we could do better within zstd.
We'll use your sample for this investigation.

Regards

@Cyan4973 Cyan4973 closed this as completed Aug 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants