Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison with dpack #38

Open
aquasync opened this issue Sep 11, 2021 · 4 comments
Open

Comparison with dpack #38

aquasync opened this issue Sep 11, 2021 · 4 comments

Comments

@aquasync
Copy link

Hi there,

As the author of dpack also, I was wondering if you could provide any high level comparison? I have been testing dpack out a bit and was impressed with the shared structure feature. I was actually looking at porting it to python (maybe using cython), though I've just come across this project which seems to at least share some design goals, but built on top of msgpack. As this one appears to be newer, should this be considered the successor? If you can share any perf/size comparisons you've made that'd be very interesting.

Thanks!

@aquasync
Copy link
Author

After taking a bit of a closer look, it seems that I'm getting much smaller serialization with dpack for my test cases. Eg:

> d = new Date(); console.log(pack(obj).length); (new Date() - d)/1000
1530597
0.036
> d = new Date(); console.log(packr.pack(obj).length); (new Date() - d)/1000
470056
0.008
> d = new Date(); console.log(serialize(obj).length); (new Date() - d)/1000
98214
0.016

Here msgpackr with records is 70% smaller than plain msgpack which is great, but dpack comes out at over 90% smaller (!). Even post gzip the dpack output is almost half the size.

From what I can gather the difference is that dpack will share strings for each structure's properties, whereas msgpackr is copying the values for each occurrence. Is there some way to get the dpack-like behaviour?

@kriszyp
Copy link
Owner

kriszyp commented Sep 11, 2021

Yes, I have primarily focused on this (msgpackr) implementation over dpack over the last year, and I can give some comparisons. I don't know if my shifted focus counts as "successor", but it is a successor within the application that I work on (in medical research).

From a basic perf/size comparison, msgpackr is generally much faster (typically over twice as fast in my tests, but YMMV), but dpack can usually achieve smaller sizes, depending on the type of data. But for a little more detailed comparisons:

First, of course, it is hard to gain traction with a new data format, and it is nice to be using a format that is widely adopted and interoperable, so working with MessagePack has obvious advantage over trying to forge a new path with dpack. msgpackr does use a new extension, but that is very small "innovation" compared to a whole new format.

Second, again the msgpackr implementation is quite a bit faster than the dpack implementation. I actually wrote a blog post describing the performance techniques used in msgpackr:
https://dev.doctorevidence.com/building-the-fastest-js-de-serializer-a413a2b4fb72
Some of these performance improvements are related to implementation itself, some to the format.

With msgpackr, the "shared" data is limited entirely to whole record structures, and to reuse a structure, it must match a previously used structure exactly. This is actually very valuable for decoding performance, since a known class or object structure can be instantiated and then filled with property values. In dpack, shared data structures can be partially used. For example a structure with "name" and "age" properties could be used to described an object with just a "name" property. This is beneficial for more efficient encoding of object structures that are similar but have some optional properties, but it is much less performant for decoding since the entire structure of the object is not known prior to decoding property values. I would suspect that msgpackr's upfront referencing of full shared structures would be advantageous on other languages as well, but I don't know enough about python to know how this would play out there.

Dpack also differs by being a "character" based encoding, where the basic unit is an encoded character rather than a byte. This type of encoding can be significantly easier to handle decoding of strings, since the character decoding can happen in a single pass and strings are specified by character length (rather than byte length) and a decoder can just simply "slice" strings from the original text source to get individual string values. msgpackr has quite a bit of sophistication to be able efficiently decode strings originating from byte/binary source, including a native C node add-on (perhaps you are considering something similar with cython). At least within node, msgpackr has achieved considerable success in high-speed decoding, despite the challenges of decoding from bytes. Within the browser, the dpack approach has an advantage, where there is no opportunity for native C plugins that can directly interact with the V8 apis (WASM can't help at all here), and the native JS api TextDecoder is terribly slow.

The character based encoding approach of dpack also has an advantage when it comes to compression since it is designed to actually match common byte frequencies and work well with Huffman encoding, whereas MessagePack (and CBOR) have byte frequencies that largely defeat Huffman encoding.

But the byte-based approach is often much better for encoding numbers. In both performance and size, particularly with floating point numbers, MessagePack (msgpackr's use of 32-bit floats when possible) is usually much faster (for both encoding and decoding) and more compact than dpack's number serialization, which is the same as JSON's decimal string encoding.

And yes, and as you noticed, dpack is also capable of reusing string values, whereas msgpackr only shares record structures (a sequence of property names). I have generally thought that there is limited benefit to more extensive value reuse since RLE in gzip/brotli can largely reduce duplication, but apparently in your tests, you have noticed better post-gzip sizes (although again might be more due to the Huffman encoding advantages than RLE encoding). And reusing string values certainly has an encoding performance cost since you have to compare each string with the cache of previous string values (uses a Map which is reasonably fast, but still definitely a cost).

As you may have noticed, I have tried to jointly develop this library for MessagePack and CBOR. For CBOR, there is an (in-progress) spec for "packed" CBOR which does support reuse of string values. I have been meaning to explore an implementation of packed CBOR to compare its performance/size characteristics, although again I have tended to believe that traditional compression algorithms will tend to perform better for reducing duplication than JS-based serialization algorithms.

Dpack also has the ability to handle lazy parsing, where sub-blocks can be parsed on-demand. This could probably be done with an extension for MessagePack as well, but I don't think the complexity overhead has necessarily been worth it.

As far as how we are currently using msgpackr, we currently use it store data in LMDB in NodeJS, and larger structures are compressed with LZ4. This combination has been a great balance of performance and size for us. We are actually not using MessagePack for HTTP/browser communication. We have used dpack before (and with some noticeable benefits for certain large documents). But with such extensive and optimized native support in the browser for decompression and JSON parsing, it is just generally too hard to beat the speed and ease of Brotli+JSON for HTTP responses (where huffman encoding comes into play).

Anyway, be glad to help or answer any questions if you do make a python implementation.

@aquasync
Copy link
Author

Thanks for the info! I've not come across serialization approaches trying to leverage the emergent structure before (akin to what V8 does for objects) - it's a neat approach. I'd been considering switching to protobuf/capnproto or similar to reduce my message sizes, but that loses the schema-free flexibility.

What you're saying all makes sense - a direct binary encoding (including binary for numbers) sounds preferable, as does building on the foundation of something like msgpack. The packed CBOR proposal sounds interesting (I take it that is this - https://www.ietf.org/id/draft-ietf-cbor-packed-03.html?). Value sharing there is also more general as it can be across values for different properties, vs dpack's same property.

I agree it is unclear what should be left to compression vs the serialization re value sharing. For python I think dpack's value sharing will be beneficial beyond just reduced size, as I can directly re-use the string object (which'll be faster than creating copies with PyUnicode_FromStringAndSize and have better memory usage). Perhaps I could try adding a msgpack extension for shared strings and combine with msgpackr's structure extension?

I've done a quick & (very) dirty port of dpack parser to python (see https://gist.github.com/aquasync/a7a78a694933797344831be80683f0aa) - not idiomatic at all as I'm trying to understand the types etc better. This'll be too slow for real use (no JIT) so I'd probably end up lowering the whole thing to cython or c++. Note that I haven't tackled the deferred reads (I take it that is for the on demand block parsing you mention?), nor pause/resume (is that for partial reads on network streams?).

@kriszyp
Copy link
Owner

kriszyp commented Sep 12, 2021

The packed CBOR proposal sounds interesting (I take it that is this - https://www.ietf.org/id/draft-ietf-cbor-packed-03.html?).

Yes

Value sharing there is also more general as it can be across values for different properties, vs dpack's same property.

Yes, the downside compared to dpack is that the shared values must be declared upfront with extra bytes for referencing, so CBOR packed essentially requires two passes, to first identify which string values are using multiple times and worth sharing, and then serialization (or using a progressive shared dictionary, shared values can only be applied to the next serialization).

Perhaps I could try adding a msgpack extension for shared strings and combine with msgpackr's structure extension?

Sure, although I wonder if using extensions, if it is better to just use CBOR's packed protocol rather creating new ones.

I've done a quick & (very) dirty port of dpack parser to python (see https://gist.github.com/aquasync/a7a78a694933797344831be80683f0aa)

Very cool!

so I'd probably end up lowering the whole thing to cython or c++.

FWIW, I have actually experimenting with moving various parts of msgpackr to C++, and almost anything that has to interact with JS objects (through V8 apis), is significantly slower in C++ than JS (as surprising as that may be). The only thing that ended up being beneficial is bulk string conversions. Of course that may be because V8 JS is so fast to begin with (and I think much faster than Python), so that those comparison may not translate to python/cython.

Note that I haven't tackled the deferred reads (I take it that is for the on demand block parsing you mention?), nor pause/resume (is that for partial reads on network streams?).

Yes and yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants