compress/zstd: add new package #62513

dsnet · 2023-09-07T19:53:23Z

Zstandard (RFC 8878) is well-positioned to replaced GZIP (RFC 1952) as the de-facto compression format.

Zstandard:

is faster to compress/decompress than GZIP,
has much better compression ratios,
is well-specified,
is backed by a prominent company, and
is seeing increased adoption in use with archives and as a browser content encoding.

Given the likely ubiquitous place Zstandard is going have in the future, I propose first-class adoption of Zstandard in the stdlib.

Existing Go implementations:

@klauspost has an excellent Zstandard implementation at github.com/klauspost/compress/zstd, but the API is a somewhat complex and it makes use of native assembly.
@ianlancetaylor implemented a Zstandard decompressor in internal/zstd without any use of unsafe or assembly.

Some goals for this package:

The API be simple and similar to existing packages like compress/gzip.
It does not use unsafe or assembly similar to the existing compress package.
- Performance is always nice, but Go community surveys consistently shows security as more important than performance.
- Users who just need a decently fast enough Zstandard implementation can use compress/zstd, while those want the best performance and/or advanced features of Zstandard can use @klauspost's package. Any stdlib packages (e.g., net/http that make use of compress/zstd should make it possible to swap over to a different zstd implementation).
- Note that the same is already true of compress/gzip, which is generally good enough (although there is still room for performance optimizations), but power users can use github.com/klauspost/compress/gzip, which is much faster (as it makes use of assembly).

We could:

adopt @klauspost's pure-Go (i.e., no assembly) implementation with a cleaned-up API
extend @ianlancetaylor's work to support compressing
do a combination of both

Related issues:

The text was updated successfully, but these errors were encountered:

dsnet · 2023-09-07T20:11:37Z

Some comments on other popular compression formats and why Zstandard makes the cut, but not others:

Brotli (RFC 7932) has many of the benefits of Zstandard, but has several detriments:
- It uses a 120KiB static dictionary, which must be linked into every Go binary that uses brotli. This doesn't suite Go well which always aims for static binaries and avoids shared libraries.
- The overall design seems geared towards web data, while Zstandard is more general purpose.
- There is no framing format for Brotli, which makes its use as a standalone file format somewhat limited.
XZ (or LZMA) has great compression ratios, but is really slow for compression and decompression. There is no specification.
bzip2 has fairly good compression ratios, but is slow for compression and decompression. There is no specification. While there exists a decompression implementation in the stdlib (i.e., compress/bzip2), I don't believe bzip2 should have made the cut for inclusion into the stdlib.
snappy is heavily geared towards very high speed compression and decompression, but sacrifices compression ratio. Somewhat of a niche use case.
lz4 aims for similar goals to snappy.

Most compression formats (other than bzip2) are a combination of LZ77 with entropy encoding (and maybe other tricks like Markov chains for LZMA). The entropy encoding that GZIP uses is Huffman encoding, which can't compress better than 1-bit per byte. In terms of compression ratio, arithmetic encoding (which LZMA uses), range encoding, and ANS (which Zstandard uses) encoding provide better ratios. ANS has the benefit that is overcomes the limitations of Huffman encoding, is patent unencumbered (to my knowledge), and is also very fast. It seems unlikely a better entropy encoding than ANS will be discovered in the next decade.

IMO, the lack of a formal specification should bar inclusion in stdlib as it makes "correctness" somewhat undefined.

ianlancetaylor · 2023-09-07T20:20:41Z

IMO, the lack of a formal specification should bar inclusion in stdlib as it makes "correctness" somewhat undefined.

Makes sense but I have to say that I'm not sure that anybody could write a correct zstd decompressor based solely on reading RFC 8878.

dsnet · 2023-09-07T20:22:46Z

I'm not sure that anybody could write a correct zstd decompressor based solely on reading RFC 8878.

Heh... you're the expert there, having done it. I had written a Brotli decompressor solely from the draft RFC without looking at the implementation. This effort led to a number of updates to the draft so that it would be unambiguous when finalized.

Jorropo · 2023-09-07T20:29:37Z

To add to the list of good points zstd has an extremely really wide range of levels available, depending on the compressor options you can have lz4 or gzip like ratios which have very low resource usage.
Or by turning up a few knobs and or changing a few algorithms in the compressor you can get ratios slightly bellow xz while being transparently compatible with the decompressor (and by using less resources than xz).

klauspost · 2023-09-08T15:40:19Z

@dsnet Just to clarify some mistakes in your initial post.

zstd compression (nor deflate for that matter) uses any assembly or "unsafe" for compression. The assembly is used for decompression and can be fully disabled with a "noasm" tag. So obviously that is trivial to exclude.

the API is a somewhat complex

Could you elaborate what you find particularly complex?

I expect the stdlib would have the multithreading pulled out and only implement a fully synchronous mode, as well as possibly omitting dictionary features. Without that the API isn't really more complex than flate, gzip and similar. Without concurrency you can have the options on the Encoder (or Writer, whatever it is called), since you don't have any sharing to care about.

security as more important

I could interpret this as an underhanded way of implying my library isn't safe to use. Probably not the intention, but just to be clear there is continuous fuzzing running and there is third party security review being done.

Considering it doesn't use assembly, nor unsafe, I think it is fair to say that a battletested code is more secure than a freshly written code. If you want to write it, I will of course not take that fun from you - but to wrap it in an argument of "security" seems disingenuous or the very least not at all researched to me.

I do think that features like concurrent (de)compression and possibly dictionaries should be taken out from the stdlib. I think having the EncodeAll([]byte) []byte on an encoder is a good may to control allocations - and without concurrency it will be easier to control anyway.

dsnet · 2023-09-08T16:57:10Z

@klauspost, thanks for your thoughts.

I expect the stdlib would have the multithreading pulled out and only implement a fully synchronous mode, as well as possibly omitting dictionary features

Correct. There should be no usages of goroutines in the stdlib implementation. The example in klauspost/compress#479 should be perfectly thread safe (as it is with the other stdlib packages).

One "dictionary" feature we probably want to keep is the ability to specify the maximum window size.

I could interpret this as an underhanded way of implying my library isn't safe to use. Probably not the intention,

Not at all the intention; I use your package quite extensively myself.

That said, the stdlib avoids the use of assembly except for a few select places. Assembly can be a challenge to review and maintain. Continuous fuzzing is great, but takes a long time before confidence in the implementation is obtained. Fuzzing doesn't help when trying to review if a change to the assembly implementation is correct.

I should also mention the asynchronous nature zstd.NewWriter has actually lended itself to race conditions that could then lead to security problems. To be clear, it's not a race in zstd, but the API design made it easy for users to hold it wrong. In general, library code should be synchronous (from the perspective of the caller) unless the naming or the functionality makes it obvious that it operates in a asynchronous manner.

Considering it doesn't use assembly, nor unsafe, I think it is fair to say that a battletested code is more secure than a freshly written code.

I'm confused. Just above, you mentioned that assembly is used in your package for decompression, but here you saying it doesn't use assembly? Did you mean if you built it under noasm?

If you want to write it, I will of course not take that fun from you - but to wrap it in an argument of "security" seems disingenuous or the very least not at all researched to me.

I'm not itching to write a zstd implementation from scratch. My end goal is to see support for it in stdlib. In my proposal, I deliberately didn't make a statement about how this would come to be fulfilled. Using the pure Go (i.e., no assembly) implementation of your package would be a reasonable possibility.

I think having the EncodeAll([]byte) []byte on an encoder is a good may to control allocations - and without concurrency it will be easier to control anyway.

I personally find it rather strange that the Encoder is either a streaming encoder or essentially a variable to hold encoder settings and shared resources to use with EncodeAll.

Given the acceptance of #54078, I would rather see something like:

func AppendCompress(dst, src []byte, level int) ([]byte, error)
func AppendDecompress(dst, src []byte, maxSize int) ([]byte, error)

klauspost · 2023-09-08T17:55:14Z

One "dictionary" feature we probably want to keep is the ability to specify the maximum window size.

Yes. That is an essential security DOS mitigation. Though when I refer to dictionaries I only refer to predefined dictionaries that seeds the first blocks with backreferences and predefined entropy coders.

I'm confused. Just above, you mentioned that assembly is used in your package for decompression, but here you saying it doesn't use assembly? Did you mean if you built it under noasm?

I assumed you were only looking for compression given there already is an internal decompressor, which I unfortunately haven't had the time to test. But since everything is marked by tags, taking it out is extremely easy.

I should also mention the asynchronous nature zstd.NewWriter has actually lended itself to race conditions that could then lead to security problems.

I am not sure how you come to that. If you use it for streaming, you can only use it for one stream - that isn't different than a sync Writer. If you want writes to only happen when you are calling Write, yes, you need to disable concurrency - but that is an assumption you are making that doesn't hold. The documentation should be pretty clear on what you can expect. Either way, it is a non-issue since the concurrency should go.

Both the compressor and the decompressor have an "async" and "sync" code path. Ripping out the async one is a minor task. Of course some cleanup can probably be done, but that is cosmetics.

func AppendCompress(dst, src []byte, level int) ([]byte, error)
func AppendDecompress(dst, src []byte, maxSize int) ([]byte, error

So you want each of them to allocate, or try to hide the allocs with internal sync.Pools?

EncodeAll/DecodeAll are basically this (they also append), but it keeps the allocs within the Reader/Writer, and buffers, etc can be allocated with settings for the compression level needed.

Being able to use the same Reader/Writer for concurrent Encode/Decode operations would have to go, but it would at least give control over how each is reused. Sidenote, AppendCompress shouldn't need to return an error.

dsnet · 2023-09-08T17:58:58Z

So you want each of them to allocate, or try to hide the allocs with internal sync.Pools?

Stack allocate if possible, reusing the destination buffer if possible (e.g., we don't need to maintain the LZ77 window for decompression since it can be obtained from dst itself), and otherwise sync.Pool.

dsnet · 2023-09-08T18:02:00Z

To comment more on the EncodeAll hung off as methods where the pooling occurs within an Encoder: in a sufficiently large program, I'm observing multiple instances of global Encoder variables in various packages that each use more or less the same settings. It is somewhat silly that across packages, the resources is not shared since resource pooling occurs within an Encoder.

silverwind · 2023-09-08T18:42:31Z

While zstd is a noble goal, I woul like to highlight that it still has zero support in browsers as of today. So for anyone looking to use a more efficient Content-Encoding for HTTP than gzip, brotli is still the go-to:

Jorropo · 2023-09-08T22:26:28Z

@silverwind I think this is more relevant for #62492. ZSTD has found many uses outside Content-Encoding already.

cespare · 2023-09-09T00:45:21Z

To offer a bit of commentary on how it would be used: at my company Zstandard is becoming our default general-purpose compression algorithm because it's strictly better than gzip on all the dimensions we care about. There are cases where we would decide to use gzip (for compatibility) or snappy (for very high throughput) but in the absence of specific requirements, Zstandard is the go-to.

silverwind · 2023-09-09T11:09:45Z

@silverwind I think this is more relevant for #62492. ZSTD has found many uses outside Content-Encoding already.

Yes, just wanted to mention it. I'm in favor of adding any of the "better then gzip" mechanisms to the standard library for both http transfer, but also go:embed compression, which as of today does not even support gzip.

klauspost · 2023-09-10T15:55:49Z

So you want each of them to allocate, or try to hide the allocs with internal sync.Pools?

Stack allocate if possible, reusing the destination buffer if possible (e.g., we don't need to maintain the LZ77 window for decompression since it can be obtained from dst itself), and otherwise sync.Pool.

Correct (and it already uses dst instead of a dedicated history). I can see the single-use decoder working stateless with hidden, internal buffers (for things like literal decompression, entropy decoders for example).

For encoding you have rather big hash tables to deal with, which will put quite some pressure on the stack. Level 1 has 256K, Level 2 has 1.2MB, Level 3 is 4MB. So these would also need some internal pooling, since having those on stack and forcing zeroing on each call, wouldn't be very optimal. Again entropy/literal coders would need to have internal pools, since they are rather expensive to set up.

Doable, but I am not generally a fan of having a bunch of pools on a package level.

Streams will of course be stateful, so ignoring those for now.

That said, I don't really see any problem in separating single-shot state from the Reader/Writer state and having an internal pool for those.

To comment more on the EncodeAll hung off as methods where the pooling occurs within an Encoder: in a sufficiently large program, I'm observing multiple instances of global Encoder variables in various packages that each use more or less the same settings. It is somewhat silly that across packages, the resources is not shared since resource pooling occurs within an Encoder.

Sure. I can see that. I try to avoid global package state. That is the main reason I haven't added it, since it would be rather annoying if one bad actor could cause problems for others.

A global "pool" is nice for resource control. It is pointless to run more than GOMAXPROCS EncodeAll/DecodeAll concurrently, since each will consume exactly one thread fully until it is done. That has a "play nice" feel for preemption, and limiting this below GOMAXPROCS will of course make sure that compression/decompression will never oversaturate a system.

Without concurrency there isn't the main "annoyance" with the current API - that you have to Close to release resources. It is not that different from many other things, but it is too easy to forget and cause leaks.

So yeah, I can see that as a way of simplifying things. Let me know if I understood your ideas clearly.

klauspost · 2023-09-25T15:24:13Z

@dsnet I tested the internal/zstd - there are a bunch of problems in there and some differences to accepted blocks compared to zstd. This is mostly the snappy testset. input files, zstd compressed.

BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst/unbuffered-32                    267           4490401 ns/op         328.38 MB/s         174 B/op          1 allocs/op
(Failed - zstd decompression error at 41415: offset past window)

BenchmarkDecoder_DecoderSmall/geo.protodata.zst/unbuffered-32               1107           1064937 ns/op         890.85 MB/s          59 B/op          1 allocs/op
BenchmarkDecoder_Internal/geo.protodata.zst-32                               669           1801050 ns/op         526.75 MB/s      554226 B/op         31 allocs/op

BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst/unbuffered-32                  81          14196832 ns/op         271.53 MB/s         217 B/op          1 allocs/op
(Failed - zstd decompression error at 106901: offset past window)

BenchmarkDecoder_DecoderSmall/lcet10.txt.zst/unbuffered-32                   100          10505848 ns/op         324.96 MB/s         173 B/op          1 allocs/op
(Failed - zstd decompression error at 89343: offset past window)

BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst/unbuffered-32                 327           3645659 ns/op         274.69 MB/s         150 B/op          1 allocs/op
BenchmarkDecoder_Internal/asyoulik.txt.zst-32                                188           6349734 ns/op         157.71 MB/s      615777 B/op         31 allocs/op

BenchmarkDecoder_DecoderSmall/alice29.txt.zst/unbuffered-32                  255           4717756 ns/op         257.90 MB/s          97 B/op          1 allocs/op
(Failed - zstd decompression error at 57404: offset past window)

BenchmarkDecoder_DecoderSmall/html_x_4.zst/unbuffered-32                     550           2159054 ns/op        1517.70 MB/s         109 B/op          1 allocs/op
(Failed - zstd decompression error at 14867: offset past window)

BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst/unbuffered-32              5551            211883 ns/op        3866.29 MB/s          48 B/op          1 allocs/op
BenchmarkDecoder_Internal/paper-100k.pdf.zst-32                             2862            436787 ns/op        1875.51 MB/s      672010 B/op         30 allocs/op

BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst/unbuffered-32             10000            101969 ns/op        9657.29 MB/s          48 B/op          1 allocs/op
BenchmarkDecoder_Internal/fireworks.jpeg.zst-32                             9979            149482 ns/op        6587.71 MB/s      131808 B/op          3 allocs/op

BenchmarkDecoder_DecoderSmall/urls.10K.zst/unbuffered-32                      94          12117244 ns/op         463.53 MB/s         405 B/op          1 allocs/op
(Failed - zstd decompression error at 74547: offset past window)

BenchmarkDecoder_DecoderSmall/html.zst/unbuffered-32                        1040           1153069 ns/op         710.45 MB/s          80 B/op          1 allocs/op
BenchmarkDecoder_Internal/html.zst-32                                        592           2033895 ns/op         402.77 MB/s      557693 B/op         33 allocs/op

BenchmarkDecoder_DecoderSmall/comp-data.bin.zst/unbuffered-32              14134             84502 ns/op         385.89 MB/s          48 B/op          1 allocs/op
BenchmarkDecoder_Internal/comp-data.bin.zst-32                              7734            158308 ns/op         205.98 MB/s       32780 B/op         19 allocs/op

Assembly disabled. Using streaming interface. Multithreading disabled. Bytes() []byte interface disabled. I could not use the Reset(...) function, since it doesn't provide correct output when used (see below).

So "not great, not terrible" applies it seems. With assembly the difference grows significantly. Also, I did find some bugs while testing casually.

Bugs found

I ran my fuzz set through the decoder and found some differences/bugs. I don't really know where to report them (should I open an issue?) So I collapsed them here, since they aren't super relevant for this issue.

Here are a bunch of samples: zstd-internal-problems.zip

test-diverge-N.zst is the input file.
test-diverge-N.txt contains the error message returned by the library.
test-diverge-N.got is the output from zstd commandline.

`offset past window` & `invalid offset`

There is a logic problem somewhere and this check fails on valid data (as can be seen above).
I suspect in some cases it ends up with wrong offsets (maybe repeats? - I can provide decoded sequence output if you want)

`unexpected EOF`

It seems there are a bunch of unexpected EOFs, that shouldn't really be unexpected. Not sure if it can happen on actual output of zstd.

`invalid checksum: got A want B`

Not sure if this is because of the offset problem - seems likely.

dictionaries are not supported

Dictionaries with ID 0 is the same as "no dictionary". The decoder should ignore this.

`no sequences and no literals`

This is allowed - even if it is a bit silly. Check against latest versions - this is something I had clarified while testing.

`(*Reader).Reset` is not working

I was unable to make (*Reader).Reset work as intended and it kept just spitting out errors and wrong content. I suspect it doesn't reset all state.

For golang#62513

AlexanderYastrebov · 2023-09-25T23:38:11Z

All failures for https://github.com/klauspost/compress/blob/master/zstd/testdata/benchdecoder.zip are fixed when window size is configured according to RFC instead of zero.

When Single_Segment_Flag is set, Window_Descriptor is not present. In this case, Window_Size is Frame_Content_Size

I've created #63224 to address this.

For golang#62513

gopherbot · 2023-09-25T23:46:38Z

Change https://go.dev/cl/531075 mentions this issue: internal/zstd: configure window size for single segment frames

For golang#62513

ianlancetaylor · 2023-09-26T03:58:42Z

@klauspost Thanks for the bug reports, @AlexanderYastrebov Thanks for the quick fix.

@klauspost Please do feel free to open bug reports against the code on this issue tracker. Or I'll take a look either way.
Thanks.

For golang#62513

Fix copyFromWindow when match extends past initial buffer size For golang#62513

jonjohnsonjr · 2024-07-25T15:59:24Z

I wonder if that would be more appropriate for one of the zstd implementations outside of the standard library.

Also totally reasonable! I mostly wanted to bring up the parallel between gzip's Multistream API and zstandard framing in case we wanted to match what compress/gzip is doing, but I'm also fine with keeping a very conservative API given that the average stdlib user just wants a reader/writer.

Are there more examples of what skippable frames get used for and how programs would like to process them?

Two other examples I've found:

Storing a "seek table" for seekable zstandard streams (looks very similar to what red hat is doing):
https://github.com/facebook/zstd/blob/dev/contrib/seekable_format/zstd_seekable_compression_format.md
A proposal to use them for extensions in zstd-compresed WARC files: https://iipc.github.io/warc-specifications/specifications/warc-zstd/

The second one doesn't feel quite "real", but the first seems like something that might be worth supporting eventually. I think as a first pass, we should probably just ignore skippable frames, but if we wanted to support seekable zstandard format, we could expose an io.ReaderAt version of this, similar to what I'm doing with (very experimental) gsip.

dsnet · 2024-07-25T21:59:00Z

Overall the API looks fine, but I wonder if we should keep it minimal for the initial release:

package zstd

const (
	NoCompression = 0
	BestSpeed = 1
	BestCompression = 9
	DefaultCompression = -1
)

type Reader struct { /* ... unexported fields ... */ }
func NewReader(r io.Reader) (*Reader, error)
func (z *Reader) Reset(r io.Reader) error
func (z *Reader) Read(p []byte) (int, error)
func (z *Reader) Close() error

type Writer struct { /* ... unexported fields ... */ }
func NewWriter(w io.Writer) *Writer
func (z *Writer) Reset(w io.Writer)
func (z *Writer) SetLevel(int)
func (z *Writer) Write([]byte) (int, error)
func (z *Writer) Flush() error
func (z *Writer) Close() error

In particular, I removed:

Methods that implement io.ReaderFrom and io.WriterTo. These sound fine, but it seems that we should add them altogether to the other "compress/..." packages rather than just "zstd" (perhaps as a separate proposal?)
Any sentinel errors. It's unclear to me at the moment what the right API is for proper error classification. For some prior precedence with problems with "gzip"'s own sentinel errors, see proposal: compress/gzip: return different error for trailing garbage #61797.
Any API for dictionaries. This is a nice feature of "zstd", but I suspect it can be added later. The latest discussion also shows that we're not sure what the right API is. Given that dictionaries have tables and structure to them, perhaps they should be a separate Dictionary type that you can set on a reader and writer. The construction of Dictionary (e.g., via BuildDictionary) would validate the dictionary (and return an error if relevant) and also do any processing work up front (and can therefore be cached).

IMO, the biggest early decision that needs to be settled is whether we go with:

Variadic options similar to @klauspost's own "zstd" package
The setter pattern being proposed by @rsc, or
An options struct

I have a slight preference for variadic options over the setter pattern, but don't feel strongly. I find options structs more cumbersome to work with especially when zero options is a common situation being dealt with.

klauspost · 2024-07-26T17:47:54Z

@dsnet I think you nailed it 💯 - I have no relevant opinions on your questions.

mateusz834 · 2024-07-26T17:54:48Z

@dsnet Maybe it was mentioned before, but why the level is not a special defined type, but an int?

dsnet · 2024-07-26T17:58:53Z

I replicated @rsc's proposal on that regard and I believe he's just copying what the "flate" and "gzip" package does. IMO, the main utility of it being a named type depends on the options pattern we go with. If we go with variadic options, there is a benefit to having it be a named type that implements the options type, so that you can pass it in directly without going through some options constructor.

rsc · 2024-07-31T17:24:49Z

Re API, I think we want compress/zstd to be a mostly drop-in replacement for compress/flate except for the bytes it generates. So I think having dictionaries is important. A "raw dictionary" is meant to handle what https://datatracker.ietf.org/doc/html/rfc8878#name-dictionary-format refers to as a "raw content" dictionary and most closely matches compress/flate. But then we also need the zstd-format dictionaries.

As for variadic vs setters, setters seem to work fine in the other compress packages. The variadic options can easily get overwhelming, and I think they're better avoided here.

#62513 (comment) example for skippable frames makes clear that for starters we can have the decompressor just skip over them. We could also add an API that asks "am I now at a skippable frame" and could be called after every read, or we could have a Reader mode that stops at a skippable frame with ErrAtSkippable or something like that. (Plus we'd need a way to get its data.) But neither of those would actually help the Facebook seekable streams, so for now we should probably just leave them out for now, until more examples arise.

I don't mind leaving out ReadFrom/WriteTo. For some reason I thought compress/flate had them, but it doesn't.

So that would make the API:

package zstd

// These match flate and gzip constants, intentionally,
// but the package does not import flate.
const (
	NoCompression = 0
	BestSpeed = 1
	BestCompression = 9
	DefaultCompression = -1
)

// Unexported for now; speak up if you think they should be exported.
var (
	errChecksum = errors.New("zstd: invalid checksum")
	errHeader = errors.New("zstd: invalid header")
	errCorrupt = errors.New("zstd: corrupt data")
)

type Reader struct {
	... unexported fields ...
}

func NewReader(r io.Reader) (*Reader, error)

func (z *Reader) Reset(r io.Reader) error

func (z *Reader) AddDict([]byte) error
func (z *Reader) SetRawDict([]byte)

func (z *Reader) Read(p []byte) (int, error)

func (z *Reader) Close() error

type Writer struct {
	... unexported fields ...
}

func NewWriter(w io.Writer) *Writer

func (z *Writer) Reset(w io.Writer)

func (z *Writer) SetLevel(int)
func (z *Writer) AddDict([]byte) error
func (z *Writer) SetRawDict([]byte)

func (z *Writer) Write([]byte) (int, error)

func (z *Writer) Flush() error
func (z *Writer) Close() error

Feedback still welcome.

klauspost · 2024-08-05T12:56:30Z

@rsc FWIW, I have never seen "raw dictionaries" used, nor had them requested. Only the "structured dictionaries" - and that at a low rate. So I am not sure if that needs to be part of the initial API.

`func (z *Writer) AddDict([]byte) error`

I assume this adds a structured dictionary. This will mainly be useful for small encodes - in the 0 -> 64KB range. As I briefly went into above, this data range will pose a significant overhead for processing the dictionary, if added to every encode.

If this dictionary would be persisted across Reset calls, it would be feasible, but a bit clumsy to use. (New -> AddDict -> use -> sync.Pool.Put -> sync.Pool.Get -> Reset) - forcing one encoder pool per dictionary.

For similar case, I have gone for the type Dict struct {... unexported....} approach, where the dictionary gets parsed once and can be used for encodes.

func ParseDict([]byte) (*Dict, error) and for the raw dictionaries it would be func MakeRawDict([]byte) (*Dict, error). The dictionary would then know which internal states should be modified on the Writer when used.

With these functions a func (z *Writer) SetDict(*Dict) would be a pointer copy, and the dictionary can keep internal data structures for pre-indexed dictionaries for different compression levels, with one instance needed for dictionary. The nil/zero value would mean "no dictionary". The Reader would have an equivalent interface.

FWIW, I used a similar approach with S2, which has worked out fine. The dictionary also offers one-shot []byte -> []byte encode/decode, mostly for convenience.

rsc · 2024-08-07T17:32:02Z

I think we need to keep RawDict because it makes it easier to transition from compress/flate (or zlib) to compress/zstd.

I don't mind ParseDict+AddDict(*Dict). As I understand it, you are allowed to have multiple dicts for a reader/writer to choose from, so AddDict not SetDict.

Jorropo · 2024-08-07T21:07:08Z

I have never seen "raw dictionaries" used, nor had them requested

I use them using your lib. I didn't requested it because they are already there and work so 🎉.
I use them as one of the primitives to do delta updates between two versions of the same file.

rsc · 2024-08-14T17:12:37Z

It sounds like the revised API would be as follows. Do I have this right?

package zstd

// These match flate and gzip constants, intentionally,
// but the package does not import flate.
const (
	NoCompression = 0
	BestSpeed = 1
	BestCompression = 9
	DefaultCompression = -1
)

// Unexported for now; speak up if you think they should be exported.
var (
	errChecksum = errors.New("zstd: invalid checksum")
	errHeader = errors.New("zstd: invalid header")
	errCorrupt = errors.New("zstd: corrupt data")
)

type Dict struct { ... }
func ParseDict(enc []byte) (*Dict, error)
func (*Dict) MarshalBinary() ([]byte, error)
func (*Dict) UnmarshalBinary([]byte) error

type Reader struct {
	... unexported fields ...
}

func NewReader(r io.Reader) (*Reader, error)

func (z *Reader) Reset(r io.Reader) error

func (z *Reader) AddDict(*Dict)
func (z *Reader) SetRawDict([]byte)

func (z *Reader) Read(p []byte) (int, error)

func (z *Reader) Close() error

type Writer struct {
	... unexported fields ...
}

func NewWriter(w io.Writer) *Writer

func (z *Writer) Reset(w io.Writer)

func (z *Writer) SetLevel(int) error
func (z *Writer) AddDict(*Dict)
func (z *Writer) SetRawDict([]byte)

func (z *Writer) Write([]byte) (int, error)

func (z *Writer) Flush() error
func (z *Writer) Close() error

dsnet · 2024-08-15T03:54:04Z

Should Dict implement encoding.BinaryMarshaler and encoding.BinaryUnmarshaler?

klauspost · 2024-08-15T16:32:34Z

@dsnet That seems very reasonable and convenient, since it has a well-defined binary format.

The only complication that comes to mind is the dicts without tables. I don't like the if magic != 0xec30a437 { // assume raw }, since it potentially could collide.

jharveyb · 2024-08-15T20:18:22Z

It sounds like the revised API would be as follows. Do I have this right?

AFAICT the stdlib compressors that accept options do it with separate constructors and not a settter like the SetLevel(int) suggested?

https://pkg.go.dev/compress/[email protected]#NewWriterDict

https://pkg.go.dev/compress/[email protected]#NewWriterLevel

https://pkg.go.dev/compress/[email protected]#NewWriterLevelDict

I'm not sure why that pattern should be changed if an invalid level can be passed to SetLevel().

rsc · 2024-08-28T17:11:14Z

The only complication that comes to mind is the dicts without tables. I don't like the if magic != 0xec30a437 { // assume raw }, since it potentially could collide.

Perhaps this is a reason not to have ParseRawDict and ParseDict both return a *Dict.
Another reason is that you can only have one raw dict but you can have many dicts.
So let's keep SetRawDict([]byte) and AddDict(*Dict) and then *Dict is only about the non-raw Dicts.

rsc · 2024-08-28T17:13:21Z

It's true that we changed to setters instead of NewWriterLevel, NewWriterLevelDict, but I think the setters are quite a bit clearer and avoid NewWriterLevelRawDict, NewWriterLevelRawDictWithOtherDicts, etc. The old approach did not scale.

I should also add an error result from SetLevel, although we should make it clear that if you are setting a constant level like BestCompression it will never fail.

rsc · 2024-08-29T15:15:05Z

Have all remaining concerns about this proposal been addressed?

The proposal details are in #62513 (comment).

rsc · 2024-09-04T18:44:13Z

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

The proposal details are in #62513 (comment).

coxley · 2024-09-04T19:09:10Z

Another datapoint on a potential dictionary API, ideally it should be done in a way that the user can manage the lifecycle of their own dictionaries. klausspost/compress doesn't handle this well today: klauspost/compress#953

In multi-tenanted systems — or any dealing with multiple data categories — you want to be able to hold many dictionaries in memory and switch between them with minimal duplication. We rely on cgo for this today, leveraging the native library functions for ZSTD_createDDict_byReference and ZSTD_createCDict_byReference.

Should dictionary support be added to the stdlib, I'd love this use-case to be considered.

ETA: A trimmed example of our use of an internal gozstd fork that supports *_byReference (cgo lifetime specifics omitted)

type Dict struct {
	Key  uint32
	Data []byte
}

type Codec struct {
	dict *Dict
	// Both of these share the underlying bytes inside dict.Data
	cdict *gozstd.CDict
	ddict *gozstd.DDict
}

func NewCodec(dict *dictkey.Dictionary) (Codec, error) {
	if dict == nil {
		return Codec{}, errors.New("'dict' must be a non-nil pointer")
	}

	cdict, err := gozstd.NewCDictByRef(dict.Data)
	if err != nil {
		return Codec{}, fmt.Errorf("creating cdict: %w", err)
	}

	ddict, err := gozstd.NewDDictByRef(dict.Data)
	if err != nil {
		return Codec{}, fmt.Errorf("creating ddict: %w", err)
	}

	return Codec{
		dict:     dict,
		cdict:    cdict,
		ddict:    ddict,
	}, nil
}

func (c *Codec) Key() uint32 {
	return c.dict.Key
}

func (c *Codec) Compress(dst []byte, src []byte) ([]byte, error) {
	return gozstd.CompressDict(dst, src, c.cdict), nil
}

func (c *Codec) Decompress(dst []byte, src []byte) ([]byte, error) {
	return gozstd.DecompressDict(dst, src, c.ddict)
}

rsc · 2024-09-11T17:09:49Z

@coxley I don't understand what you mean by managing the life cycle. It seems like the ParseDict approach gives you a read-only Dict that can be shared by many encoders/decoders. What more is needed?

Is the point for ParseDict not to copy the []byte it has been passed?

rsc · 2024-09-11T18:05:03Z

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

The proposal details are in #62513 (comment).

dsnet added the Proposal label Sep 7, 2023

gopherbot added this to the Proposal milestone Sep 7, 2023

dsnet mentioned this issue Sep 7, 2023

proposal: net/http: add support for RFC8878 Content-Encoding: zstd to http.DefaultTransport #62492

Open

Jorropo mentioned this issue Sep 22, 2023

http3: add support for RFC8878 Content-Encoding: zstd to RoundTripper quic-go/quic-go#4100

Open

AlexanderYastrebov added a commit to AlexanderYastrebov/go that referenced this issue Sep 25, 2023

internal/zstd: configure window size for single segment frames

c8b7db0

For golang#62513

AlexanderYastrebov mentioned this issue Sep 25, 2023

internal/zstd: configure window size for single segment frames #63224

Closed

AlexanderYastrebov added a commit to AlexanderYastrebov/go that referenced this issue Sep 25, 2023

internal/zstd: configure window size for single segment frames

af57e1c

For golang#62513

AlexanderYastrebov added a commit to AlexanderYastrebov/go that referenced this issue Sep 25, 2023

internal/zstd: configure window size for single segment frames

8747174

For golang#62513

AlexanderYastrebov added a commit to AlexanderYastrebov/go that referenced this issue Sep 26, 2023

internal/zstd: configure window size for single segment frames

9881d62

For golang#62513

AlexanderYastrebov added a commit to AlexanderYastrebov/go that referenced this issue Sep 26, 2023

internal/zstd: fix copyFromWindow

405db15

Fix copyFromWindow when match extends past initial buffer size For golang#62513

AlexanderYastrebov added a commit to AlexanderYastrebov/go that referenced this issue Sep 26, 2023

internal/zstd: fix copyFromWindow

84c0a70

Fix copyFromWindow when match extends past initial buffer size For golang#62513

AlexanderYastrebov mentioned this issue Sep 26, 2023

internal/zstd: handle match extending past window #63248

Closed

This was referenced Aug 15, 2024

log: pull in logrotator, switch to ZSTD compressor btcsuite/btcd#2237

Closed

multi: add zstd support jrick/logrotate#1

Closed

multi: add zstd support jrick/logrotate#2

Closed

Support custom compression algos, parameters, and suffixes jrick/logrotate#3

Merged

rsc added the Proposal-FinalCommentPeriod label Sep 4, 2024

rsc changed the title ~~proposal: compress/zstd: add new package~~ compress/zstd: add new package Sep 11, 2024

rsc modified the milestones: Proposal, Backlog Sep 11, 2024

rsc added Proposal-Accepted and removed Proposal-FinalCommentPeriod labels Sep 11, 2024

compress/zstd: add new package #62513

compress/zstd: add new package #62513

Comments

dsnet commented Sep 7, 2023 • edited Loading

dsnet commented Sep 7, 2023 • edited Loading

ianlancetaylor commented Sep 7, 2023

dsnet commented Sep 7, 2023

Jorropo commented Sep 7, 2023 • edited Loading

klauspost commented Sep 8, 2023

dsnet commented Sep 8, 2023

klauspost commented Sep 8, 2023

dsnet commented Sep 8, 2023

dsnet commented Sep 8, 2023

silverwind commented Sep 8, 2023 • edited Loading

Jorropo commented Sep 8, 2023

cespare commented Sep 9, 2023

silverwind commented Sep 9, 2023

klauspost commented Sep 10, 2023

klauspost commented Sep 25, 2023 • edited Loading

offset past window & invalid offset

unexpected EOF

invalid checksum: got A want B

dictionaries are not supported

no sequences and no literals

(*Reader).Reset is not working

AlexanderYastrebov commented Sep 25, 2023

gopherbot commented Sep 25, 2023

ianlancetaylor commented Sep 26, 2023

jonjohnsonjr commented Jul 25, 2024

dsnet commented Jul 25, 2024

klauspost commented Jul 26, 2024

mateusz834 commented Jul 26, 2024 • edited Loading

dsnet commented Jul 26, 2024

rsc commented Jul 31, 2024

klauspost commented Aug 5, 2024 • edited Loading

func (z *Writer) AddDict([]byte) error

rsc commented Aug 7, 2024

Jorropo commented Aug 7, 2024 • edited Loading

rsc commented Aug 14, 2024 • edited Loading

dsnet commented Aug 15, 2024

klauspost commented Aug 15, 2024 • edited Loading

jharveyb commented Aug 15, 2024 • edited Loading

rsc commented Aug 28, 2024

rsc commented Aug 28, 2024

rsc commented Aug 29, 2024

rsc commented Sep 4, 2024

coxley commented Sep 4, 2024 • edited Loading

rsc commented Sep 11, 2024

rsc commented Sep 11, 2024

dsnet commented Sep 7, 2023 •

edited

Loading

dsnet commented Sep 7, 2023 •

edited

Loading

Jorropo commented Sep 7, 2023 •

edited

Loading

silverwind commented Sep 8, 2023 •

edited

Loading

klauspost commented Sep 25, 2023 •

edited

Loading

`offset past window` & `invalid offset`

`unexpected EOF`

`invalid checksum: got A want B`

`no sequences and no literals`

`(*Reader).Reset` is not working

mateusz834 commented Jul 26, 2024 •

edited

Loading

klauspost commented Aug 5, 2024 •

edited

Loading

`func (z *Writer) AddDict([]byte) error`

Jorropo commented Aug 7, 2024 •

edited

Loading

rsc commented Aug 14, 2024 •

edited

Loading

klauspost commented Aug 15, 2024 •

edited

Loading

jharveyb commented Aug 15, 2024 •

edited

Loading

coxley commented Sep 4, 2024 •

edited

Loading