Add dictionary builder #853

klauspost · 2023-08-17T10:10:15Z

Dictionary builder

This is an experimental dictionary builder for Zstandard, S2, LZ4 and more.

This diverges from the Zstandard dictionary builder, and may have some failure scenarios for very small or uniform inputs.

Dictionaries returned should all be valid, but if very little data is supplied, it may not be able to generate a dictionary.

With a large, diverse sample set, it will generate a dictionary that can compete with the Zstandard dictionary builder,
but for very similar data it will not be able to generate a dictionary that is as good.

Feedback is welcome.

Usage

First of all a collection of samples must be collected.

These samples should be representative of the input data and should not contain any complete duplicates.

Only the beginning of the samples is important, the rest can be truncated.
Beyond something like 64KB the input is not important anymore.
The commandline tool can do this truncation for you.

Command line

To install the command line tool run:

$ go install github.com/klaupost/compress/dict/cmd/builddict@latest

Collect the samples in a directory, for example samples/.

Then run the command line tool. Basic usage is just to pass the directory with the samples:

$ builddict samples/

This will build a Zstandard dictionary and write it to dictionary.bin in the current folder.

The dictionary can be used with the Zstandard command line tool:

$ zstd -D dictionary.bin input

Options

The command line tool has a few options:

-format. Output type. "zstd" "s2" or "raw". Default "zstd".

Output a dictionary in Zstandard format, S2 format or raw bytes.
The raw bytes can be used with Deflate, LZ4, etc.

-hash Hash bytes match length. Minimum match length. Must be 4-8 (inclusive) Default 6.

The hash bytes are used to define the shortest matches to look for.
Shorter matches can generate a more fractured dictionary with less compression, but can for certain inputs be better.
Usually lengths around 6-8 are best.

-len Specify custom output size. Default 114688.
-max Max input length to index per input file. Default 32768. All inputs are truncated to this.
-o Output name. Default dictionary.bin.
-q Do not print progress
-dictID zstd dictionary ID. 0 will be random. Default 0.
-zcompat Generate dictionary compatible with zstd 1.5.5 and older. Default false.
-zlevel Zstandard compression level.

The Zstandard compression level to use when compressing the samples.
The dictionary will be built using the specified encoder level,
which will reflect speed and make the dictionary tailored for that level.
Default will use level 4 (best).

Valid values are 1-4, where 1 = fastest, 2 = default, 3 = better, 4 = best.

Library

The github.com/klaupost/compress/dict package can be used to build dictionaries in code.
The caller must supply a collection of (pre-truncated) samples, and the options to use.
The options largely correspond to the command line options.

package main

import (
	"github.com/klaupost/compress/dict"
	"github.com/klauspost/compress/zstd"
)

func main() {
	var samples [][]byte

	// ... Fill samples with representative data.

	dict, err := dict.BuildZstdDict(samples, dict.Options{
		HashLen:     6,
		MaxDictSize: 114688,
		ZstdDictID:  0, // Random
		ZstdCompat:  false,
		ZstdLevel:   zstd.SpeedBestCompression,
	})
	// ... Handle error, etc.
}

There are similar functions for S2 and raw dictionaries (BuildS2Dict and BuildRawDict).

Functional and ok, but has failure modes.

Add "lazy" matching. Should probably be a skipping attempt instead.

Group output and use offset as secondary. Add step back for long hashes.

This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [github.com/klauspost/compress](https://togithub.com/klauspost/compress) | indirect | minor | `v1.16.7` -> `v1.17.0` | --- ### Release Notes <details> <summary>klauspost/compress (github.com/klauspost/compress)</summary> ### [`v1.17.0`](https://togithub.com/klauspost/compress/releases/tag/v1.17.0) [Compare Source](https://togithub.com/klauspost/compress/compare/v1.16.7...v1.17.0) #### What's Changed - Add dictionary builder by [@klauspost](https://togithub.com/klauspost) in [klauspost/compress#853 - Add xerial snappy read/writer by [@klauspost](https://togithub.com/klauspost) in [klauspost/compress#838 - flate: Add limited window compression by [@klauspost](https://togithub.com/klauspost) in [klauspost/compress#843 - s2: Do 2 overlapping match checks by [@klauspost](https://togithub.com/klauspost) in [klauspost/compress#839 - flate: Add amd64 assembly matchlen by [@klauspost](https://togithub.com/klauspost) in [klauspost/compress#837 - gzip: Copy bufio.Reader on Reset by [@thatguystone](https://togithub.com/thatguystone) in [klauspost/compress#860 - zstd: Remove offset from bitReader by [@greatroar](https://togithub.com/greatroar) in [klauspost/compress#854 - fse, huff0, zstd: Remove always-nil error returns by [@greatroar](https://togithub.com/greatroar) in [klauspost/compress#857 - tests: unnecessary use of fmt.Sprintf by [@testwill](https://togithub.com/testwill) in [klauspost/compress#836 - tests: Fix OSS fuzzer t.Run by [@klauspost](https://togithub.com/klauspost) in [klauspost/compress#852 - tests: Use Go 1.21.x by [@klauspost](https://togithub.com/klauspost) in [klauspost/compress#851 #### New Contributors - [@testwill](https://togithub.com/testwill) made their first contribution in [klauspost/compress#836 - [@thatguystone](https://togithub.com/thatguystone) made their first contribution in [klauspost/compress#860 **Full Changelog**: klauspost/compress@v1.16.7...v1.17.0 </details> --- ### Configuration 📅 **Schedule**: Branch creation - "before 4am on the first day of the month" (UTC), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://togithub.com/renovatebot/renovate).

WIP: Add dictionary builder

ff10202

Functional and ok, but has failure modes.

klauspost force-pushed the dict-builder branch from 431d9fe to ff10202 Compare August 17, 2023 10:12

klauspost added 3 commits August 17, 2023 14:28

Reduce minimum to 16 bytes. Keep linter happy.

dbdb94e

Check sizes compared to header sizes.

1ca482f

Try more huffman tables before bailing out.

523fb7a

Add "lazy" matching. Should probably be a skipping attempt instead.

klauspost mentioned this pull request Aug 22, 2023

Dicts: check maxSymbolValue < 255 failed, returning ERROR(dictionary_corrupted) facebook/zstd#3724

Closed

klauspost added 2 commits August 25, 2023 14:52

Rewrite zstd offsets

ed11654

Group output and use offset as secondary. Add step back for long hashes.

Add docs and zstd level and compat.

9663333

klauspost changed the title ~~WIP: Add dictionary builder~~ Add dictionary builder Aug 29, 2023

Final tweaks

5b71795

klauspost merged commit 11e91a1 into master Aug 29, 2023
18 checks passed

klauspost deleted the dict-builder branch August 29, 2023 12:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dictionary builder #853

Add dictionary builder #853

klauspost commented Aug 17, 2023 •

edited

Loading

Add dictionary builder #853

Add dictionary builder #853

Conversation

klauspost commented Aug 17, 2023 • edited Loading

Dictionary builder

Usage

Command line

Options

Library

klauspost commented Aug 17, 2023 •

edited

Loading