Understanding is Compression

Ziguang Li^1,a,b,e, Chao Huang^1,a, Xuliang Wang^1,a,e, Haibo Hu^1,a, Cole Wyeth^c, Dongbo Bu^a,e, Quan Yu^b, Wen Gao^b,
Xingwu Liu^2,d,e, Ming Li^2,c,e
^aInstitute of Computing Technology, Chinese Academy of Science, Beijing, China
^bPeng Cheng Laboratory, Shenzhen, China
^cSchool of Computer Science, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada
^dSchool of Mathematical Sciences, Dalian University of Technology, Dalian, China
^eZhongyuan Institute of Artificial Intelligence, Zhengzhou, China

Abstract

We have previously shown all understanding or learning are compression, under reasonable assumptions. In principle, better understanding of data should improve data compression. Traditional compression methodologies focus on encoding frequencies or some other computable properties of data. Large language models approximate the uncomputable Solomonoff distribution, opening up a whole new avenue to justify our theory.

Under the new uncomputable paradigm, we present LMCompress based on the understanding of data using large models. LMCompress has significantly better lossless compression ratios than all other lossless data compression methods, doubling the compression ratios of JPEG-XL for images, FLAC for audios and H264 for videos, and tripling or quadrupling the compression ratio of bz2 for texts. The better a large model understands the data, the better LMCompress compresses.

keywords:

Lossless compression, large language models, Kolmogorov complexity, Solomonoff distribution

^myfootnote^myfootnotefootnotetext: These authors have made equal contributions^myfootnote^myfootnotefootnotetext: Corresponding authors

1 Introduction

Before the reader starts to read this article, we invite you to reminisce about how you have compressed data: When you see a tiger, didn’t you store it as a ”large cat”? When you see 3.141592 … didn’t you just note it down as a $\pi$ ? When you see a bird, didn’t you only focus on its unique features like size and color?

Yes, you have learned, understood, and compressed. In Jiang et al. [2023], we have mathematically proved that all human or animal learning or understanding are compression, under reasonable assumptions. However, we did not give a method of how to effectively find a way to compress the data using the fact that we thought we “ understood” the data.

Traditional compression methods depend on various frequency concerns, or Shannon entropy, or other computable properties. While being computable, such methods have reached their limits after many years of research. Our new theory depends on Kolmogorov complexity, information distance Bennett et al. [1998], and Solomonoff’s universal distribution Li and Vitanyi [2019]. Such metrics are not computable. GPT may be seen as approximating the uncomputable Solomonoff distribution with a lot of data. In this paper, we advocate a new paradigm of compression using such approximations by large models, hence compression by understanding the data. See Fig. 1.

Refer to caption — Figure 1: The architecture of our method, LMCompress

A precursor of our work has been independently published by us Huang et al. [2023] and by a DeepMind team Delétang et al. [2023], with preceding work in Bellard [2021] with similar ideas. These works demonstrated that arithmetic coding with a generative large model can improve the best traditional text compressors such as gzip a few folds. Delétang et al. [2023] also achieved the state of the art of classical image compression, and moderately improved classical audio compression.

This paper intends to comprehensively justify the idea of better understanding implies better compression. We will focus on lossless compression for clean comparisons. In order to demonstrate that better understanding implies better compression, we use image GPT rather than a plain large language model (LLM) to compress images and videos, retrain an LLM with a small amount of audio data to compress audios, and employ domain-specific finetuned LLMs to compress domain texts. Lossless compression experiments show that we significantly improve compression ratios on all types of data: texts, images, videos, and audios. Our method is several folds better than traditional algorithms, and a large margin better than the plain LLMs otherwise.

2 Background

2.1 Solomonoff Prior

In 1960’s, Solomonoff [1964] proposed a theory of prediction. Consider an infinite sequence $\omega$ of events over a finite alphabet $\Sigma$ . Given an increasingly longer initial segment $\omega_{1:n}\in\Sigma^{n}$ of $\omega\in\Sigma^{\infty}$ , the task is to predict the next symbol $\omega_{n+1}\in\Sigma$ of $\omega$ . Provided that the prior distribution of finite sequences is $\mu$ , the optimum posterior distribution of $\omega_{n+1}$ given $\omega_{1:n}$ is

\mu(\omega_{n+1}|\omega_{1:n})=\frac{\mu(\omega_{1:(n+1)})}{\mu(\omega_{1:n})}.

When the prior distribution $\mu$ is unknown, Solomonoff proposed a semimeasure $\mathbf{m}$ on $\Sigma^{*}$ , called Solomonoff prior, based on which the predictive distribution

\mathbf{m}(\omega_{n+1}|\omega_{1:n})=\frac{\mathbf{m}(\omega_{1:(n+1)})}{% \mathbf{m}(\omega_{1:n})}

is defined to approximate $\mu(\omega_{n+1}|\omega_{1:n})$ . Here, for any $x\in\Sigma^{*}$ ,

\mathbf{m}(x)=\sum_{P\in\mathcal{P}(x)}2^{-l(P)},

where $\mathcal{P}(x)$ is the set of binary programs that output a sequence with prefix $x$ , and $l(P)$ is the length of the program $P$ . Intuitively, the shorter or the more programs generate $x$ , the larger the prior probability of $x$ should be.

Interestingly, it has been proven that the universally-defined predictive probability converges to the true, even unknown, posterior distribution. Namely,

\lim_{n\rightarrow\infty}\mathbf{m}(\omega_{n+1}|\omega_{1:n})-\mu(\omega_{n+1% }|\omega_{1:n})=0.

This seemingly paves a universal way to well approximate the true posterior distribution. However, computing $\mathbf{m}$ is closely related to Kolmogorov complexity [Li and Vitanyi, 2019] which turns out uncomputable.

Fortunately, it is known [Ortega et al., 2019, Grau-Moya et al., 2024] that any parametric meta-learning model $\pi_{\theta}$ , just as decoder-only large models, if fully trained with log-loss function

-\sum_{t=1}^{n}\log\pi_{\theta}(x_{t}|x_{1:(t-1)}),

converges to the Bayes-optimal predictor, namely

\lim_{n\rightarrow\infty}\left(\pi_{\theta}(x_{n}|x_{1:(n-1)})-\mu(x_{n}|x_{1:% (n-1)})\right)=0.

This makes it possible to bypass the uncomputable Solomonoff priors via large models. We will achieve efficient lossless compression of texts, voices, images, and videos under this paradigm.

Different ideas have been proposed to break the Shannon entropy bound. One such proposal is “semantic communication”. While it is unclear how such a proposal would be implemented, lossy or lossless, the central idea is clear: we communicate the meaning of the data, or we compute the meaning of the data then send their shorter representations. However, the idea of “semantic communication” again falls into the classical trap of trying to find computable functions to interpret the semantics of the data.

2.2 Multimedia Compression

Multimedia compression refers to the process of reducing the size of digital media data, such as text, images, audio, and video, without compromising the essential information they contain. This is achieved by identifying and eliminating redundancies within the data, resulting in a more compact representation that requires less storage space or bandwidth for transmission.

The three key techniques widely used in multimedia compression are prediction, transformation, and entropy coding.

Prediction-based compression leverages the inherent patterns and correlations within the data to predict future or missing elements. In audio compression, future audio frames can be predicted based on previous frames, as seen in techniques like linear predictive coding (LPC) [Auristin and Mali, 2016]. Video compression often employs prediction, both within frames (intra-frame prediction) and between frames (inter-frame prediction), as demonstrated in the H.264 video codec [Wiegand et al., 2003].

Transformation-based compression aims to convert the data representation from a higher-dimensional space to a lower-dimensional space, reducing the overall data size. In audio, image, and video compression, techniques like the Discrete Cosine Transform (DCT)[Ahmed et al., 1974] are widely used to transform the digital media into the frequency domain, where high-frequency components can be selectively discarded with minimal impact on visual quality. The transformed data can then be quantized and encoded using entropy coding methods.

Entropy coding is a lossless compression technique that eliminates statistical redundancies in the data by assigning shorter codes to more frequent symbols and longer codes to less frequent symbols. Two most commonly used entropy coding techniques are Huffman coding [Huffman, 1952] and arithmetic coding Pasco [1976], Rissanen [1976]. Huffman coding is a variable-length coding method that constructs a prefix-free code to minimize the average code length, while arithmetic coding can achieve higher compression ratios by representing the entire input sequence as a single code.

Multimedia compression techniques are extensively used in various file formats and media delivery systems. The choice in practice depends on the specific requirements of the application, such as the desired compression ratio, computational complexity, and the need for lossless or lossy compression.

2.3 Arithmetic Coding

Our compression will be based on arithmetic coding, so we introduce it in more detail to make the paper self-contained.

Arithmetic coding maps any sequence of symbols into a number in the interval $[0,1)$ .

Let $\Sigma$ be a finite alphabet with total order $\preceq$ . For any integer $n\geq 0$ , there is a probability mass function $p_{n}$ on $\Sigma^{n}$ , which satisfies that $p_{n}(\omega)=\sum_{x\in\Sigma}p_{n+1}(\omega x)$ for any sequence $\omega\in\Sigma^{n}$ .

Arbitrarily fix a sequence $\omega=\omega_{1}\omega_{2}\cdots\omega_{n}\in\Sigma^{n}$ for some $n>0$ . The arithmetic encoder compresses the sequence symbol by symbol while iteratively narrowing down the interval $I_{0}=[0,1)$ .

Specifically, for any $1\leq k\leq n$ , let $I_{k-1}=[a_{k-1},b_{k-1})$ be the interval obtained after encoding $\omega_{k-1}$ . To encode $\omega_{k}$ , we define the interval $I_{k}=[a_{k},b_{k})$ as follows:

	$\displaystyle a_{k}$	$\displaystyle=a_{k-1}+(b_{k-1}-a_{k-1})\sum_{x\in\Sigma,x\prec\omega_{k}}p(x\|% \omega_{1:(k-1)}),$		(1)
	$\displaystyle b_{k}$	$\displaystyle=a_{k-1}+(b_{k-1}-a_{k-1})\sum_{x\in\Sigma,x\preceq\omega_{k}}p(x% \|\omega_{1:(k-1)}).$		(2)

where

p(x|\omega_{1:(k-1)})=\frac{p_{k}(\omega_{1:(k-1)}x)}{p_{k-1}(\omega_{1:(k-1)})}

Finally, we select a value $\lambda\in I_{n}$ that has the shortest binary representation. It serves as the arithmetic code for $\omega$ . Note that the length $l(\lambda)$ of the binary representation of $\lambda$ is

	$\displaystyle l(\lambda)$	$\displaystyle\leq\lceil-\log_{2}(b_{n}-a_{n})\rceil+1$
		$\displaystyle=\left\lceil-\log_{2}\prod_{i=1}^{n}p(\omega_{i}\|\omega_{1:(k-1)}% )\right\rceil+1$
		$\displaystyle=\lceil-\log_{2}p_{n}(\omega)\rceil+1.$

On the other hand, given $\lambda$ , we can reconstruct the sequence $\omega$ in a similar manner. Starting with the interval $I_{0}=[0,1)$ , assume that we have recovered $\omega_{1:(k-1)}$ and obtained the interval $I_{k-1}=[a_{k-1},b_{k-1})$ that contains $\lambda$ . The $k$ th element of $\omega$ is the unique symbol $\omega_{k}\in\Sigma$ for which $I_{k}=[a_{k},b_{k})$ contains $\lambda$ . Here, $a_{k}$ and $b_{k}$ are defined as shown in Formulas (1) and (2).

Hence, arithmetic coding is a lossless compression method.

3 Methods

Traditional compression methods, whether lossy or lossless, depend on a computable function. We propose LMCompress, a new Kolmogorov paradigm of compression depending on the uncomputable Solomonoff distribution. The Solonomoff distribution is approximated by large models with never ending input data. The compression ratio should go up with better approximation of Solomonoff distribution and better understanding of data.

It turns out that we have already long passed the critical point. We demonstrate that the data is already sufficient to improve lossless compression of texts, images, videos and audios by several folds, unthinkable for traditional methods, far passing the Shannon entropy bound.

The basic process of LMCompress is as follows. First, we decompose the original data into a sequence of tokens. Then, we feed this token sequence into a large generative model, which outputs the predictive distribution for each token. Finally, we use arithmetic coding to losslessly compress the original data based on these predictive distributions. As shown below, the tokenization method and the large generative model may vary according to the type of the original data.

3.1 Image Compression

We will use the image-GPT model (iGPT, Chen et al. [2020]) as the generative large model. Our choice of iGPT is driven by two key factors.

Firstly, iGPT is a large-scale vision model that has been trained on a vast corpus of images, equipping it with a robust understanding of visual data. This makes iGPT well-suited for analyzing and processing images.

Secondly, iGPT is an autoregressive large vision model, which means that when presented with a sequence of pixels, it can generate predictive probability for each pixel in the sequence. This capability is a prerequisite for employing arithmetic coding.

To compress an image using iGPT, we first concatenate the image’s rows from top to bottom, transforming the two-dimensional visual data into a one-dimensional sequence of pixels. This pixel sequence is then fed into iGPT for processing.

However, due to the limited context window of iGPT, the entire pixel sequence cannot be input to the model all at once. Instead, we divide the sequence into non-overlapping segments, each of which can fit within iGPT’s context window. These individual segments are then fed into iGPT sequentially and processed in a piece-by-piece manner.

3.2 Video Compression

3.2.1 Lossless Video Compression

To the best of our knowledge, all existing open-source large video models do not naturally output probabilities. As a result, we have opted to circumvent this limitation by leveraging the image-based generative model iGPT instead. Since a video is fundamentally a sequence of frames, we propose to compress each individual frame using the iGPT model.

At this stage, we have chosen not to exploit the inter-frame information for compression. There are two reasons for this:

1. Many types of videos, such as action movies, exhibit drastic changes from one frame to the next. Attempting to leverage information from previous frames is unlikely to be effective for compressing the current frame in such cases.

2. Even for videos with relatively modest inter-frame variations, such as classroom lecture recordings, we have found that utilizing the inter-frame information does not actually improve the overall video compression performance. This may be because the iGPT model is already able to sufficiently understand and model each individual frame on its own.

By compressing each video frame independently using iGPT, we can sidestep the challenge posed by the lack of large autoregressive video models. This frame-by-frame compression approach allows us to harness the powerful image understanding capabilities of iGPT, without needing to address the complexities of modeling temporal dependencies between video frames.

3.2.2 Lossy Video Compression

In addition to lossless compression, the more common application of video compression is lossy compression. Therefore, we have also conducted research on the improvement of LLM in video lossy compression. Existing works such as DCVC series (DCVC Li et al. [2021]; DCVC-HEM Li et al. [2022]; DCVC-FM Li et al. [2024]) typically draw on the residual coding-based framework. However, in recent years, Artificial Intelligence Generated Content (AIGC) has developed rapidly with the accumulation of generative large model. The concept ”generative compression” mentioned in Santurkar et al. [2017] is getting a lot of attention in image compression area (Yang and Mandt [2024]; Relic et al. [2024]), and these methods perform promising results.

Essentially, due to learned data distribution from training dataset, the generative large models are able to learn more compact feature representations, flexible motion estimation mechanisms and superior signal reconstruction capabilities, which bring a lot of help for compression performance.

To compare DCVC series with methods using generative model, we set that the bit rate of both to be the same. Inspirited by Xu et al. [2024] which give a paradiam of transform-coding based method linked with generative large model, we try to use DCVC results as posterior and sample from diffusion model with the manifold constrained gradient without a strict measurement consistency projection step as Chung et al. [2022] does. Because gradient of DCVC is necessary, we use a proxy loss function, replacing the quantization step with additive uniform noise [Ballé et al. [2017]].

3.3 Audio Compression

Audios, as a type of sequential data, should undoubtedly exploit the sequential modeling ability of auto-regressive models. To capture long term patterns, state-of-the-art large audio models tend to discretize an audio into tokens, which is inevitably a lossy transformation. To achieve lossless compression, we establish a model that handles audio at the level of signal, rather than tokens.

Basically, the audio consists of a sequence of frames, each of which can be represented by a constant number of bytes. Note that a frame contains the amplitude information at a time point in the audio signal. We map each byte for a frame into an ASCII character, transforming the audio into a string of characters. What’s in need is just an auto-regressive model which understands this audio string. We implement such a model by adding a low rank adaption layer to an LLM and fine-tuning the LLM with audio strings. The fine-tuned model can estimate next-token probabilities for an audio string and thus compress the string with arithmetic coding.

3.4 Text Compression

Large language models have demonstrated impressive capability in compressing general texts. Intriguingly, they have potential to achieve even better compression ratios, provided that the texts to be compressed are restricted to specific domains.

The key lies in adapting the LLM to better understand the target domain. This is implemented by incorporating an adaptation layer and fine-tuning the LLM via doamin-specific texts, tailoring the model to the characteristics of the domain.

Then, to compress a text in the target domain, we feed it into the fine-tuned LLM. The LLM will estimate the next-token probabilities for the text, which can be leveraged by arithmetic coding to perform domain-aware compression.

4 Results

The metric of compression performance on a dataset, the compression ratio, is the ratio of the size of the original data to that of the compressed data. In general, the bigger the compression ratio, the better the compression performance.

4.1 Image Compression

Dataset We validate the compression ratios of images on two benchmark datasets, ISLVRC2017 (Russakovsky et al. [2015]) and CLIC2019 professional. The ILSVRC2017 dataset is a large-scale dataset containing millions of labeled images across thousands of categories derived from ImageNet. The CLIC2019 dataset is designed for evaluating and benchmarking image compression algorithms. It contains high-quality images with various characteristics such as natural scenes, textures, patterns, and structures. These images are representative of real-world scenarios encountered in photography, multimedia, and visual content. Since the ISLVRC2017 dataset is extremely large, we extract 10000 windows of each dataset to perform the arithmetic coding with large vision language model. When we test the impact of window size on compression performance, the number of windows will change to keep the total raw data size consistent.

In our experiment, raw data length is always 10 Megabytes since we just extract ten thousand windows of each dataset. Table 1 shows the compression ratios for all algorithms and datasets.It can be seen that LMCompress significantly outperforms traditional compression algorithms, achieving double the compression ratio compared to methods such as JPEG2000 Skodras et al. [2001].

Method	CLIC	ISLVRC
PNG	2.205	1.67
JPEG-XL	2.93	1.90
WebP	2.75	2.04
JPEG-2000	2.73	1.53
Chinchilla 7B	$\backslash$	1.82
Chinchilla 70B	$\backslash$	2.08
LMCompress	6.55	4.594

Table 1: Compression ratios of different algorithms on different datasets.The compression ratios of PNG, JPEG-XL, WebP, and JPEG-2000 on the CLIC dataset are from Rhee et al. [2022]. The compression ratios of Chinchilla models on ISLVRC dataset are from Delétang et al. [2023]. Since the Chinchilla models are not publicly available, we cannot test the performance on the other dataset.

4.2 Video Compression

4.2.1 Lossless Video Compression

Datasets We use the video data from Xiph.org. The Xiph has over 1000 videos.All video sequences are in the uncompressed YUV4MPEG format. Since the LLM is slow to autoregress at the pixel level, we selected two typical videos to test the effectiveness of LMCompress to test the compression ratios. One of the videos(bowing.y4m) is a static scene while the other video(bus.y4m) is a dynamic scene.The bus.y4m video contains 150 frames, while the bowing.y4m video contains 300 frames.In the static scene, the background is stationary and most of the pixel changes occur in the foreground. In the dynamic scene, most of the video content changes as the target moves, causing the majority of the background pixels to change as well.

method	bowing.y4m	bus.y4m
FFV1	3.37	2.00
H264	3.88	2.35
LMCompress	8.04	3.52

Table 2: Compression ratios of two videos.

The results is shown in Table 2, LMCompress doubles the compression ratios compared to H264. Meanwhile, we observe that the video compression ratio is better in static scenes than in dynamic scenes. This is due to the fact that in static scenes, most of the frames have few changes and the motion of the objects in the video is less variable. Therefore, the model can comprehend the entire video content through the image content of individual frames. This phenomenon re-enforces our main theme that better understanding leads to more effective compression.

4.2.2 Lossy Video Compression

Datasets The experimentation and analysis of the methods are performed using videos from CIPR SIF Sequences at Xiph.org. Due to the resolution limitations of the diffusion model, we scale the video size to 256x256.

Metric We use Peak-Signal-Noise-Ratio (PSNR) to measure distortion, bpp(bits per pixel) to measure bitrate and Fréchet Inception Distance (FID) to measure perceptual quality.

Implementation details We follow the diffusion training setting and hyperparameters in Chung et al. [2022], which uses stochastic gradient descent to optimize the intermediate samples. To achieve this in the presence of quantization, we use a proxy loss function based on a continuous relaxation of the probability model, replacing the quantization step with additive uniform noise (Ballé et al. [2017]). The forward measurement operator is specified as in DCVC (Li et al. [2021]).

	compression ratio	bpp	PSNR $\uparrow$	FID $\downarrow$
DCVC	162	0.0945	29.0	153
LMCompress	582	0.0263	32.3	81

Table 3: We choose ELIC (He et al. [2022]) as I-frame compressor.

According to Table 3, in terms of lossy video compression, LMCompress is more than three times the compression ratio of the DCVC method while maintaining better reconstructed image quality. The results support our perspective that better understanding implies better compression performance.

4.3 Audio Compression

Dataset We use LibriSpeech ASR corpus (Panayotov et al. [2015]) and LJSpeech (Ito and Johnson [2017]) as datasets to test audio compression. Both datasets were collected from the LibriVox project which covers nearly 1000 hours of 16kHz English speech in audiobooks. Since the datasets are too large, we extract the first Gigabyte from the train-clean-100 split of the LibriSpeech corpus and the first 256 Megabytes from LJSpeech. The audio streams are transformed into strings of ASCII characters before being compressed. The strings are further divided into non-overlapping segments of 2048 bytes, so that every segment does not exceed the context window size of the Llama models. The segments are compressed independently.

Audio-understanding We build our model for audio compression by conducting supervised LoRA (Hu et al. [2022]) fine-tuning on the Llama3-8B model. Note that Llama3-8B was pretrained on normal texts. To tailor it for audio compression, we use the first 64 Megabytes in the dev-clean split of the LibriSpeech corpus as the training data for fine-tuning, with rank 8 and alpha 32. Note that no data from the LJSpeech dataset is involved in the fine-tuning process. The results are shown in Table 4.

Method	LibriSpeech	LJSpeech
FLAC	3.23	3.21
Llama3-8B*	4.45	4.02
Chinchilla-7B**	4.24	\
Chinchilla-70B**	4.76	\
LMCompress	6.07	6.22

Table 4: Compression ratios of different lossless audio comprerssion algorithms.
*Delétang et al. [2023] experimented compression with Llama2-7B model. Here we implemented the same method on top of Llama3-8B model, since our proposed LMCompress is trained from Llama3-8B model.
** Results are from Delétang et al. [2023] since the Chinchilla models are not publicly available.

According to Table 4, even fine-tuned with a small amount of audio data, the model generalizes well on much larger test datasets. LMCompress almost doubles the compression ratio of the classic FLAC method. Furthermore, it outperforms raw Llama3-8B by 36% on LibriSpeech and by 55% on LJSpeech, respectively. The results support our observation that better understanding leads to better compression.

4.4 Text Compression

Dataset Our benchmarks for domain-aware text compression are the MeDAL (Wen et al. [2020]) dataset and the Pile of Law (Henderson et al. [2022]) dataset. The domains of the datasets are medicine and law, respectively. Specifically, MeDAL is created from PubMed abstracts which are released in the 2019 annual baseline and primarily serves as a corpus for medical abbreviation understanding. Pile of Law is a dataset of legal and administrative texts compiled from 35 sources. In the experiments, we extract the first 1104 Megabytes from MeDAL and the eurlex split from the Pile of Law corpus. Again, we divide the texts into segments of 2048 bytes so that every segment fits the context window of the Llama models. The segments are compressed independently.

Domain-aware compression To help the LLM understanding specific domains, supervised LoRA fine-tuning is applied. For each domain dataset, we use the first 64 Megabytes for training, the next 16 Megabytes for validation, and all of the remaining for testing. The results are illustrated in Table 5.

Method	MeDAL	Pile of Law
zlib	2.96	3.14
bzip2	3.94	4.15
brotli(Alakuijala et al. [2019])	4.22	5.13
Llama3-8B**	9.66	12.15
LMCompress	10.48	16.81

Table 5: Compression ratios of different lossless text compression algorithms.
** We implemented the same general text compressor in Huang et al. [2023] and Delétang et al. [2023] on top of Llama3-8B model.

We observe that LMCompress outperforms all the baselines. Its compression ratio on either dataset almost triples those of the best traditional methods. Compared to raw Llama3-8B, LMCompress improves the compression ratio by 8.5% on MeDAL and nearly by 38.4% on Pile of Law. Again, we get evidence that better understanding leads to better compression.

5 Conclusion

Communication in the past was generally governed by the Shannon paradigm, with compression ratio upper bounded by Shannon entropy. While exploring other computable features can further improve compression, large models may be seen to approximate the uncomputable Solomonoff distribution hence opening a new Kolmogorov paradigm of compression. As we have shown, this new way of lossless compression has achieved several folds of improvements on various kinds of data. This new Kolmogorov paradigm allows us to systematically understand the data we transmit, shattering the Shannon entropy upper bound in a great scale.

It is worth noting that the “arithmetic coding + LLM” paradigm has a universal upper bound on compression ratios, which is determined by the the inherent probability distribution of the texts. Let us focus on $\mathcal{X}^{n}$ with probability distribution $p_{n}$ . Suppose that $p_{n}$ is not known and we only have an approximate probability distribution $q_{n}$ . This is the case for LLMs, where the next-token probabilities are estimated. Then we apply arithmetic coding with $q_{n}$ , which in the worst case needs $\lceil-\log_{2}p_{n}(\omega)\rceil$ bits to represent a sequence $\omega\in\mathcal{X}^{n}$ . Therefore, the expected number of bits needed to compress an length- $n$ sequence is $\mathbb{E}_{p_{n}}\left[\lceil-\log_{2}q_{n}(\omega)\rceil\right]\approx% \mathbb{E}_{p_{n}}\left[-\log_{2}q_{n}(\omega)\right]=H(p_{n},q_{n})$ . Here, $H(p_{n},q_{n})$ is the cross-entropy between the true distribution and the estimated distribution. According to the theory of cross-entropy, if and only if $q_{n}=p_{n}$ , we get the expected shortest arithmetic coding which is the entropy of $\mathcal{X}^{n}$ with probability distribution $p_{n}$ .

The 6G communication, especially when the bandwidth is limited from the satellites, will be significantly benefited by understanding the data, with large models at both ends of communication to encode and decode. As the large models are specialized as agents, assisted with RAG, AI will understand our data to be transmitted much better. Conceivably, the research presented here can be extended to the domain of lossy compression. When the data need to be encrypted, our compression needs to be done before encryption. One can even imagine that the sides with better models broadcast open compressed messages allowing only those with equal models to decipher as a first level of encryption, at no extra cost.

Of course, exploring techniques to effectively incorporate inter-frame information remains an important area for future research. But for now, our pragmatic decision to compress videos by individually encoding their constituent frames appears to be a viable and effective strategy.

Acknowledgements

We thank Nick Zhang and Paul Vitanyi for discussions on Solomonoff distribution. We thank Cynthia Huang, Yuqing Xie, Zhiying Jiang, Rui Wang, and Peijia Guo for their discussions and related work in Jiang et al. [2023] and Huang et al. [2023]. This research is partially supported by Canada’s NSERC grant OGP0046506, and Canada Research Chair Program.

References

Ahmed et al. [1974] Ahmed, N., Natarajan, T., Rao, K., 1974. Discrete cosine transform. IEEE Transactions on Computers C-23, 90–93. doi:10.1109/T-C.1974.223784.
Alakuijala et al. [2019] Alakuijala, J., Farruggia, A., Ferragina, P., Kliuchnikov, E., Obryk, R., Szabadka, Z., Vandevenne, L., 2019. Brotli: A general-purpose data compressor. ACM Transactions on Information Systems .
Auristin and Mali [2016] Auristin, F.N., Mali, S.D., 2016. Advanced audio compression for lossless audio coding using ieee 1857.2. International Journal Of Engineering And Computer Science .
Ballé et al. [2017] Ballé, J., Laparra, V., Simoncelli, E.P., 2017. End-to-end optimized image compression. arXiv:1611.01704.
Bellard [2021] Bellard, F., 2021. Lossless data compression with transformer.
Bennett et al. [1998] Bennett, C.H., Gács, P., Li, M., Vitányi, P.M., Zurek, W.H., 1998. Information distance. IEEE Transactions on information theory 44, 1407–1423.
Chen et al. [2020] Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., Sutskever, I., 2020. Generative pretraining from pixels, in: International conference on machine learning, PMLR. pp. 1691–1703.
Chung et al. [2022] Chung, H., Kim, J., Mccann, M.T., Klasky, M.L., Ye, J.C., 2022. Diffusion posterior sampling for general noisy inverse problems. arXiv preprint arXiv:2209.14687 .
Delétang et al. [2023] Delétang, G., Ruoss, A., Duquenne, P.A., Catt, E., Genewein, T., Mattern, C., Grau-Moya, J., Wenliang, L.K., Aitchison, M., Orseau, L., et al., 2023. Language modeling is compression. arXiv preprint arXiv:2309.10668 .
Grau-Moya et al. [2024] Grau-Moya, J., Genewein, T., Hutter, M., Orseau, L., Delétang, G., Catt, E., Ruoss, A., Wenliang, L.K., Mattern, C., Aitchison, M., et al., 2024. Learning universal predictors. arXiv preprint arXiv:2401.14953 .
He et al. [2022] He, D., Yang, Z., Peng, W., Ma, R., Qin, H., Wang, Y., 2022. Elic: Efficient learned image compression with unevenly grouped space-channel contextual adaptive coding. arXiv:2203.10886.
Henderson et al. [2022] Henderson, P., Krass, M., Zheng, L., Guha, N., Manning, C., Jurafsky, D., Ho, D.E., 2022. Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset .
Hu et al. [2022] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., 2022. Lora: Low-rank adaptation of large language models, in: ICLR 2022.
Huang et al. [2023] Huang, C., Xie, Y., Jiang, Z., Lin, J., Li, M., 2023. Approximating human-like few-shot learning with gpt-based compression. arXiv:2308.06942.
Huffman [1952] Huffman, D.A., 1952. A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40, 1098–1101. doi:10.1109/JRPROC.1952.273898.
Ito and Johnson [2017] Ito, K., Johnson, L., 2017. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/.
Jiang et al. [2023] Jiang, Z., Wang, R., Bu, D., Li, M., 2023. A theory of human-like few-shot learning. arXiv:2301.01047.
Li et al. [2021] Li, J., Li, B., Lu, Y., 2021. Deep contextual video compression. arXiv:2109.15047.
Li et al. [2022] Li, J., Li, B., Lu, Y., 2022. Hybrid spatial-temporal entropy modelling for neural video compression, in: Proceedings of the 30th ACM International Conference on Multimedia, ACM. URL: https://dx.doi.org/10.1145/3503161.3547845, doi:10.1145/3503161.3547845.
Li et al. [2024] Li, J., Li, B., Lu, Y., 2024. Neural video compression with feature modulation. arXiv:2402.17414.
Li and Vitanyi [2019] Li, M., Vitanyi, P., 2019. An introduction to kolmogorov complexity and its applications.
Ortega et al. [2019] Ortega, P.A., Wang, J.X., Rowland, M., Genewein, T., Kurth-Nelson, Z., Pascanu, R., Heess, N.M.O., Veness, J., Pritzel, A., Sprechmann, P., Jayakumar, S.M., McGrath, T., Miller, K.J., Azar, M.G., Osband, I., Rabinowitz, N.C., György, A., Chiappa, S., Osindero, S., Teh, Y.W., Hasselt, H.V., de Freitas, N., Botvinick, M.M., Legg, S., 2019. Meta-learning of sequential strategies. ArXiv abs/1905.03030. URL: https://api.semanticscholar.org/CorpusID:147703875.
Panayotov et al. [2015] Panayotov, V., Chen, G., Povey, D., Khudanpur, S., 2015. Librispeech: An asr corpus based on public domain audio book, in: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Pasco [1976] Pasco, R.C., 1976. Source coding algorithms for fast data compression. Ph.D. thesis. Stanford University CA.
Relic et al. [2024] Relic, L., Azevedo, R., Gross, M., Schroers, C., 2024. Lossy image compression with foundation diffusion models. arXiv:2404.08580.
Rhee et al. [2022] Rhee, H., Jang, Y.I., Kim, S., Cho, N.I., 2022. Lc-fdnet: Learned lossless image compression with frequency decomposition network, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6033–6042.
Rissanen [1976] Rissanen, J.J., 1976. Generalized kraft inequality and arithmetic coding. IBM Journal of research and development 20, 198–203.
Russakovsky et al. [2015] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al., 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115, 211–252.
Santurkar et al. [2017] Santurkar, S., Budden, D., Shavit, N., 2017. Generative compression. arXiv:1703.01467.
Skodras et al. [2001] Skodras, A., Christopoulos, C., Ebrahimi, T., 2001. The jpeg 2000 still image compression standard. IEEE Signal Processing Magazine 18, 36–58. doi:10.1109/79.952804.
Solomonoff [1964] Solomonoff, R., 1964. A formal theory of inductive inference. Inform. control 7.
Wen et al. [2020] Wen, Z., Lu, X.H., Reddy, S., 2020. Medal: Medical abbreviation disambiguation dataset for natural language understanding pretraining. arXiv:2012.13978.
Wiegand et al. [2003] Wiegand, T., Sullivan, G., Bjontegaard, G., Luthra, A., 2003. Overview of the h.264/avc video coding standard. IEEE Transactions on Circuits and Systems for Video Technology .
Xu et al. [2024] Xu, T., Zhu, Z., He, D., Li, Y., Guo, L., Wang, Y., Wang, Z., Qin, H., Wang, Y., Liu, J., et al., 2024. Idempotence and perceptual image compression. arXiv preprint arXiv:2401.08920 .
Yang and Mandt [2024] Yang, R., Mandt, S., 2024. Lossy image compression with conditional diffusion models. arXiv:2209.06950.