Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Magnushhoie committed Oct 7, 2023
1 parent 26a1e3e commit 82e468b
Showing 1 changed file with 14 additions and 11 deletions.
25 changes: 14 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,28 @@
## Tldr; Notebook proof-of-concept
This [notebook.ipynb](notebook.ipynb) demonstrates this proof of concept:
- Encoder neural network: Compresses data.txt -> compressed.txt using a simple LSTM neural network
- Decoder neural network: Decompresses compressed.txt directly from the file and without directly transmitting any of the previously learned neural network weights
- The idea works by having the encoder compressing text while it is training, and the decoder mirroring the process exactly by decompressing and training on the decompressed text. This way, both neural networks always share the same state over time, removing the need to store the weights externally.

## Rationale

Neural network-based language models are ideally suited for compressing text, as they can efficiently predict the next word in a sentence.
Instead of storing all words directly, we can instead store only the index of the word in the predicted probability distribution, across the sentence:
Instead of storing all words directly, we can instead find where each word is in the neural network's top predicted words, and only use the index instead.

Character counts (excluding spaces):
- Original words: An apple a day keeps the doctor away (30 characters)
- Neural network predicted word indices: 40 9 6 3 1 1 1 1 (9 characters)

An apple a day keeps the doctor away (37 characters)
-> 40, 9, 6, 3, 1, 1, 1, 1 (9 digits)
Even this naive implementation achieves a compression ratio of 9/30 = 0.30.

<img width="480" alt="image" src="https://github.com/Magnushhoie/weightless_NN_decompression/assets/39849954/4fe62e9c-bdc7-4904-86b3-4a75e371e646">

Likewise, given a list of indices to select the next words from the same neural network, we can decode the digits back to words.
However, this only works if we already have the neural network weights available. If we were to include these, we'd likely thow away any compression gains. Unless there is a method to completely skip storing them ...

However, normally this requires us to store the full neural network weights, throwing away any compression gains. The below proof of concept details a a way to avoid storing the weights, by learning them on-the-go from the compressed data itself.
The below proof of concept details a a way to avoid storing the weights, by learning them on-the-go from the compressed data itself.

The idea comes from this [2019 NNCP paper](https://bellard.org/nncp/nncp.pdf), which holds the currently world record for smallest compressed version of Wikipedia file (~1 GB -> 100 MB). Under normal circumstances the compressed file would also have to contain the decoder neural weights, but with this technique this requirement is removed. You can read more in this [HackerNews post](https://news.ycombinator.com/item?id=27244810).

## Tldr; Notebook proof-of-concept
This [notebook.ipynb](notebook.ipynb) demonstrates this proof of concept:
- Encoder neural network: Compresses data.txt -> compressed.txt using a simple LSTM neural network
- Decoder neural network: Decompresses compressed.txt directly from the file and without directly transmitting any of the previously learned neural network weights
- The idea works by having the encoder compressing text while it is training, and the decoder mirroring the process exactly by decompressing and training on the decompressed text. This way, both neural networks always share the same state over time, removing the need to store the weights externally.

## Implementation details
We encode sequences of digits like "000000", "000001", etc., and store the compressed [data](data.txt) in [compressed.txt](compressed.txt). Instead of using the index of the most likely next word, we'll be even more efficient and use an [Arithmetic Compressor](https://pypi.org/project/arithmetic-compressor/).

Expand Down

0 comments on commit 82e468b

Please sign in to comment.