POLCA

Python implementation of the online grammar compression algorithm OLCA. POLCA was developed for studying OLCA, so its time and space consumption and the output compressed size are much worse than lcacomp which is a C++ implemenation of OLCA. However, POLCA has an advantage, lcacomp does not have, that it can treat each multi byte character as a single character.

Note: The main drawback of POLCA is that it stores each whole line of input in RAM because python does not provide an interface for reading by characters.

Requirements

Python 3.5 or higher is required. See requirements.txt for other required libraries. They will be installed by the following command.

$ pip install -r requirements.txt

Usages

$ python src/polcarun.py -h
src/polcarun.py: Python implementation of the online grammar compression algorithm OLCA.

Usage:
  src/polcarun.py (comp | decomp) [options]
  
Options:
  -i FILE --input FILE              input file  [default: stdin]
  -o FILE --output FILE             output file [default: stdout]
  -e ENCODING --encoding ENCODING   open input/output by ENCODING for comp/decomp [default: binary]
                                    if ENCODING="binary", treat a byte as a single character 
                                    otherwise, treat a multi-byte encoded by ENCODING as a single character

Examples

An example for a small text.

$ echo "hogehoge" > hoge.txt
$ python src/polcarun.py comp -i hoge.txt -o hoge.enc
$ python src/polcarun.py decomp -i hoge.enc -o hoge.decomp
$ cat hoge.decomp
hogehoge

An example for a highly-repetitive text. Note: The compressed size is very small, but POLCA store whole input (about 17M) in RAM. This is because the input huge data is written in a line.

$ python -c "print('a'*(2**24))" > fuga.txt # 16777217 bytes (about 17M)
$ python src/polcarun.py comp -i fuga.txt -o fuga.enc # 266 bytes
$ python src/polcarun.py comp -i fuga.enc -o fuga.decomp
$ diff fuga.txt fuga.decomp
# no output

An example for a multi-byte text.

$ echo "âêîôû" | python src/polcarun.py comp --encoding=utf8 | python src/polcarun.py decomp --encoding=utf8
âêîôû
# ofcourse, you can treat a byte as a character also for a multi-byte text!
$ echo "âêîôû" | python src/polcarun.py comp | python src/polcarun.py decomp
âêîôû

Tests

$ cd data
$ ./download.sh # download corpus
$ cd ../
$ python -m unittest -v src/test_polca.py

download.sh downloads files for test from The Canterbury Corpus.

Licenses

References

Shirou Maruyama, Hiroshi Sakamoto, Masayuki Takeda: An Online Algorithm for Lightweight Grammar-Based Compression. Algorithms 5(2): 214-235 (2012)
lcacomp: A C++ implementation of OLCA.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

POLCA

Requirements

Usages

Examples

Tests

Licenses

References

About

Releases

Packages

Languages

License

kg86/polca

Folders and files

Latest commit

History

Repository files navigation

POLCA

Requirements

Usages

Examples

Tests

Licenses

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages