Tokengrams

Tokengrams allows you to efficiently compute $n$-gram statistics for pre-tokenized text corpora used to train large language models. It does this not by explicitly pre-computing the $n$-gram counts for fixed $n$, but by creating a suffix array index which allows you to efficiently compute the count of an $n$-gram on the fly for any $n$.

Our code also allows you to turn your suffix array index into an efficient $n$-gram language model, which can be used to generate text or compute the perplexity of a given text.

The backend is written in Rust, and the Python bindings are generated using PyO3.

Installation

pip install tokengrams

Development

pip install maturin
maturin develop

Usage

Building an index

from tokengrams import MemmapIndex

# Create a new index from an on-disk corpus called `document.bin` and save it to
# `pile.idx`.
index = MemmapIndex.build(
    "/data/document.bin",
    "/pile.idx",
)

# Verify index correctness
print(index.is_sorted())
  
# Get the count of "hello world" in the corpus.
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-160m")
print(index.count(tokenizer.encode("hello world")))

# You can now load the index from disk later using __init__
index = MemmapIndex(
    "/data/document.bin",
    "/pile.idx"
)

Using an index

# Count how often each token in the corpus succeeds "hello world".
print(index.count_next(tokenizer.encode("hello world")))

# Parallelise over queries
print(index.batch_count_next(
    [tokenizer.encode("hello world"), tokenizer.encode("hello universe")]
))

# Autoregressively sample 10 tokens using 5-gram language statistics. Initial
# gram statistics are derived from the query, with lower order gram statistics used 
# until the sequence contains at least 5 tokens.
print(index.sample(tokenizer.encode("hello world"), n=5, k=10))

# Parallelize over sequence generations
print(index.batch_sample(tokenizer.encode("hello world"), n=5, k=10, num_samples=20))

# Query whether the corpus contains "hello world"
print(index.contains(tokenizer.encode("hello world")))

# Get all n-grams beginning with "hello world" in the corpus
print(index.positions(tokenizer.encode("hello world")))

Support

The best way to get support is to open an issue on this repo or post in #inductive-biases in the EleutherAI Discord server. If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github/workflows		.github/workflows
.vscode		.vscode
src		src
tests		tests
tokengrams		tokengrams
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE.md		LICENSE.md
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tokengrams

Installation

Development

Usage

Building an index

Using an index

Support

About

Releases 1

Packages

Contributors 2

Languages

License

EleutherAI/tokengrams

Folders and files

Latest commit

History

Repository files navigation

Tokengrams

Installation

Development

Usage

Building an index

Using an index

Support

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages