Tokengrams allows you to efficiently compute
Our code also allows you to turn your suffix array index into an efficient
The backend is written in Rust, and the Python bindings are generated using PyO3.
pip install tokengrams
pip install maturin
maturin develop
from tokengrams import MemmapIndex
# Create a new index from an on-disk corpus called `document.bin` and save it to
# `pile.idx`.
index = MemmapIndex.build(
"/data/document.bin",
"/pile.idx",
)
# Verify index correctness
print(index.is_sorted())
# Get the count of "hello world" in the corpus.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-160m")
print(index.count(tokenizer.encode("hello world")))
# You can now load the index from disk later using __init__
index = MemmapIndex(
"/data/document.bin",
"/pile.idx"
)
# Count how often each token in the corpus succeeds "hello world".
print(index.count_next(tokenizer.encode("hello world")))
# Parallelise over queries
print(index.batch_count_next(
[tokenizer.encode("hello world"), tokenizer.encode("hello universe")]
))
# Autoregressively sample 10 tokens using 5-gram language statistics. Initial
# gram statistics are derived from the query, with lower order gram statistics used
# until the sequence contains at least 5 tokens.
print(index.sample(tokenizer.encode("hello world"), n=5, k=10))
# Parallelize over sequence generations
print(index.batch_sample(tokenizer.encode("hello world"), n=5, k=10, num_samples=20))
# Query whether the corpus contains "hello world"
print(index.contains(tokenizer.encode("hello world")))
# Get all n-grams beginning with "hello world" in the corpus
print(index.positions(tokenizer.encode("hello world")))
The best way to get support is to open an issue on this repo or post in #inductive-biases in the EleutherAI Discord server. If you've used the library and have had a positive (or negative) experience, we'd love to hear from you!