⏳ tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.

import tiktoken
enc = tiktoken.get_encoding("o200k_base")
assert enc.decode(enc.encode("hello world")) == "hello world"

# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc = tiktoken.encoding_for_model("gpt-4o")

The open source version of tiktoken can be installed from PyPI:

pip install tiktoken

The tokeniser API is documented in tiktoken/core.py.

Example code using tiktoken can be found in the OpenAI Cookbook.

Performance

tiktoken is between 3-6x faster than a comparable open source tokeniser:

Performance measured on 1GB of text using the GPT-2 tokeniser, using GPT2TokenizerFast from tokenizers==0.13.2, transformers==4.24.0 and tiktoken==0.2.0.

Getting help

Please post questions in the issue tracker.

If you work at OpenAI, make sure to check the internal documentation or feel free to contact @shantanu.

What is BPE anyway?

Language models don't see text like you and I, instead they see a sequence of numbers (known as tokens). Byte pair encoding (BPE) is a way of converting text into tokens. It has a couple desirable properties:

It's reversible and lossless, so you can convert tokens back into the original text
It works on arbitrary text, even text that is not in the tokeniser's training data
It compresses the text: the token sequence is shorter than the bytes corresponding to the original text. On average, in practice, each token corresponds to about 4 bytes.
It attempts to let the model see common subwords. For instance, "ing" is a common subword in English, so BPE encodings will often split "encoding" into tokens like "encod" and "ing" (instead of e.g. "enc" and "oding"). Because the model will then see the "ing" token again and again in different contexts, it helps models generalise and better understand grammar.

tiktoken contains an educational submodule that is friendlier if you want to learn more about the details of BPE, including code that helps visualise the BPE procedure:

from tiktoken._educational import *

# Train a BPE tokeniser on a small amount of text
enc = train_simple_encoding()

# Visualise how the GPT-4 encoder encodes text
enc = SimpleBytePairEncoding.from_tiktoken("cl100k_base")
enc.encode("hello world aaaaaaaaaaaa")

Extending tiktoken

You may wish to extend tiktoken to support new encodings. There are two ways to do this.

Create your Encoding object exactly the way you want and simply pass it around.

cl100k_base

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.github/workflows		.github/workflows
scripts		scripts
src		src
tests		tests
tiktoken		tiktoken
tiktoken_ext		tiktoken_ext
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
perf.svg		perf.svg
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⏳ tiktoken

Performance

Getting help

What is BPE anyway?

Extending tiktoken

License

jrcribb/tiktoken

Folders and files

Latest commit

History

Repository files navigation

⏳ tiktoken

Performance

Getting help

What is BPE anyway?

Extending tiktoken