Minimal, clean, educational code for the (byte-level) Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization. The BPE algorithm is "byte-level" because it runs on UTF-8 encoded strings.
This algorithm was popularized for LLMs by the GPT-2 paper and the associated GPT-2 code release from OpenAI. Sennrich et al. 2015 is cited as the original reference for the use of BPE in NLP applications. Today, all modern LLMs (e.g. GPT, Llama, Mistral) use this algorithm to train their tokenizers.
There are two primary Tokenizers in this repository, both of which can perform the 3 primary functions of a Tokenizer: 1) train the tokenizer vocabulary and merges on a given text, 2) encode from text to tokens, 3) decode from tokens to text. The two tokenizers are:
- bpe_basic.py: Implements the
BasicTokenizer
, the simplest implementation of the BPE algorithm that runs directly on text. - bpe_regex.py: Implements the
RegexTokenizer
that further splits the input text by a regex pattern, which is a preprocessing stage that splits up the input text by categories (think: letters, numbers, punctuation) before tokenization. This ensures that no merges will happen across category boundaries. This was introduced in the GPT-2 paper and continues to be in use as of GPT-4. - bpe_gpt4.py: Implements the
GPT4Tokenizer
. This class is a light wrapper around theRegexTokenizer
(2, above) that exactly reproduces the tokenization of GPT-4 in the tiktoken library. The wrapping handles some details around recovering the exact merges in the tokenizer, and the handling of some unfortunate and likely historical 1-byte token permutations. Note that the parity is not fully complete yet because we do not handle special tokens.
Finally, the script train.py trains the two major tokenizers on the input text taylorswift.txt (this is the Wikipedia entry for her kek) and saves the vocab to disk for visualization. This script runs in about 25 seconds on my (M1) MacBook.
All of the files above are very short and thoroughly commented, and also contain a usage example on the bottom of the file. As a quick example, following along the Wikipedia article on BPE, we can reproduce it as follows:
from bpe_basic import BasicTokenizer
tokenizer = BasicTokenizer()
text = "aaabdaaabac"
tokenizer.train(text, 256 + 3) # 256 are the byte tokens, then do 3 merges
print(tokenizer.encode(text))
# [258, 100, 258, 97, 99]
print(tokenizer.decode([258, 100, 258, 97, 99]))
# aaabdaaabac
This is exactly as expected, please see bottom of bpe_basic for more details. To use the GPT4Tokenizer
, simple example and comparison to tiktoken:
text = "hello123!!!? (안녕하세요!) 😉"
# tiktoken
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
print(enc.encode(text))
# [15339, 4513, 12340, 30, 320, 31495, 230, 75265, 243, 92245, 16715, 57037]
# ours
from bpe_gpt4 import GPT4Tokenizer
tokenizer = GPT4Tokenizer()
print(tokenizer.encode(text))
# [15339, 4513, 12340, 30, 320, 31495, 230, 75265, 243, 92245, 16715, 57037]
(you'll have to pip install tiktoken
to run).
- handle special tokens (?)
- save and load Tokenizers to/from disk
- video coming soon ;)
MIT