This repository contains the code and extended results for the paper Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
The most interesting thing in this repository is probably the detailed reports, found in results/reports.
▁
(but not_
) is a space, and¿entry?
represents tokens with a vocabularyentry
which was not encoded as expected.
This is a standard poetry project.
poetry shell # make/activate your virtual environment
poetry install # only the first time or on updates
See run_verification.sh
for some example commands for running new models. The script itself is mainly a reference for reproducibility and it is not recommended to run.
For models with tied embeddings, or for nicer visualizations and results, you will need to hard-code some unused token ids in magikarp/unused_tokens.py
.
- If a related model already exists, copying the token ids is likely to work just fine.
- For non-tied embeddings you can typically just let verification finish, and update unused tokens after you get the results.
- For the rare case of new model families with tied embeddings:
- Take a guess, like
[0]
, or use the tokenizer vocabulary to pick some. - Run the
magikarp/fishing.py
script and kill it when it starts verifying. - You now have
results/verifications/yourmodel.jsonl
which allows you to look at the vocabulary and update suitable tokens. - Update your unused tokens, and run verification.
- Take a guess, like
generate_results.py
: Generates plots and markdown reports. Typically after finishing verification you shouldpython generate_results.py [your_model_id]
and then look inresults
.
If you want to contribute results for additional models, please include:
- The
UNUSED_TOKENS
entry- ensure tokenization tests (via
pytest
) pass for the new model, which uses this array as a model registry.
- ensure tokenization tests (via
- A line in
run_verification.sh
- All files in
results
that are not.gitignore
'd