Clarification of Pythia tokenizer(s) at different sizes, steps and data preprocessing? #115

RylanSchaeffer · 2023-08-02T21:50:46Z

I apologize if this has been asked before but I couldn't find the answer on GitHub or HuggingFace! I also asked on Discord, and I will cross-post the answer to whichever responds slower.

For the Pythia models, what is the relationship between tokenizers at different size, different steps and different data preprocessing (duplicated vs deduplicated)?

The demo shows:

tokenizer = AutoTokenizer.from_pretrained(
  "EleutherAI/pythia-70m-deduped",
  revision="step3000",
  cache_dir="./pythia-70m-deduped/step3000",
)

This suggests to me that the Pythia tokenizers are a function of all three: size (70M), step (3000), data (deduplicated).

But this doesn't make sense to me. Rather, I would guess that the answer is either:

There is one Pythia tokenizer, shared by all sizes, steps and data preprocessing
There are two Pythia tokenizers, one for deduplicated, the other for not

Could someone please clarify?

The text was updated successfully, but these errors were encountered:

RylanSchaeffer · 2023-08-03T00:25:00Z

Answer from Stella on Discord:

There is one Pythia tokenizer and it’s the same tokenizer as used by GPT-NeoX-20B, MPT, and a bunch of other models too

It’s generally considered best practice to write the code like that because then you develop habits that are invariant to the tokenizer and you don’t need to know which models use the GPT-2 tokenizer, which models use the GPT-NeoX tokenizer, etc

RylanSchaeffer changed the title ~~Clarification of Pythia tokenizers~~ Clarification of Pythia tokenizer(s) at different sizes, steps and data preprocessing? Aug 2, 2023

RylanSchaeffer closed this as completed Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification of Pythia tokenizer(s) at different sizes, steps and data preprocessing? #115

Clarification of Pythia tokenizer(s) at different sizes, steps and data preprocessing? #115

RylanSchaeffer commented Aug 2, 2023 •

edited

Loading

RylanSchaeffer commented Aug 3, 2023 •

edited

Loading

Clarification of Pythia tokenizer(s) at different sizes, steps and data preprocessing? #115

Clarification of Pythia tokenizer(s) at different sizes, steps and data preprocessing? #115

Comments

RylanSchaeffer commented Aug 2, 2023 • edited Loading

RylanSchaeffer commented Aug 3, 2023 • edited Loading

RylanSchaeffer commented Aug 2, 2023 •

edited

Loading

RylanSchaeffer commented Aug 3, 2023 •

edited

Loading