Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification of Pythia tokenizer(s) at different sizes, steps and data preprocessing? #115

Closed
RylanSchaeffer opened this issue Aug 2, 2023 · 1 comment

Comments

@RylanSchaeffer
Copy link

RylanSchaeffer commented Aug 2, 2023

I apologize if this has been asked before but I couldn't find the answer on GitHub or HuggingFace! I also asked on Discord, and I will cross-post the answer to whichever responds slower.

For the Pythia models, what is the relationship between tokenizers at different size, different steps and different data preprocessing (duplicated vs deduplicated)?

The demo shows:

tokenizer = AutoTokenizer.from_pretrained(
  "EleutherAI/pythia-70m-deduped",
  revision="step3000",
  cache_dir="./pythia-70m-deduped/step3000",
)

This suggests to me that the Pythia tokenizers are a function of all three: size (70M), step (3000), data (deduplicated).

But this doesn't make sense to me. Rather, I would guess that the answer is either:

  1. There is one Pythia tokenizer, shared by all sizes, steps and data preprocessing

  2. There are two Pythia tokenizers, one for deduplicated, the other for not

Could someone please clarify?

@RylanSchaeffer RylanSchaeffer changed the title Clarification of Pythia tokenizers Clarification of Pythia tokenizer(s) at different sizes, steps and data preprocessing? Aug 2, 2023
@RylanSchaeffer
Copy link
Author

RylanSchaeffer commented Aug 3, 2023

Answer from Stella on Discord:

There is one Pythia tokenizer and it’s the same tokenizer as used by GPT-NeoX-20B, MPT, and a bunch of other models too

It’s generally considered best practice to write the code like that because then you develop habits that are invariant to the tokenizer and you don’t need to know which models use the GPT-2 tokenizer, which models use the GPT-NeoX tokenizer, etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant