-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarification of Pythia tokenizer(s) at different sizes, steps and data preprocessing? #115
Comments
Answer from Stella on Discord:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I apologize if this has been asked before but I couldn't find the answer on GitHub or HuggingFace! I also asked on Discord, and I will cross-post the answer to whichever responds slower.
For the Pythia models, what is the relationship between tokenizers at different size, different steps and different data preprocessing (duplicated vs deduplicated)?
The demo shows:
This suggests to me that the Pythia tokenizers are a function of all three: size (70M), step (3000), data (deduplicated).
But this doesn't make sense to me. Rather, I would guess that the answer is either:
There is one Pythia tokenizer, shared by all sizes, steps and data preprocessing
There are two Pythia tokenizers, one for deduplicated, the other for not
Could someone please clarify?
The text was updated successfully, but these errors were encountered: