-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistency in vocab sizes #68
Comments
Hi! Yes, these models all use the same tokenizer with a vocab size of 50277. The remaining embeddings in the embedding layer are all random embeddings not corresponding to tokens, added for padding. The models have their embedding layers rounded up in size to the nearest multiple of Hope this makes sense! Let me know if this doesn't resolve your question or if you have any further questions. :) |
Thank you! |
Thank you EleutherAI team for this valuable resource!
I noticed that the
vocab_size
changes between different pythia models. Based on theconfig.json
files from the models hosted on HuggingFace, the models have the following vocab sizes:Strangely, these sizes also don't match the vocab size of the tokenizers for each model. Based on
tokenizer.get_vocab()
, the tokenizer for each model size has a vocab size of 50277. Does anyone know the reason for this vocab size mismatch?The text was updated successfully, but these errors were encountered: