Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency in vocab sizes #68

Closed
nkandpa2 opened this issue Feb 21, 2023 · 2 comments
Closed

Inconsistency in vocab sizes #68

nkandpa2 opened this issue Feb 21, 2023 · 2 comments

Comments

@nkandpa2
Copy link

nkandpa2 commented Feb 21, 2023

Thank you EleutherAI team for this valuable resource!

I noticed that the vocab_size changes between different pythia models. Based on the config.json files from the models hosted on HuggingFace, the models have the following vocab sizes:

pythia-70m: 50304
pythia-160m: 50304
pythia-410m: 50304
pythia-1b: 50304
pythia-1.4b: 50304
pythia-2.8b: 50304
pythia-6.9b: 50432
pythia-12b: 50688

Strangely, these sizes also don't match the vocab size of the tokenizers for each model. Based on tokenizer.get_vocab(), the tokenizer for each model size has a vocab size of 50277. Does anyone know the reason for this vocab size mismatch?

@haileyschoelkopf
Copy link
Collaborator

Hi! Yes, these models all use the same tokenizer with a vocab size of 50277.

The remaining embeddings in the embedding layer are all random embeddings not corresponding to tokens, added for padding. The models have their embedding layers rounded up in size to the nearest multiple of MODEL_PARALLEL_DEGREE * 128 which provides a performance boost due to being divisible by 128 after evenly dividing the embedding layers across model parallel ranks.

Hope this makes sense! Let me know if this doesn't resolve your question or if you have any further questions. :)

@nkandpa2
Copy link
Author

nkandpa2 commented Mar 2, 2023

Thank you!

@nkandpa2 nkandpa2 closed this as completed Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants