Inconsistency in vocab sizes #68

nkandpa2 · 2023-02-21T22:12:07Z

Thank you EleutherAI team for this valuable resource!

I noticed that the vocab_size changes between different pythia models. Based on the config.json files from the models hosted on HuggingFace, the models have the following vocab sizes:

pythia-70m: 50304
pythia-160m: 50304
pythia-410m: 50304
pythia-1b: 50304
pythia-1.4b: 50304
pythia-2.8b: 50304
pythia-6.9b: 50432
pythia-12b: 50688

Strangely, these sizes also don't match the vocab size of the tokenizers for each model. Based on tokenizer.get_vocab(), the tokenizer for each model size has a vocab size of 50277. Does anyone know the reason for this vocab size mismatch?

The text was updated successfully, but these errors were encountered:

haileyschoelkopf · 2023-02-24T01:55:05Z

Hi! Yes, these models all use the same tokenizer with a vocab size of 50277.

The remaining embeddings in the embedding layer are all random embeddings not corresponding to tokens, added for padding. The models have their embedding layers rounded up in size to the nearest multiple of MODEL_PARALLEL_DEGREE * 128 which provides a performance boost due to being divisible by 128 after evenly dividing the embedding layers across model parallel ranks.

Hope this makes sense! Let me know if this doesn't resolve your question or if you have any further questions. :)

nkandpa2 · 2023-03-02T17:06:12Z

Thank you!

nkandpa2 closed this as completed Mar 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency in vocab sizes #68

Inconsistency in vocab sizes #68

nkandpa2 commented Feb 21, 2023 •

edited

Loading

haileyschoelkopf commented Feb 24, 2023

nkandpa2 commented Mar 2, 2023

Inconsistency in vocab sizes #68

Inconsistency in vocab sizes #68

Comments

nkandpa2 commented Feb 21, 2023 • edited Loading

haileyschoelkopf commented Feb 24, 2023

nkandpa2 commented Mar 2, 2023

nkandpa2 commented Feb 21, 2023 •

edited

Loading