Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about naming convention for models #54

Closed
ejmichaud opened this issue Jan 13, 2023 · 3 comments
Closed

Question about naming convention for models #54

ejmichaud opened this issue Jan 13, 2023 · 3 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@ejmichaud
Copy link

Hi there! Thank you so much for releasing these models! They've already been really valuable for my research. One small question: how were the model names chosen, specifically the parameter number part of the names? By my calculation here are the number of parameters for each of the models, excluding the embed and unembed matrices:

 'pythia-19m': 18915328,
 'pythia-125m': 85056000,
 'pythia-350m': 302311424,
 'pythia-800m': 805736448,
 'pythia-1.3b': 1208602624,
 'pythia-2.7b': 2517652480,
 'pythia-6.7b': 6444163072,
 'pythia-13b': 11327027200

And here are the number of parameters if I include one, but not both, of the embed/unembed matrices in the calculation:

 'pythia-19m': 44670976,
 'pythia-125m': 123689472,
 'pythia-350m': 353822720,
 'pythia-800m': 908759040,
 'pythia-1.3b': 1311625216,
 'pythia-2.7b': 2646430720,
 'pythia-6.7b': 6650732544,
 'pythia-13b': 11586549760

I guess for some of the models, one embed matrix was included in the parameter count, but others exclude both the embed and unembed matrix from the count? The 13b model seems like an overestimate whichever way you count it.

@StellaAthena
Copy link
Member

@ejmichaud Thank you for raising this as an issue! I was able to track down the cause of each number and we can discuss what makes sense to do going forwards.

Firstly, some of our configs come from GPT-3 and OPT papers. Specifically, all models other than 19M, 800M, and 13B use the same configs as models in those papers. We used the same nomenclature as those papers, which is to use total number of parameters and round to a nice number. Based on personal communication with OpenAI employees I know that the decision to do it this way wasn't considered particularly carefully, as for the largest models it barely makes any difference (as your calculations show). The paper's focus was on large models and even for the 1.3B model it's ~10% and goes down from there.

The 19M and 800M models have custom config files that were created by us. We didn't think about the fact that GPT-3 and OPT used the aforementioned model naming conventions and instead used the number of trainable parameters as it seemed more natural. The 13B model is the real problem: this was supposed to be the same as the 13B model in OPT and GPT-3, but it appears a transcription error was made at some point and we used 36 layers instead of 40. If you redo your calculations using 40 layers you would in fact get 13B parameters.

The question is, what should we do going forward? We need to correct the 13B model for sure, but we should also choose a consistent naming convention. Since our models do in fact match many widely discussed models it would be useful to have them carry the same naming convention. The 19M and 125M models are the only ones where the distinction between embedding and non-embedding parameters is really that big a deal I think, but I also don't like how the current way this is presented in the lit regularly confuses people (incl. people who should know better like us). We could use descriptive instead of numeric names, but I really hate that practice and the amount of work it is to look up the sizes of models like T5 and BERT.

@StellaAthena StellaAthena self-assigned this Jan 13, 2023
@ejmichaud
Copy link
Author

Got it! This all makes sense. I see the discussion on the Discord about this and that you've changed the model names on HuggingFace too. Seems reasonable to do this for the long-term!

@StellaAthena
Copy link
Member

@ejmichaud One other thing I forgot to point out: the reason you got numbers matching OPT when using one, but not both, of the embed/unembed matrices is that we were told by Neel Nanda and some others that the standard practice of tying the weights of the two matrices is deleterious for interpretability research. The architecture and number of learnable parameters (the typical number to plot on the x-axis of scaling laws work) is the same as the corresponding OPT models when they exist.

We're also going to train a real 13B parameter model this week.

@StellaAthena StellaAthena added the documentation Improvements or additions to documentation label Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants