Question about naming convention for models #54

ejmichaud · 2023-01-13T00:06:38Z

Hi there! Thank you so much for releasing these models! They've already been really valuable for my research. One small question: how were the model names chosen, specifically the parameter number part of the names? By my calculation here are the number of parameters for each of the models, excluding the embed and unembed matrices:

 'pythia-19m': 18915328,
 'pythia-125m': 85056000,
 'pythia-350m': 302311424,
 'pythia-800m': 805736448,
 'pythia-1.3b': 1208602624,
 'pythia-2.7b': 2517652480,
 'pythia-6.7b': 6444163072,
 'pythia-13b': 11327027200

And here are the number of parameters if I include one, but not both, of the embed/unembed matrices in the calculation:

 'pythia-19m': 44670976,
 'pythia-125m': 123689472,
 'pythia-350m': 353822720,
 'pythia-800m': 908759040,
 'pythia-1.3b': 1311625216,
 'pythia-2.7b': 2646430720,
 'pythia-6.7b': 6650732544,
 'pythia-13b': 11586549760

I guess for some of the models, one embed matrix was included in the parameter count, but others exclude both the embed and unembed matrix from the count? The 13b model seems like an overestimate whichever way you count it.

The text was updated successfully, but these errors were encountered:

StellaAthena · 2023-01-13T16:14:41Z

@ejmichaud Thank you for raising this as an issue! I was able to track down the cause of each number and we can discuss what makes sense to do going forwards.

Firstly, some of our configs come from GPT-3 and OPT papers. Specifically, all models other than 19M, 800M, and 13B use the same configs as models in those papers. We used the same nomenclature as those papers, which is to use total number of parameters and round to a nice number. Based on personal communication with OpenAI employees I know that the decision to do it this way wasn't considered particularly carefully, as for the largest models it barely makes any difference (as your calculations show). The paper's focus was on large models and even for the 1.3B model it's ~10% and goes down from there.

The 19M and 800M models have custom config files that were created by us. We didn't think about the fact that GPT-3 and OPT used the aforementioned model naming conventions and instead used the number of trainable parameters as it seemed more natural. The 13B model is the real problem: this was supposed to be the same as the 13B model in OPT and GPT-3, but it appears a transcription error was made at some point and we used 36 layers instead of 40. If you redo your calculations using 40 layers you would in fact get 13B parameters.

The question is, what should we do going forward? We need to correct the 13B model for sure, but we should also choose a consistent naming convention. Since our models do in fact match many widely discussed models it would be useful to have them carry the same naming convention. The 19M and 125M models are the only ones where the distinction between embedding and non-embedding parameters is really that big a deal I think, but I also don't like how the current way this is presented in the lit regularly confuses people (incl. people who should know better like us). We could use descriptive instead of numeric names, but I really hate that practice and the amount of work it is to look up the sizes of models like T5 and BERT.

ejmichaud · 2023-01-23T17:32:18Z

Got it! This all makes sense. I see the discussion on the Discord about this and that you've changed the model names on HuggingFace too. Seems reasonable to do this for the long-term!

StellaAthena · 2023-01-23T17:52:17Z

@ejmichaud One other thing I forgot to point out: the reason you got numbers matching OPT when using one, but not both, of the embed/unembed matrices is that we were told by Neel Nanda and some others that the standard practice of tying the weights of the two matrices is deleterious for interpretability research. The architecture and number of learnable parameters (the typical number to plot on the x-axis of scaling laws work) is the same as the corresponding OPT models when they exist.

We're also going to train a real 13B parameter model this week.

StellaAthena self-assigned this Jan 13, 2023

StellaAthena closed this as completed Jan 23, 2023

StellaAthena added the documentation Improvements or additions to documentation label Jan 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about naming convention for models #54

Question about naming convention for models #54

ejmichaud commented Jan 13, 2023

StellaAthena commented Jan 13, 2023

ejmichaud commented Jan 23, 2023

StellaAthena commented Jan 23, 2023

Question about naming convention for models #54

Question about naming convention for models #54

Comments

ejmichaud commented Jan 13, 2023

StellaAthena commented Jan 13, 2023

ejmichaud commented Jan 23, 2023

StellaAthena commented Jan 23, 2023