Implement Maximal Update Parametrization (muP) #16157

thegregyang · 2022-03-14T22:08:55Z

🚀 Feature request

This request is to open up a discussion on 1) whether it makes sense to implement Maximal Update Parametrization (abbreviated muP) in Huggingface, 2) if so, how to do it.

Motivation

Hi,

I'm a maintainer for the mup package (paper). This repo allows one to implement in their models a special parametrization called maximal update parametrization, or muP, that has the special property that narrow and wide networks share the same optimal hyperparameters (like learning rate, initialization, etc). This is demonstrated below on a Transformer trained with adam, where on the left we have the pytorch default parametrization and the right we have muP.

Most strikingly, this property can be used to tune hyperparameters for extremely large neural networks like GPT-3 that is too expensive to train more than once, by just tuning a tiny version of it. But even for "regular joe" users, muP can alleviate a lot of the pain when transitioning from exploration to scaling up and finding performance suffer for mysterious reasons. Transformers in particular is somewhat infamous for problems like training instability. So having muP integrated natively into Huggingface can benefit a lot of users at once.

muP can be implemented in a backward compatible way, as shown below, so users do not need to worry about it breaking existing codebases.

See this twitter thread for more (but brief) information about how this works, and this blog post for less brief overview.

Your contribution

Now let's return to the two questions at the beginning: 1) whether it makes sense to implement Maximal Update Parametrization (abbreviated muP) in Huggingface, 2) if so, how to do it.

For 1), the popularity (or not) of this issue should serve as an indicator of community interest, and the above makes the case for the utility of this integration.

For 2), we have examples of how to integrate muP with some common (PyTorch) Huggingface transformers in our mutransformers repo.

Current Example Implementation

In summary, to modify an existing Huggingface transformer to implement muP, one needs to

Switch any readout layer (dimensions: width -> number of labels) from nn.Linear to mup.MuReadout.
Modify the _init_weights method to use mup.init.* methods instead of nn.init.* methods (or equivalent).
Scale the attention logits like 1/d instead of 1/sqrt(d)
Use mup.AdamW instead of the pytorch or Huggingface version.

In addition, when using a mutransformer, one needs to provide a "base shape file" that lets the model know how to properly scale the learning rate and attention with width. This is designed so that if the model parameter shapes are the same as the "base shapes", then the model is in the original parametrization, i.e. backward compatible.

from mutransformers import BertConfig, BertForMaskedLM
# instantiate model
model = BertForMaskedLM(config=BertConfig(...))
# set base shapes
set_base_shapes(model, path_to_base_shape_file)
# re-initialize
model.apply(model._init_weights)

More Seamless Integration

Now, the mutransformers repo is primarily designed to serve as examples of how to implement muP into existing transformers. So all of the above can be streamlined if we really want seamless integration into Huggingface.

For example, the user interface for instantiating a model could just be the same as it is now, but we just have an additional flag mup=True in BertConfig that says to switch on mup. BertConfig itself may carry a default set of base shapes for use in this scenario, which the user can also modify if necessary.

# the model automatically sets base shapes based on defaults in BertConfig
# no need to re-initialize either
model = BertForMaskedLM(config=BertConfig(mup=True,...))
# use model immediately, e.g., train

In addition, mup.MuAdamW can be incorporated natively into Huggingface as well, so that there is no dependency on the mup package at all.

muP for All Transformers?

As, currently, there is no automatic way of backfitting existing transformers, it could be quite a task to add muP to all of the transformers in Huggingface. So a good practical compromise is to just implement muP for the most commonly used models in Huggingface.

In the interim, research can be done on a method of such automatic backfitting. This could even involve a pull request into PyTorch core.

Conclusion

Again, this issue is intended to start the discussion of whether and how to make muP available to Huggingface users natively. It could be that the best course forward is to have users implement muP transformers themselves as in mutransformers, or even to build mutransformers into such a repo of muP transformers. And even if we do decide to integrate muP into Huggingface, there could be many ways to do it.

I hope discussion here could elucidate the right course of action.

The text was updated successfully, but these errors were encountered:

LysandreJik · 2022-03-15T10:28:41Z

Pinging maintainers for knowledge: @patrickvonplaten @sgugger @patil-suraj

patrickvonplaten · 2022-03-17T17:55:17Z

This is only really relevant for pretraining I assume no? I wonder whether it might make more sense to add this directly to accelerate ? cc @sgugger

edwardjhu · 2022-03-19T13:27:20Z

Hi Patrick,

I'm another maintainer of the mup repo.

It's true that the biggest payoff will probably come from applying our technique to large-scale pretraining, but mup can also help users who are working with novel architectures on a smaller scale. For example, someone might modify an existing model in transformers to test a new idea, and mup can improve stability and reduce the need to tune HPs when they gradually scale up. We aren't sure if this is a common use case for transformers, but mentioning it in case it is.

We are more than happy to look into integration with other tools such as accelerate. A quick scan tells me that it abstracts away the management of device in PyTorch without having to know the architecture inside an nn.Module. mup on the other hand does need to know the architecture, specifically, which weight dimensions go to infinity. We'd love to hear more from you guys regarding this!

sgugger · 2022-03-28T14:57:51Z

From what I gather of the mup repository, it's not general enough (yet?) to be integrated into Accelerate as it seems to be very targeted toward Transformer models, whereas Accelerate handles any kind of models.

As for integrating into Transformers, I think everyone would be delighted to see it as easily accessible as possible. There is just the (big) catch of modifying every modeling file for this. It's not really an option for two reasons:

users are getting upset with us that modeling files already contain too much stuff they do not need when they tweak it for their experiments, so this would add more code that's not strictly necessary just to run BERT, GPT-2 etc.
there are currently 108 different architectures, so waaaay too many modeling files ;-)

As such it would be way more powerful if we could design a function that automatically converts a model to be used with muP.

The first two points you mention are easy to do on an existing model (we can change the Linear layers on the fly and re-init the weights), the last one is a tiny bit more complex. I don't know if you have any idea on this. As for making the mup.AdamW accessible, we wouldn't necessarily take the code, but it's very easy to add support for a new optimizer in the Trainer with an optional dependency on mutransformers.

If we don't manage to have such a function, we also have a feature where you can host any modeling code on the Hub and have it run with Transformers (using the AutoModel.from_pretrained API). See here for the documentation. It would allow us to easily integrate muP in our examples while not changing the code of popular models such as BERT and GPT-2.

Let me know your thoughts!

thegregyang · 2022-03-28T17:09:18Z

Hi @sgugger, is there any particular reason you say that mup is very targeted toward Transformers? We definitely designed mup with general models in mind, even though Transformers would be where a lot of payoffs would be. For example, we have a ResNet example in our repo.

sgugger · 2022-03-28T19:02:35Z

You're right, I should have said that the adaptations you mention seem very targeted toward Transformer (in particularly point 3 above).

edwardjhu · 2022-03-30T11:54:26Z

Hi @sgugger,

Like Greg said, only the third item is Transformer-specific (we should have noted it clearly). My concern wrt accelerate is more so that it seems agnostic to the model architecture in its current shape, whereas our technique requires knowing things like how many dimensions of a parameter tensor grow with width.

I like the idea of having a converter function so we keep the model files as clean as possible. I'd also like to point out that muAdam is simply a wrapper on top of torch Adam, which manipulates the parameter group dictionary to explicitly adjust learning rates according to muP. Perhaps this explicit conversion can be a part of the converter function instead to remove the dependency on the mup package. @thegregyang What do you think?

thegregyang · 2022-04-28T16:27:42Z

After discussion with Edward, we think perhaps hosting custom model code on the Hub would be the best way to go. We have some questions about this:

Is there an API to initialize a model from scratch instead of from a checkpoint (in contrast to AutoModel.from_pretrained)?
Is there intellisense or other user-facing tools in VS Code or other IDEs to facilitate the usage of a model from the Hub? More generally, I'm just wondering what kind of user experience we are dealing with here.

sgugger · 2022-05-03T14:35:59Z

You can create a randomly initialized model with AutoModel.from_config, with the config pulled with AutoConfig.from_pretrained:

from transformers import AutoConfig, AutoModel

config = AutoConfig.from_pretrained(checkpoint_name)
model = AutoModel.from_config(config)

As for the second point, not really.

github-actions · 2022-05-27T15:02:20Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

thegregyang · 2022-06-08T23:51:20Z

Sorry still working on this!

OhadRubin · 2023-03-15T17:33:11Z

Any update?

thegregyang · 2023-03-15T18:25:56Z

@sodabeta7 has been working on this. @sodabeta7 could you summarize your progress?

fzyzcjy · 2024-03-21T12:50:06Z

Hi, is there any updates after a year? Thanks!

nilsec · 2024-04-11T08:27:18Z

Curious too, any news?

TeddLi · 2024-05-07T02:50:10Z

@thegregyang , I trained a model with Mup, Just wondering how could I convert my Mup model weight to SP so that I could load with huggingface?

huggingface deleted a comment from github-actions bot Apr 26, 2022

github-actions bot closed this as completed Jun 5, 2022

LysandreJik reopened this Jun 9, 2022

LysandreJik added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Jun 9, 2022

ordabayevy mentioned this issue Aug 28, 2023

Add maximal update parameterization cellarium-ai/cellarium-ml#77

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Maximal Update Parametrization (muP) #16157

Implement Maximal Update Parametrization (muP) #16157

thegregyang commented Mar 14, 2022 •

edited

LysandreJik commented Mar 15, 2022

patrickvonplaten commented Mar 17, 2022 •

edited

edwardjhu commented Mar 19, 2022

sgugger commented Mar 28, 2022

thegregyang commented Mar 28, 2022 •

edited

sgugger commented Mar 28, 2022

edwardjhu commented Mar 30, 2022

thegregyang commented Apr 28, 2022

sgugger commented May 3, 2022

github-actions bot commented May 27, 2022

thegregyang commented Jun 8, 2022

OhadRubin commented Mar 15, 2023

thegregyang commented Mar 15, 2023

fzyzcjy commented Mar 21, 2024

nilsec commented Apr 11, 2024

TeddLi commented May 7, 2024

Implement Maximal Update Parametrization (muP) #16157

Implement Maximal Update Parametrization (muP) #16157

Comments

thegregyang commented Mar 14, 2022 • edited

🚀 Feature request

Motivation

Your contribution

Current Example Implementation

More Seamless Integration

muP for All Transformers?

Conclusion

LysandreJik commented Mar 15, 2022

patrickvonplaten commented Mar 17, 2022 • edited

edwardjhu commented Mar 19, 2022

sgugger commented Mar 28, 2022

thegregyang commented Mar 28, 2022 • edited

sgugger commented Mar 28, 2022

edwardjhu commented Mar 30, 2022

thegregyang commented Apr 28, 2022

sgugger commented May 3, 2022

github-actions bot commented May 27, 2022

thegregyang commented Jun 8, 2022

OhadRubin commented Mar 15, 2023

thegregyang commented Mar 15, 2023

fzyzcjy commented Mar 21, 2024

nilsec commented Apr 11, 2024

TeddLi commented May 7, 2024

thegregyang commented Mar 14, 2022 •

edited

patrickvonplaten commented Mar 17, 2022 •

edited

thegregyang commented Mar 28, 2022 •

edited