Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Maximal Update Parametrization (muP) #16157

Open
thegregyang opened this issue Mar 14, 2022 · 16 comments
Open

Implement Maximal Update Parametrization (muP) #16157

thegregyang opened this issue Mar 14, 2022 · 16 comments
Labels
WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress

Comments

@thegregyang
Copy link

thegregyang commented Mar 14, 2022

馃殌 Feature request

This request is to open up a discussion on 1) whether it makes sense to implement Maximal Update Parametrization (abbreviated muP) in Huggingface, 2) if so, how to do it.

Motivation

Hi,

I'm a maintainer for the mup package (paper). This repo allows one to implement in their models a special parametrization called maximal update parametrization, or muP, that has the special property that narrow and wide networks share the same optimal hyperparameters (like learning rate, initialization, etc). This is demonstrated below on a Transformer trained with adam, where on the left we have the pytorch default parametrization and the right we have muP.
image
Most strikingly, this property can be used to tune hyperparameters for extremely large neural networks like GPT-3 that is too expensive to train more than once, by just tuning a tiny version of it. But even for "regular joe" users, muP can alleviate a lot of the pain when transitioning from exploration to scaling up and finding performance suffer for mysterious reasons. Transformers in particular is somewhat infamous for problems like training instability. So having muP integrated natively into Huggingface can benefit a lot of users at once.

muP can be implemented in a backward compatible way, as shown below, so users do not need to worry about it breaking existing codebases.

See this twitter thread for more (but brief) information about how this works, and this blog post for less brief overview.

Your contribution

Now let's return to the two questions at the beginning: 1) whether it makes sense to implement Maximal Update Parametrization (abbreviated muP) in Huggingface, 2) if so, how to do it.

For 1), the popularity (or not) of this issue should serve as an indicator of community interest, and the above makes the case for the utility of this integration.

For 2), we have examples of how to integrate muP with some common (PyTorch) Huggingface transformers in our mutransformers repo.

Current Example Implementation

In summary, to modify an existing Huggingface transformer to implement muP, one needs to

  1. Switch any readout layer (dimensions: width -> number of labels) from nn.Linear to mup.MuReadout.
  2. Modify the _init_weights method to use mup.init.* methods instead of nn.init.* methods (or equivalent).
  3. Scale the attention logits like 1/d instead of 1/sqrt(d)
  4. Use mup.AdamW instead of the pytorch or Huggingface version.

In addition, when using a mutransformer, one needs to provide a "base shape file" that lets the model know how to properly scale the learning rate and attention with width. This is designed so that if the model parameter shapes are the same as the "base shapes", then the model is in the original parametrization, i.e. backward compatible.

from mutransformers import BertConfig, BertForMaskedLM
# instantiate model
model = BertForMaskedLM(config=BertConfig(...))
# set base shapes
set_base_shapes(model, path_to_base_shape_file)
# re-initialize
model.apply(model._init_weights)

More Seamless Integration

Now, the mutransformers repo is primarily designed to serve as examples of how to implement muP into existing transformers. So all of the above can be streamlined if we really want seamless integration into Huggingface.

For example, the user interface for instantiating a model could just be the same as it is now, but we just have an additional flag mup=True in BertConfig that says to switch on mup. BertConfig itself may carry a default set of base shapes for use in this scenario, which the user can also modify if necessary.

# the model automatically sets base shapes based on defaults in BertConfig
# no need to re-initialize either
model = BertForMaskedLM(config=BertConfig(mup=True,...))
# use model immediately, e.g., train

In addition, mup.MuAdamW can be incorporated natively into Huggingface as well, so that there is no dependency on the mup package at all.

muP for All Transformers?

As, currently, there is no automatic way of backfitting existing transformers, it could be quite a task to add muP to all of the transformers in Huggingface. So a good practical compromise is to just implement muP for the most commonly used models in Huggingface.

In the interim, research can be done on a method of such automatic backfitting. This could even involve a pull request into PyTorch core.

Conclusion

Again, this issue is intended to start the discussion of whether and how to make muP available to Huggingface users natively. It could be that the best course forward is to have users implement muP transformers themselves as in mutransformers, or even to build mutransformers into such a repo of muP transformers. And even if we do decide to integrate muP into Huggingface, there could be many ways to do it.

I hope discussion here could elucidate the right course of action.

@LysandreJik
Copy link
Member

Pinging maintainers for knowledge: @patrickvonplaten @sgugger @patil-suraj

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Mar 17, 2022

This is only really relevant for pretraining I assume no? I wonder whether it might make more sense to add this directly to accelerate ? cc @sgugger

@edwardjhu
Copy link

Hi Patrick,

I'm another maintainer of the mup repo.

It's true that the biggest payoff will probably come from applying our technique to large-scale pretraining, but mup can also help users who are working with novel architectures on a smaller scale. For example, someone might modify an existing model in transformers to test a new idea, and mup can improve stability and reduce the need to tune HPs when they gradually scale up. We aren't sure if this is a common use case for transformers, but mentioning it in case it is.

We are more than happy to look into integration with other tools such as accelerate. A quick scan tells me that it abstracts away the management of device in PyTorch without having to know the architecture inside an nn.Module. mup on the other hand does need to know the architecture, specifically, which weight dimensions go to infinity. We'd love to hear more from you guys regarding this!

@sgugger
Copy link
Collaborator

sgugger commented Mar 28, 2022

From what I gather of the mup repository, it's not general enough (yet?) to be integrated into Accelerate as it seems to be very targeted toward Transformer models, whereas Accelerate handles any kind of models.

As for integrating into Transformers, I think everyone would be delighted to see it as easily accessible as possible. There is just the (big) catch of modifying every modeling file for this. It's not really an option for two reasons:

  1. users are getting upset with us that modeling files already contain too much stuff they do not need when they tweak it for their experiments, so this would add more code that's not strictly necessary just to run BERT, GPT-2 etc.
  2. there are currently 108 different architectures, so waaaay too many modeling files ;-)

As such it would be way more powerful if we could design a function that automatically converts a model to be used with muP.

The first two points you mention are easy to do on an existing model (we can change the Linear layers on the fly and re-init the weights), the last one is a tiny bit more complex. I don't know if you have any idea on this. As for making the mup.AdamW accessible, we wouldn't necessarily take the code, but it's very easy to add support for a new optimizer in the Trainer with an optional dependency on mutransformers.

If we don't manage to have such a function, we also have a feature where you can host any modeling code on the Hub and have it run with Transformers (using the AutoModel.from_pretrained API). See here for the documentation. It would allow us to easily integrate muP in our examples while not changing the code of popular models such as BERT and GPT-2.

Let me know your thoughts!

@thegregyang
Copy link
Author

thegregyang commented Mar 28, 2022

Hi @sgugger, is there any particular reason you say that mup is very targeted toward Transformers? We definitely designed mup with general models in mind, even though Transformers would be where a lot of payoffs would be. For example, we have a ResNet example in our repo.

@sgugger
Copy link
Collaborator

sgugger commented Mar 28, 2022

You're right, I should have said that the adaptations you mention seem very targeted toward Transformer (in particularly point 3 above).

@edwardjhu
Copy link

Hi @sgugger,

Like Greg said, only the third item is Transformer-specific (we should have noted it clearly). My concern wrt accelerate is more so that it seems agnostic to the model architecture in its current shape, whereas our technique requires knowing things like how many dimensions of a parameter tensor grow with width.

I like the idea of having a converter function so we keep the model files as clean as possible. I'd also like to point out that muAdam is simply a wrapper on top of torch Adam, which manipulates the parameter group dictionary to explicitly adjust learning rates according to muP. Perhaps this explicit conversion can be a part of the converter function instead to remove the dependency on the mup package. @thegregyang What do you think?

@huggingface huggingface deleted a comment from github-actions bot Apr 26, 2022
@thegregyang
Copy link
Author

After discussion with Edward, we think perhaps hosting custom model code on the Hub would be the best way to go. We have some questions about this:

  1. Is there an API to initialize a model from scratch instead of from a checkpoint (in contrast to AutoModel.from_pretrained)?
  2. Is there intellisense or other user-facing tools in VS Code or other IDEs to facilitate the usage of a model from the Hub? More generally, I'm just wondering what kind of user experience we are dealing with here.

@sgugger
Copy link
Collaborator

sgugger commented May 3, 2022

You can create a randomly initialized model with AutoModel.from_config, with the config pulled with AutoConfig.from_pretrained:

from transformers import AutoConfig, AutoModel

config = AutoConfig.from_pretrained(checkpoint_name)
model = AutoModel.from_config(config)

As for the second point, not really.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Jun 5, 2022
@thegregyang
Copy link
Author

Sorry still working on this!

@LysandreJik LysandreJik reopened this Jun 9, 2022
@LysandreJik LysandreJik added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Jun 9, 2022
@OhadRubin
Copy link

Any update?

@thegregyang
Copy link
Author

@sodabeta7 has been working on this. @sodabeta7 could you summarize your progress?

@fzyzcjy
Copy link
Contributor

fzyzcjy commented Mar 21, 2024

Hi, is there any updates after a year? Thanks!

@nilsec
Copy link

nilsec commented Apr 11, 2024

Curious too, any news?

@TeddLi
Copy link

TeddLi commented May 7, 2024

@thegregyang , I trained a model with Mup, Just wondering how could I convert my Mup model weight to SP so that I could load with huggingface?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress
Projects
None yet
Development

No branches or pull requests

9 participants