Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a use_parallel_residual argument to control the residual computing way #18695

Merged
merged 3 commits into from
Sep 27, 2022
Merged

Add a use_parallel_residual argument to control the residual computing way #18695

merged 3 commits into from
Sep 27, 2022

Conversation

NinedayWang
Copy link
Contributor

@NinedayWang NinedayWang commented Aug 19, 2022

What does this PR do?

Add a gpt_j_residual argument to control the residual computing way, the default value is False, that is
consistent with https://github.com/EleutherAI/gpt-neox/blob/main/megatron/model/transformer.py#L592. And we can convert the model trained by gpt-neox into huggingface more easily.

Who can review?

Anyone in the community is free to review the PR once the tests have passed.
@LysandreJik @patrickvonplaten

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 19, 2022

The documentation is not available anymore as the PR was closed or merged.

@NinedayWang NinedayWang marked this pull request as draft August 25, 2022 13:26
@NinedayWang NinedayWang marked this pull request as ready for review August 25, 2022 13:30
@urialon
Copy link
Contributor

urialon commented Sep 1, 2022

Thanks @NinedayWang !

If someone could review this, that would be great.

This PR will also allow loading our PolyCoder model in transformers (https://arxiv.org/pdf/2202.13169.pdf)

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Sep 2, 2022

Thanks a lot for the PR @NinedayWang,

I'm however not 100% sure we want that as we don't try to make Transformer models be very configurable generally.
Are there already pretrained checkpoints that would be useful for the community with gpt_j_residual=True?

@VHellendoorn
Copy link

Hi @patrickvonplaten! I appreciate your perspective, but I think in this case supporting the variation is warranted. The default of nearly all training configurations in the NeoX toolkit is to have this flag set to False. Only the 20B configuration uses that residual. So I expect that supporting this variation will make deploying new models trained with the NeoX toolkit easier for a lot of folks. Given how costly these models are to train, we are not planning to create a new variant using this residual.

@patrickvonplaten
Copy link
Contributor

Sorry I don't fully follow here:

@VHellendoorn
Copy link

Ah, yes let me clarify. GPT-NeoX is a toolkit that can be (and is actively) used to train GPT-style models. It supports a broad range of model sizes, and has a few other hyper-parameters to vary the architecture in other ways, like that gpt_j_residual flag.

Now Neox-20B is a specific, 20B parameter model trained with this toolkit. It largely uses the same configuration that other models trained with GPT-NeoX would, with the notable exception of the aforementioned residual flag: that flag is set to False in all configurations by default, but was turned to True by the authors of the 20B model. As such, other models trained with the GPT-NeoX toolkit are unlikely to have this flag enabled.

So for HuggingFace/transformers to support most other models trained with the NeoX toolkit, including PolyCoder, we could either add multiple other modeling_gpt_neox_MODELNAME.py* style architectures, or make the basic modeling_gpt_neox.py architecture a bit more flexible. The latter seems more reasonable to me, but if the HF community prefers the former, that could work for us too.

Hope this clarifies things!
-Vincent

@patrickvonplaten
Copy link
Contributor

Ah, yes let me clarify. GPT-NeoX is a toolkit that can be (and is actively) used to train GPT-style models. It supports a broad range of model sizes, and has a few other hyper-parameters to vary the architecture in other ways, like that gpt_j_residual flag.

Now Neox-20B is a specific, 20B parameter model trained with this toolkit. It largely uses the same configuration that other models trained with GPT-NeoX would, with the notable exception of the aforementioned residual flag: that flag is set to False in all configurations by default, but was turned to True by the authors of the 20B model. As such, other models trained with the GPT-NeoX toolkit are unlikely to have this flag enabled.

So for HuggingFace/transformers to support most other models trained with the NeoX toolkit, including PolyCoder, we could either add multiple other modeling_gpt_neox_MODELNAME.py* style architectures, or make the basic modeling_gpt_neox.py architecture a bit more flexible. The latter seems more reasonable to me, but if the HF community prefers the former, that could work for us too.

Hope this clarifies things! -Vincent

Hey @VHellendoorn,

Thanks for clarifying! Putting @LysandreJik and @sgugger in cc here. Given the "single-file" policy of Transformers (see post here), I think we would indeed prefer to add a new file such as modeling_poly_coder.py if the architecture is too different to existing architectures, such as gpt_j or gpt_neox_20b.
Also just one more question, if PolyCoder follows the same architecture than GPT-NeoX, couldn't we just load it with https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/modeling_gpt_neox.py ?

We're definitely more than happy though to help get Polycoder added to Transformers (cc @lvwerra as well)

@VHellendoorn
Copy link

Hi @patrickvonplaten,

Thanks, yes that would work for us too. The reason we can't load PolyCoder with that architecture file is precisely because modeling_gpt_neox.py hard-codes the assumption that gpt_j_residual is set to True. Hence the change in this PR, which makes that a configurable boolean. If we add a special modeling_polycoder.py file, it will just be identical to the modeling_gpt_neox.py one except for using the "normal" residual branch that most other models trained with GPT-NeoX will tend to use. So a slightly weird consequence of splitting the architectures across two files would be that most new models trained with GPT-NeoX will have to be loaded with the polycoder architecture, instead of the neox one. This PR would avoid such duplication by making that a togglable boolean instead.

-Vincent

@sgugger
Copy link
Collaborator

sgugger commented Sep 9, 2022

As @patrickvonplaten mentioned, Transformers is not a modular toolkit. It's therefore not surprising that one toolkit class such as GPT-Neo-X in EleutherAI is split in several different classes in Transformers (exactly like BART from fairseq is split in multiple classes here).

@NinedayWang
Copy link
Contributor Author

NinedayWang commented Sep 13, 2022

Thanks for your reply @patrickvonplaten @sgugger.

Let me do some explanation. GPT-NeoX supports two different residual computing ways using gpt_j_residual configuration:
https://github.com/EleutherAI/gpt-neox/blob/main/megatron/model/transformer.py#L592
And the default value is gpt_j_residual=False:
https://github.com/EleutherAI/gpt-neox/blob/main/megatron/neox_arguments/neox_args.py#L311

gpt_j_residual: bool = False
"""
If false, we use the conventional residual path:
  x = x + attn(ln1(x))
  x = x + mlp(ln2(x))
Otherwise, we use the residual path from GPT-J, which offers a slight speedup:
  x = ln(x)
  x = x + attn(x) + mlp(x)
"""

As @VHellendoorn said, gpt-neox-20b is a special case of specifying gpt_j_residual=True in the 20B.yml config file, most models trained with GPT-NeoX use the default value of False, such as small.yml, medium.yml, large.yml, 2-7B.yml and so on. However, modeling_gpt_neox.py is an implementation that assumes gpt_j_residual=True, so we cannot use modeling_gpt_neox.py to load PolyCoder, even though PolyCoder actually follows the same architecture as GPT-NeoX.

I've read about the "single-file" policy, but I think GPT-NeoX is a bit special. If we load gpt-neox-20b with model_type=gpt_neox, but gpt-neox-2.7b or gpt-neox-0.4b with model_type=polycoder, it can be confusing and people need more time to figure out which model_type is suitable.

@patrickvonplaten
Copy link
Contributor

Hey @NinedayWang,

Thanks for the explanation. Sorry some more questions to clarify: Why is it called gpt_j_residual? Could this be changed to another name? I don't fully understand the relation to GPT-J here.

If we have half the gpt-neox checkpoints using one residual architecture and gpt-neox-20b another architecture I'm actually not too against trying to fit it in one file.

@NinedayWang
Copy link
Contributor Author

Hey @NinedayWang,

Thanks for the explanation. Sorry some more questions to clarify: Why is it called gpt_j_residual? Could this be changed to another name? I don't fully understand the relation to GPT-J here.

If we have half the gpt-neox checkpoints using one residual architecture and gpt-neox-20b another architecture I'm actually not too against trying to fit it in one file.

Thanks a lot! The name gpt_j_residual comes from the developers of GPT-NeoX. Actually, the unconventional residual architecture in GPT-NeoX is inherited from GPT-J. For clarity and to be the same as the original GPT-NeoX, I think it is better to keep the name gpt_j_residual.

@patrickvonplaten
Copy link
Contributor

Hey @NinedayWang,
Thanks for the explanation. Sorry some more questions to clarify: Why is it called gpt_j_residual? Could this be changed to another name? I don't fully understand the relation to GPT-J here.
If we have half the gpt-neox checkpoints using one residual architecture and gpt-neox-20b another architecture I'm actually not too against trying to fit it in one file.

Thanks a lot! The name gpt_j_residual comes from the developers of GPT-NeoX. Actually, the unconventional residual architecture in GPT-NeoX is inherited from GPT-J. For clarity and to be the same as the original GPT-NeoX, I think it is better to keep the name gpt_j_residual.

Is this essentially the "parallel" residual computation that allows the model to be tensor-parallelized better (especially for TPUs) - e.g. the same architecture that was used in PALM: https://arxiv.org/abs/2204.02311 ?

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make an exception for the same family of checkpoints indeed. There is something similar in BLOOM.

However the parameter should be better named (gpt_j_residual will not evoke anything to a user) and needs to be documented.

use_cache=use_cache,
output_attentions=output_attentions,
)
attn_output = attention_layer_outputs[0] # output_attn: a, present, (attentions)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe leave everything until this line outside of the if block? Duplicating this code doesn't serve any purpose here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review! I fixed it in 70aaec5

@@ -99,6 +99,7 @@ def __init__(
bos_token_id=0,
eos_token_id=2,
tie_word_embeddings=False,
gpt_j_residual=False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is not informative at all. Reading the code it's more a add_residual_at_end or something along those lines. The new parameter will also require documentation.

Copy link
Contributor Author

@NinedayWang NinedayWang Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it to "use_parallel_residual" and set the default value to True (4c12b69) so that the released "gpt-neox-20b" doesn't need to change the config file.

@NinedayWang
Copy link
Contributor Author

Hey @NinedayWang,
Thanks for the explanation. Sorry some more questions to clarify: Why is it called gpt_j_residual? Could this be changed to another name? I don't fully understand the relation to GPT-J here.
If we have half the gpt-neox checkpoints using one residual architecture and gpt-neox-20b another architecture I'm actually not too against trying to fit it in one file.

Thanks a lot! The name gpt_j_residual comes from the developers of GPT-NeoX. Actually, the unconventional residual architecture in GPT-NeoX is inherited from GPT-J. For clarity and to be the same as the original GPT-NeoX, I think it is better to keep the name gpt_j_residual.

Is this essentially the "parallel" residual computation that allows the model to be tensor-parallelized better (especially for TPUs) - e.g. the same architecture that was used in PALM: https://arxiv.org/abs/2204.02311 ?

Yes, it's the same "parallel" architecture as PALM, which provides faster training speed when training large-scale models.

@@ -66,6 +66,9 @@ class GPTNeoXConfig(PretrainedConfig):
use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models). Only
relevant if `config.is_decoder=True`.
use_parallel_residual (`bool`, *optional*, defaults to `True`):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like the name!

hidden_states = mlp_output + attn_output + residual
if self.use_parallel_residual:
# pseudocode:
# x = x + attn(ln1(x)) + mlp(ln2(x))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice comments!

@patrickvonplaten patrickvonplaten changed the title Add a gpt_j_residual argument to control the residual computing way Add a use_parallel_residual argument to control the residual computing way Sep 27, 2022
Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants