Add a use_parallel_residual argument to control the residual computing way #18695

NinedayWang · 2022-08-19T11:28:58Z

What does this PR do?

Add a gpt_j_residual argument to control the residual computing way, the default value is False, that is
consistent with https://github.com/EleutherAI/gpt-neox/blob/main/megatron/model/transformer.py#L592. And we can convert the model trained by gpt-neox into huggingface more easily.

Who can review?

Anyone in the community is free to review the PR once the tests have passed.
@LysandreJik @patrickvonplaten

HuggingFaceDocBuilderDev · 2022-08-19T11:41:17Z

The documentation is not available anymore as the PR was closed or merged.

urialon · 2022-09-01T20:15:29Z

Thanks @NinedayWang !

If someone could review this, that would be great.

This PR will also allow loading our PolyCoder model in transformers (https://arxiv.org/pdf/2202.13169.pdf)

patrickvonplaten · 2022-09-02T17:41:53Z

Thanks a lot for the PR @NinedayWang,

I'm however not 100% sure we want that as we don't try to make Transformer models be very configurable generally.
Are there already pretrained checkpoints that would be useful for the community with gpt_j_residual=True?

VHellendoorn · 2022-09-08T14:19:42Z

Hi @patrickvonplaten! I appreciate your perspective, but I think in this case supporting the variation is warranted. The default of nearly all training configurations in the NeoX toolkit is to have this flag set to False. Only the 20B configuration uses that residual. So I expect that supporting this variation will make deploying new models trained with the NeoX toolkit easier for a lot of folks. Given how costly these models are to train, we are not planning to create a new variant using this residual.

patrickvonplaten · 2022-09-09T13:04:03Z

Sorry I don't fully follow here:

The GPT-NeoX checkpoint can be loaded with this architecture: https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/modeling_gpt_neox.py
So supporting this flag is only for GPT-NeoX, but we already have this checkpoint in Transfomers: https://huggingface.co/EleutherAI/gpt-neox-20b

VHellendoorn · 2022-09-09T13:32:49Z

Ah, yes let me clarify. GPT-NeoX is a toolkit that can be (and is actively) used to train GPT-style models. It supports a broad range of model sizes, and has a few other hyper-parameters to vary the architecture in other ways, like that gpt_j_residual flag.

Now Neox-20B is a specific, 20B parameter model trained with this toolkit. It largely uses the same configuration that other models trained with GPT-NeoX would, with the notable exception of the aforementioned residual flag: that flag is set to False in all configurations by default, but was turned to True by the authors of the 20B model. As such, other models trained with the GPT-NeoX toolkit are unlikely to have this flag enabled.

So for HuggingFace/transformers to support most other models trained with the NeoX toolkit, including PolyCoder, we could either add multiple other modeling_gpt_neox_MODELNAME.py* style architectures, or make the basic modeling_gpt_neox.py architecture a bit more flexible. The latter seems more reasonable to me, but if the HF community prefers the former, that could work for us too.

Hope this clarifies things!
-Vincent

patrickvonplaten · 2022-09-09T14:55:49Z

Ah, yes let me clarify. GPT-NeoX is a toolkit that can be (and is actively) used to train GPT-style models. It supports a broad range of model sizes, and has a few other hyper-parameters to vary the architecture in other ways, like that gpt_j_residual flag.

Now Neox-20B is a specific, 20B parameter model trained with this toolkit. It largely uses the same configuration that other models trained with GPT-NeoX would, with the notable exception of the aforementioned residual flag: that flag is set to False in all configurations by default, but was turned to True by the authors of the 20B model. As such, other models trained with the GPT-NeoX toolkit are unlikely to have this flag enabled.

So for HuggingFace/transformers to support most other models trained with the NeoX toolkit, including PolyCoder, we could either add multiple other modeling_gpt_neox_MODELNAME.py* style architectures, or make the basic modeling_gpt_neox.py architecture a bit more flexible. The latter seems more reasonable to me, but if the HF community prefers the former, that could work for us too.

Hope this clarifies things! -Vincent

Hey @VHellendoorn,

Thanks for clarifying! Putting @LysandreJik and @sgugger in cc here. Given the "single-file" policy of Transformers (see post here), I think we would indeed prefer to add a new file such as modeling_poly_coder.py if the architecture is too different to existing architectures, such as gpt_j or gpt_neox_20b.
Also just one more question, if PolyCoder follows the same architecture than GPT-NeoX, couldn't we just load it with https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/modeling_gpt_neox.py ?

We're definitely more than happy though to help get Polycoder added to Transformers (cc @lvwerra as well)

VHellendoorn · 2022-09-09T17:53:35Z

Hi @patrickvonplaten,

Thanks, yes that would work for us too. The reason we can't load PolyCoder with that architecture file is precisely because modeling_gpt_neox.py hard-codes the assumption that gpt_j_residual is set to True. Hence the change in this PR, which makes that a configurable boolean. If we add a special modeling_polycoder.py file, it will just be identical to the modeling_gpt_neox.py one except for using the "normal" residual branch that most other models trained with GPT-NeoX will tend to use. So a slightly weird consequence of splitting the architectures across two files would be that most new models trained with GPT-NeoX will have to be loaded with the polycoder architecture, instead of the neox one. This PR would avoid such duplication by making that a togglable boolean instead.

-Vincent

sgugger · 2022-09-09T17:59:12Z

As @patrickvonplaten mentioned, Transformers is not a modular toolkit. It's therefore not surprising that one toolkit class such as GPT-Neo-X in EleutherAI is split in several different classes in Transformers (exactly like BART from fairseq is split in multiple classes here).

NinedayWang · 2022-09-13T09:11:47Z

Thanks for your reply @patrickvonplaten @sgugger.

Let me do some explanation. GPT-NeoX supports two different residual computing ways using gpt_j_residual configuration:
https://github.com/EleutherAI/gpt-neox/blob/main/megatron/model/transformer.py#L592
And the default value is gpt_j_residual=False:
https://github.com/EleutherAI/gpt-neox/blob/main/megatron/neox_arguments/neox_args.py#L311

gpt_j_residual: bool = False
"""
If false, we use the conventional residual path:
  x = x + attn(ln1(x))
  x = x + mlp(ln2(x))
Otherwise, we use the residual path from GPT-J, which offers a slight speedup:
  x = ln(x)
  x = x + attn(x) + mlp(x)
"""

As @VHellendoorn said, gpt-neox-20b is a special case of specifying gpt_j_residual=True in the 20B.yml config file, most models trained with GPT-NeoX use the default value of False, such as small.yml, medium.yml, large.yml, 2-7B.yml and so on. However, modeling_gpt_neox.py is an implementation that assumes gpt_j_residual=True, so we cannot use modeling_gpt_neox.py to load PolyCoder, even though PolyCoder actually follows the same architecture as GPT-NeoX.

I've read about the "single-file" policy, but I think GPT-NeoX is a bit special. If we load gpt-neox-20b with model_type=gpt_neox, but gpt-neox-2.7b or gpt-neox-0.4b with model_type=polycoder, it can be confusing and people need more time to figure out which model_type is suitable.

patrickvonplaten · 2022-09-13T23:03:56Z

Hey @NinedayWang,

Thanks for the explanation. Sorry some more questions to clarify: Why is it called gpt_j_residual? Could this be changed to another name? I don't fully understand the relation to GPT-J here.

If we have half the gpt-neox checkpoints using one residual architecture and gpt-neox-20b another architecture I'm actually not too against trying to fit it in one file.

NinedayWang · 2022-09-14T08:54:23Z

Hey @NinedayWang,

Thanks for the explanation. Sorry some more questions to clarify: Why is it called gpt_j_residual? Could this be changed to another name? I don't fully understand the relation to GPT-J here.

If we have half the gpt-neox checkpoints using one residual architecture and gpt-neox-20b another architecture I'm actually not too against trying to fit it in one file.

Thanks a lot! The name gpt_j_residual comes from the developers of GPT-NeoX. Actually, the unconventional residual architecture in GPT-NeoX is inherited from GPT-J. For clarity and to be the same as the original GPT-NeoX, I think it is better to keep the name gpt_j_residual.

patrickvonplaten · 2022-09-14T19:18:08Z

Hey @NinedayWang,
Thanks for the explanation. Sorry some more questions to clarify: Why is it called gpt_j_residual? Could this be changed to another name? I don't fully understand the relation to GPT-J here.
If we have half the gpt-neox checkpoints using one residual architecture and gpt-neox-20b another architecture I'm actually not too against trying to fit it in one file.

Thanks a lot! The name gpt_j_residual comes from the developers of GPT-NeoX. Actually, the unconventional residual architecture in GPT-NeoX is inherited from GPT-J. For clarity and to be the same as the original GPT-NeoX, I think it is better to keep the name gpt_j_residual.

Is this essentially the "parallel" residual computation that allows the model to be tensor-parallelized better (especially for TPUs) - e.g. the same architecture that was used in PALM: https://arxiv.org/abs/2204.02311 ?

sgugger

We can make an exception for the same family of checkpoints indeed. There is something similar in BLOOM.

However the parameter should be better named (gpt_j_residual will not evoke anything to a user) and needs to be documented.

sgugger · 2022-09-14T19:28:51Z

src/transformers/models/gpt_neox/modeling_gpt_neox.py

+ use_cache=use_cache,
+ output_attentions=output_attentions,
+ )
+ attn_output = attention_layer_outputs[0] # output_attn: a, present, (attentions)


Maybe leave everything until this line outside of the if block? Duplicating this code doesn't serve any purpose here.

Thanks for your review! I fixed it in 70aaec5

sgugger · 2022-09-14T19:29:47Z

src/transformers/models/gpt_neox/configuration_gpt_neox.py

@@ -99,6 +99,7 @@ def __init__(
 bos_token_id=0,
 eos_token_id=2,
 tie_word_embeddings=False,
+ gpt_j_residual=False,


This name is not informative at all. Reading the code it's more a add_residual_at_end or something along those lines. The new parameter will also require documentation.

I renamed it to "use_parallel_residual" and set the default value to True (4c12b69) so that the released "gpt-neox-20b" doesn't need to change the config file.

…the default value to True

NinedayWang · 2022-09-23T15:56:55Z

Hey @NinedayWang,
Thanks for the explanation. Sorry some more questions to clarify: Why is it called gpt_j_residual? Could this be changed to another name? I don't fully understand the relation to GPT-J here.
If we have half the gpt-neox checkpoints using one residual architecture and gpt-neox-20b another architecture I'm actually not too against trying to fit it in one file.

Thanks a lot! The name gpt_j_residual comes from the developers of GPT-NeoX. Actually, the unconventional residual architecture in GPT-NeoX is inherited from GPT-J. For clarity and to be the same as the original GPT-NeoX, I think it is better to keep the name gpt_j_residual.

Is this essentially the "parallel" residual computation that allows the model to be tensor-parallelized better (especially for TPUs) - e.g. the same architecture that was used in PALM: https://arxiv.org/abs/2204.02311 ?

Yes, it's the same "parallel" architecture as PALM, which provides faster training speed when training large-scale models.

patrickvonplaten · 2022-09-27T10:32:09Z

src/transformers/models/gpt_neox/configuration_gpt_neox.py

@@ -66,6 +66,9 @@ class GPTNeoXConfig(PretrainedConfig):
 use_cache (`bool`, *optional*, defaults to `True`):
 Whether or not the model should return the last key/values attentions (not used by all models). Only
 relevant if `config.is_decoder=True`.
+ use_parallel_residual (`bool`, *optional*, defaults to `True`):


like the name!

patrickvonplaten · 2022-09-27T10:32:19Z

src/transformers/models/gpt_neox/modeling_gpt_neox.py

- hidden_states = mlp_output + attn_output + residual
+ if self.use_parallel_residual:
+ # pseudocode:
+ # x = x + attn(ln1(x)) + mlp(ln2(x))


very nice comments!

patrickvonplaten

Looks good to me

NinedayWang mentioned this pull request Aug 22, 2022

Convert GPT-NeoX to HuggingFace VHellendoorn/Code-LMs#34

Merged

NinedayWang marked this pull request as draft August 25, 2022 13:26

NinedayWang marked this pull request as ready for review August 25, 2022 13:30

sgugger reviewed Sep 14, 2022

View reviewed changes

NinedayWang added 3 commits September 23, 2022 23:10

Add a gpt_j_residual argument to control the residual computing way

328e064

Put duplicate code outside of the if block

70aaec5

Rename parameter "gpt_j_residual" to "use_parallel_residual" and set …

4c12b69

…the default value to True

patrickvonplaten reviewed Sep 27, 2022

View reviewed changes

patrickvonplaten changed the title ~~Add a gpt_j_residual argument to control the residual computing way~~ Add a use_parallel_residual argument to control the residual computing way Sep 27, 2022

patrickvonplaten approved these changes Sep 27, 2022

View reviewed changes

sgugger approved these changes Sep 27, 2022

View reviewed changes

sgugger merged commit 226b0e4 into huggingface:main Sep 27, 2022

urialon mentioned this pull request Oct 21, 2022

"transformers_version" is not enforced #19781

Closed

4 tasks

twaka mentioned this pull request Oct 26, 2022

Reflect use_parallel_residual in mlp_after_attn for module_inject microsoft/DeepSpeed#2446

Open

haileyschoelkopf mentioned this pull request Nov 15, 2022

Gibberish text generation after converting to Huggingface. EleutherAI/gpt-neox#712

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a use_parallel_residual argument to control the residual computing way #18695

Add a use_parallel_residual argument to control the residual computing way #18695

NinedayWang commented Aug 19, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 19, 2022 •

edited

Loading

urialon commented Sep 1, 2022 •

edited

Loading

patrickvonplaten commented Sep 2, 2022 •

edited

Loading

VHellendoorn commented Sep 8, 2022

patrickvonplaten commented Sep 9, 2022

VHellendoorn commented Sep 9, 2022

patrickvonplaten commented Sep 9, 2022

VHellendoorn commented Sep 9, 2022

sgugger commented Sep 9, 2022

NinedayWang commented Sep 13, 2022 •

edited

Loading

patrickvonplaten commented Sep 13, 2022

NinedayWang commented Sep 14, 2022

patrickvonplaten commented Sep 14, 2022

sgugger left a comment

sgugger Sep 14, 2022

NinedayWang Sep 23, 2022

sgugger Sep 14, 2022

NinedayWang Sep 23, 2022 •

edited

Loading

NinedayWang commented Sep 23, 2022

patrickvonplaten Sep 27, 2022

patrickvonplaten Sep 27, 2022

patrickvonplaten left a comment

Add a use_parallel_residual argument to control the residual computing way #18695

Add a use_parallel_residual argument to control the residual computing way #18695

Conversation

NinedayWang commented Aug 19, 2022 • edited Loading

What does this PR do?

Who can review?

HuggingFaceDocBuilderDev commented Aug 19, 2022 • edited Loading

urialon commented Sep 1, 2022 • edited Loading

patrickvonplaten commented Sep 2, 2022 • edited Loading

VHellendoorn commented Sep 8, 2022

patrickvonplaten commented Sep 9, 2022

VHellendoorn commented Sep 9, 2022

patrickvonplaten commented Sep 9, 2022

VHellendoorn commented Sep 9, 2022

sgugger commented Sep 9, 2022

NinedayWang commented Sep 13, 2022 • edited Loading

patrickvonplaten commented Sep 13, 2022

NinedayWang commented Sep 14, 2022

patrickvonplaten commented Sep 14, 2022

sgugger left a comment

Choose a reason for hiding this comment

sgugger Sep 14, 2022

Choose a reason for hiding this comment

NinedayWang Sep 23, 2022

Choose a reason for hiding this comment

sgugger Sep 14, 2022

Choose a reason for hiding this comment

NinedayWang Sep 23, 2022 • edited Loading

Choose a reason for hiding this comment

NinedayWang commented Sep 23, 2022

patrickvonplaten Sep 27, 2022

Choose a reason for hiding this comment

patrickvonplaten Sep 27, 2022

Choose a reason for hiding this comment

patrickvonplaten left a comment

Choose a reason for hiding this comment

NinedayWang commented Aug 19, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 19, 2022 •

edited

Loading

urialon commented Sep 1, 2022 •

edited

Loading

patrickvonplaten commented Sep 2, 2022 •

edited

Loading

NinedayWang commented Sep 13, 2022 •

edited

Loading

NinedayWang Sep 23, 2022 •

edited

Loading