Fix floating point precision issue for RoPE #23837

butsugiri · 2023-05-29T04:29:30Z

What does this PR do?

This PR fixes the issue of floating point precision in RotaryEmbedding.
The purpose of this PR is to fix inconsistency between GPT-Neo-X and HF Transformers, which is causing a model performance degradation.

Issue

In the current implementation of RotaryEmbedding, inv_freq is first initialized by float32.
This value is then used for initializing cos_cached and sin_cached by float32.
As a result, cos_cached and sin_cached remain float32 even if the model (including inv_freq) uses float16; this is because these two variables are not the target of dtype conversion of half() method
Note that there is also a recomputation logic for these two variables, but it is very unlikely to occur

transformers/src/transformers/models/gpt_neox/modeling_gpt_neox.py

Line 268 in f67dac9

 # This `if` block is unlikely to be run after we build sin/cos in `__init__`. Keep the logic here just in case. 

However, this implementation seems inconsistent to the one in the EleutherAI/gpt-neox library.
In their implementation, cos_cached and sin_cached are almost always recomputed in the forward method.
Thus, dtype of cos_cached and sin_cached are always consistent to the dtype of inv_freq.

This inconsistency between two libraries (HF Transformers and GPT-Neo-X) causes the performance degradation of the model converted from gpt-neox.
For example, the perplexity score of the language model on Wikitext corpus is as follows:

gpt-neo-x w/o conversion: 520.7840
gpt-neo-x w/ conversion to HF format: 520.9911
gpt-neo-x w/ conversion to HF format and this PR: 520.7840

(Sorry that the perplexity value is really bad. I am reporting the performance of model trained on toy data for debugging purpose)

Solution

I basically followed the previous PR #22888 and made a similar fix.

Possible Side Effect

In the original code, cos_cashed and sin_cashed are initialized in the model consturctor.
However, I had to move the initialization code to forward method.
Otherwise the library gave me the following error: "cos_vml_cpu" not implemented for 'Half'.
As a result, torch.jit.trace might be no longer available.
Since I am not sure what jit.trace is, I don't have any workaround for this.

Similar Issues

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?
- I would really appreciate it if the reviewers could point out the missing tests.

Who can review?

@ArthurZucker and @younesbelkada

HuggingFaceDocBuilderDev · 2023-05-29T04:44:08Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

sgugger

Thanks for your PR but we cannot break the model like this as it's used by multiple checkpoints on the hub.

butsugiri · 2023-05-31T09:29:20Z

Thank you for the message.
While I appreciate that we have to keep the compatibility with existing models on the hub, my understanding is that all existing models converted from NeoX all have this precision issue.
I would like to explore alternative solutions to address this issue rather than simply closing the pull request. Is there any other approach we can consider to fix the problem?

github-actions · 2023-06-28T15:02:32Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker · 2023-07-10T01:00:34Z

Hey! If you want to fix the problem without having to close the PR you should be aiming for a full backward compatibility, add tests to make sure that you are fixing the issue in place, and that previous behaviour is not broken.

Fix floating point precision issue

c83e380

apply black

959407b

sgugger reviewed May 30, 2023

View reviewed changes

github-actions bot closed this Jul 6, 2023

ArthurZucker mentioned this pull request Jul 11, 2023

GPTNeoXAttention takes extra GPU memory footprint in torch.float16 precision mode. #24261

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix floating point precision issue for RoPE #23837

Fix floating point precision issue for RoPE #23837

butsugiri commented May 29, 2023

HuggingFaceDocBuilderDev commented May 29, 2023

sgugger left a comment

butsugiri commented May 31, 2023

github-actions bot commented Jun 28, 2023

ArthurZucker commented Jul 10, 2023

Fix floating point precision issue for RoPE #23837

Fix floating point precision issue for RoPE #23837

Conversation

butsugiri commented May 29, 2023

What does this PR do?

Issue

Solution

Possible Side Effect

Similar Issues

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented May 29, 2023

sgugger left a comment

Choose a reason for hiding this comment

butsugiri commented May 31, 2023

github-actions bot commented Jun 28, 2023

ArthurZucker commented Jul 10, 2023