Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix amp pooling overflow #9670

Merged
merged 9 commits into from
Jan 30, 2023
Merged

fix amp pooling overflow #9670

merged 9 commits into from
Jan 30, 2023

Conversation

zylo117
Copy link
Contributor

@zylo117 zylo117 commented Jan 29, 2023

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Pooling, mostly avg pooling and max pooling, in amp(float16) mode may overflow, because when every element in one channel sums up, it might be greater than the upper limit of float16, which is only 65536, then it overflows and starts from the lower limit. This is very common when the weights are not converged.

I trained coco with rtmdet from scratch in amp mode, I encounter nan loss issue. I thought this may be overflowing. And when I turn off amp mode, no more nan issue. So I confirmed this is pooling overflow issue, I know it because I have fixed it in another repo before.

Modification

Detect whether input tensor dtype in pooling is float16 or float32, if float16, force casting it to float32 and do pooling, then cast the result back to float16.

I tested by training rtmdet on coco from scratch, no more nan. But in other network, this issue may still exist.

@CLAassistant
Copy link

CLAassistant commented Jan 29, 2023

CLA assistant check
All committers have signed the CLA.

@ZwwWayne ZwwWayne added this to the 3.0.0rc6 milestone Jan 29, 2023
@ZwwWayne
Copy link
Collaborator

Hi @zylo117 ,
The overall PR looks good to me now and can be merged If the modification has been verified to resolve the issue in your case. Furthermore, please rebase your code to the most recent dev-3.x, which resolves the lint issue.

@zylo117
Copy link
Contributor Author

zylo117 commented Jan 30, 2023

Hi @zylo117 , The overall PR looks good to me now and can be merged If the modification has been verified to resolve the issue in your case. Furthermore, please rebase your code to the most recent dev-3.x, which resolves the lint issue.

What is
"""
yapf.....................................................................Failed

  • hook id: yapf
  • files were modified by this hook

"""
I don't know what kind of formatter you are using. And when I format the code with yapf with either pep8 or google style, many original lines were changed. I think the original file was using custom formatter. Can't the ci perform auto formatting?

@RangiLyu
Copy link
Member

Thanks for your contribution! Please run pip install pre-commit and then run pre-commit run --all-files to fix the lint.

@zylo117
Copy link
Contributor Author

zylo117 commented Jan 30, 2023

now that the unit test failed. I think it has nothing to do with my commits?

@ZwwWayne
Copy link
Collaborator

now that the unit test failed. I think it has nothing to do with my commits?

Yes, you are right. We will fix it in other PR.

@ZwwWayne ZwwWayne merged commit 4e83e86 into open-mmlab:dev-3.x Jan 30, 2023
@hhaAndroid
Copy link
Collaborator

@zylo117 Both rtmdet and yolov5 are trained with amp. I have never encountered this problem. Is there any reference for this? Thank you!

@openmmlab-bot
Copy link
Collaborator

zylo117,您好!您在MMDet项目中给我们提的PR非常重要,感谢您付出私人时间帮助改进开源项目,相信很多开发者会从你的PR中受益。
我们非常期待与您继续合作,OpenMMLab专门成立了贡献者组织MMSIG,为贡献者们提供开源证书、荣誉体系和专享好礼,可通过添加微信:openmmlabwx 联系我们(请备注mmsig+GitHub id),由衷希望您能加入!
Dear zylo117,
First of all, we want to express our gratitude for your significant PR in the MMDet project. Your contribution is highly appreciated, and we are grateful for your efforts in helping improve this open-source project during your personal time. We believe that many developers will benefit from your PR.
We are looking forward to continuing our collaboration with you. OpenMMLab has established a special contributors' organization called MMSIG, which provides contributors with open-source certificates, a recognition system, and exclusive rewards. You can contact us by adding our WeChat(if you have WeChat): openmmlab_wx, or join in our discord: https://discord.gg/qH9fysxPDW. We sincerely hope you will join us!
SineYuan,您好!您在MMXXX项目中给我们提的PR非常重要,感谢您付出私人时间帮助改进开源项目,相信很多开发者会从你的PR中受益。
我们非常期待与您继续合作,OpenMMLab专门成立了贡献者组织MMSIG,为贡献者们提供开源证书、荣誉体系和专享好礼,可通过添加微信:openmmlabwx 联系我们(请备注mmsig+GitHub id),由衷希望您能加入!
Dear SineYuan,
First of all, we want to express our gratitude for your significant PR in the MMXXX project. Your contribution is highly appreciated, and we are grateful for your efforts in helping improve this open-source project during your personal time. We believe that many developers will benefit from your PR.
We are looking forward to continuing our collaboration with you. OpenMMLab has established a special contributors' organization called MMSIG, which provides contributors with open-source certificates, a recognition system, and exclusive rewards. You can contact us by adding our WeChat(if you have WeChat): openmmlab_wx, or join in our discord: https://discord.gg/qH9fysxPDW. We sincerely hope you will join us!
Best regards! @zylo117

yumion pushed a commit to yumion/mmdetection that referenced this pull request Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants