fix amp pooling overflow #9670

zylo117 · 2023-01-29T03:02:04Z

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Pooling, mostly avg pooling and max pooling, in amp(float16) mode may overflow, because when every element in one channel sums up, it might be greater than the upper limit of float16, which is only 65536, then it overflows and starts from the lower limit. This is very common when the weights are not converged.

I trained coco with rtmdet from scratch in amp mode, I encounter nan loss issue. I thought this may be overflowing. And when I turn off amp mode, no more nan issue. So I confirmed this is pooling overflow issue, I know it because I have fixed it in another repo before.

Modification

Detect whether input tensor dtype in pooling is float16 or float32, if float16, force casting it to float32 and do pooling, then cast the result back to float16.

I tested by training rtmdet on coco from scratch, no more nan. But in other network, this issue may still exist.

CLAassistant · 2023-01-29T03:02:09Z

All committers have signed the CLA.

mmdet/models/backbones/csp_darknet.py

Co-authored-by: Wenwei Zhang <[email protected]>

ZwwWayne · 2023-01-29T10:32:54Z

Hi @zylo117 ,
The overall PR looks good to me now and can be merged If the modification has been verified to resolve the issue in your case. Furthermore, please rebase your code to the most recent dev-3.x, which resolves the lint issue.

zylo117 · 2023-01-30T01:51:24Z

Hi @zylo117 , The overall PR looks good to me now and can be merged If the modification has been verified to resolve the issue in your case. Furthermore, please rebase your code to the most recent dev-3.x, which resolves the lint issue.

What is
"""
yapf.....................................................................Failed

hook id: yapf
files were modified by this hook

"""
I don't know what kind of formatter you are using. And when I format the code with yapf with either pep8 or google style, many original lines were changed. I think the original file was using custom formatter. Can't the ci perform auto formatting?

RangiLyu · 2023-01-30T02:06:54Z

Thanks for your contribution! Please run pip install pre-commit and then run pre-commit run --all-files to fix the lint.

zylo117 · 2023-01-30T03:55:53Z

now that the unit test failed. I think it has nothing to do with my commits?

ZwwWayne · 2023-01-30T05:09:30Z

now that the unit test failed. I think it has nothing to do with my commits?

Yes, you are right. We will fix it in other PR.

hhaAndroid · 2023-02-10T05:01:43Z

@zylo117 Both rtmdet and yolov5 are trained with amp. I have never encountered this problem. Is there any reference for this? Thank you!

openmmlab-bot · 2023-03-28T09:49:08Z

zylo117,您好！您在MMDet项目中给我们提的PR非常重要，感谢您付出私人时间帮助改进开源项目，相信很多开发者会从你的PR中受益。
我们非常期待与您继续合作，OpenMMLab专门成立了贡献者组织MMSIG，为贡献者们提供开源证书、荣誉体系和专享好礼，可通过添加微信：openmmlabwx 联系我们（请备注mmsig+GitHub id），由衷希望您能加入！
Dear zylo117,
First of all, we want to express our gratitude for your significant PR in the MMDet project. Your contribution is highly appreciated, and we are grateful for your efforts in helping improve this open-source project during your personal time. We believe that many developers will benefit from your PR.
We are looking forward to continuing our collaboration with you. OpenMMLab has established a special contributors' organization called MMSIG, which provides contributors with open-source certificates, a recognition system, and exclusive rewards. You can contact us by adding our WeChat（if you have WeChat): openmmlab_wx, or join in our discord： https://discord.gg/qH9fysxPDW. We sincerely hope you will join us!
SineYuan，您好！您在MMXXX项目中给我们提的PR非常重要，感谢您付出私人时间帮助改进开源项目，相信很多开发者会从你的PR中受益。
我们非常期待与您继续合作，OpenMMLab专门成立了贡献者组织MMSIG，为贡献者们提供开源证书、荣誉体系和专享好礼，可通过添加微信：openmmlabwx 联系我们（请备注mmsig+GitHub id），由衷希望您能加入！
Dear SineYuan,
First of all, we want to express our gratitude for your significant PR in the MMXXX project. Your contribution is highly appreciated, and we are grateful for your efforts in helping improve this open-source project during your personal time. We believe that many developers will benefit from your PR.
We are looking forward to continuing our collaboration with you. OpenMMLab has established a special contributors' organization called MMSIG, which provides contributors with open-source certificates, a recognition system, and exclusive rewards. You can contact us by adding our WeChat（if you have WeChat): openmmlab_wx, or join in our discord： https://discord.gg/qH9fysxPDW. We sincerely hope you will join us!
Best regards！ @zylo117

…g AMP (open-mmlab#9670) Co-authored-by: Wenwei Zhang <[email protected]>

fix amp pooling overflow

9bc6cfc

mm-assistant bot assigned ZwwWayne Jan 29, 2023

ZwwWayne requested a review from RangiLyu January 29, 2023 06:37

ZwwWayne added this to the 3.0.0rc6 milestone Jan 29, 2023

ZwwWayne reviewed Jan 29, 2023

View reviewed changes

mmdet/models/backbones/csp_darknet.py Outdated Show resolved Hide resolved

zylo117 and others added 2 commits January 29, 2023 15:48

Update mmdet/models/backbones/csp_darknet.py

c55cb1e

Co-authored-by: Wenwei Zhang <[email protected]>

simplified casting

74ba41e

zylo117 added 3 commits January 30, 2023 09:10

Merge branch 'open-mmlab:dev-3.x' into dev-3.x

a1ac7c2

reformat code

442e736

reformat code

1c17c12

zylo117 added 3 commits January 30, 2023 11:14

reformat code

cacf6b2

reformat code

4211fe7

fix autocast

08385d8

ZwwWayne merged commit 4e83e86 into open-mmlab:dev-3.x Jan 30, 2023

yumion pushed a commit to yumion/mmdetection that referenced this pull request Jan 31, 2024

[Enhance]: Avoid overflow issue in pooling layers of RTMDet when usin…

42a1755

…g AMP (open-mmlab#9670) Co-authored-by: Wenwei Zhang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix amp pooling overflow #9670

fix amp pooling overflow #9670

zylo117 commented Jan 29, 2023

CLAassistant commented Jan 29, 2023 •

edited

Loading

ZwwWayne commented Jan 29, 2023

zylo117 commented Jan 30, 2023

RangiLyu commented Jan 30, 2023

zylo117 commented Jan 30, 2023

ZwwWayne commented Jan 30, 2023

hhaAndroid commented Feb 10, 2023

openmmlab-bot commented Mar 28, 2023

fix amp pooling overflow #9670

fix amp pooling overflow #9670

Conversation

zylo117 commented Jan 29, 2023

Motivation

Modification

CLAassistant commented Jan 29, 2023 • edited Loading

ZwwWayne commented Jan 29, 2023

zylo117 commented Jan 30, 2023

RangiLyu commented Jan 30, 2023

zylo117 commented Jan 30, 2023

ZwwWayne commented Jan 30, 2023

hhaAndroid commented Feb 10, 2023

openmmlab-bot commented Mar 28, 2023

CLAassistant commented Jan 29, 2023 •

edited

Loading