-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix amp pooling overflow #9670
fix amp pooling overflow #9670
Conversation
Co-authored-by: Wenwei Zhang <[email protected]>
Hi @zylo117 , |
What is
""" |
Thanks for your contribution! Please run |
now that the unit test failed. I think it has nothing to do with my commits? |
Yes, you are right. We will fix it in other PR. |
@zylo117 Both rtmdet and yolov5 are trained with amp. I have never encountered this problem. Is there any reference for this? Thank you! |
zylo117,您好!您在MMDet项目中给我们提的PR非常重要,感谢您付出私人时间帮助改进开源项目,相信很多开发者会从你的PR中受益。 |
…g AMP (open-mmlab#9670) Co-authored-by: Wenwei Zhang <[email protected]>
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
Pooling, mostly avg pooling and max pooling, in amp(float16) mode may overflow, because when every element in one channel sums up, it might be greater than the upper limit of float16, which is only 65536, then it overflows and starts from the lower limit. This is very common when the weights are not converged.
I trained coco with rtmdet from scratch in amp mode, I encounter nan loss issue. I thought this may be overflowing. And when I turn off amp mode, no more nan issue. So I confirmed this is pooling overflow issue, I know it because I have fixed it in another repo before.
Modification
Detect whether input tensor dtype in pooling is float16 or float32, if float16, force casting it to float32 and do pooling, then cast the result back to float16.
I tested by training rtmdet on coco from scratch, no more nan. But in other network, this issue may still exist.