-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to handle gradient overflow when training a deep model with mixed precision? #318
Comments
Hi @tfwu, Best, |
Occasionally seeing a message like “overflow detected, skipping step, reducing loss scale” is normal behavior with dynamic loss scaling, and it usually happens in the first few iterations because Amp begins by trying a high loss scale. Seeing nan loss values (ie, the loss scalar resulting from the forward pass is nan or inf) is NOT normal, and indicates something has gone wrong. As Piotr says, if this is the case, a minimal repro would be helpful. |
Is is normal to see many (~30) "gradient overflow, skipping step, reducing loss scale" messages on the first epoch? My dataset is pretty large (22 classes, 9M images total) and it didn't make is past the first epoch, after many hours, ie overnight. Granted, I'm only testing on a single RTX2070, but I thought it might have at least gotten to 2 by then. Is that unrealistic? On a tiny dataset, I get a 2-3 gradient overflows on the 1st epoch, and then none (usually) on the subsequent epochs. I'm using OPT=O1. |
@gbrow004 Do these messages appear consecutively, or are they spread out? Amp occasionally tries to increase the loss scale, so for a really long epoch or run, there will be proportionately more messages. Aside from a few (2-3) at the beginning of training, these should be isolated rather than consecutive. If you are seeing 30 in a row, that is likely a problem, however. Also, what is the range of loss values it says amp is using? |
To be honest, I'm not sure if they are consecutive or not, but I don't think so. I'll let it run longer and see what happens. Thanks for the information. So far, here's the output:
|
Since the loss scale is being reduced to |
Makes sense. Thanks, ptrblck! I guess I'll send it to the main GPU cluster and see what happens. It was going painfully slow on my single GPU testbed! |
@mcarilli : You said that gradient overflow message can happen in first few epochs. What is happen with my model if gradient overflow happens in every epoch (Maybe 2 or 5 times in each epochs, although I reduced learning rate). How should I fix it? |
i am seeking for the answer as @John1231983 asked above,someone please help I was trying this model in google colab : https://www.kaggle.com/taindow/pytorch-resnext-101-32x8d-benchmark i reduced the size of the data so colab doesn't crash and also reduced the batch size down to 32 but after almost 2 hours model training i get this error : i understand why this error is happening but can't understand what to do to solve this error? is it because batch size = 32 and not 64? here is my model log : Epoch 0/0 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0 ZeroDivisionError Traceback (most recent call last) |
Hi, I encountered similar problems during my training process. |
Hi, I have encountered the same problem.
Is there any solution to solve this problem? |
#318 (comment) The message may occur several times in succession at the beginning of training as the scale value calibrates. |
@mobassir94 have your problem been solved ? I get stuck in the same problem. My log is as follows: @mcarilli
|
it seems like you have nan or null in input and that's why you are getting that error @chengmengli06 |
@mobassir94 The forward pass overflows, if I make the learning rate smaller from 1e-3 to 1e-4, then it could run for 20 epochs, then the forward pass overflows again. |
I am facing a similar issue while training a GAN architecture with a pre-trained generator. The logs look like this: |
me too: |
@vishal16babu Does your model train in FP32 without apex? @devsentient |
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0 |
Hey, I got the same Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 256.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 256.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0078125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0078125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.000244140625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.000244140625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2737367544323206e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2737367544323206e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.105427357601002e-15
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.105427357601002e-15
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.168404344971009e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.168404344971009e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.776263578034403e-21
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.776263578034403e-21 then I got Nan step 2900/210000 (669 example/last step); acc: 0.00; ppl: nan
; xent: nan; lr: 0.00000191; 0/26651 tok/s; 5468 sec
-- many iterations ----
[Logger(3)] [2020-03-20 03:55:42,898 WARNING] NaN or Inf found in input tensor. then after may Nan iteration, It finally raise File "/opt/conda/envs/py36/lib/python3.6/site-packages/apex/amp/_process_optimizer.py", line 135, in post_backward_m
odels_are_masters
scale_override=(grads_have_scale, stashed_have_scale, out_scale))
File "/opt/conda/envs/py36/lib/python3.6/site-packages/apex/amp/scaler.py", line 176, in unscale_with_stashed
out_scale/grads_have_scale, # 1./scale,
ZeroDivisionError: float division by zero Our Usage:
for batch in accumulated_batch:
# forward
# loss = creterion(xxx)
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
if self.grad_accum_count > 1:
if self.n_gpu > 1:
grads = [p.grad.data for p in self.model.parameters()
if p.requires_grad
and p.grad is not None]
# NOTE, P1: SYNC multi-cards
distributed.all_reduce_and_rescale_tensors(
grads, float(1))
# NOTE, P2: do step
optimizer.step()
optimizer.zero_grad() I'm worry the preious
Currenlty, the fp32 training is ok (about 9 Epochs); amp O1 failed at Epoch 1 @mcarilli could you please take some time to process this? I'm waiting online, Thanks Hey, has anyone processing this problem? I have tried:
Over all, I think the previous May be some suggestion or conclusion? —————— Oh, I Finally figure it out. Let me summarize it: the Amp **How Amp do when come across a overflow? ** It just hack the optimizer's **How to deal with this situation? ** It is easy, just sync all the Amp state across all process. my code is based on OpenNMT-py, the sync state is easy, but we need to get def _step():
"""step function wrapper. totally safe for fp32"""
def _multiprocess_sync_amp_is_overflow():
"""need sync optimizer is-overflow state when multiprocessing"""
if self.args.model_dtype != "fp16":
return False
# get current process amp state
local_overflow_cnt = 0
for o in self.optims:
if o.optimizer._amp_stash.already_patched:
local_overflow_cnt = 1
break
# print(f"Device {self.gpu_rank} local overflow state: {local_overflow_cnt}")
# Sync the global overflow
global_overflow_cnt = local_overflow_cnt
if self.n_gpu > 1:
global_overflow_cnt = sum(distributed.all_gather_list(global_overflow_cnt))
# print(f"Gloal overflow state: {global_overflow_cnt}")
is_global_overflow = global_overflow_cnt > 0
return is_global_overflow
def patch_step(opt):
"""this function is copied from apex"""
opt_step = opt.step
def skip_step(closure=None):
if closure is not None:
raise RuntimeError("Currently, Amp does not support closure use with optimizers.")
logger.info(f"Device[{self.gpu_rank}] Gradient overflow. Skipping step. "
"(This is from hack-for-optimizer-sync)")
if hasattr(opt._amp_stash, "all_fp32_from_fp16_params"):
# Clear the master grads that wouldn't be zeroed by model.zero_grad()
for param in opt._amp_stash.all_fp32_from_fp16_params:
param.grad = None
if hasattr(opt, "most_recent_scale"):
opt.most_recent_scale = 1.0
opt.scale_set_by_backward = False
opt.step = opt_step
opt._amp_stash.already_patched = False
return skip_step
if self.n_gpu > 1:
is_global_overflow = _multiprocess_sync_amp_is_overflow()
if is_global_overflow:
# hack the optimizer
for o in self.optims:
if o.optimizer._amp_stash.already_patched:
continue
o.optimizer.step = patch_step(o.optimizer)
o.optimizer._amp_stash.already_patched = True previous code solve my problem. |
Does your model train normally without fp16? |
Oh, it is great! Congratulation! |
@ptrblck Is there a way to tell AMP not to screw up with the loss function at all? I have the same problem with gradient overflow continuously appearing (there are a few steps where it doesn't appear but the majority - 99% - are gradient overflows). I have already tried to set the amp.initializer loss_scale to 1 but that results in nans from the get-go. |
I also meet this issue, can anyone tell me a good solution? |
Hi!
With opt_level 'O0' it runs fine making me think it is not the data problem.
before returning a nan loss |
So where did you put that piece of code? which file in the apex folder |
apex/amp/handle.py |
I have encountered the same problem in fairseq; a small amount of gradient overflow is acceptable, but in the end, the loss scale will quickly shrink to the threshold value, leading to training interruption. my solution : |
Hi, it is ok to train the model with fp32, but we would like to take advantage of the speed of mixed precision. Unfortunately, we got gradient overflow and then nan loss at the very beginning with both opt-level O1 and O2. Is there a good way to handle gradient overflow / underflow using apex? Thank you
The text was updated successfully, but these errors were encountered: