-
Notifications
You must be signed in to change notification settings - Fork 22k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. #43259
Comments
@QiuSYang Could you a complete script to reproduce this issue locally on a GPU machine? Also, it would be useful if you could share the complete definition of your model with the forward function. This would help us in debugging this issue further. |
`import numpy as np word2vec_path = "../datasets/GoogleNews-vectors-negative300.bin" class Solver(object):
This is my full script, the position error is the base class forward function of the torch. nn. Modules.module.py |
This may be an issue with unused parameter detection in DDP, although it is hard to debug without your model's definition/its forward() function. Could you share that as well? |
I met the same issue. |
Or if it is the case, use the |
I found that find_unused_parameters=True started to hang indefinitely after 3 steps depending on how complex what was in a forward pass. I fixed it by making a separate function at the beginning of the forward pass that creates the pseudo-labels I use, somehow it being in a separate function (with things that go out of scope presumably disappearing) stopped the find_unused_parameters=True option from hanging forever. |
@rohan-varma I ran into the same error. My case is similar to @yinghuang where I have a parallel dual-BN architecture and each sample will go through a single BN path according to an input flag. However, I'm not sure whether the In my case each sample will go through a specific BN branch and the |
Can someone explain a little bit why set up this Runtime Error? It looks to me like a warning instead of an error (even some parameters/modules are not used in the forward pass for calculating loss, it doesn't break anything. Just brought some overhead and redundancy). Thanks |
A walkaround to this problem is to multiply the sum of all parameters with zero and add it to the final loss. Note that this may bring a small overhead in backprop. |
I get the same error,but when I train ,all is ok.This error only occurred during the test. |
As far as I know, this happens in distributed training when multiple GPUs need to communicate with each other for calculating losses for the whole minibatch. If the number of loss terms are different, the GPU with less number of loss terms may quit the communication earlier than others. However, other GPUs may still be waiting for this GPU to reply, which leads to a dead lock. |
It has been two years since this issue is reported. I also came into this issue recently, and found the workaround. My code is like
The the error is raised, and finally, I find this could make a fix
Hope this help. |
This is a life savor! Though overhead is introduced, at least things run now under DDP. Thanks for the suggestion. |
@maxwellzh Thanks a lot! The solution works for my code. Without added lines, bugs will be raised. I am wondering the reason. Do you have any thoughts? |
The error tells that there are parameters not used to compute the loss. If you add all the parameters into the loss, though with a multiply factor of zero, all parameters are technically used for loss computation. |
@zerovl No... I was trying all the ways to fix the issue at that time and occasionally found the workaround. Probably a bug of pytorch I guess. @zeakey Though torch tells there are unused parameters, I think it's a false alarm. Please have a look at the code I pasted above. All parameters were indeed used in loss computation, but somehow torch just thought it weren't. |
@maxwellzh @zeakey
Without Line 1 and Line 2, the error will be raised. |
@maxwellzh Oh my bad. It might be a false alarm from PyTorch. I occasionally ran into this error due to unused parameters during training. In my case, there is a |
Any plan from PyTorch to fix this? |
Thanks a lot! It works for me. I guess that |
This issue also occurs in the case that you want to apply LayerDrop while using DDP. LayerDrop skips an entire layer in the forward pass so no parameters of the skipped layer are used for the loss computation. Hence they are missing in the autograd graph and the same errors gets raised by DDP. The workaround by adding the sum of parameter values multiplied by zero to the loss also works here, but it would be more efficient and much more elegant to be able to simply
EDIT 22/02 2022: |
May I ask where this class is? In which dependency file? Thx a lot! |
@saulgoodman08 This is just an example code snippet. You should make changes according to yours. |
--distributed-backend 'nccl' --ddp-backend "no_c10d" \ |
I use the following code: |
I still have the similar issue even through I use single GPU. I solve it by setting "find_unused_parameters=True". Thanks a lot for your many hints. |
I was having the same error. In the self.dec_layer = nn.TransformerDecoderLayer(embedding_dim, num_heads, hidden_size, dropout, batch_first=True)
self.decoder = nn.TransformerDecoder(self.dec_layer, num_layers=num_layers) to the code below dec_layer = nn.TransformerDecoderLayer(embedding_dim, num_heads, hidden_size, dropout, batch_first=True)
self.decoder = nn.TransformerDecoder(dec_layer, num_layers=num_layers) and I no longer get the error. The forward method of my class only explicitly calls The parameters of the module |
Thank you! This worked for me. |
Btw something that helped me was to run
and I could set the particular |
I got the same error and follow this guide to solve it! Thanks a lot. Code below: |
I ran into the same problem because I used with torch.no_grad() in front of the newly defined MLP function:
|
If you are getting this because of accelerate, just make sure you ran |
Yes! This is the case. I tried your method and finally fixed this haunting bug! |
🐛 Bug
To Reproduce
Epoch: 1, iter 0: loss = 10.099
0%| | 1/144967 [00:02<116:54:31, 2.90s/it]
Traceback (most recent call last):
File "train.py", line 99, in
solver.train()
File "/home/yckj2453/nlp_space/jd_multimodal_dialogue/multi-modal-dialogue-transformer_bart/utils/time_track.py", line 18, in timed
result = method(*args, **kwargs)
File "/home/yckj2453/nlp_space/jd_multimodal_dialogue/multi-modal-dialogue-transformer_bart/solver.py", line 284, in train
decoder_input_ids=decoder_input_ids)
File "/root/anaconda3/envs/jddc_mddr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/jddc_mddr/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 473, in forward
self.reducer.prepare_for_backward(list(_find_tensors(output)))
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument
find_unused_parameters=True
totorch.nn.parallel.DistributedDataParallel
; (2) making sure allforward
function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module'sforward
function. Please include the loss function and the structure of the return value offorward
of your module when reporting this issue (e.g. list, dict, iterable).Traceback (most recent call last):
File "/root/anaconda3/envs/jddc_mddr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/root/anaconda3/envs/jddc_mddr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/anaconda3/envs/jddc_mddr/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/root/anaconda3/envs/jddc_mddr/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/root/anaconda3/envs/jddc_mddr/bin/python', '-u', 'train.py', '--local_rank=0']' returned non-zero exit status 1.
Steps to reproduce the behavior:
Expected behavior
Environment
Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).
You can get the script and run it with:
conda
,pip
, source):pip transformers==2.11.0
pip numpy==1.19.0
Additional context
here is my code:
` def train(self):
epoch_loss_history = []
best_eval_loss = float('inf') # 记录最佳损失
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski
The text was updated successfully, but these errors were encountered: