RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. #43259

QiuSYang · 2020-08-19T08:52:26Z

🐛 Bug

To Reproduce

Epoch: 1, iter 0: loss = 10.099
0%| | 1/144967 [00:02<116:54:31, 2.90s/it]
Traceback (most recent call last):
File "train.py", line 99, in
solver.train()
File "/home/yckj2453/nlp_space/jd_multimodal_dialogue/multi-modal-dialogue-transformer_bart/utils/time_track.py", line 18, in timed
result = method(*args, **kwargs)
File "/home/yckj2453/nlp_space/jd_multimodal_dialogue/multi-modal-dialogue-transformer_bart/solver.py", line 284, in train
decoder_input_ids=decoder_input_ids)
File "/root/anaconda3/envs/jddc_mddr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/jddc_mddr/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 473, in forward
self.reducer.prepare_for_backward(list(_find_tensors(output)))
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
Traceback (most recent call last):
File "/root/anaconda3/envs/jddc_mddr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/root/anaconda3/envs/jddc_mddr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/anaconda3/envs/jddc_mddr/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/root/anaconda3/envs/jddc_mddr/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/root/anaconda3/envs/jddc_mddr/bin/python', '-u', 'train.py', '--local_rank=0']' returned non-zero exit status 1.

Steps to reproduce the behavior:

Expected behavior

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

PyTorch Version (1.5.1):
OS (e.g., Linux):
How you installed PyTorch (conda, pip, source):
Build command you used (if compiling from source):
Python version: 3.7.5
CUDA/cuDNN version: 10.1/7.6
GPU models and configuration:
Any other relevant information:
pip transformers==2.11.0
pip numpy==1.19.0

Additional context

here is my code:

` def train(self):
epoch_loss_history = []
best_eval_loss = float('inf') # 记录最佳损失

    # 设置并行计算
    if self.config.n_gpu > 1:
        print("use torch.nn.DataParallel for the parallel operations.")
        self.model = nn.DataParallel(self.model)
    if self.config.local_rank != -1:
        print("use torch.nn.parallel.DistributedDataParallel for the parallel operations.")
        self.model = nn.parallel.DistributedDataParallel(self.model,
                                                         device_ids=[self.config.local_rank],
                                                         output_device=self.config.local_rank,
                                                         find_unused_parameters=True)

    for epoch_i in range(self.epoch_i, self.config.n_epoch):
        # self.epoch_i = epoch_i
        batch_loss_history = []
        loss_history = []
        num_batch = 0
        self.model.train()
        n_total_words = 0

        # 每个batch开始之前, 先进行梯度清空
        # self.optimizer.zero_grad()
        self.model.zero_grad()  # 更加安全的清理梯度

        # epoch_iterator = tqdm(self.train_data_loader, desc="Iteration",
        #                       disable=self.config.local_rank not in [-1, 0])
        for batch_i, (input_ids, label_ids,
                      images, img_char_positions) in enumerate(tqdm(self.train_data_loader, ncols=80)):
            # input_ids: [batch, sentence_length]
            num_batch = batch_i

            # flatten input and target conversations
            # 去除PAD_ID列表长度, -1为了去除起始字符的长度
            label_origin_length = [len(single_label) - single_label.count(PAD_ID) - 1 for single_label in label_ids]
            if self.config.is_images_embedding:
                # 将一个batch内的所有image都压缩到一个列表之中
                input_images = [image for sentence_images in images for image in sentence_images]
                # image在sentence中对应的索引号
                input_image_indexes = [i for sentence_images_index in img_char_positions
                                       for i in sentence_images_index]
                # 计算每个句子包含多少张图片
                input_images_length = [len(sentence_images_index) for sentence_images_index in img_char_positions]

                # 确保输入图片数量与图像索引号数量相同
                assert len(input_images) == sum(input_images_length)
                assert len(input_image_indexes) == sum(input_images_length)

            input_sentences = to_var(torch.LongTensor(input_ids))
            target_sentences = to_var(torch.LongTensor(label_ids))
            target_sentence_length = to_var(torch.LongTensor(label_origin_length))
            if self.config.is_images_embedding:
                input_images = to_var(torch.stack(input_images))
                input_images_length = to_var(torch.LongTensor(input_images_length))
                input_image_indexes = to_var(torch.LongTensor(input_image_indexes))
            else:
                input_images = None
                input_images_length = None
                input_image_indexes = None

            # if self.config.gradient_accumulation_step == 1:
            #     # reset gradient
            #     self.optimizer.zero_grad()
            #     self.model.zero_grad()

            attention_mask = input_sentences.ne(0).long()
            # decoder_input_ids = target_sentences[:, :-1]  # GPT解码输入, 去除末尾的结束字符
            decoder_input_ids = self.shift_tokens_right_custom(target_sentences, PAD_ID)  # 删除EOS_ID
            outputs = self.model(input_ids=input_sentences,
                                 input_images=input_images,
                                 input_images_length=input_images_length,
                                 input_image_indexes=input_image_indexes,
                                 attention_mask=attention_mask,  # input_sentences.eq(0)
                                 # lm_labels=target_sentences,
                                 decoder_input_ids=decoder_input_ids)

            # sentence_logits = self.model(
            #    input_sentences,
            #    input_sentence_length,
            #    input_conversation_length,
            #    target_sentences,
            #    input_images,
            #    input_images_length=input_images_length,
            #    input_image_indexes=input_image_indexes)

            decoder_target_label_ids = target_sentences[:, 1:]  # GPT解码Label, 去除首部的起始字符
            sentence_logits = outputs[0]  # 获取Bart的logits
            batch_loss, n_words = masked_cross_entropy(
                sentence_logits,
                decoder_target_label_ids,
                target_sentence_length)

            if self.config.n_gpu > 1:
                # mean() to average on multi-gpu parallel (not distributed) training
                batch_loss = batch_loss.mean()
                n_words = n_words.mean()
            if self.config.gradient_accumulation_step > 1:
                batch_loss = batch_loss / self.config.gradient_accumulation_step
                n_words = n_words / self.config.gradient_accumulation_step

            # assert not isnan(batch_loss.item())
            batch_loss_history.append(batch_loss.item())
            n_total_words += n_words.item()
            # loss_history.append(loss)

            if batch_i % self.config.print_every == 0:
                # tqdm.write(
                #    f'Epoch: {epoch_i+1}, iter {batch_i}: loss = {loss.item():.3f}')
                tqdm.write(
                    f'Epoch: {epoch_i+1}, iter {batch_i}: loss = {batch_loss.item()/ n_words.item():.3f}')

            # Back-propagation
            # loss.backward()
            batch_loss.backward()

            # Gradient cliping
            torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.config.clip)

            # 进行梯度累积
            if (batch_i + 1) % self.config.gradient_accumulation_step == 0:
                # Run optimizer & scheduler
                self.optimizer.step()
                self.scheduler.step()
                # self.optimizer.zero_grad()  # 清空梯度
                self.model.zero_grad()

        torch.cuda.empty_cache()
        gc.collect()

        # epoch_loss = np.sum(loss_history) / (num_batch + 1)
        epoch_loss = np.sum(batch_loss_history) / n_total_words
        epoch_loss_history.append(epoch_loss)
        self.epoch_loss = epoch_loss

        print_str = f'Epoch {epoch_i+1} loss average: {epoch_loss:.3f}'
        print(print_str)

        if epoch_i % self.config.save_every_epoch == 0:
            self.save_model(epoch_i + 1)

        # Only evaluate when single GPU otherwise metrics may not average well
        if self.config.local_rank == -1:
            # print('\n<BLEU score>...')
            # self.calculate_bleu()

            print('\n<Validation>...')
            self.validation_loss = self.evaluate()

            # 保存最佳validation los model
            if self.validation_loss < best_eval_loss:
               self.save_model(epoch_i, best='best_model')
               # 更新最佳验证损失
               best_eval_loss = self.validation_loss
        #
        # if epoch_i % self.config.plot_every_epoch == 0:
        #     self.write_summary(epoch_i)

    self.save_model(self.config.n_epoch)

    return epoch_loss_history`

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski

The text was updated successfully, but these errors were encountered:

pritamdamania87 · 2020-08-24T19:08:23Z

@QiuSYang Could you a complete script to reproduce this issue locally on a GPU machine? Also, it would be useful if you could share the complete definition of your model with the forward function. This would help us in debugging this issue further.

QiuSYang · 2020-08-25T03:50:08Z

Could you a complete script to reproduce this issue locally on a GPU machine? Also, it would be useful if you could share the complete definition of your model with the forward function. This would help us in debugging this issue further.

`import numpy as np
import torch
import torch.nn as nn
import transformer.models as models
from layers import masked_cross_entropy
from utils import to_var, time_desc_decorator, embedding_metric
import os
from tqdm import tqdm
from math import isnan
import re
import gc
from transformers import AdamW, get_linear_schedule_with_warmup
from transformer import BertConfig, BartConfig, BartForConditionalGeneration
import math
import pickle
from utils import SOS_ID, PAD_ID, EOS_ID
import time
import offline_dev_data_postprocess
from utils import BLEUEvaluator

word2vec_path = "../datasets/GoogleNews-vectors-negative300.bin"

class Solver(object):
def init(self, config, train_data_loader, eval_data_loader, vocab, is_train=True, model=None):
self.config = config
self.epoch_i = 0
self.train_data_loader = train_data_loader
self.eval_data_loader = eval_data_loader
self.vocab = vocab
self.is_train = is_train
self.model = model

@time_desc_decorator('Build Graph')
def build(self, cuda=True):

    if self.model is None:
        config_bart = BartConfig.from_json_file(json_file=self.config.bart_config)
        if self.config.bert_config:
            config_bert = BertConfig.from_json_file(json_file=self.config.bert_config)
            # 词汇表统一
            if config_bert.vocab_size != self.config.vocab_size:
                config_bert.vocab_size = self.config.vocab_size
            config_bart.vocab_size = config_bert.vocab_size
            # 设置特殊字符
            config_bert.eos_token_id = EOS_ID
            config_bert.bos_token_id = SOS_ID
            config_bert.pad_token_id = PAD_ID
        else:
            config_bart.vocab_size = self.config.vocab_size
        config_bart.max_length = self.config.max_sentence_length  # 生成最大句子长度
        config_bart.is_images_embedding = self.config.is_images_embedding
        # Initializing a bart model from the custom style configurations
        if not self.is_train:
            # inference no load pre_training model
            self.config.bert_pre_training = None
        self.model = BartForConditionalGeneration(config=config_bart,
                                                  bert_config=config_bert,
                                                  bert_pretrained_model_name_or_path=self.config.bert_pre_training)
        self.model.resize_token_embeddings(config_bart.vocab_size)

        # # orthogonal initialiation for hidden weights
        # # input gate bias for GRUs
        # if self.config.mode == 'train' and self.config.checkpoint is None:
        #    print('Parameter initiailization')
        #   for name, param in self.model.named_parameters():
        #        if 'weight_hh' in name:
        #            print('\t' + name)
        #            nn.init.orthogonal_(param)
        #
        #         bias_hh is concatenation of reset, input, new gates
        #         only set the input gate bias to 2.0
        #        if 'bias_hh' in name:
        #            print('\t' + name)
        #            dim = int(param.size(0) / 3)
        #            param.data[dim:2 * dim].fill_(2.0)

    if torch.cuda.is_available() and cuda:
        # self.model.cuda()
        self.model.to(self.config.device)

    # Overview Parameters
    # print('Model Parameters')
    # for name, param in self.model.named_parameters():
    #     print('\t' + name + '\t', list(param.size()))

    if self.config.checkpoint:
        self.load_model(self.config.checkpoint)

    if self.is_train:
        no_decay = ['bias', 'layer_norm.weight']
        optimizer_grouped_parameters = [
            {'params': [p for n, p in self.model.named_parameters() if not any(nd in n for nd in no_decay)],
             'weight_decay': self.config.weight_decay},
            {'params': [p for n, p in self.model.named_parameters() if any(nd in n for nd in no_decay)],
             'weight_decay': 0.0}
        ]
        self.optimizer = AdamW(optimizer_grouped_parameters,
                               lr=self.config.learning_rate,
                               eps=self.config.adam_epsilon)
        # self.optimizer = AdamW(self.model.parameters(),
        #                        lr=self.config.learning_rate,
        #                        eps=self.config.adam_epsilon)
        self.scheduler = get_linear_schedule_with_warmup(self.optimizer,
                                                         num_warmup_steps=self.config.num_warmup_steps,
                                                         num_training_steps=self.config.num_training_steps)
        if self.config.checkpoint:
            checkpoint = torch.load(self.config.checkpoint)
            self.scheduler.load_state_dict(checkpoint["lr_scheduler"])

        # # 并行计算
        # if torch.cuda.device_count() > 1:
        #     print("Training model with multi-gpus.")
        #     self.model = nn.DataParallel(self.model,
        #                                  device_ids=[idx for idx in range(torch.cuda.device_count())])
        #     # torch.backends.cudnn.benchmark = True  # 增加运行效率
        #     self.multi_gpus = True

    # if self.is_train:
    #    self.optimizer = self.config.optimizer(
    #        filter(lambda p: p.requires_grad, self.model.parameters()),
    #        lr=self.config.learning_rate)

def save_model(self, epoch, best=None):
    """Save parameters to checkpoint"""
    if best:
        ckpt_path = os.path.join(self.config.save_path, f'{best}.pkl')
    else:
        ckpt_path = os.path.join(self.config.save_path, f'{epoch}.pkl')
    model_state = {'model': self.model.state_dict(),
                   'optimizer': self.optimizer.state_dict(),
                   'lr_scheduler': self.scheduler.state_dict(),
                   'epoch': epoch}
    print(f'Save parameters to {ckpt_path}')
    torch.save(model_state, ckpt_path)

def load_model(self, checkpoint_path):
    """Load parameters from checkpoint"""
    print(f'Load parameters from {checkpoint_path}')
    # epoch = re.match(r"[0-9]*", os.path.basename(checkpoint_path)).group(0)
    # self.epoch_i = int(epoch)
    checkpoint = torch.load(checkpoint_path)
    self.epoch_i = checkpoint.get('epoch')
    self.model.load_state_dict(checkpoint['model'])
def shift_tokens_right(self, input_ids, sos_token_id):
    """Shift input ids one token to the right, and wrap the last non pad token (usually <eos>)."""
    prev_output_tokens = input_ids.clone()
    prev_output_tokens[:, 0] = sos_token_id
    prev_output_tokens[:, 1:] = input_ids[:, :-1]
    return prev_output_tokens

def shift_tokens_right_custom(self, input_ids, pad_token_id):
    """Shift input ids one token to the right, and wrap the last non pad token (usually <eos>).
        删除第一个pad_token_id前一个字符"""
    prev_output_tokens = input_ids.clone()
    # temp_ = input_ids[0]
    index_of_eos = (input_ids.ne(pad_token_id).sum(dim=1) - 1).unsqueeze(-1).view(-1)
    # 1. 将每个sentence的EOS替换为pad
    for i, index in enumerate(index_of_eos):
        prev_output_tokens[i, index] = pad_token_id
    # temp = prev_output_tokens[0]
    # 2. 再删除最后一个字符
    return prev_output_tokens[:, :-1]

@time_desc_decorator('Training Start!')
def train(self):
    # print('Training Start!')
    epoch_loss_history = []
    best_eval_loss = float('inf')  # 记录最佳损失

    # 设置并行计算
    if self.config.n_gpu > 1:
        print("use torch.nn.DataParallel for the parallel operations.")
        self.model = nn.DataParallel(self.model)
    if self.config.local_rank != -1:
        print("use torch.nn.parallel.DistributedDataParallel for the parallel operations.")
        self.model = nn.parallel.DistributedDataParallel(self.model,
                                                         device_ids=[self.config.local_rank],
                                                         output_device=self.config.local_rank,
                                                         find_unused_parameters=True)

    for epoch_i in range(self.epoch_i, self.config.n_epoch):
        # self.epoch_i = epoch_i
        batch_loss_history = []
        loss_history = []
        num_batch = 0
        self.model.train()
        n_total_words = 0

        # 每个batch开始之前, 先进行梯度清空
        # self.optimizer.zero_grad()
        self.model.zero_grad()  # 更加安全的清理梯度

        # epoch_iterator = tqdm(self.train_data_loader, desc="Iteration",
        #                       disable=self.config.local_rank not in [-1, 0])
        for batch_i, (input_ids, label_ids,
                      images, img_char_positions) in enumerate(tqdm(self.train_data_loader, ncols=80)):
            # input_ids: [batch, sentence_length]
            num_batch = batch_i

            # flatten input and target conversations
            # 去除PAD_ID列表长度, -1为了去除起始字符的长度
            label_origin_length = [len(single_label) - single_label.count(PAD_ID) - 1 for single_label in label_ids]
            if self.config.is_images_embedding:
                # 将一个batch内的所有image都压缩到一个列表之中
                input_images = [image for sentence_images in images for image in sentence_images]
                # image在sentence中对应的索引号
                input_image_indexes = [i for sentence_images_index in img_char_positions
                                       for i in sentence_images_index]
                # 计算每个句子包含多少张图片
                input_images_length = [len(sentence_images_index) for sentence_images_index in img_char_positions]

                # 确保输入图片数量与图像索引号数量相同
                assert len(input_images) == sum(input_images_length)
                assert len(input_image_indexes) == sum(input_images_length)

            # input_sentences = to_var(torch.LongTensor(input_ids))
            input_sentences = torch.LongTensor(input_ids).to(self.config.device)
            # target_sentences = to_var(torch.LongTensor(label_ids))
            target_sentences = torch.LongTensor(label_ids).to(self.config.device)
            # target_sentence_length = to_var(torch.LongTensor(label_origin_length))
            target_sentence_length = torch.LongTensor(label_origin_length).to(self.config.device)
            if self.config.is_images_embedding:
                # input_images = to_var(torch.stack(input_images))
                # input_images_length = to_var(torch.LongTensor(input_images_length))
                # input_image_indexes = to_var(torch.LongTensor(input_image_indexes))
                input_images = torch.stack(input_images).to(self.config.device)
                input_images_length = torch.LongTensor(input_images_length).to(self.config.device)
                input_image_indexes = torch.LongTensor(input_image_indexes).to(self.config.device)
            else:
                input_images = None
                input_images_length = None
                input_image_indexes = None

            # if self.config.gradient_accumulation_step == 1:
            #     # reset gradient
            #     self.optimizer.zero_grad()
            #     self.model.zero_grad()

            attention_mask = input_sentences.ne(0).long()
            # decoder_input_ids = target_sentences[:, :-1]  # GPT解码输入, 去除末尾的结束字符
            decoder_input_ids = self.shift_tokens_right_custom(target_sentences, PAD_ID)  # 删除EOS_ID
            outputs = self.model(input_ids=input_sentences,
                                 input_images=input_images,
                                 input_images_length=input_images_length,
                                 input_image_indexes=input_image_indexes,
                                 attention_mask=attention_mask,  # input_sentences.eq(0)
                                 # lm_labels=target_sentences,
                                 decoder_input_ids=decoder_input_ids)

            # sentence_logits = self.model(
            #    input_sentences,
            #    input_sentence_length,
            #    input_conversation_length,
            #    target_sentences,
            #    input_images,
            #    input_images_length=input_images_length,
            #    input_image_indexes=input_image_indexes)

            decoder_target_label_ids = target_sentences[:, 1:]  # GPT解码Label, 去除首部的起始字符
            sentence_logits = outputs[0]  # 获取Bart的logits
            batch_loss, n_words = masked_cross_entropy(
                sentence_logits,
                decoder_target_label_ids,
                target_sentence_length)

            if self.config.n_gpu > 1:
                # mean() to average on multi-gpu parallel (not distributed) training
                batch_loss = batch_loss.mean()
                n_words = n_words.mean()
            if self.config.gradient_accumulation_step > 1:
                batch_loss = batch_loss / self.config.gradient_accumulation_step
                n_words = n_words / self.config.gradient_accumulation_step

            # assert not isnan(batch_loss.item())
            batch_loss_history.append(batch_loss.item())
            n_total_words += n_words.item()
            # loss_history.append(loss)

            if batch_i % self.config.print_every == 0:
                # tqdm.write(
                #    f'Epoch: {epoch_i+1}, iter {batch_i}: loss = {loss.item():.3f}')
                # tqdm.write(
                #     f'Epoch: {epoch_i+1}, iter {batch_i}: loss = {batch_loss.item()/ n_words.item():.3f}')
                print(
                    f'Epoch: {epoch_i + 1}, iter {batch_i}: loss = {batch_loss.item() / n_words.item():.3f}')

            # Back-propagation
            # loss.backward()
            batch_loss.backward()

            # Gradient cliping
            torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.config.clip)

            # 进行梯度累积
            if (batch_i + 1) % self.config.gradient_accumulation_step == 0:
                # Run optimizer & scheduler
                self.optimizer.step()
                self.scheduler.step()
                # self.optimizer.zero_grad()  # 清空梯度
                self.model.zero_grad()

        torch.cuda.empty_cache()
        gc.collect()

        # epoch_loss = np.sum(loss_history) / (num_batch + 1)
        epoch_loss = np.sum(batch_loss_history) / n_total_words
        epoch_loss_history.append(epoch_loss)
        self.epoch_loss = epoch_loss

        print_str = f'Epoch {epoch_i+1} loss average: {epoch_loss:.3f}'
        print(print_str)

        if epoch_i % self.config.save_every_epoch == 0:
            self.save_model(epoch_i + 1)

        # Only evaluate when single GPU otherwise metrics may not average well
        if self.config.local_rank == -1 and self.config.model_val:
            # print('\n<BLEU score>...')
            # self.calculate_bleu()

            print('\n<Validation>...')
            self.validation_loss = self.evaluate()

            # 保存最佳validation los model
            if self.validation_loss < best_eval_loss:
               self.save_model(epoch_i, best='best_model')
               # 更新最佳验证损失
               best_eval_loss = self.validation_loss
        #
        # if epoch_i % self.config.plot_every_epoch == 0:
        #     self.write_summary(epoch_i)

    self.save_model(self.config.n_epoch)

    return epoch_loss_history`

This is my full script, the position error is the base class forward function of the torch. nn. Modules.module.py

rohan-varma · 2020-08-26T18:28:34Z

This may be an issue with unused parameter detection in DDP, although it is hard to debug without your model's definition/its forward() function. Could you share that as well?

yinghuang · 2020-10-10T04:45:47Z

I met the same issue.
But i solved it.
The reason is that in my model class, I define a fpn module with 5 level output feature maps in the init function,
but in forward function I only use 4 of them.
When I use all of them, the problem was solved.
This is my supposed conclusion: you should use all output of each module in forward function.

manueldiaz96 · 2021-03-09T23:18:15Z

Or if it is the case, use the find_unused_parameters=True option when wrapping the model in torch.nn.parallel.DistributedDataParallel.

relh · 2021-03-13T05:42:43Z

Or if it is the case, use the find_unused_parameters=True option when wrapping the model in torch.nn.parallel.DistributedDataParallel.

I found that find_unused_parameters=True started to hang indefinitely after 3 steps depending on how complex what was in a forward pass. I fixed it by making a separate function at the beginning of the forward pass that creates the pseudo-labels I use, somehow it being in a separate function (with things that go out of scope presumably disappearing) stopped the find_unused_parameters=True option from hanging forever.

zeakey · 2021-03-30T06:43:08Z

@rohan-varma I ran into the same error.

My case is similar to @yinghuang where I have a parallel dual-BN architecture and each sample will go through a single BN path according to an input flag.

However, I'm not sure whether the unused parameter detection only checks parameters that are used in forward but the outputs are not used to compute loss, or the parameters are not allowed to even exist no matter the forward() is called or not.

In my case each sample will go through a specific BN branch and the forward() function of other BN function was not called.
It seems that the uninvolved parameters are not allowed to exist even if it is not involved in producing any outputs?

CanyonWind · 2021-09-05T07:12:21Z

Can someone explain a little bit why set up this Runtime Error? It looks to me like a warning instead of an error (even some parameters/modules are not used in the forward pass for calculating loss, it doesn't break anything. Just brought some overhead and redundancy). Thanks

fingertap · 2021-11-09T15:54:48Z

A walkaround to this problem is to multiply the sum of all parameters with zero and add it to the final loss. Note that this may bring a small overhead in backprop.

csufangyu · 2022-01-05T02:47:42Z

I get the same error,but when I train ,all is ok.This error only occurred during the test.

fingertap · 2022-02-13T15:05:24Z

Can someone explain a little bit why set up this Runtime Error? It looks to me like a warning instead of an error (even some parameters/modules are not used in the forward pass for calculating loss, it doesn't break anything. Just brought some overhead and redundancy). Thanks

As far as I know, this happens in distributed training when multiple GPUs need to communicate with each other for calculating losses for the whole minibatch. If the number of loss terms are different, the GPU with less number of loss terms may quit the communication earlier than others. However, other GPUs may still be waiting for this GPU to reply, which leads to a dead lock.

maxwellzh · 2022-04-22T16:33:40Z

It has been two years since this issue is reported. I also came into this issue recently, and found the workaround.

My code is like

class Model(nn.Module):
    ...
    def forward(self, ...):
        # I disable to grad tracker for some operations
        with torch.no_grad():
            # do something with self.net
        output = self.net(...) # the output is tracked with autograd, so self.net is indeed involving the computation.
        loss = criterion(output, ...)
        return loss
....

The the error is raised, and finally, I find this could make a fix

class Model(nn.Module):
    ...
    def forward(self, ...):
        # I disable to grad tracker for some operations
        with torch.no_grad():
            self.net.requires_grad_(False)  # <- Added line 1
            # do something with self.net
            self.net.requires_grad_(True)  # <- Added line 2
        output = self.net(...) # the output is tracked with autograd, so self.net is indeed involving the computation.
        loss = criterion(output, ...)
        return loss
....

Hope this help.

jumxglhf · 2022-05-31T01:07:25Z

A walkaround to this problem is to multiply the sum of all parameters with zero and add it to the final loss. Note that this may bring a small overhead in backprop.

This is a life savor! Though overhead is introduced, at least things run now under DDP. Thanks for the suggestion.

zerovl · 2022-06-15T07:30:10Z

@maxwellzh Thanks a lot! The solution works for my code. Without added lines, bugs will be raised. I am wondering the reason. Do you have any thoughts?

zeakey · 2022-06-15T08:26:56Z

@maxwellzh Thanks a lot! The solution works for my code. Without added lines, bugs will be raised. I am wondering the reason. Do you have any thoughts?

The error tells that there are parameters not used to compute the loss. If you add all the parameters into the loss, though with a multiply factor of zero, all parameters are technically used for loss computation.

maxwellzh · 2022-06-15T09:42:24Z

@maxwellzh Thanks a lot! The solution works for my code. Without added lines, bugs will be raised. I am wondering the reason. Do you have any thoughts?

@zerovl No... I was trying all the ways to fix the issue at that time and occasionally found the workaround. Probably a bug of pytorch I guess.

@zeakey Though torch tells there are unused parameters, I think it's a false alarm. Please have a look at the code I pasted above. All parameters were indeed used in loss computation, but somehow torch just thought it weren't.

zerovl · 2022-06-15T09:58:57Z

@maxwellzh @zeakey
Thanks for your replying. I agree with @maxwellzh that this error could be a false alarm, and parameters of my network are all used in loss computation.
I run into this error since I use "with torch.no_grad" in a call function of a manually defined class. If I use "with torch.no_grad" in a normal training loop, no error is raised.

class ABC:

    def __init__(self, ....):
        ....

    def __call__(self, model, criterion, image):
        # first time forward without grad 
        with torch.no_grad():
            # model.requires_grad_(False) # Line 1
            pred_no_grad = model(image)
            # model.requires_grad_(True) # Line 2
        
        # second time forward with grad
        pred = model(image)
        loss = criterion(pred, pred_no_grad, target)

        return loss

Without Line 1 and Line 2, the error will be raised.

zeakey · 2022-06-16T09:10:27Z

@maxwellzh Oh my bad. It might be a false alarm from PyTorch. I occasionally ran into this error due to unused parameters during training. In my case, there is a find_unused_parameters flag in torch.nn.parallel.DistributedDataParallel to suppress this error.

jainraj · 2022-10-03T10:19:20Z

Any plan from PyTorch to fix this?

Mi-Peng · 2022-11-23T13:13:15Z

It has been two years since this issue is reported. I also came into this issue recently, and found the workaround.

My code is like

class Model(nn.Module):
    ...
    def forward(self, ...):
        # I disable to grad tracker for some operations
        with torch.no_grad():
            # do something with self.net
        output = self.net(...) # the output is tracked with autograd, so self.net is indeed involving the computation.
        loss = criterion(output, ...)
        return loss
....

The the error is raised, and finally, I find this could make a fix

class Model(nn.Module):
    ...
    def forward(self, ...):
        # I disable to grad tracker for some operations
        with torch.no_grad():
            self.net.requires_grad_(False)  # <- Added line 1
            # do something with self.net
            self.net.requires_grad_(True)  # <- Added line 2
        output = self.net(...) # the output is tracked with autograd, so self.net is indeed involving the computation.
        loss = criterion(output, ...)
        return loss
....

Hope this help.

Thanks a lot! It works for me. I guess that with torch.no_grad() tells the model do not calculate grad and the parameters do not require grad anymore, but I don't know why the parameters still require their grad in DDP mode. And thanks again.

JakobHavtorn · 2022-11-30T15:02:45Z

This issue also occurs in the case that you want to apply LayerDrop while using DDP.

LayerDrop skips an entire layer in the forward pass so no parameters of the skipped layer are used for the loss computation. Hence they are missing in the autograd graph and the same errors gets raised by DDP.

The workaround by adding the sum of parameter values multiplied by zero to the loss also works here, but it would be more efficient and much more elegant to be able to simply

ignore such parameters if no worker has gradients for them, or
average the gradients that do exist, ignoring the missing ones.

EDIT 22/02 2022:
A more elegant solution to my problem is setting find_unused_parameters=True in torch.nn.parallel.DistributedDataParallel. This introduces a small overhead to training due to an extra reduction. I'm not sure whether this is faster or slower than adding zero times all parameters to the loss.

saulgoodman08 · 2022-12-01T02:09:40Z

It has been two years since this issue is reported. I also came into this issue recently, and found the workaround.

My code is like

class Model(nn.Module):
    ...
    def forward(self, ...):
        # I disable to grad tracker for some operations
        with torch.no_grad():
            # do something with self.net
        output = self.net(...) # the output is tracked with autograd, so self.net is indeed involving the computation.
        loss = criterion(output, ...)
        return loss
....

The the error is raised, and finally, I find this could make a fix

class Model(nn.Module):
    ...
    def forward(self, ...):
        # I disable to grad tracker for some operations
        with torch.no_grad():
            self.net.requires_grad_(False)  # <- Added line 1
            # do something with self.net
            self.net.requires_grad_(True)  # <- Added line 2
        output = self.net(...) # the output is tracked with autograd, so self.net is indeed involving the computation.
        loss = criterion(output, ...)
        return loss
....

Hope this help.

May I ask where this class is? In which dependency file? Thx a lot!

maxwellzh · 2022-12-01T10:26:24Z

May I ask where this class is? In which dependency file?

@saulgoodman08 This is just an example code snippet. You should make changes according to yours.

liangyukkk · 2023-02-21T07:14:43Z

--distributed-backend 'nccl' --ddp-backend "no_c10d" \

QinHsiu · 2023-04-11T07:18:18Z

I use the following code:
model = torch.nn.parallel.DistributedDataParallel(model, find_unused_parameters=True)
and i fix the problem

wtliao · 2023-04-30T14:39:55Z

find_unused_parameters

I still have the similar issue even through I use single GPU. I solve it by setting "find_unused_parameters=True". Thanks a lot for your many hints.

sanj909 · 2023-08-02T08:51:56Z

I was having the same error. In the __init__() method of my custom class Decoder(torch.nn.Module), I changed the code below

self.dec_layer = nn.TransformerDecoderLayer(embedding_dim, num_heads, hidden_size, dropout, batch_first=True)
self.decoder = nn.TransformerDecoder(self.dec_layer, num_layers=num_layers)

to the code below

dec_layer = nn.TransformerDecoderLayer(embedding_dim, num_heads, hidden_size, dropout, batch_first=True)
self.decoder = nn.TransformerDecoder(dec_layer, num_layers=num_layers)

and I no longer get the error. The forward method of my class only explicitly calls self.decoder, and does not explicitly call self.dec_layer (transformer_out = self.decoder(target, memory, tgt_mask=tgt_mask, tgt_key_padding_mask=tgt_key_padding_mask)).

The parameters of the module dec_layer were being used implicitly within the module self.decoder, which is why my model still trained properly when I passed find_unused_parameters=True when wrapping the model in torch.nn.parallel.DistributedDataParallel. However, the root of the problem was that making dec_layer a class attribute means that PyTorch 'counts' these parameters twice, once as part of self.dec_layer and again as part of self.decoder.

parthkvv · 2023-08-04T02:14:54Z

A walkaround to this problem is to multiply the sum of all parameters with zero and add it to the final loss. Note that this may bring a small overhead in backprop.

Thank you! This worked for me.
Just in case someone is wondering how to implement it (a beginner like me), here is the major modification I made to my code (apart from other minor ones), for the training_step -

Old Code:

New Code:

vwxyzjn · 2023-09-12T15:10:04Z

Btw something that helped me was to run

for name, param in model.named_parameters():
    names.append(name)
    params.append(param)
    print(name, param.shape, param.requires_grad)

and I could set the particular param which was not used during the forward pass to param.requires_grad=False to avoid the issue. See https://gist.github.com/vwxyzjn/45fc8706dfb3cf33695f0f57cc44a533?permalink_comment_id=4689479#gistcomment-4689479 as an example.

thuy4tbn99 · 2023-10-24T08:08:46Z

A walkaround to this problem is to multiply the sum of all parameters with zero and add it to the final loss. Note that this may bring a small overhead in backprop.

I got the same error and follow this guide to solve it! Thanks a lot. Code below:
loss= loss+ 0. * sum(p.sum() for p in net.parameters())

chengamo · 2023-10-25T05:49:42Z

I ran into the same problem because I used with torch.no_grad() in front of the newly defined MLP function:

with torch.no_grad():
          prefix = self.mlp(x)

naraharibm · 2024-06-22T21:33:27Z

If you are getting this because of accelerate, just make sure you ran
accelerate config properly. I use it with deepspeed and it works fine

JimmmmmL · 2024-08-20T06:50:27Z

I was having the same error. In the __init__() method of my custom class Decoder(torch.nn.Module), I changed the code below
self.dec_layer = nn.TransformerDecoderLayer(embedding_dim, num_heads, hidden_size, dropout, batch_first=True)
self.decoder = nn.TransformerDecoder(self.dec_layer, num_layers=num_layers)
to the code below
dec_layer = nn.TransformerDecoderLayer(embedding_dim, num_heads, hidden_size, dropout, batch_first=True)
self.decoder = nn.TransformerDecoder(dec_layer, num_layers=num_layers)
and I no longer get the error. The forward method of my class only explicitly calls self.decoder, and does not explicitly call self.dec_layer (transformer_out = self.decoder(target, memory, tgt_mask=tgt_mask, tgt_key_padding_mask=tgt_key_padding_mask)).

The parameters of the module dec_layer were being used implicitly within the module self.decoder, which is why my model still trained properly when I passed find_unused_parameters=True when wrapping the model in torch.nn.parallel.DistributedDataParallel. However, the root of the problem was that making dec_layer a class attribute means that PyTorch 'counts' these parameters twice, once as part of self.dec_layer and again as part of self.decoder.

Yes! This is the case. I tried your method and finally fixed this haunting bug!

pbelevich added module: data parallel oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Aug 21, 2020

Bonsen mentioned this issue Apr 8, 2021

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. #55582

Closed

maxreciprocate mentioned this issue Oct 20, 2022

DDP and hydra model CarperAI/trlx#51

Closed

AyushExel mentioned this issue Nov 27, 2022

update segment training ultralytics/ultralytics#57

Merged

4 tasks

BenoitWang mentioned this issue Jan 18, 2024

unused parameters when using WavLM. cased crash when using DDP speechbrain/speechbrain#2340

Closed

anhthuan1999 mentioned this issue Mar 19, 2024

Multi-gpus parallel error anhthuan1999/PointCT#4

Closed

ControllableGeneration mentioned this issue May 15, 2024

Unused forward modules result in RuntimeError zhangzjn/ADer#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. #43259

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. #43259

QiuSYang commented Aug 19, 2020 •

edited by pytorch-probot bot

Loading

pritamdamania87 commented Aug 24, 2020

QiuSYang commented Aug 25, 2020

rohan-varma commented Aug 26, 2020

yinghuang commented Oct 10, 2020

manueldiaz96 commented Mar 9, 2021

relh commented Mar 13, 2021

zeakey commented Mar 30, 2021

CanyonWind commented Sep 5, 2021 •

edited

Loading

fingertap commented Nov 9, 2021 •

edited

Loading

csufangyu commented Jan 5, 2022

fingertap commented Feb 13, 2022

maxwellzh commented Apr 22, 2022

jumxglhf commented May 31, 2022

zerovl commented Jun 15, 2022

zeakey commented Jun 15, 2022

maxwellzh commented Jun 15, 2022

zerovl commented Jun 15, 2022

zeakey commented Jun 16, 2022

jainraj commented Oct 3, 2022

Mi-Peng commented Nov 23, 2022

JakobHavtorn commented Nov 30, 2022 •

edited

Loading

saulgoodman08 commented Dec 1, 2022

maxwellzh commented Dec 1, 2022

liangyukkk commented Feb 21, 2023

QinHsiu commented Apr 11, 2023

wtliao commented Apr 30, 2023

sanj909 commented Aug 2, 2023

parthkvv commented Aug 4, 2023

vwxyzjn commented Sep 12, 2023 •

edited

Loading

thuy4tbn99 commented Oct 24, 2023 •

edited

Loading

chengamo commented Oct 25, 2023 •

edited

Loading

naraharibm commented Jun 22, 2024

JimmmmmL commented Aug 20, 2024

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. #43259

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. #43259

Comments

QiuSYang commented Aug 19, 2020 • edited by pytorch-probot bot Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

here is my code:

pritamdamania87 commented Aug 24, 2020

QiuSYang commented Aug 25, 2020

rohan-varma commented Aug 26, 2020

yinghuang commented Oct 10, 2020

manueldiaz96 commented Mar 9, 2021

relh commented Mar 13, 2021

zeakey commented Mar 30, 2021

CanyonWind commented Sep 5, 2021 • edited Loading

fingertap commented Nov 9, 2021 • edited Loading

csufangyu commented Jan 5, 2022

fingertap commented Feb 13, 2022

maxwellzh commented Apr 22, 2022

jumxglhf commented May 31, 2022

zerovl commented Jun 15, 2022

zeakey commented Jun 15, 2022

maxwellzh commented Jun 15, 2022

zerovl commented Jun 15, 2022

zeakey commented Jun 16, 2022

jainraj commented Oct 3, 2022

Mi-Peng commented Nov 23, 2022

JakobHavtorn commented Nov 30, 2022 • edited Loading

saulgoodman08 commented Dec 1, 2022

maxwellzh commented Dec 1, 2022

liangyukkk commented Feb 21, 2023

QinHsiu commented Apr 11, 2023

wtliao commented Apr 30, 2023

sanj909 commented Aug 2, 2023

parthkvv commented Aug 4, 2023

vwxyzjn commented Sep 12, 2023 • edited Loading

thuy4tbn99 commented Oct 24, 2023 • edited Loading

chengamo commented Oct 25, 2023 • edited Loading

naraharibm commented Jun 22, 2024

JimmmmmL commented Aug 20, 2024

QiuSYang commented Aug 19, 2020 •

edited by pytorch-probot bot

Loading

CanyonWind commented Sep 5, 2021 •

edited

Loading

fingertap commented Nov 9, 2021 •

edited

Loading

JakobHavtorn commented Nov 30, 2022 •

edited

Loading

vwxyzjn commented Sep 12, 2023 •

edited

Loading

thuy4tbn99 commented Oct 24, 2023 •

edited

Loading

chengamo commented Oct 25, 2023 •

edited

Loading