Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. #43259

Open
QiuSYang opened this issue Aug 19, 2020 · 33 comments
Labels
module: data parallel oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@QiuSYang
Copy link

QiuSYang commented Aug 19, 2020

🐛 Bug

To Reproduce

Epoch: 1, iter 0: loss = 10.099
0%| | 1/144967 [00:02<116:54:31, 2.90s/it]
Traceback (most recent call last):
File "train.py", line 99, in
solver.train()
File "/home/yckj2453/nlp_space/jd_multimodal_dialogue/multi-modal-dialogue-transformer_bart/utils/time_track.py", line 18, in timed
result = method(*args, **kwargs)
File "/home/yckj2453/nlp_space/jd_multimodal_dialogue/multi-modal-dialogue-transformer_bart/solver.py", line 284, in train
decoder_input_ids=decoder_input_ids)
File "/root/anaconda3/envs/jddc_mddr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/jddc_mddr/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 473, in forward
self.reducer.prepare_for_backward(list(_find_tensors(output)))
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
Traceback (most recent call last):
File "/root/anaconda3/envs/jddc_mddr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/root/anaconda3/envs/jddc_mddr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/anaconda3/envs/jddc_mddr/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/root/anaconda3/envs/jddc_mddr/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/root/anaconda3/envs/jddc_mddr/bin/python', '-u', 'train.py', '--local_rank=0']' returned non-zero exit status 1.

Steps to reproduce the behavior:

Expected behavior

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
  • PyTorch Version (1.5.1):
  • OS (e.g., Linux):
  • How you installed PyTorch (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version: 3.7.5
  • CUDA/cuDNN version: 10.1/7.6
  • GPU models and configuration:
  • Any other relevant information:
    pip transformers==2.11.0
    pip numpy==1.19.0

Additional context

here is my code:

` def train(self):
epoch_loss_history = []
best_eval_loss = float('inf') # 记录最佳损失

    # 设置并行计算
    if self.config.n_gpu > 1:
        print("use torch.nn.DataParallel for the parallel operations.")
        self.model = nn.DataParallel(self.model)
    if self.config.local_rank != -1:
        print("use torch.nn.parallel.DistributedDataParallel for the parallel operations.")
        self.model = nn.parallel.DistributedDataParallel(self.model,
                                                         device_ids=[self.config.local_rank],
                                                         output_device=self.config.local_rank,
                                                         find_unused_parameters=True)

    for epoch_i in range(self.epoch_i, self.config.n_epoch):
        # self.epoch_i = epoch_i
        batch_loss_history = []
        loss_history = []
        num_batch = 0
        self.model.train()
        n_total_words = 0

        # 每个batch开始之前, 先进行梯度清空
        # self.optimizer.zero_grad()
        self.model.zero_grad()  # 更加安全的清理梯度

        # epoch_iterator = tqdm(self.train_data_loader, desc="Iteration",
        #                       disable=self.config.local_rank not in [-1, 0])
        for batch_i, (input_ids, label_ids,
                      images, img_char_positions) in enumerate(tqdm(self.train_data_loader, ncols=80)):
            # input_ids: [batch, sentence_length]
            num_batch = batch_i

            # flatten input and target conversations
            # 去除PAD_ID列表长度, -1为了去除起始字符的长度
            label_origin_length = [len(single_label) - single_label.count(PAD_ID) - 1 for single_label in label_ids]
            if self.config.is_images_embedding:
                # 将一个batch内的所有image都压缩到一个列表之中
                input_images = [image for sentence_images in images for image in sentence_images]
                # image在sentence中对应的索引号
                input_image_indexes = [i for sentence_images_index in img_char_positions
                                       for i in sentence_images_index]
                # 计算每个句子包含多少张图片
                input_images_length = [len(sentence_images_index) for sentence_images_index in img_char_positions]

                # 确保输入图片数量与图像索引号数量相同
                assert len(input_images) == sum(input_images_length)
                assert len(input_image_indexes) == sum(input_images_length)

            input_sentences = to_var(torch.LongTensor(input_ids))
            target_sentences = to_var(torch.LongTensor(label_ids))
            target_sentence_length = to_var(torch.LongTensor(label_origin_length))
            if self.config.is_images_embedding:
                input_images = to_var(torch.stack(input_images))
                input_images_length = to_var(torch.LongTensor(input_images_length))
                input_image_indexes = to_var(torch.LongTensor(input_image_indexes))
            else:
                input_images = None
                input_images_length = None
                input_image_indexes = None

            # if self.config.gradient_accumulation_step == 1:
            #     # reset gradient
            #     self.optimizer.zero_grad()
            #     self.model.zero_grad()

            attention_mask = input_sentences.ne(0).long()
            # decoder_input_ids = target_sentences[:, :-1]  # GPT解码输入, 去除末尾的结束字符
            decoder_input_ids = self.shift_tokens_right_custom(target_sentences, PAD_ID)  # 删除EOS_ID
            outputs = self.model(input_ids=input_sentences,
                                 input_images=input_images,
                                 input_images_length=input_images_length,
                                 input_image_indexes=input_image_indexes,
                                 attention_mask=attention_mask,  # input_sentences.eq(0)
                                 # lm_labels=target_sentences,
                                 decoder_input_ids=decoder_input_ids)

            # sentence_logits = self.model(
            #    input_sentences,
            #    input_sentence_length,
            #    input_conversation_length,
            #    target_sentences,
            #    input_images,
            #    input_images_length=input_images_length,
            #    input_image_indexes=input_image_indexes)

            decoder_target_label_ids = target_sentences[:, 1:]  # GPT解码Label, 去除首部的起始字符
            sentence_logits = outputs[0]  # 获取Bart的logits
            batch_loss, n_words = masked_cross_entropy(
                sentence_logits,
                decoder_target_label_ids,
                target_sentence_length)

            if self.config.n_gpu > 1:
                # mean() to average on multi-gpu parallel (not distributed) training
                batch_loss = batch_loss.mean()
                n_words = n_words.mean()
            if self.config.gradient_accumulation_step > 1:
                batch_loss = batch_loss / self.config.gradient_accumulation_step
                n_words = n_words / self.config.gradient_accumulation_step

            # assert not isnan(batch_loss.item())
            batch_loss_history.append(batch_loss.item())
            n_total_words += n_words.item()
            # loss_history.append(loss)

            if batch_i % self.config.print_every == 0:
                # tqdm.write(
                #    f'Epoch: {epoch_i+1}, iter {batch_i}: loss = {loss.item():.3f}')
                tqdm.write(
                    f'Epoch: {epoch_i+1}, iter {batch_i}: loss = {batch_loss.item()/ n_words.item():.3f}')

            # Back-propagation
            # loss.backward()
            batch_loss.backward()

            # Gradient cliping
            torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.config.clip)

            # 进行梯度累积
            if (batch_i + 1) % self.config.gradient_accumulation_step == 0:
                # Run optimizer & scheduler
                self.optimizer.step()
                self.scheduler.step()
                # self.optimizer.zero_grad()  # 清空梯度
                self.model.zero_grad()

        torch.cuda.empty_cache()
        gc.collect()

        # epoch_loss = np.sum(loss_history) / (num_batch + 1)
        epoch_loss = np.sum(batch_loss_history) / n_total_words
        epoch_loss_history.append(epoch_loss)
        self.epoch_loss = epoch_loss

        print_str = f'Epoch {epoch_i+1} loss average: {epoch_loss:.3f}'
        print(print_str)

        if epoch_i % self.config.save_every_epoch == 0:
            self.save_model(epoch_i + 1)

        # Only evaluate when single GPU otherwise metrics may not average well
        if self.config.local_rank == -1:
            # print('\n<BLEU score>...')
            # self.calculate_bleu()

            print('\n<Validation>...')
            self.validation_loss = self.evaluate()

            # 保存最佳validation los model
            if self.validation_loss < best_eval_loss:
               self.save_model(epoch_i, best='best_model')
               # 更新最佳验证损失
               best_eval_loss = self.validation_loss
        #
        # if epoch_i % self.config.plot_every_epoch == 0:
        #     self.write_summary(epoch_i)

    self.save_model(self.config.n_epoch)

    return epoch_loss_history`

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski

@pbelevich pbelevich added module: data parallel oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Aug 21, 2020
@pritamdamania87
Copy link
Contributor

@QiuSYang Could you a complete script to reproduce this issue locally on a GPU machine? Also, it would be useful if you could share the complete definition of your model with the forward function. This would help us in debugging this issue further.

@QiuSYang
Copy link
Author

Could you a complete script to reproduce this issue locally on a GPU machine? Also, it would be useful if you could share the complete definition of your model with the forward function. This would help us in debugging this issue further.

`import numpy as np
import torch
import torch.nn as nn
import transformer.models as models
from layers import masked_cross_entropy
from utils import to_var, time_desc_decorator, embedding_metric
import os
from tqdm import tqdm
from math import isnan
import re
import gc
from transformers import AdamW, get_linear_schedule_with_warmup
from transformer import BertConfig, BartConfig, BartForConditionalGeneration
import math
import pickle
from utils import SOS_ID, PAD_ID, EOS_ID
import time
import offline_dev_data_postprocess
from utils import BLEUEvaluator

word2vec_path = "../datasets/GoogleNews-vectors-negative300.bin"

class Solver(object):
def init(self, config, train_data_loader, eval_data_loader, vocab, is_train=True, model=None):
self.config = config
self.epoch_i = 0
self.train_data_loader = train_data_loader
self.eval_data_loader = eval_data_loader
self.vocab = vocab
self.is_train = is_train
self.model = model

@time_desc_decorator('Build Graph')
def build(self, cuda=True):

    if self.model is None:
        config_bart = BartConfig.from_json_file(json_file=self.config.bart_config)
        if self.config.bert_config:
            config_bert = BertConfig.from_json_file(json_file=self.config.bert_config)
            # 词汇表统一
            if config_bert.vocab_size != self.config.vocab_size:
                config_bert.vocab_size = self.config.vocab_size
            config_bart.vocab_size = config_bert.vocab_size
            # 设置特殊字符
            config_bert.eos_token_id = EOS_ID
            config_bert.bos_token_id = SOS_ID
            config_bert.pad_token_id = PAD_ID
        else:
            config_bart.vocab_size = self.config.vocab_size
        config_bart.max_length = self.config.max_sentence_length  # 生成最大句子长度
        config_bart.is_images_embedding = self.config.is_images_embedding
        # Initializing a bart model from the custom style configurations
        if not self.is_train:
            # inference no load pre_training model
            self.config.bert_pre_training = None
        self.model = BartForConditionalGeneration(config=config_bart,
                                                  bert_config=config_bert,
                                                  bert_pretrained_model_name_or_path=self.config.bert_pre_training)
        self.model.resize_token_embeddings(config_bart.vocab_size)

        # # orthogonal initialiation for hidden weights
        # # input gate bias for GRUs
        # if self.config.mode == 'train' and self.config.checkpoint is None:
        #    print('Parameter initiailization')
        #   for name, param in self.model.named_parameters():
        #        if 'weight_hh' in name:
        #            print('\t' + name)
        #            nn.init.orthogonal_(param)
        #
        #         bias_hh is concatenation of reset, input, new gates
        #         only set the input gate bias to 2.0
        #        if 'bias_hh' in name:
        #            print('\t' + name)
        #            dim = int(param.size(0) / 3)
        #            param.data[dim:2 * dim].fill_(2.0)

    if torch.cuda.is_available() and cuda:
        # self.model.cuda()
        self.model.to(self.config.device)

    # Overview Parameters
    # print('Model Parameters')
    # for name, param in self.model.named_parameters():
    #     print('\t' + name + '\t', list(param.size()))

    if self.config.checkpoint:
        self.load_model(self.config.checkpoint)

    if self.is_train:
        no_decay = ['bias', 'layer_norm.weight']
        optimizer_grouped_parameters = [
            {'params': [p for n, p in self.model.named_parameters() if not any(nd in n for nd in no_decay)],
             'weight_decay': self.config.weight_decay},
            {'params': [p for n, p in self.model.named_parameters() if any(nd in n for nd in no_decay)],
             'weight_decay': 0.0}
        ]
        self.optimizer = AdamW(optimizer_grouped_parameters,
                               lr=self.config.learning_rate,
                               eps=self.config.adam_epsilon)
        # self.optimizer = AdamW(self.model.parameters(),
        #                        lr=self.config.learning_rate,
        #                        eps=self.config.adam_epsilon)
        self.scheduler = get_linear_schedule_with_warmup(self.optimizer,
                                                         num_warmup_steps=self.config.num_warmup_steps,
                                                         num_training_steps=self.config.num_training_steps)
        if self.config.checkpoint:
            checkpoint = torch.load(self.config.checkpoint)
            self.scheduler.load_state_dict(checkpoint["lr_scheduler"])

        # # 并行计算
        # if torch.cuda.device_count() > 1:
        #     print("Training model with multi-gpus.")
        #     self.model = nn.DataParallel(self.model,
        #                                  device_ids=[idx for idx in range(torch.cuda.device_count())])
        #     # torch.backends.cudnn.benchmark = True  # 增加运行效率
        #     self.multi_gpus = True

    # if self.is_train:
    #    self.optimizer = self.config.optimizer(
    #        filter(lambda p: p.requires_grad, self.model.parameters()),
    #        lr=self.config.learning_rate)

def save_model(self, epoch, best=None):
    """Save parameters to checkpoint"""
    if best:
        ckpt_path = os.path.join(self.config.save_path, f'{best}.pkl')
    else:
        ckpt_path = os.path.join(self.config.save_path, f'{epoch}.pkl')
    model_state = {'model': self.model.state_dict(),
                   'optimizer': self.optimizer.state_dict(),
                   'lr_scheduler': self.scheduler.state_dict(),
                   'epoch': epoch}
    print(f'Save parameters to {ckpt_path}')
    torch.save(model_state, ckpt_path)

def load_model(self, checkpoint_path):
    """Load parameters from checkpoint"""
    print(f'Load parameters from {checkpoint_path}')
    # epoch = re.match(r"[0-9]*", os.path.basename(checkpoint_path)).group(0)
    # self.epoch_i = int(epoch)
    checkpoint = torch.load(checkpoint_path)
    self.epoch_i = checkpoint.get('epoch')
    self.model.load_state_dict(checkpoint['model'])
def shift_tokens_right(self, input_ids, sos_token_id):
    """Shift input ids one token to the right, and wrap the last non pad token (usually <eos>)."""
    prev_output_tokens = input_ids.clone()
    prev_output_tokens[:, 0] = sos_token_id
    prev_output_tokens[:, 1:] = input_ids[:, :-1]
    return prev_output_tokens

def shift_tokens_right_custom(self, input_ids, pad_token_id):
    """Shift input ids one token to the right, and wrap the last non pad token (usually <eos>).
        删除第一个pad_token_id前一个字符"""
    prev_output_tokens = input_ids.clone()
    # temp_ = input_ids[0]
    index_of_eos = (input_ids.ne(pad_token_id).sum(dim=1) - 1).unsqueeze(-1).view(-1)
    # 1. 将每个sentence的EOS替换为pad
    for i, index in enumerate(index_of_eos):
        prev_output_tokens[i, index] = pad_token_id
    # temp = prev_output_tokens[0]
    # 2. 再删除最后一个字符
    return prev_output_tokens[:, :-1]

@time_desc_decorator('Training Start!')
def train(self):
    # print('Training Start!')
    epoch_loss_history = []
    best_eval_loss = float('inf')  # 记录最佳损失

    # 设置并行计算
    if self.config.n_gpu > 1:
        print("use torch.nn.DataParallel for the parallel operations.")
        self.model = nn.DataParallel(self.model)
    if self.config.local_rank != -1:
        print("use torch.nn.parallel.DistributedDataParallel for the parallel operations.")
        self.model = nn.parallel.DistributedDataParallel(self.model,
                                                         device_ids=[self.config.local_rank],
                                                         output_device=self.config.local_rank,
                                                         find_unused_parameters=True)

    for epoch_i in range(self.epoch_i, self.config.n_epoch):
        # self.epoch_i = epoch_i
        batch_loss_history = []
        loss_history = []
        num_batch = 0
        self.model.train()
        n_total_words = 0

        # 每个batch开始之前, 先进行梯度清空
        # self.optimizer.zero_grad()
        self.model.zero_grad()  # 更加安全的清理梯度

        # epoch_iterator = tqdm(self.train_data_loader, desc="Iteration",
        #                       disable=self.config.local_rank not in [-1, 0])
        for batch_i, (input_ids, label_ids,
                      images, img_char_positions) in enumerate(tqdm(self.train_data_loader, ncols=80)):
            # input_ids: [batch, sentence_length]
            num_batch = batch_i

            # flatten input and target conversations
            # 去除PAD_ID列表长度, -1为了去除起始字符的长度
            label_origin_length = [len(single_label) - single_label.count(PAD_ID) - 1 for single_label in label_ids]
            if self.config.is_images_embedding:
                # 将一个batch内的所有image都压缩到一个列表之中
                input_images = [image for sentence_images in images for image in sentence_images]
                # image在sentence中对应的索引号
                input_image_indexes = [i for sentence_images_index in img_char_positions
                                       for i in sentence_images_index]
                # 计算每个句子包含多少张图片
                input_images_length = [len(sentence_images_index) for sentence_images_index in img_char_positions]

                # 确保输入图片数量与图像索引号数量相同
                assert len(input_images) == sum(input_images_length)
                assert len(input_image_indexes) == sum(input_images_length)

            # input_sentences = to_var(torch.LongTensor(input_ids))
            input_sentences = torch.LongTensor(input_ids).to(self.config.device)
            # target_sentences = to_var(torch.LongTensor(label_ids))
            target_sentences = torch.LongTensor(label_ids).to(self.config.device)
            # target_sentence_length = to_var(torch.LongTensor(label_origin_length))
            target_sentence_length = torch.LongTensor(label_origin_length).to(self.config.device)
            if self.config.is_images_embedding:
                # input_images = to_var(torch.stack(input_images))
                # input_images_length = to_var(torch.LongTensor(input_images_length))
                # input_image_indexes = to_var(torch.LongTensor(input_image_indexes))
                input_images = torch.stack(input_images).to(self.config.device)
                input_images_length = torch.LongTensor(input_images_length).to(self.config.device)
                input_image_indexes = torch.LongTensor(input_image_indexes).to(self.config.device)
            else:
                input_images = None
                input_images_length = None
                input_image_indexes = None

            # if self.config.gradient_accumulation_step == 1:
            #     # reset gradient
            #     self.optimizer.zero_grad()
            #     self.model.zero_grad()

            attention_mask = input_sentences.ne(0).long()
            # decoder_input_ids = target_sentences[:, :-1]  # GPT解码输入, 去除末尾的结束字符
            decoder_input_ids = self.shift_tokens_right_custom(target_sentences, PAD_ID)  # 删除EOS_ID
            outputs = self.model(input_ids=input_sentences,
                                 input_images=input_images,
                                 input_images_length=input_images_length,
                                 input_image_indexes=input_image_indexes,
                                 attention_mask=attention_mask,  # input_sentences.eq(0)
                                 # lm_labels=target_sentences,
                                 decoder_input_ids=decoder_input_ids)

            # sentence_logits = self.model(
            #    input_sentences,
            #    input_sentence_length,
            #    input_conversation_length,
            #    target_sentences,
            #    input_images,
            #    input_images_length=input_images_length,
            #    input_image_indexes=input_image_indexes)

            decoder_target_label_ids = target_sentences[:, 1:]  # GPT解码Label, 去除首部的起始字符
            sentence_logits = outputs[0]  # 获取Bart的logits
            batch_loss, n_words = masked_cross_entropy(
                sentence_logits,
                decoder_target_label_ids,
                target_sentence_length)

            if self.config.n_gpu > 1:
                # mean() to average on multi-gpu parallel (not distributed) training
                batch_loss = batch_loss.mean()
                n_words = n_words.mean()
            if self.config.gradient_accumulation_step > 1:
                batch_loss = batch_loss / self.config.gradient_accumulation_step
                n_words = n_words / self.config.gradient_accumulation_step

            # assert not isnan(batch_loss.item())
            batch_loss_history.append(batch_loss.item())
            n_total_words += n_words.item()
            # loss_history.append(loss)

            if batch_i % self.config.print_every == 0:
                # tqdm.write(
                #    f'Epoch: {epoch_i+1}, iter {batch_i}: loss = {loss.item():.3f}')
                # tqdm.write(
                #     f'Epoch: {epoch_i+1}, iter {batch_i}: loss = {batch_loss.item()/ n_words.item():.3f}')
                print(
                    f'Epoch: {epoch_i + 1}, iter {batch_i}: loss = {batch_loss.item() / n_words.item():.3f}')

            # Back-propagation
            # loss.backward()
            batch_loss.backward()

            # Gradient cliping
            torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.config.clip)

            # 进行梯度累积
            if (batch_i + 1) % self.config.gradient_accumulation_step == 0:
                # Run optimizer & scheduler
                self.optimizer.step()
                self.scheduler.step()
                # self.optimizer.zero_grad()  # 清空梯度
                self.model.zero_grad()

        torch.cuda.empty_cache()
        gc.collect()

        # epoch_loss = np.sum(loss_history) / (num_batch + 1)
        epoch_loss = np.sum(batch_loss_history) / n_total_words
        epoch_loss_history.append(epoch_loss)
        self.epoch_loss = epoch_loss

        print_str = f'Epoch {epoch_i+1} loss average: {epoch_loss:.3f}'
        print(print_str)

        if epoch_i % self.config.save_every_epoch == 0:
            self.save_model(epoch_i + 1)

        # Only evaluate when single GPU otherwise metrics may not average well
        if self.config.local_rank == -1 and self.config.model_val:
            # print('\n<BLEU score>...')
            # self.calculate_bleu()

            print('\n<Validation>...')
            self.validation_loss = self.evaluate()

            # 保存最佳validation los model
            if self.validation_loss < best_eval_loss:
               self.save_model(epoch_i, best='best_model')
               # 更新最佳验证损失
               best_eval_loss = self.validation_loss
        #
        # if epoch_i % self.config.plot_every_epoch == 0:
        #     self.write_summary(epoch_i)

    self.save_model(self.config.n_epoch)

    return epoch_loss_history`

This is my full script, the position error is the base class forward function of the torch. nn. Modules.module.py

@rohan-varma
Copy link
Member

This may be an issue with unused parameter detection in DDP, although it is hard to debug without your model's definition/its forward() function. Could you share that as well?

@yinghuang
Copy link

I met the same issue.
But i solved it.
The reason is that in my model class, I define a fpn module with 5 level output feature maps in the init function,
but in forward function I only use 4 of them.
When I use all of them, the problem was solved.
This is my supposed conclusion: you should use all output of each module in forward function.

@manueldiaz96
Copy link

Or if it is the case, use the find_unused_parameters=True option when wrapping the model in torch.nn.parallel.DistributedDataParallel.

@relh
Copy link

relh commented Mar 13, 2021

Or if it is the case, use the find_unused_parameters=True option when wrapping the model in torch.nn.parallel.DistributedDataParallel.

I found that find_unused_parameters=True started to hang indefinitely after 3 steps depending on how complex what was in a forward pass. I fixed it by making a separate function at the beginning of the forward pass that creates the pseudo-labels I use, somehow it being in a separate function (with things that go out of scope presumably disappearing) stopped the find_unused_parameters=True option from hanging forever.

@zeakey
Copy link

zeakey commented Mar 30, 2021

@rohan-varma I ran into the same error.

My case is similar to @yinghuang where I have a parallel dual-BN architecture and each sample will go through a single BN path according to an input flag.

However, I'm not sure whether the unused parameter detection only checks parameters that are used in forward but the outputs are not used to compute loss, or the parameters are not allowed to even exist no matter the forward() is called or not.

In my case each sample will go through a specific BN branch and the forward() function of other BN function was not called.
It seems that the uninvolved parameters are not allowed to exist even if it is not involved in producing any outputs?

@CanyonWind
Copy link

CanyonWind commented Sep 5, 2021

Can someone explain a little bit why set up this Runtime Error? It looks to me like a warning instead of an error (even some parameters/modules are not used in the forward pass for calculating loss, it doesn't break anything. Just brought some overhead and redundancy). Thanks

@fingertap
Copy link

fingertap commented Nov 9, 2021

A walkaround to this problem is to multiply the sum of all parameters with zero and add it to the final loss. Note that this may bring a small overhead in backprop.

@csufangyu
Copy link

I get the same error,but when I train ,all is ok.This error only occurred during the test.

@fingertap
Copy link

Can someone explain a little bit why set up this Runtime Error? It looks to me like a warning instead of an error (even some parameters/modules are not used in the forward pass for calculating loss, it doesn't break anything. Just brought some overhead and redundancy). Thanks

As far as I know, this happens in distributed training when multiple GPUs need to communicate with each other for calculating losses for the whole minibatch. If the number of loss terms are different, the GPU with less number of loss terms may quit the communication earlier than others. However, other GPUs may still be waiting for this GPU to reply, which leads to a dead lock.

@maxwellzh
Copy link
Contributor

It has been two years since this issue is reported. I also came into this issue recently, and found the workaround.

My code is like

class Model(nn.Module):
    ...
    def forward(self, ...):
        # I disable to grad tracker for some operations
        with torch.no_grad():
            # do something with self.net
        output = self.net(...) # the output is tracked with autograd, so self.net is indeed involving the computation.
        loss = criterion(output, ...)
        return loss
....

The the error is raised, and finally, I find this could make a fix

class Model(nn.Module):
    ...
    def forward(self, ...):
        # I disable to grad tracker for some operations
        with torch.no_grad():
            self.net.requires_grad_(False)  # <- Added line 1
            # do something with self.net
            self.net.requires_grad_(True)  # <- Added line 2
        output = self.net(...) # the output is tracked with autograd, so self.net is indeed involving the computation.
        loss = criterion(output, ...)
        return loss
....

Hope this help.

@jumxglhf
Copy link

A walkaround to this problem is to multiply the sum of all parameters with zero and add it to the final loss. Note that this may bring a small overhead in backprop.

This is a life savor! Though overhead is introduced, at least things run now under DDP. Thanks for the suggestion.

@zerovl
Copy link

zerovl commented Jun 15, 2022

@maxwellzh Thanks a lot! The solution works for my code. Without added lines, bugs will be raised. I am wondering the reason. Do you have any thoughts?

@zeakey
Copy link

zeakey commented Jun 15, 2022

@maxwellzh Thanks a lot! The solution works for my code. Without added lines, bugs will be raised. I am wondering the reason. Do you have any thoughts?

The error tells that there are parameters not used to compute the loss. If you add all the parameters into the loss, though with a multiply factor of zero, all parameters are technically used for loss computation.

@maxwellzh
Copy link
Contributor

@maxwellzh Thanks a lot! The solution works for my code. Without added lines, bugs will be raised. I am wondering the reason. Do you have any thoughts?

@zerovl No... I was trying all the ways to fix the issue at that time and occasionally found the workaround. Probably a bug of pytorch I guess.

@zeakey Though torch tells there are unused parameters, I think it's a false alarm. Please have a look at the code I pasted above. All parameters were indeed used in loss computation, but somehow torch just thought it weren't.

@zerovl
Copy link

zerovl commented Jun 15, 2022

@maxwellzh @zeakey
Thanks for your replying. I agree with @maxwellzh that this error could be a false alarm, and parameters of my network are all used in loss computation.
I run into this error since I use "with torch.no_grad" in a call function of a manually defined class. If I use "with torch.no_grad" in a normal training loop, no error is raised.

class ABC:

    def __init__(self, ....):
        ....

    def __call__(self, model, criterion, image):
        # first time forward without grad 
        with torch.no_grad():
            # model.requires_grad_(False) # Line 1
            pred_no_grad = model(image)
            # model.requires_grad_(True) # Line 2
        
        # second time forward with grad
        pred = model(image)
        loss = criterion(pred, pred_no_grad, target)

        return loss

Without Line 1 and Line 2, the error will be raised.

@zeakey
Copy link

zeakey commented Jun 16, 2022

@maxwellzh Oh my bad. It might be a false alarm from PyTorch. I occasionally ran into this error due to unused parameters during training. In my case, there is a find_unused_parameters flag in torch.nn.parallel.DistributedDataParallel to suppress this error.

@jainraj
Copy link

jainraj commented Oct 3, 2022

Any plan from PyTorch to fix this?

@Mi-Peng
Copy link

Mi-Peng commented Nov 23, 2022

It has been two years since this issue is reported. I also came into this issue recently, and found the workaround.

My code is like

class Model(nn.Module):
    ...
    def forward(self, ...):
        # I disable to grad tracker for some operations
        with torch.no_grad():
            # do something with self.net
        output = self.net(...) # the output is tracked with autograd, so self.net is indeed involving the computation.
        loss = criterion(output, ...)
        return loss
....

The the error is raised, and finally, I find this could make a fix

class Model(nn.Module):
    ...
    def forward(self, ...):
        # I disable to grad tracker for some operations
        with torch.no_grad():
            self.net.requires_grad_(False)  # <- Added line 1
            # do something with self.net
            self.net.requires_grad_(True)  # <- Added line 2
        output = self.net(...) # the output is tracked with autograd, so self.net is indeed involving the computation.
        loss = criterion(output, ...)
        return loss
....

Hope this help.

Thanks a lot! It works for me. I guess that with torch.no_grad() tells the model do not calculate grad and the parameters do not require grad anymore, but I don't know why the parameters still require their grad in DDP mode. And thanks again.

@JakobHavtorn
Copy link

JakobHavtorn commented Nov 30, 2022

This issue also occurs in the case that you want to apply LayerDrop while using DDP.

LayerDrop skips an entire layer in the forward pass so no parameters of the skipped layer are used for the loss computation. Hence they are missing in the autograd graph and the same errors gets raised by DDP.

The workaround by adding the sum of parameter values multiplied by zero to the loss also works here, but it would be more efficient and much more elegant to be able to simply

  1. ignore such parameters if no worker has gradients for them, or
  2. average the gradients that do exist, ignoring the missing ones.

EDIT 22/02 2022:
A more elegant solution to my problem is setting find_unused_parameters=True in torch.nn.parallel.DistributedDataParallel. This introduces a small overhead to training due to an extra reduction. I'm not sure whether this is faster or slower than adding zero times all parameters to the loss.

@saulgoodman08
Copy link

It has been two years since this issue is reported. I also came into this issue recently, and found the workaround.

My code is like

class Model(nn.Module):
    ...
    def forward(self, ...):
        # I disable to grad tracker for some operations
        with torch.no_grad():
            # do something with self.net
        output = self.net(...) # the output is tracked with autograd, so self.net is indeed involving the computation.
        loss = criterion(output, ...)
        return loss
....

The the error is raised, and finally, I find this could make a fix

class Model(nn.Module):
    ...
    def forward(self, ...):
        # I disable to grad tracker for some operations
        with torch.no_grad():
            self.net.requires_grad_(False)  # <- Added line 1
            # do something with self.net
            self.net.requires_grad_(True)  # <- Added line 2
        output = self.net(...) # the output is tracked with autograd, so self.net is indeed involving the computation.
        loss = criterion(output, ...)
        return loss
....

Hope this help.

May I ask where this class is? In which dependency file? Thx a lot!

@maxwellzh
Copy link
Contributor

May I ask where this class is? In which dependency file?

@saulgoodman08 This is just an example code snippet. You should make changes according to yours.

@liangyukkk
Copy link

--distributed-backend 'nccl' --ddp-backend "no_c10d" \

@QinHsiu
Copy link

QinHsiu commented Apr 11, 2023

I use the following code:
model = torch.nn.parallel.DistributedDataParallel(model, find_unused_parameters=True)
and i fix the problem

@wtliao
Copy link

wtliao commented Apr 30, 2023

find_unused_parameters

I still have the similar issue even through I use single GPU. I solve it by setting "find_unused_parameters=True". Thanks a lot for your many hints.

@sanj909
Copy link

sanj909 commented Aug 2, 2023

I was having the same error. In the __init__() method of my custom class Decoder(torch.nn.Module), I changed the code below

self.dec_layer = nn.TransformerDecoderLayer(embedding_dim, num_heads, hidden_size, dropout, batch_first=True)
self.decoder = nn.TransformerDecoder(self.dec_layer, num_layers=num_layers)

to the code below

dec_layer = nn.TransformerDecoderLayer(embedding_dim, num_heads, hidden_size, dropout, batch_first=True)
self.decoder = nn.TransformerDecoder(dec_layer, num_layers=num_layers)

and I no longer get the error. The forward method of my class only explicitly calls self.decoder, and does not explicitly call self.dec_layer (transformer_out = self.decoder(target, memory, tgt_mask=tgt_mask, tgt_key_padding_mask=tgt_key_padding_mask)).

The parameters of the module dec_layer were being used implicitly within the module self.decoder, which is why my model still trained properly when I passed find_unused_parameters=True when wrapping the model in torch.nn.parallel.DistributedDataParallel. However, the root of the problem was that making dec_layer a class attribute means that PyTorch 'counts' these parameters twice, once as part of self.dec_layer and again as part of self.decoder.

@parthkvv
Copy link

parthkvv commented Aug 4, 2023

A walkaround to this problem is to multiply the sum of all parameters with zero and add it to the final loss. Note that this may bring a small overhead in backprop.

Thank you! This worked for me.
Just in case someone is wondering how to implement it (a beginner like me), here is the major modification I made to my code (apart from other minor ones), for the training_step -

Old Code:
image

New Code:
image

@vwxyzjn
Copy link

vwxyzjn commented Sep 12, 2023

Btw something that helped me was to run

for name, param in model.named_parameters():
    names.append(name)
    params.append(param)
    print(name, param.shape, param.requires_grad)

and I could set the particular param which was not used during the forward pass to param.requires_grad=False to avoid the issue. See https://gist.github.com/vwxyzjn/45fc8706dfb3cf33695f0f57cc44a533?permalink_comment_id=4689479#gistcomment-4689479 as an example.

@thuy4tbn99
Copy link

thuy4tbn99 commented Oct 24, 2023

A walkaround to this problem is to multiply the sum of all parameters with zero and add it to the final loss. Note that this may bring a small overhead in backprop.

I got the same error and follow this guide to solve it! Thanks a lot. Code below:
loss= loss+ 0. * sum(p.sum() for p in net.parameters())

@chengamo
Copy link

chengamo commented Oct 25, 2023

I ran into the same problem because I used with torch.no_grad() in front of the newly defined MLP function:

with torch.no_grad():
          prefix = self.mlp(x)

@naraharibm
Copy link

If you are getting this because of accelerate, just make sure you ran
accelerate config properly. I use it with deepspeed and it works fine

@JimmmmmL
Copy link

I was having the same error. In the __init__() method of my custom class Decoder(torch.nn.Module), I changed the code below

self.dec_layer = nn.TransformerDecoderLayer(embedding_dim, num_heads, hidden_size, dropout, batch_first=True)
self.decoder = nn.TransformerDecoder(self.dec_layer, num_layers=num_layers)

to the code below

dec_layer = nn.TransformerDecoderLayer(embedding_dim, num_heads, hidden_size, dropout, batch_first=True)
self.decoder = nn.TransformerDecoder(dec_layer, num_layers=num_layers)

and I no longer get the error. The forward method of my class only explicitly calls self.decoder, and does not explicitly call self.dec_layer (transformer_out = self.decoder(target, memory, tgt_mask=tgt_mask, tgt_key_padding_mask=tgt_key_padding_mask)).

The parameters of the module dec_layer were being used implicitly within the module self.decoder, which is why my model still trained properly when I passed find_unused_parameters=True when wrapping the model in torch.nn.parallel.DistributedDataParallel. However, the root of the problem was that making dec_layer a class attribute means that PyTorch 'counts' these parameters twice, once as part of self.dec_layer and again as part of self.decoder.

Yes! This is the case. I tried your method and finally fixed this haunting bug!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: data parallel oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests