Parameters out of sync over different ranks due to unused parameters #128949

yyk-wew · 2024-06-18T12:37:17Z

🐛 Describe the bug

Hi. When training the model using DDP, I found that the RuntimeError about unused parameters was not thrown as expected even find_unused_parameters was set to False.

Here’s a toy code snippet for reproduction.

import torch.nn as nn
import torch
import torch.distributed as dist
import torch.optim as optim

class Toy(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.a = nn.Linear(1, 1)
        self.b = nn.Linear(1, 1)

    def forward(self, x):
        return self.a(x)    # self.b is unused

dist.init_process_group(backend='nccl')
model = Toy().to(device=dist.get_rank())
ddp_model = torch.nn.parallel.DistributedDataParallel(model, find_unused_parameters=False)
optimizer = optim.SGD(model.parameters(), lr=1)
optimizer.zero_grad()

x = torch.randn(2, 1).to(device=dist.get_rank())
out = ddp_model(x)
out.mean().backward()

print(dist.get_rank(), ddp_model.module.a.weight)
optimizer.step()
print(dist.get_rank(), ddp_model.module.a.weight)
dist.destroy_process_group()

Run the code above with torchrun --nnodes=1 --nproc-per-node=2 --standalone demo.py, I got results:

0 Parameter containing:
tensor([[0.6218]], device='cuda:0', requires_grad=True)
0 Parameter containing:
tensor([[0.7547]], device='cuda:0', requires_grad=True)
1 Parameter containing:
tensor([[0.6218]], device='cuda:1', requires_grad=True)
1 Parameter containing:
tensor([[0.9173]], device='cuda:1', requires_grad=True)

After one step optimization, the weight of model.a is out of sync. It is expected that a RuntimeError should be thrown here to remind me set find_unused_parameters=True but it didn't.

As expected, comment out the definition of self.b or set find_unused_parameters=True could get the right result.

Versions

PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 11 (bullseye) (x86_64)
GCC version: (Debian 10.2.1-6) 10.2.1 20210110
Clang version: Could not collect
CMake version: version 3.18.4
Libc version: glibc-2.31

Python version: 3.9.2 (default, Feb 28 2021, 17:03:44) [GCC 10.2.1 20210110] (64-bit runtime)
Python platform: Linux-5.4.143.bsk.7-amd64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY

Nvidia driver version: 470.129.06
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

The text was updated successfully, but these errors were encountered:

yyk-wew closed this as completed Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parameters out of sync over different ranks due to unused parameters #128949

Parameters out of sync over different ranks due to unused parameters #128949

yyk-wew commented Jun 18, 2024

Parameters out of sync over different ranks due to unused parameters #128949

Parameters out of sync over different ranks due to unused parameters #128949

Comments

yyk-wew commented Jun 18, 2024

🐛 Describe the bug

Versions