Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameters out of sync over different ranks due to unused parameters #128949

Closed
yyk-wew opened this issue Jun 18, 2024 · 0 comments
Closed

Parameters out of sync over different ranks due to unused parameters #128949

yyk-wew opened this issue Jun 18, 2024 · 0 comments

Comments

@yyk-wew
Copy link

yyk-wew commented Jun 18, 2024

🐛 Describe the bug

Hi. When training the model using DDP, I found that the RuntimeError about unused parameters was not thrown as expected even find_unused_parameters was set to False.

Here’s a toy code snippet for reproduction.

import torch.nn as nn
import torch
import torch.distributed as dist
import torch.optim as optim

class Toy(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.a = nn.Linear(1, 1)
        self.b = nn.Linear(1, 1)

    def forward(self, x):
        return self.a(x)    # self.b is unused

dist.init_process_group(backend='nccl')
model = Toy().to(device=dist.get_rank())
ddp_model = torch.nn.parallel.DistributedDataParallel(model, find_unused_parameters=False)
optimizer = optim.SGD(model.parameters(), lr=1)
optimizer.zero_grad()

x = torch.randn(2, 1).to(device=dist.get_rank())
out = ddp_model(x)
out.mean().backward()

print(dist.get_rank(), ddp_model.module.a.weight)
optimizer.step()
print(dist.get_rank(), ddp_model.module.a.weight)
dist.destroy_process_group()

Run the code above with torchrun --nnodes=1 --nproc-per-node=2 --standalone demo.py, I got results:

0 Parameter containing:
tensor([[0.6218]], device='cuda:0', requires_grad=True)
0 Parameter containing:
tensor([[0.7547]], device='cuda:0', requires_grad=True)
1 Parameter containing:
tensor([[0.6218]], device='cuda:1', requires_grad=True)
1 Parameter containing:
tensor([[0.9173]], device='cuda:1', requires_grad=True)

After one step optimization, the weight of model.a is out of sync. It is expected that a RuntimeError should be thrown here to remind me set find_unused_parameters=True but it didn't.

As expected, comment out the definition of self.b or set find_unused_parameters=True could get the right result.

Versions

PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 11 (bullseye) (x86_64)
GCC version: (Debian 10.2.1-6) 10.2.1 20210110
Clang version: Could not collect
CMake version: version 3.18.4
Libc version: glibc-2.31

Python version: 3.9.2 (default, Feb 28 2021, 17:03:44) [GCC 10.2.1 20210110] (64-bit runtime)
Python platform: Linux-5.4.143.bsk.7-amd64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY

Nvidia driver version: 470.129.06
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

@yyk-wew yyk-wew closed this as completed Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant