[ADD] support Distributed Data Parallel #137

jinwonkim93 · 2023-02-22T16:12:33Z

Title

Colossal AI-based Distributed Data Parallel with oslo interface

Description

The purpose of this implementation is to enable DDP in Oslo, with the reducer method being identical to that of Colossal AI, but adapted to fit Oslo's interface. To enhance user experience, we replaced model.backward() with loss.backward() and added model.zero_grad() temporarily to the code. Any feedback is welcome :)

If you don't use model.zero_grad() there will be unexpected error.

test_data_parallel.py

import os
import torch.multiprocessing as mp

import torch
from torch.nn.parallel import DistributedDataParallel as DDP
from torch import nn
from torch import optim
import torch.distributed as dist

from oslo.torch.distributed.parallel_context import ParallelContext


def setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12345"
    os.environ["RANK"] = str(rank)
    os.environ["LOCAL_RANK"] = str(rank)
    os.environ["WORLD_SIZE"] = str(world_size)
    os.environ["LOCAL_WORLD_SIZE"] = str(world_size)


def cleanup():
    dist.destroy_process_group()


class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))


def train(rank, world_size):
    print(f"Running basic DDP example on rank {rank}.")
    setup(rank, world_size)
    parallel_context = ParallelContext.from_torch(data_parallel_size=world_size)

    # create model and move it to GPU with id rank
    model = ToyModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(model.parameters(), lr=0.001)

    optimizer.zero_grad()
    outputs = ddp_model(torch.zeros(20, 10).to(rank))
    labels = torch.zeros(20, 5).to(rank)
    loss_fn(outputs, labels).backward()
    optimizer.step()
    print(outputs)
    cleanup()


def main(world_size):
    mp.spawn(train, args=(world_size,), nprocs=world_size, join=True)


if __name__ == "__main__":
    main(2)

test_oslo_data_parallel.py

import os
import torch.multiprocessing as mp

import torch
from torch import nn
from torch import optim
import torch.distributed as dist

import oslo
from oslo.torch.distributed.parallel_context import ParallelContext
from oslo.torch.nn.parallel.data_parallel import DistributedDataParallel as DDP


def setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12345"
    os.environ["RANK"] = str(rank)
    os.environ["LOCAL_RANK"] = str(rank)
    os.environ["WORLD_SIZE"] = str(world_size)
    os.environ["LOCAL_WORLD_SIZE"] = str(world_size)


def cleanup():
    dist.destroy_process_group()


class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))


def train(rank, world_size):
    print(f"Running oslo DDP example on rank {rank}.")
    setup(rank, world_size)
    parallel_context = ParallelContext.from_torch(data_parallel_size=world_size)

    # create model and move it to GPU with id rank
    model = ToyModel().to(rank)
    ddp_model = DDP(model, parallel_context)

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
    oslo.ready(ddp_model, parallel_context)
    optimizer.zero_grad()
    outputs = ddp_model(torch.zeros(20, 10).to(rank))
    labels = torch.zeros(20, 5).to(rank)
    loss = loss_fn(outputs, labels)
    ddp_model.backward(loss)
    optimizer.step()
    print(outputs)
    cleanup()


def main(world_size):
    mp.spawn(train, args=(world_size,), nprocs=world_size, join=True)


if __name__ == "__main__":
    main(2)

pytorch DDP

Oslo DDP

By checking the model's parameters, oslo DDP is working as expected.

After Cleaning

Oslo DDP
oslo-ddp-time.log

Torch DDP
torch-ddp-time.log

Linked Issues

resolved #00

…e updated more. (remove unnecssary code)

hyunwoongko · 2023-03-02T05:21:52Z

Could you run precommit run --all-files.

## Title Refactor backward in DP ## Description Followed https://github.com/KKIEEK/oslo/blob/3ca6b1aa0d87688af891f12b22837d89847680e9/oslo/torch/nn/parallel/data_parallel/distributed_data_parallel.py#L96. And committed as KKIEEK for the code ownership. Co-authored-by: KKIEEK <[email protected]>

## Title Deleted legacy code. ## Description Only newly added code written by jinwonkim93 remains. --------- Co-authored-by: KKIEEK <[email protected]>

oslo/torch/nn/parallel/utils.py

oslo/torch/nn/parallel/data_parallel/data_parallel.py

Co-authored-by: Junhwa Song <[email protected]>

KKIEEK · 2023-03-02T09:55:52Z

I think it would be better to merge _DistirbutedDataParallelWrapper class into _DistributedDataParallel.

…llel

Co-authored-by: Junhwa Song <[email protected]>

Related to #137 For now, our implementation of DDP does not support long tensor input, so I fixed it. --------- Co-authored-by: Hakjin Lee <[email protected]>

oslo/torch/nn/parallel/data_parallel/__init__.py

oslo/torch/nn/parallel/data_parallel/data_parallel.py

hyunwoongko · 2023-03-08T02:17:18Z

please resolve conflict! @jinwonkim93

jinwonkim93 · 2023-03-08T11:10:55Z

please resolve conflict! @jinwonkim93

Completed. One question is there a reason for "ALL" rather than "all"?

hyunwoongko · 2023-03-08T12:29:35Z

no. I prefer __ALL__, but we don't use both of them because we think import is enough.
Is there any file which contains __all__?

jinwonkim93 · 2023-03-08T13:26:08Z

no. I prefer __ALL__, but we don't use both of them because we think import is enough. Is there any file which contains __all__?

oslo/oslo/torch/nn/parallel/data_parallel/zero/__init__.py

Line 5 in dcad48e

__all__ = ["ZeroRedundancyOptimizer"]

oslo/oslo/torch/nn/parallel/data_parallel/zero/sharded_optim/__init__.py

Line 5 in dcad48e

__all__ = ["ZeroRedundancyOptimizer"]

oslo/oslo/torch/utils/__init__.py

Line 5 in dcad48e

__all__ = ["get_free_port", "set_seed"]

hyunwoongko · 2023-03-08T15:22:58Z

I didn't add them. all the code added by DP new members. so it's okay to change them to uppercase.

jinwonkim93 · 2023-03-08T15:31:01Z

I didn't add them. all the code added by DP new members. so it's okay to change them to uppercase.

Okay. I think it is ready to merged. what do you think?

oslo/torch/nn/parallel/data_parallel/zero/sharded_optim/bookkeeping/__init__.py

…eping/__init__.py Co-authored-by: Junhwa Song <[email protected]>

hyunwoongko · 2023-03-10T15:00:02Z

@jinwonkim93 looks good to me.

## Title Colossal AI-based Distributed Data Parallel with oslo interface - ## Description The purpose of this implementation is to enable DDP in Oslo, with the reducer method being identical to that of Colossal AI, but adapted to fit Oslo's interface. To enhance user experience, we replaced model.backward() with loss.backward() and added model.zero_grad() temporarily to the code. Any feedback is welcome :) If you don't use model.zero_grad() there will be unexpected error. test_data_parallel.py ```python import os import torch.multiprocessing as mp import torch from torch.nn.parallel import DistributedDataParallel as DDP from torch import nn from torch import optim import torch.distributed as dist from oslo.torch.distributed.parallel_context import ParallelContext def setup(rank, world_size): os.environ["MASTER_ADDR"] = "localhost" os.environ["MASTER_PORT"] = "12345" os.environ["RANK"] = str(rank) os.environ["LOCAL_RANK"] = str(rank) os.environ["WORLD_SIZE"] = str(world_size) os.environ["LOCAL_WORLD_SIZE"] = str(world_size) def cleanup(): dist.destroy_process_group() class ToyModel(nn.Module): def __init__(self): super(ToyModel, self).__init__() self.net1 = nn.Linear(10, 10) self.relu = nn.ReLU() self.net2 = nn.Linear(10, 5) def forward(self, x): return self.net2(self.relu(self.net1(x))) def train(rank, world_size): print(f"Running basic DDP example on rank {rank}.") setup(rank, world_size) parallel_context = ParallelContext.from_torch(data_parallel_size=world_size) # create model and move it to GPU with id rank model = ToyModel().to(rank) ddp_model = DDP(model, device_ids=[rank]) loss_fn = nn.MSELoss() optimizer = optim.SGD(model.parameters(), lr=0.001) optimizer.zero_grad() outputs = ddp_model(torch.zeros(20, 10).to(rank)) labels = torch.zeros(20, 5).to(rank) loss_fn(outputs, labels).backward() optimizer.step() print(outputs) cleanup() def main(world_size): mp.spawn(train, args=(world_size,), nprocs=world_size, join=True) if __name__ == "__main__": main(2) ``` test_oslo_data_parallel.py ```python import os import torch.multiprocessing as mp import torch from torch import nn from torch import optim import torch.distributed as dist import oslo from oslo.torch.distributed.parallel_context import ParallelContext from oslo.torch.nn.parallel.data_parallel import DistributedDataParallel as DDP def setup(rank, world_size): os.environ["MASTER_ADDR"] = "localhost" os.environ["MASTER_PORT"] = "12345" os.environ["RANK"] = str(rank) os.environ["LOCAL_RANK"] = str(rank) os.environ["WORLD_SIZE"] = str(world_size) os.environ["LOCAL_WORLD_SIZE"] = str(world_size) def cleanup(): dist.destroy_process_group() class ToyModel(nn.Module): def __init__(self): super(ToyModel, self).__init__() self.net1 = nn.Linear(10, 10) self.relu = nn.ReLU() self.net2 = nn.Linear(10, 5) def forward(self, x): return self.net2(self.relu(self.net1(x))) def train(rank, world_size): print(f"Running oslo DDP example on rank {rank}.") setup(rank, world_size) parallel_context = ParallelContext.from_torch(data_parallel_size=world_size) # create model and move it to GPU with id rank model = ToyModel().to(rank) ddp_model = DDP(model, parallel_context) loss_fn = nn.MSELoss() optimizer = optim.SGD(ddp_model.parameters(), lr=0.001) oslo.ready(ddp_model, parallel_context) optimizer.zero_grad() outputs = ddp_model(torch.zeros(20, 10).to(rank)) labels = torch.zeros(20, 5).to(rank) loss = loss_fn(outputs, labels) ddp_model.backward(loss) optimizer.step() print(outputs) cleanup() def main(world_size): mp.spawn(train, args=(world_size,), nprocs=world_size, join=True) if __name__ == "__main__": main(2) ``` ![image](https://user-images.githubusercontent.com/26476095/220687694-4236dcaa-ae66-4332-8159-3e206b04df49.png) ![image](https://user-images.githubusercontent.com/26476095/220687852-578584a5-db9a-4a90-ab3e-bdc779bb39a2.png) - pytorch DDP <img width="585" alt="ddp_before_backward" src="https://user-images.githubusercontent.com/26476095/221404650-2525413c-ce86-44e9-bd53-897ac4077b4a.png"> <img width="577" alt="ddp_after_backward" src="https://user-images.githubusercontent.com/26476095/221404654-ce1e2d45-9304-4d13-aa83-c5a5f8d06689.png"> Oslo DDP <img width="610" alt="oslo_before_backward" src="https://user-images.githubusercontent.com/26476095/221404663-e85a0462-6fd2-4a6d-85a3-7fdcf9a5e9a7.png"> <img width="576" alt="oslo_after_backward" src="https://user-images.githubusercontent.com/26476095/221404668-8cdee44d-3d76-4d23-adc0-68983ea7b173.png"> By checking the model's parameters, oslo DDP is working as expected. After Cleaning ![image](https://user-images.githubusercontent.com/26476095/222415778-3358b862-a8c4-416e-9bc1-338d915d5e79.png) Oslo DDP [oslo-ddp-time.log](https://github.com/EleutherAI/oslo/files/10887632/oslo-ddp-time.log) Torch DDP [torch-ddp-time.log](https://github.com/EleutherAI/oslo/files/10887634/torch-ddp-time.log) ## Linked Issues - resolved #00 --------- Co-authored-by: dongsung kim <[email protected]> Co-authored-by: Hakjin Lee <[email protected]> Co-authored-by: KKIEEK <[email protected]>

dongsung kim and others added 7 commits January 29, 2023 09:26

initial data_parallel code based on colossalai code but it needs to b…

f04471e

…e updated more. (remove unnecssary code)

take out the code from _coloddp, _coloddp will be removed soon

b15e42c

initialization for colossalAI integration

0aa7020

working code

3eba4a1

Change to oslo interface

f89e22a

remove temp testcode

d45e029

change docstrings

80f0970

jinwonkim93 added the Data Parallelism Data parallelism related label Feb 22, 2023

jinwonkim93 requested a review from hyunwoongko February 22, 2023 16:12

jinwonkim93 closed this Mar 1, 2023

Merge branch 'data_parallel' into distributed_data_parallel

315e5f8

jinwonkim93 reopened this Mar 2, 2023

jinwonkim93 and others added 4 commits March 2, 2023 15:30

reformat all files

e9f8b0b

[Clean] Clean DDP Code (#142)

20c0d10

## Title Deleted legacy code. ## Description Only newly added code written by jinwonkim93 remains. --------- Co-authored-by: KKIEEK <[email protected]>

Merge branch 'main' into distributed_data_parallel

5079773

KKIEEK reviewed Mar 2, 2023

View reviewed changes

oslo/torch/nn/parallel/utils.py Outdated Show resolved Hide resolved

KKIEEK reviewed Mar 2, 2023

View reviewed changes

oslo/torch/nn/parallel/data_parallel/data_parallel.py Outdated Show resolved Hide resolved

KKIEEK reviewed Mar 2, 2023

View reviewed changes

oslo/torch/nn/parallel/data_parallel/data_parallel.py Outdated Show resolved Hide resolved

hyunwoongko reviewed Mar 2, 2023

View reviewed changes

KKIEEK reviewed Mar 2, 2023

View reviewed changes

oslo/torch/nn/parallel/data_parallel/data_parallel.py Outdated Show resolved Hide resolved

KKIEEK reviewed Mar 2, 2023

View reviewed changes

oslo/torch/nn/parallel/data_parallel/data_parallel.py Outdated Show resolved Hide resolved

Update oslo/torch/nn/parallel/data_parallel/data_parallel.py

f9097f0

Co-authored-by: Junhwa Song <[email protected]>

jinwonkim93 and others added 5 commits March 2, 2023 19:16

[Clean] Remove unused code

28e474a

[Clean] Merge _DistirbutedDataParallelWrapper to _DistributedDataPara…

5401fa1

…llel

[Fix] fix forward max recursion

fe9ff2f

[Clean] Remove parameters

1c581c8

Update oslo/torch/nn/parallel/data_parallel/data_parallel.py

cc0e6a8

Co-authored-by: Junhwa Song <[email protected]>

[Refactor] Remove zero_grad from forward

7494c9c

KKIEEK mentioned this pull request Mar 3, 2023

[Fix] Support long tensor for DDP backward #146

Merged

[Fix] Support long tensor for DDP backward (#146)

a5e0c9c

Related to #137 For now, our implementation of DDP does not support long tensor input, so I fixed it. --------- Co-authored-by: Hakjin Lee <[email protected]>

nijkah reviewed Mar 3, 2023

View reviewed changes

oslo/torch/nn/parallel/data_parallel/__init__.py Show resolved Hide resolved

[Add] Add copyright

894d66d

jinwonkim93 marked this pull request as ready for review March 3, 2023 16:45

jinwonkim93 and others added 3 commits March 4, 2023 10:24

[Refactor] move zero_grad to parallelize

4b14253

[Refactor] refactor backward

8b72f4d

[Feat] support gloo backend

313d325

KKIEEK reviewed Mar 6, 2023

View reviewed changes

oslo/torch/nn/parallel/data_parallel/data_parallel.py Outdated Show resolved Hide resolved

jinwonkim93 and others added 4 commits March 8, 2023 11:25

[Fix] fix conflict

50814c3

[Comment] change backward comment

b2282de

[Fix] fix conflict

32dd163

Merge branch 'main' into distributed_data_parallel

8a61b92

jinwonkim93 added 3 commits March 8, 2023 22:29

[Style] change __all__ to __ALL__

78f003a

[Style] change __all__ to __ALL__

520eb53

[Style] change zero __all__ to __ALL__

2408bf0

KKIEEK reviewed Mar 8, 2023

View reviewed changes

oslo/torch/nn/parallel/data_parallel/zero/sharded_optim/bookkeeping/__init__.py Outdated Show resolved Hide resolved

Update oslo/torch/nn/parallel/data_parallel/zero/sharded_optim/bookke…

697653b

…eping/__init__.py Co-authored-by: Junhwa Song <[email protected]>

hyunwoongko merged commit f129a90 into main Mar 10, 2023

nijkah mentioned this pull request Mar 23, 2023

[Add] Heterogeneous Memory Manager for ZeRO3 #163

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADD] support Distributed Data Parallel #137

[ADD] support Distributed Data Parallel #137

jinwonkim93 commented Feb 22, 2023 •

edited

Loading

hyunwoongko commented Mar 2, 2023

KKIEEK commented Mar 2, 2023

hyunwoongko commented Mar 8, 2023

jinwonkim93 commented Mar 8, 2023 •

edited

Loading

hyunwoongko commented Mar 8, 2023 •

edited

Loading

jinwonkim93 commented Mar 8, 2023

hyunwoongko commented Mar 8, 2023

jinwonkim93 commented Mar 8, 2023

hyunwoongko commented Mar 10, 2023

[ADD] support Distributed Data Parallel #137

[ADD] support Distributed Data Parallel #137

Conversation

jinwonkim93 commented Feb 22, 2023 • edited Loading

Title

Colossal AI-based Distributed Data Parallel with oslo interface

Description

Linked Issues

hyunwoongko commented Mar 2, 2023

KKIEEK commented Mar 2, 2023

hyunwoongko commented Mar 8, 2023

jinwonkim93 commented Mar 8, 2023 • edited Loading

hyunwoongko commented Mar 8, 2023 • edited Loading

jinwonkim93 commented Mar 8, 2023

hyunwoongko commented Mar 8, 2023

jinwonkim93 commented Mar 8, 2023

hyunwoongko commented Mar 10, 2023

jinwonkim93 commented Feb 22, 2023 •

edited

Loading

jinwonkim93 commented Mar 8, 2023 •

edited

Loading

hyunwoongko commented Mar 8, 2023 •

edited

Loading