DeAR: decoupling the all-reduce primitive to accelerate distributed deep learning

Introduction

We propose a new optimization algorithm called DeAR, that decouples the all-reduce primitive to two operations, so as to enable fine-grained scheduling without introducing extra communication overhead. This repository contains DeAR's source code, as well as a set of benchmarking scripts for evaluating the training performance of popular distributed deep learning methods with data parallelism. Currently, it covers:

Optimization algorithms without Tensor Fusion

Wait-free backpropagation (WFBP), which is also known as the technique of pipelining the backward computations with gradient communications.
ByteScheduler, which uses tensor partition and priority schedule to overlap some communication tasks with feed-forward computing tasks.
DeAR w/o TF, which disables the tensor fusion technique by setting THRESHOLD=None and NUM_NEARBY_LAYERS=1.

Optimization algorithms with Tensor Fusion

Horovod.
PyTorch-DDP.
MG-WFBP, which determines fusion tensors by measuring the backward computation time and communication time.
DeAR, which supports tuning tensor fusion with Bayesian optimization.

Deep Neural Networks

Convolutional neural networks (CNNs) on a fake ImageNet data set (i.e., randomly generate the input image of 224*224*3)
Transformers: BERT-Base and BERT-Large pretraining models.

Installation

Prerequisites

Python 3.6+
CUDA-10.+
NCCL-2.4.+
PyTorch-1.8.+
OpenMPI-4.0.+
Horovod-0.19.+
ByteScheduler

Get the code

$git clone https://github.com/lzhangbv/dear_pytorch.git
$cd dear_pytorch
$pip install -r requirements.txt
$HOROVOD_GPU_OPERATIONS=NCCL pip install horovod==0.21.3

If pip installation failed, please try to upgrade pip via pip install --upgrade pip. If Horovod installation with NCCL failed, please check the installation guide. To run ByteScheduler, please check the installation instruction and it was found to be compatible with PyTorch 1.4.

If you have encountered other errors during installation, please check the install document (contributed by Haoxuan Yu), and we recommend using the same software versions according to our paper (section VI.A).

Configure the cluster settings

Before running the scripts, please carefully configure the configuration files in the directory of configs.

configs/cluster*: configure the host files for MPI
configs/envs.conf: configure the cluster environments

Compile the communication package:

$ cd common/comm_core
$ bash compile.sh

Create a log folder in the dear_pytorch dir, e.g.,

$mkdir -p logs/sc22-tf

Run benchmarks

The batch mode

$python benchmarks.py

For different experimental settings, users can modify the DNN model, batch size, the number of GPUs, and network configurations in the benckmarks.py script.

The individual mode, e.g.,

$cd dear
$dnn=resnet50 bs=64 nworkers=64 ./horovod_mpi_cj.sh

Before running DeAR w/o tensor fusion, please set THRESHOLD=None and NUM_NEARBY_LAYERS=1 in the DeAR's dopt_rsag.py script. For DeAR with tensor fusion, we use THRESHOLD=25MB by default. To support Bayesian optimization, please import dopt_rsag_bo and increase the num-warmup-batches to at least 60 to tune buffer size in DeAR's benchmark scripts.

DeAR Usage

The DeAR distributed optimizer can be easily used like horovod.DistributedOptimizer().

import dear
dear.init()
... 
optimizer = optim.SGD(model.parameters(), ...)
optimizer = dear.DistributedOptimizer(optimizer, ...)
... 
for i, (data, target) in enumerate(train_loader):
    optimizer.zero_grad()
    output = model(data)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()
...

DeAR Example

Example script for training on MNIST was provided.

$ bash mnist.sh

Paper

If you are using this repository for your paper, please cite our work

@article{zhang2023decoupling,
  title={Decoupling the All-Reduce Primitive for Accelerating Distributed Deep Learning},
  author={Zhang, Lin and Shi, Shaohuai and Chu, Xiaowen and Wang, Wei and Li, Bo and Liu, Chengjian},
  journal={arXiv preprint arXiv:2302.12445},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
bytescheduler		bytescheduler
common/comm_core		common/comm_core
configs		configs
dear		dear
examples/mnist		examples/mnist
horovod		horovod
mgwfbp		mgwfbp
pytorch-ddp		pytorch-ddp
wfbp		wfbp
.gitignore		.gitignore
README.md		README.md
benchmarks.py		benchmarks.py
install.md		install.md
mnist.sh		mnist.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeAR: decoupling the all-reduce primitive to accelerate distributed deep learning

Introduction

Optimization algorithms without Tensor Fusion

Optimization algorithms with Tensor Fusion

Deep Neural Networks

Installation

Prerequisites

Get the code

Configure the cluster settings

Run benchmarks

DeAR Usage

DeAR Example

Paper

About

Releases

Packages

Languages

lzhangbv/dear_pytorch

Folders and files

Latest commit

History

Repository files navigation

DeAR: decoupling the all-reduce primitive to accelerate distributed deep learning

Introduction

Optimization algorithms without Tensor Fusion

Optimization algorithms with Tensor Fusion

Deep Neural Networks

Installation

Prerequisites

Get the code

Configure the cluster settings

Run benchmarks

DeAR Usage

DeAR Example

Paper

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages