We propose a new optimization algorithm called DeAR, that decouples the all-reduce primitive to two operations, so as to enable fine-grained scheduling without introducing extra communication overhead. This repository contains DeAR's source code, as well as a set of benchmarking scripts for evaluating the training performance of popular distributed deep learning methods with data parallelism. Currently, it covers:
- Wait-free backpropagation (WFBP), which is also known as the technique of pipelining the backward computations with gradient communications.
- ByteScheduler, which uses tensor partition and priority schedule to overlap some communication tasks with feed-forward computing tasks.
- DeAR w/o TF, which disables the tensor fusion technique by setting THRESHOLD=None and NUM_NEARBY_LAYERS=1.
- Horovod.
- PyTorch-DDP.
- MG-WFBP, which determines fusion tensors by measuring the backward computation time and communication time.
- DeAR, which supports tuning tensor fusion with Bayesian optimization.
- Convolutional neural networks (CNNs) on a fake ImageNet data set (i.e., randomly generate the input image of 224*224*3)
- Transformers: BERT-Base and BERT-Large pretraining models.
- Python 3.6+
- CUDA-10.+
- NCCL-2.4.+
- PyTorch-1.8.+
- OpenMPI-4.0.+
- Horovod-0.19.+
- ByteScheduler
$git clone https://github.com/lzhangbv/dear_pytorch.git
$cd dear_pytorch
$pip install -r requirements.txt
$HOROVOD_GPU_OPERATIONS=NCCL pip install horovod==0.21.3
If pip installation failed, please try to upgrade pip via pip install --upgrade pip
. If Horovod installation with NCCL failed, please check the installation guide. To run ByteScheduler, please check the installation instruction and it was found to be compatible with PyTorch 1.4.
If you have encountered other errors during installation, please check the install document (contributed by Haoxuan Yu), and we recommend using the same software versions according to our paper (section VI.A).
Before running the scripts, please carefully configure the configuration files in the directory of configs
.
- configs/cluster*: configure the host files for MPI
- configs/envs.conf: configure the cluster environments
Compile the communication package:
$ cd common/comm_core
$ bash compile.sh
Create a log folder in the dear_pytorch dir, e.g.,
$mkdir -p logs/sc22-tf
- The batch mode
$python benchmarks.py
For different experimental settings, users can modify the DNN model, batch size, the number of GPUs, and network configurations in the benckmarks.py script.
- The individual mode, e.g.,
$cd dear
$dnn=resnet50 bs=64 nworkers=64 ./horovod_mpi_cj.sh
Before running DeAR w/o tensor fusion, please set THRESHOLD=None and NUM_NEARBY_LAYERS=1 in the DeAR's dopt_rsag.py script. For DeAR with tensor fusion, we use THRESHOLD=25MB by default. To support Bayesian optimization, please import dopt_rsag_bo and increase the num-warmup-batches to at least 60 to tune buffer size in DeAR's benchmark scripts.
The DeAR distributed optimizer can be easily used like horovod.DistributedOptimizer()
.
import dear
dear.init()
...
optimizer = optim.SGD(model.parameters(), ...)
optimizer = dear.DistributedOptimizer(optimizer, ...)
...
for i, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
...
Example script for training on MNIST was provided.
$ bash mnist.sh
If you are using this repository for your paper, please cite our work
@article{zhang2023decoupling,
title={Decoupling the All-Reduce Primitive for Accelerating Distributed Deep Learning},
author={Zhang, Lin and Shi, Shaohuai and Chu, Xiaowen and Wang, Wei and Li, Bo and Liu, Chengjian},
journal={arXiv preprint arXiv:2302.12445},
year={2023}
}