GitHub - allwefantasy/bagua: Bagua is a flexible and performant distributed training algorithm development framework.

Bagua is a distributed training utility developed by AI platform@Kuaishou Technology and DS3 Lab@ETH. Users can extend the training on a single GPU to multi-GPUs (may across multiple machines) by simply adding a few lines of code. One prominent feature of Bagua is to provide a flexible system abstraction that supports state-of-the-art system relaxation techniques of distributed training. Powered by the new system design, Bagua has a great ability to implement and extend various state-of-the-art distributed learning algorithms. This in turns enables better scalability and efficiency of the end-to-end training process. Researchers can also easily develop new distributed training algorithms within the Bagua framework, without worrying about low-level optimizations.

So far, Bagua has integrated communication primitives including

Centralized Synchronous Communication (AllReduce)
Decentralized Synchronous Communication
Low Precision Communication

Its effectiveness has been evaluated in various scenarios, including VGG and ResNet on ImageNet, BERT Large and many industrial applications at Kuaishou.

The underlying communication execution engine is in bagua-core, a library written in Rust.

Performance

The scalability of different systems on VGG16 with up to 128 GPUs.

Epoch time of BERT-Large Finetune under different network conditions for different systems.

For more comprehensive and up to date results, refer to Bagua benchmark page.

Installation

Develop version:

pip install git+https://github.com/BaguaSys/bagua.git

Release version:

pip install bagua

Build API documentation locally

pip install -r docs/doc-requirements.txt
make html

Cite Bagua

% System Overview
@misc{gan2021bagua,
  title={BAGUA: Scaling up Distributed Learning with System Relaxations}, 
  author={Shaoduo Gan and Xiangru Lian and Rui Wang and Jianbin Chang and Chengjun Liu and Hongmei Shi and Shengzhuo Zhang and Xianghong Li and Tengxu Sun and Jiawei Jiang and Binhang Yuan and Sen Yang and Ji Liu and Ce Zhang},
  year={2021},
  eprint={2107.01499},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}

% Theory on System Relaxation Techniques
@book{liu2020distributed,
  title={Distributed Learning Systems with First-Order Methods: An Introduction},
  author={Liu, J. and Zhang, C.},
  isbn={9781680837018},
  series={Foundations and trends in databases},
  url={https://books.google.com/books?id=vzQmzgEACAAJ},
  year={2020},
  publisher={now publishers}
}

Limitations

When communication is not a bottleneck in the training task, using Bagua communication algorithms will not provide significant performance improvement (unless you use other optimizations in Bagua such as fused optimizer).
Currently only tested on Linux and NVIDIA GPUs.

Name		Name	Last commit message	Last commit date
Latest commit History 186 Commits
.buildkite		.buildkite
.github		.github
.test_to_migrate		.test_to_migrate
bagua		bagua
docker		docker
docs		docs
figures		figures
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
install.sh		install.sh
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Performance

Installation

Build API documentation locally

Cite Bagua

Limitations

Links

About

Releases

Packages

Languages

License

allwefantasy/bagua

Folders and files

Latest commit

History

Repository files navigation

Performance

Installation

Build API documentation locally

Cite Bagua

Limitations

Links

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages