![Editor](https://github.com/intelligent-machine-learning/dlrover/raw/ed49a8b7a342cbe658449b8a9db10e601b2d436e/atorch/docs/img/atorch.png)
ATorch: Make LLMs training more efficient and reproducible for everyone.
Paper | Documentation | Examples | Blog
- TODO
- Why ATorch
- Features
- ATorch Applications
- Parallel Training Demo
- Single GPU Training Demo
- Installation
- Community
- Contributing
- Cite Us
ATorch is an extension library of PyTorch developed by Ant Group's AI Infrastructure team. By decoupling model definition from training optimization strategy, ATorch supports efficient and easy-to-use model training experience. The design principle is to minimally disrupt the native PyTorch programming style. Through its API, ATorch provides performance optimizations in aspects such as I/O, preprocessing, computation, and communication (including automatic optimization). ATorch has supported large-scale pretraining of LLMs with over 100 billion parameters and thousands of A100/H100 GPUs. We aim to open source it and make these capabilities reproducible for everyone. We also welcome contributions.
- Usability
- Fast deployment of runtime environment (images and installation packages)
- Solutions for large-scale model training
- Automated optimization
- auto_accelerate for automatic optimization
- IO/Preprocessing
- Recommended storage for training data
- Accessing the Pangu cluster
- CPU/GPU cooperation to optimize data preprocessing
- Customized operator optimization
- High-performance MOE
- Flash Attention 2
- Transformer operator
- Mixed precision
- Communication optimization
- Cashed sharding
- Hybrid parallelism
- Compilation optimization
- Elastic fault tolerance
- HangDetector (detecting and automatically restarting distributed training if it hangs)
- GPU elastic training
- Hardware error detect and migration
- Improve the stalibity of training over thousands GPUs by fault-tolerance and elasticity.
TODO
TODO
TODO
TODO
TODO
TODO
TODO
TODO
We leverage the power of GitHub Actions to automate our development, release and deployment workflows. Please check out this documentation on how the automated workflows are operated.