Name		Name	Last commit message	Last commit date
parent directory ..
atorch		atorch
bin		bin
dev		dev
docs		docs
.flake8		.flake8
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
setup.py		setup.py
setup.py.tpl		setup.py.tpl

README.md

ATorch

ATorch: Make LLMs training more efficient and reproducible for everyone.

Paper | Documentation | Examples | Blog

| English | 中文 |

Latest News

TODO

Why ATorch

ATorch is an extension library of PyTorch developed by Ant Group's AI Infrastructure team. By decoupling model definition from training optimization strategy, ATorch supports efficient and easy-to-use model training experience. The design principle is to minimally disrupt the native PyTorch programming style. Through its API, ATorch provides performance optimizations in aspects such as I/O, preprocessing, computation, and communication (including automatic optimization). ATorch has supported large-scale pretraining of LLMs with over 100 billion parameters and thousands of A100/H100 GPUs. We aim to open source it and make these capabilities reproducible for everyone. We also welcome contributions.

Features

Usability
- Fast deployment of runtime environment (images and installation packages)
Solutions for large-scale model training
Automated optimization
- auto_accelerate for automatic optimization
IO/Preprocessing
- Recommended storage for training data
- Accessing the Pangu cluster
- CPU/GPU cooperation to optimize data preprocessing
Customized operator optimization
- High-performance MOE
- Flash Attention 2
- Transformer operator
Mixed precision
Communication optimization
- Cashed sharding
Hybrid parallelism
Compilation optimization
Elastic fault tolerance
- HangDetector (detecting and automatically restarting distributed training if it hangs)
- GPU elastic training
- Hardware error detect and migration

ATorch Applications

ATorch Pretrain LLMs with over thousands GPUs (HFU > 50%)

Improve the stalibity of training over thousands GPUs by fault-tolerance and elasticity.

Finetune your LLMs with ATorch RLHF (3x trlx)

TODO

Major Model results

TODO

LLaMA2

TODO

GPT2

TODO

GLM

TODO

CLIP

TODO

Installation

TODO

Contributing

TODO

CI/CD

We leverage the power of GitHub Actions to automate our development, release and deployment workflows. Please check out this documentation on how the automated workflows are operated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

atorch

atorch

README.md

ATorch

Paper | Documentation | Examples | Blog

Latest News

Table of Contents

Why ATorch

Features

ATorch Applications

ATorch Pretrain LLMs with over thousands GPUs (HFU > 50%)

Finetune your LLMs with ATorch RLHF (3x trlx)

Major Model results

LLaMA2

GPT2

GLM

CLIP

Installation

Contributing

CI/CD

Cite Us

Files

atorch

Directory actions

More options

Directory actions

More options

Latest commit

History

atorch

Folders and files

parent directory

README.md

ATorch

Paper | Documentation | Examples | Blog

Latest News

Table of Contents

Why ATorch

Features

ATorch Applications

ATorch Pretrain LLMs with over thousands GPUs (HFU > 50%)

Finetune your LLMs with ATorch RLHF (3x trlx)

Major Model results

LLaMA2

GPT2

GLM

CLIP

Installation

Contributing

CI/CD

Cite Us