Skip to content

Latest commit

 

History

History

atorch

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

ATorch

Editor

ATorch: Make LLMs training more efficient and reproducible for everyone.

GitHub Repo stars Build CodeFactor HuggingFace badge slack badge WeChat badge

| English | 中文 |

Latest News

  • TODO

Table of Contents

Why ATorch

ATorch is an extension library of PyTorch developed by Ant Group's AI Infrastructure team. By decoupling model definition from training optimization strategy, ATorch supports efficient and easy-to-use model training experience. The design principle is to minimally disrupt the native PyTorch programming style. Through its API, ATorch provides performance optimizations in aspects such as I/O, preprocessing, computation, and communication (including automatic optimization). ATorch has supported large-scale pretraining of LLMs with over 100 billion parameters and thousands of A100/H100 GPUs. We aim to open source it and make these capabilities reproducible for everyone. We also welcome contributions.

Features

  • Usability
    • Fast deployment of runtime environment (images and installation packages)
  • Solutions for large-scale model training
  • Automated optimization
    • auto_accelerate for automatic optimization
  • IO/Preprocessing
    • Recommended storage for training data
    • Accessing the Pangu cluster
    • CPU/GPU cooperation to optimize data preprocessing
  • Customized operator optimization
    • High-performance MOE
    • Flash Attention 2
    • Transformer operator
  • Mixed precision
  • Communication optimization
    • Cashed sharding
  • Hybrid parallelism
  • Compilation optimization
  • Elastic fault tolerance
    • HangDetector (detecting and automatically restarting distributed training if it hangs)
    • GPU elastic training
    • Hardware error detect and migration

ATorch Applications

ATorch Pretrain LLMs with over thousands GPUs (HFU > 50%)

Finetune your LLMs with ATorch RLHF (3x trlx)

TODO

Major Model results

TODO

LLaMA2

TODO

GPT2

TODO

GLM

TODO

CLIP

TODO

Installation

TODO

Contributing

TODO

CI/CD

We leverage the power of GitHub Actions to automate our development, release and deployment workflows. Please check out this documentation on how the automated workflows are operated.

Cite Us