Skip to content
forked from yandex/YaFSDP

YaFSDP: Yet another Fully Sharded Data Parallel

License

Notifications You must be signed in to change notification settings

linboqiao/YaFSDP

 
 

Repository files navigation

YaFSDP

 

Overview

YaFSDP is a Sharded Data Parallelism framework, designed to work well with transformer-like neural network architectures.

You can find more info on YaFSDP internals in our blog posts on Medium and Habr.

Advantages over FSDP

YaFSDP is up to 20% faster for pre-training LLMs and performs better in high memory pressure conditions. It is designed to reduce communications and memory operations overhead.

YaFSDP:

ya_fsdp

FSDP:

fsdp

Benchmarks

We've compared YaFSDP with FSDP on a variety of pre-training setups ranging from:

  • 7B to 70B parameters
  • 64 to 256 devices
  • 2048 to 8192 tokens per sequence
model gpu-count seq-len num-ckpt-layers speedup YaFSDP iteration time (s) FSDP iteration time (s)
Llama 2 7B 64 2048 0 9.92% 0.81 0.90
Llama 2 7B 64 4096 0 3.43% 1.16 1.21
Llama 2 7B 64 8192 0 2.68% 2.23 2.29
Llama 2 7B 128 2048 0 9.57% 0.87 0.97
Llama 2 7B 128 4096 0 2.42% 1.19 1.22
Llama 2 7B 128 8192 0 2.32% 2.25 2.31
Llama 2 13B 128 2048 0 12.10% 1.55 1.76
Llama 2 13B 128 4096 0 3.49% 2.06 2.14
Llama 2 34B 128 2048 0 20.70% 3.39 4.27
Llama 2 34B 256 2048 0 21.99% 3.51 4.50
Llama 2 34B 256 4096 5 8.35% 5.33 5.81
Llama 2 70B 256 2048 10 21.48% 6.97 8.87
Llama 2 70B 256 4096 50 7.17% 11.07 11.93
Llama 3 8B 64 2048 0 11.91% 0.97 1.10
Llama 3 8B 64 4096 0 7.86% 1.36 1.48
Llama 3 70B 256 2048 20 26.60% 7.17 9.76

Details:

  • In each run per-device batch size is set to 1.
  • speedup represents relative iteration time decrease between YaFSDP and FSDP runs.
  • num-ckpt-layers refers to the number of transformer layers to which activation checkpointing was applied.
  • Performance was measured using a cluster of hosts with A100 80 GB GPUs.

Examples

You can find examples of LLM training using 🤗 stack in the examples folder:

  1. clm.md for causal pre-training
  2. sft.md for supervised fine-tuning

Notice that both examples require a Docker image, which can be built using docker/build.sh script. The image is based on the NVIDIA PyTorch image with some patched 🤗 libraries. Patches for the libraries can be found in the patches folder.

Issues and questions

If you encounter any bugs of have any questions feel free to open a GitHub issue.

Citation

If you use this codebase, please cite it by using the following BibTeX entry:

@misc{YaFSDP2024,
  author =       {Mikhail Khrushchev and Anton Frolov and Ruslan Vasilev},
  title =        {YaFSDP: Yet another Fully Sharded Data Parallel},
  howpublished = {\url{https://github.com/yandex/YaFSDP}},
  year =         {2024}
}

About

YaFSDP: Yet another Fully Sharded Data Parallel

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.0%
  • Dockerfile 1.7%
  • Shell 0.3%