Release

Co-authored-by: Mikhail Khrushchev <[email protected]> Co-authored-by: Ruslan Vasilev <[email protected]>
yandex · May 27, 2024 · 0bd33e7 · 0bd33e7
1 parent 85547f6
commit 0bd33e7
Show file tree

Hide file tree

Showing 16 changed files with 1,900 additions and 0 deletions.
diff --git a/CITATION.cff b/CITATION.cff
@@ -0,0 +1,12 @@
+cff-version: 1.2.0
+message: "If you use this software, please cite it as below."
+authors:
+- family-names: "Khrushchev"
+ given-names: "Mikhail"
+- family-names: "Frolov"
+ given-names: "Anton"
+- family-names: "Vasilev"
+ given-names: "Ruslan"
+title: "YaFSDP"
+date-released: 2024-05-XX
+url: "https://github.com/yandex/YaFSDP"
diff --git a/README.md b/README.md
@@ -0,0 +1,80 @@
+# YaFSDP
+
+- [Overview](#overview)
+- [Advantages over FSDP](#advantages-over-fsdp)
+- [Examples](#examples)
+- [Issues and questions](#issues-and-questions)
+- [Citation](#citation)
+
+## Overview
+
+YaFSDP is a Sharded Data Parallelism framework, designed to work well with transformer-like
+neural network architectures.
+
+You can find more info on YaFSDP internals in our [medium blog post]().
+
+## Advantages over FSDP
+
+YaFSDP is up to 20% faster for pre-training LLMs and performs better in high
+memory pressure conditions. It is designed to reduce communications and memory operations overhead.
+
+YaFSDP:
+
+![ya_fsdp](assets/ya_fsdp.png)
+
+FSDP:
+
+![fsdp](assets/fsdp.png)
+
+### Benchmarks
+
+We've compared YaFSDP with FSDP on a variety of pre-training setups ranging from:
+
+- 7B to 70B parameters
+- 64 to 256 devices
+- 2048 to 8192 tokens per sequence
+
+In each run per device batch size is set to 1.
+
+| model | gpu-count | seq-len | num-ckpt-layers | speedup |
+| :---------- | --------: | ------: | --------------: | ------: |
+| Llama-2-7b | 64 | 2048 | 0 | 9.92% |
+| Llama-2-7b | 64 | 4096 | 0 | 3.43% |
+| Llama-2-7b | 64 | 8192 | 0 | 2.68% |
+| Llama-2-7b | 128 | 2048 | 0 | 9.57% |
+| Llama-2-7b | 128 | 4096 | 0 | 2.42% |
+| Llama-2-7b | 128 | 8192 | 0 | 2.32% |
+| Llama-2-13b | 128 | 2048 | 0 | 12.10% |
+| Llama-2-13b | 128 | 4096 | 0 | 3.49% |
+| Llama-2-34b | 128 | 2048 | 0 | 20.70% |
+| Llama-2-34b | 256 | 2048 | 0 | 21.99% |
+| Llama-2-34b | 256 | 4096 | 5 | 8.35% |
+| Llama-2-70b | 256 | 2048 | 10 | 21.48% |
+| Llama-2-70b | 256 | 4096 | 50 | 7.17% |
+| Llama-3-8B | 64 | 2048 | 0 | 10.15% |
+| Llama-3-8B | 64 | 4096 | 0 | 7.98% |
+| Llama-3-70B | 256 | 2048 | 20 | 26.60% |
+
+## Examples
+
+To try out YaFSDP you should:
+
+1. build the docker container with `docker/build.sh`
+2. launch one of the examples in the `examples` folder.
+
+## Issues and questions
+
+If you encounter any bugs of have any questions feel free to open a GitHub issue.
+
+## Citation
+
+If you use this codebase, please cite it by using the following BibTeX entry:
+
+```bibtex
+@misc{YaFSDP2024,
+ author = {Mikhail Khrushchev and Anton Frolov and Ruslan Vasilev},
+ title = {YaFSDP: Yet another Fully Sharded Data Parallel},
+ howpublished = {\url{https://github.com/yandex/YaFSDP}},
+ year = {2024}
+}
+```
diff --git a/assets/fsdp.png b/assets/fsdp.png
diff --git a/assets/ya_fsdp.png b/assets/ya_fsdp.png
diff --git a/docker/Dockerfile b/docker/Dockerfile
@@ -0,0 +1,24 @@
+FROM nvcr.io/nvidia/pytorch:24.02-py3
+
+SHELL ["/bin/bash", "-o", "pipefail", "-c"]
+WORKDIR /workspace
+
+COPY ./ ya-fsdp/
+
+RUN git clone -b v4.39-release --depth 1 https://github.com/huggingface/transformers.git \
+ && git apply --directory transformers ya-fsdp/patches/transformers.diff
+
+RUN git clone -b v0.27.0-release --depth 1 https://github.com/huggingface/accelerate.git \
+ && git apply --directory accelerate ya-fsdp/patches/accelerate.diff
+
+RUN git clone -b v0.8.3 --depth 1 https://github.com/huggingface/trl.git \
+ && git apply --directory trl ya-fsdp/patches/trl.diff
+
+RUN pip install --no-cache-dir \
+ ./ya-fsdp \
+ ./transformers \
+ ./accelerate \
+ ./trl
+
+RUN pip install --no-cache-dir \
+ -r transformers/examples/pytorch/language-modeling/requirements.txt
diff --git a/docker/build.sh b/docker/build.sh
@@ -0,0 +1,7 @@
+#!/usr/bin/env sh
+
+docker buildx build \
+ --load \
+ --network host \
+ -f docker/Dockerfile \
+ -t ya-fsdp:latest .
diff --git a/examples/clm.md b/examples/clm.md
@@ -0,0 +1,56 @@
+# Causal LM pre-training example
+
+This command launches a distributed pre-training setup using 🤗 transformers and accelerate libraries.
+
+```bash
+docker run \
+ -it \
+ --rm \
+ --net host \
+ --gpus '"device=0,1"' \
+ --ipc=host \
+ --ulimit memlock=-1 \
+ --ulimit stack=67108864 \
+ ya-fsdp:latest \
+ accelerate launch \
+ --config_file ya-fsdp/examples/fsdp_config.yaml \
+ --fsdp_ya_fsdp_enabled true \
+ transformers/examples/pytorch/language-modeling/run_clm.py \
+ --do_train \
+ --config_name meta-llama/Meta-Llama-3-8B \
+ --tokenizer_name meta-llama/Meta-Llama-3-8B \
+ --max_steps 5 \
+ --block_size 2048 \
+ --per_device_train_batch_size 1 \
+ --per_device_eval_batch_size 1 \
+ --dataset_name wikitext \
+ --dataset_config_name wikitext-2-raw-v1 \
+ --save_strategy no \
+ --logging_steps 1 \
+ --report_to tensorboard \
+ --output_dir clm
+```
+
+CLI options:
+
+- `--gpus '"device=0,1"'` – limit the number of devices used by each host or set
+ to `all` to use all available devices.
+- `--fsdp_ya_fsdp_enabled true` – toggle between FSDP and YaFSDP
+- `--(config_name|tokenizer_name|model_name) meta-llama/Meta-Llama-3-8B` –
+ specify any model available at 🤗 hub or provide a path to you local model
+ folder.
+- `--max_steps 5` — specify number of training steps.
+- `--block_size 2048` – specify input sequence length.
+- `--per_device_(train|eval)_batch_size 1` – specify train/eval batch size
+- `--dataset_name wikitext` – specify any publicly available dataset from the 🤗
+ dataset library.
+- `--save_strategy no` – specify saving strategy `(no|steps)`.
+
+`fsdp_config.yaml` options:
+
+- `fsdp_state_dict_type` — choose between `FULL_STATE_DICT` and
+ `LOCAL_STATE_DICT` to save a global gathered state or local sharded states.
+- `fsdp_activation_checkpointing` — toggle activation checkpointing.
+- `fsdp_num_layers_to_checkpoint` — specify number of layers to checkpoint.
+- `num_processes` — specify total number of training processes (`number or hosts
+ x number of devices on each host`)
diff --git a/examples/fsdp_config.yaml b/examples/fsdp_config.yaml
@@ -0,0 +1,19 @@
+compute_environment: LOCAL_MACHINE
+distributed_type: FSDP
+fsdp_config:
+ fsdp_backward_prefetch: BACKWARD_PRE
+ fsdp_cpu_ram_efficient_loading: false
+ fsdp_forward_prefetch: true
+ fsdp_offload_params: false
+ fsdp_sharding_strategy: FULL_SHARD
+ fsdp_state_dict_type: FULL_STATE_DICT
+ fsdp_sync_module_states: true
+ fsdp_use_orig_params: false
+ fsdp_activation_checkpointing: false
+ fsdp_num_layers_to_checkpoint: 0
+main_training_function: main
+main_process_ip: localhost
+mixed_precision: bf16
+num_processes: 2
+rdzv_backend: c10d
+same_network: true
diff --git a/examples/sft.md b/examples/sft.md
@@ -0,0 +1,32 @@
+# Supervised fine-tuning example
+
+This command launches a distributed fine-tuning setup using 🤗 trl, transformers and accelerate libraries.
+
+```bash
+docker run \
+ -it \
+ --rm \
+ --net host \
+ --gpus '"device=0,1"' \
+ --ipc=host \
+ --ulimit memlock=-1 \
+ --ulimit stack=67108864 \
+ ya-fsdp:latest \
+ accelerate launch \
+ --config_file ya-fsdp/examples/fsdp_config.yaml \
+ --fsdp_ya_fsdp_enabled true \
+ trl/examples/scripts/sft.py \
+ --do_train \
+ --model_name_or_path meta-llama/Meta-Llama-3-8B \
+ --max_steps 5 \
+ --block_size 2048 \
+ --per_device_train_batch_size 1 \
+ --per_device_eval_batch_size 1 \
+ --dataset_name timdettmers/openassistant-guanaco \
+ --save_strategy no \
+ --logging_steps 1 \
+ --report_to tensorboard \
+ --output_dir sft
+```
+
+See `examples/clm.md` for tips on some of the options.