Skip to content

Commit

Permalink
Release
Browse files Browse the repository at this point in the history
Co-authored-by: Mikhail Khrushchev <[email protected]>
Co-authored-by: Ruslan Vasilev <[email protected]>
  • Loading branch information
3 people committed May 27, 2024
1 parent 85547f6 commit 0bd33e7
Show file tree
Hide file tree
Showing 16 changed files with 1,900 additions and 0 deletions.
12 changes: 12 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Khrushchev"
given-names: "Mikhail"
- family-names: "Frolov"
given-names: "Anton"
- family-names: "Vasilev"
given-names: "Ruslan"
title: "YaFSDP"
date-released: 2024-05-XX
url: "https://github.com/yandex/YaFSDP"
80 changes: 80 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# YaFSDP

- [Overview](#overview)
- [Advantages over FSDP](#advantages-over-fsdp)
- [Examples](#examples)
- [Issues and questions](#issues-and-questions)
- [Citation](#citation)

## Overview

YaFSDP is a Sharded Data Parallelism framework, designed to work well with transformer-like
neural network architectures.

You can find more info on YaFSDP internals in our [medium blog post]().

## Advantages over FSDP

YaFSDP is up to 20% faster for pre-training LLMs and performs better in high
memory pressure conditions. It is designed to reduce communications and memory operations overhead.

YaFSDP:

![ya_fsdp](assets/ya_fsdp.png)

FSDP:

![fsdp](assets/fsdp.png)

### Benchmarks

We've compared YaFSDP with FSDP on a variety of pre-training setups ranging from:

- 7B to 70B parameters
- 64 to 256 devices
- 2048 to 8192 tokens per sequence

In each run per device batch size is set to 1.

| model | gpu-count | seq-len | num-ckpt-layers | speedup |
| :---------- | --------: | ------: | --------------: | ------: |
| Llama-2-7b | 64 | 2048 | 0 | 9.92% |
| Llama-2-7b | 64 | 4096 | 0 | 3.43% |
| Llama-2-7b | 64 | 8192 | 0 | 2.68% |
| Llama-2-7b | 128 | 2048 | 0 | 9.57% |
| Llama-2-7b | 128 | 4096 | 0 | 2.42% |
| Llama-2-7b | 128 | 8192 | 0 | 2.32% |
| Llama-2-13b | 128 | 2048 | 0 | 12.10% |
| Llama-2-13b | 128 | 4096 | 0 | 3.49% |
| Llama-2-34b | 128 | 2048 | 0 | 20.70% |
| Llama-2-34b | 256 | 2048 | 0 | 21.99% |
| Llama-2-34b | 256 | 4096 | 5 | 8.35% |
| Llama-2-70b | 256 | 2048 | 10 | 21.48% |
| Llama-2-70b | 256 | 4096 | 50 | 7.17% |
| Llama-3-8B | 64 | 2048 | 0 | 10.15% |
| Llama-3-8B | 64 | 4096 | 0 | 7.98% |
| Llama-3-70B | 256 | 2048 | 20 | 26.60% |

## Examples

To try out YaFSDP you should:

1. build the docker container with `docker/build.sh`
2. launch one of the examples in the `examples` folder.

## Issues and questions

If you encounter any bugs of have any questions feel free to open a GitHub issue.

## Citation

If you use this codebase, please cite it by using the following BibTeX entry:

```bibtex
@misc{YaFSDP2024,
author = {Mikhail Khrushchev and Anton Frolov and Ruslan Vasilev},
title = {YaFSDP: Yet another Fully Sharded Data Parallel},
howpublished = {\url{https://github.com/yandex/YaFSDP}},
year = {2024}
}
```
Binary file added assets/fsdp.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/ya_fsdp.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
24 changes: 24 additions & 0 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
FROM nvcr.io/nvidia/pytorch:24.02-py3

SHELL ["/bin/bash", "-o", "pipefail", "-c"]
WORKDIR /workspace

COPY ./ ya-fsdp/

RUN git clone -b v4.39-release --depth 1 https://github.com/huggingface/transformers.git \
&& git apply --directory transformers ya-fsdp/patches/transformers.diff

RUN git clone -b v0.27.0-release --depth 1 https://github.com/huggingface/accelerate.git \
&& git apply --directory accelerate ya-fsdp/patches/accelerate.diff

RUN git clone -b v0.8.3 --depth 1 https://github.com/huggingface/trl.git \
&& git apply --directory trl ya-fsdp/patches/trl.diff

RUN pip install --no-cache-dir \
./ya-fsdp \
./transformers \
./accelerate \
./trl

RUN pip install --no-cache-dir \
-r transformers/examples/pytorch/language-modeling/requirements.txt
7 changes: 7 additions & 0 deletions docker/build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/usr/bin/env sh

docker buildx build \
--load \
--network host \
-f docker/Dockerfile \
-t ya-fsdp:latest .
56 changes: 56 additions & 0 deletions examples/clm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Causal LM pre-training example

This command launches a distributed pre-training setup using 🤗 transformers and accelerate libraries.

```bash
docker run \
-it \
--rm \
--net host \
--gpus '"device=0,1"' \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
ya-fsdp:latest \
accelerate launch \
--config_file ya-fsdp/examples/fsdp_config.yaml \
--fsdp_ya_fsdp_enabled true \
transformers/examples/pytorch/language-modeling/run_clm.py \
--do_train \
--config_name meta-llama/Meta-Llama-3-8B \
--tokenizer_name meta-llama/Meta-Llama-3-8B \
--max_steps 5 \
--block_size 2048 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--save_strategy no \
--logging_steps 1 \
--report_to tensorboard \
--output_dir clm
```

CLI options:

- `--gpus '"device=0,1"'` – limit the number of devices used by each host or set
to `all` to use all available devices.
- `--fsdp_ya_fsdp_enabled true` – toggle between FSDP and YaFSDP
- `--(config_name|tokenizer_name|model_name) meta-llama/Meta-Llama-3-8B`
specify any model available at 🤗 hub or provide a path to you local model
folder.
- `--max_steps 5` — specify number of training steps.
- `--block_size 2048` – specify input sequence length.
- `--per_device_(train|eval)_batch_size 1` – specify train/eval batch size
- `--dataset_name wikitext` – specify any publicly available dataset from the 🤗
dataset library.
- `--save_strategy no` – specify saving strategy `(no|steps)`.

`fsdp_config.yaml` options:

- `fsdp_state_dict_type` — choose between `FULL_STATE_DICT` and
`LOCAL_STATE_DICT` to save a global gathered state or local sharded states.
- `fsdp_activation_checkpointing` — toggle activation checkpointing.
- `fsdp_num_layers_to_checkpoint` — specify number of layers to checkpoint.
- `num_processes` — specify total number of training processes (`number or hosts
x number of devices on each host`)
19 changes: 19 additions & 0 deletions examples/fsdp_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
fsdp_config:
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: false
fsdp_forward_prefetch: true
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: false
fsdp_activation_checkpointing: false
fsdp_num_layers_to_checkpoint: 0
main_training_function: main
main_process_ip: localhost
mixed_precision: bf16
num_processes: 2
rdzv_backend: c10d
same_network: true
32 changes: 32 additions & 0 deletions examples/sft.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Supervised fine-tuning example

This command launches a distributed fine-tuning setup using 🤗 trl, transformers and accelerate libraries.

```bash
docker run \
-it \
--rm \
--net host \
--gpus '"device=0,1"' \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
ya-fsdp:latest \
accelerate launch \
--config_file ya-fsdp/examples/fsdp_config.yaml \
--fsdp_ya_fsdp_enabled true \
trl/examples/scripts/sft.py \
--do_train \
--model_name_or_path meta-llama/Meta-Llama-3-8B \
--max_steps 5 \
--block_size 2048 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--dataset_name timdettmers/openassistant-guanaco \
--save_strategy no \
--logging_steps 1 \
--report_to tensorboard \
--output_dir sft
```

See `examples/clm.md` for tips on some of the options.
Loading

0 comments on commit 0bd33e7

Please sign in to comment.