-
Notifications
You must be signed in to change notification settings - Fork 36
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Co-authored-by: Mikhail Khrushchev <[email protected]> Co-authored-by: Ruslan Vasilev <[email protected]>
- Loading branch information
1 parent
85547f6
commit 0bd33e7
Showing
16 changed files
with
1,900 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
cff-version: 1.2.0 | ||
message: "If you use this software, please cite it as below." | ||
authors: | ||
- family-names: "Khrushchev" | ||
given-names: "Mikhail" | ||
- family-names: "Frolov" | ||
given-names: "Anton" | ||
- family-names: "Vasilev" | ||
given-names: "Ruslan" | ||
title: "YaFSDP" | ||
date-released: 2024-05-XX | ||
url: "https://github.com/yandex/YaFSDP" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
# YaFSDP | ||
|
||
- [Overview](#overview) | ||
- [Advantages over FSDP](#advantages-over-fsdp) | ||
- [Examples](#examples) | ||
- [Issues and questions](#issues-and-questions) | ||
- [Citation](#citation) | ||
|
||
## Overview | ||
|
||
YaFSDP is a Sharded Data Parallelism framework, designed to work well with transformer-like | ||
neural network architectures. | ||
|
||
You can find more info on YaFSDP internals in our [medium blog post](). | ||
|
||
## Advantages over FSDP | ||
|
||
YaFSDP is up to 20% faster for pre-training LLMs and performs better in high | ||
memory pressure conditions. It is designed to reduce communications and memory operations overhead. | ||
|
||
YaFSDP: | ||
|
||
![ya_fsdp](assets/ya_fsdp.png) | ||
|
||
FSDP: | ||
|
||
![fsdp](assets/fsdp.png) | ||
|
||
### Benchmarks | ||
|
||
We've compared YaFSDP with FSDP on a variety of pre-training setups ranging from: | ||
|
||
- 7B to 70B parameters | ||
- 64 to 256 devices | ||
- 2048 to 8192 tokens per sequence | ||
|
||
In each run per device batch size is set to 1. | ||
|
||
| model | gpu-count | seq-len | num-ckpt-layers | speedup | | ||
| :---------- | --------: | ------: | --------------: | ------: | | ||
| Llama-2-7b | 64 | 2048 | 0 | 9.92% | | ||
| Llama-2-7b | 64 | 4096 | 0 | 3.43% | | ||
| Llama-2-7b | 64 | 8192 | 0 | 2.68% | | ||
| Llama-2-7b | 128 | 2048 | 0 | 9.57% | | ||
| Llama-2-7b | 128 | 4096 | 0 | 2.42% | | ||
| Llama-2-7b | 128 | 8192 | 0 | 2.32% | | ||
| Llama-2-13b | 128 | 2048 | 0 | 12.10% | | ||
| Llama-2-13b | 128 | 4096 | 0 | 3.49% | | ||
| Llama-2-34b | 128 | 2048 | 0 | 20.70% | | ||
| Llama-2-34b | 256 | 2048 | 0 | 21.99% | | ||
| Llama-2-34b | 256 | 4096 | 5 | 8.35% | | ||
| Llama-2-70b | 256 | 2048 | 10 | 21.48% | | ||
| Llama-2-70b | 256 | 4096 | 50 | 7.17% | | ||
| Llama-3-8B | 64 | 2048 | 0 | 10.15% | | ||
| Llama-3-8B | 64 | 4096 | 0 | 7.98% | | ||
| Llama-3-70B | 256 | 2048 | 20 | 26.60% | | ||
|
||
## Examples | ||
|
||
To try out YaFSDP you should: | ||
|
||
1. build the docker container with `docker/build.sh` | ||
2. launch one of the examples in the `examples` folder. | ||
|
||
## Issues and questions | ||
|
||
If you encounter any bugs of have any questions feel free to open a GitHub issue. | ||
|
||
## Citation | ||
|
||
If you use this codebase, please cite it by using the following BibTeX entry: | ||
|
||
```bibtex | ||
@misc{YaFSDP2024, | ||
author = {Mikhail Khrushchev and Anton Frolov and Ruslan Vasilev}, | ||
title = {YaFSDP: Yet another Fully Sharded Data Parallel}, | ||
howpublished = {\url{https://github.com/yandex/YaFSDP}}, | ||
year = {2024} | ||
} | ||
``` |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
FROM nvcr.io/nvidia/pytorch:24.02-py3 | ||
|
||
SHELL ["/bin/bash", "-o", "pipefail", "-c"] | ||
WORKDIR /workspace | ||
|
||
COPY ./ ya-fsdp/ | ||
|
||
RUN git clone -b v4.39-release --depth 1 https://github.com/huggingface/transformers.git \ | ||
&& git apply --directory transformers ya-fsdp/patches/transformers.diff | ||
|
||
RUN git clone -b v0.27.0-release --depth 1 https://github.com/huggingface/accelerate.git \ | ||
&& git apply --directory accelerate ya-fsdp/patches/accelerate.diff | ||
|
||
RUN git clone -b v0.8.3 --depth 1 https://github.com/huggingface/trl.git \ | ||
&& git apply --directory trl ya-fsdp/patches/trl.diff | ||
|
||
RUN pip install --no-cache-dir \ | ||
./ya-fsdp \ | ||
./transformers \ | ||
./accelerate \ | ||
./trl | ||
|
||
RUN pip install --no-cache-dir \ | ||
-r transformers/examples/pytorch/language-modeling/requirements.txt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
#!/usr/bin/env sh | ||
|
||
docker buildx build \ | ||
--load \ | ||
--network host \ | ||
-f docker/Dockerfile \ | ||
-t ya-fsdp:latest . |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# Causal LM pre-training example | ||
|
||
This command launches a distributed pre-training setup using 🤗 transformers and accelerate libraries. | ||
|
||
```bash | ||
docker run \ | ||
-it \ | ||
--rm \ | ||
--net host \ | ||
--gpus '"device=0,1"' \ | ||
--ipc=host \ | ||
--ulimit memlock=-1 \ | ||
--ulimit stack=67108864 \ | ||
ya-fsdp:latest \ | ||
accelerate launch \ | ||
--config_file ya-fsdp/examples/fsdp_config.yaml \ | ||
--fsdp_ya_fsdp_enabled true \ | ||
transformers/examples/pytorch/language-modeling/run_clm.py \ | ||
--do_train \ | ||
--config_name meta-llama/Meta-Llama-3-8B \ | ||
--tokenizer_name meta-llama/Meta-Llama-3-8B \ | ||
--max_steps 5 \ | ||
--block_size 2048 \ | ||
--per_device_train_batch_size 1 \ | ||
--per_device_eval_batch_size 1 \ | ||
--dataset_name wikitext \ | ||
--dataset_config_name wikitext-2-raw-v1 \ | ||
--save_strategy no \ | ||
--logging_steps 1 \ | ||
--report_to tensorboard \ | ||
--output_dir clm | ||
``` | ||
|
||
CLI options: | ||
|
||
- `--gpus '"device=0,1"'` – limit the number of devices used by each host or set | ||
to `all` to use all available devices. | ||
- `--fsdp_ya_fsdp_enabled true` – toggle between FSDP and YaFSDP | ||
- `--(config_name|tokenizer_name|model_name) meta-llama/Meta-Llama-3-8B` – | ||
specify any model available at 🤗 hub or provide a path to you local model | ||
folder. | ||
- `--max_steps 5` — specify number of training steps. | ||
- `--block_size 2048` – specify input sequence length. | ||
- `--per_device_(train|eval)_batch_size 1` – specify train/eval batch size | ||
- `--dataset_name wikitext` – specify any publicly available dataset from the 🤗 | ||
dataset library. | ||
- `--save_strategy no` – specify saving strategy `(no|steps)`. | ||
|
||
`fsdp_config.yaml` options: | ||
|
||
- `fsdp_state_dict_type` — choose between `FULL_STATE_DICT` and | ||
`LOCAL_STATE_DICT` to save a global gathered state or local sharded states. | ||
- `fsdp_activation_checkpointing` — toggle activation checkpointing. | ||
- `fsdp_num_layers_to_checkpoint` — specify number of layers to checkpoint. | ||
- `num_processes` — specify total number of training processes (`number or hosts | ||
x number of devices on each host`) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
compute_environment: LOCAL_MACHINE | ||
distributed_type: FSDP | ||
fsdp_config: | ||
fsdp_backward_prefetch: BACKWARD_PRE | ||
fsdp_cpu_ram_efficient_loading: false | ||
fsdp_forward_prefetch: true | ||
fsdp_offload_params: false | ||
fsdp_sharding_strategy: FULL_SHARD | ||
fsdp_state_dict_type: FULL_STATE_DICT | ||
fsdp_sync_module_states: true | ||
fsdp_use_orig_params: false | ||
fsdp_activation_checkpointing: false | ||
fsdp_num_layers_to_checkpoint: 0 | ||
main_training_function: main | ||
main_process_ip: localhost | ||
mixed_precision: bf16 | ||
num_processes: 2 | ||
rdzv_backend: c10d | ||
same_network: true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
# Supervised fine-tuning example | ||
|
||
This command launches a distributed fine-tuning setup using 🤗 trl, transformers and accelerate libraries. | ||
|
||
```bash | ||
docker run \ | ||
-it \ | ||
--rm \ | ||
--net host \ | ||
--gpus '"device=0,1"' \ | ||
--ipc=host \ | ||
--ulimit memlock=-1 \ | ||
--ulimit stack=67108864 \ | ||
ya-fsdp:latest \ | ||
accelerate launch \ | ||
--config_file ya-fsdp/examples/fsdp_config.yaml \ | ||
--fsdp_ya_fsdp_enabled true \ | ||
trl/examples/scripts/sft.py \ | ||
--do_train \ | ||
--model_name_or_path meta-llama/Meta-Llama-3-8B \ | ||
--max_steps 5 \ | ||
--block_size 2048 \ | ||
--per_device_train_batch_size 1 \ | ||
--per_device_eval_batch_size 1 \ | ||
--dataset_name timdettmers/openassistant-guanaco \ | ||
--save_strategy no \ | ||
--logging_steps 1 \ | ||
--report_to tensorboard \ | ||
--output_dir sft | ||
``` | ||
|
||
See `examples/clm.md` for tips on some of the options. |
Oops, something went wrong.