DeepSpeed Ulysses Release (#230)

Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: chengming-zhang <[email protected]>
microsoft · Aug 23, 2023 · a2a476e · a2a476e
1 parent 41116e0
commit a2a476e
Show file tree

Hide file tree

Showing 17 changed files with 1,164 additions and 63 deletions.
diff --git a/examples_deepspeed/sequence_parallel/README.md b/examples_deepspeed/sequence_parallel/README.md
@@ -0,0 +1,36 @@
+# Sequence Parallelism
+
+This folder contains examples that demonstrate how to use DeepSpeed's sequence parallelism.
+
+## Setting Up the Environment for FlashAttention
+
+DeepSpeed's sequence parallelism can be combined with the following types of attention.
+
+- Classic attention
+- FlashAttention (enabled by `--use-flash-attn`)
+- FlashAttention + Triton (enabled by `--use-flash-attn-triton`)
+
+For the best performance, we recommend using FlashAttention + Triton. Here are the installation steps and the versions we have tested. Note that FlashAttention is compatible only with Turing, Ampere, Ada, or Hopper GPUs.
+
+```shell
+# install triton
+git clone -b legacy-backend https://github.com/openai/triton
+cd triton/python/
+pip install cmake
+pip install .
+
+# install
+cd ${WORK_DIR}
+git clone -b v1.0.4 https://github.com/HazyResearch/flash-attention
+cd flash-attention
+python setup.py install
+```
+
+## Enabling Sequence Parallelism
+
+To enable sequence parallelism, set the degree of parallelism using the `--ds-sequence-parallel-size` argument. Ensure that the number of attention heads is divisible by this value.
+Ensure your model configuration is compliant with FlashAttention's requirements. For instance, to achieve optimal performance, the head size should be divisible by 8. Refer to the document of [FlashAttention](https://github.com/Dao-AILab/flash-attention/tree/v1.0.4) for more details.
+
+Some working examples ([GPT1.3B](ds_pretrain_gpt_1.3B_seq_parallel_32k.sh), [GPT30B](ds_pretrain_gpt_30B_seq_parallel_32k.sh)), that enable sequence parallelism, are available in this foloder.
+
+Please note that our sequence parallelism feature is currently incompatible with Megatron-LM's tensor or pipeline parallelism.
diff --git a/examples_deepspeed/sequence_parallel/ds_config_gpt_TEMPLATE.json b/examples_deepspeed/sequence_parallel/ds_config_gpt_TEMPLATE.json
@@ -0,0 +1,24 @@
+{
+ "train_batch_size": GBSIZE,
+ "train_micro_batch_size_per_gpu": MBSIZE,
+ "steps_per_print": LOG_INTERVAL,
+
+ "zero_optimization": {
+ "stage": ZERO_STAGE,
+ "elastic_checkpoint": true
+ },
+
+ "gradient_clipping": 1.0,
+ "prescale_gradients": PRESCALE_GRAD,
+
+ "fp16": {
+ "enabled": true,
+ "loss_scale": 0,
+ "loss_scale_window": 500,
+ "hysteresis": 2,
+ "min_loss_scale": 1,
+ "initial_scale_power": 11
+ },
+
+ "wall_clock_breakdown" : false
+}