[User guides] Add user guides for DeepSpeed and Accelerate (ray-proje…

…ct#38513) Signed-off-by: Yunxuan Xiao <[email protected]>
krfricke · Aug 19, 2023 · e78b0ef · e78b0ef
1 parent c429c10
commit e78b0ef
Show file tree

Hide file tree

Showing 18 changed files with 1,076 additions and 23 deletions.
diff --git a/doc/source/_toc.yml b/doc/source/_toc.yml
@@ -71,6 +71,8 @@ parts:
  sections:
  - file: train/huggingface-accelerate
  title: Hugging Face Accelerate Guide
+ - file: train/deepspeed
+ title: DeepSpeed Guide
  - file: train/distributed-tensorflow-keras
  title: TensorFlow and Keras Guide
  - file: train/distributed-xgboost-lightgbm

diff --git a/doc/source/images/accelerate_logo.png b/doc/source/images/accelerate_logo.png
diff --git a/doc/source/images/deepspeed_logo.svg b/doc/source/images/deepspeed_logo.svg
diff --git a/doc/source/ray-overview/examples.rst b/doc/source/ray-overview/examples.rst
@@ -1402,3 +1402,17 @@ Ray Examples
  :link-type: doc
 
  Fine-tune vicuna-13b-v1.3 with DeepSpeed and LightningTrainer
+
+ .. grid-item-card:: :bdg-secondary:`Code example`
+ :class-item: gallery-item training llm pytorch nlp
+ :link: deepspeed_example
+ :link-type: ref
+
+ Distributed Training with DeepSpeed ZeRO-3 and TorchTrainer
+
+ .. grid-item-card:: :bdg-secondary:`Code example`
+ :class-item: gallery-item training llm pytorch huggingface nlp
+ :link: deepspeed_example
+ :link-type: ref
+
+ Distributed Training with Hugging Face Accelelate and TorchTrainer
diff --git a/doc/source/train/deepspeed.rst b/doc/source/train/deepspeed.rst
@@ -0,0 +1,94 @@
+.. _train-deepspeed:
+
+Training with DeepSpeed
+=======================
+
+The :class:`~ray.train.torch.TorchTrainer` can help you easily launch your `DeepSpeed <https://www.deepspeed.ai/>`_ training across a distributed Ray cluster.
+
+All you need to do is run your existing training code with a TorchTrainer. You can expect the final code to look like this:
+
+.. code-block:: python
+
+ import deepspeed
+ from deepspeed.accelerator import get_accelerator
+
+ def train_func(config):
+ # Instantiate your model and dataset
+ model = ...
+ train_dataset = ...
+ eval_dataset = ...
+ deepspeed_config = {...} # Your Deepspeed config
+
+ # Prepare everything for distributed training
+ model, optimizer, train_dataloader, lr_scheduler = deepspeed.initialize(
+ model=model,
+ model_parameters=model.parameters(),
+ training_data=tokenized_datasets["train"],
+ collate_fn=collate_fn,
+ config=deepspeed_config,
+ )
+
+ # Define the GPU device for the current worker
+ device = get_accelerator().device_name(model.local_rank)
+
+ # Start training
+ ...
+ 
+ from ray.train.torch import TorchTrainer
+ from ray.train import ScalingConfig
+
+ trainer = TorchTrainer(
+ train_func,
+ scaling_config=ScalingConfig(...),
+ ...
+ )
+ trainer.fit()
+
+
+Below is a simple example of ZeRO-3 training with DeepSpeed only. 
+
+.. tabs::
+
+ .. group-tab:: Example with Ray Data
+
+ .. dropdown:: Show Code
+
+ .. literalinclude:: /../../python/ray/train/examples/deepspeed/deepspeed_torch_trainer.py
+ :language: python
+ :start-after: __deepspeed_torch_basic_example_start__
+ :end-before: __deepspeed_torch_basic_example_end__
+
+ .. group-tab:: Example with PyTorch DataLoader
+
+ .. dropdown:: Show Code
+
+ .. literalinclude:: /../../python/ray/train/examples/deepspeed/deepspeed_torch_trainer_no_raydata.py
+ :language: python
+ :start-after: __deepspeed_torch_basic_example_no_raydata_start__
+ :end-before: __deepspeed_torch_basic_example_no_raydata_end__
+
+.. tip::
+
+ To run DeepSpeed with pure PyTorch, you **don't need to** provide any additional Ray Train utilities 
+ like :meth:`~ray.train.torch.prepare_model` or :meth:`~ray.train.torch.prepare_data_loader` in your training funciton. Instead, 
+ keep using `deepspeed.initialize() <https://deepspeed.readthedocs.io/en/latest/initialize.html>`_ as usual to prepare everything 
+ for distributed training.
+
+Running DeepSpeed with other frameworks
+-------------------------------------------
+
+Many deep learning frameworks have integrated with DeepSpeed, including Lightning, Transformers, Accelerate, and more. You can run all these combinations in Ray Train.
+
+Please check the below examples for more details:
+
+.. list-table::
+ :header-rows: 1
+
+ * - Framework
+ - Example
+ * - Accelelate (:ref:`User Guide <train-hf-accelerate>`)
+ - `Fine-tune Llama-2 series models with Deepspeed, Accelerate, and Ray Train. <https://github.com/ray-project/ray/tree/master/doc/source/templates/04_finetuning_llms_with_deepspeed>`_
+ * - Transformers (:ref:`User Guide <train-pytorch-transformers>`)
+ - :ref:`Fine-tune GPT-J-6b with DeepSpeed and Hugging Face Transformers <gptj_deepspeed_finetune>`
+ * - Lightning (:ref:`User Guide <train-pytorch-lightning>`)
+ - :ref:`Fine-tune vicuna-13b with DeepSpeed and PyTorch Lightning <vicuna_lightning_deepspeed_finetuning>`
diff --git a/doc/source/train/doc_code/accelerate_trainer.py b/doc/source/train/doc_code/accelerate_trainer.py
@@ -52,7 +52,7 @@ def train_loop_per_worker():
  print(f"epoch: {epoch}, loss: {loss.item()}")
 
  train.report(
- {},
+ metrics={"epoch": epoch, "loss": loss.item()},
  checkpoint=Checkpoint.from_dict(
  dict(epoch=epoch, model=accelerator.unwrap_model(model).state_dict())
  ),

diff --git a/doc/source/train/examples/accelerate/accelerate_example.rst b/doc/source/train/examples/accelerate/accelerate_example.rst
@@ -0,0 +1,8 @@
+:orphan:
+
+.. _accelerate_example:
+
+Hugging Face Accelerate Distributed Training Example with Ray Train
+===================================================================
+
+.. literalinclude:: /../../python/ray/train/examples/accelerate/accelerate_torch_trainer.py
diff --git a/doc/source/train/examples/deepspeed/deepspeed_example.rst b/doc/source/train/examples/deepspeed/deepspeed_example.rst
@@ -0,0 +1,8 @@
+:orphan:
+
+.. _deepspeed_example:
+
+DeepSpeed ZeRO-3 Distributed Training Example with Ray Train
+============================================================
+
+.. literalinclude:: /../../python/ray/train/examples/deepspeed/deepspeed_torch_trainer.py