Skip to content

Commit

Permalink
[User guides] Add user guides for DeepSpeed and Accelerate (ray-proje…
Browse files Browse the repository at this point in the history
…ct#38513)


Signed-off-by: Yunxuan Xiao <[email protected]>
  • Loading branch information
woshiyyya authored Aug 19, 2023
1 parent c429c10 commit e78b0ef
Show file tree
Hide file tree
Showing 18 changed files with 1,076 additions and 23 deletions.
2 changes: 2 additions & 0 deletions doc/source/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,8 @@ parts:
sections:
- file: train/huggingface-accelerate
title: Hugging Face Accelerate Guide
- file: train/deepspeed
title: DeepSpeed Guide
- file: train/distributed-tensorflow-keras
title: TensorFlow and Keras Guide
- file: train/distributed-xgboost-lightgbm
Expand Down
Binary file added doc/source/images/accelerate_logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
24 changes: 24 additions & 0 deletions doc/source/images/deepspeed_logo.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
14 changes: 14 additions & 0 deletions doc/source/ray-overview/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1402,3 +1402,17 @@ Ray Examples
:link-type: doc

Fine-tune vicuna-13b-v1.3 with DeepSpeed and LightningTrainer

.. grid-item-card:: :bdg-secondary:`Code example`
:class-item: gallery-item training llm pytorch nlp
:link: deepspeed_example
:link-type: ref

Distributed Training with DeepSpeed ZeRO-3 and TorchTrainer

.. grid-item-card:: :bdg-secondary:`Code example`
:class-item: gallery-item training llm pytorch huggingface nlp
:link: deepspeed_example
:link-type: ref

Distributed Training with Hugging Face Accelelate and TorchTrainer
94 changes: 94 additions & 0 deletions doc/source/train/deepspeed.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
.. _train-deepspeed:

Training with DeepSpeed
=======================

The :class:`~ray.train.torch.TorchTrainer` can help you easily launch your `DeepSpeed <https://www.deepspeed.ai/>`_ training across a distributed Ray cluster.

All you need to do is run your existing training code with a TorchTrainer. You can expect the final code to look like this:

.. code-block:: python
import deepspeed
from deepspeed.accelerator import get_accelerator
def train_func(config):
# Instantiate your model and dataset
model = ...
train_dataset = ...
eval_dataset = ...
deepspeed_config = {...} # Your Deepspeed config
# Prepare everything for distributed training
model, optimizer, train_dataloader, lr_scheduler = deepspeed.initialize(
model=model,
model_parameters=model.parameters(),
training_data=tokenized_datasets["train"],
collate_fn=collate_fn,
config=deepspeed_config,
)
# Define the GPU device for the current worker
device = get_accelerator().device_name(model.local_rank)
# Start training
...
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig
trainer = TorchTrainer(
train_func,
scaling_config=ScalingConfig(...),
...
)
trainer.fit()
Below is a simple example of ZeRO-3 training with DeepSpeed only.

.. tabs::

.. group-tab:: Example with Ray Data

.. dropdown:: Show Code

.. literalinclude:: /../../python/ray/train/examples/deepspeed/deepspeed_torch_trainer.py
:language: python
:start-after: __deepspeed_torch_basic_example_start__
:end-before: __deepspeed_torch_basic_example_end__

.. group-tab:: Example with PyTorch DataLoader

.. dropdown:: Show Code

.. literalinclude:: /../../python/ray/train/examples/deepspeed/deepspeed_torch_trainer_no_raydata.py
:language: python
:start-after: __deepspeed_torch_basic_example_no_raydata_start__
:end-before: __deepspeed_torch_basic_example_no_raydata_end__

.. tip::

To run DeepSpeed with pure PyTorch, you **don't need to** provide any additional Ray Train utilities
like :meth:`~ray.train.torch.prepare_model` or :meth:`~ray.train.torch.prepare_data_loader` in your training funciton. Instead,
keep using `deepspeed.initialize() <https://deepspeed.readthedocs.io/en/latest/initialize.html>`_ as usual to prepare everything
for distributed training.

Running DeepSpeed with other frameworks
-------------------------------------------

Many deep learning frameworks have integrated with DeepSpeed, including Lightning, Transformers, Accelerate, and more. You can run all these combinations in Ray Train.

Please check the below examples for more details:

.. list-table::
:header-rows: 1

* - Framework
- Example
* - Accelelate (:ref:`User Guide <train-hf-accelerate>`)
- `Fine-tune Llama-2 series models with Deepspeed, Accelerate, and Ray Train. <https://github.com/ray-project/ray/tree/master/doc/source/templates/04_finetuning_llms_with_deepspeed>`_
* - Transformers (:ref:`User Guide <train-pytorch-transformers>`)
- :ref:`Fine-tune GPT-J-6b with DeepSpeed and Hugging Face Transformers <gptj_deepspeed_finetune>`
* - Lightning (:ref:`User Guide <train-pytorch-lightning>`)
- :ref:`Fine-tune vicuna-13b with DeepSpeed and PyTorch Lightning <vicuna_lightning_deepspeed_finetuning>`
2 changes: 1 addition & 1 deletion doc/source/train/doc_code/accelerate_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ def train_loop_per_worker():
print(f"epoch: {epoch}, loss: {loss.item()}")

train.report(
{},
metrics={"epoch": epoch, "loss": loss.item()},
checkpoint=Checkpoint.from_dict(
dict(epoch=epoch, model=accelerator.unwrap_model(model).state_dict())
),
Expand Down
8 changes: 8 additions & 0 deletions doc/source/train/examples/accelerate/accelerate_example.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
:orphan:

.. _accelerate_example:

Hugging Face Accelerate Distributed Training Example with Ray Train
===================================================================

.. literalinclude:: /../../python/ray/train/examples/accelerate/accelerate_torch_trainer.py
8 changes: 8 additions & 0 deletions doc/source/train/examples/deepspeed/deepspeed_example.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
:orphan:

.. _deepspeed_example:

DeepSpeed ZeRO-3 Distributed Training Example with Ray Train
============================================================

.. literalinclude:: /../../python/ray/train/examples/deepspeed/deepspeed_torch_trainer.py
Loading

0 comments on commit e78b0ef

Please sign in to comment.