Skip to content

Commit

Permalink
polish examples: make titles more consistent, add links to guides
Browse files Browse the repository at this point in the history
Signed-off-by: angelinalg <[email protected]>
  • Loading branch information
angelinalg committed Sep 11, 2023
1 parent 6991189 commit 9d793ab
Show file tree
Hide file tree
Showing 13 changed files with 182 additions and 118 deletions.
1 change: 1 addition & 0 deletions .github/styles/Vocab/Train/accept.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
Horovod
Hugging Face
hyperparameters?
Keras
LightGBM
PyTorch
Expand Down
24 changes: 12 additions & 12 deletions doc/source/train/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Ray Train Examples
==================

.. Example .rst files should be organized in the same manner as the
.. Organize example .rst files in the same manner as the
.py files in ray/python/ray/train/examples.
Below are examples for using Ray Train with a variety of frameworks and use cases.
Expand All @@ -18,17 +18,17 @@ Beginner
* - Framework
- Example
* - PyTorch
- :ref:`Training an Fashion MNIST Image Classifier with PyTorch <torch_fashion_mnist_ex>`
- :ref:`Train a Fashion MNIST Image Classifier with PyTorch <torch_fashion_mnist_ex>`
* - Lightning
- :ref:`Training an MNIST Image Classifier with Lightning <lightning_mnist_example>`
- :ref:`Train an MNIST Image Classifier with Lightning <lightning_mnist_example>`
* - Transformers
- :ref:`Fine-tuning a Text Classifier on Yelp Reviews Dataset with HF Transformers <transformers_torch_trainer_basic_example>`
- :ref:`Fine-tune a Text Classifier on the Yelp Reviews Dataset with HF Transformers <transformers_torch_trainer_basic_example>`
* - Accelerate
- :ref:`Distributed Data Parallel Training with HF Accelerate <accelerate_example>`
* - DeepSpeed
- :ref:`Distributed Training with DeepSpeed ZeRO-3 <deepspeed_example>`
- :ref:`Train with DeepSpeed ZeRO-3 <deepspeed_example>`
* - TensorFlow
- :ref:`TensorFlow MNIST Training Example <tensorflow_mnist_example>`
- :ref:`Train with TensorFlow MNIST <tensorflow_mnist_example>`
* - Horovod
- :ref:`End-to-end Horovod Training Example <horovod_example>`

Expand All @@ -42,11 +42,11 @@ Intermediate
* - Framework
- Example
* - PyTorch
- `DreamBooth fine-tuning of Stable Diffusion with Ray Train <https://github.com/ray-project/ray/tree/master/doc/source/templates/05_dreambooth_finetuning>`_
- :ref:`Fine-tune of Stable Diffusion with DreamBooth and Ray Train <torch_finetune_dreambooth_ex>`
* - Lightning
- :ref:`Model Training with PyTorch Lightning and Ray Data <lightning_advanced_example>`
* - Accelerate
- :ref:`Fine-tuning a Text Classifier on GLUE Benchmark with HF Accelerate. <train_transformers_accelerate_example>`
- :ref:`Fine-tune a text classifier on GLUE Benchmark with HF Accelerate <train_transformers_accelerate_example>`


Advanced
Expand All @@ -59,10 +59,10 @@ Advanced
* - Framework
- Example
* - Accelerate, DeepSpeed
- `Fine-tuning Llama-2 series models with Deepspeed, Accelerate, and Ray Train TorchTrainer <https://github.com/ray-project/ray/tree/master/doc/source/templates/04_finetuning_llms_with_deepspeed>`_
- `Fine-tune Llama-2 series models with Deepspeed, Accelerate, and Ray Train TorchTrainer <https://github.com/ray-project/ray/tree/master/doc/source/templates/04_finetuning_llms_with_deepspeed>`_
* - Transformers, DeepSpeed
- :ref:`Fine-tuning GPT-J-6B with Ray Train and DeepSpeed <gptj_deepspeed_finetune>`
- :ref:`Fine-tune GPT-J-6B with Ray Train and DeepSpeed <gptj_deepspeed_finetune>`
* - Lightning, DeepSpeed
- :ref:`Fine-tuning vicuna-13b with PyTorch Lightning and DeepSpeed <vicuna_lightning_deepspeed_finetuning>`
- :ref:`Fine-tune vicuna-13b with PyTorch Lightning and DeepSpeed <vicuna_lightning_deepspeed_finetuning>`
* - Lightning
- :ref:`Fine-tuning dolly-v2-7b with PyTorch Lightning and FSDP <dolly_lightning_fsdp_finetuning>`
- :ref:`Fine-tune dolly-v2-7b with PyTorch Lightning and FSDP <dolly_lightning_fsdp_finetuning>`
21 changes: 19 additions & 2 deletions doc/source/train/examples/accelerate/accelerate_example.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,24 @@

.. _accelerate_example:

Hugging Face Accelerate Distributed Training Example with Ray Train
===================================================================
Distributed Training Example with Hugging Face Accelerate
=========================================================

This example does distributed data parallel training
with Hugging Face (HF) Accelerate, Ray Train, and Ray Data.
It fine-tunes a BERT model and is adapted from
https://github.com/huggingface/accelerate/blob/main/examples/nlp_example.py


Code example
------------

.. literalinclude:: /../../python/ray/train/examples/accelerate/accelerate_torch_trainer.py

See also
--------

For a tutorial on using Ray Train and HF Accelerate,
see :ref:`Training with Hugging Face Accelerate <train-hf-accelerate>`.

For more Train examples, see :ref:`Ray Train Examples <train-examples>`.
33 changes: 22 additions & 11 deletions doc/source/train/examples/lightning/lightning_mnist_example.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@
"source": [
"## Prepare a dataset and module\n",
"\n",
"The Pytorch Lightning Trainer takes either `torch.utils.data.DataLoader` or `pl.LightningDataModule` as data inputs. You can keep using them without any changes with Ray Train. "
"The Pytorch Lightning Trainer takes either `torch.utils.data.DataLoader` or `pl.LightningDataModule` as data inputs. You can continue using them without any changes with Ray Train. "
]
},
{
Expand All @@ -75,7 +75,7 @@
" self.data_dir, train=True, download=True, transform=self.transform\n",
" )\n",
"\n",
" # split data into train and val sets\n",
" # Split data into train and val sets\n",
" self.mnist_train, self.mnist_val = random_split(mnist, [55000, 5000])\n",
"\n",
" def train_dataloader(self):\n",
Expand Down Expand Up @@ -175,26 +175,26 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"You don't need to make any change to the definition of PyTorch Lightning model and datamodule."
"You don't need to modify the definition of the PyTorch Lightning model or datamodule."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define the training loop\n",
"## Define a training function\n",
"\n",
"This code defines a training loop for each worker. Comparing the training loop with the original PyTorch Lightning code, there are 3 main differences:\n",
"This code defines a {ref}`training function <train-overview-training-function>` for each worker. Comparing the training fuction with the original PyTorch Lightning code, notice three main differences:\n",
"\n",
"- Distributed strategy: Use {class}`RayDDPStrategy <ray.train.lightning.RayDDPStrategy>`.\n",
"- Cluster environment: Use {class}`RayLightningEnvironment <ray.train.lightning.RayLightningEnvironment>`.\n",
"- Parallel devices: Always sets to `devices=\"auto\"` to use all available devices configured by ``TorchTrainer``.\n",
"- Parallel devices: Always set to `devices=\"auto\"` to use all available devices configured by ``TorchTrainer``.\n",
"\n",
"See {ref}`Getting Started with PyTorch Lightning <train-pytorch-lightning>` for more information.\n",
"\n",
"\n",
"For checkpoint reportining, Ray Train provides a minimal {class}`RayTrainReportCallback <ray.train.lightning.RayTrainReportCallback>` that reports metrics and checkpoint on each train epoch end. For more complex checkpoint logic, please implement custom callbacks as described in {ref}`Saving and Loading Checkpoint <train-checkpointing>` user guide."
"For checkpoint reporting, Ray Train provides a minimal {class}`RayTrainReportCallback <ray.train.lightning.RayTrainReportCallback>` class that reports metrics and checkpoints at the end of each train epoch. For more complex checkpoint logic, implement custom callbacks. See {ref}`Saving and Loading Checkpoint <train-checkpointing>`."
]
},
{
Expand All @@ -203,7 +203,7 @@
"metadata": {},
"outputs": [],
"source": [
"use_gpu = True # Set it to False if you want to run without GPUs\n",
"use_gpu = True # Set to False if you want to run without GPUs\n",
"num_workers = 4"
]
},
Expand Down Expand Up @@ -804,7 +804,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Check the Training Results and Checkpoints"
"## Check training results and checkpoints"
]
},
{
Expand Down Expand Up @@ -857,9 +857,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As we can see, three checkpoints(`checkpoint_000007`, `checkpoint_000008`, `checkpoint_000009`) have been saved in the trial directory. To retrieve the latest checkpoint from the fit results and load it back into the model, follow these steps.\n",
"Ray Train saved three checkpoints(`checkpoint_000007`, `checkpoint_000008`, `checkpoint_000009`) in the trial directory. The following code retrieves the latest checkpoint from the fit results and loads it back into the model.\n",
"\n",
"If you lost the in-memory result object, you can also restore the model from the checkpoint file. Here the checkpoint path is: `/tmp/ray_results/ptl-mnist-example/TorchTrainer_eb925_00000_0_2023-08-07_23-15-06/checkpoint_000009/checkpoint.ckpt`."
"If you lost the in-memory result object, you can restore the model from the checkpoint file. The checkpoint path is: `/tmp/ray_results/ptl-mnist-example/TorchTrainer_eb925_00000_0_2023-08-07_23-15-06/checkpoint_000009/checkpoint.ckpt`."
]
},
{
Expand Down Expand Up @@ -903,6 +903,17 @@
"\n",
"best_model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## See also\n",
"\n",
"For a tutorial on using Ray Train and PyTorch Lightning, see {ref}`Getting Started with PyTorch Lightning <train-pytorch-lightning>`.\n",
"\n",
"For more Train examples, see :ref:`Ray Train Examples <train-examples>`."
]
}
],
"metadata": {
Expand Down
Loading

0 comments on commit 9d793ab

Please sign in to comment.