[Train][Doc] Update PyTorch Data Ingestion User Guide (ray-project#45421

) Improves framework migration steps to Ray Data for data ingest user guide. --------- Signed-off-by: woshiyyya <[email protected]> Signed-off-by: yunxuanx <[email protected]>
harborn · Jun 25, 2024 · 9ba6656 · 9ba6656
1 parent f75ad5d
commit 9ba6656
Showing 1 changed file with 28 additions and 14 deletions.
diff --git a/doc/source/train/user-guides/data-loading-preprocessing.rst b/doc/source/train/user-guides/data-loading-preprocessing.rst
@@ -13,6 +13,10 @@ Key advantages include:
 
 For more details about Ray Data, including comparisons to alternatives, see :ref:`Ray Data Overview <data_overview>`.
 
+.. note::
+
+ In addition to Ray Data, you can continue to use framework-native data utilities with Ray Train, such as PyTorch Dataset, Hugging Face Dataset, and Lightning DataModule.
+
 In this guide, we will cover how to incorporate Ray Data into your Ray Train script, and different ways to customize your data ingestion pipeline.
 
 .. TODO: Replace this image with a better one.
@@ -258,8 +262,7 @@ Some frameworks provide their own dataset and data loading utilities. For exampl
 - **Hugging Face:** `Dataset <https://huggingface.co/docs/datasets/index>`_
 - **PyTorch Lightning:** `LightningDataModule <https://lightning.ai/docs/pytorch/stable/data/datamodule.html>`_
 
-These utilities can still be used directly with Ray Train. In particular, you may want to do this if you already have your data ingestion pipeline set up.
-However, for more performant large-scale data ingestion we do recommend migrating to Ray Data.
+You can still use these framework data utilities directly with Ray Train.
 
 At a high level, you can compare these concepts as follows:
 
@@ -276,34 +279,45 @@ At a high level, you can compare these concepts as follows:
  - n/a
  - :meth:`ray.data.Dataset.iter_torch_batches`
 
-
-For more details, see the following sections for each framework.
+For more details, see the following sections for each framework:
 
 .. tab-set::
 
- .. tab-item:: PyTorch Dataset and DataLoader
+ .. tab-item:: PyTorch DataLoader
 
- **Option 1 (with Ray Data):** Convert your PyTorch Dataset to a Ray Dataset and pass it into the Trainer via ``datasets`` argument.
- Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`.
- You can convert this to replace the PyTorch DataLoader via :meth:`ray.data.DataIterator.iter_torch_batches`.
+ **Option 1 (with Ray Data):** 
+
+ 1. Convert your PyTorch Dataset to a Ray Dataset. 
+ 2. Pass the Ray Dataset into the TorchTrainer via ``datasets`` argument.
+ 3. Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`.
+ 4. Create a dataset iterable via :meth:`ray.data.DataIterator.iter_torch_batches`.
 
  For more details, see the :ref:`Migrating from PyTorch Datasets and DataLoaders <migrate_pytorch>`.
 
- **Option 2 (without Ray Data):** Instantiate the Torch Dataset and DataLoader directly in the ``train_loop_per_worker``.
- You can use the :meth:`ray.train.torch.prepare_data_loader` utility to set up the DataLoader for distributed training.
+ **Option 2 (without Ray Data):** 
+
+ 1. Instantiate the Torch Dataset and DataLoader directly in the ``train_loop_per_worker``.
+ 2. Use the :meth:`ray.train.torch.prepare_data_loader` utility to set up the DataLoader for distributed training.
 
  .. tab-item:: LightningDataModule
 
  The ``LightningDataModule`` is created with PyTorch ``Dataset``\s and ``DataLoader``\s. You can apply the same logic here.
 
  .. tab-item:: Hugging Face Dataset
 
- **Option 1 (with Ray Data):** Convert your Hugging Face Dataset to a Ray Dataset and pass it into the Trainer via the ``datasets`` argument.
- Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`.
+ **Option 1 (with Ray Data):** 
+
+ 1. Convert your Hugging Face Dataset to a Ray Dataset. For instructions, see :ref:`Ray Data for Hugging Face <loading_datasets_from_ml_libraries>`.
+ 2. Pass the Ray Dataset into the TorchTrainer via the ``datasets`` argument.
+ 3. Inside your ``train_loop_per_worker``, access the sharded dataset via :meth:`ray.train.get_dataset_shard`.
+ 4. Create a iterable dataset via :meth:`ray.data.DataIterator.iter_torch_batches`. 
+ 5. Pass the iterable dataset while initializing ``transformers.Trainer``.
+ 6. Wrap your transformers trainer with the :meth:`ray.train.huggingface.transformers.prepare_trainer` utility.
 
- For instructions, see :ref:`Ray Data for Hugging Face <loading_datasets_from_ml_libraries>`.
+ **Option 2 (without Ray Data):** 
 
- **Option 2 (without Ray Data):** Instantiate the Hugging Face Dataset directly in the ``train_loop_per_worker``.
+ 1. Instantiate the Hugging Face Dataset directly in the ``train_loop_per_worker``.
+ 2. Pass the Hugging Face Dataset into ``transformers.Trainer`` during initialization.
 
 .. tip::