Skip to content

Commit

Permalink
[Train][Doc] Update PyTorch Data Ingestion User Guide (ray-project#45421
Browse files Browse the repository at this point in the history
)

Improves framework migration steps to Ray Data for data ingest user guide.

---------

Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: yunxuanx <[email protected]>
  • Loading branch information
woshiyyya committed Jun 25, 2024
1 parent f75ad5d commit 9ba6656
Showing 1 changed file with 28 additions and 14 deletions.
42 changes: 28 additions & 14 deletions doc/source/train/user-guides/data-loading-preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ Key advantages include:

For more details about Ray Data, including comparisons to alternatives, see :ref:`Ray Data Overview <data_overview>`.

.. note::

In addition to Ray Data, you can continue to use framework-native data utilities with Ray Train, such as PyTorch Dataset, Hugging Face Dataset, and Lightning DataModule.

In this guide, we will cover how to incorporate Ray Data into your Ray Train script, and different ways to customize your data ingestion pipeline.

.. TODO: Replace this image with a better one.
Expand Down Expand Up @@ -258,8 +262,7 @@ Some frameworks provide their own dataset and data loading utilities. For exampl
- **Hugging Face:** `Dataset <https://huggingface.co/docs/datasets/index>`_
- **PyTorch Lightning:** `LightningDataModule <https://lightning.ai/docs/pytorch/stable/data/datamodule.html>`_

These utilities can still be used directly with Ray Train. In particular, you may want to do this if you already have your data ingestion pipeline set up.
However, for more performant large-scale data ingestion we do recommend migrating to Ray Data.
You can still use these framework data utilities directly with Ray Train.

At a high level, you can compare these concepts as follows:

Expand All @@ -276,34 +279,45 @@ At a high level, you can compare these concepts as follows:
- n/a
- :meth:`ray.data.Dataset.iter_torch_batches`


For more details, see the following sections for each framework.
For more details, see the following sections for each framework:

.. tab-set::

.. tab-item:: PyTorch Dataset and DataLoader
.. tab-item:: PyTorch DataLoader

**Option 1 (with Ray Data):** Convert your PyTorch Dataset to a Ray Dataset and pass it into the Trainer via ``datasets`` argument.
Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`.
You can convert this to replace the PyTorch DataLoader via :meth:`ray.data.DataIterator.iter_torch_batches`.
**Option 1 (with Ray Data):**

1. Convert your PyTorch Dataset to a Ray Dataset.
2. Pass the Ray Dataset into the TorchTrainer via ``datasets`` argument.
3. Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`.
4. Create a dataset iterable via :meth:`ray.data.DataIterator.iter_torch_batches`.

For more details, see the :ref:`Migrating from PyTorch Datasets and DataLoaders <migrate_pytorch>`.

**Option 2 (without Ray Data):** Instantiate the Torch Dataset and DataLoader directly in the ``train_loop_per_worker``.
You can use the :meth:`ray.train.torch.prepare_data_loader` utility to set up the DataLoader for distributed training.
**Option 2 (without Ray Data):**

1. Instantiate the Torch Dataset and DataLoader directly in the ``train_loop_per_worker``.
2. Use the :meth:`ray.train.torch.prepare_data_loader` utility to set up the DataLoader for distributed training.

.. tab-item:: LightningDataModule

The ``LightningDataModule`` is created with PyTorch ``Dataset``\s and ``DataLoader``\s. You can apply the same logic here.

.. tab-item:: Hugging Face Dataset

**Option 1 (with Ray Data):** Convert your Hugging Face Dataset to a Ray Dataset and pass it into the Trainer via the ``datasets`` argument.
Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`.
**Option 1 (with Ray Data):**

1. Convert your Hugging Face Dataset to a Ray Dataset. For instructions, see :ref:`Ray Data for Hugging Face <loading_datasets_from_ml_libraries>`.
2. Pass the Ray Dataset into the TorchTrainer via the ``datasets`` argument.
3. Inside your ``train_loop_per_worker``, access the sharded dataset via :meth:`ray.train.get_dataset_shard`.
4. Create a iterable dataset via :meth:`ray.data.DataIterator.iter_torch_batches`.
5. Pass the iterable dataset while initializing ``transformers.Trainer``.
6. Wrap your transformers trainer with the :meth:`ray.train.huggingface.transformers.prepare_trainer` utility.

For instructions, see :ref:`Ray Data for Hugging Face <loading_datasets_from_ml_libraries>`.
**Option 2 (without Ray Data):**

**Option 2 (without Ray Data):** Instantiate the Hugging Face Dataset directly in the ``train_loop_per_worker``.
1. Instantiate the Hugging Face Dataset directly in the ``train_loop_per_worker``.
2. Pass the Hugging Face Dataset into ``transformers.Trainer`` during initialization.

.. tip::

Expand Down

0 comments on commit 9ba6656

Please sign in to comment.