diff --git a/.github/styles/Vocab/Train/accept.txt b/.github/styles/Vocab/Train/accept.txt index d832f7f80e7ce..38f7eed079981 100644 --- a/.github/styles/Vocab/Train/accept.txt +++ b/.github/styles/Vocab/Train/accept.txt @@ -1,5 +1,5 @@ Horovod -Hugging Face +hyperparameters? Keras LightGBM PyTorch diff --git a/doc/source/train/deepspeed.rst b/doc/source/train/deepspeed.rst index b9e8e396c5e9c..704c6a2b48ef2 100644 --- a/doc/source/train/deepspeed.rst +++ b/doc/source/train/deepspeed.rst @@ -1,11 +1,14 @@ .. _train-deepspeed: -Training with DeepSpeed -======================= +Get Started with DeepSpeed +========================== The :class:`~ray.train.torch.TorchTrainer` can help you easily launch your `DeepSpeed `_ training across a distributed Ray cluster. -All you need to do is run your existing training code with a TorchTrainer. You can expect the final code to look like this: +Code example +------------ + +You only need to run your existing training code with a TorchTrainer. You can expect the final code to look like this: .. code-block:: python @@ -74,12 +77,12 @@ Below is a simple example of ZeRO-3 training with DeepSpeed only. keep using `deepspeed.initialize() `_ as usual to prepare everything for distributed training. -Running DeepSpeed with other frameworks -------------------------------------------- +Run DeepSpeed with other frameworks +----------------------------------- Many deep learning frameworks have integrated with DeepSpeed, including Lightning, Transformers, Accelerate, and more. You can run all these combinations in Ray Train. -Please check the below examples for more details: +Check the below examples for more details: .. list-table:: :header-rows: 1 diff --git a/doc/source/train/distributed-tensorflow-keras.rst b/doc/source/train/distributed-tensorflow-keras.rst index 6febebf8f1821..0326930ef3039 100644 --- a/doc/source/train/distributed-tensorflow-keras.rst +++ b/doc/source/train/distributed-tensorflow-keras.rst @@ -1,9 +1,10 @@ .. _train-tensorflow-overview: -Distributed Tensorflow & Keras -============================== +Get Started with TensorFlow and Keras +===================================== + Ray Train's `TensorFlow `__ integration enables you -to scale your TensorFlow and Keras training loops to many machines and GPUs. +to scale your TensorFlow and Keras training functions to many machines and GPUs. On a technical level, Ray Train schedules your training workers and configures ``TF_CONFIG`` for you, allowing you to run @@ -11,8 +12,8 @@ your ``MultiWorkerMirroredStrategy`` training script. See `Distributed training with TensorFlow `_ for more information. -Most of the examples in this guide use Tensorflow with Keras, but -Ray Train also works with vanilla Tensorflow. +Most of the examples in this guide use TensorFlow with Keras, but +Ray Train also works with vanilla TensorFlow. Quickstart @@ -23,29 +24,27 @@ Quickstart :end-before: __tf_train_end__ -Updating your training function -------------------------------- +Update your training function +----------------------------- -First, you'll want to update your training function to support distributed +First, update your :ref:`training function ` to support distributed training. .. note:: The current TensorFlow implementation supports ``MultiWorkerMirroredStrategy`` (and ``MirroredStrategy``). If there are - other strategies you wish to see supported by Ray Train, please let us know - by submitting a `feature request on GitHub `_. + other strategies you wish to see supported by Ray Train, submit a `feature request on GitHub `_. These instructions closely follow TensorFlow's `Multi-worker training with Keras `_ -tutorial. One key difference is that Ray Train will handle the environment +tutorial. One key difference is that Ray Train handles the environment variable set up for you. **Step 1:** Wrap your model in ``MultiWorkerMirroredStrategy``. The `MultiWorkerMirroredStrategy `_ -enables synchronous distributed training. The ``Model`` *must* be built and -compiled within the scope of the strategy. +enables synchronous distributed training. You *must* build and compile the ``Model`` within the scope of the strategy. .. code-block:: python @@ -56,9 +55,8 @@ compiled within the scope of the strategy. **Step 2:** Update your ``Dataset`` batch size to the *global* batch size. -The `batch `_ -will be split evenly across worker processes, so ``batch_size`` should be -set appropriately. +Set ``batch_size`` appropriately because `batch `_ +splits evenly across worker processes. .. code-block:: diff @@ -67,20 +65,20 @@ set appropriately. .. warning:: - Ray will not automatically set any environment variables or configuration - related to local parallelism / threading + Ray doesn't automatically set any environment variables or configuration + related to local parallelism or threading :ref:`aside from "OMP_NUM_THREADS" `. - If you desire greater control over TensorFlow threading, use + If you want greater control over TensorFlow threading, use the ``tf.config.threading`` module (eg. ``tf.config.threading.set_inter_op_parallelism_threads(num_cpus)``) at the beginning of your ``train_loop_per_worker`` function. -Creating a :class:`~ray.train.tensorflow.TensorflowTrainer` ------------------------------------------------------------ +Create a TensorflowTrainer +-------------------------- -``Trainer``\s are the primary Ray Train classes that are used to manage state and +``Trainer``\s are the primary Ray Train classes for managing state and execute training. For distributed Tensorflow, -we use a :class:`~ray.train.tensorflow.TensorflowTrainer` +use a :class:`~ray.train.tensorflow.TensorflowTrainer` that you can setup like this: .. code-block:: python @@ -109,38 +107,35 @@ To customize the backend setup, you can pass a ) -For more configurability, please reference the :py:class:`~ray.train.data_parallel_trainer.DataParallelTrainer` API. +For more configurability, see the :py:class:`~ray.train.data_parallel_trainer.DataParallelTrainer` API. -Running your training function ------------------------------- +Run a training function +----------------------- With a distributed training function and a Ray Train ``Trainer``, you are now -ready to start training! +ready to start training. .. code-block:: python trainer.fit() -Data loading and preprocessing ------------------------------- -Tensorflow per default uses its own internal dataset sharding policy, as described +Load and preprocess data +------------------------ + +TensorFlow by default uses its own internal dataset sharding policy, as described `in the guide `__. -If your tensorflow dataset is compatible with distributed loading, you don't need to +If your TensorFlow dataset is compatible with distributed loading, you don't need to change anything. If you require more advanced preprocessing, you may want to consider using Ray Data -for distributed data ingest. - -There is a guide for using :ref:`Ray Data with Ray Train ` -in our PyTorch guide. Since Ray Data is an independent library, most concepts can -be directly applied to TensorFlow. +for distributed data ingest. See :ref:`Ray Data with Ray Train `. The main difference is that you may want to convert your Ray Data dataset shard to a TensorFlow dataset in your training function so that you can use the Keras API for model training. -`Here's a full example you can refer to `__ +`See this example `__ for distributed data loading. The relevant parts are: .. code-block:: python @@ -184,8 +179,8 @@ for distributed data loading. The relevant parts are: -Reporting results ------------------ +Report results +-------------- During training, the training loop should report intermediate results and checkpoints to Ray Train. This reporting logs the results to the console output and appends them to local log files. The logging also triggers :ref:`checkpoint bookkeeping `. @@ -203,30 +198,29 @@ The easiest way to report your results with Keras is by using the model.fit(dataset, callbacks=[ReportCheckpointCallback()]) -This callback will automatically forward all results and checkpoints from the -Keras training loop to Ray Train. +This callback automatically forwards all results and checkpoints from the +Keras training function to Ray Train. -Aggregating results -~~~~~~~~~~~~~~~~~~~ +Aggregate results +~~~~~~~~~~~~~~~~~ TensorFlow Keras automatically aggregates metrics from all workers. If you wish to have more control over that, consider implementing a `custom training loop `__. -Saving and loading checkpoints ------------------------------- +Save and load checkpoints +------------------------- -:class:`Checkpoints ` can be saved by calling ``train.report(metrics, checkpoint=Checkpoint(...))`` in the -training function. This will cause the checkpoint state from the distributed -workers to be saved on the ``Trainer`` (where your python script is executed). +You can save :class:`Checkpoints ` by calling ``train.report(metrics, checkpoint=Checkpoint(...))`` in the +training function. This call saves the checkpoint state from the distributed +workers on the ``Trainer``, where you executed your python script. -The latest saved checkpoint can be accessed through the ``checkpoint`` attribute of -the :py:class:`~ray.train.Result`, and the best saved checkpoints can be accessed by the ``best_checkpoints`` +You can access the latest saved checkpoint through the ``checkpoint`` attribute of +the :py:class:`~ray.train.Result`, and access the best saved checkpoints with the ``best_checkpoints`` attribute. -Concrete examples are provided to demonstrate how checkpoints (model weights but not models) are saved -appropriately in distributed training. +These concrete examples demonstrate how Ray Train appropriately saves checkpoints, model weights but not models, in distributed training. .. code-block:: python @@ -275,11 +269,11 @@ appropriately in distributed training. result = trainer.fit() print(result.checkpoint) -By default, checkpoints will be persisted to local disk in the :ref:`log +By default, checkpoints persist to local disk in the :ref:`log directory ` of each run. -Loading checkpoints -~~~~~~~~~~~~~~~~~~~ +Load checkpoints +~~~~~~~~~~~~~~~~ .. code-block:: python diff --git a/doc/source/train/distributed-xgboost-lightgbm.rst b/doc/source/train/distributed-xgboost-lightgbm.rst index e87fc2c9757bf..e444c36bf1d4f 100644 --- a/doc/source/train/distributed-xgboost-lightgbm.rst +++ b/doc/source/train/distributed-xgboost-lightgbm.rst @@ -1,7 +1,7 @@ .. _train-gbdt-guide: -Distributed XGBoost and LightGBM -================================ +Get Started with XGBoost and LightGBM +===================================== Ray Train has built-in support for XGBoost and LightGBM. @@ -25,7 +25,7 @@ Quickstart :end-before: __lightgbm_end__ -Basic Training with Tree-Based Models in Train +Basic training with tree-based models in Train ---------------------------------------------- Just as in the original `xgboost.train() `__ and @@ -53,24 +53,24 @@ training parameters are passed as the ``params`` dictionary. :end-before: __lightgbm_end__ -Ray-specific params are passed in through the trainer constructors. +Trainer constructors pass Ray-specific parameters. .. _train-gbdt-checkpoints: -Saving and Loading XGBoost and LightGBM Checkpoints ---------------------------------------------------- +Save and load XGBoost and LightGBM checkpoints +---------------------------------------------- -When a new tree is trained on every boosting round, -it's possible to save a checkpoint to snapshot the training progress so far. +When you train a new tree on every boosting round, +you can save a checkpoint to snapshot the training progress so far. :class:`~ray.train.xgboost.XGBoostTrainer` and :class:`~ray.train.lightgbm.LightGBMTrainer` both implement checkpointing out of the box. These checkpoints can be loaded into memory using static methods :meth:`XGBoostTrainer.get_model ` and :meth:`LightGBMTrainer.get_model `. The only required change is to configure :class:`~ray.train.CheckpointConfig` to set -the checkpointing frequency. For example, the following configuration will -save a checkpoint on every boosting round and will only keep the latest checkpoint: +the checkpointing frequency. For example, the following configuration +saves a checkpoint on every boosting round and only keeps the latest checkpoint: .. literalinclude:: doc_code/key_concepts.py :language: python @@ -79,7 +79,7 @@ save a checkpoint on every boosting round and will only keep the latest checkpoi .. tip:: - Once checkpointing is enabled, you can follow :ref:`this guide ` + Once you enable checkpointing, you can follow :ref:`this guide ` to enable fault tolerance. @@ -90,15 +90,15 @@ The benefit of using Ray Train is that you can seamlessly scale up your training adjusting the :class:`ScalingConfig `. .. note:: - Ray Train does not modify or otherwise alter the working - of the underlying XGBoost / LightGBM distributed training algorithms. + Ray Train doesn't modify or otherwise alter the working + of the underlying XGBoost or LightGBM distributed training algorithms. Ray only provides orchestration, data ingest and fault tolerance. For more information on GBDT distributed training, refer to `XGBoost documentation `__ and `LightGBM documentation `__. -Here are some examples for common use-cases: +Following are some examples of common use-cases: .. tab-set:: @@ -138,44 +138,44 @@ Here are some examples for common use-cases: :start-after: __scaling_gpumulti_start__ :end-before: __scaling_gpumulti_end__ - Note that you just have to adjust the number of workers - everything else - will be handled by Ray automatically. + Note that you just have to adjust the number of workers. Ray handles everything else + automatically. -How many remote actors should I use? ------------------------------------- +How many remote actors should you use? +-------------------------------------- This depends on your workload and your cluster setup. Generally there is no inherent benefit of running more than one remote actor per node for CPU-only training. This is because -XGBoost can already leverage multiple CPUs via threading. +XGBoost can already leverage multiple CPUs with threading. -However, there are some cases when you should consider starting +However, in some cases, you should consider some starting more than one actor per node: * For **multi GPU training**, each GPU should have a separate remote actor. Thus, if your machine has 24 CPUs and 4 GPUs, - you will want to start 4 remote actors with 6 CPUs and 1 GPU + you want to start 4 remote actors with 6 CPUs and 1 GPU each * In a **heterogeneous cluster** , you might want to find the `greatest common divisor `_ for the number of CPUs. - E.g. for a cluster with three nodes of 4, 8, and 12 CPUs, respectively, + For example, for a cluster with three nodes of 4, 8, and 12 CPUs, respectively, you should set the number of actors to 6 and the CPUs per actor to 4. How to use GPUs for training? ----------------------------- -Ray Train enables multi GPU training for XGBoost and LightGBM. The core backends -will automatically leverage NCCL2 for cross-device communication. -All you have to do is to start one actor per GPU and set GPU-compatible parameters, -e.g. XGBoost's ``tree_method`` to ``gpu_hist`` (see XGBoost -documentation for more details.) +Ray Train enables multi-GPU training for XGBoost and LightGBM. The core backends +automatically leverage NCCL2 for cross-device communication. +All you have to do is to start one actor per GPU and set GPU-compatible parameters. +For example, XGBoost's ``tree_method`` to ``gpu_hist``. See XGBoost +documentation for more details. -For instance, if you have 2 machines with 4 GPUs each, you will want +For instance, if you have 2 machines with 4 GPUs each, you want to start 8 workers, and set ``use_gpu=True``. There is usually -no benefit in allocating less (e.g. 0.5) or more than one GPU per actor. +no benefit in allocating less (for example, 0.5) or more than one GPU per actor. You should divide the CPUs evenly across actors per machine, so if your machines have 16 CPUs in addition to the 4 GPUs, each actor should have @@ -209,13 +209,13 @@ How to optimize XGBoost memory usage? XGBoost uses a compute-optimized datastructure, the ``DMatrix``, to hold training data. When converting a dataset to a ``DMatrix``, XGBoost creates intermediate copies and ends up -holding a complete copy of the full data. The data will be converted -into the local dataformat (on a 64 bit system these are 64 bit floats.) +holding a complete copy of the full data. XGBoost converts the data +into the local data format. On a 64-bit system the format is 64-bit floats. Depending on the system and original dataset dtype, this matrix can thus occupy more memory than the original dataset. The **peak memory usage** for CPU-based training is at least -**3x** the dataset size (assuming dtype ``float32`` on a 64bit system) +**3x** the dataset size, assuming dtype ``float32`` on a 64-bit system, plus about **400,000 KiB** for other resources, like operating system requirements and storing of intermediate results. @@ -229,7 +229,7 @@ results. Total size: 5,000,000 KiB * XGBoost DMatrix size: ~10,000,000 KiB -This dataset will fit exactly on this node for training. +This dataset fits exactly on this node for training. Note that the DMatrix size might be lower on a 32 bit system. @@ -239,10 +239,10 @@ Generally, the same memory requirements exist for GPU-based training. Additionally, the GPU must have enough memory to hold the dataset. -In the example above, the GPU must have at least +In the preceding example, the GPU must have at least 10,000,000 KiB (about 9.6 GiB) memory. However, -empirically we found that using a ``DeviceQuantileDMatrix`` -seems to show more peak GPU memory usage, possibly +empirical data shows that using a ``DeviceQuantileDMatrix`` +seems to result in more peak GPU memory usage, possibly for intermediate storage when loading data (about 10%). **Best practices** @@ -251,9 +251,9 @@ In order to reduce peak memory usage, consider the following suggestions: -* Store data as ``float32`` or less. More precision is often - not needed, and keeping data in a smaller format will - help reduce peak memory usage for initial data loading. +* Store data as ``float32`` or less. You often don't need + more precision is often, and keeping data in a smaller format + helps reduce peak memory usage for initial data loading. * Pass the ``dtype`` when loading data from CSV. Otherwise, - floating point values will be loaded as ``np.float64`` + floating point values are loaded as ``np.float64`` per default, increasing peak memory usage by 33%. diff --git a/doc/source/train/examples.rst b/doc/source/train/examples.rst index 3d4c3791b6263..718a9fcaeb4ef 100644 --- a/doc/source/train/examples.rst +++ b/doc/source/train/examples.rst @@ -3,7 +3,7 @@ Ray Train Examples ================== -.. Example .rst files should be organized in the same manner as the +.. Organize example .rst files in the same manner as the .py files in ray/python/ray/train/examples. Below are examples for using Ray Train with a variety of frameworks and use cases. @@ -18,19 +18,19 @@ Beginner * - Framework - Example * - PyTorch - - :ref:`Training an Fashion MNIST Image Classifier with PyTorch ` + - :ref:`Train a Fashion MNIST Image Classifier with PyTorch ` * - Lightning - - :ref:`Training an MNIST Image Classifier with Lightning ` + - :ref:`Train an MNIST Image Classifier with Lightning ` * - Transformers - - :ref:`Fine-tuning a Text Classifier on Yelp Reviews Dataset with HF Transformers ` + - :ref:`Fine-tune a Text Classifier on the Yelp Reviews Dataset with Hugging Face Transformers ` * - Accelerate - - :ref:`Distributed Data Parallel Training with HF Accelerate ` + - :ref:`Distributed Data Parallel Training with Hugging Face Accelerate ` * - DeepSpeed - - :ref:`Distributed Training with DeepSpeed ZeRO-3 ` + - :ref:`Train with DeepSpeed ZeRO-3 ` * - TensorFlow - - :ref:`TensorFlow MNIST Training Example ` + - :ref:`Train an MNIST Image Classifier with TensorFlow ` * - Horovod - - :ref:`End-to-end Horovod Training Example ` + - :ref:`Train with Horovod and PyTorch ` Intermediate ------------ @@ -42,11 +42,11 @@ Intermediate * - Framework - Example * - PyTorch - - `DreamBooth fine-tuning of Stable Diffusion with Ray Train `_ + - :ref:`Fine-tune of Stable Diffusion with DreamBooth and Ray Train ` * - Lightning - - :ref:`Model Training with PyTorch Lightning and Ray Data ` + - :ref:`Train with PyTorch Lightning and Ray Data ` * - Transformers - - :ref:`Fine-tuning a Text Classifier on GLUE Benchmark with HF Transformers. ` + - :ref:`Fine-tune a Text Classifier on GLUE Benchmark with Hugging Face Accelerate ` Advanced @@ -59,10 +59,10 @@ Advanced * - Framework - Example * - Accelerate, DeepSpeed - - `Fine-tuning Llama-2 series models with Deepspeed, Accelerate, and Ray Train TorchTrainer `_ + - `Fine-tune Llama-2 series models with Deepspeed, Accelerate, and Ray Train TorchTrainer `_ * - Transformers, DeepSpeed - - :ref:`Fine-tuning GPT-J-6B with Ray Train and DeepSpeed ` + - :ref:`Fine-tune GPT-J-6B with Ray Train and DeepSpeed ` * - Lightning, DeepSpeed - - :ref:`Fine-tuning vicuna-13b with PyTorch Lightning and DeepSpeed ` + - :ref:`Fine-tune vicuna-13b with PyTorch Lightning and DeepSpeed ` * - Lightning - - :ref:`Fine-tuning dolly-v2-7b with PyTorch Lightning and FSDP ` + - :ref:`Fine-tune dolly-v2-7b with PyTorch Lightning and FSDP ` diff --git a/doc/source/train/examples/accelerate/accelerate_example.rst b/doc/source/train/examples/accelerate/accelerate_example.rst index 6205add5ac48a..140312ce90bfc 100644 --- a/doc/source/train/examples/accelerate/accelerate_example.rst +++ b/doc/source/train/examples/accelerate/accelerate_example.rst @@ -2,7 +2,23 @@ .. _accelerate_example: -Hugging Face Accelerate Distributed Training Example with Ray Train -=================================================================== +Distributed Training with Hugging Face Accelerate +================================================= + +This example does distributed data parallel training +with Hugging Face Accelerate, Ray Train, and Ray Data. +It fine-tunes a BERT model and is adapted from +https://github.com/huggingface/accelerate/blob/main/examples/nlp_example.py + + +Code example +------------ .. literalinclude:: /../../python/ray/train/examples/accelerate/accelerate_torch_trainer.py + +See also +-------- + +* :ref:`Get Started with Hugging Face Accelerate ` for a tutorial on using Ray Train and HF Accelerate + +* :ref:`Ray Train Examples ` for more use cases diff --git a/doc/source/train/examples/deepspeed/deepspeed_example.rst b/doc/source/train/examples/deepspeed/deepspeed_example.rst index b35311546dec9..15cab93e30ba9 100644 --- a/doc/source/train/examples/deepspeed/deepspeed_example.rst +++ b/doc/source/train/examples/deepspeed/deepspeed_example.rst @@ -2,7 +2,23 @@ .. _deepspeed_example: -DeepSpeed ZeRO-3 Distributed Training Example with Ray Train -============================================================ +Train with DeepSpeed ZeRO-3 and Ray Train +========================================= + +This is an intermediate example that shows how to do distributed training with DeepSpeed ZeRO-3 and Ray Train. +It demonstrates how to use :ref:`Ray Data ` with DeepSpeed ZeRO-3 and Ray Train. +If you just want to quickly convert your existing TorchTrainer scripts into Ray Train, you can refer to the :ref:`Train with DeepSpeed `. + + +Code example +------------ .. literalinclude:: /../../python/ray/train/examples/deepspeed/deepspeed_torch_trainer.py + + +See also +-------- + +* :ref:`Ray Train Examples ` for more use cases. + +* :ref:`Get Started with DeepSpeed ` for a tutorial. diff --git a/doc/source/train/examples/horovod/horovod_example.rst b/doc/source/train/examples/horovod/horovod_example.rst index 0593a275be095..5a88fc22c0616 100644 --- a/doc/source/train/examples/horovod/horovod_example.rst +++ b/doc/source/train/examples/horovod/horovod_example.rst @@ -2,7 +2,19 @@ .. _horovod_example: -Horovod Distributed Training Example with PyTorch & Ray Train -============================================================= +Run Horovod Distributed Training with PyTorch and Ray Train +=========================================================== + +This basic example demonstrates how to run Horovod distributed training with PyTorch and Ray Train. + +Code example +------------ .. literalinclude:: /../../python/ray/train/examples/horovod/horovod_example.py + + +See also +-------- + +* :ref:`Get Started with Horovod ` for a tutorial on using Horovod with Ray Train +* :ref:`Ray Train Examples ` for more use cases diff --git a/doc/source/train/examples/lightning/lightning_cola_advanced.ipynb b/doc/source/train/examples/lightning/lightning_cola_advanced.ipynb index 13b2697343559..4bd8e7c427b7c 100644 --- a/doc/source/train/examples/lightning/lightning_cola_advanced.ipynb +++ b/doc/source/train/examples/lightning/lightning_cola_advanced.ipynb @@ -11,13 +11,13 @@ "\n", ":::{note}\n", "\n", - "This is an intermediate example demonstrates how to use {ref}`Ray Dataset ` with PyTorch Lightning in Ray Train.\n", + "This is an intermediate example demonstrates how to use [Ray Data](data) with PyTorch Lightning in Ray Train.\n", "\n", - "If you just want to quickly convert your existing PyTorch Lightning scripts into Ray Train, you can refer to the {ref}`Lightning Quick Start Guide `.\n", + "If you just want to quickly convert your existing PyTorch Lightning scripts into Ray Train, you can refer to the [Lightning Quick Start Guide](train-pytorch-lightning).\n", "\n", ":::\n", "\n", - "In this demo, we will introduce how to fine-tune a text classifier on the [CoLA(The Corpus of Linguistic Acceptability)](https://nyu-mll.github.io/CoLA/) dataset using a pre-trained BERT model. In particular, we will:\n", + "This demo introduces how to fine-tune a text classifier on the [CoLA(The Corpus of Linguistic Acceptability)](https://nyu-mll.github.io/CoLA/) dataset using a pre-trained BERT model. In particular, it follows three steps:\n", "- Preprocess the CoLA dataset with Ray Data.\n", "- Define a training function with PyTorch Lightning.\n", "- Launch distributed training with Ray Train's TorchTrainer." @@ -58,7 +58,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Let's start by importing the needed libraries:" + "Start by importing the needed libraries:" ] }, { @@ -90,6 +90,11 @@ "from datasets import load_dataset, load_metric" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + }, { "attachments": {}, "cell_type": "markdown", @@ -97,7 +102,7 @@ "source": [ "## Pre-process CoLA Dataset\n", "\n", - "CoLA is a dataset for binary sentence classification with 10.6K training examples. First, we download the dataset and metrics using the Hugging Face datasets API, and create a Ray Dataset for each split accordingly." + "CoLA is a dataset for binary sentence classification with 10.6K training examples. First, download the dataset and metrics using the Hugging Face datasets API, and create a Ray Dataset for each split accordingly." ] }, { @@ -117,9 +122,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Next, tokenize the input sentences and pad the ID sequence to length 128 using the `bert-base-uncased` tokenizer. The {meth}`map_batches ` will apply this preprocessing function on all data samples." + "Next, tokenize the input sentences and pad the ID sequence to length 128 using the `bert-base-uncased` tokenizer. The {meth}`map_batches ` applies this preprocessing function on all data samples." ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + }, { "cell_type": "code", "execution_count": 5, @@ -148,9 +158,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Define a PyTorch Lightning Model\n", + "## Define a PyTorch Lightning model\n", "\n", - "You don't have to make any change of your `LightningModule` definition. Just copy and paste your code here:" + "You don't have to make any changes to your `LightningModule` definition. Just copy and paste your code here:" ] }, { @@ -214,9 +224,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Define your Training Function\n", + "## Define a training function\n", "\n", - "Define a function that includes all your lightning training logics. This function will be launched by {class}`TorchTrainer ` on each worker in parallel. \n" + "Define a [training function](train-overview-training-function) that includes all of your lightning training logic. {class}`TorchTrainer ` launches this function on each worker in parallel. \n" ] }, { @@ -284,11 +294,11 @@ "- {class}`~ray.train.lightning.RayTrainReportCallback`\n", "\n", "\n", - "To ingest Ray Data with Lightning Trainer, we need to take the following 3 steps:\n", + "To ingest Ray Data with Lightning Trainer, follow these three steps:\n", "\n", "- Feed the full Ray dataset to Ray `TorchTrainer` (details in the next section).\n", "- Use {meth}`ray.train.get_dataset_shard ` to fetch the sharded dataset on each worker.\n", - "- Use {meth}`ds.iter_torch_batches ` to create a Ray data Loader for Lightning Trainer.\n", + "- Use {meth}`ds.iter_torch_batches ` to create a Ray data loader for Lightning Trainer.\n", "\n", ":::{seealso}\n", "\n", @@ -318,11 +328,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Distributed training with Ray TorchTrainer\n", + "## Distributed training with Ray TorchTrainer\n", "\n", "Next, define a {class}`TorchTrainer ` to launch your training function on 4 GPU workers. \n", "\n", - "Here, you can pass the full Ray dataset to the `datasets` argument of ``TorchTrainer``. TorchTrainer automatically shards the datasets among multiple workers." + "You can pass the full Ray dataset to the `datasets` argument of ``TorchTrainer``. TorchTrainer automatically shards the datasets among multiple workers." ] }, { @@ -1050,7 +1060,7 @@ "metadata": {}, "source": [ ":::{note}\n", - "Note that we are using Ray Data for data ingestion for faster preprocessing here, but you can also continue to use the native `PyTorch DataLoader` or `LightningDataModule`. See {ref}`this example `. \n", + "Note that this examples uses Ray Data for data ingestion for faster preprocessing, but you can also continue to use the native `PyTorch DataLoader` or `LightningDataModule`. See {ref}`Train a Pytorch Lightning Image Classifier `. \n", "\n", ":::" ] @@ -1087,6 +1097,17 @@ "source": [ "result" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## See also\n", + "\n", + "* [Ray Train Examples](train-examples) for more use cases\n", + "\n", + "* [Ray Train User Guides](train-user-guides) for how-to guides" + ] } ], "metadata": { diff --git a/doc/source/train/examples/lightning/lightning_mnist_example.ipynb b/doc/source/train/examples/lightning/lightning_mnist_example.ipynb index 738bc4d47c523..508ad2eeda457 100644 --- a/doc/source/train/examples/lightning/lightning_mnist_example.ipynb +++ b/doc/source/train/examples/lightning/lightning_mnist_example.ipynb @@ -9,7 +9,7 @@ "\n", "# Train a Pytorch Lightning Image Classifier\n", "\n", - "This example introduces how to train a Pytorch Lightning Module using Ray Train {class}`TorchTrainer `. We will demonstrate how to train a basic neural network on the MNIST dataset with distributed data parallelism.\n" + "This example introduces how to train a Pytorch Lightning Module using Ray Train {class}`TorchTrainer `. It demonstrates how to train a basic neural network on the MNIST dataset with distributed data parallelism.\n" ] }, { @@ -49,9 +49,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Prepare Dataset and Module\n", + "## Prepare a dataset and module\n", "\n", - "The Pytorch Lightning Trainer takes either `torch.utils.data.DataLoader` or `pl.LightningDataModule` as data inputs. You can keep using them without any changes with Ray Train. " + "The Pytorch Lightning Trainer takes either `torch.utils.data.DataLoader` or `pl.LightningDataModule` as data inputs. You can continue using them without any changes with Ray Train. " ] }, { @@ -75,7 +75,7 @@ " self.data_dir, train=True, download=True, transform=self.transform\n", " )\n", "\n", - " # split data into train and val sets\n", + " # Split data into train and val sets\n", " self.mnist_train, self.mnist_val = random_split(mnist, [55000, 5000])\n", "\n", " def train_dataloader(self):\n", @@ -175,7 +175,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "You don't need to make any change to the definition of PyTorch Lightning model and datamodule." + "You don't need to modify the definition of the PyTorch Lightning model or datamodule." ] }, { @@ -183,18 +183,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Define the Training Loop\n", + "## Define a training function\n", "\n", - "Here we define a training loop for each worker. Compare with the original PyTorch Lightning code, there are 3 main differences:\n", + "This code defines a {ref}`training function ` for each worker. Comparing the training fuction with the original PyTorch Lightning code, notice three main differences:\n", "\n", "- Distributed strategy: Use {class}`RayDDPStrategy `.\n", "- Cluster environment: Use {class}`RayLightningEnvironment `.\n", - "- Parallel devices: Always sets to `devices=\"auto\"` to use all available devices configured by ``TorchTrainer``.\n", + "- Parallel devices: Always set to `devices=\"auto\"` to use all available devices configured by ``TorchTrainer``.\n", "\n", - "Please refer to {ref}`Getting Started with PyTorch Lightning `.\n", + "See {ref}`Getting Started with PyTorch Lightning ` for more information.\n", "\n", "\n", - "For checkpoint reportining, Ray Train provides a minimal {class}`RayTrainReportCallback ` that reports metrics and checkpoint on each train epoch end. For more complex checkpoint logic, please implement custom callbacks as described in {ref}`Saving and Loading Checkpoint ` user guide." + "For checkpoint reporting, Ray Train provides a minimal {class}`RayTrainReportCallback ` class that reports metrics and checkpoints at the end of each train epoch. For more complex checkpoint logic, implement custom callbacks. See {ref}`Saving and Loading Checkpoint `." ] }, { @@ -203,7 +203,7 @@ "metadata": {}, "outputs": [], "source": [ - "use_gpu = True # Set it to False if you want to run without GPUs\n", + "use_gpu = True # Set to False if you want to run without GPUs\n", "num_workers = 4" ] }, @@ -804,7 +804,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Check the Training Results and Checkpoints" + "## Check training results and checkpoints" ] }, { @@ -857,9 +857,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "As we can see, three checkpoints(`checkpoint_000007`, `checkpoint_000008`, `checkpoint_000009`) have been saved in the trial directory. To retrieve the latest checkpoint from the fit results and load it back into the model, follow these steps.\n", + "Ray Train saved three checkpoints(`checkpoint_000007`, `checkpoint_000008`, `checkpoint_000009`) in the trial directory. The following code retrieves the latest checkpoint from the fit results and loads it back into the model.\n", "\n", - "If you lost the in-memory result object, you can also restore the model from the checkpoint file. Here the checkpoint path is: `/tmp/ray_results/ptl-mnist-example/TorchTrainer_eb925_00000_0_2023-08-07_23-15-06/checkpoint_000009/checkpoint.ckpt`." + "If you lost the in-memory result object, you can restore the model from the checkpoint file. The checkpoint path is: `/tmp/ray_results/ptl-mnist-example/TorchTrainer_eb925_00000_0_2023-08-07_23-15-06/checkpoint_000009/checkpoint.ckpt`." ] }, { @@ -903,6 +903,17 @@ "\n", "best_model" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## See also\n", + "\n", + "* {ref}`Getting Started with PyTorch Lightning ` for a tutorial on using Ray Train and PyTorch Lightning \n", + "\n", + "* {ref}`Ray Train Examples ` for more use cases" + ] } ], "metadata": { diff --git a/doc/source/train/examples/pytorch/dreambooth_finetuning.rst b/doc/source/train/examples/pytorch/dreambooth_finetuning.rst index 96ea9a5a1f9d8..7db3e96c82124 100644 --- a/doc/source/train/examples/pytorch/dreambooth_finetuning.rst +++ b/doc/source/train/examples/pytorch/dreambooth_finetuning.rst @@ -1,50 +1,55 @@ :orphan: -Fine-tuning DreamBooth with Ray Train -===================================== +.. _torch_finetune_dreambooth_ex: + +Fine-tune of Stable Diffusion with DreamBooth and Ray Train +=========================================================== + +This is an intermediate example that shows how to do DreamBooth fine-tuning of a Stable Diffusion model using Ray Train. +It demonstrates how to use :ref:`Ray Data ` with PyTorch Lightning in Ray Train. + -This example shows how to do DreamBooth fine-tuning of a Stable Diffusion model using Ray Train. See the original `DreamBooth project homepage `_ for more details on what this fine-tuning method achieves. .. image:: https://dreambooth.github.io/DreamBooth_files/high_level.png :target: https://dreambooth.github.io :alt: DreamBooth fine-tuning overview -This example is built on top of `this HuggingFace 🤗 tutorial `_. -See the HuggingFace tutorial for useful explanations and suggestions on hyperparameters. +This example builds on `this Hugging Face 🤗 tutorial `_. +See the Hugging Face tutorial for useful explanations and suggestions on hyperparameters. **Adapting this example to Ray Train allows you to easily scale up the fine-tuning to an arbitrary number of distributed training workers.** **Compute requirements:** -* Because of the large model sizes, you'll need a machine with at least 1 A10G GPU. -* Each training worker uses 1 GPU. You can use multiple GPUs/workers to leverage data-parallel training to speed up training time. +* Because of the large model sizes, you need a machine with at least 1 A10G GPU. +* Each training worker uses 1 GPU. You can use multiple GPUs or workers to leverage data-parallel training to speed up training time. -This example fine-tunes both the ``text_encoder`` and ``unet`` models used in the Stable Diffusion process, with respect to a prior preserving loss. +This example fine-tunes both the ``text_encoder`` and ``unet`` models used in the stable diffusion process, with respect to a prior preserving loss. .. image:: /templates/05_dreambooth_finetuning/dreambooth/images/dreambooth_example.png :alt: DreamBooth overview -The full code repository can be found here: `https://github.com/ray-project/ray/tree/master/doc/source/templates/05_dreambooth_finetuning `_ +Find the full code repository at `https://github.com/ray-project/ray/tree/master/doc/source/templates/05_dreambooth_finetuning `_ How it works ------------ -This example leverages Ray Data for data loading and Ray Train for distributed training. +This example uses Ray Data for data loading and Ray Train for distributed training. Data loading ^^^^^^^^^^^^ .. note:: - You can find the latest version of the code here: `dataset.py `_ + Find the latest version of the code at `dataset.py `_ The latest version might differ slightly from the code presented here. -We use Ray Data for data loading. The code has three interesting parts. +Use Ray Data for data loading. The code has three interesting parts. -First, we load two datasets using :func:`ray.data.read_images`: +First, load two datasets using :func:`ray.data.read_images`: .. literalinclude:: /templates/05_dreambooth_finetuning/dreambooth/dataset.py :language: python @@ -52,7 +57,7 @@ First, we load two datasets using :func:`ray.data.read_images`: :end-at: class_dataset = read :dedent: 4 -Then, we tokenize the prompt that generated these images: +Then, tokenize the prompt that generated these images: .. literalinclude:: /templates/05_dreambooth_finetuning/dreambooth/dataset.py :language: python @@ -61,7 +66,7 @@ Then, we tokenize the prompt that generated these images: :dedent: 4 -And lastly, we apply a ``torchvision`` preprocessing pipeline to the images: +And lastly, apply a ``torchvision`` preprocessing pipeline to the images: .. literalinclude:: /templates/05_dreambooth_finetuning/dreambooth/dataset.py :language: python @@ -69,8 +74,7 @@ And lastly, we apply a ``torchvision`` preprocessing pipeline to the images: :end-before: END: image preprocessing :dedent: 4 -We apply all of this in final step: - +Apply all three parts in a final step: .. literalinclude:: /templates/05_dreambooth_finetuning/dreambooth/dataset.py :language: python @@ -79,29 +83,28 @@ We apply all of this in final step: :dedent: 4 - Distributed training ^^^^^^^^^^^^^^^^^^^^ .. note:: - You can find the latest version of the code here: `train.py `_ + Find the latest version of the code at `train.py `_ The latest version might differ slightly from the code presented here. -The central part of the training code is the *training function*. This function accepts a configuration dict that contains the hyperparameters. It then defines a regular PyTorch training loop. +The central part of the training code is the :ref:`training function `. This function accepts a configuration dict that contains the hyperparameters. It then defines a regular PyTorch training loop. -There are only a few locations where we interact with the Ray Train API. We marked them with in-line comments in the snippet below. +You interact with the Ray Train API in only a few locations, which follow in-line comments in the snippet below. -Remember that we want to do data-parallel training for all our models. +Remember that you want to do data-parallel training for all the models. -#. We load the data shard for each worker with session.get_dataset_shard("train") -#. We iterate over the dataset with train_dataset.iter_torch_batches() -#. We report results to Ray Train with session.report(results) +#. Load the data shard for each worker with `session.get_dataset_shard("train")`` +#. Iterate over the dataset with `train_dataset.iter_torch_batches()`` +#. Report results to Ray Train with `session.report(results)`` -The code was compacted for brevity. The `full code `_ is more thoroughly annotated. +The code is compacted for brevity. The `full code `_ is more thoroughly annotated. .. literalinclude:: /templates/05_dreambooth_finetuning/dreambooth/train.py @@ -109,7 +112,7 @@ The code was compacted for brevity. The `full code ``. +To achieve this, choose a non-word as an identifier, such as ``unqtkn``. When fine-tuning the model with this subject, you teach the model that the prompt is ``A photo of a unqtkn ``. -After fine-tuning we can run inference with this specific prompt. -For instance: ``A photo of a unqtkn `` will create an image of our subject. -Similarly, ``A photo of a unqtkn at the beach`` will create an image of our subject at the beach. +After fine-tuning you can run inference with this specific prompt. +For instance: ``A photo of a unqtkn `` creates an image of the subject. +Similarly, ``A photo of a unqtkn at the beach`` creates an image of the subject at the beach. Step 0: Preparation ^^^^^^^^^^^^^^^^^^^ @@ -216,7 +219,7 @@ Prepare some directories and environment variables. Step 1: Download the pre-trained model ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Download and cache a pre-trained Stable-Diffusion model locally. +Download and cache a pre-trained Stable Diffusion model locally. .. literalinclude:: /templates/05_dreambooth_finetuning/dreambooth_run.sh :language: bash @@ -228,10 +231,10 @@ You can access the downloaded model checkpoint at the ``$ORIG_MODEL_PATH``. Step 2: Supply images of your subject ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Use one of the sample datasets (dog, lego car), or provide your own directory +Use one of the sample datasets, like `dog` or `lego car`, or provide your own directory of images, and specify the directory with the ``$INSTANCE_DIR`` environment variable. -Then, we copy these images to ``$IMAGES_OWN_DIR``. +Then, copy these images to ``$IMAGES_OWN_DIR``. .. literalinclude:: /templates/05_dreambooth_finetuning/dreambooth_run.sh :language: bash @@ -247,7 +250,7 @@ Step 3: Create the regularization images ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Create a regularization image set for a class of subjects using the pre-trained -Stable Diffusion model. This is used to regularize the fine-tuning by ensuring that +Stable Diffusion model. This regularization set ensures that the model still produces decent images for random images of the same class, rather than just optimize for producing good images of the subject. @@ -256,12 +259,12 @@ rather than just optimize for producing good images of the subject. :start-after: Step 3: START :end-before: Step 3: END -We use Ray Data to do batch inference with 4 workers, so more images can be generated in parallel. +Use Ray Data to do batch inference with 4 workers, to generate more images in parallel. Step 4: Fine-tune the model ^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Save a few (4 to 5) images of the subject being fine-tuned +Save a few, like 4 to 5, images of the subject being fine-tuned in a local directory. Then launch the training job with: .. literalinclude:: /templates/05_dreambooth_finetuning/dreambooth_run.sh @@ -269,21 +272,28 @@ in a local directory. Then launch the training job with: :start-after: Step 4: START :end-before: Step 4: END -Step 5: Generate images of our subject +Step 5: Generate images of the subject ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Try your model with the same command line as Step 2, but point -to your own model this time! +to your own model this time. .. literalinclude:: /templates/05_dreambooth_finetuning/dreambooth_run.sh :language: bash :start-after: Step 5: START :end-before: Step 5: END -Next, try replacing the prompt with something more interesting! +Next, try replacing the prompt with something more interesting. For example, for the dog subject, you can try: - "photo of a unqtkn dog in a bucket" - "photo of a unqtkn dog sleeping" -- "photo of a unqtkn dog in a doghouse" \ No newline at end of file +- "photo of a unqtkn dog in a doghouse" + +See also +-------- + +* :ref:`Ray Train Examples ` for more use cases + +* :ref:`Ray Train User Guides ` for how-to guides \ No newline at end of file diff --git a/doc/source/train/examples/pytorch/torch_fashion_mnist_example.rst b/doc/source/train/examples/pytorch/torch_fashion_mnist_example.rst index 2955441efaf08..860fb745d864e 100644 --- a/doc/source/train/examples/pytorch/torch_fashion_mnist_example.rst +++ b/doc/source/train/examples/pytorch/torch_fashion_mnist_example.rst @@ -2,7 +2,19 @@ .. _torch_fashion_mnist_ex: -Running Distributed Training of a PyTorch Model on Fashion MNIST with Ray Train -=============================================================================== +Train a PyTorch Model on Fashion MNIST +====================================== + +This example runs distributed training of a PyTorch model on Fashion MNIST with Ray Train. + +Code example +------------ .. literalinclude:: /../../python/ray/train/examples/pytorch/torch_fashion_mnist_example.py + +See also +-------- + +* :ref:`Get Started with PyTorch ` for a tutorial on using Ray Train and PyTorch + +* :ref:`Ray Train Examples ` for more use cases diff --git a/doc/source/train/examples/tf/tensorflow_mnist_example.rst b/doc/source/train/examples/tf/tensorflow_mnist_example.rst index 0a03a9462d761..1c7a04a97d016 100644 --- a/doc/source/train/examples/tf/tensorflow_mnist_example.rst +++ b/doc/source/train/examples/tf/tensorflow_mnist_example.rst @@ -2,7 +2,20 @@ .. _tensorflow_mnist_example: -Running Distributed Training of a TensorFlow Model on MNIST with Ray Train -========================================================================== +Training with TensorFlow and Ray Train +====================================== + +This basic example runs distributed training of a TensorFlow model on MNIST with Ray Train. + +Code example +------------ .. literalinclude:: /../../python/ray/train/examples/tf/tensorflow_mnist_example.py + + +See also +-------- + +* :ref:`Ray Train Examples ` for more use cases. + +* :ref:`Distributed Tensorflow & Keras ` for a tutorial. \ No newline at end of file diff --git a/doc/source/train/examples/transformers/huggingface_text_classification.ipynb b/doc/source/train/examples/transformers/huggingface_text_classification.ipynb index 718df6741890f..3243cab40722f 100644 --- a/doc/source/train/examples/transformers/huggingface_text_classification.ipynb +++ b/doc/source/train/examples/transformers/huggingface_text_classification.ipynb @@ -6,7 +6,7 @@ "source": [ "(train_transformers_glue_example)=\n", "\n", - "# Fine-tune a 🤗 Transformers model" + "# Fine-tune a Hugging Face Transformers Model" ] }, { @@ -15,9 +15,9 @@ "id": "VaFMt6AIhYbK" }, "source": [ - "This notebook is based on [an official 🤗 notebook - \"How to fine-tune a model on text classification\"](https://github.com/huggingface/notebooks/blob/6ca682955173cc9d36ffa431ddda505a048cbe80/examples/text_classification.ipynb). The main aim of this notebook is to show the process of conversion from vanilla 🤗 to Ray Train without changing the training logic unless necessary.\n", + "This notebook is based on an official Hugging Face example, [How to fine-tune a model on text classification](https://github.com/huggingface/notebooks/blob/6ca682955173cc9d36ffa431ddda505a048cbe80/examples/text_classification.ipynb). This notebook shows the process of conversion from vanilla HF to Ray Train without changing the training logic unless necessary.\n", "\n", - "In this notebook, we will:\n", + "This notebook consists of the following steps:\n", "1. [Set up Ray](#setup)\n", "2. [Load the dataset](#load)\n", "3. [Preprocess the dataset with Ray Data](#preprocess)\n", @@ -31,7 +31,7 @@ "id": "sQbdfyWQhYbO" }, "source": [ - "Uncomment and run the following line in order to install all the necessary dependencies (this notebook is being tested with `transformers==4.19.1`):" + "Uncomment and run the following line to install all the necessary dependencies. (This notebook is being tested with `transformers==4.19.1`.):" ] }, { @@ -60,7 +60,7 @@ "id": "LRdL3kWBhYbQ" }, "source": [ - "We will use `ray.init()` to initialize a local cluster. By default, this cluster will be comprised of only the machine you are running this notebook on. You can also run this notebook on an Anyscale cluster." + "Use `ray.init()` to initialize a local cluster. By default, this cluster contains only the machine you are running this notebook on. You can also run this notebook on an [Anyscale](https://www.anyscale.com/) cluster." ] }, { @@ -88,7 +88,7 @@ "id": "oJiSdWy2hYbR" }, "source": [ - "We can check the resources our cluster is composed of. If you are running this notebook on your local machine or Google Colab, you should see the number of CPU cores and GPUs available on the said machine." + "Check the resources our cluster is composed of. If you are running this notebook on your local machine or Google Colab, you should see the number of CPU cores and GPUs available on the your machine." ] }, { @@ -127,9 +127,9 @@ "id": "uS6oeJELhYbS" }, "source": [ - "In this notebook, we will see how to fine-tune a [🤗 Transformers](https://github.com/huggingface/transformers) model for one of the text classification task of the [GLUE Benchmark](https://gluebenchmark.com/). We will be running the training using Ray Train.\n", + "This notebook fine-tunes a [HF Transformers](https://github.com/huggingface/transformers) model for one of the text classification task of the [GLUE Benchmark](https://gluebenchmark.com/). It runs the training using Ray Train.\n", "\n", - "You can change those two variables to control whether the training (which we will get to later) uses CPUs or GPUs, and how many workers should be spawned. Each worker will claim one CPU or GPU. Make sure not to request more resources than the resources present! By default, we will run the training with one GPU worker." + "You can change these two variables to control whether the training, which happens later, uses CPUs or GPUs, and how many workers to spawn. Each worker claims one CPU or GPU. Make sure to not request more resources than the resources present. By default, the training runs with one GPU worker." ] }, { @@ -142,7 +142,7 @@ "outputs": [], "source": [ "use_gpu = True # set this to False to run on CPUs\n", - "num_workers = 1 # set this to number of GPUs/CPUs you want to use" + "num_workers = 1 # set this to number of GPUs or CPUs you want to use" ] }, { @@ -151,7 +151,7 @@ "id": "rEJBSTyZIrIb" }, "source": [ - "## Fine-tuning a model on a text classification task" + "## Fine-tune a model on a text classification task" ] }, { @@ -160,9 +160,9 @@ "id": "kTCFado4IrIc" }, "source": [ - "The GLUE Benchmark is a group of nine classification tasks on sentences or pairs of sentences. If you would like to learn more, refer to the [original notebook](https://github.com/huggingface/notebooks/blob/6ca682955173cc9d36ffa431ddda505a048cbe80/examples/text_classification.ipynb).\n", + "The GLUE Benchmark is a group of nine classification tasks on sentences or pairs of sentences. To learn more, see the [original notebook](https://github.com/huggingface/notebooks/blob/6ca682955173cc9d36ffa431ddda505a048cbe80/examples/text_classification.ipynb).\n", "\n", - "Each task is named by its acronym, with `mnli-mm` standing for the mismatched version of MNLI (so same training set as `mnli` but different validation and test sets):" + "Each task has a name that is its acronym, with `mnli-mm` to indicate that it is a mismatched version of MNLI. Each one has the same training set as `mnli` but different validation and test sets." ] }, { @@ -194,7 +194,7 @@ "id": "4RRkXuteIrIh" }, "source": [ - "This notebook is built to run on any of the tasks in the list above, with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a version with a classification head. Depending on your model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those three parameters, then the rest of the notebook should run smoothly:" + "This notebook runs on any of the tasks in the list above, with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a version with a classification head. Depending on the model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set these three parameters, and the rest of the notebook should run smoothly:" ] }, { @@ -226,11 +226,11 @@ "id": "W7QYTpxXIrIl" }, "source": [ - "We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.\n", + "Use the [HF Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric to use for evaluation and to compare your model to the benchmark. You can do this comparison easily with the `load_dataset` and `load_metric` functions.\n", "\n", - "Apart from `mnli-mm` being a special code, we can directly pass our task name to those functions.\n", + "Apart from `mnli-mm` being special code, you can directly pass the task name to those functions.\n", "\n", - "We will run the normal 🤗 Datasets code to load the dataset from the Hub." + "Run the normal HF Datasets code to load the dataset from the Hub." ] }, { @@ -281,7 +281,7 @@ "id": "RzfPtOMoIrIu" }, "source": [ - "The `dataset` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation, and test set (with more keys for the mismatched validation and test set in the special case of `mnli`)." + "The `dataset` object itself is a [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation, and test set, with more keys for the mismatched validation and test set in the special case of `mnli`." ] }, { @@ -299,12 +299,12 @@ "id": "YVx71GdAIrJH" }, "source": [ - "Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers' `Tokenizer`, which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.\n", + "Before you can feed these texts to the model, you need to preprocess them. Preprocess them with a HF Transformers' `Tokenizer`, which tokenizes the inputs, including converting the tokens to their corresponding IDs in the pretrained vocabulary, and puts them in a format the model expects. It also generates the other inputs that the model requires.\n", "\n", - "To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure that:\n", + "To do all of this preprocessing, instantiate your tokenizer with the `AutoTokenizer.from_pretrained` method, which ensures that you:\n", "\n", - "- we get a tokenizer that corresponds to the model architecture we want to use,\n", - "- we download the vocabulary used when pretraining this specific checkpoint." + "- Get a tokenizer that corresponds to the model architecture you want to use.\n", + "- Download the vocabulary used when pretraining this specific checkpoint." ] }, { @@ -332,7 +332,7 @@ "id": "Vl6IidfdIrJK" }, "source": [ - "We pass along `use_fast=True` to the call above to use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument." + "Pass `use_fast=True` to the preceding call to use one of the fast tokenizers, backed by Rust, from the HF Tokenizers library. These fast tokenizers are available for almost all models, but if you get an error with the previous call, remove the argument." ] }, { @@ -341,7 +341,7 @@ "id": "qo_0B1M2IrJM" }, "source": [ - "To preprocess our dataset, we will thus need the names of the columns containing the sentence(s). The following dictionary keeps track of the correspondence task to column names:" + "To preprocess the dataset, you need the names of the columns containing the sentence(s). The following dictionary keeps track of the correspondence task to column names:" ] }, { @@ -373,7 +373,7 @@ "id": "256fOuzjhYbY" }, "source": [ - "Instead of using 🤗 Dataset objects directly, we will convert them to [Ray Data](data). Both are backed by Arrow tables, so the conversion is straightforward. We will use the built-in {meth}`~ray.data.from_huggingface` function." + "Instead of using HF Dataset objects directly, convert them to [Ray Data](data). Arrow tables back both of them, so the conversion is straightforward. Use the built-in {meth}`~ray.data.from_huggingface` function." ] }, { @@ -425,7 +425,7 @@ "id": "2C0hcmp9IrJQ" }, "source": [ - "We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer than what the model selected can handle will be truncated and pad to the longest sequence in the batch." + "You can then write the function that preprocesses the samples. Feed them to the `tokenizer` with the argument `truncation=True`. This configuration ensures that the `tokenizer` truncates and pads to the longest sequence in the batch, any input longer than what the model selected can handle." ] }, { @@ -484,11 +484,11 @@ "id": "FBiW8UpKIrJW" }, "source": [ - "Now that our data is ready, we can download the pretrained model and fine-tune it.\n", + "Now that the data is ready, download the pretrained model and fine-tune it.\n", "\n", - "Since all of our tasks involve sentence classification, we will use the `AutoModelForSequenceClassification` class. We will not delve into the specifics of each individual training component. For more information, see the [original notebook](https://github.com/huggingface/notebooks/blob/6ca682955173cc9d36ffa431ddda505a048cbe80/examples/text_classification.ipynb). The tokenizer used is the same one we used to encode the dataset previously.\n", + "Because all of the tasks involve sentence classification, use the `AutoModelForSequenceClassification` class. For more specifics about each individual training component, see the [original notebook](https://github.com/huggingface/notebooks/blob/6ca682955173cc9d36ffa431ddda505a048cbe80/examples/text_classification.ipynb). The original notebook uses the same tokenizer used to encode the dataset in this notebook's preceding example.\n", "\n", - "The main difference when using Ray Train is that we need to define our training logic as a function (`train_func`). This function will be passed to the {class}`~ray.train.torch.TorchTrainer` and will run on every Ray worker. The training will then proceed using PyTorch DDP.\n", + "The main difference when using Ray Train is that you need to define the training logic as a function (`train_func`). You pass this [training function](train-overview-training-function) to the {class}`~ray.train.torch.TorchTrainer` to on every Ray worker. The training then proceeds using PyTorch DDP.\n", "\n", "\n", "```{note}\n", @@ -622,7 +622,7 @@ "id": "CdzABDVcIrJg" }, "source": [ - "With our `train_func` complete, we can now instantiate the {class}`~ray.train.torch.TorchTrainer`. Aside from the function, we set the `scaling_config`, controlling the amount of workers and resources used, and the `datasets` we will use for training and evaluation." + "With your `train_func` complete, you can now instantiate the {class}`~ray.train.torch.TorchTrainer`. Aside from calling the function, set the `scaling_config`, which controls the amount of workers and resources used, and the `datasets` to use for training and evaluation." ] }, { @@ -660,7 +660,7 @@ "id": "XvS136zKhYba" }, "source": [ - "Finally, we call the `fit` method to start training with Ray Train. We will save the `Result` object to a variable so we can access metrics and checkpoints." + "Finally, call the `fit` method to start training with Ray Train. Save the `Result` object to a variable so you can access metrics and checkpoints." ] }, { @@ -1061,9 +1061,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "If we would like to tune any hyperparameters of the model, we can do so by simply passing our `TorchTrainer` into a `Tuner` and defining the search space.\n", + "To tune any hyperparameters of the model, pass your `TorchTrainer` into a `Tuner` and define the search space.\n", "\n", - "We can also take advantage of the advanced search algorithms and schedulers provided by Ray Tune. In this example, we will use an `ASHAScheduler` to aggresively terminate underperforming trials." + "You can also take advantage of the advanced search algorithms and schedulers from Ray Tune. This example uses an `ASHAScheduler` to aggresively terminate underperforming trials." ] }, { @@ -1744,7 +1744,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can view the results of the tuning run as a dataframe, and obtain the best result." + "View the results of the tuning run as a dataframe, and find the best result." ] }, { @@ -1969,11 +1969,11 @@ "id": "mS8PId_NhYbb" }, "source": [ - "To be able to share your model with the community, there are a few more steps to follow.\n", + "To share the model with the community, a few more steps follow.\n", "\n", - "We have conducted the training on the Ray cluster, but share the model from the local enviroment - this will allow us to easily authenticate.\n", + "You conducted the training on the Ray cluster, but want share the model from the local enviroment. This configuration allows you to easily authenticate.\n", "\n", - "First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:" + "First, store your authentication token from the Hugging Face website. Sign up [here](https://huggingface.co/join) if you haven't already. Then execute the following cell and input your username and password:" ] }, { @@ -2021,7 +2021,7 @@ "id": "5fr6E0e8hYbb" }, "source": [ - "Now, load the model and tokenizer locally, and recreate the 🤗 Transformers `Trainer`:" + "Now, load the model and tokenizer locally, and recreate the HF Transformers `Trainer`:" ] }, { @@ -2047,9 +2047,16 @@ "id": "tgV2xKfFhYbc" }, "source": [ - "You can now upload the result of the training to the Hub, just execute this instruction:" + "You can now upload the result of the training to the Hub. Execute this instruction:" ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, { "cell_type": "code", "execution_count": null, @@ -2070,7 +2077,7 @@ "id": "UL-Boc4dhYbc" }, "source": [ - "You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `\"your-username/the-name-you-picked\"` so for instance:\n", + "You can now share this model. Others can load it with the identifier `\"your-username/the-name-you-picked\"`. For example:\n", "\n", "```python\n", "from transformers import AutoModelForSequenceClassification\n", @@ -2083,9 +2090,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Next steps\n", + "## See also\n", "\n", - "- {ref}`End-to-end: Offline Batch Inference `" + "* {ref}`Ray Train Examples ` for more use cases\n", + "* {ref}`Ray Train User Guides ` for how-to guides\n" ] } ], diff --git a/doc/source/train/examples/transformers/transformers_torch_trainer_basic.rst b/doc/source/train/examples/transformers/transformers_torch_trainer_basic.rst index d4bb78290cf5b..c7259be27f33e 100644 --- a/doc/source/train/examples/transformers/transformers_torch_trainer_basic.rst +++ b/doc/source/train/examples/transformers/transformers_torch_trainer_basic.rst @@ -2,7 +2,20 @@ .. _transformers_torch_trainer_basic_example : -Ray Train Basic Example for HuggingFace Transformers -==================================================== +Fine-tune a Text Classifier with Hugging Face Transformers +========================================================== + +This basic example of distributed training with Ray Train and Hugging Face (HF) Transformers +fine-tunes a text classifier on the Yelp review dataset using HF Transformers and Ray Train. + +Code example +------------ .. literalinclude:: /../../python/ray/train/examples/transformers/transformers_torch_trainer_basic.py + +See also +-------- + +* :ref:`Get Started with Hugging Face Transformers ` for a tutorial + +* :ref:`Ray Train Examples ` for more use cases diff --git a/doc/source/train/getting-started-pytorch-lightning.rst b/doc/source/train/getting-started-pytorch-lightning.rst index 00b8af39828e0..fa198c4d3c6bc 100644 --- a/doc/source/train/getting-started-pytorch-lightning.rst +++ b/doc/source/train/getting-started-pytorch-lightning.rst @@ -1,21 +1,21 @@ .. _train-pytorch-lightning: -Getting Started with PyTorch Lightning -====================================== +Get Started with PyTorch Lightning +================================== This tutorial walks through the process of converting an existing PyTorch Lightning script to use Ray Train. Learn how to: -1. Configure your Lightning Trainer so that it runs distributed with Ray and is placed on the correct CPU/GPU device. -2. Configure your training function to report metrics and save checkpoints. -3. Configure scale and CPU/GPU resource requirements for your training job. -4. Launch your distributed training job with a :class:`~ray.train.torch.TorchTrainer`. +1. Configure the Lightning Trainer so that it runs distributed with Ray and on the correct CPU or GPU device. +2. Configure :ref:`training function ` to report metrics and save checkpoints. +3. Configure :ref:`scaling ` and CPU or GPU resource requirements for a training job. +4. Launch a distributed training job with a :class:`~ray.train.torch.TorchTrainer`. Quickstart ---------- -For reference, the final code follows: +For reference, the final code is as follows: .. code-block:: python @@ -29,7 +29,7 @@ For reference, the final code follows: trainer = TorchTrainer(train_func, scaling_config=scaling_config) result = trainer.fit() -1. Your `train_func` is the Python code that is executed on each distributed training worker. +1. Your `train_func` is the Python code that each distributed training :ref:`worker ` executes. 2. Your `ScalingConfig` defines the number of distributed training workers and whether to use GPUs. 3. Your `TorchTrainer` launches the distributed training job. @@ -147,18 +147,18 @@ Compare a PyTorch Lightning training script with and without Ray Train. result = trainer.fit() -Setting up your training function ---------------------------------- +Set up a training function +-------------------------- First, update your training code to support distributed training. -Begin by wrapping your code in a function: +Begin by wrapping your code in a :ref:`training function `: .. code-block:: python def train_func(config): # Your PyTorch Lightning training code here. -This function is executed on each distributed training worker. +Each distributed training worker executes this function. Ray Train sets up your distributed process group on each worker. You only need to @@ -189,12 +189,12 @@ make a few changes to your Lightning Trainer definition. trainer.fit(model, datamodule=datamodule) -We now go over each change. +The following sections discuss each change. -Configuring distributed strategy -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Configure the distributed strategy +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Ray Train offers several subclassed distributed strategies for Lightning. +Ray Train offers several sub-classed distributed strategies for Lightning. These strategies retain the same argument list as their base strategy classes. Internally, they configure the root device and the distributed sampler arguments. @@ -220,11 +220,11 @@ sampler arguments. ) ... -Configuring Ray cluster environment plugin -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Configure the Ray cluster environment plugin +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Ray Train also provides :class:`~ray.train.lightning.RayLightningEnvironment` -as a specification for Ray Cluster. This utility class configures the worker's +Ray Train also provides a :class:`~ray.train.lightning.RayLightningEnvironment` class +as a specification for the Ray Cluster. This utility class configures the worker's local, global, and node rank and world size. @@ -245,8 +245,8 @@ local, global, and node rank and world size. ... -Configuring parallel devices -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Configure parallel devices +^^^^^^^^^^^^^^^^^^^^^^^^^^ In addition, Ray TorchTrainer has already configured the correct ``CUDA_VISIBLE_DEVICES`` for you. One should always use all available @@ -270,8 +270,8 @@ GPUs by setting ``devices="auto"`` and ``acelerator="auto"``. -Reporting checkpoints and metrics -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Report checkpoints and metrics +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To persist your checkpoints and monitor training progress, add a :class:`ray.train.lightning.RayTrainReportCallback` utility callback to your Trainer. @@ -293,10 +293,10 @@ To persist your checkpoints and monitor training progress, add a Reporting metrics and checkpoints to Ray Train enables you to support :ref:`fault-tolerant training ` and :ref:`hyperparameter optimization `. -Note that the :class:`ray.train.lightning.RayTrainReportCallback` only provides a simple implementation, and can be :ref:`further customized `. +Note that the :class:`ray.train.lightning.RayTrainReportCallback` class only provides a simple implementation, and can be :ref:`further customized `. -Preparing your Lightning Trainer -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Prepare your Lightning Trainer +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Finally, pass your Lightning Trainer into :meth:`~ray.train.lightning.prepare_trainer` to validate @@ -315,8 +315,8 @@ your configurations. ... -Configuring scale and GPUs ---------------------------- +Configure scale and GPUs +------------------------ Outside of your training function, create a :class:`~ray.train.ScalingConfig` object to configure: @@ -331,8 +331,8 @@ Outside of your training function, create a :class:`~ray.train.ScalingConfig` ob For more details, see :ref:`train_scaling_config`. -Launching your training job ---------------------------- +Launch a training job +--------------------- Tying this all together, you can now launch a distributed training job with a :class:`~ray.train.torch.TorchTrainer`. @@ -344,12 +344,12 @@ with a :class:`~ray.train.torch.TorchTrainer`. trainer = TorchTrainer(train_func, scaling_config=scaling_config) result = trainer.fit() -Please also refer to :ref:`train-run-config` for more configuration options for `TorchTrainer`. +See :ref:`train-run-config` for more configuration options for `TorchTrainer`. -Accessing training results --------------------------- +Access training results +----------------------- -After training completes, a :class:`~ray.train.Result` object will be returned which contains +After training completes, Ray Train returns a :class:`~ray.train.Result` object, which contains information about the training run, including the metrics and checkpoints reported during training. .. code-block:: python @@ -364,36 +364,35 @@ information about the training run, including the metrics and checkpoints report Next steps ---------- -After you have converted your PyTorch Lightningtraining script to use Ray Train: +After you have converted your PyTorch Lightning training script to use Ray Train: * See :ref:`User Guides ` to learn more about how to perform specific tasks. * Browse the :ref:`Examples ` for end-to-end examples of how to use Ray Train. -* Dive into the :ref:`API Reference ` for more details on the classes and methods used in this tutorial. +* Consult the :ref:`API Reference ` for more details on the classes and methods from this tutorial. Version Compatibility --------------------- Ray Train is tested with `pytorch_lightning` versions `1.6.5` and `2.0.4`. For full compatibility, use ``pytorch_lightning>=1.6.5`` . -Earlier versions are not prohibited but may result in unexpected issues. If you run into any compatibility issues, consider upgrading your PyTorch Lightning version or +Earlier versions aren't prohibited but may result in unexpected issues. If you run into any compatibility issues, consider upgrading your PyTorch Lightning version or `file an issue `_. .. _lightning-trainer-migration-guide: -``LightningTrainer`` Migration Guide ------------------------------------- +LightningTrainer Migration Guide +-------------------------------- -The `LightningTrainer` was added in Ray 2.4, and exposes a +Ray 2.4 introduced the `LightningTrainer`, and exposed a `LightningConfigBuilder` to define configurations for `pl.LightningModule` and `pl.Trainer`. It then instantiates the model and trainer objects and runs a pre-defined -training loop in a black box. - +training function in a black box. This version of the LightningTrainer API was constraining and limited -the users' ability to manage the training functionality. +your ability to manage the training functionality. -Ray 2.7 introduces the newly unified :class:`~ray.train.torch.TorchTrainer` API, which offers +Ray 2.7 introduced the newly unified :class:`~ray.train.torch.TorchTrainer` API, which offers enhanced transparency, flexibility, and simplicity. This API is more aligned with standard PyTorch Lightning scripts, ensuring users have better control over their native Lightning code. diff --git a/doc/source/train/getting-started-pytorch.rst b/doc/source/train/getting-started-pytorch.rst index b903ac7937c13..aa9d891bbeeb1 100644 --- a/doc/source/train/getting-started-pytorch.rst +++ b/doc/source/train/getting-started-pytorch.rst @@ -1,22 +1,22 @@ .. _train-pytorch: -Getting Started with PyTorch -============================ +Get Started with PyTorch +======================== This tutorial walks through the process of converting an existing PyTorch script to use Ray Train. Learn how to: -1. Configure your model so that it runs distributed and is placed on the correct CPU/GPU device. -2. Configure your dataloader so that it is sharded across the workers and place data on the correct CPU/GPU device. -3. Configure your training function to report metrics and save checkpoints. -4. Configure scale and CPU/GPU resource requirements for your training job. -5. Launch your distributed training job with a :class:`~ray.train.torch.TorchTrainer`. +1. Configure a model to run distributed and on the correct CPU/GPU device. +2. Configure a dataloader to shard data across the :ref:`workers ` and place data on the correct CPU or GPU device. +3. Configure a :ref:`training function ` to report metrics and save checkpoints. +4. Configure :ref:`scaling ` and CPU or GPU resource requirements for a training job. +5. Launch a distributed training job with a :class:`~ray.train.torch.TorchTrainer` class. Quickstart ---------- -For reference, the final code follows: +For reference, the final code is as follows: .. code-block:: python @@ -30,9 +30,9 @@ For reference, the final code follows: trainer = TorchTrainer(train_func, scaling_config=scaling_config) result = trainer.fit() -1. Your `train_func` is the Python code that is executed on each distributed training worker. -2. Your `ScalingConfig` defines the number of distributed training workers and whether to use GPUs. -3. Your `TorchTrainer` launches the distributed training job. +1. `train_func` is the Python code that executes on each distributed training worker. +2. `ScalingConfig` defines the number of distributed training workers and whether to use GPUs. +3. `TorchTrainer` launches the distributed training job. Compare a PyTorch training script with and without Ray Train. @@ -131,25 +131,25 @@ Compare a PyTorch training script with and without Ray Train. trainer = TorchTrainer(train_func, scaling_config=scaling_config) result = trainer.fit() -Setting up your training function ---------------------------------- +Set up a training function +-------------------------- First, update your training code to support distributed training. -You can begin by wrapping your code in a function: +Begin by wrapping your code in a :ref:`training function `: .. code-block:: python def train_func(config): # Your PyTorch training code here. -This function is executed on each distributed training worker. +Each distributed training worker executes this function. -Setting up your model -^^^^^^^^^^^^^^^^^^^^^ +Set up a model +^^^^^^^^^^^^^^ -Use the :func:`ray.train.torch.prepare_model` utility function. This will: +Use the :func:`ray.train.torch.prepare_model` utility function to: -1. Move your model to the right device. +1. Move your model to the correct device. 2. Wrap it in ``DistributedDataParallel``. .. code-block:: diff @@ -172,8 +172,8 @@ Use the :func:`ray.train.torch.prepare_model` utility function. This will: ... -Setting up your dataset -^^^^^^^^^^^^^^^^^^^^^^^ +Set up a dataset +^^^^^^^^^^^^^^^^ .. TODO: Update this to use Ray Data. @@ -182,8 +182,8 @@ Use the :func:`ray.train.torch.prepare_data_loader` utility function, which: 1. Adds a ``DistributedSampler`` to your ``DataLoader``. 2. Moves the batches to the right device. -Note that this step is not necessary if you are passing in Ray Data to your Trainer -(see :ref:`data-ingest-torch`): +Note that this step isn't necessary if you're passing in Ray Data to your Trainer. +See :ref:`data-ingest-torch`. .. code-block:: diff @@ -216,8 +216,8 @@ Note that this step is not necessary if you are passing in Ray Data to your Trai global_batch_size = worker_batch_size * ray.train.get_context().get_world_size() -Reporting checkpoints and metrics -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Report checkpoints and metrics +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To monitor progress, you can report intermediate metrics and checkpoints using the :func:`ray.train.report` utility function. @@ -239,8 +239,8 @@ To monitor progress, you can report intermediate metrics and checkpoints using t For more details, see :ref:`train-monitoring-and-logging` and :ref:`train-checkpointing`. -Configuring scale and GPUs ---------------------------- +Configure scale and GPUs +------------------------ Outside of your training function, create a :class:`~ray.train.ScalingConfig` object to configure: @@ -255,8 +255,8 @@ Outside of your training function, create a :class:`~ray.train.ScalingConfig` ob For more details, see :ref:`train_scaling_config`. -Launching your training job ---------------------------- +Launch a training job +--------------------- Tying this all together, you can now launch a distributed training job with a :class:`~ray.train.torch.TorchTrainer`. @@ -268,8 +268,8 @@ with a :class:`~ray.train.torch.TorchTrainer`. trainer = TorchTrainer(train_func, scaling_config=scaling_config) result = trainer.fit() -Accessing training results --------------------------- +Access training results +----------------------- After training completes, a :class:`~ray.train.Result` object is returned which contains information about the training run, including the metrics and checkpoints reported during training. diff --git a/doc/source/train/getting-started-transformers.rst b/doc/source/train/getting-started-transformers.rst index 9c9f84081b95f..95bb99fafd108 100644 --- a/doc/source/train/getting-started-transformers.rst +++ b/doc/source/train/getting-started-transformers.rst @@ -1,14 +1,14 @@ .. _train-pytorch-transformers: -Getting Started with Hugging Face Transformers -============================================== +Get Started with Hugging Face Transformers +========================================== This tutorial walks through the process of converting an existing Hugging Face Transformers script to use Ray Train. Learn how to: -1. Configure your training function to report metrics and save checkpoints. -2. Configure scale and CPU/GPU resource requirements for your training job. +1. Configure a :ref:`training function ` to report metrics and save checkpoints. +2. Configure :ref:`scaling ` and CPU or GPU resource requirements for your training job. 3. Launch your distributed training job with a :class:`~ray.train.torch.TorchTrainer`. Quickstart @@ -28,9 +28,9 @@ For reference, the final code follows: trainer = TorchTrainer(train_func, scaling_config=scaling_config) result = trainer.fit() -1. Your `train_func` is the Python code that is executed on each distributed training worker. -2. Your :class:`~ray.train.ScalingConfig` defines the number of distributed training workers and computing resources (e.g. GPUs). -3. Your :class:`~ray.train.torch.TorchTrainer` launches the distributed training job. +1. `train_func` is the Python code that executes on each distributed training :ref:`worker `. +2. :class:`~ray.train.ScalingConfig` defines the number of distributed training workers and computing resources (e.g. GPUs). +3. :class:`~ray.train.torch.TorchTrainer` launches the distributed training job. Compare a Hugging Face Transformers training script with and without Ray Train. @@ -171,32 +171,32 @@ Compare a Hugging Face Transformers training script with and without Ray Train. ray_trainer.fit() -Setting up your training function ---------------------------------- +Set up a training function +-------------------------- First, update your training code to support distributed training. -You can begin by wrapping your code in a function: +You can begin by wrapping your code in a :ref:`training function `: .. code-block:: python def train_func(config): # Your Transformers training code here. -This function is executed on each distributed training worker. Ray Train will set up the distributed +This function executes on each distributed training worker. Ray Train sets up the distributed process group on each worker before entering this function. -Please put all the logics into this function, including dataset construction and preprocessing, +Put all the logic into this function, including dataset construction and preprocessing, model initialization, transformers trainer definition and more. .. note:: If you are using Hugging Face Datasets or Evaluate, make sure to call ``datasets.load_dataset`` and ``evaluate.load`` - inside the training function. Do not pass the loaded datasets and metrics from outside of the training + inside the training function. Don't pass the loaded datasets and metrics from outside of the training function, because it might cause serialization errors while transferring the objects to the workers. -Reporting checkpoints and metrics -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Report checkpoints and metrics +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To persist your checkpoints and monitor training progress, add a :class:`ray.train.huggingface.transformers.RayTrainReportCallback` utility callback to your Trainer. @@ -215,11 +215,11 @@ To persist your checkpoints and monitor training progress, add a Reporting metrics and checkpoints to Ray Train ensures that you can use Ray Tune and :ref:`fault-tolerant training `. -Note that the :class:`ray.train.huggingface.transformers.RayTrainReportCallback` only provides a simple implementation, and can be :ref:`further customized `. +Note that the :class:`ray.train.huggingface.transformers.RayTrainReportCallback` only provides a simple implementation, and you can :ref:`further customize ` it. -Preparing your Transformers Trainer -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Prepare a Transformers Trainer +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Finally, pass your Transformers Trainer into :meth:`~ray.train.huggingface.transformers.prepare_trainer` to validate @@ -239,8 +239,8 @@ your configurations and enable Ray Data Integration. ... -Configuring scale and GPUs ---------------------------- +Configure scale and GPUs +------------------------ Outside of your training function, create a :class:`~ray.train.ScalingConfig` object to configure: @@ -255,8 +255,8 @@ Outside of your training function, create a :class:`~ray.train.ScalingConfig` ob For more details, see :ref:`train_scaling_config`. -Launching your training job ---------------------------- +Launch a training job +--------------------- Tying this all together, you can now launch a distributed training job with a :class:`~ray.train.torch.TorchTrainer`. @@ -270,8 +270,8 @@ with a :class:`~ray.train.torch.TorchTrainer`. Refer to :ref:`train-run-config` for more configuration options for `TorchTrainer`. -Accessing training results --------------------------- +Access training results +----------------------- After training completes, a :class:`~ray.train.Result` object is returned which contains information about the training run, including the metrics and checkpoints reported during training. @@ -297,15 +297,15 @@ After you have converted your Hugging Face Transformers training script to use R .. _transformers-trainer-migration-guide: -``TransformersTrainer`` Migration Guide ---------------------------------------- +TransformersTrainer Migration Guide +----------------------------------- -The `TransformersTrainer` was added in Ray 2.1. It exposes a `trainer_init_per_worker` interface -to define `transformers.Trainer`, then runs a pre-defined training loop in a black box. +Ray 2.1 introduced the `TransformersTrainer`, which exposes a `trainer_init_per_worker` interface +to define `transformers.Trainer`, then runs a pre-defined training function in a black box. -Ray 2.7 introduces the newly unified :class:`~ray.train.torch.TorchTrainer` API, -which offers enhanced transparency, flexibility, and simplicity. This API is more aligned -with standard Hugging Face Transformers scripts, ensuring users have better control over their +Ray 2.7 introduced the newly unified :class:`~ray.train.torch.TorchTrainer` API, +which offers enhanced transparency, flexibility, and simplicity. This API aligns more +with standard Hugging Face Transformers scripts, ensuring that you have better control over your native Transformers training code. diff --git a/doc/source/train/horovod.rst b/doc/source/train/horovod.rst index 1165eaccd5274..6632c8f9164a0 100644 --- a/doc/source/train/horovod.rst +++ b/doc/source/train/horovod.rst @@ -1,9 +1,12 @@ -Horovod -======= + +.. _train-horovod: + +Get Started with Horovod +======================== Ray Train configures the Horovod environment and Rendezvous server for you, allowing you to run your ``DistributedOptimizer`` training -script. See `Horovod documentation `_ +script. See the `Horovod documentation `_ for more information. Quickstart @@ -13,10 +16,10 @@ Quickstart -Updating your training function -------------------------------- +Update your training function +----------------------------- -First, update your training function to support distributed +First, update your :ref:`training function ` to support distributed training. If you have a training function that already runs with the `Horovod Ray @@ -27,11 +30,11 @@ To onboard onto Horovod, visit the `Horovod guide `_. -Creating a :class:`~ray.train.horovod.HorovodTrainer` ------------------------------------------------------ +Create a HorovodTrainer +----------------------- -``Trainer``\s are the primary Ray Train classes that are used to manage state and -execute training. For Horovod, we use a :class:`~ray.train.horovod.HorovodTrainer` +``Trainer``\s are the primary Ray Train classes to use to manage state and +execute training. For Horovod, use a :class:`~ray.train.horovod.HorovodTrainer` that you can setup like this: .. code-block:: python @@ -45,7 +48,7 @@ that you can setup like this: scaling_config=ScalingConfig(use_gpu=use_gpu, num_workers=2) ) -When training with Horovod, we will always use a HorovodTrainer, +When training with Horovod, always use a HorovodTrainer, irrespective of the training framework, for example, PyTorch or TensorFlow. To customize the backend setup, you can pass a @@ -64,8 +67,8 @@ To customize the backend setup, you can pass a For more configurability, see the :py:class:`~ray.train.data_parallel_trainer.DataParallelTrainer` API. -Running your training function ------------------------------- +Run a training function +----------------------- With a distributed training function and a Ray Train ``Trainer``, you are now ready to start training. @@ -77,6 +80,7 @@ ready to start training. Further reading --------------- + Ray Train's :class:`~ray.train.horovod.HorovodTrainer` replaces the distributed communication backend of the native libraries with its own implementation. Thus, the remaining integration points remain the same. If you're using Horovod @@ -85,6 +89,8 @@ refer to the respective guides for further configuration and information. If you are implementing your own Horovod-based training routine without using any of -the training libraries, we still encourage you to read through the -:ref:`User Guides `, as many of the contents are applicable -to generic use cases and can be easily adapted. +the training libraries, read through the +:ref:`User Guides `, as you can apply much of the content +to generic use cases and adapt them easily. + + diff --git a/doc/source/train/huggingface-accelerate.rst b/doc/source/train/huggingface-accelerate.rst index dd4e86dc65090..480ae9b148a9b 100644 --- a/doc/source/train/huggingface-accelerate.rst +++ b/doc/source/train/huggingface-accelerate.rst @@ -1,11 +1,11 @@ .. _train-hf-accelerate: -Training with HuggingFace Accelerate -==================================== +Get Started with Hugging Face Accelerate +======================================== -The :class:`~ray.train.torch.TorchTrainer` can help you easily launch your `Accelelate `_ training across a distributed Ray cluster. +The :class:`~ray.train.torch.TorchTrainer` can help you easily launch your `Accelerate `_ training across a distributed Ray cluster. -All you need to do is run your existing training code with a TorchTrainer. You can expect the final code to look like this: +You only need to run your existing training code with a TorchTrainer. You can expect the final code to look like this: .. code-block:: python @@ -50,11 +50,11 @@ All you need to do is run your existing training code with a TorchTrainer. You c Model and data preparation for distributed training is completely handled by the `Accelerator `_ object and its `Accelerator.prepare() `_ method. - Unlike with native PyTorch, PyTorch Lightning, or HuggingFace Transformers, you do **not** call any additional Ray Train utilities + Unlike with native PyTorch, PyTorch Lightning, or Hugging Face Transformers, **don't** call any additional Ray Train utilities like :meth:`~ray.train.torch.prepare_model` or :meth:`~ray.train.torch.prepare_data_loader` in your training function. -Configuring Accelerate ------------------------ +Configure Accelerate +-------------------- In Ray Train, you can set configurations through the `accelerate.Accelerator `_ object in your training function. Below are starter examples for configuring Accelerate. @@ -161,11 +161,11 @@ object in your training function. Below are starter examples for configuring Acc trainer.fit() Note that Accelerate also provides a CLI tool, `"accelerate config"`, to generate a configuration and launch your training -job with `"accelerate launch"`. However, it is not necessary here because Ray's `TorchTrainer` already sets up the Torch +job with `"accelerate launch"`. However, it's not necessary here because Ray's `TorchTrainer` already sets up the Torch distributed environment and launches the training function on all workers. -Next, check these end-to-end examples below for more details: +Next, see these end-to-end examples below for more details: .. tabs:: @@ -201,8 +201,8 @@ You may also find these user guides helpful: - :ref:`How to use Ray Data with Ray Train ` -`AccelerateTrainer` Migration Guide ------------------------------------ +AccelerateTrainer Migration Guide +--------------------------------- Before Ray 2.7, Ray Train's :class:`AccelerateTrainer ` API was the recommended way to run Accelerate code. As a subclass of :class:`TorchTrainer `, @@ -210,7 +210,7 @@ the AccelerateTrainer takes in a configuration file generated by ``accelerate co Aside from that, the functionality of ``AccelerateTrainer`` is identical to ``TorchTrainer``. However, this caused confusion around whether this was the *only* way to run Accelerate code. -Because the full Accelerate functionality can be expressed with the ``Accelerator`` and ``TorchTrainer`` combination, the ``AccelerateTrainer`` will be deprecated in Ray 2.8, -and it is recommend to run your Accelerate code directly with ``TorchTrainer``. +Because you can express the full Accelerate functionality with the ``Accelerator`` and ``TorchTrainer`` combination, the plan is to deprecate the ``AccelerateTrainer`` in Ray 2.8, +and it's recommend to run your Accelerate code directly with ``TorchTrainer``. diff --git a/doc/source/train/more-frameworks.rst b/doc/source/train/more-frameworks.rst index 1f2dd89ff64ec..dce706c1d5368 100644 --- a/doc/source/train/more-frameworks.rst +++ b/doc/source/train/more-frameworks.rst @@ -29,7 +29,7 @@ More Frameworks .. button-ref:: distributed-tensorflow-keras - TensorFlow & Keras + TensorFlow and Keras .. grid-item-card:: :img-top: /images/xgboost_logo.png @@ -37,7 +37,7 @@ More Frameworks .. button-ref:: distributed-xgboost-lightgbm - XGBoost & LightGBM + XGBoost and LightGBM .. grid-item-card:: :img-top: /images/horovod.png diff --git a/doc/source/train/train.rst b/doc/source/train/train.rst index d779cf297dcf1..2ceb336d77354 100644 --- a/doc/source/train/train.rst +++ b/doc/source/train/train.rst @@ -97,7 +97,7 @@ Get started :outline: :expand: - Try Ray Train and Lightning + Try Ray Train with Lightning .. grid-item-card::