Skip to content

Commit

Permalink
[docs][train] Update Train landing and Overview pages (ray-project#38808
Browse files Browse the repository at this point in the history
)

Signed-off-by: angelinalg <[email protected]>
Co-authored-by: matthewdeng <[email protected]>
  • Loading branch information
angelinalg and matthewdeng committed Sep 1, 2023
1 parent 8a6b4f8 commit 6fc61d1
Show file tree
Hide file tree
Showing 9 changed files with 260 additions and 134 deletions.
8 changes: 8 additions & 0 deletions .github/styles/Vocab/Train/accept.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
Horovod
Hugging Face
Keras
LightGBM
PyTorch
PyTorch Lightning
TensorFlow
XGBoost
9 changes: 0 additions & 9 deletions doc/source/_includes/train/announcement.rst

This file was deleted.

3 changes: 0 additions & 3 deletions doc/source/_includes/train/announcement_bottom.rst

This file was deleted.

4 changes: 2 additions & 2 deletions doc/source/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,8 @@ parts:
- file: train/train
title: Ray Train
sections:
- file: train/key-concepts
title: Key Concepts
- file: train/overview
title: Overview
- file: train/getting-started-pytorch
title: PyTorch Guide
- file: train/getting-started-pytorch-lightning
Expand Down
12 changes: 6 additions & 6 deletions doc/source/train/horovod.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,14 @@ Quickstart
Updating your training function
-------------------------------

First, you'll want to update your training function to support distributed
First, update your training function to support distributed
training.

If you have a training function that already runs with the `Horovod Ray
Executor <https://horovod.readthedocs.io/en/stable/ray_include.html#horovod-ray-executor>`_,
you should not need to make any additional changes!
you shouldn't need to make any additional changes.

To onboard onto Horovod, please visit the `Horovod guide
To onboard onto Horovod, visit the `Horovod guide
<https://horovod.readthedocs.io/en/stable/index.html#get-started>`_.


Expand All @@ -46,7 +46,7 @@ that you can setup like this:
)
When training with Horovod, we will always use a HorovodTrainer,
irrespective of the training framework (e.g. PyTorch or Tensorflow).
irrespective of the training framework, for example, PyTorch or TensorFlow.

To customize the backend setup, you can pass a
:class:`~ray.train.horovod.HorovodConfig`:
Expand All @@ -62,13 +62,13 @@ To customize the backend setup, you can pass a
scaling_config=ScalingConfig(num_workers=2),
)
For more configurability, please reference the :py:class:`~ray.train.data_parallel_trainer.DataParallelTrainer` API.
For more configurability, see the :py:class:`~ray.train.data_parallel_trainer.DataParallelTrainer` API.

Running your training function
------------------------------

With a distributed training function and a Ray Train ``Trainer``, you are now
ready to start training!
ready to start training.

.. code-block:: python
Expand Down
83 changes: 0 additions & 83 deletions doc/source/train/key-concepts.rst

This file was deleted.

91 changes: 91 additions & 0 deletions doc/source/train/overview.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
.. _train-key-concepts:

.. _train-overview:

Ray Train Overview
==================


To use Ray Train effectively, you need to understand four main concepts:

#. :ref:`Training function <train-overview-training-function>`: A Python function that contains your model training logic.
#. :ref:`Worker <train-overview-worker>`: A process that runs the training function.
#. :ref:`Scaling configuration: <train-overview-scaling-config>` A configuration of the number of workers and compute resources (for example, CPUs or GPUs).
#. :ref:`Trainer <train-overview-trainers>`: A Python class that ties together the training function, workers, and scaling configuration to execute a distributed training job.

.. _train-overview-training-function:

Training function
-----------------

The training function is a user-defined Python function that contains the end-to-end model training loop logic.
When launching a distributed training job, each worker executes this training function.

Ray Train documentation uses the following conventions:

#. `train_func` is a user-defined function that contains the training code.
#. `train_func` is passed into the Trainer's `train_loop_per_worker` parameter.

.. code-block:: python
def train_func():
"""User-defined training function that runs on each distributed worker process.
This function typically contains logic for loading the model,
loading the dataset, training the model, saving checkpoints,
and logging metrics.
"""
...
.. _train-overview-worker:

Worker
------

Ray Train distributes model training compute to individual worker processes across the cluster.
Each worker is a process that executes the `train_func`.
The number of workers determines the parallelism of the training job and is configured in the `ScalingConfig`.

.. _train-overview-scaling-config:

Scaling configuration
---------------------

The :class:`~ray.train.ScalingConfig` is the mechanism for defining the scale of the training job.
Specify two basic parameters for worker parallelism and compute resources:

* `num_workers`: The number of workers to launch for a distributed training job.
* `use_gpu`: Whether each worker should use a GPU or CPU.

.. code-block:: python
from ray.train import ScalingConfig
# Single worker with a CPU
scaling_config = ScalingConfig(num_workers=1, use_gpu=False)
# Single worker with a GPU
scaling_config = ScalingConfig(num_workers=1, use_gpu=True)
# Multiple workers, each with a GPU
scaling_config = ScalingConfig(num_workers=4, use_gpu=True)
.. _train-overview-trainers:

Trainer
-------

The Trainer ties the previous three concepts together to launch distributed training jobs.
Ray Train provides :ref:`Trainer classes <train-api>` for different frameworks.
Calling the `fit()` method executes the training job by:

#. Launching workers as defined by the `scaling_config`.
#. Setting up the framework's distributed environment on all workers.
#. Running the `train_func` on all workers.

.. code-block:: python
from ray.train.torch import TorchTrainer
trainer = TorchTrainer(train_func, scaling_config=scaling_config)
trainer.fit()
Loading

0 comments on commit 6fc61d1

Please sign in to comment.