[docs][train] Update Train landing and Overview pages (ray-project#38808

) Signed-off-by: angelinalg <[email protected]> Co-authored-by: matthewdeng <[email protected]>
krfricke · Sep 1, 2023 · 6fc61d1 · 6fc61d1
1 parent 8a6b4f8
commit 6fc61d1
Show file tree

Hide file tree

Showing 9 changed files with 260 additions and 134 deletions.
diff --git a/.github/styles/Vocab/Train/accept.txt b/.github/styles/Vocab/Train/accept.txt
@@ -0,0 +1,8 @@
+Horovod
+Hugging Face
+Keras
+LightGBM
+PyTorch
+PyTorch Lightning
+TensorFlow
+XGBoost
diff --git a/doc/source/_includes/train/announcement.rst b/doc/source/_includes/train/announcement.rst
diff --git a/doc/source/_includes/train/announcement_bottom.rst b/doc/source/_includes/train/announcement_bottom.rst
diff --git a/doc/source/_toc.yml b/doc/source/_toc.yml
@@ -58,8 +58,8 @@ parts:
  - file: train/train
  title: Ray Train
  sections:
- - file: train/key-concepts
- title: Key Concepts
+ - file: train/overview
+ title: Overview
  - file: train/getting-started-pytorch
  title: PyTorch Guide
  - file: train/getting-started-pytorch-lightning

diff --git a/doc/source/train/horovod.rst b/doc/source/train/horovod.rst
@@ -16,14 +16,14 @@ Quickstart
 Updating your training function
 -------------------------------
 
-First, you'll want to update your training function to support distributed
+First, update your training function to support distributed
 training.
 
 If you have a training function that already runs with the `Horovod Ray
 Executor <https://horovod.readthedocs.io/en/stable/ray_include.html#horovod-ray-executor>`_,
-you should not need to make any additional changes!
+you shouldn't need to make any additional changes.
 
-To onboard onto Horovod, please visit the `Horovod guide
+To onboard onto Horovod, visit the `Horovod guide
 <https://horovod.readthedocs.io/en/stable/index.html#get-started>`_.
 
 
@@ -46,7 +46,7 @@ that you can setup like this:
  )
 
 When training with Horovod, we will always use a HorovodTrainer,
-irrespective of the training framework (e.g. PyTorch or Tensorflow).
+irrespective of the training framework, for example, PyTorch or TensorFlow.
 
 To customize the backend setup, you can pass a
 :class:`~ray.train.horovod.HorovodConfig`:
@@ -62,13 +62,13 @@ To customize the backend setup, you can pass a
  scaling_config=ScalingConfig(num_workers=2),
  )
 
-For more configurability, please reference the :py:class:`~ray.train.data_parallel_trainer.DataParallelTrainer` API.
+For more configurability, see the :py:class:`~ray.train.data_parallel_trainer.DataParallelTrainer` API.
 
 Running your training function
 ------------------------------
 
 With a distributed training function and a Ray Train ``Trainer``, you are now
-ready to start training!
+ready to start training.
 
 .. code-block:: python
 

diff --git a/doc/source/train/key-concepts.rst b/doc/source/train/key-concepts.rst
diff --git a/doc/source/train/overview.rst b/doc/source/train/overview.rst
@@ -0,0 +1,91 @@
+.. _train-key-concepts:
+
+.. _train-overview:
+
+Ray Train Overview
+==================
+
+
+To use Ray Train effectively, you need to understand four main concepts:
+
+#. :ref:`Training function <train-overview-training-function>`: A Python function that contains your model training logic.
+#. :ref:`Worker <train-overview-worker>`: A process that runs the training function.
+#. :ref:`Scaling configuration: <train-overview-scaling-config>` A configuration of the number of workers and compute resources (for example, CPUs or GPUs).
+#. :ref:`Trainer <train-overview-trainers>`: A Python class that ties together the training function, workers, and scaling configuration to execute a distributed training job.
+
+.. _train-overview-training-function:
+
+Training function
+-----------------
+
+The training function is a user-defined Python function that contains the end-to-end model training loop logic.
+When launching a distributed training job, each worker executes this training function.
+
+Ray Train documentation uses the following conventions:
+
+#. `train_func` is a user-defined function that contains the training code.
+#. `train_func` is passed into the Trainer's `train_loop_per_worker` parameter.
+
+.. code-block:: python
+
+ def train_func():
+ """User-defined training function that runs on each distributed worker process.
+ 
+ This function typically contains logic for loading the model, 
+ loading the dataset, training the model, saving checkpoints, 
+ and logging metrics.
+ """
+ ...
+
+.. _train-overview-worker:
+
+Worker
+------
+
+Ray Train distributes model training compute to individual worker processes across the cluster. 
+Each worker is a process that executes the `train_func`.
+The number of workers determines the parallelism of the training job and is configured in the `ScalingConfig`.
+
+.. _train-overview-scaling-config:
+
+Scaling configuration
+---------------------
+
+The :class:`~ray.train.ScalingConfig` is the mechanism for defining the scale of the training job.
+Specify two basic parameters for worker parallelism and compute resources:
+
+* `num_workers`: The number of workers to launch for a distributed training job.
+* `use_gpu`: Whether each worker should use a GPU or CPU. 
+
+.. code-block:: python
+
+ from ray.train import ScalingConfig
+
+ # Single worker with a CPU
+ scaling_config = ScalingConfig(num_workers=1, use_gpu=False)
+
+ # Single worker with a GPU
+ scaling_config = ScalingConfig(num_workers=1, use_gpu=True)
+
+ # Multiple workers, each with a GPU
+ scaling_config = ScalingConfig(num_workers=4, use_gpu=True)
+
+.. _train-overview-trainers:
+
+Trainer
+-------
+
+The Trainer ties the previous three concepts together to launch distributed training jobs.
+Ray Train provides :ref:`Trainer classes <train-api>` for different frameworks. 
+Calling the `fit()` method executes the training job by:
+
+#. Launching workers as defined by the `scaling_config`.
+#. Setting up the framework's distributed environment on all workers.
+#. Running the `train_func` on all workers.
+
+.. code-block:: python
+
+ from ray.train.torch import TorchTrainer
+ 
+ trainer = TorchTrainer(train_func, scaling_config=scaling_config)
+ trainer.fit()