forked from ray-project/ray
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[docs][train] Update Train landing and Overview pages (ray-project#38808
) Signed-off-by: angelinalg <[email protected]> Co-authored-by: matthewdeng <[email protected]>
- Loading branch information
1 parent
8a6b4f8
commit 6fc61d1
Showing
9 changed files
with
260 additions
and
134 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
Horovod | ||
Hugging Face | ||
Keras | ||
LightGBM | ||
PyTorch | ||
PyTorch Lightning | ||
TensorFlow | ||
XGBoost |
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
.. _train-key-concepts: | ||
|
||
.. _train-overview: | ||
|
||
Ray Train Overview | ||
================== | ||
|
||
|
||
To use Ray Train effectively, you need to understand four main concepts: | ||
|
||
#. :ref:`Training function <train-overview-training-function>`: A Python function that contains your model training logic. | ||
#. :ref:`Worker <train-overview-worker>`: A process that runs the training function. | ||
#. :ref:`Scaling configuration: <train-overview-scaling-config>` A configuration of the number of workers and compute resources (for example, CPUs or GPUs). | ||
#. :ref:`Trainer <train-overview-trainers>`: A Python class that ties together the training function, workers, and scaling configuration to execute a distributed training job. | ||
|
||
.. _train-overview-training-function: | ||
|
||
Training function | ||
----------------- | ||
|
||
The training function is a user-defined Python function that contains the end-to-end model training loop logic. | ||
When launching a distributed training job, each worker executes this training function. | ||
|
||
Ray Train documentation uses the following conventions: | ||
|
||
#. `train_func` is a user-defined function that contains the training code. | ||
#. `train_func` is passed into the Trainer's `train_loop_per_worker` parameter. | ||
|
||
.. code-block:: python | ||
def train_func(): | ||
"""User-defined training function that runs on each distributed worker process. | ||
This function typically contains logic for loading the model, | ||
loading the dataset, training the model, saving checkpoints, | ||
and logging metrics. | ||
""" | ||
... | ||
.. _train-overview-worker: | ||
|
||
Worker | ||
------ | ||
|
||
Ray Train distributes model training compute to individual worker processes across the cluster. | ||
Each worker is a process that executes the `train_func`. | ||
The number of workers determines the parallelism of the training job and is configured in the `ScalingConfig`. | ||
|
||
.. _train-overview-scaling-config: | ||
|
||
Scaling configuration | ||
--------------------- | ||
|
||
The :class:`~ray.train.ScalingConfig` is the mechanism for defining the scale of the training job. | ||
Specify two basic parameters for worker parallelism and compute resources: | ||
|
||
* `num_workers`: The number of workers to launch for a distributed training job. | ||
* `use_gpu`: Whether each worker should use a GPU or CPU. | ||
|
||
.. code-block:: python | ||
from ray.train import ScalingConfig | ||
# Single worker with a CPU | ||
scaling_config = ScalingConfig(num_workers=1, use_gpu=False) | ||
# Single worker with a GPU | ||
scaling_config = ScalingConfig(num_workers=1, use_gpu=True) | ||
# Multiple workers, each with a GPU | ||
scaling_config = ScalingConfig(num_workers=4, use_gpu=True) | ||
.. _train-overview-trainers: | ||
|
||
Trainer | ||
------- | ||
|
||
The Trainer ties the previous three concepts together to launch distributed training jobs. | ||
Ray Train provides :ref:`Trainer classes <train-api>` for different frameworks. | ||
Calling the `fit()` method executes the training job by: | ||
|
||
#. Launching workers as defined by the `scaling_config`. | ||
#. Setting up the framework's distributed environment on all workers. | ||
#. Running the `train_func` on all workers. | ||
|
||
.. code-block:: python | ||
from ray.train.torch import TorchTrainer | ||
trainer = TorchTrainer(train_func, scaling_config=scaling_config) | ||
trainer.fit() |
Oops, something went wrong.