Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs][train]Make Train example titles, heading more consistent #39606

Merged
merged 18 commits into from
Sep 14, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
adding some links from guides to the overview pages; fix typos
Signed-off-by: angelinalg <[email protected]>
  • Loading branch information
angelinalg committed Sep 13, 2023
commit 87097bab94ac5a257aa9878dda243186e1d0235a
3 changes: 3 additions & 0 deletions doc/source/train/deepspeed.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@ Get Started with DeepSpeed

The :class:`~ray.train.torch.TorchTrainer` can help you easily launch your `DeepSpeed <https://www.deepspeed.ai/>`_ training across a distributed Ray cluster.

Code example
------------

You only need to run your existing training code with a TorchTrainer. You can expect the final code to look like this:

.. code-block:: python
Expand Down
4 changes: 2 additions & 2 deletions doc/source/train/distributed-tensorflow-keras.rst
Original file line number Diff line number Diff line change
Expand Up @@ -110,8 +110,8 @@ To customize the backend setup, you can pass a
For more configurability, see the :py:class:`~ray.train.data_parallel_trainer.DataParallelTrainer` API.


Run your training function
--------------------------
Run a training function
-----------------------

With a distributed training function and a Ray Train ``Trainer``, you are now
ready to start training.
Expand Down
12 changes: 6 additions & 6 deletions doc/source/train/distributed-xgboost-lightgbm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Quickstart
:end-before: __lightgbm_end__


Basic Training with Tree-Based Models in Train
Basic training with tree-based models in Train
----------------------------------------------

Just as in the original `xgboost.train() <https://xgboost.readthedocs.io/en/stable/parameter.html>`__ and
Expand Down Expand Up @@ -53,12 +53,12 @@ training parameters are passed as the ``params`` dictionary.
:end-before: __lightgbm_end__


Ray-specific params are passed in through the trainer constructors.
Trainer constructors pass Ray-specific parameters.


.. _train-gbdt-checkpoints:

Save and Load XGBoost and LightGBM Checkpoints
Save and load XGBoost and LightGBM checkpoints
----------------------------------------------

When you train a new tree on every boosting round,
Expand Down Expand Up @@ -209,13 +209,13 @@ How to optimize XGBoost memory usage?
XGBoost uses a compute-optimized datastructure, the ``DMatrix``,
to hold training data. When converting a dataset to a ``DMatrix``,
XGBoost creates intermediate copies and ends up
holding a complete copy of the full data. The data will be converted
into the local dataformat (on a 64 bit system these are 64 bit floats.)
holding a complete copy of the full data. XGBoost converts the data
into the local data format. On a 64-bit system the format is 64-bit floats.
Depending on the system and original dataset dtype, this matrix can
thus occupy more memory than the original dataset.

The **peak memory usage** for CPU-based training is at least
**3x** the dataset size (assuming dtype ``float32`` on a 64bit system)
**3x** the dataset size, assuming dtype ``float32`` on a 64-bit system,
plus about **400,000 KiB** for other resources,
like operating system requirements and storing of intermediate
results.
Expand Down
6 changes: 3 additions & 3 deletions doc/source/train/getting-started-pytorch-lightning.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ This tutorial walks through the process of converting an existing PyTorch Lightn
Learn how to:

1. Configure the Lightning Trainer so that it runs distributed with Ray and on the correct CPU or GPU device.
2. Configure the training function to report metrics and save checkpoints.
3. Configure scale and CPU or GPU resource requirements for a training job.
2. Configure :ref:`training function <train-overview-training-function>` to report metrics and save checkpoints.
3. Configure :ref:`scaling <train-overview-scaling-config>` and CPU or GPU resource requirements for a training job.
4. Launch a distributed training job with a :class:`~ray.train.torch.TorchTrainer`.

Quickstart
Expand All @@ -29,7 +29,7 @@ For reference, the final code is as follows:
trainer = TorchTrainer(train_func, scaling_config=scaling_config)
result = trainer.fit()

1. Your `train_func` is the Python code that each distributed training worker executes.
1. Your `train_func` is the Python code that each distributed training :ref:`worker <train-overview-worker>` executes.
2. Your `ScalingConfig` defines the number of distributed training workers and whether to use GPUs.
3. Your `TorchTrainer` launches the distributed training job.

Expand Down
6 changes: 3 additions & 3 deletions doc/source/train/getting-started-pytorch.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,9 @@ This tutorial walks through the process of converting an existing PyTorch script
Learn how to:

1. Configure a model to run distributed and on the correct CPU/GPU device.
2. Configure a dataloader to shard data across the workers and place data on the correct CPU or GPU device.
3. Configure a training function to report metrics and save checkpoints.
4. Configure scale and CPU or GPU resource requirements for a training job.
2. Configure a dataloader to shard data across the :ref:`workers <train-overview-worker>` and place data on the correct CPU or GPU device.
3. Configure a :ref:`training function <train-overview-training-function>` to report metrics and save checkpoints.
4. Configure :ref:`scaling <train-overview-scaling-config>` and CPU or GPU resource requirements for a training job.
5. Launch a distributed training job with a :class:`~ray.train.torch.TorchTrainer` class.

Quickstart
Expand Down
8 changes: 4 additions & 4 deletions doc/source/train/getting-started-transformers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@ This tutorial walks through the process of converting an existing Hugging Face T

Learn how to:

1. Configure your training function to report metrics and save checkpoints.
2. Configure scale and CPU/GPU resource requirements for your training job.
1. Configure a :ref:`training function <train-overview-training-function>` to report metrics and save checkpoints.
2. Configure :ref:`scaling <train-overview-scaling-config>` and CPU or GPU resource requirements for your training job.
3. Launch your distributed training job with a :class:`~ray.train.torch.TorchTrainer`.

Quickstart
Expand All @@ -28,7 +28,7 @@ For reference, the final code follows:
trainer = TorchTrainer(train_func, scaling_config=scaling_config)
result = trainer.fit()

1. `train_func` is the Python code that executes on each distributed training worker.
1. `train_func` is the Python code that executes on each distributed training :ref:`worker <train-overview-worker>`.
2. :class:`~ray.train.ScalingConfig` defines the number of distributed training workers and computing resources (e.g. GPUs).
3. :class:`~ray.train.torch.TorchTrainer` launches the distributed training job.

Expand Down Expand Up @@ -175,7 +175,7 @@ Set up a training function
--------------------------

First, update your training code to support distributed training.
You can begin by wrapping your code in a function:
You can begin by wrapping your code in a :ref:`training function <train-overview-training-function>`:

.. code-block:: python

Expand Down
2 changes: 1 addition & 1 deletion doc/source/train/huggingface-accelerate.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Get Started with Hugging Face Accelerate
========================================

The :class:`~ray.train.torch.TorchTrainer` can help you easily launch your `Accelelate <https://huggingface.co/docs/accelerate>`_ training across a distributed Ray cluster.
The :class:`~ray.train.torch.TorchTrainer` can help you easily launch your `Accelerate <https://huggingface.co/docs/accelerate>`_ training across a distributed Ray cluster.

You only need to run your existing training code with a TorchTrainer. You can expect the final code to look like this:

Expand Down