Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs][train]Make Train example titles, heading more consistent #39606

Merged
merged 18 commits into from
Sep 14, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
make remaining examples consistent; add see also's
Signed-off-by: angelinalg <[email protected]>
  • Loading branch information
angelinalg committed Sep 12, 2023
commit b7f2a0eef18dc75a87796439f7180b1f85dff54a
1 change: 0 additions & 1 deletion .github/styles/Vocab/Train/accept.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
Horovod
Hugging Face
hyperparameters?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to have the question mark here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is a regex that matches "hyperparameter" and "hyperparameters"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bveeramani - would you be able to confirm that this is the correcct syntax?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quotation mark means the "s" is optional.

You might also want to add a "$" to the end (hyperparameters?$) to avoid matching words like "hyperparametere"

Keras
LightGBM
Expand Down
12 changes: 6 additions & 6 deletions doc/source/train/deepspeed.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
.. _train-deepspeed:

Training with DeepSpeed
=======================
Get Started with DeepSpeed
==========================

The :class:`~ray.train.torch.TorchTrainer` can help you easily launch your `DeepSpeed <https://www.deepspeed.ai/>`_ training across a distributed Ray cluster.

All you need to do is run your existing training code with a TorchTrainer. You can expect the final code to look like this:
You only need to run your existing training code with a TorchTrainer. You can expect the final code to look like this:

.. code-block:: python

Expand Down Expand Up @@ -74,12 +74,12 @@ Below is a simple example of ZeRO-3 training with DeepSpeed only.
keep using `deepspeed.initialize() <https://deepspeed.readthedocs.io/en/latest/initialize.html>`_ as usual to prepare everything
for distributed training.

Running DeepSpeed with other frameworks
-------------------------------------------
Run DeepSpeed with other frameworks
-----------------------------------

Many deep learning frameworks have integrated with DeepSpeed, including Lightning, Transformers, Accelerate, and more. You can run all these combinations in Ray Train.

Please check the below examples for more details:
Check the below examples for more details:

.. list-table::
:header-rows: 1
Expand Down
105 changes: 50 additions & 55 deletions doc/source/train/distributed-tensorflow-keras.rst
Original file line number Diff line number Diff line change
@@ -1,18 +1,19 @@
.. _train-tensorflow-overview:

Distributed Tensorflow & Keras
==============================
Get Started with TensorFlow and Keras
=====================================

Ray Train's `TensorFlow <https://www.tensorflow.org/>`__ integration enables you
to scale your TensorFlow and Keras training loops to many machines and GPUs.
to scale your TensorFlow and Keras training functions to many machines and GPUs.

On a technical level, Ray Train schedules your training workers
and configures ``TF_CONFIG`` for you, allowing you to run
your ``MultiWorkerMirroredStrategy`` training script. See `Distributed
training with TensorFlow <https://www.tensorflow.org/guide/distributed_training>`_
for more information.

Most of the examples in this guide use Tensorflow with Keras, but
Ray Train also works with vanilla Tensorflow.
Most of the examples in this guide use TensorFlow with Keras, but
Ray Train also works with vanilla TensorFlow.


Quickstart
Expand All @@ -23,29 +24,27 @@ Quickstart
:end-before: __tf_train_end__


Updating your training function
-------------------------------
Update your training function
-----------------------------

First, you'll want to update your training function to support distributed
First, update your :ref:`training function <train-overview-training-function> to support distributed
training.


.. note::
The current TensorFlow implementation supports
``MultiWorkerMirroredStrategy`` (and ``MirroredStrategy``). If there are
other strategies you wish to see supported by Ray Train, please let us know
by submitting a `feature request on GitHub <https://github.com/ray-project/ray/issues>`_.
other strategies you wish to see supported by Ray Train, submit a `feature request on GitHub <https://github.com/ray-project/ray/issues>`_.

These instructions closely follow TensorFlow's `Multi-worker training
with Keras <https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras>`_
tutorial. One key difference is that Ray Train will handle the environment
tutorial. One key difference is that Ray Train handles the environment
variable set up for you.

**Step 1:** Wrap your model in ``MultiWorkerMirroredStrategy``.

The `MultiWorkerMirroredStrategy <https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy>`_
enables synchronous distributed training. The ``Model`` *must* be built and
compiled within the scope of the strategy.
enables synchronous distributed training. You *must* build and compile the ``Model`` within the scope of the strategy.

.. code-block:: python

Expand All @@ -56,9 +55,8 @@ compiled within the scope of the strategy.
**Step 2:** Update your ``Dataset`` batch size to the *global* batch
size.

The `batch <https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch>`_
will be split evenly across worker processes, so ``batch_size`` should be
set appropriately.
Set ``batch_size`` appropriately because `batch <https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch>`_
splits evenly across worker processes.

.. code-block:: diff

Expand All @@ -67,20 +65,20 @@ set appropriately.


.. warning::
Ray will not automatically set any environment variables or configuration
related to local parallelism / threading
Ray doesn't automatically set any environment variables or configuration
related to local parallelism or threading
:ref:`aside from "OMP_NUM_THREADS" <omp-num-thread-note>`.
If you desire greater control over TensorFlow threading, use
If you want greater control over TensorFlow threading, use
the ``tf.config.threading`` module (eg.
``tf.config.threading.set_inter_op_parallelism_threads(num_cpus)``)
at the beginning of your ``train_loop_per_worker`` function.

Creating a :class:`~ray.train.tensorflow.TensorflowTrainer`
-----------------------------------------------------------
Create a TensorflowTrainer
--------------------------

``Trainer``\s are the primary Ray Train classes that are used to manage state and
``Trainer``\s are the primary Ray Train classes for managing state and
execute training. For distributed Tensorflow,
we use a :class:`~ray.train.tensorflow.TensorflowTrainer`
use a :class:`~ray.train.tensorflow.TensorflowTrainer`
that you can setup like this:

.. code-block:: python
Expand Down Expand Up @@ -109,38 +107,36 @@ To customize the backend setup, you can pass a
)


For more configurability, please reference the :py:class:`~ray.train.data_parallel_trainer.DataParallelTrainer` API.
For more configurability, see the :py:class:`~ray.train.data_parallel_trainer.DataParallelTrainer` API.


Running your training function
------------------------------
Run your training function
--------------------------

With a distributed training function and a Ray Train ``Trainer``, you are now
ready to start training!
ready to start training.

.. code-block:: python

trainer.fit()

Data loading and preprocessing
------------------------------
Tensorflow per default uses its own internal dataset sharding policy, as described
Load and preprocess data
------------------------

TensorFlow by default uses its own internal dataset sharding policy, as described
`in the guide <https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#dataset_sharding>`__.
If your tensorflow dataset is compatible with distributed loading, you don't need to
If your TensorFlow dataset is compatible with distributed loading, you don't need to
change anything.

If you require more advanced preprocessing, you may want to consider using Ray Data
for distributed data ingest.

There is a guide for using :ref:`Ray Data with Ray Train <data-ingest-torch>`
in our PyTorch guide. Since Ray Data is an independent library, most concepts can
be directly applied to TensorFlow.
for distributed data ingest. See :ref:`Ray Data with Ray Train <data-ingest-torch>`.
Because Ray Data is an independent library, you can directly apply most concepts to TensorFlow.
angelinalg marked this conversation as resolved.
Show resolved Hide resolved

The main difference is that you may want to convert your Ray Data dataset shard to
a TensorFlow dataset in your training function so that you can use the Keras
API for model training.

`Here's a full example you can refer to <https://github.com/ray-project/ray/blob/master/python/ray/train/examples/tf/tune_tensorflow_autoencoder_example.py>`__
`See this example <https://github.com/ray-project/ray/blob/master/python/ray/train/examples/tf/tune_tensorflow_autoencoder_example.py>`__
for distributed data loading. The relevant parts are:

.. code-block:: python
Expand Down Expand Up @@ -184,8 +180,8 @@ for distributed data loading. The relevant parts are:



Reporting results
-----------------
Report results
--------------
During training, the training loop should report intermediate results and checkpoints
to Ray Train. This reporting logs the results to the console output and appends them to
local log files. The logging also triggers :ref:`checkpoint bookkeeping <train-dl-configure-checkpoints>`.
Expand All @@ -203,30 +199,29 @@ The easiest way to report your results with Keras is by using the
model.fit(dataset, callbacks=[ReportCheckpointCallback()])


This callback will automatically forward all results and checkpoints from the
Keras training loop to Ray Train.
This callback automatically forwards all results and checkpoints from the
Keras training function to Ray Train.


Aggregating results
~~~~~~~~~~~~~~~~~~~
Aggregate results
~~~~~~~~~~~~~~~~~

TensorFlow Keras automatically aggregates metrics from all workers. If you wish to have more
control over that, consider implementing a `custom training loop <https://www.tensorflow.org/tutorials/distribute/custom_training>`__.


Saving and loading checkpoints
------------------------------
Save and load checkpoints
-------------------------

:class:`Checkpoints <ray.train.Checkpoint>` can be saved by calling ``train.report(metrics, checkpoint=Checkpoint(...))`` in the
training function. This will cause the checkpoint state from the distributed
workers to be saved on the ``Trainer`` (where your python script is executed).
You can save :class:`Checkpoints <ray.train.Checkpoint>` by calling ``train.report(metrics, checkpoint=Checkpoint(...))`` in the
training function. This call saves the checkpoint state from the distributed
workers on the ``Trainer``, where you executed your python script.

The latest saved checkpoint can be accessed through the ``checkpoint`` attribute of
the :py:class:`~ray.train.Result`, and the best saved checkpoints can be accessed by the ``best_checkpoints``
You can access the latest saved checkpoint through the ``checkpoint`` attribute of
the :py:class:`~ray.train.Result`, and access the best saved checkpoints with the ``best_checkpoints``
attribute.

Concrete examples are provided to demonstrate how checkpoints (model weights but not models) are saved
appropriately in distributed training.
These concrete examples demonstrate how Ray Train appropriately saves checkpoints, model weights but not models, in distributed training.


.. code-block:: python
Expand Down Expand Up @@ -275,11 +270,11 @@ appropriately in distributed training.
result = trainer.fit()
print(result.checkpoint)

By default, checkpoints will be persisted to local disk in the :ref:`log
By default, checkpoints persist to local disk in the :ref:`log
directory <train-log-dir>` of each run.

Loading checkpoints
~~~~~~~~~~~~~~~~~~~
Load checkpoints
~~~~~~~~~~~~~~~~

.. code-block:: python

Expand Down
Loading
Loading