Skip to content

Commit

Permalink
[air] update documentation to use session.report (ray-project#26051)
Browse files Browse the repository at this point in the history
Update documentation to use `session.report`.

Next steps:
1. Update our internal caller to use `session.report`. Most importantly, CheckpointManager and DataParallelTrainer.
2. Update `get_trial_resources` to use PGF notions to incorporate the requirement of ResourceChangingScheduler. @Yard1 
3. After 2 is done, change all `tune.get_trial_resources` to `session.get_trial_resources`
4. [internal implementation] remove special checkpoint handling logic from huggingface trainer. Optimize the flow for checkpoint conversion with `session.report`.

Co-authored-by: Antoni Baum <[email protected]>
  • Loading branch information
xwjiang2010 and Yard1 committed Jun 30, 2022
1 parent 20c6c07 commit ac831fd
Show file tree
Hide file tree
Showing 79 changed files with 624 additions and 387 deletions.
4 changes: 2 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ This example runs a parallel grid search to optimize an example objective functi

.. code-block:: python
from ray import tune
from ray.air import session
def objective(step, alpha, beta):
Expand All @@ -132,7 +132,7 @@ This example runs a parallel grid search to optimize an example objective functi
# Iterative training function - can be any arbitrary training procedure.
intermediate_score = objective(step, alpha, beta)
# Feed the score back back to Tune.
tune.report(mean_loss=intermediate_score)
session.report({"mean_loss": intermediate_score})
analysis = tune.run(
Expand Down
1 change: 1 addition & 0 deletions doc/source/ray-air/images/session.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
23 changes: 23 additions & 0 deletions doc/source/ray-air/key-concepts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,29 @@ Trainer objects will produce a :ref:`Results <air-results-ref>` object after cal
:end-before: __air_trainer_output_end__


Session
-------

Ray AIR exposes a functional API for users to define training behavior, or for developers to create their own ``Trainer``\s.
In both cases, there is a need for the following interactions:

1. To disseminate information downstream, including ``trial_name``, ``trial_id``, ``trial_resources``, rank information etc.
2. To report information to upstream, including metrics and checkpoint.

To facilitate such interactions, we introduce the :ref:`Session <air-session-ref>` concept.

The session concept exists on several levels: The execution layer (called `Tune Session`) and the Data Parallel training layer
(called `Train Session`).
The following figure shows how these two sessions look like in a Data Parallel training scenario.

.. image:: images/session.svg
:width: 650px
:align: center

..
https://docs.google.com/drawings/d/1g0pv8gqgG29aPEPTcd4BC0LaRNbW1sAkv3H6W1TCp0c/edit

Tuner
-----

Expand Down
10 changes: 9 additions & 1 deletion doc/source/ray-air/package-ref.rst
Original file line number Diff line number Diff line change
Expand Up @@ -124,4 +124,12 @@ Configs
.. automodule:: ray.air.config
:members:

.. autoclass:: ray.air.config.CheckpointConfig
.. autoclass:: ray.air.config.CheckpointConfig

.. _air-session-ref:

Session
~~~~~~~

.. automodule:: ray.air.session
:members:
4 changes: 2 additions & 2 deletions doc/source/ray-contribute/docs.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -189,7 +189,7 @@
"outputs": [],
"source": [
"# __function_api_start__\n",
"from ray import tune\n",
"from ray.air import session\n",
"\n",
"\n",
"def objective(x, a, b): # Define an objective function.\n",
Expand All @@ -201,7 +201,7 @@
" for x in range(20): # \"Train\" for 20 iterations and compute intermediate scores.\n",
" score = objective(x, config[\"a\"], config[\"b\"])\n",
"\n",
" tune.report(score=score) # Send the score to Tune.\n",
" session.report({\"score\": score}) # Send the score to Tune.\n",
"\n",
"\n",
"# __function_api_end__"
Expand Down
3 changes: 2 additions & 1 deletion doc/source/ray-overview/doc_test/ray_tune.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from ray import tune
from ray.air import session


def objective(step, alpha, beta):
Expand All @@ -12,7 +13,7 @@ def training_function(config):
# Iterative training function - can be any arbitrary training procedure.
intermediate_score = objective(step, alpha, beta)
# Feed the score back back to Tune.
tune.report(mean_loss=intermediate_score)
session.report({"mean_loss": intermediate_score})


analysis = tune.run(
Expand Down
2 changes: 1 addition & 1 deletion doc/source/tune/api_docs/env.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ These are the environment variables Ray Tune currently considers:
directories when the name is not specified explicitly or the trainable isn't passed
as a string. Setting this environment variable to ``1`` disables adding these date strings.
* **TUNE_DISABLE_STRICT_METRIC_CHECKING**: When you report metrics to Tune via
``tune.report()`` and passed a ``metric`` parameter to ``tune.run()``, a scheduler,
``session.report()`` and passed a ``metric`` parameter to ``tune.run()``, a scheduler,
or a search algorithm, Tune will error
if the metric was not reported in the result. Setting this environment variable
to ``1`` will disable this check.
Expand Down
90 changes: 31 additions & 59 deletions doc/source/tune/api_docs/trainable.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@
API does not really have a signature to just describe.
.. TODO: Reusing actors and advanced resources allocation seem ill-placed.
Training (tune.Trainable, tune.report)
======================================
Training (tune.Trainable, session.report)
==========================================

Training can be done with either a **Class API** (``tune.Trainable``) or **function API** (``tune.report``).
Training can be done with either a **Class API** (``tune.Trainable``) or **function API** (``session.report``).

For the sake of example, let's maximize this objective function:

Expand All @@ -21,7 +21,7 @@ For the sake of example, let's maximize this objective function:
Function API
------------

With the Function API, you can report intermediate metrics by simply calling ``tune.report`` within the provided function.
With the Function API, you can report intermediate metrics by simply calling ``session.report`` within the provided function.

.. code-block:: python
Expand All @@ -31,7 +31,7 @@ With the Function API, you can report intermediate metrics by simply calling ``t
for x in range(20):
intermediate_score = objective(x, config["a"], config["b"])
tune.report(score=intermediate_score) # This sends the score to Tune.
session.report({"score": intermediate_score}) # This sends the score to Tune.
analysis = tune.run(
trainable,
Expand All @@ -40,43 +40,13 @@ With the Function API, you can report intermediate metrics by simply calling ``t
print("best config: ", analysis.get_best_config(metric="score", mode="max"))
.. tip:: Do not use ``tune.report`` within a ``Trainable`` class.
.. tip:: Do not use ``session.report`` within a ``Trainable`` class.

Tune will run this function on a separate thread in a Ray actor process.

You'll notice that Ray Tune will output extra values in addition to the user reported metrics,
such as ``iterations_since_restore``. See :ref:`tune-autofilled-metrics` for an explanation/glossary of these values.

Function API return and yield values
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Instead of using ``tune.report()``, you can also use Python's ``yield``
statement to report metrics to Ray Tune:


.. code-block:: python
def trainable(config):
# config (dict): A dict of hyperparameters.
for x in range(20):
intermediate_score = objective(x, config["a"], config["b"])
yield {"score": intermediate_score} # This sends the score to Tune.
analysis = tune.run(
trainable,
config={"a": 2, "b": 4}
)
print("best config: ", analysis.get_best_config(metric="score", mode="max"))
If you yield a dictionary object, this will work just as ``tune.report()``.
If you yield a number, if will be reported to Ray Tune with the key ``_metric``, i.e.
as if you had called ``tune.report(_metric=value)``.

Ray Tune supports the same functionality for return values if you only
report metrics at the end of each run:

.. code-block:: python
def trainable(config):
Expand All @@ -102,30 +72,27 @@ Function API Checkpointing
~~~~~~~~~~~~~~~~~~~~~~~~~~

Many Tune features rely on checkpointing, including the usage of certain Trial Schedulers and fault tolerance.
To use Tune's checkpointing features, you must expose a ``checkpoint_dir`` argument in the function signature,
and call ``tune.checkpoint_dir`` :
You can save and load checkpoint in Ray Tune in the following manner:

.. code-block:: python
import time
from ray import tune
from ray.air import session
from ray.air.checkpoint import Checkpoint
def train_func(config, checkpoint_dir=None):
start = 0
if checkpoint_dir:
with open(os.path.join(checkpoint_dir, "checkpoint")) as f:
state = json.loads(f.read())
start = state["step"] + 1
def train_func(config):
step = 0
loaded_checkpoint = session.get_checkpoint()
if loaded_checkpoint:
last_step = loaded_checkpoint.to_dict()["step"]
step = last_step + 1
for iter in range(start, 100):
for iter in range(step, 100):
time.sleep(1)
with tune.checkpoint_dir(step=step) as checkpoint_dir:
path = os.path.join(checkpoint_dir, "checkpoint")
with open(path, "w") as f:
f.write(json.dumps({"step": start}))
tune.report(hello="world", ray="tune")
checkpoint = Checkpoint.from_dict({"step": step})
session.report({"message": "Hello world Ray Tune!"}, checkpoint=checkpoint)
tune.run(train_func)
Expand Down Expand Up @@ -153,7 +120,7 @@ it is important not to depend on absolute paths in the implementation of ``save`
Trainable Class API
-------------------

.. caution:: Do not use ``tune.report`` within a ``Trainable`` class.
.. caution:: Do not use ``session.report`` within a ``Trainable`` class.

The Trainable **class API** will require users to subclass ``ray.tune.Trainable``. Here's a naive example of this API:

Expand Down Expand Up @@ -343,18 +310,23 @@ It is also possible to specify memory (``"memory"``, in bytes) and custom resour

.. _tune-function-docstring:

tune.report / tune.checkpoint (Function API)
--------------------------------------------
session (Function API)
----------------------

.. autofunction:: ray.tune.report
.. autofunction:: ray.air.session.report
:noindex:

.. autofunction:: ray.tune.checkpoint_dir
.. autofunction:: ray.air.session.get_checkpoint
:noindex:

.. autofunction:: ray.tune.get_trial_dir
.. autofunction:: ray.air.session.get_trial_name
:noindex:

.. autofunction:: ray.tune.get_trial_name
.. autofunction:: ray.air.session.get_trial_id
:noindex:

.. autofunction:: ray.tune.get_trial_id
.. autofunction:: ray.air.session.get_trial_resources
:noindex:

tune.Trainable (Class API)
--------------------------
Expand Down
7 changes: 4 additions & 3 deletions doc/source/tune/doc_code/faq.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
# __reproducible_start__
import numpy as np
from ray import tune
from ray.air import session


def train(config):
Expand All @@ -12,7 +13,7 @@ def train(config):
# is the same.
np.random.seed(config["seed"])
random_result = np.random.uniform(0, 100, size=1).item()
tune.report(result=random_result)
session.report({"result": random_result})


# Set seed for Ray Tune's random search.
Expand Down Expand Up @@ -54,7 +55,7 @@ def _iter():

def train(config):
random_result = np.random.uniform(0, 100, size=1).item()
tune.report(result=random_result)
session.report({"result": random_result})


train_fn = train
Expand Down Expand Up @@ -90,7 +91,7 @@ def train(config):
def train_fn(config, checkpoint_dir=None):
# some Modin operations here
# import modin.pandas as pd
tune.report(metric=metric)
session.report({"metric": metric})

tune.run(
train_fn,
Expand Down
3 changes: 2 additions & 1 deletion doc/source/tune/doc_code/key_concepts.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# flake8: noqa

# __function_api_start__
from ray.air import session


def objective(x, a, b): # Define an objective function.
Expand All @@ -12,7 +13,7 @@ def trainable(config): # Pass a "config" dictionary into your trainable.
for x in range(20): # "Train" for 20 iterations and compute intermediate scores.
score = objective(x, config["a"], config["b"])

tune.report(score=score) # Send the score to Tune.
session.report({"score": score}) # Send the score to Tune.


# __function_api_end__
Expand Down
3 changes: 2 additions & 1 deletion doc/source/tune/doc_code/pytorch_optuna.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@ def forward(self, x):
# 1. Wrap your PyTorch model in an objective function.
import torch
from ray import tune
from ray.air import session
from ray.tune.search.optuna import OptunaSearch


Expand All @@ -95,7 +96,7 @@ def objective(config):
while True:
train(model, optimizer, train_loader) # Train the model
acc = test(model, test_loader) # Compute test accuracy
tune.report(mean_accuracy=acc) # Report to Tune
session.report({"mean_accuracy": acc}) # Report to Tune


# 2. Define a search space and initialize the search algorithm.
Expand Down
9 changes: 5 additions & 4 deletions doc/source/tune/examples/ax_example.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@
"\n",
"import ray\n",
"from ray import tune\n",
"from ray.air import session\n",
"from ray.tune.search.ax import AxSearch"
]
},
Expand Down Expand Up @@ -112,7 +113,7 @@
"metadata": {},
"source": [
"Next, our `objective` function takes a Tune `config`, evaluates the `landscape` of our experiment in a training loop,\n",
"and uses `tune.report` to report the `landscape` back to Tune."
"and uses `session.report` to report the `landscape` back to Tune."
]
},
{
Expand All @@ -125,8 +126,8 @@
"def objective(config):\n",
" for i in range(config[\"iterations\"]):\n",
" x = np.array([config.get(\"x{}\".format(i + 1)) for i in range(6)])\n",
" tune.report(\n",
" timesteps_total=i, landscape=landscape(x), l2norm=np.sqrt((x ** 2).sum())\n",
" session.report(\n",
" {\"timesteps_total\": i, \"landscape\": landscape(x), \"l2norm\": np.sqrt((x ** 2).sum()})\n",
" )\n",
" time.sleep(0.02)"
]
Expand Down Expand Up @@ -250,7 +251,7 @@
"id": "91076c5a",
"metadata": {},
"source": [
"Finally, we run the experiment to find the global minimum of the provided landscape (which contains 5 false minima). The argument to metric, `\"landscape\"`, is provided via the `objective` function's `tune.report`. The experiment `\"min\"`imizes the \"mean_loss\" of the `landscape` by searching within `search_space` via `algo`, `num_samples` times or when `\"timesteps_total\": stop_timesteps`. This previous sentence is fully characterizes the search problem we aim to solve. With this in mind, notice how efficient it is to execute `tune.run()`."
"Finally, we run the experiment to find the global minimum of the provided landscape (which contains 5 false minima). The argument to metric, `\"landscape\"`, is provided via the `objective` function's `session.report`. The experiment `\"min\"`imizes the \"mean_loss\" of the `landscape` by searching within `search_space` via `algo`, `num_samples` times or when `\"timesteps_total\": stop_timesteps`. This previous sentence is fully characterizes the search problem we aim to solve. With this in mind, notice how efficient it is to execute `tune.run()`."
]
},
{
Expand Down
5 changes: 3 additions & 2 deletions doc/source/tune/examples/bayesopt_example.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@
"\n",
"import ray\n",
"from ray import tune\n",
"from ray.air import session\n",
"from ray.tune.search import ConcurrencyLimiter\n",
"from ray.tune.search.bayesopt import BayesOptSearch"
]
Expand Down Expand Up @@ -85,7 +86,7 @@
"metadata": {},
"source": [
"Next, our ``objective`` function takes a Tune ``config``, evaluates the `score` of your experiment in a training loop,\n",
"and uses `tune.report` to report the `score` back to Tune."
"and uses `session.report` to report the `score` back to Tune."
]
},
{
Expand All @@ -98,7 +99,7 @@
"def objective(config):\n",
" for step in range(config[\"steps\"]):\n",
" score = evaluate(step, config[\"width\"], config[\"height\"])\n",
" tune.report(iterations=step, mean_loss=score)"
" session.report({\"iterations\": step, \"mean_loss\": score})"
]
},
{
Expand Down
Loading

0 comments on commit ac831fd

Please sign in to comment.