[air] update documentation to use session.report (ray-project#26051)

Update documentation to use `session.report`. Next steps: 1. Update our internal caller to use `session.report`. Most importantly, CheckpointManager and DataParallelTrainer. 2. Update `get_trial_resources` to use PGF notions to incorporate the requirement of ResourceChangingScheduler. @Yard1 3. After 2 is done, change all `tune.get_trial_resources` to `session.get_trial_resources` 4. [internal implementation] remove special checkpoint handling logic from huggingface trainer. Optimize the flow for checkpoint conversion with `session.report`. Co-authored-by: Antoni Baum <[email protected]>
rickyyx · Jun 30, 2022 · ac831fd · ac831fd
1 parent 20c6c07
commit ac831fd
Show file tree

Hide file tree

Showing 79 changed files with 624 additions and 387 deletions.
diff --git a/README.rst b/README.rst
@@ -118,7 +118,7 @@ This example runs a parallel grid search to optimize an example objective functi
 
 .. code-block:: python
 
- from ray import tune
+ from ray.air import session
 
 
  def objective(step, alpha, beta):
@@ -132,7 +132,7 @@ This example runs a parallel grid search to optimize an example objective functi
  # Iterative training function - can be any arbitrary training procedure.
  intermediate_score = objective(step, alpha, beta)
  # Feed the score back back to Tune.
- tune.report(mean_loss=intermediate_score)
+ session.report({"mean_loss": intermediate_score})
 
 
  analysis = tune.run(

diff --git a/doc/source/ray-air/images/session.svg b/doc/source/ray-air/images/session.svg
diff --git a/doc/source/ray-air/key-concepts.rst b/doc/source/ray-air/key-concepts.rst
@@ -47,6 +47,29 @@ Trainer objects will produce a :ref:`Results <air-results-ref>` object after cal
  :end-before: __air_trainer_output_end__
 
 
+Session
+-------
+
+Ray AIR exposes a functional API for users to define training behavior, or for developers to create their own ``Trainer``\s.
+In both cases, there is a need for the following interactions:
+
+1. To disseminate information downstream, including ``trial_name``, ``trial_id``, ``trial_resources``, rank information etc.
+2. To report information to upstream, including metrics and checkpoint.
+
+To facilitate such interactions, we introduce the :ref:`Session <air-session-ref>` concept.
+
+The session concept exists on several levels: The execution layer (called `Tune Session`) and the Data Parallel training layer
+(called `Train Session`).
+The following figure shows how these two sessions look like in a Data Parallel training scenario.
+
+.. image:: images/session.svg
+ :width: 650px
+ :align: center
+
+..
+ https://docs.google.com/drawings/d/1g0pv8gqgG29aPEPTcd4BC0LaRNbW1sAkv3H6W1TCp0c/edit
+
+
 Tuner
 -----
 

diff --git a/doc/source/ray-air/package-ref.rst b/doc/source/ray-air/package-ref.rst
@@ -124,4 +124,12 @@ Configs
 .. automodule:: ray.air.config
  :members:
 
-.. autoclass:: ray.air.config.CheckpointConfig
+.. autoclass:: ray.air.config.CheckpointConfig
+
+.. _air-session-ref:
+
+Session
+~~~~~~~
+
+.. automodule:: ray.air.session
+ :members:
diff --git a/doc/source/ray-contribute/docs.ipynb b/doc/source/ray-contribute/docs.ipynb
@@ -189,7 +189,7 @@
  "outputs": [],
  "source": [
  "# __function_api_start__\n",
- "from ray import tune\n",
+ "from ray.air import session\n",
  "\n",
  "\n",
  "def objective(x, a, b): # Define an objective function.\n",
@@ -201,7 +201,7 @@
  " for x in range(20): # \"Train\" for 20 iterations and compute intermediate scores.\n",
  " score = objective(x, config[\"a\"], config[\"b\"])\n",
  "\n",
- " tune.report(score=score) # Send the score to Tune.\n",
+ " session.report({\"score\": score}) # Send the score to Tune.\n",
  "\n",
  "\n",
  "# __function_api_end__"

diff --git a/doc/source/ray-overview/doc_test/ray_tune.py b/doc/source/ray-overview/doc_test/ray_tune.py
@@ -1,4 +1,5 @@
 from ray import tune
+from ray.air import session
 
 
 def objective(step, alpha, beta):
@@ -12,7 +13,7 @@ def training_function(config):
  # Iterative training function - can be any arbitrary training procedure.
  intermediate_score = objective(step, alpha, beta)
  # Feed the score back back to Tune.
- tune.report(mean_loss=intermediate_score)
+ session.report({"mean_loss": intermediate_score})
 
 
 analysis = tune.run(

diff --git a/doc/source/tune/api_docs/env.rst b/doc/source/tune/api_docs/env.rst
@@ -24,7 +24,7 @@ These are the environment variables Ray Tune currently considers:
  directories when the name is not specified explicitly or the trainable isn't passed
  as a string. Setting this environment variable to ``1`` disables adding these date strings.
 * **TUNE_DISABLE_STRICT_METRIC_CHECKING**: When you report metrics to Tune via
- ``tune.report()`` and passed a ``metric`` parameter to ``tune.run()``, a scheduler,
+ ``session.report()`` and passed a ``metric`` parameter to ``tune.run()``, a scheduler,
  or a search algorithm, Tune will error
  if the metric was not reported in the result. Setting this environment variable
  to ``1`` will disable this check.

diff --git a/doc/source/tune/api_docs/trainable.rst b/doc/source/tune/api_docs/trainable.rst
@@ -4,10 +4,10 @@
  API does not really have a signature to just describe.
 .. TODO: Reusing actors and advanced resources allocation seem ill-placed.
 
-Training (tune.Trainable, tune.report)
-======================================
+Training (tune.Trainable, session.report)
+==========================================
 
-Training can be done with either a **Class API** (``tune.Trainable``) or **function API** (``tune.report``).
+Training can be done with either a **Class API** (``tune.Trainable``) or **function API** (``session.report``).
 
 For the sake of example, let's maximize this objective function:
 
@@ -21,7 +21,7 @@ For the sake of example, let's maximize this objective function:
 Function API
 ------------
 
-With the Function API, you can report intermediate metrics by simply calling ``tune.report`` within the provided function.
+With the Function API, you can report intermediate metrics by simply calling ``session.report`` within the provided function.
 
 .. code-block:: python
 
@@ -31,7 +31,7 @@ With the Function API, you can report intermediate metrics by simply calling ``t
  for x in range(20):
  intermediate_score = objective(x, config["a"], config["b"])
 
- tune.report(score=intermediate_score) # This sends the score to Tune.
+ session.report({"score": intermediate_score}) # This sends the score to Tune.
 
  analysis = tune.run(
  trainable,
@@ -40,43 +40,13 @@ With the Function API, you can report intermediate metrics by simply calling ``t
 
  print("best config: ", analysis.get_best_config(metric="score", mode="max"))
 
-.. tip:: Do not use ``tune.report`` within a ``Trainable`` class.
+.. tip:: Do not use ``session.report`` within a ``Trainable`` class.
 
 Tune will run this function on a separate thread in a Ray actor process.
 
 You'll notice that Ray Tune will output extra values in addition to the user reported metrics,
 such as ``iterations_since_restore``. See :ref:`tune-autofilled-metrics` for an explanation/glossary of these values.
 
-Function API return and yield values
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Instead of using ``tune.report()``, you can also use Python's ``yield``
-statement to report metrics to Ray Tune:
-
-
-.. code-block:: python
-
- def trainable(config):
- # config (dict): A dict of hyperparameters.
-
- for x in range(20):
- intermediate_score = objective(x, config["a"], config["b"])
-
- yield {"score": intermediate_score} # This sends the score to Tune.
-
- analysis = tune.run(
- trainable,
- config={"a": 2, "b": 4}
- )
-
- print("best config: ", analysis.get_best_config(metric="score", mode="max"))
-
-If you yield a dictionary object, this will work just as ``tune.report()``.
-If you yield a number, if will be reported to Ray Tune with the key ``_metric``, i.e.
-as if you had called ``tune.report(_metric=value)``.
-
-Ray Tune supports the same functionality for return values if you only
-report metrics at the end of each run:
-
 .. code-block:: python
 
  def trainable(config):
@@ -102,30 +72,27 @@ Function API Checkpointing
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Many Tune features rely on checkpointing, including the usage of certain Trial Schedulers and fault tolerance.
-To use Tune's checkpointing features, you must expose a ``checkpoint_dir`` argument in the function signature,
-and call ``tune.checkpoint_dir`` :
+You can save and load checkpoint in Ray Tune in the following manner:
 
 .. code-block:: python
 
  import time
  from ray import tune
+ from ray.air import session
+ from ray.air.checkpoint import Checkpoint
 
- def train_func(config, checkpoint_dir=None):
- start = 0
- if checkpoint_dir:
-  with open(os.path.join(checkpoint_dir, "checkpoint")) as f:
-  state = json.loads(f.read())
-  start = state["step"] + 1
+ def train_func(config):
+ step = 0
+ loaded_checkpoint = session.get_checkpoint()
+ if loaded_checkpoint:
+ last_step = loaded_checkpoint.to_dict()["step"]
+ step = last_step + 1
 
- for iter in range(start, 100):
+ for iter in range(step, 100):
  time.sleep(1)
 
- with tune.checkpoint_dir(step=step) as checkpoint_dir:
- path = os.path.join(checkpoint_dir, "checkpoint")
- with open(path, "w") as f:
- f.write(json.dumps({"step": start}))
-
- tune.report(hello="world", ray="tune")
+ checkpoint = Checkpoint.from_dict({"step": step})
+ session.report({"message": "Hello world Ray Tune!"}, checkpoint=checkpoint)
 
  tune.run(train_func)
 
@@ -153,7 +120,7 @@ it is important not to depend on absolute paths in the implementation of ``save`
 Trainable Class API
 -------------------
 
-.. caution:: Do not use ``tune.report`` within a ``Trainable`` class.
+.. caution:: Do not use ``session.report`` within a ``Trainable`` class.
 
 The Trainable **class API** will require users to subclass ``ray.tune.Trainable``. Here's a naive example of this API:
 
@@ -343,18 +310,23 @@ It is also possible to specify memory (``"memory"``, in bytes) and custom resour
 
 .. _tune-function-docstring:
 
-tune.report / tune.checkpoint (Function API)
---------------------------------------------
+session (Function API)
+----------------------
 
-.. autofunction:: ray.tune.report
+.. autofunction:: ray.air.session.report
+ :noindex:
 
-.. autofunction:: ray.tune.checkpoint_dir
+.. autofunction:: ray.air.session.get_checkpoint
+ :noindex:
 
-.. autofunction:: ray.tune.get_trial_dir
+.. autofunction:: ray.air.session.get_trial_name
+ :noindex:
 
-.. autofunction:: ray.tune.get_trial_name
+.. autofunction:: ray.air.session.get_trial_id
+ :noindex:
 
-.. autofunction:: ray.tune.get_trial_id
+.. autofunction:: ray.air.session.get_trial_resources
+ :noindex:
 
 tune.Trainable (Class API)
 --------------------------

diff --git a/doc/source/tune/doc_code/faq.py b/doc/source/tune/doc_code/faq.py
@@ -3,6 +3,7 @@
 # __reproducible_start__
 import numpy as np
 from ray import tune
+from ray.air import session
 
 
 def train(config):
@@ -12,7 +13,7 @@ def train(config):
  # is the same.
  np.random.seed(config["seed"])
  random_result = np.random.uniform(0, 100, size=1).item()
- tune.report(result=random_result)
+ session.report({"result": random_result})
 
 
 # Set seed for Ray Tune's random search.
@@ -54,7 +55,7 @@ def _iter():
 
 def train(config):
  random_result = np.random.uniform(0, 100, size=1).item()
- tune.report(result=random_result)
+ session.report({"result": random_result})
 
 
 train_fn = train
@@ -90,7 +91,7 @@ def train(config):
  def train_fn(config, checkpoint_dir=None):
  # some Modin operations here
  # import modin.pandas as pd
- tune.report(metric=metric)
+ session.report({"metric": metric})
 
  tune.run(
  train_fn,

diff --git a/doc/source/tune/doc_code/key_concepts.py b/doc/source/tune/doc_code/key_concepts.py
@@ -1,6 +1,7 @@
 # flake8: noqa
 
 # __function_api_start__
+from ray.air import session
 
 
 def objective(x, a, b): # Define an objective function.
@@ -12,7 +13,7 @@ def trainable(config): # Pass a "config" dictionary into your trainable.
  for x in range(20): # "Train" for 20 iterations and compute intermediate scores.
  score = objective(x, config["a"], config["b"])
 
- tune.report(score=score) # Send the score to Tune.
+ session.report({"score": score}) # Send the score to Tune.
 
 
 # __function_api_end__

diff --git a/doc/source/tune/doc_code/pytorch_optuna.py b/doc/source/tune/doc_code/pytorch_optuna.py
@@ -81,6 +81,7 @@ def forward(self, x):
 # 1. Wrap your PyTorch model in an objective function.
 import torch
 from ray import tune
+from ray.air import session
 from ray.tune.search.optuna import OptunaSearch
 
 
@@ -95,7 +96,7 @@ def objective(config):
  while True:
  train(model, optimizer, train_loader) # Train the model
  acc = test(model, test_loader) # Compute test accuracy
- tune.report(mean_accuracy=acc) # Report to Tune
+ session.report({"mean_accuracy": acc}) # Report to Tune
 
 
 # 2. Define a search space and initialize the search algorithm.

diff --git a/doc/source/tune/examples/ax_example.ipynb b/doc/source/tune/examples/ax_example.ipynb
@@ -54,6 +54,7 @@
  "\n",
  "import ray\n",
  "from ray import tune\n",
+ "from ray.air import session\n",
  "from ray.tune.search.ax import AxSearch"
  ]
  },
@@ -112,7 +113,7 @@
  "metadata": {},
  "source": [
  "Next, our `objective` function takes a Tune `config`, evaluates the `landscape` of our experiment in a training loop,\n",
- "and uses `tune.report` to report the `landscape` back to Tune."
+ "and uses `session.report` to report the `landscape` back to Tune."
  ]
  },
  {
@@ -125,8 +126,8 @@
  "def objective(config):\n",
  " for i in range(config[\"iterations\"]):\n",
  " x = np.array([config.get(\"x{}\".format(i + 1)) for i in range(6)])\n",
- " tune.report(\n",
- " timesteps_total=i, landscape=landscape(x), l2norm=np.sqrt((x ** 2).sum())\n",
+ " session.report(\n",
+ " {\"timesteps_total\": i, \"landscape\": landscape(x), \"l2norm\": np.sqrt((x ** 2).sum()})\n",
  " )\n",
  " time.sleep(0.02)"
  ]
@@ -250,7 +251,7 @@
  "id": "91076c5a",
  "metadata": {},
  "source": [
- "Finally, we run the experiment to find the global minimum of the provided landscape (which contains 5 false minima). The argument to metric, `\"landscape\"`, is provided via the `objective` function's `tune.report`. The experiment `\"min\"`imizes the \"mean_loss\" of the `landscape` by searching within `search_space` via `algo`, `num_samples` times or when `\"timesteps_total\": stop_timesteps`. This previous sentence is fully characterizes the search problem we aim to solve. With this in mind, notice how efficient it is to execute `tune.run()`."
+ "Finally, we run the experiment to find the global minimum of the provided landscape (which contains 5 false minima). The argument to metric, `\"landscape\"`, is provided via the `objective` function's `session.report`. The experiment `\"min\"`imizes the \"mean_loss\" of the `landscape` by searching within `search_space` via `algo`, `num_samples` times or when `\"timesteps_total\": stop_timesteps`. This previous sentence is fully characterizes the search problem we aim to solve. With this in mind, notice how efficient it is to execute `tune.run()`."
  ]
  },
  {

diff --git a/doc/source/tune/examples/bayesopt_example.ipynb b/doc/source/tune/examples/bayesopt_example.ipynb
@@ -52,6 +52,7 @@
  "\n",
  "import ray\n",
  "from ray import tune\n",
+ "from ray.air import session\n",
  "from ray.tune.search import ConcurrencyLimiter\n",
  "from ray.tune.search.bayesopt import BayesOptSearch"
  ]
@@ -85,7 +86,7 @@
  "metadata": {},
  "source": [
  "Next, our ``objective`` function takes a Tune ``config``, evaluates the `score` of your experiment in a training loop,\n",
- "and uses `tune.report` to report the `score` back to Tune."
+ "and uses `session.report` to report the `score` back to Tune."
  ]
  },
  {
@@ -98,7 +99,7 @@
  "def objective(config):\n",
  " for step in range(config[\"steps\"]):\n",
  " score = evaluate(step, config[\"width\"], config[\"height\"])\n",
- " tune.report(iterations=step, mean_loss=score)"
+ " session.report({\"iterations\": step, \"mean_loss\": score})"
  ]
  },
  {