Skip to content

Commit

Permalink
[train+tune][doc] Remove docs sections recommending `RAY_AIR_LOCAL_CA…
Browse files Browse the repository at this point in the history
…CHE_DIR` (#44284)

Removes docs that recommend using the `RAY_AIR_LOCAL_CACHE_DIR` env variable.

---------

Signed-off-by: Justin Yu <[email protected]>
  • Loading branch information
justinvyu committed Apr 1, 2024
1 parent 834eb29 commit 731b53d
Show file tree
Hide file tree
Showing 4 changed files with 61 additions and 58 deletions.
72 changes: 52 additions & 20 deletions doc/source/train/user-guides/persistent-storage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -337,10 +337,11 @@ In the example above, we saved some artifacts within the training loop to the wo
If you were training a stable diffusion model, you could save
some sample generated images every so often as a training artifact.

By default, the worker's current working directory is set to the local version of the "trial directory."
For example, this would be ``~/ray_results/experiment_name/TorchTrainer_46367_00000_0_...`` in the example above.
See :ref:`below <train-working-directory>` for how to disable this change in the working directory,
if you want your training workers to keep their original working directories.
By default, Ray Train changes the current working directory of each worker to be inside the run's
:ref:`local staging directory <train-local-staging-dir>`.
This way, all distributed training workers share the same absolute path as the working directory.
See :ref:`below <train-working-directory>` for how to disable this default behavior,
which is useful if you want your training workers to keep their original working directories.

If :class:`RunConfig(SyncConfig(sync_artifacts=True)) <ray.train.SyncConfig>`, then
all artifacts saved in this directory will be persisted to storage.
Expand Down Expand Up @@ -378,25 +379,57 @@ Note that this behavior is off by default.
...


.. _train-storage-advanced:

Advanced configuration
----------------------

Setting the intermediate local directory
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. _train-local-staging-dir:

When a ``storage_path`` is specified, training outputs are saved to an
*intermediate local directory*, then persisted (copied/uploaded) to the ``storage_path``.
By default, this intermediate local directory is a sub-directory of ``~/ray_results``.
Setting the local staging directory
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Customize this intermediate local directory with the ``RAY_AIR_LOCAL_CACHE_DIR`` environment variable:
.. warning::

.. testcode::
:skipif: True
Prior to 2.10, the ``RAY_AIR_LOCAL_CACHE_DIR`` environment variable and ``RunConfig(local_dir)``
were ways to configure the local staging directory to be outside of the home directory (``~/ray_results``).

import os
os.environ["RAY_AIR_LOCAL_CACHE_DIR"] = "/tmp/custom/"
**These configurations are no longer used to configure the local staging directory.
Please instead use** ``RunConfig(storage_path)`` **to configure where your
run's outputs go.**


Apart from files such as checkpoints written directly to the ``storage_path``,
Ray Train also writes some logfiles and metadata files to an intermediate
*local staging directory* before they get persisted (copied/uploaded) to the ``storage_path``.
The current working directory of each worker is set within this local staging directory.

By default, the local staging directory is a sub-directory of the Ray session
directory (e.g., ``/tmp/ray/session_latest``), which is also where other temporary Ray files are dumped.

Customize the location of the staging directory by :ref:`setting the location of the
temporary Ray session directory <temp-dir-log-files>`.

Here's an example of what the local staging directory looks like:

.. code-block:: text
/tmp/ray/session_latest/artifacts/<ray-train-job-timestamp>/
└── experiment_name
├── driver_artifacts <- These are all uploaded to storage periodically
│ ├── Experiment state snapshot files needed for resuming training
│ └── Metrics logfiles
└── working_dirs <- These are uploaded to storage if `SyncConfig(sync_artifacts=True)`
└── Current working directory of training workers, which contains worker artifacts
.. warning::

You should not need to look into the local staging directory.
The ``storage_path`` should be the only path that you need to interact with.

The structure of the local staging directory is subject to change
in future versions of Ray Train -- do not rely on these local staging files in your application.

...

.. _train-working-directory:

Expand Down Expand Up @@ -439,9 +472,8 @@ directory you launched the training script from.
# NOTE: The working directory is copied to each worker and is read only.
assert os.path.exists("./data.txt"), os.getcwd()

# If `SyncConfig(sync_artifacts=True)`, write artifacts that you want to
# persist in the trial directory.
# Artifacts written in the current working directory will NOT be persisted.
# To use artifact syncing with `SyncConfig(sync_artifacts=True)`,
# write artifacts here, instead of the current working directory:
ray.train.get_context().get_trial_dir()

trainer = TorchTrainer(
Expand All @@ -466,5 +498,5 @@ environment variable.
For instance, if you set ``RAY_STORAGE="s3:https://my_bucket/train_results"``, your
results will automatically persisted there.

If you manually set a :attr:`RunConfig.storage_path <ray.train.RunConfig.storage_path>`, it
will take precedence over this environment variable.
If you manually set a :attr:`RunConfig.storage_path <ray.train.RunConfig.storage_path>`,
it will take precedence over this environment variable.
11 changes: 0 additions & 11 deletions doc/source/tune/doc_code/faq.py
Original file line number Diff line number Diff line change
Expand Up @@ -224,17 +224,6 @@ def f(config, data=None):
tuner.fit()
# __large_data_end__

MyTrainableClass = None

if not MOCK:
# __log_1_start__
tuner = tune.Tuner(
MyTrainableClass,
run_config=train.RunConfig(storage_path="s3:https://my-log-dir"),
)
tuner.fit()
# __log_1_end__


import ray

Expand Down
12 changes: 1 addition & 11 deletions doc/source/tune/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -570,17 +570,7 @@ be automatically fetched and passed to your trainable as a parameter.
How can I upload my Tune results to cloud storage?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If an upload directory is provided, Tune will automatically sync results from the ``RAY_AIR_LOCAL_CACHE_DIR`` to the given directory,
natively supporting standard URIs for systems like S3, gsutil or HDFS. You can add more filesystems by installing
`fs-spec <https://filesystem-spec.readthedocs.io/en/latest/>`_-compatible filesystems e.g. using pip.

Here is an example of uploading to S3, using a bucket called ``my-log-dir``:

.. literalinclude:: doc_code/faq.py
:dedent:
:language: python
:start-after: __log_1_start__
:end-before: __log_1_end__
See :ref:`tune-cloud-checkpointing`.

Make sure that worker nodes have the write access to the cloud storage.
Failing to do so would cause error messages like ``Error message (1): fatal error: Unable to locate credentials``.
Expand Down
24 changes: 8 additions & 16 deletions doc/source/tune/tutorials/tune-storage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -169,13 +169,6 @@ that implements saving and loading checkpoints.
from ray import train, tune
from your_module import my_trainable
# Look for the existing cluster and connect to it
ray.init()
# Set the local caching directory. Results will be stored here
# before they are synced to remote storage. This env variable is ignored
# if `storage_path` below is set to a local directory.
os.environ["RAY_AIR_LOCAL_CACHE_DIR"] = "/tmp/mypath"
tuner = tune.Tuner(
my_trainable,
Expand All @@ -198,14 +191,7 @@ that implements saving and loading checkpoints.
# This starts the run!
results = tuner.fit()
In this example, here's how trial checkpoints will be saved:

- On head node where we are running from:
- ``/tmp/mypath/my-tune-exp/<trial_name>/checkpoint_<step>`` (but only for trials running on this node)
- On worker nodes:
- ``/tmp/mypath/my-tune-exp/<trial_name>/checkpoint_<step>`` (but only for trials running on this node)
- S3:
- ``s3:https://my-checkpoints-bucket/path/my-tune-exp/<trial_name>/checkpoint_<step>`` (all trials)
In this example, trial checkpoints will be saved to: ``s3:https://my-checkpoints-bucket/path/my-tune-exp/<trial_name>/checkpoint_<step>``

.. _tune-syncing-restore-from-uri:

Expand All @@ -218,7 +204,7 @@ you can resume it any time starting from the experiment state saved in the cloud
tuner = tune.Tuner.restore(
"s3:https://my-checkpoints-bucket/path/my-tune-exp",
trainable=my_trainable,
resume_errored=True
resume_errored=True,
)
tuner.fit()
Expand All @@ -228,3 +214,9 @@ There are a few options for restoring an experiment:
Please see the documentation of
:meth:`Tuner.restore() <ray.tune.tuner.Tuner.restore>` for more details.


Advanced configuration
----------------------

See :ref:`Ray Train's section on advanced storage configuration <train-storage-advanced>`.
All of the configurations also apply to Ray Tune.

0 comments on commit 731b53d

Please sign in to comment.