Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc] [Job] Add notes about where Ray Job entrypoint runs and how to specify it #41319

Merged
merged 6 commits into from
Nov 22, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 12 additions & 1 deletion doc/source/cluster/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,4 +91,15 @@ reported:
starting ray to verify that the allocations are as expected. For more
detailed information see :ref:`ray-slurm-deploy`.

.. _`known OpenBLAS limitation`: https://github.com/xianyi/OpenBLAS/wiki/faq#how-can-i-use-openblas-in-multi-threaded-applications
.. _`known OpenBLAS limitation`: https://github.com/xianyi/OpenBLAS/wiki/faq#how-can-i-use-openblas-in-multi-threaded-applications

Where does my Ray Job entrypoint script run? On the head node or worker nodes?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By default, jobs submitted using the :ref:`Ray Job API <jobs-quickstart>` run
their `entrypoint` script on the head node. You can change this by specifying
any of the options `--entrypoint-num-cpus`, `--entrypoint-num-gpus`,
`--entrypoint-resources` or `--entrypoint-memory` to `ray job submit`, or the
corresponding arguments if using the Python SDK. If these are specified, the
job entrypoint will be scheduled on a node that has the requested resources
available.
Original file line number Diff line number Diff line change
Expand Up @@ -111,12 +111,19 @@ Make sure to specify the path to the working directory in the ``--working-dir``
# Job 'raysubmit_inB2ViQuE29aZRJ5' succeeded
# ------------------------------------------

This command will run the script on the Ray Cluster and wait until the job has finished. Note that it also streams the stdout of the job back to the client (``hello world`` in this case). Ray will also make the contents of the directory passed as `--working-dir` available to the Ray job by downloading the directory to all nodes in your cluster.
This command will run the entrypoint script on the Ray Cluster's head node and wait until the job has finished. Note that it also streams the `stdout` and `stderr` of the entrypoint script back to the client (``hello world`` in this case). Ray will also make the contents of the directory passed as `--working-dir` available to the Ray job by downloading the directory to all nodes in your cluster.

.. note::

The double dash (`--`) separates the arguments for the entrypoint command (e.g. `python script.py --arg1=val1`) from the arguments to `ray job submit`.

.. note::

By default the entrypoint script is run on the head node. To override this, specify any of the arguments
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"entrypoint script is run on the head node" => Do you mean the driver process would be running on the head node by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we say entrypoint script here to convey that it is running whatever the user specifies as entrypoint. Typically this is a script that starts a Ray driver process (ray.init()), but it could also be any command at all, like echo hello && pip install something. It technically doesn't have to involveRay

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short answer, yes, the driver is running on the head node by default

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

`--entrypoint-num-cpus`, `--entrypoint-num-gpus`, `--entrypoint-resources`, or
`--entrypoint-memory` to the `ray job submit` command.
See :ref:`Specifying CPU and GPU resources <ray-job-cpu-gpu-resources>` for more details.

Interacting with Long-running Jobs
----------------------------------

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -183,15 +183,19 @@ Using the Python SDK, the syntax looks something like this:
For full details, see the :ref:`API Reference <ray-job-submission-sdk-ref>`.


.. _ray-job-cpu-gpu-resources:

Specifying CPU and GPU resources
--------------------------------

We recommend doing heavy computation within Ray tasks, actors, or Ray libraries, not directly in the top level of your entrypoint script.
By default, the job entrypoint script always runs on the head node. We recommend doing heavy computation within Ray tasks, actors, or Ray libraries, not directly in the top level of your entrypoint script.
No extra configuration is needed to do this.

However, if you need to do computation directly in the entrypoint script and would like to reserve CPU and GPU resources for the entrypoint script, you may specify the ``entrypoint_num_cpus``, ``entrypoint_num_gpus``, ``entrypoint_memory`` and ``entrypoint_resources`` arguments to ``submit_job``. These arguments function
identically to the ``num_cpus``, ``num_gpus``, ``resources``, and ``_memory`` arguments to ``@ray.remote()`` decorator for tasks and actors as described in :ref:`resource-requirements`.

If any of these arguments are specified, the entrypoint script will be scheduled on a node with at least the specified resources, instead of the head node, which is the default. For example, the following code will schedule the entrypoint script on a node with at least 1 GPU:

.. code-block:: python

job_id = client.submit_job(
Expand Down
Loading