Skip to content

Commit

Permalink
[Doc] [Job] Add notes about where Ray Job entrypoint runs and how to …
Browse files Browse the repository at this point in the history
…specify it (#41319)

There is recurring user confusion about where the job entrypoint script runs and how to make it run on a worker node.

This PR adds the missing information to the doc in relevant places in the tutorials, and includes it in the FAQ.

---------

Signed-off-by: Archit Kulkarni <[email protected]>
Signed-off-by: Archit Kulkarni <[email protected]>
Co-authored-by: Kai-Hsun Chen <[email protected]>
  • Loading branch information
architkulkarni and kevin85421 committed Nov 22, 2023
1 parent 53da874 commit 80a1770
Show file tree
Hide file tree
Showing 3 changed files with 25 additions and 3 deletions.
13 changes: 12 additions & 1 deletion doc/source/cluster/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,4 +91,15 @@ reported:
starting ray to verify that the allocations are as expected. For more
detailed information see :ref:`ray-slurm-deploy`.

.. _`known OpenBLAS limitation`: https://github.com/xianyi/OpenBLAS/wiki/faq#how-can-i-use-openblas-in-multi-threaded-applications
.. _`known OpenBLAS limitation`: https://github.com/xianyi/OpenBLAS/wiki/faq#how-can-i-use-openblas-in-multi-threaded-applications

Where does my Ray Job entrypoint script run? On the head node or worker nodes?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By default, jobs submitted using the :ref:`Ray Job API <jobs-quickstart>` run
their `entrypoint` script on the head node. You can change this by specifying
any of the options `--entrypoint-num-cpus`, `--entrypoint-num-gpus`,
`--entrypoint-resources` or `--entrypoint-memory` to `ray job submit`, or the
corresponding arguments if using the Python SDK. If these are specified, the
job entrypoint will be scheduled on a node that has the requested resources
available.
Original file line number Diff line number Diff line change
Expand Up @@ -111,12 +111,19 @@ Make sure to specify the path to the working directory in the ``--working-dir``
# Job 'raysubmit_inB2ViQuE29aZRJ5' succeeded
# ------------------------------------------
This command will run the script on the Ray Cluster and wait until the job has finished. Note that it also streams the stdout of the job back to the client (``hello world`` in this case). Ray will also make the contents of the directory passed as `--working-dir` available to the Ray job by downloading the directory to all nodes in your cluster.
This command will run the entrypoint script on the Ray Cluster's head node and wait until the job has finished. Note that it also streams the `stdout` and `stderr` of the entrypoint script back to the client (``hello world`` in this case). Ray will also make the contents of the directory passed as `--working-dir` available to the Ray job by downloading the directory to all nodes in your cluster.

.. note::

The double dash (`--`) separates the arguments for the entrypoint command (e.g. `python script.py --arg1=val1`) from the arguments to `ray job submit`.

.. note::

By default the entrypoint script is run on the head node. To override this, specify any of the arguments
`--entrypoint-num-cpus`, `--entrypoint-num-gpus`, `--entrypoint-resources`, or
`--entrypoint-memory` to the `ray job submit` command.
See :ref:`Specifying CPU and GPU resources <ray-job-cpu-gpu-resources>` for more details.

Interacting with Long-running Jobs
----------------------------------

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -183,15 +183,19 @@ Using the Python SDK, the syntax looks something like this:
For full details, see the :ref:`API Reference <ray-job-submission-sdk-ref>`.


.. _ray-job-cpu-gpu-resources:

Specifying CPU and GPU resources
--------------------------------

We recommend doing heavy computation within Ray tasks, actors, or Ray libraries, not directly in the top level of your entrypoint script.
By default, the job entrypoint script always runs on the head node. We recommend doing heavy computation within Ray tasks, actors, or Ray libraries, not directly in the top level of your entrypoint script.
No extra configuration is needed to do this.

However, if you need to do computation directly in the entrypoint script and would like to reserve CPU and GPU resources for the entrypoint script, you may specify the ``entrypoint_num_cpus``, ``entrypoint_num_gpus``, ``entrypoint_memory`` and ``entrypoint_resources`` arguments to ``submit_job``. These arguments function
identically to the ``num_cpus``, ``num_gpus``, ``resources``, and ``_memory`` arguments to ``@ray.remote()`` decorator for tasks and actors as described in :ref:`resource-requirements`.

If any of these arguments are specified, the entrypoint script will be scheduled on a node with at least the specified resources, instead of the head node, which is the default. For example, the following code will schedule the entrypoint script on a node with at least 1 GPU:

.. code-block:: python
job_id = client.submit_job(
Expand Down

0 comments on commit 80a1770

Please sign in to comment.