[Doc] [Job] Add notes about where Ray Job entrypoint runs and how to …

…specify it (#41319) There is recurring user confusion about where the job entrypoint script runs and how to make it run on a worker node. This PR adds the missing information to the doc in relevant places in the tutorials, and includes it in the FAQ. --------- Signed-off-by: Archit Kulkarni <[email protected]> Signed-off-by: Archit Kulkarni <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]>
ray-project · Nov 22, 2023 · 80a1770 · 80a1770
1 parent 53da874
commit 80a1770
Show file tree

Hide file tree

Showing 3 changed files with 25 additions and 3 deletions.
diff --git a/doc/source/cluster/faq.rst b/doc/source/cluster/faq.rst
@@ -91,4 +91,15 @@ reported:
  starting ray to verify that the allocations are as expected. For more
  detailed information see :ref:`ray-slurm-deploy`.
 
-.. _`known OpenBLAS limitation`: https://github.com/xianyi/OpenBLAS/wiki/faq#how-can-i-use-openblas-in-multi-threaded-applications 
+.. _`known OpenBLAS limitation`: https://github.com/xianyi/OpenBLAS/wiki/faq#how-can-i-use-openblas-in-multi-threaded-applications
+
+Where does my Ray Job entrypoint script run? On the head node or worker nodes?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+By default, jobs submitted using the :ref:`Ray Job API <jobs-quickstart>` run
+their `entrypoint` script on the head node. You can change this by specifying
+any of the options `--entrypoint-num-cpus`, `--entrypoint-num-gpus`,
+`--entrypoint-resources` or `--entrypoint-memory` to `ray job submit`, or the
+corresponding arguments if using the Python SDK. If these are specified, the
+job entrypoint will be scheduled on a node that has the requested resources
+available.
diff --git a/doc/source/cluster/running-applications/job-submission/quickstart.rst b/doc/source/cluster/running-applications/job-submission/quickstart.rst
@@ -111,12 +111,19 @@ Make sure to specify the path to the working directory in the ``--working-dir``
  # Job 'raysubmit_inB2ViQuE29aZRJ5' succeeded
  # ------------------------------------------
 
-This command will run the script on the Ray Cluster and wait until the job has finished. Note that it also streams the stdout of the job back to the client (``hello world`` in this case). Ray will also make the contents of the directory passed as `--working-dir` available to the Ray job by downloading the directory to all nodes in your cluster.
+This command will run the entrypoint script on the Ray Cluster's head node and wait until the job has finished. Note that it also streams the `stdout` and `stderr` of the entrypoint script back to the client (``hello world`` in this case). Ray will also make the contents of the directory passed as `--working-dir` available to the Ray job by downloading the directory to all nodes in your cluster.
 
 .. note::
 
  The double dash (`--`) separates the arguments for the entrypoint command (e.g. `python script.py --arg1=val1`) from the arguments to `ray job submit`.
 
+.. note::
+
+ By default the entrypoint script is run on the head node. To override this, specify any of the arguments 
+ `--entrypoint-num-cpus`, `--entrypoint-num-gpus`, `--entrypoint-resources`, or 
+ `--entrypoint-memory` to the `ray job submit` command. 
+ See :ref:`Specifying CPU and GPU resources <ray-job-cpu-gpu-resources>` for more details.
+
 Interacting with Long-running Jobs
 ----------------------------------
 

diff --git a/doc/source/cluster/running-applications/job-submission/sdk.rst b/doc/source/cluster/running-applications/job-submission/sdk.rst
@@ -183,15 +183,19 @@ Using the Python SDK, the syntax looks something like this:
 For full details, see the :ref:`API Reference <ray-job-submission-sdk-ref>`.
 
 
+.. _ray-job-cpu-gpu-resources:
+
 Specifying CPU and GPU resources
 --------------------------------
 
-We recommend doing heavy computation within Ray tasks, actors, or Ray libraries, not directly in the top level of your entrypoint script.
+By default, the job entrypoint script always runs on the head node. We recommend doing heavy computation within Ray tasks, actors, or Ray libraries, not directly in the top level of your entrypoint script.
 No extra configuration is needed to do this.
 
 However, if you need to do computation directly in the entrypoint script and would like to reserve CPU and GPU resources for the entrypoint script, you may specify the ``entrypoint_num_cpus``, ``entrypoint_num_gpus``, ``entrypoint_memory`` and ``entrypoint_resources`` arguments to ``submit_job``. These arguments function
 identically to the ``num_cpus``, ``num_gpus``, ``resources``, and ``_memory`` arguments to ``@ray.remote()`` decorator for tasks and actors as described in :ref:`resource-requirements`.
 
+If any of these arguments are specified, the entrypoint script will be scheduled on a node with at least the specified resources, instead of the head node, which is the default. For example, the following code will schedule the entrypoint script on a node with at least 1 GPU:
+
 .. code-block:: python
 
  job_id = client.submit_job(