From e3d6d793d5a093810bf3753af3d8b178f789095f Mon Sep 17 00:00:00 2001 From: Archit Kulkarni Date: Tue, 21 Nov 2023 14:44:12 -0800 Subject: [PATCH 1/5] Add notes about where Ray Job entrypoint runs and how to specify it Signed-off-by: Archit Kulkarni --- doc/source/cluster/faq.rst | 13 ++++++++++++- .../job-submission/quickstart.rst | 9 ++++++++- .../running-applications/job-submission/sdk.rst | 6 +++++- 3 files changed, 25 insertions(+), 3 deletions(-) diff --git a/doc/source/cluster/faq.rst b/doc/source/cluster/faq.rst index a44cc38177f17..f476995292ba2 100644 --- a/doc/source/cluster/faq.rst +++ b/doc/source/cluster/faq.rst @@ -91,4 +91,15 @@ reported: starting ray to verify that the allocations are as expected. For more detailed information see :ref:`ray-slurm-deploy`. -.. _`known OpenBLAS limitation`: https://github.com/xianyi/OpenBLAS/wiki/faq#how-can-i-use-openblas-in-multi-threaded-applications +.. _`known OpenBLAS limitation`: https://github.com/xianyi/OpenBLAS/wiki/faq#how-can-i-use-openblas-in-multi-threaded-applications + +Where does my Ray Job entrypoint script run? On the head node or worker nodes? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +By default, jobs submitted using the :ref:`Ray Job API ` run +their `entrypoint` script on the head node. You can change this by specifying +any of the options `--entrypoint-num-cpus`, `--entrypoint-num-gpus`, +`--entrypoint-resources` or `--entrypoint-memory` to `ray job submit`, or the +corresponding arguments if using the Python SDK. If these are specified, the +job entrypoint will be scheduled on a node that has the requested resources +available. \ No newline at end of file diff --git a/doc/source/cluster/running-applications/job-submission/quickstart.rst b/doc/source/cluster/running-applications/job-submission/quickstart.rst index 53001cbc08838..b6d4b9142b865 100644 --- a/doc/source/cluster/running-applications/job-submission/quickstart.rst +++ b/doc/source/cluster/running-applications/job-submission/quickstart.rst @@ -111,12 +111,19 @@ Make sure to specify the path to the working directory in the ``--working-dir`` # Job 'raysubmit_inB2ViQuE29aZRJ5' succeeded # ------------------------------------------ -This command will run the script on the Ray Cluster and wait until the job has finished. Note that it also streams the stdout of the job back to the client (``hello world`` in this case). Ray will also make the contents of the directory passed as `--working-dir` available to the Ray job by downloading the directory to all nodes in your cluster. +This command will run the entrypoint script on the Ray Cluster's head node and wait until the job has finished. Note that it also streams the stdout of the job back to the client (``hello world`` in this case). Ray will also make the contents of the directory passed as `--working-dir` available to the Ray job by downloading the directory to all nodes in your cluster. .. note:: The double dash (`--`) separates the arguments for the entrypoint command (e.g. `python script.py --arg1=val1`) from the arguments to `ray job submit`. +.. note:: + + By default the entrypoint script is run on the head node. To override this, specify any of the arguments + `--entrypoint-num-cpus`, `--entrypoint-num-gpus`, `--entrypoint-resources`, or + `--entrypoint-memory` to the `ray job submit` command. + See :ref:`Specifying CPU and GPU resources ` for more details. + Interacting with Long-running Jobs ---------------------------------- diff --git a/doc/source/cluster/running-applications/job-submission/sdk.rst b/doc/source/cluster/running-applications/job-submission/sdk.rst index 2517fafa56baa..b3d71af46c830 100644 --- a/doc/source/cluster/running-applications/job-submission/sdk.rst +++ b/doc/source/cluster/running-applications/job-submission/sdk.rst @@ -183,15 +183,19 @@ Using the Python SDK, the syntax looks something like this: For full details, see the :ref:`API Reference `. +.. _ray-job-cpu-gpu-resources: + Specifying CPU and GPU resources -------------------------------- -We recommend doing heavy computation within Ray tasks, actors, or Ray libraries, not directly in the top level of your entrypoint script. +By default, the job entrypoint script always runs on the head node. We recommend doing heavy computation within Ray tasks, actors, or Ray libraries, not directly in the top level of your entrypoint script. No extra configuration is needed to do this. However, if you need to do computation directly in the entrypoint script and would like to reserve CPU and GPU resources for the entrypoint script, you may specify the ``entrypoint_num_cpus``, ``entrypoint_num_gpus``, ``entrypoint_memory`` and ``entrypoint_resources`` arguments to ``submit_job``. These arguments function identically to the ``num_cpus``, ``num_gpus``, ``resources``, and ``_memory`` arguments to ``@ray.remote()`` decorator for tasks and actors as described in :ref:`resource-requirements`. +If any of these arguments are specified, the entrypoint script will be scheduled on a node with at least the specified resources, instead of the head node, which is the default. For example, the following code will schedule the entrypoint script on a node with at least 1 GPU: + .. code-block:: python job_id = client.submit_job( From c1282ba8ee293b2f9d5ec20d5a4508c9b263698a Mon Sep 17 00:00:00 2001 From: Archit Kulkarni Date: Tue, 21 Nov 2023 15:58:58 -0800 Subject: [PATCH 2/5] Update doc/source/cluster/running-applications/job-submission/quickstart.rst Co-authored-by: Kai-Hsun Chen Signed-off-by: Archit Kulkarni --- .../cluster/running-applications/job-submission/quickstart.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/cluster/running-applications/job-submission/quickstart.rst b/doc/source/cluster/running-applications/job-submission/quickstart.rst index b6d4b9142b865..ff53bf9a82886 100644 --- a/doc/source/cluster/running-applications/job-submission/quickstart.rst +++ b/doc/source/cluster/running-applications/job-submission/quickstart.rst @@ -111,7 +111,7 @@ Make sure to specify the path to the working directory in the ``--working-dir`` # Job 'raysubmit_inB2ViQuE29aZRJ5' succeeded # ------------------------------------------ -This command will run the entrypoint script on the Ray Cluster's head node and wait until the job has finished. Note that it also streams the stdout of the job back to the client (``hello world`` in this case). Ray will also make the contents of the directory passed as `--working-dir` available to the Ray job by downloading the directory to all nodes in your cluster. +This command will run the entrypoint script on the Ray cluster's head node and wait until the job has finished. Note that it also streams the stdout of the job back to the client (`hello world` in this case). Ray will also make the contents of the directory passed as `--working-dir` available to the Ray job by downloading the directory to all nodes in your cluster. .. note:: From 8e194d09efdb67006a5cf673c1b84fa657b313b2 Mon Sep 17 00:00:00 2001 From: Archit Kulkarni Date: Tue, 21 Nov 2023 16:12:13 -0800 Subject: [PATCH 3/5] Change stdout to "output" Signed-off-by: Archit Kulkarni --- .../cluster/running-applications/job-submission/quickstart.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/cluster/running-applications/job-submission/quickstart.rst b/doc/source/cluster/running-applications/job-submission/quickstart.rst index b6d4b9142b865..a5a82884e08e9 100644 --- a/doc/source/cluster/running-applications/job-submission/quickstart.rst +++ b/doc/source/cluster/running-applications/job-submission/quickstart.rst @@ -111,7 +111,7 @@ Make sure to specify the path to the working directory in the ``--working-dir`` # Job 'raysubmit_inB2ViQuE29aZRJ5' succeeded # ------------------------------------------ -This command will run the entrypoint script on the Ray Cluster's head node and wait until the job has finished. Note that it also streams the stdout of the job back to the client (``hello world`` in this case). Ray will also make the contents of the directory passed as `--working-dir` available to the Ray job by downloading the directory to all nodes in your cluster. +This command will run the entrypoint script on the Ray Cluster's head node and wait until the job has finished. Note that it also streams the output of the entrypoint script back to the client (``hello world`` in this case). Ray will also make the contents of the directory passed as `--working-dir` available to the Ray job by downloading the directory to all nodes in your cluster. .. note:: From 61d65ff7b4153410c8218dde51e37d7f21dea824 Mon Sep 17 00:00:00 2001 From: Archit Kulkarni Date: Tue, 21 Nov 2023 16:13:24 -0800 Subject: [PATCH 4/5] Change the text to "streams the output" Signed-off-by: Archit Kulkarni --- .../cluster/running-applications/job-submission/quickstart.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/cluster/running-applications/job-submission/quickstart.rst b/doc/source/cluster/running-applications/job-submission/quickstart.rst index f1a81f9cc6ed7..a5a82884e08e9 100644 --- a/doc/source/cluster/running-applications/job-submission/quickstart.rst +++ b/doc/source/cluster/running-applications/job-submission/quickstart.rst @@ -111,7 +111,7 @@ Make sure to specify the path to the working directory in the ``--working-dir`` # Job 'raysubmit_inB2ViQuE29aZRJ5' succeeded # ------------------------------------------ -This command will run the entrypoint script on the Ray Cluster's head node and wait until the job has finished. Note that it also streams the stdout of the job back to the client (`hello world` in this case). Ray will also make the contents of the directory passed as `--working-dir` available to the Ray job by downloading the directory to all nodes in your cluster. +This command will run the entrypoint script on the Ray Cluster's head node and wait until the job has finished. Note that it also streams the output of the entrypoint script back to the client (``hello world`` in this case). Ray will also make the contents of the directory passed as `--working-dir` available to the Ray job by downloading the directory to all nodes in your cluster. .. note:: From ab7950c35c9211145d4dd9329153d27a8532bc6d Mon Sep 17 00:00:00 2001 From: Archit Kulkarni Date: Tue, 21 Nov 2023 16:16:59 -0800 Subject: [PATCH 5/5] Use stdout and stderr Signed-off-by: Archit Kulkarni --- .../cluster/running-applications/job-submission/quickstart.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/cluster/running-applications/job-submission/quickstart.rst b/doc/source/cluster/running-applications/job-submission/quickstart.rst index a5a82884e08e9..78ee0bca61b7d 100644 --- a/doc/source/cluster/running-applications/job-submission/quickstart.rst +++ b/doc/source/cluster/running-applications/job-submission/quickstart.rst @@ -111,7 +111,7 @@ Make sure to specify the path to the working directory in the ``--working-dir`` # Job 'raysubmit_inB2ViQuE29aZRJ5' succeeded # ------------------------------------------ -This command will run the entrypoint script on the Ray Cluster's head node and wait until the job has finished. Note that it also streams the output of the entrypoint script back to the client (``hello world`` in this case). Ray will also make the contents of the directory passed as `--working-dir` available to the Ray job by downloading the directory to all nodes in your cluster. +This command will run the entrypoint script on the Ray Cluster's head node and wait until the job has finished. Note that it also streams the `stdout` and `stderr` of the entrypoint script back to the client (``hello world`` in this case). Ray will also make the contents of the directory passed as `--working-dir` available to the Ray job by downloading the directory to all nodes in your cluster. .. note::