move content to new page for config and manage dashboard

Signed-off-by: angelinalg <[email protected]>
ray-project · Yard1 · May 22, 2023 · May 12, 2023 · May 12, 2023 · May 12, 2023
commit 8ea2d4475e8a9b14e39f8830dccfa993b83483c8
@@ -13,44 +13,6 @@ The rest of this page will focus on how to access these services when running a
 
 .. _monitor-cluster-via-dashboard:
 
-Monitoring the cluster via the dashboard
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-:ref:`The dashboard <ray-dashboard>` provides detailed information about the state of the cluster,
-including the running jobs, actors, workers, nodes, etc.
-By default, the :ref:`cluster launcher <vm-cluster-quick-start>` and :ref:`KubeRay operator <kuberay-quickstart>` will launch the dashboard, but will
-not publicly expose the port.
-
-.. tab-set::
-
- .. tab-item:: If using the VM cluster launcher
-
- You can securely port-forward local traffic to the dashboard via the ``ray
- dashboard`` command.
-
- .. code-block:: shell
-
- $ ray dashboard [-p <port, 8265 by default>] <cluster config file>
-
- The dashboard will now be visible at ``http:https://localhost:8265``.
-
- .. tab-item:: If using Kubernetes
-
- The KubeRay operator makes the dashboard available via a Service targeting
- the Ray head pod, named ``<RayCluster name>-head-svc``. You can access the
- dashboard from within the Kubernetes cluster at ``http:https://<RayCluster name>-head-svc:8265``.
-
- You can also view the dashboard from outside the Kubernetes cluster by
- using port-forwarding:
-
- .. code-block:: shell
-
- $ kubectl port-forward service/raycluster-autoscaler-head-svc 8265:8265
-
- For more information about configuring network access to a Ray cluster on
- Kubernetes, see the :ref:`networking notes <kuberay-networking>`.
-
-
 Using Ray Cluster CLI tools
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 

@@ -446,125 +446,6 @@ To understand the log file structure of Ray, see the :ref:`Logging directory str
 
 The logs view provides search functionality to help you find specific log messages.
 
-Advanced Usage
---------------
-
-Changing Dashboard Ports
-~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. tab-set::
-
- .. tab-item:: Single-node local cluster
-
- **CLI**
-
- To customize the port on which the dashboard runs, you can pass
- the ``--dashboard-port`` argument with ``ray start`` in the command line.
-
- **ray.init**
-
- If you need to customize the port on which the dashboard will run, you can pass the
- keyword argument ``dashboard_port`` in your call to ``ray.init()``.
-
- .. tab-item:: VM Cluster Launcher
-
- To disable the dashboard while using the "VM cluster launcher", include the "ray start --head --include-dashboard=False" argument
- and specify the desired port number in the "head_start_ray_commands" section of the `cluster launcher's YAML file <https://github.com/ray-project/ray/blob/0574620d454952556fa1befc7694353d68c72049/python/ray/autoscaler/aws/example-full.yaml#L172>`_.
-
- .. tab-item:: Kuberay
-
- See the `Specifying non-default ports <https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html#specifying-non-default-ports>`_ page.
-
-Viewing Built-in Dashboard API Metrics
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The dashboard is powered by a server that serves both the UI code and the data about the cluster via API endpoints.
-There are basic Prometheus metrics that are emitted for each of these API endpoints:
-
-`ray_dashboard_api_requests_count_requests_total`: Collects the total count of requests. This is tagged by endpoint, method, and http_status.
-
-`ray_dashboard_api_requests_duration_seconds_bucket`: Collects the duration of requests. This is tagged by endpoint and method.
-
-For example, you can view the p95 duration of all requests with this query:
-
-.. code-block:: text
-
- histogram_quantile(0.95, sum(rate(ray_dashboard_api_requests_duration_seconds_bucket[5m])) by (le))
-
-These metrics can be queried via Prometheus or Grafana UI. Instructions on how to set these tools up can be found :ref:`here <ray-metrics>`.
-
-
-Running Behind a Reverse Proxy
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The dashboard should work out-of-the-box when accessed via a reverse proxy. API requests don't need to be proxied individually.
-
-Always access the dashboard with a trailing ``/`` at the end of the URL.
-For example, if your proxy is set up to handle requests to ``/ray/dashboard``, view the dashboard at ``www.my-website.com/ray/dashboard/``.
-
-The dashboard now sends HTTP requests with relative URL paths. Browsers will handle these requests as expected when the ``window.location.href`` ends in a trailing ``/``.
-
-This is a peculiarity of how many browsers handle requests with relative URLs, despite what `MDN <https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL#examples_of_relative_urls>`_
-defines as the expected behavior.
-
-Make your dashboard visible without a trailing ``/`` by including a rule in your reverse proxy that
-redirects the user's browser to ``/``, i.e. ``/ray/dashboard`` --> ``/ray/dashboard/``.
-
-Below is an example with a `traefik <https://doc.traefik.io/traefik/getting-started/quick-start/>`_ TOML file that accomplishes this:
-
-.. code-block:: yaml
-
- [http]
- [http.routers]
- [http.routers.to-dashboard]
- rule = "PathPrefix(`/ray/dashboard`)"
- middlewares = ["test-redirectregex", "strip"]
- service = "dashboard"
- [http.middlewares]
- [http.middlewares.test-redirectregex.redirectRegex]
- regex = "^(.*)/ray/dashboard$"
- replacement = "${1}/ray/dashboard/"
- [http.middlewares.strip.stripPrefix]
- prefixes = ["/ray/dashboard"]
- [http.services]
- [http.services.dashboard.loadBalancer]
- [[http.services.dashboard.loadBalancer.servers]]
- url = "http:https://localhost:8265"
-
-Disabling the Dashboard
-~~~~~~~~~~~~~~~~~~~~~~~
-Dashboard is included in the `ray[default]` installation by default and automatically started.
-
-To disable the dashboard, use the following arguments `--include-dashboard`.
-
-.. tab-set::
-
- .. tab-item:: Single-node local cluster
-
- **CLI**
-
- .. code-block:: bash
-
- ray start --include-dashboard=False
-
- **ray.init**
-
- .. testcode::
- :hide:
-
- ray.shutdown()
-
- .. testcode::
-
- ray.init(include_dashboard=False)
-
- .. tab-item:: VM Cluster Launcher
-
- To disable the dashboard while using the "VM cluster launcher", include the "ray start --head --include-dashboard=False" argument
- in the "head_start_ray_commands" section of the `cluster launcher's YAML file <https://github.com/ray-project/ray/blob/0574620d454952556fa1befc7694353d68c72049/python/ray/autoscaler/aws/example-full.yaml#L172>`_.
-
- .. tab-item:: Kuberay
-
- TODO
-
 .. _dash-reference:
 
 Page References

diff --git a/doc/source/ray-observability/config-manage-dashboard.rst b/doc/source/ray-observability/config-manage-dashboard.rst
@@ -0,0 +1,201 @@
+.. _observability-key-concepts:
+
+Key Concepts
+============
+
+This section covers a list of key concepts for monitoring and debugging tools and features in Ray.
+
+Dashboard (Web UI)
+------------------
+Ray supports the web-based dashboard to help users monitor the cluster. When a new cluster is started, the dashboard is available
+through the default address `localhost:8265` (port can be automatically incremented if port 8265 is already occupied).
+
+See :ref:`Ray Dashboard <ray-dashboard>` for more details.
+
+Application Logging
+-------------------
+By default, all stdout and stderr of tasks and actors are streamed to the Ray driver (the entrypoint script that calls ``ray.init``).
+
+.. literalinclude:: doc_code/app_logging.py
+ :language: python
+
+All stdout emitted from the ``print`` method is printed to the driver with a ``(the task or actor repr, the process ID, IP address)`` prefix.
+
+.. code-block:: bash
+
+ (pid=45601) task
+ (Actor pid=480956) actor
+
+See :ref:`Logging <ray-logging>` for more details.
+
+Driver logs
+~~~~~~~~~~~
+An entry point of Ray applications that calls ``ray.init()`` is called a driver.
+All the driver logs are handled in the same way as normal Python programs.
+
+Job logs
+~~~~~~~~
+Logs for jobs submitted via the :ref:`Ray Jobs API <jobs-overview>` can be retrieved using the ``ray job logs`` :ref:`CLI command <ray-job-logs-doc>` or using ``JobSubmissionClient.get_logs()`` or ``JobSubmissionClient.tail_job_logs()`` via the :ref:`Python SDK <ray-job-submission-sdk-ref>`.
+The log file consists of the stdout of the entrypoint command of the job. For the location of the log file on disk, see :ref:`Logging directory structure <logging-directory-structure>`.
+
+.. _ray-worker-logs:
+
+Worker logs
+~~~~~~~~~~~
+Ray's tasks or actors are executed remotely within Ray's worker processes. Ray has special support to improve the visibility of logs produced by workers.
+
+- By default, all of the tasks/actors stdout and stderr are redirected to the worker log files. Check out :ref:`Logging directory structure <logging-directory-structure>` to learn how Ray's logging directory is structured.
+- By default, all of the tasks/actors stdout and stderr that is redirected to worker log files are published to the driver. Drivers display logs generated from its tasks/actors to its stdout and stderr.
+
+Let's look at a code example to see how this works.
+
+.. code-block:: python
+
+ import ray
+ # Initiate a driver.
+ ray.init()
+
+ @ray.remote
+ def task():
+ print("task")
+
+ ray.get(task.remote())
+
+You should be able to see the string `task` from your driver stdout. 
+
+When logs are printed, the process id (pid) and an IP address of the node that executes tasks/actors are printed together. Check out the output below.
+
+.. code-block:: bash
+
+ (pid=45601) task
+
+Actor log messages look like the following by default.
+
+.. code-block:: bash
+
+ (MyActor pid=480956) actor log message
+
+Accessing Ray States
+--------------------
+Starting from Ray 2.0, it supports CLI / Python APIs to query the state of resources (e.g., actor, task, object, etc.).
+
+For example, the following command will summarize the task state of the cluster.
+
+.. code-block:: bash
+
+ ray summary tasks
+
+.. code-block:: text
+
+ ======== Tasks Summary: 2022-07-22 08:54:38.332537 ========
+ Stats:
+ ------------------------------------
+ total_actor_scheduled: 2
+ total_actor_tasks: 0
+ total_tasks: 2
+
+
+ Table (group by func_name):
+ ------------------------------------
+ FUNC_OR_CLASS_NAME STATE_COUNTS TYPE
+ 0 task_running_300_seconds RUNNING: 2 NORMAL_TASK
+ 1 Actor.__init__ FINISHED: 2 ACTOR_CREATION_TASK
+
+The following command will list all the actors from the cluster.
+
+.. code-block:: bash
+
+ ray list actors
+
+.. code-block:: text
+
+ ======== List: 2022-07-23 21:29:39.323925 ========
+ Stats:
+ ------------------------------
+ Total: 2
+
+ Table:
+ ------------------------------
+ ACTOR_ID CLASS_NAME NAME PID STATE
+ 0 31405554844820381c2f0f8501000000 Actor 96956 ALIVE
+ 1 f36758a9f8871a9ca993b1d201000000 Actor 96955 ALIVE
+
+See :ref:`Ray State API <state-api-overview-ref>` for more details.
+
+Metrics
+-------
+Ray collects and exposes the physical stats (e.g., CPU, memory, GRAM, disk, and network usage of each node),
+internal stats (e.g., number of actors in the cluster, number of worker failures of the cluster),
+and custom metrics (e.g., metrics defined by users). All stats can be exported as time series data (to Prometheus by default) and used
+to monitor the cluster over time.
+
+See :ref:`Ray Metrics <ray-metrics>` for more details.
+
+Exceptions
+----------
+Creating a new task or submitting an actor task generates an object reference. When ``ray.get`` is called on the object reference,
+the API raises an exception if anything goes wrong with a related task, actor or object. For example,
+
+- :class:`RayTaskError <ray.exceptions.RayTaskError>` is raised when there's an error from user code that throws an exception.
+- :class:`RayActorError <ray.exceptions.RayActorError>` is raised when an actor is dead (by a system failure such as node failure or user-level failure such as an exception from ``__init__`` method).
+- :class:`RuntimeEnvSetupError <ray.exceptions.RuntimeEnvSetupError>` is raised when the actor or task couldn't be started because :ref:`a runtime environment <runtime-environments>` failed to be created.
+
+See :ref:`Exceptions Reference <ray-core-exceptions>` for more details.
+
+Debugger
+--------
+Ray has a built-in debugger that allows you to debug your distributed applications.
+It allows you to set breakpoints in your Ray tasks and actors, and when hitting the breakpoint, you can
+drop into a PDB session that you can then use to:
+
+- Inspect variables in that context
+- Step within that task or actor
+- Move up or down the stack
+
+See :ref:`Ray Debugger <ray-debugger>` for more details.
+
+Monitoring Cluster State and Resource Demands
+---------------------------------------------
+You can monitor cluster usage and auto-scaling status by running (on the head node) a CLI command ``ray status``. It displays
+
+- **Cluster State**: Nodes that are up and running. Addresses of running nodes. Information about pending nodes and failed nodes.
+- **Autoscaling Status**: The number of nodes that are autoscaling up and down.
+- **Cluster Usage**: The resource usage of the cluster. E.g., requested CPUs from all Ray tasks and actors. Number of GPUs that are used.
+
+Here's an example output.
+
+.. code-block:: shell
+
+ $ ray status
+ ======== Autoscaler status: 2021-10-12 13:10:21.035674 ========
+ Node status
+ ---------------------------------------------------------------
+ Healthy:
+ 1 ray.head.default
+ 2 ray.worker.cpu
+ Pending:
+ (no pending nodes)
+ Recent failures:
+ (no failures)
+
+ Resources
+ ---------------------------------------------------------------
+ Usage:
+ 0.0/10.0 CPU
+ 0.00/70.437 GiB memory
+ 0.00/10.306 GiB object_store_memory
+
+ Demands:
+ (no resource demands)
+
+Profiling
+---------
+Ray is compatible with Python profiling tools such as ``CProfile``. It also supports its built-in profiling tool such as :ref:```ray timeline`` <ray-timeline-doc>`.
+
+See :ref:`Profiling <ray-core-profiling>` for more details.
+
+Tracing
+-------
+To help debug and monitor Ray applications, Ray supports distributed tracing (integration with OpenTelemetry) across tasks and actors.
+
+See :ref:`Ray Tracing <ray-tracing>` for more details.