Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs][observability] O11y refactor 2 #35279

Merged
merged 33 commits into from
May 22, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
aab97f1
moving log persistence content to clusters
angelinalg May 12, 2023
3e11e1e
move redirect stdout and stderr; add and rename some files
angelinalg May 12, 2023
414a314
fix notes and code blocks from rst to md syntax, fix heading hierarch…
angelinalg May 12, 2023
50f9291
rename overview content to key concepts
angelinalg May 12, 2023
59a0ed8
new page for getting started on observability with programmatic inter…
angelinalg May 16, 2023
c035142
Merge branch 'master' into o11y-refactor-2
angelinalg May 16, 2023
8ea2d44
move content to new page for config and manage dashboard
angelinalg May 16, 2023
529187f
populating configure dashboard doc and moving to clusters section
angelinalg May 16, 2023
f18b2b8
forgot to save changes
angelinalg May 16, 2023
cacd619
change title name on side nav
angelinalg May 16, 2023
794130f
change title name on side nav
angelinalg May 16, 2023
ce42aba
changing headers and titles to gerunds for consistency
angelinalg May 16, 2023
32618ab
remove overview, rename dashboard page to getting started
angelinalg May 16, 2023
1448b4d
fixed file location to move up a level
angelinalg May 16, 2023
f19deaf
fixed titles
angelinalg May 16, 2023
408bc0f
another iteration of titles to fit side nav better
angelinalg May 17, 2023
0af82b9
Merge branch 'master' into o11y-refactor-2
angelinalg May 17, 2023
e78de8d
create reference subdirectory; moved content for cli GS and metrics
angelinalg May 17, 2023
6827e25
add youtube links for overview and jobs videos
angelinalg May 17, 2023
0819a37
add five youtube links; fix anchors for api and cli refs
angelinalg May 17, 2023
d5f5558
add two youtube links
angelinalg May 17, 2023
399e9b8
moving some misplaced logging content; fixed extra blank spaces after…
angelinalg May 17, 2023
4d64a1e
removed unnecessary file hierarchy
angelinalg May 17, 2023
c2cc47d
list guides in index of user guides, fixed anchors
angelinalg May 17, 2023
c29d4c2
move gotchas content to troubleshooting apps
angelinalg May 18, 2023
7929f38
revert inadvertent deletion of section
angelinalg May 18, 2023
4ae150f
change title
angelinalg May 18, 2023
fc16269
fixed indentation
angelinalg May 18, 2023
bf9c611
Merge branch 'master' into o11y-refactor-2
angelinalg May 18, 2023
3122f7e
fixing links
angelinalg May 22, 2023
b32a6b8
fixing links
angelinalg May 22, 2023
181b0cf
fixing merge conflicts
angelinalg May 22, 2023
68019c1
fixing broken test and broken note
angelinalg May 22, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
move content to new page for config and manage dashboard
Signed-off-by: angelinalg <[email protected]>
  • Loading branch information
angelinalg committed May 16, 2023
commit 8ea2d4475e8a9b14e39f8830dccfa993b83483c8
Original file line number Diff line number Diff line change
Expand Up @@ -13,44 +13,6 @@ The rest of this page will focus on how to access these services when running a

.. _monitor-cluster-via-dashboard:

Monitoring the cluster via the dashboard
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:ref:`The dashboard <ray-dashboard>` provides detailed information about the state of the cluster,
including the running jobs, actors, workers, nodes, etc.
By default, the :ref:`cluster launcher <vm-cluster-quick-start>` and :ref:`KubeRay operator <kuberay-quickstart>` will launch the dashboard, but will
not publicly expose the port.

.. tab-set::

.. tab-item:: If using the VM cluster launcher

You can securely port-forward local traffic to the dashboard via the ``ray
dashboard`` command.

.. code-block:: shell

$ ray dashboard [-p <port, 8265 by default>] <cluster config file>

The dashboard will now be visible at ``http:https://localhost:8265``.

.. tab-item:: If using Kubernetes

The KubeRay operator makes the dashboard available via a Service targeting
the Ray head pod, named ``<RayCluster name>-head-svc``. You can access the
dashboard from within the Kubernetes cluster at ``http:https://<RayCluster name>-head-svc:8265``.

You can also view the dashboard from outside the Kubernetes cluster by
using port-forwarding:

.. code-block:: shell

$ kubectl port-forward service/raycluster-autoscaler-head-svc 8265:8265

For more information about configuring network access to a Ray cluster on
Kubernetes, see the :ref:`networking notes <kuberay-networking>`.


Using Ray Cluster CLI tools
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down
119 changes: 0 additions & 119 deletions doc/source/ray-core/ray-dashboard.rst
Original file line number Diff line number Diff line change
Expand Up @@ -446,125 +446,6 @@ To understand the log file structure of Ray, see the :ref:`Logging directory str

The logs view provides search functionality to help you find specific log messages.

Advanced Usage
--------------

Changing Dashboard Ports
~~~~~~~~~~~~~~~~~~~~~~~~

.. tab-set::

.. tab-item:: Single-node local cluster

**CLI**

To customize the port on which the dashboard runs, you can pass
the ``--dashboard-port`` argument with ``ray start`` in the command line.

**ray.init**

If you need to customize the port on which the dashboard will run, you can pass the
keyword argument ``dashboard_port`` in your call to ``ray.init()``.

.. tab-item:: VM Cluster Launcher

To disable the dashboard while using the "VM cluster launcher", include the "ray start --head --include-dashboard=False" argument
and specify the desired port number in the "head_start_ray_commands" section of the `cluster launcher's YAML file <https://github.com/ray-project/ray/blob/0574620d454952556fa1befc7694353d68c72049/python/ray/autoscaler/aws/example-full.yaml#L172>`_.

.. tab-item:: Kuberay

See the `Specifying non-default ports <https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html#specifying-non-default-ports>`_ page.

Viewing Built-in Dashboard API Metrics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The dashboard is powered by a server that serves both the UI code and the data about the cluster via API endpoints.
There are basic Prometheus metrics that are emitted for each of these API endpoints:

`ray_dashboard_api_requests_count_requests_total`: Collects the total count of requests. This is tagged by endpoint, method, and http_status.

`ray_dashboard_api_requests_duration_seconds_bucket`: Collects the duration of requests. This is tagged by endpoint and method.

For example, you can view the p95 duration of all requests with this query:

.. code-block:: text

histogram_quantile(0.95, sum(rate(ray_dashboard_api_requests_duration_seconds_bucket[5m])) by (le))

These metrics can be queried via Prometheus or Grafana UI. Instructions on how to set these tools up can be found :ref:`here <ray-metrics>`.


Running Behind a Reverse Proxy
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The dashboard should work out-of-the-box when accessed via a reverse proxy. API requests don't need to be proxied individually.

Always access the dashboard with a trailing ``/`` at the end of the URL.
For example, if your proxy is set up to handle requests to ``/ray/dashboard``, view the dashboard at ``www.my-website.com/ray/dashboard/``.

The dashboard now sends HTTP requests with relative URL paths. Browsers will handle these requests as expected when the ``window.location.href`` ends in a trailing ``/``.

This is a peculiarity of how many browsers handle requests with relative URLs, despite what `MDN <https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL#examples_of_relative_urls>`_
defines as the expected behavior.

Make your dashboard visible without a trailing ``/`` by including a rule in your reverse proxy that
redirects the user's browser to ``/``, i.e. ``/ray/dashboard`` --> ``/ray/dashboard/``.

Below is an example with a `traefik <https://doc.traefik.io/traefik/getting-started/quick-start/>`_ TOML file that accomplishes this:

.. code-block:: yaml

[http]
[http.routers]
[http.routers.to-dashboard]
rule = "PathPrefix(`/ray/dashboard`)"
middlewares = ["test-redirectregex", "strip"]
service = "dashboard"
[http.middlewares]
[http.middlewares.test-redirectregex.redirectRegex]
regex = "^(.*)/ray/dashboard$"
replacement = "${1}/ray/dashboard/"
[http.middlewares.strip.stripPrefix]
prefixes = ["/ray/dashboard"]
[http.services]
[http.services.dashboard.loadBalancer]
[[http.services.dashboard.loadBalancer.servers]]
url = "http:https://localhost:8265"

Disabling the Dashboard
~~~~~~~~~~~~~~~~~~~~~~~
Dashboard is included in the `ray[default]` installation by default and automatically started.

To disable the dashboard, use the following arguments `--include-dashboard`.

.. tab-set::

.. tab-item:: Single-node local cluster

**CLI**

.. code-block:: bash

ray start --include-dashboard=False

**ray.init**

.. testcode::
:hide:

ray.shutdown()

.. testcode::

ray.init(include_dashboard=False)

.. tab-item:: VM Cluster Launcher

To disable the dashboard while using the "VM cluster launcher", include the "ray start --head --include-dashboard=False" argument
in the "head_start_ray_commands" section of the `cluster launcher's YAML file <https://github.com/ray-project/ray/blob/0574620d454952556fa1befc7694353d68c72049/python/ray/autoscaler/aws/example-full.yaml#L172>`_.

.. tab-item:: Kuberay

TODO

.. _dash-reference:

Page References
Expand Down
201 changes: 201 additions & 0 deletions doc/source/ray-observability/config-manage-dashboard.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
.. _observability-key-concepts:

Key Concepts
============

This section covers a list of key concepts for monitoring and debugging tools and features in Ray.

Dashboard (Web UI)
------------------
Ray supports the web-based dashboard to help users monitor the cluster. When a new cluster is started, the dashboard is available
through the default address `localhost:8265` (port can be automatically incremented if port 8265 is already occupied).

See :ref:`Ray Dashboard <ray-dashboard>` for more details.

Application Logging
-------------------
By default, all stdout and stderr of tasks and actors are streamed to the Ray driver (the entrypoint script that calls ``ray.init``).

.. literalinclude:: doc_code/app_logging.py
:language: python

All stdout emitted from the ``print`` method is printed to the driver with a ``(the task or actor repr, the process ID, IP address)`` prefix.

.. code-block:: bash

(pid=45601) task
(Actor pid=480956) actor

See :ref:`Logging <ray-logging>` for more details.

Driver logs
~~~~~~~~~~~
An entry point of Ray applications that calls ``ray.init()`` is called a driver.
All the driver logs are handled in the same way as normal Python programs.

Job logs
~~~~~~~~
Logs for jobs submitted via the :ref:`Ray Jobs API <jobs-overview>` can be retrieved using the ``ray job logs`` :ref:`CLI command <ray-job-logs-doc>` or using ``JobSubmissionClient.get_logs()`` or ``JobSubmissionClient.tail_job_logs()`` via the :ref:`Python SDK <ray-job-submission-sdk-ref>`.
The log file consists of the stdout of the entrypoint command of the job. For the location of the log file on disk, see :ref:`Logging directory structure <logging-directory-structure>`.

.. _ray-worker-logs:

Worker logs
~~~~~~~~~~~
Ray's tasks or actors are executed remotely within Ray's worker processes. Ray has special support to improve the visibility of logs produced by workers.

- By default, all of the tasks/actors stdout and stderr are redirected to the worker log files. Check out :ref:`Logging directory structure <logging-directory-structure>` to learn how Ray's logging directory is structured.
- By default, all of the tasks/actors stdout and stderr that is redirected to worker log files are published to the driver. Drivers display logs generated from its tasks/actors to its stdout and stderr.

Let's look at a code example to see how this works.

.. code-block:: python

import ray
# Initiate a driver.
ray.init()

@ray.remote
def task():
print("task")

ray.get(task.remote())

You should be able to see the string `task` from your driver stdout.

When logs are printed, the process id (pid) and an IP address of the node that executes tasks/actors are printed together. Check out the output below.

.. code-block:: bash

(pid=45601) task

Actor log messages look like the following by default.

.. code-block:: bash

(MyActor pid=480956) actor log message

Accessing Ray States
--------------------
Starting from Ray 2.0, it supports CLI / Python APIs to query the state of resources (e.g., actor, task, object, etc.).

For example, the following command will summarize the task state of the cluster.

.. code-block:: bash

ray summary tasks

.. code-block:: text

======== Tasks Summary: 2022-07-22 08:54:38.332537 ========
Stats:
------------------------------------
total_actor_scheduled: 2
total_actor_tasks: 0
total_tasks: 2


Table (group by func_name):
------------------------------------
FUNC_OR_CLASS_NAME STATE_COUNTS TYPE
0 task_running_300_seconds RUNNING: 2 NORMAL_TASK
1 Actor.__init__ FINISHED: 2 ACTOR_CREATION_TASK

The following command will list all the actors from the cluster.

.. code-block:: bash

ray list actors

.. code-block:: text

======== List: 2022-07-23 21:29:39.323925 ========
Stats:
------------------------------
Total: 2

Table:
------------------------------
ACTOR_ID CLASS_NAME NAME PID STATE
0 31405554844820381c2f0f8501000000 Actor 96956 ALIVE
1 f36758a9f8871a9ca993b1d201000000 Actor 96955 ALIVE

See :ref:`Ray State API <state-api-overview-ref>` for more details.

Metrics
-------
Ray collects and exposes the physical stats (e.g., CPU, memory, GRAM, disk, and network usage of each node),
internal stats (e.g., number of actors in the cluster, number of worker failures of the cluster),
and custom metrics (e.g., metrics defined by users). All stats can be exported as time series data (to Prometheus by default) and used
to monitor the cluster over time.

See :ref:`Ray Metrics <ray-metrics>` for more details.

Exceptions
----------
Creating a new task or submitting an actor task generates an object reference. When ``ray.get`` is called on the object reference,
the API raises an exception if anything goes wrong with a related task, actor or object. For example,

- :class:`RayTaskError <ray.exceptions.RayTaskError>` is raised when there's an error from user code that throws an exception.
- :class:`RayActorError <ray.exceptions.RayActorError>` is raised when an actor is dead (by a system failure such as node failure or user-level failure such as an exception from ``__init__`` method).
- :class:`RuntimeEnvSetupError <ray.exceptions.RuntimeEnvSetupError>` is raised when the actor or task couldn't be started because :ref:`a runtime environment <runtime-environments>` failed to be created.

See :ref:`Exceptions Reference <ray-core-exceptions>` for more details.

Debugger
--------
Ray has a built-in debugger that allows you to debug your distributed applications.
It allows you to set breakpoints in your Ray tasks and actors, and when hitting the breakpoint, you can
drop into a PDB session that you can then use to:

- Inspect variables in that context
- Step within that task or actor
- Move up or down the stack

See :ref:`Ray Debugger <ray-debugger>` for more details.

Monitoring Cluster State and Resource Demands
---------------------------------------------
You can monitor cluster usage and auto-scaling status by running (on the head node) a CLI command ``ray status``. It displays

- **Cluster State**: Nodes that are up and running. Addresses of running nodes. Information about pending nodes and failed nodes.
- **Autoscaling Status**: The number of nodes that are autoscaling up and down.
- **Cluster Usage**: The resource usage of the cluster. E.g., requested CPUs from all Ray tasks and actors. Number of GPUs that are used.

Here's an example output.

.. code-block:: shell

$ ray status
======== Autoscaler status: 2021-10-12 13:10:21.035674 ========
Node status
---------------------------------------------------------------
Healthy:
1 ray.head.default
2 ray.worker.cpu
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources
---------------------------------------------------------------
Usage:
0.0/10.0 CPU
0.00/70.437 GiB memory
0.00/10.306 GiB object_store_memory

Demands:
(no resource demands)

Profiling
---------
Ray is compatible with Python profiling tools such as ``CProfile``. It also supports its built-in profiling tool such as :ref:```ray timeline`` <ray-timeline-doc>`.

See :ref:`Profiling <ray-core-profiling>` for more details.

Tracing
-------
To help debug and monitor Ray applications, Ray supports distributed tracing (integration with OpenTelemetry) across tasks and actors.

See :ref:`Ray Tracing <ray-tracing>` for more details.