Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs][observability] O11y refactor 2 #35279

Merged
merged 33 commits into from
May 22, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
aab97f1
moving log persistence content to clusters
angelinalg May 12, 2023
3e11e1e
move redirect stdout and stderr; add and rename some files
angelinalg May 12, 2023
414a314
fix notes and code blocks from rst to md syntax, fix heading hierarch…
angelinalg May 12, 2023
50f9291
rename overview content to key concepts
angelinalg May 12, 2023
59a0ed8
new page for getting started on observability with programmatic inter…
angelinalg May 16, 2023
c035142
Merge branch 'master' into o11y-refactor-2
angelinalg May 16, 2023
8ea2d44
move content to new page for config and manage dashboard
angelinalg May 16, 2023
529187f
populating configure dashboard doc and moving to clusters section
angelinalg May 16, 2023
f18b2b8
forgot to save changes
angelinalg May 16, 2023
cacd619
change title name on side nav
angelinalg May 16, 2023
794130f
change title name on side nav
angelinalg May 16, 2023
ce42aba
changing headers and titles to gerunds for consistency
angelinalg May 16, 2023
32618ab
remove overview, rename dashboard page to getting started
angelinalg May 16, 2023
1448b4d
fixed file location to move up a level
angelinalg May 16, 2023
f19deaf
fixed titles
angelinalg May 16, 2023
408bc0f
another iteration of titles to fit side nav better
angelinalg May 17, 2023
0af82b9
Merge branch 'master' into o11y-refactor-2
angelinalg May 17, 2023
e78de8d
create reference subdirectory; moved content for cli GS and metrics
angelinalg May 17, 2023
6827e25
add youtube links for overview and jobs videos
angelinalg May 17, 2023
0819a37
add five youtube links; fix anchors for api and cli refs
angelinalg May 17, 2023
d5f5558
add two youtube links
angelinalg May 17, 2023
399e9b8
moving some misplaced logging content; fixed extra blank spaces after…
angelinalg May 17, 2023
4d64a1e
removed unnecessary file hierarchy
angelinalg May 17, 2023
c2cc47d
list guides in index of user guides, fixed anchors
angelinalg May 17, 2023
c29d4c2
move gotchas content to troubleshooting apps
angelinalg May 18, 2023
7929f38
revert inadvertent deletion of section
angelinalg May 18, 2023
4ae150f
change title
angelinalg May 18, 2023
fc16269
fixed indentation
angelinalg May 18, 2023
bf9c611
Merge branch 'master' into o11y-refactor-2
angelinalg May 18, 2023
3122f7e
fixing links
angelinalg May 22, 2023
b32a6b8
fixing links
angelinalg May 22, 2023
181b0cf
fixing merge conflicts
angelinalg May 22, 2023
68019c1
fixing broken test and broken note
angelinalg May 22, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
fixing links
Signed-off-by: angelinalg <[email protected]>
  • Loading branch information
angelinalg committed May 22, 2023
commit 3122f7ed47d4f28f5b88966faaca5f039266a936
17 changes: 9 additions & 8 deletions doc/source/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -394,15 +394,16 @@ parts:
- file: ray-observability/user-guides/index
title: User Guides
sections:
- file: ray-observability/user-guides/troubleshoot-apps/index
title: Troubleshooting Applications
- file: ray-observability/user-guides/debug-apps/index
title: Debugging Applications
sections:
- file: ray-observability/user-guides/troubleshoot-apps/general-troubleshoot
- file: ray-observability/user-guides/troubleshoot-apps/troubleshoot-failures
- file: ray-observability/user-guides/troubleshoot-apps/troubleshoot-hangs
- file: ray-observability/user-guides/troubleshoot-apps/optimize-performance
- file: ray-observability/user-guides/troubleshoot-apps/ray-debugging
- file: ray-observability/user-guides/troubleshoot-apps/ray-core-profiling
- file: ray-observability/user-guides/debug-apps/general-troubleshoot
- file: ray-observability/user-guides/debug-apps/debug-memory
- file: ray-observability/user-guides/debug-apps/debug-hangs
- file: ray-observability/user-guides/debug-apps/debug-failures
- file: ray-observability/user-guides/debug-apps/optimize-performance
- file: ray-observability/user-guides/debug-apps/ray-debugging
- file: ray-observability/user-guides/debug-apps/ray-core-profiling
- file: ray-observability/user-guides/cli-sdk
- file: ray-observability/user-guides/configure-logging
- file: ray-observability/user-guides/add-app-metrics
scottsun94 marked this conversation as resolved.
Show resolved Hide resolved
Expand Down
6 changes: 3 additions & 3 deletions doc/source/cluster/configure-manage-dashboard.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Setting up the dashboard may require some configuration depending on your use mo
Port forwarding
---------------

:ref:`The dashboard <ray-dashboard>` provides detailed information about the state of the cluster,
:ref:`The dashboard <observability-getting-started>` provides detailed information about the state of the cluster,
including the running jobs, actors, workers, nodes, etc.
By default, the :ref:`cluster launcher <vm-cluster-quick-start>` and :ref:`KubeRay operator <kuberay-quickstart>` will launch the dashboard, but will
not publicly expose the port.
Expand Down Expand Up @@ -248,7 +248,7 @@ Then go to to the location of the binary and run grafana using the built in conf
./bin/grafana-server --config /tmp/ray/session_latest/metrics/grafana/grafana.ini web

Now, you can access grafana using the default grafana url, `http:https://localhost:3000`.
You can then see the default dashboard by going to dashboards -> manage -> Ray -> Default Dashboard. The same :ref:`metric graphs <system-metrics>` are also accessible via :ref:`Ray Dashboard <ray-dashboard>`.
You can then see the default dashboard by going to dashboards -> manage -> Ray -> Default Dashboard. The same :ref:`metric graphs <system-metrics>` are also accessible via :ref:`Ray Dashboard <observability-getting-started>`.

.. tip::

Expand Down Expand Up @@ -288,7 +288,7 @@ For example, if Prometheus is hosted at port 9000 on a node with ip 55.66.77.88,
Alternate Grafana host location
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can choose to run Grafana on a non-default port or on a different machine. If you choose to do this, the
:ref:`Dashboard <ray-dashboard>` needs to be configured with a public address to that service so the web page
:ref:`Dashboard <observability-getting-started>` needs to be configured with a public address to that service so the web page
can load the graphs. This can be done with the `RAY_GRAFANA_HOST` env var when launching ray. The env var takes
in the address to access Grafana. More info can be found :ref:`here <multi-node-metrics-grafana>`. Instructions
to use an existing Grafana instance can be found :ref:`here <multi-node-metrics-grafana-existing>`.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Scraping and Persisting Metrics

Ray ships with the following observability features:

1. :ref:`The dashboard <ray-dashboard>`, for viewing cluster state.
1. :ref:`The dashboard <observability-getting-started>`, for viewing cluster state.
2. CLI tools such as the :ref:`Ray state APIs <state-api-overview-ref>` and :ref:`ray status <monitor-cluster>`, for checking application and cluster status.
3. :ref:`Prometheus metrics <multi-node-metrics>` for internal and custom user-defined metrics.

Expand All @@ -16,7 +16,7 @@ The rest of this page will focus on how to access these services when running a
Monitoring the cluster via the dashboard
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:ref:`The dashboard <ray-dashboard>` provides detailed information about the state of the cluster,
:ref:`The dashboard <observability-getting-started>` provides detailed information about the state of the cluster,
including the running jobs, actors, workers, nodes, etc.
By default, the :ref:`cluster launcher <vm-cluster-quick-start>` and :ref:`KubeRay operator <kuberay-quickstart>` will launch the dashboard, but will
not publicly expose the port.
Expand Down Expand Up @@ -96,14 +96,14 @@ below.
Prometheus
^^^^^^^^^^
Ray supports Prometheus for emitting and recording time-series metrics.
See :ref:`metrics <ray-metrics>` for more details of the metrics emitted.
See :ref:`metrics <dash-metrics-view>` for more details of the metrics emitted.
To use Prometheus in a Ray cluster, decide where to host it, then configure
it so that it can scrape the metrics from Ray.

Scraping metrics
################

Ray runs a metrics agent per node to export :ref:`metrics <ray-metrics>` about Ray core as well as
Ray runs a metrics agent per node to export :ref:`metrics <dash-metrics-view>` about Ray core as well as
custom user-defined metrics. Each metrics agent collects metrics from the local
node and exposes these in a Prometheus format. You can then scrape each
endpoint to access Ray's metrics.
Expand Down
218 changes: 0 additions & 218 deletions doc/source/cluster/vms/user-guides/logging.md

This file was deleted.

2 changes: 1 addition & 1 deletion doc/source/data/performance-tips.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Monitoring your application
~~~~~~~~~~~~~~~~~~~~~~~~~~~

View the Ray dashboard to monitor your application and troubleshoot issues. To learn
more about the Ray dashboard, read :ref:`Ray Dashboard <ray-dashboard>`.
more about the Ray dashboard, read :ref:`Ray Dashboard <observability-getting-started>`.

Debugging Statistics
~~~~~~~~~~~~~~~~~~~~
Expand Down
2 changes: 1 addition & 1 deletion doc/source/ray-air/getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -216,4 +216,4 @@ Next Steps
- :ref:`air-examples-ref`
- :ref:`API reference <air-api-ref>`
- :ref:`Technical whitepaper <whitepaper>`
- To check how your application is doing, you can use the :ref:`Ray dashboard<ray-dashboard>`.
- To check how your application is doing, you can use the :ref:`Ray dashboard<robservability-getting-started>`.
2 changes: 1 addition & 1 deletion doc/source/ray-core/scheduling/ray-oom-prevention.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Out-Of-Memory Prevention
========================

If application tasks or actors consume a large amount of heap space, it can cause the node to run out of memory (OOM). When that happens, the operating system will start killing worker or raylet processes, disrupting the application. OOM may also stall metrics and if this happens on the head node, it may stall the :ref:`dashboard <ray-dashboard>` or other control processes and cause the cluster to become unusable.
If application tasks or actors consume a large amount of heap space, it can cause the node to run out of memory (OOM). When that happens, the operating system will start killing worker or raylet processes, disrupting the application. OOM may also stall metrics and if this happens on the head node, it may stall the :ref:`dashboard <observability-getting-started>` or other control processes and cause the cluster to become unusable.

In this section we will go over:

Expand Down
2 changes: 1 addition & 1 deletion doc/source/ray-core/walkthrough.rst
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ As seen above, Ray stores task and actor call results in its :ref:`distributed o
Next Steps
----------

.. tip:: To check how your application is doing, you can use the :ref:`Ray dashboard <ray-dashboard>`.
.. tip:: To check how your application is doing, you can use the :ref:`Ray dashboard <observability-getting-started>`.

Ray's key primitives are simple, but can be composed together to express almost any kind of distributed computation.
Learn more about Ray's :ref:`key concepts <core-key-concepts>` with the following user guides:
Expand Down
Loading
Loading