diff --git a/doc/source/_toc.yml b/doc/source/_toc.yml index 680706bad10a5..006a5ade39d90 100644 --- a/doc/source/_toc.yml +++ b/doc/source/_toc.yml @@ -370,6 +370,7 @@ parts: - file: cluster/vms/user-guides/launching-clusters/index - file: cluster/vms/user-guides/large-cluster-best-practices - file: cluster/vms/user-guides/configuring-autoscaling + - file: cluster/vms/user-guides/logging - file: cluster/vms/user-guides/community/index title: Community-supported Cluster Managers sections: @@ -381,26 +382,40 @@ parts: sections: - file: cluster/vms/examples/ml-example - file: cluster/vms/references/index + - file: cluster/configure-manage-dashboard - file: cluster/running-applications/index title: Applications Guide - file: cluster/faq - file: cluster/package-overview - - file: ray-observability/monitoring-debugging/monitoring-debugging + - file: ray-observability/index title: "Monitoring and Debugging" sections: + - file: ray-observability/getting-started + - file: ray-observability/key-concepts - file: ray-observability/user-guides/index title: User Guides sections: - - file: ray-observability/user-guides/troubleshoot-apps/index - title: Troubleshooting Applications + - file: ray-observability/user-guides/debug-apps/index + title: Debugging Applications sections: - - file: ray-observability/user-guides/troubleshoot-apps/troubleshoot-failures - - file: ray-observability/user-guides/troubleshoot-apps/troubleshoot-hangs - - file: ray-observability/user-guides/troubleshoot-apps/optimize-performance - - file: ray-observability/user-guides/troubleshoot-apps/ray-debugging - - file: ray-observability/user-guides/troubleshoot-apps/ray-core-profiling + - file: ray-observability/user-guides/debug-apps/general-troubleshoot + - file: ray-observability/user-guides/debug-apps/debug-memory + - file: ray-observability/user-guides/debug-apps/debug-hangs + - file: ray-observability/user-guides/debug-apps/debug-failures + - file: ray-observability/user-guides/debug-apps/optimize-performance + - file: ray-observability/user-guides/debug-apps/ray-debugging + - file: ray-observability/user-guides/debug-apps/ray-core-profiling + - file: ray-observability/user-guides/cli-sdk + - file: ray-observability/user-guides/configure-logging + - file: ray-observability/user-guides/add-app-metrics - file: ray-observability/user-guides/ray-tracing + - file: ray-observability/reference/index + title: Reference + sections: + - file: ray-observability/reference/api + - file: ray-observability/reference/cli + - file: ray-observability/reference/system-metrics - file: ray-references/api title: References diff --git a/doc/source/cluster/configure-manage-dashboard.rst b/doc/source/cluster/configure-manage-dashboard.rst new file mode 100644 index 0000000000000..100134271634e --- /dev/null +++ b/doc/source/cluster/configure-manage-dashboard.rst @@ -0,0 +1,346 @@ +.. _observability-configure-manage-dashboard: + +Configuring and Managing the Dashboard +====================================== + +Setting up the dashboard may require some configuration depending on your use model and cluster environment. Integrations with Prometheus and Grafana are optional for extending visualization capabilities. + +Port forwarding +--------------- + +:ref:`The dashboard ` provides detailed information about the state of the cluster, +including the running jobs, actors, workers, nodes, etc. +By default, the :ref:`cluster launcher ` and :ref:`KubeRay operator ` will launch the dashboard, but will +not publicly expose the port. + +.. tab-set:: + + .. tab-item:: VM + + You can securely port-forward local traffic to the dashboard via the ``ray + dashboard`` command. + + .. code-block:: shell + + $ ray dashboard [-p ] + + The dashboard is now be visible at ``http://localhost:8265``. + + .. tab-item:: Kubernetes + + The KubeRay operator makes the dashboard available via a Service targeting + the Ray head pod, named ``-head-svc``. You can access the + dashboard from within the Kubernetes cluster at ``http://-head-svc:8265``. + + You can also view the dashboard from outside the Kubernetes cluster by + using port-forwarding: + + .. code-block:: shell + + $ kubectl port-forward service/raycluster-autoscaler-head-svc 8265:8265 + + For more information about configuring network access to a Ray cluster on + Kubernetes, see the :ref:`networking notes `. + +Changing Dashboard Ports +------------------------ + +.. tab-set:: + + .. tab-item:: Single-node local cluster + + **CLI** + + To customize the port on which the dashboard runs, you can pass + the ``--dashboard-port`` argument with ``ray start`` in the command line. + + **ray.init** + + If you need to customize the port on which the dashboard runs, you can pass the + keyword argument ``dashboard_port`` in your call to ``ray.init()``. + + .. tab-item:: VM Cluster Launcher + + To disable the dashboard while using the "VM cluster launcher", include the "ray start --head --include-dashboard=False" argument + and specify the desired port number in the "head_start_ray_commands" section of the `cluster launcher's YAML file `_. + + .. tab-item:: Kuberay + + See the `Specifying non-default ports `_ page. + + +Running Behind a Reverse Proxy +------------------------------ + +The dashboard should work out-of-the-box when accessed via a reverse proxy. API requests don't need to be proxied individually. + +Always access the dashboard with a trailing ``/`` at the end of the URL. +For example, if your proxy is set up to handle requests to ``/ray/dashboard``, view the dashboard at ``www.my-website.com/ray/dashboard/``. + +The dashboard now sends HTTP requests with relative URL paths. Browsers will handle these requests as expected when the ``window.location.href`` ends in a trailing ``/``. + +This is a peculiarity of how many browsers handle requests with relative URLs, despite what `MDN `_ +defines as the expected behavior. + +Make your dashboard visible without a trailing ``/`` by including a rule in your reverse proxy that +redirects the user's browser to ``/``, i.e. ``/ray/dashboard`` --> ``/ray/dashboard/``. + +Below is an example with a `traefik `_ TOML file that accomplishes this: + +.. code-block:: yaml + + [http] + [http.routers] + [http.routers.to-dashboard] + rule = "PathPrefix(`/ray/dashboard`)" + middlewares = ["test-redirectregex", "strip"] + service = "dashboard" + [http.middlewares] + [http.middlewares.test-redirectregex.redirectRegex] + regex = "^(.*)/ray/dashboard$" + replacement = "${1}/ray/dashboard/" + [http.middlewares.strip.stripPrefix] + prefixes = ["/ray/dashboard"] + [http.services] + [http.services.dashboard.loadBalancer] + [[http.services.dashboard.loadBalancer.servers]] + url = "http://localhost:8265" + +Viewing Built-in Dashboard API Metrics +-------------------------------------- + +The dashboard is powered by a server that serves both the UI code and the data about the cluster via API endpoints. +There are basic Prometheus metrics that are emitted for each of these API endpoints: + +`ray_dashboard_api_requests_count_requests_total`: Collects the total count of requests. This is tagged by endpoint, method, and http_status. + +`ray_dashboard_api_requests_duration_seconds_bucket`: Collects the duration of requests. This is tagged by endpoint and method. + +For example, you can view the p95 duration of all requests with this query: + +.. code-block:: text + + histogram_quantile(0.95, sum(rate(ray_dashboard_api_requests_duration_seconds_bucket[5m])) by (le)) + +These metrics can be queried via Prometheus or Grafana UI. Instructions on how to set these tools up can be found :ref:`here `. + +Disabling the Dashboard +----------------------- + +Dashboard is included in the `ray[default]` installation by default and automatically started. + +To disable the dashboard, use the following arguments `--include-dashboard`. + +.. tab-set:: + + .. tab-item:: Single-node local cluster + + **CLI** + + .. code-block:: bash + + ray start --include-dashboard=False + + **ray.init** + + .. testcode:: + :hide: + + import ray + ray.shutdown() + + .. testcode:: + + import ray + ray.init(include_dashboard=False) + + .. tab-item:: VM Cluster Launcher + + To disable the dashboard while using the "VM cluster launcher", include the "ray start --head --include-dashboard=False" argument + in the "head_start_ray_commands" section of the `cluster launcher's YAML file `_. + + .. tab-item:: Kuberay + + TODO + +.. _observability-visualization-setup: + +Integrating with Prometheus and Grafana +--------------------------------------- + +Setting up Prometheus +~~~~~~~~~~~~~~~~~~~~~ + +.. tip:: + + The below instructions for Prometheus to enable a basic workflow of running and accessing the dashboard on your local machine. + For more information about how to run Prometheus on a remote cluster, see :ref:`here `. + +Ray exposes its metrics in Prometheus format. This allows us to easily scrape them using Prometheus. + +First, `download Prometheus `_. Make sure to download the correct binary for your operating system. (Ex: darwin for mac osx) + +Then, unzip the archive into a local directory using the following command. + +.. code-block:: bash + + tar xvfz prometheus-*.tar.gz + cd prometheus-* + +Ray exports metrics only when ``ray[default]`` is installed. + +.. code-block:: bash + + pip install "ray[default]" + +Ray provides a prometheus config that works out of the box. After running ray, it can be found at `/tmp/ray/session_latest/metrics/prometheus/prometheus.yml`. + +.. code-block:: yaml + + global: + scrape_interval: 15s + evaluation_interval: 15s + + scrape_configs: + # Scrape from each ray node as defined in the service_discovery.json provided by ray. + - job_name: 'ray' + file_sd_configs: + - files: + - '/tmp/ray/prom_metrics_service_discovery.json' + + +Next, let's start Prometheus. + +.. code-block:: shell + + ./prometheus --config.file=/tmp/ray/session_latest/metrics/prometheus/prometheus.yml + +.. note:: + If you are using mac, you may receive an error at this point about trying to launch an application where the developer has not been verified. See :ref:`this link ` to fix the issue. + +Now, you can access Ray metrics from the default Prometheus url, `http://localhost:9090`. + +See :ref:`here ` for more information on how to set up Prometheus on a Ray Cluster. + +.. _grafana: + +Setting up Grafana +~~~~~~~~~~~~~~~~~~ + +.. tip:: + + The below instructions for Grafana setup to enable a basic workflow of running and accessing the dashboard on your local machine. + For more information about how to run Grafana on a remote cluster, see :ref:`here `. + +Grafana is a tool that supports more advanced visualizations of prometheus metrics and +allows you to create custom dashboards with your favorite metrics. Ray exports some default +configurations which includes a default dashboard showing some of the most valuable metrics +for debugging ray applications. + + +Deploying Grafana +***************** + +First, `download Grafana `_. Follow the instructions on the download page to download the right binary for your operating system. + +Then go to to the location of the binary and run grafana using the built in configuration found in `/tmp/ray/session_latest/metrics/grafana` folder. + +.. code-block:: shell + + ./bin/grafana-server --config /tmp/ray/session_latest/metrics/grafana/grafana.ini web + +Now, you can access grafana using the default grafana url, `http://localhost:3000`. +You can then see the default dashboard by going to dashboards -> manage -> Ray -> Default Dashboard. The same :ref:`metric graphs ` are also accessible via :ref:`Ray Dashboard `. + +.. tip:: + + If this is your first time using Grafana, you can login with the username: `admin` and password `admin`. + +.. image:: images/graphs.png + :align: center + + +See :ref:`here ` for more information on how to set up Grafana on a Ray Cluster. + +Customizing the Prometheus export port +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Ray by default provides the service discovery file, but you can directly scrape metrics from prometheus ports. +To do that, you may want to customize the port that metrics gets exposed to a pre-defined port. + +.. code-block:: bash + + ray start --head --metrics-export-port=8080 # Assign metrics export port on a head node. + +Now, you can scrape Ray's metrics using Prometheus via ``:8080``. + +Alternate Prometheus host location +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +You can choose to run Prometheus on a non-default port or on a different machine. When doing so, you should +make sure that prometheus can scrape the metrics from your ray nodes following instructions :ref:`here `. + +In addition, both Ray and Grafana needs to know how to access this prometheus instance. This can be configured +by setting the `RAY_PROMETHEUS_HOST` env var when launching ray. The env var takes in the address to access Prometheus. More +info can be found :ref:`here `. By default, we assume Prometheus is hosted at `localhost:9090`. + +For example, if Prometheus is hosted at port 9000 on a node with ip 55.66.77.88, One should set the value to +`RAY_PROMETHEUS_HOST=http://55.66.77.88:9000`. + + +Alternate Grafana host location +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +You can choose to run Grafana on a non-default port or on a different machine. If you choose to do this, the +:ref:`Dashboard ` needs to be configured with a public address to that service so the web page +can load the graphs. This can be done with the `RAY_GRAFANA_HOST` env var when launching ray. The env var takes +in the address to access Grafana. More info can be found :ref:`here `. Instructions +to use an existing Grafana instance can be found :ref:`here `. + +For the Grafana charts to work on the Ray dashboard, the user of the dashboard's browser must be able to reach +the Grafana service. If this browser cannot reach Grafana the same way the Ray head node can, you can use a separate +env var `RAY_GRAFANA_IFRAME_HOST` to customize the host the browser users to attempt to reach Grafana. If this is not set, +we use the value of `RAY_GRAFANA_HOST` by default. + +For example, if Grafana is hosted at is 55.66.77.88 on port 3000. One should set the value +to `RAY_GRAFANA_HOST=http://55.66.77.88:3000`. + + +Troubleshooting +~~~~~~~~~~~~~~~ + +Getting Prometheus and Grafana to use the Ray configurations when installed via homebrew on macOS X +*************************************************************************************************** + +With homebrew, Prometheus and Grafana are installed as services that are automatically launched for you. +Therefore, to configure these services, you cannot simply pass in the config files as command line arguments. + +Instead, follow these instructions: +1. Change the --config-file line in `/usr/local/etc/prometheus.args` to read `--config.file /tmp/ray/session_latest/metrics/prometheus/prometheus.yml`. +2. Update `/usr/local/etc/grafana/grafana.ini` file so that it matches the contents of `/tmp/ray/session_latest/metrics/grafana/grafana.ini`. + +You can then start or restart the services with `brew services start grafana` and `brew services start prometheus`. + +.. _unverified-developer: + +MacOS does not trust the developer to install Prometheus or Grafana +******************************************************************* + +You may have received an error that looks like this: + +.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/troubleshooting/prometheus-trusted-developer.png + :align: center + +When downloading binaries from the internet, Mac requires that the binary be signed by a trusted developer ID. +Unfortunately, many developers today are not trusted by Mac and so this requirement must be overridden by the user manaully. + +See `these instructions `_ on how to override the restriction and install or run the application. + +Grafana dashboards are not embedded in the Ray dashboard +******************************************************** +If you're getting an error that says `RAY_GRAFANA_HOST` is not setup despite having set it up, check that: +You've included the protocol in the URL (e.g., `http://your-grafana-url.com` instead of `your-grafana-url.com`). +The URL doesn't have a trailing slash (e.g., `http://your-grafana-url.com` instead of `http://your-grafana-url.com/`). + +Certificate Authority (CA error) +******************************** +You may see a CA error if your Grafana instance is hosted behind HTTPS. Contact the Grafana service owner to properly enable HTTPS traffic. + diff --git a/doc/source/cluster/images/graphs.png b/doc/source/cluster/images/graphs.png new file mode 100644 index 0000000000000..2cd41f5b9b278 Binary files /dev/null and b/doc/source/cluster/images/graphs.png differ diff --git a/doc/source/cluster/kubernetes/user-guides/logging.md b/doc/source/cluster/kubernetes/user-guides/logging.md index a02dbdc42dbe2..7a349e1ee2693 100644 --- a/doc/source/cluster/kubernetes/user-guides/logging.md +++ b/doc/source/cluster/kubernetes/user-guides/logging.md @@ -1,6 +1,6 @@ (kuberay-logging)= -# Logging +# Log Persistence This page provides tips on how to collect logs from Ray clusters running on Kubernetes. @@ -42,14 +42,14 @@ nodes. With this strategy, it is key to mount the Ray container's `/tmp/ray` directory to the relevant `hostPath`. (kuberay-fluentbit)= -# Setting up logging sidecars with Fluent Bit. +## Setting up logging sidecars with Fluent Bit In this section, we give an example of how to set up log-emitting [Fluent Bit][FluentBit] sidecars for Ray pods. See the full config for a single-pod RayCluster with a logging sidecar [here][ConfigLink]. We now discuss this configuration and show how to deploy it. -## Configure log processing +### Configuring log processing The first step is to create a ConfigMap with configuration for Fluent Bit. @@ -73,9 +73,9 @@ A few notes on the above config: in the Fluent Bit container's stdout sooner. -## Add logging sidecars to your RayCluster CR. +### Adding logging sidecars to your RayCluster CR -### Add log and config volumes. +#### Adding log and config volumes For each pod template in our RayCluster CR, we need to add two volumes: One volume for Ray's logs and another volume to store Fluent Bit configuration from the ConfigMap @@ -85,7 +85,7 @@ applied above. :start-after: Log and config volumes ``` -### Mount the Ray log directory +#### Mounting the Ray log directory Add the following volume mount to the Ray container's configuration. ```{literalinclude} ../configs/ray-cluster.log.yaml :language: yaml @@ -93,7 +93,7 @@ Add the following volume mount to the Ray container's configuration. :end-before: Fluent Bit sidecar ``` -### Add the Fluent Bit sidecar +#### Adding the Fluent Bit sidecar Finally, add the Fluent Bit sidecar container to each Ray pod config in your RayCluster CR. ```{literalinclude} ../configs/ray-cluster.log.yaml @@ -104,14 +104,14 @@ in your RayCluster CR. Mounting the `ray-logs` volume gives the sidecar container access to Ray's logs. The `fluentbit-config` volume gives the sidecar access to logging configuration. -### Putting everything together +#### Putting everything together Putting all of the above elements together, we have the following yaml configuration for a single-pod RayCluster will a log-processing sidecar. ```{literalinclude} ../configs/ray-cluster.log.yaml :language: yaml ``` -## Deploying a RayCluster with logging CR. +### Deploying a RayCluster with logging CR (kuberay-logging-tldr)= Now, we will see how to deploy the configuration described above. @@ -143,3 +143,124 @@ kubectl logs raycluster-complete-logs-head-xxxxx -c fluentbit [Promtail]: https://grafana.com/docs/loki/latest/clients/promtail/ [KubDoc]: https://kubernetes.io/docs/concepts/cluster-administration/logging/ [ConfigLink]: https://raw.githubusercontent.com/ray-project/ray/releases/2.4.0/doc/source/cluster/kubernetes/configs/ray-cluster.log.yaml + +## Customizing Worker Loggers + +When using Ray, all tasks and actors are executed remotely in Ray's worker processes. + +:::{note} +To stream logs to a driver, they should be flushed to stdout and stderr. +::: + +```python +import ray +import logging +# Initiate a driver. +ray.init() + +@ray.remote +class Actor: + def __init__(self): + # Basic config automatically configures logs to + # be streamed to stdout and stderr. + # Set the severity to INFO so that info logs are printed to stdout. + logging.basicConfig(level=logging.INFO) + + def log(self, msg): + logger = logging.getLogger(__name__) + logger.info(msg) + +actor = Actor.remote() +ray.get(actor.log.remote("A log message for an actor.")) + +@ray.remote +def f(msg): + logging.basicConfig(level=logging.INFO) + logger = logging.getLogger(__name__) + logger.info(msg) + +ray.get(f.remote("A log message for a task.")) +``` + +```bash +(Actor pid=179641) INFO:__main__:A log message for an actor. +(f pid=177572) INFO:__main__:A log message for a task. +``` +## Using structured logging + +The metadata of tasks or actors may be obtained by Ray's :ref:`runtime_context APIs `. +Runtime context APIs help you to add metadata to your logging messages, making your logs more structured. + +```python +import ray +# Initiate a driver. +ray.init() + + @ray.remote +def task(): + print(f"task_id: {ray.get_runtime_context().task_id}") + +ray.get(task.remote()) +``` + +```bash +(pid=47411) task_id: TaskID(a67dc375e60ddd1affffffffffffffffffffffff01000000) +``` +## Redirecting Ray logs to stderr + +By default, Ray logs are written to files under the ``/tmp/ray/session_*/logs`` directory. If you wish to redirect all internal Ray logging and your own logging within tasks/actors to stderr of the host nodes, you can do so by ensuring that the ``RAY_LOG_TO_STDERR=1`` environment variable is set on the driver and on all Ray nodes. This practice is not recommended but may be useful if you are using a log aggregator that needs log records to be written to stderr in order for them to be captured. + +Redirecting logging to stderr will also cause a ``({component})`` prefix, e.g. ``(raylet)``, to be added to each of the log record messages. + +```bash +[2022-01-24 19:42:02,978 I 1829336 1829336] (gcs_server) grpc_server.cc:103: GcsServer server started, listening on port 50009. +[2022-01-24 19:42:06,696 I 1829415 1829415] (raylet) grpc_server.cc:103: ObjectManager server started, listening on port 40545. +2022-01-24 19:42:05,087 INFO (dashboard) dashboard.py:95 -- Setup static dir for dashboard: /mnt/data/workspace/ray/python/ray/dashboard/client/build +2022-01-24 19:42:07,500 INFO (dashboard_agent) agent.py:105 -- Dashboard agent grpc address: 0.0.0.0:49228 +``` + +This should make it easier to filter the stderr stream of logs down to the component of interest. Note that multi-line log records will **not** have this component marker at the beginning of each line. + +When running a local Ray cluster, this environment variable should be set before starting the local cluster: + +```python +os.environ["RAY_LOG_TO_STDERR"] = "1" +ray.init() +``` + +When starting a local cluster via the CLI or when starting nodes in a multi-node Ray cluster, this environment variable should be set before starting up each node: + +```bash +env RAY_LOG_TO_STDERR=1 ray start +``` + +If using the Ray cluster launcher, you would specify this environment variable in the Ray start commands: + +```bash +head_start_ray_commands: + - ray stop + - env RAY_LOG_TO_STDERR=1 ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml + +worker_start_ray_commands: + - ray stop + - env RAY_LOG_TO_STDERR=1 ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 +``` + +When connecting to the cluster, be sure to set the environment variable before connecting: + +```python +os.environ["RAY_LOG_TO_STDERR"] = "1" +ray.init(address="auto") +``` + +## Rotating logs + +Ray supports log rotation of log files. Note that not all components are currently supporting log rotation. (Raylet and Python/Java worker logs are not rotating). + +By default, logs are rotating when it reaches to 512MB (maxBytes), and there could be up to 5 backup files (backupCount). Indexes are appended to all backup files (e.g., `raylet.out.1`) +If you'd like to change the log rotation configuration, you can do it by specifying environment variables. For example, + +```bash +RAY_ROTATION_MAX_BYTES=1024; ray start --head # Start a ray instance with maxBytes 1KB. +RAY_ROTATION_BACKUP_COUNT=1; ray start --head # Start a ray instance with backupCount 1. +``` \ No newline at end of file diff --git a/doc/source/cluster/running-applications/monitoring-and-observability.rst b/doc/source/cluster/running-applications/monitoring-and-observability.rst index c9dd3e39ee9a9..80a556d5586a4 100644 --- a/doc/source/cluster/running-applications/monitoring-and-observability.rst +++ b/doc/source/cluster/running-applications/monitoring-and-observability.rst @@ -1,9 +1,9 @@ -Cluster Monitoring ------------------- +Scraping and Persisting Metrics +=============================== Ray ships with the following observability features: -1. :ref:`The dashboard `, for viewing cluster state. +1. :ref:`The dashboard `, for viewing cluster state. 2. CLI tools such as the :ref:`Ray state APIs ` and :ref:`ray status `, for checking application and cluster status. 3. :ref:`Prometheus metrics ` for internal and custom user-defined metrics. @@ -16,7 +16,7 @@ The rest of this page will focus on how to access these services when running a Monitoring the cluster via the dashboard ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -:ref:`The dashboard ` provides detailed information about the state of the cluster, +:ref:`The dashboard ` provides detailed information about the state of the cluster, including the running jobs, actors, workers, nodes, etc. By default, the :ref:`cluster launcher ` and :ref:`KubeRay operator ` will launch the dashboard, but will not publicly expose the port. @@ -96,14 +96,14 @@ below. Prometheus ^^^^^^^^^^ Ray supports Prometheus for emitting and recording time-series metrics. -See :ref:`metrics ` for more details of the metrics emitted. +See :ref:`metrics ` for more details of the metrics emitted. To use Prometheus in a Ray cluster, decide where to host it, then configure it so that it can scrape the metrics from Ray. Scraping metrics ################ -Ray runs a metrics agent per node to export :ref:`metrics ` about Ray core as well as +Ray runs a metrics agent per node to export :ref:`metrics ` about Ray core as well as custom user-defined metrics. Each metrics agent collects metrics from the local node and exposes these in a Prometheus format. You can then scrape each endpoint to access Ray's metrics. diff --git a/doc/source/data/performance-tips.rst b/doc/source/data/performance-tips.rst index 6d9ec81d9bb5e..2ff9ef94ee1fe 100644 --- a/doc/source/data/performance-tips.rst +++ b/doc/source/data/performance-tips.rst @@ -7,7 +7,7 @@ Monitoring your application ~~~~~~~~~~~~~~~~~~~~~~~~~~~ View the Ray dashboard to monitor your application and troubleshoot issues. To learn -more about the Ray dashboard, read :ref:`Ray Dashboard `. +more about the Ray dashboard, read :ref:`Ray Dashboard `. Debugging Statistics ~~~~~~~~~~~~~~~~~~~~ diff --git a/doc/source/ray-air/getting-started.rst b/doc/source/ray-air/getting-started.rst index e35f1e7c11921..9bcdd66f18eae 100644 --- a/doc/source/ray-air/getting-started.rst +++ b/doc/source/ray-air/getting-started.rst @@ -216,4 +216,4 @@ Next Steps - :ref:`air-examples-ref` - :ref:`API reference ` - :ref:`Technical whitepaper ` -- To check how your application is doing, you can use the :ref:`Ray dashboard`. +- To check how your application is doing, you can use the :ref:`Ray dashboard`. diff --git a/doc/source/ray-core/api/index.rst b/doc/source/ray-core/api/index.rst index eb5cdd9d0ef5a..2845ebe892ef6 100644 --- a/doc/source/ray-core/api/index.rst +++ b/doc/source/ray-core/api/index.rst @@ -10,5 +10,5 @@ Ray Core API utility.rst exceptions.rst cli.rst - ../../ray-observability/api/state/cli.rst - ../../ray-observability/api/state/api.rst + ../../ray-observability/reference/cli.rst + ../../ray-observability/reference/api.rst diff --git a/doc/source/ray-core/ray-dashboard.rst b/doc/source/ray-core/ray-dashboard.rst deleted file mode 100644 index ef9e0553a17b6..0000000000000 --- a/doc/source/ray-core/ray-dashboard.rst +++ /dev/null @@ -1,685 +0,0 @@ -.. _ray-dashboard: - -Ray Dashboard -============= -Ray provides a web-based dashboard for monitoring and debugging Ray applications. -The dashboard provides a visual representation of the system state, allowing users to track the performance -of their applications and troubleshoot issues. - -.. raw:: html - -
- -
- -Common Workflows ----------------- - -Here are common workflows when using the Ray dashboard. - -- :ref:`View the metrics graphs `. -- :ref:`View the progress of your job `. -- :ref:`Find the application logs or error messages of failed tasks or actors `. -- :ref:`Profile, trace dump, and visualize the timeline of the Ray jobs, tasks, or actors `. -- :ref:`Analyze the CPU and memory usage of the cluster, tasks and actors `. -- :ref:`View the individual state of task, actor, placement group `, and :ref:`nodes (machines from a cluster) ` which is equivalent to :ref:`Ray state APIs `. -- :ref:`View the hardware utilization (CPU, GPU, memory) `. - -Getting Started ---------------- - -To use the dashboard, you should use the `ray[default]` installation: - -.. code-block:: bash - - pip install -U "ray[default]" - -You can access the dashboard through a URL printed when Ray is initialized (the default URL is **http://localhost:8265**) or via the context object returned from `ray.init`. - -.. testcode:: - :hide: - - import ray - ray.shutdown() - -.. testcode:: - - import ray - - context = ray.init() - print(context.dashboard_url) - -.. testoutput:: - - 127.0.0.1:8265 - -.. code-block:: text - - INFO worker.py:1487 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265. - -Ray cluster comes with the dashboard. See :ref:`Cluster Monitoring ` for more details. - -.. note:: - - When using the Ray dashboard, it is highly recommended to also set up Prometheus and Grafana. - They are necessary for critical features such as :ref:`Metrics View `. - See :ref:`Ray Metrics ` to learn how to set up Prometheus and Grafana. - -How to Guides -------------- - -.. _dash-workflow-logs: - -View the application logs and errors -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -**Driver Logs** - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/log_button_at_job.png - :align: center - -If the Ray job is submitted by :ref:`Ray job API `, the job logs are available from the dashboard. The log file follows the following format; ``job-driver-.log``. - -.. note:: - - If the driver is executed directly on the head node of the Ray cluster (without the job API) or run via :ref:`Ray client `, the driver logs are not accessible from the dashboard. In this case, see the terminal output to view the driver logs. - -**Task and Actor Logs** - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/actor_log.png - :align: center - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/task_log.png - :align: center - -Task and actor logs are accessible from the :ref:`task and actor table view `. Click the log button. -You can see the worker logs (``worker-[worker_id]-[job_id]-[pid].[out|err]``) that execute the task and actor. ``.out`` (stdout) and ``.err`` (stderr) logs contain the logs emitted from the tasks and actors. -The core worker logs (``python-core-worker-[worker_id]_[pid].log``) contain the system-level logs for the corresponding worker. - -**Task and Actor Errors** - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/failed_task_progress-bar.png - :align: center - -You can easily identify failed tasks or actors by looking at the job progress bar, which links to the table. - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/task_error_button.png - :align: center - -The table displays the name of the failed tasks or actors and provides access to their corresponding log or error messages. - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/task_error_box.png - :align: center - -.. _dash-workflow-cpu-memory-analysis: - -Analyze the CPU and memory usage of tasks and actors -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The :ref:`Metrics View ` in the Ray dashboard provides a "per-component CPU/memory usage graph" that displays CPU and memory usage over time for each task and actor in the application (as well as system components). -This allows users to identify tasks and actors that may be consuming more resources than expected and optimize the performance of the application. - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/node_cpu_by_comp.png - :align: center - - -Per component CPU graph. 0.379 cores mean that it uses 40% of a single CPU core. Ray process names start with ``ray::``. ``raylet``, ``agent``, ``dashboard``, or ``gcs`` are system components. - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/node_memory_by_comp.png - :align: center - -Per component memory graph. Ray process names start with ``ray::``. ``raylet``, ``agent``, ``dashboard``, or ``gcs`` are system components. - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/cluster_page.png - :align: center - -Additionally, users can see a snapshot of hardware utilization from the :ref:`cluster page `, which provides an overview of resource usage across the entire Ray cluster. - -.. _dash-workflow-resource-utilization: - -View the Resource Utilization -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Ray requires users to specify the number of :ref:`resources ` their tasks and actors will use through arguments such as ``num_cpus``, ``num_gpus``, ``memory``, and ``resource``. -These values are used for scheduling, but may not always match the actual resource utilization (physical resource utilization). - -- You can see the logical and physical resource utilization over time from the :ref:`Metrics View `. -- The snapshot of physical resource utilization (CPU, GPU, memory, disk, network) is also available from the :ref:`Cluster View `. - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/logical_resource.png - :align: center - -The :ref:`logical resources ` usage. - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/physical_resource.png - :align: center - -The physical resources (hardware) usage. Ray provides CPU, GPU, Memory, GRAM, disk, and network usage for each machine in a cluster. - -.. _dash-overview: - -Overview --------- - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/overview-page.png - :align: center - -The overview page provides a high-level status of the Ray cluster. - -**Overview Metrics** - -The Overview Metrics page provides the cluster-level hardware utilization and autoscaling status (number of pending, active, and failed nodes). - -**Recent Jobs** - -The Recent Jobs card provides a list of recently submitted Ray jobs. - -.. _dash-event: - -**Event View** - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/event-page.png - :align: center - -The Event View displays a list of events associated with a specific type (e.g., autoscaler or job) in chronological order. The same information is accessible with the ``ray list cluster-events`` :ref:`(Ray state APIs)` CLI commands . - -Two types of events are available. - -- Job: Events related to :ref:`Ray job submission APIs `. -- Autoscaler: Events related to the :ref:`Ray autoscaler `. - -.. _dash-jobs-view: - -Jobs View ---------- - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/jobs.png - :align: center - -The Jobs View lets you monitor the different jobs that ran on your Ray cluster. - -A job is a ray workload that uses Ray APIs (e.g., ``ray.init``). It can be submitted directly (e.g., by executing a Python script within a head node) or via :ref:`Ray job API `. - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/job_list.png - :align: center - -The job page displays a list of active, finished, and failed jobs, and clicking on an ID allows users to view detailed information about that job. -For more information on Ray jobs, see the Ray Job Overview section. - -Job Profiling -~~~~~~~~~~~~~ - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/profile-job.png - :align: center - -You can profile Ray jobs by clicking on the “Stack Trace” or “CPU Flame Graph” actions. See the :ref:`Dashboard Profiling ` for more details. - -.. _dash-workflow-job-progress: - -Advanced Task and Actor Breakdown -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/advanced-progress.png - :align: left - -The job page allows you to see tasks and actors broken down by their states. -Tasks and actors are grouped and nested by default. You can see the nested entries by clicking the expand button. - -Tasks and actors are grouped and nested by the following criteria. - -- All tasks and actors are grouped together, and you can view individual entries by expanding the corresponding row. -- Tasks are grouped by their ``name`` attribute (e.g., ``task.options(name="").remote()``). -- Child tasks (nested tasks) are nested under their parent task's row. -- Actors are grouped by their class name. -- Child actors (actors created within an actor) are nested under their parent actor's row. -- Actor tasks (remote methods within an actor) are nested under the actor for the corresponding actor method. - -.. note:: - - Ray dashboard can only display or retrieve up to 10K tasks at a time. If there are more than 10K tasks from your job, - they are unaccounted. The number of unaccounted tasks is available from the task breakdown. - -Task Timeline -~~~~~~~~~~~~~ - -The :ref:`timeline API ` is available from the dashboard. - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/profile-button.png - :align: center - -First, you can download the chrome tracing file by clicking the download button. - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/profile_drag.png - :align: center - -Second, you can use tools like ``chrome://tracing`` or the `Perfetto UI `_ and drop the downloaded chrome tracing file. We will use the Perfetto as it is the recommendation way to visualize chrome tracing files. - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/timeline.png - :align: center - -Now, you can see the timeline visualization of Ray tasks and actors. There are Node rows (hardware) and Worker rows (processes). -Each worker rows display a list of events (e.g., task scheduled, task running, input/output deserialization, etc.) happening from that worker over time. - -Ray Status -~~~~~~~~~~ - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/ray-status.png - :align: center - -The job page displays the output of the CLI tool ``ray status``, which shows the autoscaler status of the Ray cluster. - -The left page shows the autoscaling status, including pending, active, and failed nodes. -The right page displays the cluster's demands, which are resources that cannot be scheduled to the cluster at the moment. This page is useful for debugging resource deadlocks or slow scheduling. - -.. note:: - - The output shows the aggregated information across the cluster (not by job). If you run more than one job, some of the demands may come from other jobs. - -.. _dash-workflow-state-apis: - -Task Table, Actor Table, Placement Group Table -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/tables.png - :align: center - -The dashboard shows a table with the status of the job's tasks, actors, and placement groups. -You get the same information from the :ref:`Ray state APIs `. - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/task-table.png - :align: center - -You can expand the table to see a list of each task, actor, and placement group. - -.. _dash-serve-view: - -Serve View ----------- - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/serve.png - :align: center - -The Serve view lets you monitor the status of your :ref:`Ray Serve ` applications. - -The initial page showcases your general Serve configurations, a list of the Serve applications, and, if you have :ref:`Grafana and Prometheus ` configured, some high-level -metrics of all your Serve applications. Click the name of a Serve application to go to the Serve Application Detail Page. - -Serve Application Detail Page -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/serve-application.png - :align: center - -This page shows the Serve application's configurations and metadata. It also lists the :ref:`Serve deployments and replicas `. -Click the expand button of a deployment to see all the replicas in that deployment. - -For each deployment, there are two available actions. You can view the Deployment config and, if you configured :ref:`Grafana and Prometheus `, you can open -a Grafana dashboard with detailed metrics about that deployment. - -For each replica, there are two available actions. You can see the logs of that replica and, if you configured :ref:`Grafana and Prometheus `, you can open -a Grafana dashboard with detailed metrics about that replica. Click on the replica name to go to the Serve Replica Detail Page. - - -Serve Replica Detail Page -~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/serve-replica.png - :align: center - -This page shows metadata about the Serve replica, high-level metrics about the replica if you configured :ref:`Grafana and Prometheus `, and -a history of completed :ref:`tasks ` of that replica. - - -Serve Metrics -~~~~~~~~~~~~~ - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/serve-metrics.png - :align: center - -Ray serve exports various time-series metrics to understand the status of your Serve application over time. More details of these metrics can be found :ref:`here `. -In order to store and visualize these metrics, you must set up Prometheus and Grafana by following the instructions :ref:`here `. - -These metrics are available in the Ray dashboard in the Serve page and the Serve Replica Detail page. They are also accessible as Grafana dashboards. -Within the Grafana dashboard, use the dropdown filters on the top to filter metrics by route, deployment, or replica. Exact descriptions -of each graph are available by hovering over the "info" icon on the top left of each graph. - -.. _dash-node-view: - -Cluster View ------------- - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard/nodes-view-expand.png - :align: center - -The cluster view visualizes hierarchical relationship of -machines (nodes) and workers (processes). Each host consists of many workers, and -you can see them by clicking the + button. This also shows the assignment of GPU resources to specific actors or tasks. - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard/node-detail.png - :align: center - -You can also click the node id to go into a node detail page where you can see more information. - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/machine-view-log.png - :align: center - - -In addition, the machine view lets you see **logs** for a node or a worker. - -.. _dash-actors-view: - -Actors View ------------ - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/actor-page.png - :align: center - -The Actors view lets you see information about the actors that have existed on the ray cluster. - -You can view the logs for an actor and you can see which job created the actor. -The information of up to 1000 dead actors will be stored. -This value can be overridden by using the `RAY_DASHBOARD_MAX_ACTORS_TO_CACHE` environment variable -when starting Ray. - -Actor Profiling -~~~~~~~~~~~~~~~ - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/actor-profiling.png - :align: center - -You can also run the profiler on a running actor. See :ref:`Dashboard Profiling ` for more details. - -Actor Detail Page -~~~~~~~~~~~~~~~~~ - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/actor-list-id.png - :align: center - -By clicking the ID, you can also see the detail view of the actor. - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/actor-detail.png - :align: center - -From the actor detail page, you can see the metadata, state, and the all tasks that have run from this actor. - -.. _dash-metrics-view: - -Metrics View ------------- - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard/metrics.png - :align: center - -Ray exports default metrics which are available from the :ref:`Metrics View `. Here are some available example metrics. - -- The tasks, actors, and placement groups broken down by states. -- The :ref:`logical resource usage ` across nodes. -- The hardware resource usage across nodes. -- The autoscaler status. - -See :ref:`System Metrics Page ` for available metrics. - -.. note:: - - The metrics view required the Prometheus and Grafana setup. See :ref:`Ray Metrics ` to learn how to set up Prometheus and Grafana. - -The metrics view lets you view visualizations of the time series metrics emitted by Ray. - -You can select the time range of the metrics in the top right corner. The graphs refresh automatically every 15 seconds. - -There is also a convenient button to open the grafana UI from the dashboard. The Grafana UI provides additional customizability of the charts. - -.. _dash-logs-view: - -Logs View ---------- - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard/logs.png - :align: center - -The logs view lets you view all the ray logs that are in your cluster. It is organized by node and log file name. Many log links in the other pages link to this view and filter the list so the relevant logs appear. - -To understand the log file structure of Ray, see the :ref:`Logging directory structure page `. - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard/logs-content.png - :align: center - -The logs view provides search functionality to help you find specific log messages. - -Advanced Usage --------------- - -Changing Dashboard Ports -~~~~~~~~~~~~~~~~~~~~~~~~ - -.. tab-set:: - - .. tab-item:: Single-node local cluster - - **CLI** - - To customize the port on which the dashboard runs, you can pass - the ``--dashboard-port`` argument with ``ray start`` in the command line. - - **ray.init** - - If you need to customize the port on which the dashboard will run, you can pass the - keyword argument ``dashboard_port`` in your call to ``ray.init()``. - - .. tab-item:: VM Cluster Launcher - - To disable the dashboard while using the "VM cluster launcher", include the "ray start --head --include-dashboard=False" argument - and specify the desired port number in the "head_start_ray_commands" section of the `cluster launcher's YAML file `_. - - .. tab-item:: Kuberay - - See the `Specifying non-default ports `_ page. - -Viewing Built-in Dashboard API Metrics -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The dashboard is powered by a server that serves both the UI code and the data about the cluster via API endpoints. -There are basic Prometheus metrics that are emitted for each of these API endpoints: - -`ray_dashboard_api_requests_count_requests_total`: Collects the total count of requests. This is tagged by endpoint, method, and http_status. - -`ray_dashboard_api_requests_duration_seconds_bucket`: Collects the duration of requests. This is tagged by endpoint and method. - -For example, you can view the p95 duration of all requests with this query: - -.. code-block:: text - - histogram_quantile(0.95, sum(rate(ray_dashboard_api_requests_duration_seconds_bucket[5m])) by (le)) - -These metrics can be queried via Prometheus or Grafana UI. Instructions on how to set these tools up can be found :ref:`here `. - - -Running Behind a Reverse Proxy -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The dashboard should work out-of-the-box when accessed via a reverse proxy. API requests don't need to be proxied individually. - -Always access the dashboard with a trailing ``/`` at the end of the URL. -For example, if your proxy is set up to handle requests to ``/ray/dashboard``, view the dashboard at ``www.my-website.com/ray/dashboard/``. - -The dashboard now sends HTTP requests with relative URL paths. Browsers will handle these requests as expected when the ``window.location.href`` ends in a trailing ``/``. - -This is a peculiarity of how many browsers handle requests with relative URLs, despite what `MDN `_ -defines as the expected behavior. - -Make your dashboard visible without a trailing ``/`` by including a rule in your reverse proxy that -redirects the user's browser to ``/``, i.e. ``/ray/dashboard`` --> ``/ray/dashboard/``. - -Below is an example with a `traefik `_ TOML file that accomplishes this: - -.. code-block:: yaml - - [http] - [http.routers] - [http.routers.to-dashboard] - rule = "PathPrefix(`/ray/dashboard`)" - middlewares = ["test-redirectregex", "strip"] - service = "dashboard" - [http.middlewares] - [http.middlewares.test-redirectregex.redirectRegex] - regex = "^(.*)/ray/dashboard$" - replacement = "${1}/ray/dashboard/" - [http.middlewares.strip.stripPrefix] - prefixes = ["/ray/dashboard"] - [http.services] - [http.services.dashboard.loadBalancer] - [[http.services.dashboard.loadBalancer.servers]] - url = "http://localhost:8265" - -Disabling the Dashboard -~~~~~~~~~~~~~~~~~~~~~~~ -Dashboard is included in the `ray[default]` installation by default and automatically started. - -To disable the dashboard, use the following arguments `--include-dashboard`. - -.. tab-set:: - - .. tab-item:: Single-node local cluster - - **CLI** - - .. code-block:: bash - - ray start --include-dashboard=False - - **ray.init** - - .. testcode:: - :hide: - - ray.shutdown() - - .. testcode:: - - ray.init(include_dashboard=False) - - .. tab-item:: VM Cluster Launcher - - To disable the dashboard while using the "VM cluster launcher", include the "ray start --head --include-dashboard=False" argument - in the "head_start_ray_commands" section of the `cluster launcher's YAML file `_. - - .. tab-item:: Kuberay - - TODO - -.. _dash-reference: - -Page References ---------------- - -Cluster View -~~~~~~~~~~~~ - -.. list-table:: Cluster View Node Table Reference - :widths: 25 75 - :header-rows: 1 - - * - Term - - Description - * - **State** - - Whether the node or worker is alive or dead. - * - **ID** - - The ID of the node or the workerId for the worker. - * - **Host / Cmd line** - - If it is a node, it shows host information. If it is a worker, it shows the name of the task that is being run. - * - **IP / PID** - - If it is a node, it shows the IP address of the node. If it's a worker, it shows the PID of the worker process. - * - **CPU Usage** - - CPU usage of each node and worker. - * - **Memory** - - RAM usage of each node and worker. - * - **GPU** - - GPU usage of the node. - * - **GRAM** - - GPU memory usage of the node. - * - **Object Store Memory** - - Amount of memory used by the object store for this node. - * - **Disk** - - Disk usage of the node. - * - **Sent** - - Network bytes sent for each node and worker. - * - **Received** - - Network bytes received for each node and worker. - * - **Log** - - Logs messages at each node and worker. You can see log files relevant to a node or worker by clicking this link. - * - **Stack Trace** - - Get the Python stack trace for the specified worker. Refer to :ref:`dashboard-profiling` for more information. - * - **CPU Flame Graph** - - Get a CPU flame graph for the specified worker. Refer to :ref:`dashboard-profiling` for more information. - - -Jobs View -~~~~~~~~~ - -.. list-table:: Jobs View Reference - :widths: 25 75 - :header-rows: 1 - - * - Term - - Description - * - **Job ID** - - The ID of the job. This is the primary id that associates tasks and actors to this job. - * - **Submission ID** - - An alternate ID that can be provided by a user or generated for all ray job submissions. - It's useful if you would like to associate your job with an ID that is provided by some external system. - * - **Status** - - Describes the state of a job. One of: - * PENDING: The job has not started yet, likely waiting for the runtime_env to be set up. - * RUNNING: The job is currently running. - * STOPPED: The job was intentionally stopped by the user. - * SUCCEEDED: The job finished successfully. - * FAILED: The job failed. - * - **Logs** - - A link to the logs for this job. - * - **StartTime** - - The time the job was started. - * - **EndTime** - - The time the job finished. - * - **DriverPid** - - The PID for the driver process that is started the job. - -Actors -~~~~~~ - -.. list-table:: Actor View Reference - :widths: 25 75 - :header-rows: 1 - - * - Term - - Description - * - **Actor ID** - - The ID of the actor. - * - **Restart Times** - - Number of times this actor has been restarted. - * - **Name** - - The name of an actor. This can be user defined. - * - **Class** - - The class of the actor. - * - **Function** - - The current function the actor is running. - * - **Job ID** - - The job in which this actor was created. - * - **Pid** - - ID of the worker process on which the actor is running. - * - **IP** - - Node IP Address where the actor is located. - * - **Port** - - The Port for the actor. - * - **State** - - Either one of "ALIVE" or "DEAD". - * - **Log** - - A link to the logs that are relevant to this actor. - * - **Stack Trace** - - Get the Python stack trace for the specified actor. Refer to :ref:`dashboard-profiling` for more information. - * - **CPU Flame Graph** - - Get a CPU flame graph for the specified actor. Refer to :ref:`dashboard-profiling` for more information. - -Resources ---------- -- `Ray Summit observability talk `_ -- `Ray metrics blog `_ -- `Ray dashboard roadmap `_ -- `Observability Training Module `_ \ No newline at end of file diff --git a/doc/source/ray-core/scheduling/ray-oom-prevention.rst b/doc/source/ray-core/scheduling/ray-oom-prevention.rst index eaa0076465d72..99aa1bc215ced 100644 --- a/doc/source/ray-core/scheduling/ray-oom-prevention.rst +++ b/doc/source/ray-core/scheduling/ray-oom-prevention.rst @@ -1,7 +1,7 @@ Out-Of-Memory Prevention ======================== -If application tasks or actors consume a large amount of heap space, it can cause the node to run out of memory (OOM). When that happens, the operating system will start killing worker or raylet processes, disrupting the application. OOM may also stall metrics and if this happens on the head node, it may stall the :ref:`dashboard ` or other control processes and cause the cluster to become unusable. +If application tasks or actors consume a large amount of heap space, it can cause the node to run out of memory (OOM). When that happens, the operating system will start killing worker or raylet processes, disrupting the application. OOM may also stall metrics and if this happens on the head node, it may stall the :ref:`dashboard ` or other control processes and cause the cluster to become unusable. In this section we will go over: diff --git a/doc/source/ray-core/walkthrough.rst b/doc/source/ray-core/walkthrough.rst index e4f721a7d59ed..d1522fbed31c2 100644 --- a/doc/source/ray-core/walkthrough.rst +++ b/doc/source/ray-core/walkthrough.rst @@ -60,7 +60,7 @@ As seen above, Ray stores task and actor call results in its :ref:`distributed o Next Steps ---------- -.. tip:: To check how your application is doing, you can use the :ref:`Ray dashboard `. +.. tip:: To check how your application is doing, you can use the :ref:`Ray dashboard `. Ray's key primitives are simple, but can be composed together to express almost any kind of distributed computation. Learn more about Ray's :ref:`key concepts ` with the following user guides: diff --git a/doc/source/ray-observability/getting-started.rst b/doc/source/ray-observability/getting-started.rst new file mode 100644 index 0000000000000..858794dc384a3 --- /dev/null +++ b/doc/source/ray-observability/getting-started.rst @@ -0,0 +1,367 @@ +.. _observability-getting-started: + +Getting Started +=============== + +Ray provides a web-based dashboard for monitoring and debugging Ray applications. +The dashboard provides a visual representation of the system state, allowing users to track the performance +of their applications and troubleshoot issues. + +.. raw:: html + +
+ +
+ + +To use the dashboard, you should use the `ray[default]` installation: + +.. code-block:: bash + + pip install -U "ray[default]" + +You can access the dashboard through a URL printed when Ray is initialized (the default URL is **http://localhost:8265**) or via the context object returned from `ray.init`. + +.. testcode:: + :hide: + + import ray + ray.shutdown() + +.. testcode:: + + import ray + + context = ray.init() + print(context.dashboard_url) + +.. testoutput:: + + 127.0.0.1:8265 + +.. code-block:: text + + INFO worker.py:1487 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265. + +Ray cluster comes with the dashboard. See :ref:`Cluster Monitoring ` for more details. + +.. note:: + + When using the Ray dashboard, it is highly recommended to also set up Prometheus and Grafana. + They are necessary for critical features such as :ref:`Metrics View `. + See :ref:`Configuring and Managing the Dashboard ` to learn how to set up Prometheus and Grafana. + + .. _dash-workflow-cpu-memory-analysis: + +.. _dash-jobs-view: + +Jobs View +--------- + +.. raw:: html + +
+ +
+ +The Jobs View lets you monitor the different jobs that ran on your Ray cluster. + +A job is a ray workload that uses Ray APIs (e.g., ``ray.init``). It can be submitted directly (e.g., by executing a Python script within a head node) or via :ref:`Ray job API `. + +The job page displays a list of active, finished, and failed jobs, and clicking on an ID allows users to view detailed information about that job. +For more information on Ray jobs, see the Ray Job Overview section. + +Job Profiling +~~~~~~~~~~~~~ + +You can profile Ray jobs by clicking on the “Stack Trace” or “CPU Flame Graph” actions. See the :ref:`Dashboard Profiling ` for more details. + +.. _dash-workflow-job-progress: + +Advanced Task and Actor Breakdown +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The job page allows you to see tasks and actors broken down by their states. +Tasks and actors are grouped and nested by default. You can see the nested entries by clicking the expand button. + +Tasks and actors are grouped and nested by the following criteria. + +- All tasks and actors are grouped together, and you can view individual entries by expanding the corresponding row. +- Tasks are grouped by their ``name`` attribute (e.g., ``task.options(name="").remote()``). +- Child tasks (nested tasks) are nested under their parent task's row. +- Actors are grouped by their class name. +- Child actors (actors created within an actor) are nested under their parent actor's row. +- Actor tasks (remote methods within an actor) are nested under the actor for the corresponding actor method. + +.. note:: + + Ray dashboard can only display or retrieve up to 10K tasks at a time. If there are more than 10K tasks from your job, + they are unaccounted. The number of unaccounted tasks is available from the task breakdown. + +Task Timeline +~~~~~~~~~~~~~ + +The :ref:`timeline API ` is available from the dashboard. + +First, you can download the chrome tracing file by clicking the download button. + +Second, you can use tools like ``chrome://tracing`` or the `Perfetto UI `_ and drop the downloaded chrome tracing file. We will use the Perfetto as it is the recommendation way to visualize chrome tracing files. + +Now, you can see the timeline visualization of Ray tasks and actors. There are Node rows (hardware) and Worker rows (processes). +Each worker rows display a list of events (e.g., task scheduled, task running, input/output deserialization, etc.) happening from that worker over time. + +Ray Status +~~~~~~~~~~ + +The job page displays the output of the CLI tool ``ray status``, which shows the autoscaler status of the Ray cluster. + +The left page shows the autoscaling status, including pending, active, and failed nodes. +The right page displays the cluster's demands, which are resources that cannot be scheduled to the cluster at the moment. This page is useful for debugging resource deadlocks or slow scheduling. + +.. note:: + + The output shows the aggregated information across the cluster (not by job). If you run more than one job, some of the demands may come from other jobs. + +.. _dash-workflow-state-apis: + +Task Table, Actor Table, Placement Group Table +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The dashboard shows a table with the status of the job's tasks, actors, and placement groups. +You get the same information from the :ref:`Ray state APIs `. + +You can expand the table to see a list of each task, actor, and placement group. + +.. _dash-serve-view: + +Serve View +---------- + +The Serve view lets you monitor the status of your :ref:`Ray Serve ` applications. + +.. raw:: html + +
+ +
+ +The initial page showcases your general Serve configurations, a list of the Serve applications, and, if you have :ref:`Grafana and Prometheus ` configured, some high-level +metrics of all your Serve applications. Click the name of a Serve application to go to the Serve Application Detail Page. + +Serve Application Detail Page +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This page shows the Serve application's configurations and metadata. It also lists the :ref:`Serve deployments and replicas `. +Click the expand button of a deployment to see all the replicas in that deployment. + +For each deployment, there are two available actions. You can view the Deployment config and, if you configured :ref:`Grafana and Prometheus `, you can open +a Grafana dashboard with detailed metrics about that deployment. + +For each replica, there are two available actions. You can see the logs of that replica and, if you configured :ref:`Grafana and Prometheus `, you can open +a Grafana dashboard with detailed metrics about that replica. Click on the replica name to go to the Serve Replica Detail Page. + + +Serve Replica Detail Page +~~~~~~~~~~~~~~~~~~~~~~~~~ + +This page shows metadata about the Serve replica, high-level metrics about the replica if you configured :ref:`Grafana and Prometheus `, and +a history of completed :ref:`tasks ` of that replica. + + +Serve Metrics +~~~~~~~~~~~~~ + +Ray serve exports various time-series metrics to understand the status of your Serve application over time. More details of these metrics can be found :ref:`here `. +In order to store and visualize these metrics, you must set up Prometheus and Grafana by following the instructions :ref:`here `. + +These metrics are available in the Ray dashboard in the Serve page and the Serve Replica Detail page. They are also accessible as Grafana dashboards. +Within the Grafana dashboard, use the dropdown filters on the top to filter metrics by route, deployment, or replica. Exact descriptions +of each graph are available by hovering over the "info" icon on the top left of each graph. + + +.. _dash-node-view: + +Cluster View +------------ + +.. raw:: html + +
+ +
+ +The cluster view visualizes hierarchical relationship of +machines (nodes) and workers (processes). Each host consists of many workers, and +you can see them by clicking the + button. This also shows the assignment of GPU resources to specific actors or tasks. + +You can also click the node id to go into a node detail page where you can see more information. + +In addition, the machine view lets you see **logs** for a node or a worker. + +.. _dash-actors-view: + +Actors View +----------- + +.. raw:: html + +
+ +
+ +The Actors view lets you see information about the actors that have existed on the ray cluster. + +You can view the logs for an actor and you can see which job created the actor. +The information of up to 1000 dead actors will be stored. +This value can be overridden by using the `RAY_DASHBOARD_MAX_ACTORS_TO_CACHE` environment variable +when starting Ray. + +Actor Profiling +~~~~~~~~~~~~~~~ + +You can also run the profiler on a running actor. See :ref:`Dashboard Profiling ` for more details. + +Actor Detail Page +~~~~~~~~~~~~~~~~~ + +By clicking the ID, you can also see the detail view of the actor. + +From the actor detail page, you can see the metadata, state, and the all tasks that have run from this actor. + +.. _dash-metrics-view: + +Metrics View +------------ + +.. raw:: html + +
+ +
+ + +Ray exports default metrics which are available from the :ref:`Metrics View `. Here are some available example metrics. + +- The tasks, actors, and placement groups broken down by states. +- The :ref:`logical resource usage ` across nodes. +- The hardware resource usage across nodes. +- The autoscaler status. + +See :ref:`System Metrics Page ` for available metrics. + +.. note:: + + The metrics view required the Prometheus and Grafana setup. See :ref:`Configuring and Managing the Dashboard ` to learn how to set up Prometheus and Grafana. + +The metrics view lets you view visualizations of the time series metrics emitted by Ray. + +You can select the time range of the metrics in the top right corner. The graphs refresh automatically every 15 seconds. + +There is also a convenient button to open the grafana UI from the dashboard. The Grafana UI provides additional customizability of the charts. + +Analyze the CPU and memory usage of tasks and actors +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The :ref:`Metrics View ` in the Ray dashboard provides a "per-component CPU/memory usage graph" that displays CPU and memory usage over time for each task and actor in the application (as well as system components). +This allows users to identify tasks and actors that may be consuming more resources than expected and optimize the performance of the application. + +Per component CPU graph. 0.379 cores mean that it uses 40% of a single CPU core. Ray process names start with ``ray::``. ``raylet``, ``agent``, ``dashboard``, or ``gcs`` are system components. + +Per component memory graph. Ray process names start with ``ray::``. ``raylet``, ``agent``, ``dashboard``, or ``gcs`` are system components. + +Additionally, users can see a snapshot of hardware utilization from the :ref:`cluster page `, which provides an overview of resource usage across the entire Ray cluster. + +.. _dash-workflow-resource-utilization: + +View the Resource Utilization +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Ray requires users to specify the number of :ref:`resources ` their tasks and actors will use through arguments such as ``num_cpus``, ``num_gpus``, ``memory``, and ``resource``. +These values are used for scheduling, but may not always match the actual resource utilization (physical resource utilization). + +- You can see the logical and physical resource utilization over time from the :ref:`Metrics View `. +- The snapshot of physical resource utilization (CPU, GPU, memory, disk, network) is also available from the :ref:`Cluster View `. + +The :ref:`logical resources ` usage. + +The physical resources (hardware) usage. Ray provides CPU, GPU, Memory, GRAM, disk, and network usage for each machine in a cluster. + + + +.. _dash-logs-view: + +Logs View +--------- + +.. raw:: html + +
+ +
+ +The logs view lets you view all the Ray logs in your cluster. It is organized by node and log file name. Many log links in the other pages link to this view and filter the list so the relevant logs appear. + +To understand the log file structure of Ray, see the :ref:`Logging directory structure page `. + + +The logs view provides search functionality to help you find specific log messages. + + +**Driver Logs** + +If the Ray job is submitted by :ref:`Ray job API `, the job logs are available from the dashboard. The log file follows the following format; ``job-driver-.log``. + +.. note:: + + If the driver is executed directly on the head node of the Ray cluster (without the job API) or run via :ref:`Ray client `, the driver logs are not accessible from the dashboard. In this case, see the terminal output to view the driver logs. + +**Task and Actor Logs** + +Task and actor logs are accessible from the :ref:`task and actor table view `. Click the log button. +You can see the worker logs (``worker-[worker_id]-[job_id]-[pid].[out|err]``) that execute the task and actor. ``.out`` (stdout) and ``.err`` (stderr) logs contain the logs emitted from the tasks and actors. +The core worker logs (``python-core-worker-[worker_id]_[pid].log``) contain the system-level logs for the corresponding worker. + +**Task and Actor Errors** + +You can easily identify failed tasks or actors by looking at the job progress bar, which links to the table. + +The table displays the name of the failed tasks or actors and provides access to their corresponding log or error messages. + +.. _dash-overview: + +Overview +-------- + +.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/overview-page.png + :align: center + +The overview page provides a high-level status of the Ray cluster. + +**Overview Metrics** + +The Overview Metrics page provides the cluster-level hardware utilization and autoscaling status (number of pending, active, and failed nodes). + +**Recent Jobs** + +The Recent Jobs card provides a list of recently submitted Ray jobs. + +.. _dash-event: + +**Event View** + +.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/new-dashboard-v2/dashboard-pics/event-page.png + :align: center + +The Event View displays a list of events associated with a specific type (e.g., autoscaler or job) in chronological order. The same information is accessible with the ``ray list cluster-events`` :ref:`(Ray state APIs)` CLI commands . + +Two types of events are available. + +- Job: Events related to :ref:`Ray job submission APIs `. +- Autoscaler: Events related to the :ref:`Ray autoscaler `. + +Resources +--------- +- `Ray Summit observability talk `_ +- `Ray metrics blog `_ +- `Ray dashboard roadmap `_ +- `Observability Training Module `_ \ No newline at end of file diff --git a/doc/source/ray-observability/index.rst b/doc/source/ray-observability/index.rst new file mode 100644 index 0000000000000..1f18fc4885f6f --- /dev/null +++ b/doc/source/ray-observability/index.rst @@ -0,0 +1,8 @@ +.. _observability: + +Monitoring and Debugging +======================== + +This section covers how to **monitor and debug Ray applications and clusters**. + + diff --git a/doc/source/ray-observability/key-concepts.rst b/doc/source/ray-observability/key-concepts.rst new file mode 100644 index 0000000000000..1b48c12c9501a --- /dev/null +++ b/doc/source/ray-observability/key-concepts.rst @@ -0,0 +1,201 @@ +.. _observability-key-concepts: + +Key Concepts +============ + +This section covers a list of key concepts for monitoring and debugging tools and features in Ray. + +Dashboard (Web UI) +------------------ +Ray supports the web-based dashboard to help users monitor the cluster. When a new cluster is started, the dashboard is available +through the default address `localhost:8265` (port can be automatically incremented if port 8265 is already occupied). + +See :ref:`Getting Started ` for more details about the dashboard. + +Accessing Ray States +-------------------- +Ray 2.0 and later versions support CLI and Python APIs for querying the state of resources (e.g., actor, task, object, etc.) + +For example, the following command summarizes the task state of the cluster: + +.. code-block:: bash + + ray summary tasks + +.. code-block:: text + + ======== Tasks Summary: 2022-07-22 08:54:38.332537 ======== + Stats: + ------------------------------------ + total_actor_scheduled: 2 + total_actor_tasks: 0 + total_tasks: 2 + + + Table (group by func_name): + ------------------------------------ + FUNC_OR_CLASS_NAME STATE_COUNTS TYPE + 0 task_running_300_seconds RUNNING: 2 NORMAL_TASK + 1 Actor.__init__ FINISHED: 2 ACTOR_CREATION_TASK + +The following command lists all the actors from the cluster: + +.. code-block:: bash + + ray list actors + +.. code-block:: text + + ======== List: 2022-07-23 21:29:39.323925 ======== + Stats: + ------------------------------ + Total: 2 + + Table: + ------------------------------ + ACTOR_ID CLASS_NAME NAME PID STATE + 0 31405554844820381c2f0f8501000000 Actor 96956 ALIVE + 1 f36758a9f8871a9ca993b1d201000000 Actor 96955 ALIVE + +See :ref:`Ray State API ` for more details. + +Metrics +------- +Ray collects and exposes the physical stats (e.g., CPU, memory, GRAM, disk, and network usage of each node), +internal stats (e.g., number of actors in the cluster, number of worker failures of the cluster), +and custom metrics (e.g., metrics defined by users). All stats can be exported as time series data (to Prometheus by default) and used +to monitor the cluster over time. + +See :ref:`Ray Metrics ` for more details. + +Exceptions +---------- +Creating a new task or submitting an actor task generates an object reference. When ``ray.get`` is called on the object reference, +the API raises an exception if anything goes wrong with a related task, actor or object. For example, + +- :class:`RayTaskError ` is raised when there's an error from user code that throws an exception. +- :class:`RayActorError ` is raised when an actor is dead (by a system failure such as node failure or user-level failure such as an exception from ``__init__`` method). +- :class:`RuntimeEnvSetupError ` is raised when the actor or task couldn't be started because :ref:`a runtime environment ` failed to be created. + +See :ref:`Exceptions Reference ` for more details. + +Debugger +-------- +Ray has a built-in debugger that allows you to debug your distributed applications. +It allows you to set breakpoints in your Ray tasks and actors, and when hitting the breakpoint, you can +drop into a PDB session that you can then use to: + +- Inspect variables in that context +- Step within that task or actor +- Move up or down the stack + +See :ref:`Ray Debugger ` for more details. + +Profiling +--------- +Ray is compatible with Python profiling tools such as ``CProfile``. It also supports its built-in profiling tool such as :ref:```ray timeline`` `. + +See :ref:`Profiling ` for more details. + +Tracing +------- +To help debug and monitor Ray applications, Ray supports distributed tracing (integration with OpenTelemetry) across tasks and actors. + +See :ref:`Ray Tracing ` for more details. + +Application Logging +------------------- +By default, all stdout and stderr of tasks and actors are streamed to the Ray driver (the entrypoint script that calls ``ray.init``). + +.. literalinclude:: doc_code/app_logging.py + :language: python + +All stdout emitted from the ``print`` method is printed to the driver with a ``(the task or actor repr, the process ID, IP address)`` prefix. + +.. code-block:: bash + + (pid=45601) task + (Actor pid=480956) actor + +See :ref:`Logging ` for more details. + +Driver logs +~~~~~~~~~~~ +An entry point of Ray applications that calls ``ray.init()`` is called a driver. +All the driver logs are handled in the same way as normal Python programs. + +Job logs +~~~~~~~~ +Logs for jobs submitted via the :ref:`Ray Jobs API ` can be retrieved using the ``ray job logs`` :ref:`CLI command ` or using ``JobSubmissionClient.get_logs()`` or ``JobSubmissionClient.tail_job_logs()`` via the :ref:`Python SDK `. +The log file consists of the stdout of the entrypoint command of the job. For the location of the log file on disk, see :ref:`Logging directory structure `. + +.. _ray-worker-logs: + +Worker stdout and stderr +~~~~~~~~~~~~~~~~~~~~~~~~ +Ray's tasks or actors are executed remotely within Ray's worker processes. Ray has special support to improve the visibility of stdout and stderr produced by workers. + +- By default, stdout and stderr from all tasks and actors are redirected to the worker log files, including any log messages generated by the worker. See :ref:`Logging directory structure ` to understand the structure of the Ray logging directory. +- By default, the driver reads the worker log files to which the stdout and stderr for all tasks and actors are redirected. Drivers display all stdout and stderr generated from their tasks or actors to their own stdout and stderr. + +Let's look at a code example to see how this works. + +.. code-block:: python + + import ray + # Initiate a driver. + ray.init() + + @ray.remote + def task(): + print("task") + + ray.get(task.remote()) + +You should be able to see the string `task` from your driver stdout. + +When logs are printed, the process id (pid) and an IP address of the node that executes tasks/actors are printed together. Check out the output below. + +.. code-block:: bash + + (pid=45601) task + +Actor log messages look like the following by default. + +.. code-block:: bash + + (MyActor pid=480956) actor log message + +Logging directory structure +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +By default, Ray logs are stored in a ``/tmp/ray/session_*/logs`` directory. + +..{note}: +The default temp directory is ``/tmp/ray`` (for Linux and MacOS). To change the temp directory, specify it when you call ``ray start`` or ``ray.init()``. + +A new Ray instance creates a new session ID to the temp directory. The latest session ID is symlinked to ``/tmp/ray/session_latest``. + +Here's a Ray log directory structure. Note that ``.out`` is logs from stdout/stderr and ``.err`` is logs from stderr. The backward compatibility of log directories is not maintained. + +- ``dashboard.[log|err]``: A log file of a Ray dashboard. ``log.`` file contains logs generated from the dashboard's logger. ``.err`` file contains stdout and stderr printed from the dashboard. They are usually empty except when the dashboard crashes unexpectedly. +- ``dashboard_agent.log``: Every Ray node has one dashboard agent. This is a log file of the agent. +- ``gcs_server.[out|err]``: The GCS server is a stateless server that manages Ray cluster metadata. It exists only in the head node. +- ``io-worker-[worker_id]-[pid].[out|err]``: Ray creates IO workers to spill/restore objects to external storage by default from Ray 1.3+. This is a log file of IO workers. +- ``job-driver-[submission_id].log``: The stdout of a job submitted via the :ref:`Ray Jobs API `. +- ``log_monitor.[log|err]``: The log monitor is in charge of streaming logs to the driver. ``log.`` file contains logs generated from the log monitor's logger. ``.err`` file contains the stdout and stderr printed from the log monitor. They are usually empty except when the log monitor crashes unexpectedly. +- ``monitor.[out|err]``: Stdout and stderr of a cluster launcher. +- ``monitor.log``: Ray's cluster launcher is operated with a monitor process. It also manages the autoscaler. +- ``plasma_store.[out|err]``: Deprecated. +- ``python-core-driver-[worker_id]_[pid].log``: Ray drivers consist of CPP core and Python/Java frontend. This is a log file generated from CPP code. +- ``python-core-worker-[worker_id]_[pid].log``: Ray workers consist of CPP core and Python/Java frontend. This is a log file generated from CPP code. +- ``raylet.[out|err]``: A log file of raylets. +- ``redis-shard_[shard_index].[out|err]``: Redis shard log files. +- ``redis.[out|err]``: Redis log files. +- ``runtime_env_agent.log``: Every Ray node has one agent that manages :ref:`runtime environment ` creation, deletion and caching. + This is the log file of the agent containing logs of create/delete requests and cache hits and misses. + For the logs of the actual installations (including e.g. ``pip install`` logs), see the ``runtime_env_setup-[job_id].log`` file (see below). +- ``runtime_env_setup-[job_id].log``: Logs from installing :ref:`runtime environments ` for a task, actor or job. This file will only be present if a runtime environment is installed. +- ``runtime_env_setup-ray_client_server_[port].log``: Logs from installing :ref:`runtime environments ` for a job when connecting via :ref:`Ray Client `. +- ``worker-[worker_id]-[job_id]-[pid].[out|err]``: Python or Java part of Ray drivers and workers. All of stdout and stderr from tasks or actors are streamed here. Note that job_id is an id of the driver.- + diff --git a/doc/source/ray-observability/monitoring-debugging/getting-help.rst b/doc/source/ray-observability/monitoring-debugging/getting-help.rst deleted file mode 100644 index 8118de7f7ca8d..0000000000000 --- a/doc/source/ray-observability/monitoring-debugging/getting-help.rst +++ /dev/null @@ -1,60 +0,0 @@ -.. _ray-troubleshoot-getting-help: - -Getting Help -============ - -Ray Community -------------- - -If you stuck on a problem, there are several ways to ask the Ray community. - -.. _`Discourse Forum`: https://discuss.ray.io/ -.. _`GitHub Issues`: https://github.com/ray-project/ray/issues -.. _`StackOverflow`: https://stackoverflow.com/questions/tagged/ray -.. _`Slack`: https://forms.gle/9TSdDYUgxYs8SA9e8 - -.. list-table:: - :widths: 25 50 25 25 - :header-rows: 1 - - * - Platform - - Purpose - - Estimated Response Time - - Support Level - * - `Discourse Forum`_ - - For discussions about development and questions about usage. - - < 1 day - - Community - * - `GitHub Issues`_ - - For reporting bugs and filing feature requests. - - < 2 days - - Ray OSS Team - * - `Slack`_ - - For collaborating with other Ray users. - - < 2 days - - Community - * - `StackOverflow`_ - - For asking questions about how to use Ray. - - 3-5 days - - Community - -Discourse Forum -~~~~~~~~~~~~~~~ -`Discourse Forum` is the primary place to ask questions, where the Ray committers, contributors, and other Ray users answer questions. -Someone from the community may have already answered your question, so before you ask a new question, please make sure to search them. -The Ray contributors monitor the forum daily and expect to respond within a day. - -Bugs or Feature Requests -~~~~~~~~~~~~~~~~~~~~~~~~ -Sometimes, the question or problem you have turns out to be a real bug or requires an enhancement request. In this case, -file a new issue to the `GitHub Issues`_ page. Ray contributors will triage and -address them accordingly. - -StackOverflow -~~~~~~~~~~~~~ -You can also ask questions to `StackOverflow`_ with a Ray tag. On StackOverflow, we strive to respond to questions within 3~5 days. - -Slack -~~~~~ -Many Ray users hang out on Ray `Slack`_ (the invitation is open to everyone). You can join the slack and directly communicate to other Ray users or contributors. -For asking questions, we recommend using the discourse forum or StackOverflow for future searchability. diff --git a/doc/source/ray-observability/monitoring-debugging/monitoring-debugging.rst b/doc/source/ray-observability/monitoring-debugging/monitoring-debugging.rst deleted file mode 100644 index 6d882c34fe598..0000000000000 --- a/doc/source/ray-observability/monitoring-debugging/monitoring-debugging.rst +++ /dev/null @@ -1,20 +0,0 @@ -.. _observability: - -Monitoring and Debugging -======================== - -This section covers how to **monitor and debug Ray applications and clusters**. - -See :ref:`Getting Help ` if your problem is not solved by the troubleshooting guide. - -.. toctree:: - :maxdepth: 0 - - ../overview - ../../ray-core/ray-dashboard - ../state/state-api - ../ray-logging - ../ray-metrics - profiling - gotchas - getting-help diff --git a/doc/source/ray-observability/overview.rst b/doc/source/ray-observability/overview.rst deleted file mode 100644 index 8919d3f29b5a5..0000000000000 --- a/doc/source/ray-observability/overview.rst +++ /dev/null @@ -1,156 +0,0 @@ -.. _observability-overview: - -Overview -======== - -This section covers a list of available monitoring and debugging tools and features in Ray. - -This documentation only covers the high-level description of available tools and features. For more details, see :ref:`Ray Observability `. - -Dashboard (Web UI) ------------------- -Ray supports the web-based dashboard to help users monitor the cluster. When a new cluster is started, the dashboard is available -through the default address `localhost:8265` (port can be automatically incremented if port 8265 is already occupied). - -See :ref:`Ray Dashboard ` for more details. - -Application Logging -------------------- -By default, all stdout and stderr of tasks and actors are streamed to the Ray driver (the entrypoint script that calls ``ray.init``). - -.. literalinclude:: doc_code/app_logging.py - :language: python - -All stdout emitted from the ``print`` method is printed to the driver with a ``(the task or actor repr, the process ID, IP address)`` prefix. - -.. code-block:: bash - - (pid=45601) task - (Actor pid=480956) actor - -See :ref:`Logging ` for more details. - -Exceptions ----------- -Creating a new task or submitting an actor task generates an object reference. When ``ray.get`` is called on the object reference, -the API raises an exception if anything goes wrong with a related task, actor or object. For example, - -- :class:`RayTaskError ` is raised when there's an error from user code that throws an exception. -- :class:`RayActorError ` is raised when an actor is dead (by a system failure such as node failure or user-level failure such as an exception from ``__init__`` method). -- :class:`RuntimeEnvSetupError ` is raised when the actor or task couldn't be started because :ref:`a runtime environment ` failed to be created. - -See :ref:`Exceptions Reference ` for more details. - -Accessing Ray States --------------------- -Starting from Ray 2.0, it supports CLI / Python APIs to query the state of resources (e.g., actor, task, object, etc.). - -For example, the following command will summarize the task state of the cluster. - -.. code-block:: bash - - ray summary tasks - -.. code-block:: text - - ======== Tasks Summary: 2022-07-22 08:54:38.332537 ======== - Stats: - ------------------------------------ - total_actor_scheduled: 2 - total_actor_tasks: 0 - total_tasks: 2 - - - Table (group by func_name): - ------------------------------------ - FUNC_OR_CLASS_NAME STATE_COUNTS TYPE - 0 task_running_300_seconds RUNNING: 2 NORMAL_TASK - 1 Actor.__init__ FINISHED: 2 ACTOR_CREATION_TASK - -The following command will list all the actors from the cluster. - -.. code-block:: bash - - ray list actors - -.. code-block:: text - - ======== List: 2022-07-23 21:29:39.323925 ======== - Stats: - ------------------------------ - Total: 2 - - Table: - ------------------------------ - ACTOR_ID CLASS_NAME NAME PID STATE - 0 31405554844820381c2f0f8501000000 Actor 96956 ALIVE - 1 f36758a9f8871a9ca993b1d201000000 Actor 96955 ALIVE - -See :ref:`Ray State API ` for more details. - -Debugger --------- -Ray has a built-in debugger that allows you to debug your distributed applications. -It allows you to set breakpoints in your Ray tasks and actors, and when hitting the breakpoint, you can -drop into a PDB session that you can then use to: - -- Inspect variables in that context -- Step within that task or actor -- Move up or down the stack - -See :ref:`Ray Debugger ` for more details. - -Monitoring Cluster State and Resource Demands ---------------------------------------------- -You can monitor cluster usage and auto-scaling status by running (on the head node) a CLI command ``ray status``. It displays - -- **Cluster State**: Nodes that are up and running. Addresses of running nodes. Information about pending nodes and failed nodes. -- **Autoscaling Status**: The number of nodes that are autoscaling up and down. -- **Cluster Usage**: The resource usage of the cluster. E.g., requested CPUs from all Ray tasks and actors. Number of GPUs that are used. - -Here's an example output. - -.. code-block:: shell - - $ ray status - ======== Autoscaler status: 2021-10-12 13:10:21.035674 ======== - Node status - --------------------------------------------------------------- - Healthy: - 1 ray.head.default - 2 ray.worker.cpu - Pending: - (no pending nodes) - Recent failures: - (no failures) - - Resources - --------------------------------------------------------------- - Usage: - 0.0/10.0 CPU - 0.00/70.437 GiB memory - 0.00/10.306 GiB object_store_memory - - Demands: - (no resource demands) - -Metrics -------- -Ray collects and exposes the physical stats (e.g., CPU, memory, GRAM, disk, and network usage of each node), -internal stats (e.g., number of actors in the cluster, number of worker failures of the cluster), -and custom metrics (e.g., metrics defined by users). All stats can be exported as time series data (to Prometheus by default) and used -to monitor the cluster over time. - -See :ref:`Ray Metrics ` for more details. - -Profiling ---------- -Ray is compatible with Python profiling tools such as ``CProfile``. It also supports its built-in profiling tool such as :ref:```ray timeline`` `. - -See :ref:`Profiling ` for more details. - -Tracing -------- -To help debug and monitor Ray applications, Ray supports distributed tracing (integration with OpenTelemetry) across tasks and actors. - -See :ref:`Ray Tracing ` for more details. \ No newline at end of file diff --git a/doc/source/ray-observability/ray-logging.rst b/doc/source/ray-observability/ray-logging.rst deleted file mode 100644 index b0b931db00373..0000000000000 --- a/doc/source/ray-observability/ray-logging.rst +++ /dev/null @@ -1,337 +0,0 @@ -.. _ray-logging: - -Logging -======= -This document explains Ray's logging system and related best practices. - -Internal Ray Logging Configuration -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -When ``import ray`` is executed, Ray's logger is initialized, generating a sensible configuration given in ``python/ray/_private/log.py``. The default logging level is ``logging.INFO``. - -All ray loggers are automatically configured in ``ray._private.ray_logging``. To change the Ray library logging configuration: - -.. code-block:: python - - import logging - - logger = logging.getLogger("ray") - logger # Modify the ray logging config - -Similarly, to modify the logging configuration for any Ray subcomponent, specify the appropriate logger name: - -.. code-block:: python - - import logging - - # First, get the handle for the logger you want to modify - ray_data_logger = logging.getLogger("ray.data") - ray_tune_logger = logging.getLogger("ray.tune") - ray_rllib_logger = logging.getLogger("ray.rllib") - ray_air_logger = logging.getLogger("ray.air") - ray_train_logger = logging.getLogger("ray.train") - ray_workflow_logger = logging.getLogger("ray.workflow") - - # Modify the ray.data logging level - ray_data_logger.setLevel(logging.WARNING) - - # Other loggers can be modified similarly. - # Here's how to add an aditional file handler for ray tune: - ray_tune_logger.addHandler(logging.FileHandler("extra_ray_tune_log.log")) - -For more information about logging in workers, see :ref:`Customizing worker loggers`. - -Driver logs -~~~~~~~~~~~ -An entry point of Ray applications that calls ``ray.init()`` is called a driver. -All the driver logs are handled in the same way as normal Python programs. - -Job logs -~~~~~~~~ -Logs for jobs submitted via the :ref:`Ray Jobs API ` can be retrieved using the ``ray job logs`` :ref:`CLI command ` or using ``JobSubmissionClient.get_logs()`` or ``JobSubmissionClient.tail_job_logs()`` via the :ref:`Python SDK `. -The log file consists of the stdout of the entrypoint command of the job. For the location of the log file on disk, see :ref:`Logging directory structure `. - -.. _ray-worker-logs: - -Worker stdout and stderr -~~~~~~~~~~~ -Ray's tasks or actors are executed remotely within Ray's worker processes. Ray has special support to improve the visibility of stdout and stderr produced by workers. - -- By default, stdout and stderr from all tasks and actors are redirected to the worker log files, including any log messages generated by the worker. See :ref:`Logging directory structure ` to understand the structure of the Ray logging directory. -- By default, the driver reads the worker log files to which the stdout and stderr for all tasks and actors are redirected. Drivers display all stdout and stderr generated from their tasks or actors to their own stdout and stderr. - -Let's look at a code example to see how this works. - -.. code-block:: python - - import ray - # Initiate a driver. - ray.init() - - @ray.remote - def task(): - print("task") - - ray.get(task.remote()) - -You should be able to see the string `task` from your driver stdout. - -When logs are printed, the process id (pid) and an IP address of the node that executes tasks/actors are printed together. Check out the output below. - -.. code-block:: bash - - (pid=45601) task - -Actor log messages look like the following by default. - -.. code-block:: bash - - (MyActor pid=480956) actor log message - -Log deduplication -~~~~~~~~~~~~~~~~~ - -By default, Ray will deduplicate logs that appear redundantly across multiple processes. The first instance of each log message will always be immediately printed. However, subsequent log messages of the same pattern (ignoring words with numeric components) will be buffered for up to five seconds and printed in batch. For example, for the following code snippet: - -.. code-block:: python - - import ray - import random - - @ray.remote - def task(): - print("Hello there, I am a task", random.random()) - - ray.get([task.remote() for _ in range(100)]) - -The output will be as follows: - -.. code-block:: bash - - 2023-03-27 15:08:34,195 INFO worker.py:1603 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 - (task pid=534172) Hello there, I am a task 0.20583517821231412 - (task pid=534174) Hello there, I am a task 0.17536720316370757 [repeated 99x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication) - -This feature is especially useful when importing libraries such as `tensorflow` or `numpy`, which may emit many verbose warning messages when imported. You can configure this feature as follows: - -1. Set ``RAY_DEDUP_LOGS=0`` to disable this feature entirely. -2. Set ``RAY_DEDUP_LOGS_AGG_WINDOW_S=`` to change the agggregation window. -3. Set ``RAY_DEDUP_LOGS_ALLOW_REGEX=`` to specify log messages to never deduplicate. -4. Set ``RAY_DEDUP_LOGS_SKIP_REGEX=`` to specify log messages to skip printing. - - -Disabling logging to the driver -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -In large scale runs, it may be undesirable to route all worker logs to the driver. You can disable this feature by setting ``log_to_driver=False`` in Ray init: - -.. code-block:: python - - import ray - - # Task and actor logs will not be copied to the driver stdout. - ray.init(log_to_driver=False) - -Customizing Actor logs prefixes -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -It is often useful to distinguish between log messages from different actors. For example, suppose you have a large number of worker actors. In this case, you may want to be able to easily see the index of the actor that logged a particular message. This can be achieved by defining the `__repr__ `__ method for an actor class. When defined, the actor repr will be used in place of the actor name. For example: - -.. literalinclude:: /ray-core/doc_code/actor-repr.py - -This produces the following output: - -.. code-block:: bash - - (MyActor(index=2) pid=482120) hello there - (MyActor(index=1) pid=482119) hello there - -Coloring Actor log prefixes -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -By default Ray prints Actor logs prefixes in light blue: -Users may instead activate multi-color prefixes by setting the environment variable ``RAY_COLOR_PREFIX=1``. -This will index into an array of colors modulo the PID of each process. - -.. image:: ./images/coloring-actor-log-prefixes.png - :align: center - -Distributed progress bars (tqdm) -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -When using `tqdm `__ in Ray remote tasks or actors, you may notice that the progress bar output is corrupted. To avoid this problem, you can use the Ray distributed tqdm implementation at ``ray.experimental.tqdm_ray``: - -.. literalinclude:: /ray-core/doc_code/tqdm.py - -This tqdm implementation works as follows: - -1. The ``tqdm_ray`` module translates TQDM calls into special json log messages written to worker stdout. -2. The Ray log monitor, instead of copying these log messages directly to the driver stdout, routes these messages to a tqdm singleton. -3. The tqdm singleton determines the positions of progress bars from various Ray tasks / actors, ensuring they don't collide or conflict with each other. - -Limitations: - -- Only a subset of tqdm functionality is supported. Refer to the ray_tqdm `implementation `__ for more details. -- Performance may be poor if there are more than a couple thousand updates per second (updates are not batched). - -By default, the builtin print will also be patched to use `ray.experimental.tqdm_ray.safe_print` when `tqdm_ray` is used. -This avoids progress bar corruption on driver print statements. To disable this, set `RAY_TQDM_PATCH_PRINT=0`. - -Customizing Worker Loggers -~~~~~~~~~~~~~~~~~~~~~ -When using Ray, all tasks and actors are executed remotely in Ray's worker processes. - -.. note:: - - To stream logs to a driver, they should be flushed to stdout and stderr. - -.. code-block:: python - - import ray - import logging - # Initiate a driver. - ray.init() - - @ray.remote - class Actor: - def __init__(self): - # Basic config automatically configures logs to - # be streamed to stdout and stderr. - # Set the severity to INFO so that info logs are printed to stdout. - logging.basicConfig(level=logging.INFO) - - def log(self, msg): - logger = logging.getLogger(__name__) - logger.info(msg) - - actor = Actor.remote() - ray.get(actor.log.remote("A log message for an actor.")) - - @ray.remote - def f(msg): - logging.basicConfig(level=logging.INFO) - logger = logging.getLogger(__name__) - logger.info(msg) - - ray.get(f.remote("A log message for a task.")) - -.. code-block:: bash - - (Actor pid=179641) INFO:__main__:A log message for an actor. - (f pid=177572) INFO:__main__:A log message for a task. - -How to use structured logging -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The metadata of tasks or actors may be obtained by Ray's :ref:`runtime_context APIs `. -Runtime context APIs help you to add metadata to your logging messages, making your logs more structured. - -.. code-block:: python - - import ray - # Initiate a driver. - ray.init() - - @ray.remote - def task(): - print(f"task_id: {ray.get_runtime_context().task_id}") - - ray.get(task.remote()) - -.. code-block:: bash - - (pid=47411) task_id: TaskID(a67dc375e60ddd1affffffffffffffffffffffff01000000) - -Logging directory structure ---------------------------- -.. _logging-directory-structure: - -By default, Ray logs are stored in a ``/tmp/ray/session_*/logs`` directory. - -.. note:: - - The default temp directory is ``/tmp/ray`` (for Linux and MacOS). To change the temp directory, specify it when you call ``ray start`` or ``ray.init()``. - -A new Ray instance creates a new session ID to the temp directory. The latest session ID is symlinked to ``/tmp/ray/session_latest``. - -Here's a Ray log directory structure. Note that ``.out`` is logs from stdout/stderr and ``.err`` is logs from stderr. The backward compatibility of log directories is not maintained. - -- ``dashboard.[log|err]``: A log file of a Ray dashboard. ``log.`` file contains logs generated from the dashboard's logger. ``.err`` file contains stdout and stderr printed from the dashboard. They are usually empty except when the dashboard crashes unexpectedly. -- ``dashboard_agent.log``: Every Ray node has one dashboard agent. This is a log file of the agent. -- ``gcs_server.[out|err]``: The GCS server is a stateless server that manages Ray cluster metadata. It exists only in the head node. -- ``io-worker-[worker_id]-[pid].[out|err]``: Ray creates IO workers to spill/restore objects to external storage by default from Ray 1.3+. This is a log file of IO workers. -- ``job-driver-[submission_id].log``: The stdout of a job submitted via the :ref:`Ray Jobs API `. -- ``log_monitor.[log|err]``: The log monitor is in charge of streaming logs to the driver. ``log.`` file contains logs generated from the log monitor's logger. ``.err`` file contains the stdout and stderr printed from the log monitor. They are usually empty except when the log monitor crashes unexpectedly. -- ``monitor.[out|err]``: Stdout and stderr of a cluster launcher. -- ``monitor.log``: Ray's cluster launcher is operated with a monitor process. It also manages the autoscaler. -- ``plasma_store.[out|err]``: Deprecated. -- ``python-core-driver-[worker_id]_[pid].log``: Ray drivers consist of CPP core and Python/Java frontend. This is a log file generated from CPP code. -- ``python-core-worker-[worker_id]_[pid].log``: Ray workers consist of CPP core and Python/Java frontend. This is a log file generated from CPP code. -- ``raylet.[out|err]``: A log file of raylets. -- ``redis-shard_[shard_index].[out|err]``: Redis shard log files. -- ``redis.[out|err]``: Redis log files. -- ``runtime_env_agent.log``: Every Ray node has one agent that manages :ref:`runtime environment ` creation, deletion and caching. - This is the log file of the agent containing logs of create/delete requests and cache hits and misses. - For the logs of the actual installations (including e.g. ``pip install`` logs), see the ``runtime_env_setup-[job_id].log`` file (see below). -- ``runtime_env_setup-[job_id].log``: Logs from installing :ref:`runtime environments ` for a task, actor or job. This file will only be present if a runtime environment is installed. -- ``runtime_env_setup-ray_client_server_[port].log``: Logs from installing :ref:`runtime environments ` for a job when connecting via :ref:`Ray Client `. -- ``worker-[worker_id]-[job_id]-[pid].[out|err]``: Python or Java part of Ray drivers and workers. All of stdout and stderr from tasks or actors are streamed here. Note that job_id is an id of the driver.- - -.. _ray-log-rotation: - -Log rotation ------------- - -Ray supports log rotation of log files. Note that not all components are currently supporting log rotation. (Raylet and Python/Java worker logs are not rotating). - -By default, logs are rotating when it reaches to 512MB (maxBytes), and there could be up to 5 backup files (backupCount). Indexes are appended to all backup files (e.g., `raylet.out.1`) -If you'd like to change the log rotation configuration, you can do it by specifying environment variables. For example, - -.. code-block:: bash - - RAY_ROTATION_MAX_BYTES=1024; ray start --head # Start a ray instance with maxBytes 1KB. - RAY_ROTATION_BACKUP_COUNT=1; ray start --head # Start a ray instance with backupCount 1. - -Redirecting Ray logs to stderr -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -By default, Ray logs are written to files under the ``/tmp/ray/session_*/logs`` directory. If you wish to redirect all internal Ray logging and your own logging within tasks/actors to stderr of the host nodes, you can do so by ensuring that the ``RAY_LOG_TO_STDERR=1`` environment variable is set on the driver and on all Ray nodes. This is very useful if you are using a log aggregator that needs log records to be written to stderr in order for them to be captured. - -Redirecting logging to stderr will also cause a ``({component})`` prefix, e.g. ``(raylet)``, to be added to each of the log record messages. - -.. code-block:: bash - - [2022-01-24 19:42:02,978 I 1829336 1829336] (gcs_server) grpc_server.cc:103: GcsServer server started, listening on port 50009. - [2022-01-24 19:42:06,696 I 1829415 1829415] (raylet) grpc_server.cc:103: ObjectManager server started, listening on port 40545. - 2022-01-24 19:42:05,087 INFO (dashboard) dashboard.py:95 -- Setup static dir for dashboard: /mnt/data/workspace/ray/python/ray/dashboard/client/build - 2022-01-24 19:42:07,500 INFO (dashboard_agent) agent.py:105 -- Dashboard agent grpc address: 0.0.0.0:49228 - -This should make it easier to filter the stderr stream of logs down to the component of interest. Note that multi-line log records will **not** have this component marker at the beginning of each line. - -When running a local Ray cluster, this environment variable should be set before starting the local cluster: - -.. code-block:: python - - os.environ["RAY_LOG_TO_STDERR"] = "1" - ray.init() - -When starting a local cluster via the CLI or when starting nodes in a multi-node Ray cluster, this environment variable should be set before starting up each node: - -.. code-block:: bash - - env RAY_LOG_TO_STDERR=1 ray start - -If using the Ray cluster launcher, you would specify this environment variable in the Ray start commands: - -.. code-block:: bash - - head_start_ray_commands: - - ray stop - - env RAY_LOG_TO_STDERR=1 ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml - - worker_start_ray_commands: - - ray stop - - env RAY_LOG_TO_STDERR=1 ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 - -When connecting to the cluster, be sure to set the environment variable before connecting: - -.. code-block:: python - - os.environ["RAY_LOG_TO_STDERR"] = "1" - ray.init(address="auto") diff --git a/doc/source/ray-observability/ray-metrics.rst b/doc/source/ray-observability/ray-metrics.rst deleted file mode 100644 index bcedba3a3f778..0000000000000 --- a/doc/source/ray-observability/ray-metrics.rst +++ /dev/null @@ -1,316 +0,0 @@ -.. _ray-metrics: - -Metrics -======= - -To help monitor Ray applications, Ray - -- Collects system-level metrics. -- Provides a default configuration for prometheus. -- Provides a default Grafana dashboard. -- Exposes metrics in a Prometheus format. We'll call the endpoint to access these metrics a Prometheus endpoint. -- Supports custom metrics APIs that resemble Prometheus `metric types `_. - -Getting Started ---------------- - -.. tip:: - - The below instructions for Prometheus to enable a basic workflow of running and accessing the dashboard on your local machine. - For more information about how to run Prometheus on a remote cluster, see :ref:`here `. - -Ray exposes its metrics in Prometheus format. This allows us to easily scrape them using Prometheus. - -First, `download Prometheus `_. Make sure to download the correct binary for your operating system. (Ex: darwin for mac osx) - -Then, unzip the archive into a local directory using the following command. - -.. code-block:: bash - - tar xvfz prometheus-*.tar.gz - cd prometheus-* - -Ray exports metrics only when ``ray[default]`` is installed. - -.. code-block:: bash - - pip install "ray[default]" - -Ray provides a prometheus config that works out of the box. After running ray, it can be found at `/tmp/ray/session_latest/metrics/prometheus/prometheus.yml`. - -.. code-block:: yaml - - global: - scrape_interval: 15s - evaluation_interval: 15s - - scrape_configs: - # Scrape from each ray node as defined in the service_discovery.json provided by ray. - - job_name: 'ray' - file_sd_configs: - - files: - - '/tmp/ray/prom_metrics_service_discovery.json' - - -Next, let's start Prometheus. - -.. code-block:: shell - - ./prometheus --config.file=/tmp/ray/session_latest/metrics/prometheus/prometheus.yml - -.. note:: - If you are using mac, you may receive an error at this point about trying to launch an application where the developer has not been verified. See :ref:`this link ` to fix the issue. - -Now, you can access Ray metrics from the default Prometheus url, `http://localhost:9090`. - -See :ref:`here ` for more information on how to set up Prometheus on a Ray Cluster. - -.. _grafana: - -Grafana -------- - -.. tip:: - - The below instructions for Grafana setup to enable a basic workflow of running and accessing the dashboard on your local machine. - For more information about how to run Grafana on a remote cluster, see :ref:`here `. - -Grafana is a tool that supports more advanced visualizations of prometheus metrics and -allows you to create custom dashboards with your favorite metrics. Ray exports some default -configurations which includes a default dashboard showing some of the most valuable metrics -for debugging ray applications. - - -Deploying Grafana -~~~~~~~~~~~~~~~~~ - -First, `download Grafana `_. Follow the instructions on the download page to download the right binary for your operating system. - -Then go to to the location of the binary and run grafana using the built in configuration found in `/tmp/ray/session_latest/metrics/grafana` folder. - -.. code-block:: shell - - ./bin/grafana-server --config /tmp/ray/session_latest/metrics/grafana/grafana.ini web - -Now, you can access grafana using the default grafana url, `http://localhost:3000`. -You can then see the default dashboard by going to dashboards -> manage -> Ray -> Default Dashboard. The same :ref:`metric graphs ` are also accessible via :ref:`Ray Dashboard `. - -.. tip:: - - If this is your first time using Grafana, you can login with the username: `admin` and password `admin`. - -.. image:: images/graphs.png - :align: center - - -See :ref:`here ` for more information on how to set up Grafana on a Ray Cluster. - -.. _system-metrics: - -System Metrics --------------- -Ray exports a number of system metrics, which provide introspection into the state of Ray workloads, as well as hardware utilization statistics. The following table describes the officially supported metrics: - -.. note:: - - Certain labels are common across all metrics, such as `SessionName` (uniquely identifies a Ray cluster instance), `instance` (per-node label applied by Prometheus, and `JobId` (Ray job id, as applicable). - -.. list-table:: Ray System Metrics - :header-rows: 1 - - * - Prometheus Metric - - Labels - - Description - * - `ray_tasks` - - `Name`, `State`, `IsRetry` - - Current number of tasks (both remote functions and actor calls) by state. The State label (e.g., RUNNING, FINISHED, FAILED) describes the state of the task. See `rpc::TaskState `_ for more information. The function/method name is available as the Name label. If the task was retried due to failure or reconstruction, the IsRetry label will be set to "1", otherwise "0". - * - `ray_actors` - - `Name`, `State` - - Current number of actors in a particular state. The State label is described by `rpc::ActorTableData `_ proto in gcs.proto. The actor class name is available in the Name label. - * - `ray_resources` - - `Name`, `State`, `InstanceId` - - Logical resource usage for each node of the cluster. Each resource has some quantity that is `in either `_ USED state vs AVAILABLE state. The Name label defines the resource name (e.g., CPU, GPU). - * - `ray_object_store_memory` - - `Location`, `ObjectState`, `InstanceId` - - Object store memory usage in bytes, `broken down `_ by logical Location (SPILLED, IN_MEMORY, etc.), and ObjectState (UNSEALED, SEALED). - * - `ray_placement_groups` - - `State` - - Current number of placement groups by state. The State label (e.g., PENDING, CREATED, REMOVED) describes the state of the placement group. See `rpc::PlacementGroupTable `_ for more information. - * - `ray_memory_manager_worker_eviction_total` - - `Type`, `Name` - - The number of tasks and actors killed by the Ray Out of Memory killer (https://docs.ray.io/en/master/ray-core/scheduling/ray-oom-prevention.html) broken down by types (whether it is tasks or actors) and names (name of tasks and actors). - * - `ray_node_cpu_utilization` - - `InstanceId` - - The CPU utilization per node as a percentage quantity (0..100). This should be scaled by the number of cores per node to convert the units into cores. - * - `ray_node_cpu_count` - - `InstanceId` - - The number of CPU cores per node. - * - `ray_node_gpus_utilization` - - `InstanceId`, `GpuDeviceName`, `GpuIndex` - - The GPU utilization per GPU as a percentage quantity (0..NGPU*100). `GpuDeviceName` is a name of a GPU device (e.g., Nvidia A10G) and `GpuIndex` is the index of the GPU. - * - `ray_node_disk_usage` - - `InstanceId` - - The amount of disk space used per node, in bytes. - * - `ray_node_disk_free` - - `InstanceId` - - The amount of disk space available per node, in bytes. - * - `ray_node_disk_io_write_speed` - - `InstanceId` - - The disk write throughput per node, in bytes per second. - * - `ray_node_disk_io_read_speed` - - `InstanceId` - - The disk read throughput per node, in bytes per second. - * - `ray_node_mem_used` - - `InstanceId` - - The amount of physical memory used per node, in bytes. - * - `ray_node_mem_total` - - `InstanceId` - - The amount of physical memory available per node, in bytes. - * - `ray_component_uss_mb` - - `Component`, `InstanceId` - - The measured unique set size in megabytes, broken down by logical Ray component. Ray components consist of system components (e.g., raylet, gcs, dashboard, or agent) and the method names of running tasks/actors. - * - `ray_component_cpu_percentage` - - `Component`, `InstanceId` - - The measured CPU percentage, broken down by logical Ray component. Ray components consist of system components (e.g., raylet, gcs, dashboard, or agent) and the method names of running tasks/actors. - * - `ray_node_gram_used` - - `InstanceId`, `GpuDeviceName`, `GpuIndex` - - The amount of GPU memory used per GPU, in bytes. - * - `ray_node_network_receive_speed` - - `InstanceId` - - The network receive throughput per node, in bytes per second. - * - `ray_node_network_send_speed` - - `InstanceId` - - The network send throughput per node, in bytes per second. - * - `ray_cluster_active_nodes` - - `node_type` - - The number of healthy nodes in the cluster, broken down by autoscaler node type. - * - `ray_cluster_failed_nodes` - - `node_type` - - The number of failed nodes reported by the autoscaler, broken down by node type. - * - `ray_cluster_pending_nodes` - - `node_type` - - The number of pending nodes reported by the autoscaler, broken down by node type. - -Metrics Semantics and Consistency -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Ray guarantees all its internal state metrics are *eventually* consistent even in the presence of failures--- should any worker fail, eventually the right state will be reflected in the Prometheus time-series output. However, any particular metrics query is not guaranteed to reflect an exact snapshot of the cluster state. - -For the `ray_tasks` and `ray_actors` metrics, you should use sum queries to plot their outputs (e.g., ``sum(ray_tasks) by (Name, State)``). The reason for this is that Ray's task metrics are emitted from multiple distributed components. Hence, there are multiple metric points, including negative metric points, emitted from different processes that must be summed to produce the correct logical view of the distributed system. For example, for a single task submitted and executed, Ray may emit ``(submitter) SUBMITTED_TO_WORKER: 1, (executor) SUBMITTED_TO_WORKER: -1, (executor) RUNNING: 1``, which reduces to ``SUBMITTED_TO_WORKER: 0, RUNNING: 1`` after summation. - -.. _application-level-metrics: - -Application-level Metrics -------------------------- -Ray provides a convenient API in :ref:`ray.util.metrics ` for defining and exporting custom metrics for visibility into your applications. -There are currently three metrics supported: Counter, Gauge, and Histogram. -These metrics correspond to the same `Prometheus metric types `_. -Below is a simple example of an actor that exports metrics using these APIs: - -.. literalinclude:: doc_code/metrics_example.py - :language: python - -While the script is running, the metrics will be exported to ``localhost:8080`` (this is the endpoint that Prometheus would be configured to scrape). -If you open this in the browser, you should see the following output: - -.. code-block:: none - - # HELP ray_request_latency Latencies of requests in ms. - # TYPE ray_request_latency histogram - ray_request_latency_bucket{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor",le="0.1"} 2.0 - ray_request_latency_bucket{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor",le="1.0"} 2.0 - ray_request_latency_bucket{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor",le="+Inf"} 2.0 - ray_request_latency_count{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor"} 2.0 - ray_request_latency_sum{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor"} 0.11992454528808594 - # HELP ray_curr_count Current count held by the actor. Goes up and down. - # TYPE ray_curr_count gauge - ray_curr_count{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor"} -15.0 - # HELP ray_num_requests_total Number of requests processed by the actor. - # TYPE ray_num_requests_total counter - ray_num_requests_total{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor"} 2.0 - -Please see :ref:`ray.util.metrics ` for more details. - -Configurations --------------- - -Customize prometheus export port -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Ray by default provides the service discovery file, but you can directly scrape metrics from prometheus ports. -To do that, you may want to customize the port that metrics gets exposed to a pre-defined port. - -.. code-block:: bash - - ray start --head --metrics-export-port=8080 # Assign metrics export port on a head node. - -Now, you can scrape Ray's metrics using Prometheus via ``:8080``. - -Alternate Prometheus host location -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -You can choose to run Prometheus on a non-default port or on a different machine. When doing so, you should -make sure that prometheus can scrape the metrics from your ray nodes following instructions :ref:`here `. - -In addition, both Ray and Grafana needs to know how to access this prometheus instance. This can be configured -by setting the `RAY_PROMETHEUS_HOST` env var when launching ray. The env var takes in the address to access Prometheus. More -info can be found :ref:`here `. By default, we assume Prometheus is hosted at `localhost:9090`. - -For example, if Prometheus is hosted at port 9000 on a node with ip 55.66.77.88, One should set the value to -`RAY_PROMETHEUS_HOST=http://55.66.77.88:9000`. - - -Alternate Grafana host location -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -You can choose to run Grafana on a non-default port or on a different machine. If you choose to do this, the -:ref:`Dashboard ` needs to be configured with a public address to that service so the web page -can load the graphs. This can be done with the `RAY_GRAFANA_HOST` env var when launching ray. The env var takes -in the address to access Grafana. More info can be found :ref:`here `. Instructions -to use an existing Grafana instance can be found :ref:`here `. - -For the Grafana charts to work on the Ray dashboard, the user of the dashboard's browser must be able to reach -the Grafana service. If this browser cannot reach Grafana the same way the Ray head node can, you can use a separate -env var `RAY_GRAFANA_IFRAME_HOST` to customize the host the browser users to attempt to reach Grafana. If this is not set, -we use the value of `RAY_GRAFANA_HOST` by default. - -For example, if Grafana is hosted at is 55.66.77.88 on port 3000. One should set the value -to `RAY_GRAFANA_HOST=http://55.66.77.88:3000`. - -Troubleshooting ---------------- - -Getting Prometheus and Grafana to use the Ray configurations when installed via homebrew on macOS X -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -With homebrew, Prometheus and Grafana are installed as services that are automatically launched for you. -Therefore, to configure these services, you cannot simply pass in the config files as command line arguments. - -Instead, follow these instructions: -1. Change the --config-file line in `/usr/local/etc/prometheus.args` to read `--config.file /tmp/ray/session_latest/metrics/prometheus/prometheus.yml`. -2. Update `/usr/local/etc/grafana/grafana.ini` file so that it matches the contents of `/tmp/ray/session_latest/metrics/grafana/grafana.ini`. - -You can then start or restart the services with `brew services start grafana` and `brew services start prometheus`. - -.. _unverified-developer: - -MacOS does not trust the developer to install Prometheus or Grafana -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -You may have received an error that looks like this: - -.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/troubleshooting/prometheus-trusted-developer.png - :align: center - -When downloading binaries from the internet, Mac requires that the binary be signed by a trusted developer ID. -Unfortunately, many developers today are not trusted by Mac and so this requirement must be overridden by the user manaully. - -See `these instructions `_ on how to override the restriction and install or run the application. - -Grafana dashboards are not embedded in the Ray dashboard -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -If you're getting an error that says `RAY_GRAFANA_HOST` is not setup despite having set it up, check that: -You've included the protocol in the URL (e.g., `http://your-grafana-url.com` instead of `your-grafana-url.com`). -The URL doesn't have a trailing slash (e.g., `http://your-grafana-url.com` instead of `http://your-grafana-url.com/`). - -Certificate Authority (CA error) -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -You may see a CA error if your Grafana instance is hosted behind HTTPS. Contact the Grafana service owner to properly enable HTTPS traffic. diff --git a/doc/source/ray-observability/api/state/api.rst b/doc/source/ray-observability/reference/api.rst similarity index 100% rename from doc/source/ray-observability/api/state/api.rst rename to doc/source/ray-observability/reference/api.rst index 5056422e5bd7d..bf875b5098000 100644 --- a/doc/source/ray-observability/api/state/api.rst +++ b/doc/source/ray-observability/reference/api.rst @@ -1,8 +1,8 @@ +.. _state-api-ref: + State API ========= -.. _state-api-ref: - .. note:: APIs are :ref:`alpha `. This feature requires a full installation of Ray using ``pip install "ray[default]"``. diff --git a/doc/source/ray-observability/api/state/cli.rst b/doc/source/ray-observability/reference/cli.rst similarity index 98% rename from doc/source/ray-observability/api/state/cli.rst rename to doc/source/ray-observability/reference/cli.rst index e12dfc45fdb5d..5c7ea0d57eb99 100644 --- a/doc/source/ray-observability/api/state/cli.rst +++ b/doc/source/ray-observability/reference/cli.rst @@ -1,8 +1,8 @@ -Ray State CLI -============= - .. _state-api-cli-ref: +State CLI +========= + State ----- This section contains commands to access the :ref:`live state of Ray resources (actor, task, object, etc.) `. diff --git a/doc/source/ray-observability/reference/index.md b/doc/source/ray-observability/reference/index.md new file mode 100644 index 0000000000000..06ef3bfc34498 --- /dev/null +++ b/doc/source/ray-observability/reference/index.md @@ -0,0 +1,10 @@ +(observability-reference)= + +# Reference + +Monitor and debug your Ray applications and clusters using the API and CLI documented in these references. + +The guides include: +* {ref}`state-api-ref` +* {ref}`state-api-cli-ref` +* {ref}`system-metrics` \ No newline at end of file diff --git a/doc/source/ray-observability/reference/system-metrics.rst b/doc/source/ray-observability/reference/system-metrics.rst new file mode 100644 index 0000000000000..fe0a4e458b8e6 --- /dev/null +++ b/doc/source/ray-observability/reference/system-metrics.rst @@ -0,0 +1,92 @@ +.. _system-metrics: + +System Metrics +-------------- +Ray exports a number of system metrics, which provide introspection into the state of Ray workloads, as well as hardware utilization statistics. The following table describes the officially supported metrics: + +.. note:: + + Certain labels are common across all metrics, such as `SessionName` (uniquely identifies a Ray cluster instance), `instance` (per-node label applied by Prometheus, and `JobId` (Ray job id, as applicable). + +.. list-table:: Ray System Metrics + :header-rows: 1 + + * - Prometheus Metric + - Labels + - Description + * - `ray_tasks` + - `Name`, `State`, `IsRetry` + - Current number of tasks (both remote functions and actor calls) by state. The State label (e.g., RUNNING, FINISHED, FAILED) describes the state of the task. See `rpc::TaskState `_ for more information. The function/method name is available as the Name label. If the task was retried due to failure or reconstruction, the IsRetry label will be set to "1", otherwise "0". + * - `ray_actors` + - `Name`, `State` + - Current number of actors in a particular state. The State label is described by `rpc::ActorTableData `_ proto in gcs.proto. The actor class name is available in the Name label. + * - `ray_resources` + - `Name`, `State`, `InstanceId` + - Logical resource usage for each node of the cluster. Each resource has some quantity that is `in either `_ USED state vs AVAILABLE state. The Name label defines the resource name (e.g., CPU, GPU). + * - `ray_object_store_memory` + - `Location`, `ObjectState`, `InstanceId` + - Object store memory usage in bytes, `broken down `_ by logical Location (SPILLED, IN_MEMORY, etc.), and ObjectState (UNSEALED, SEALED). + * - `ray_placement_groups` + - `State` + - Current number of placement groups by state. The State label (e.g., PENDING, CREATED, REMOVED) describes the state of the placement group. See `rpc::PlacementGroupTable `_ for more information. + * - `ray_memory_manager_worker_eviction_total` + - `Type`, `Name` + - The number of tasks and actors killed by the Ray Out of Memory killer (https://docs.ray.io/en/master/ray-core/scheduling/ray-oom-prevention.html) broken down by types (whether it is tasks or actors) and names (name of tasks and actors). + * - `ray_node_cpu_utilization` + - `InstanceId` + - The CPU utilization per node as a percentage quantity (0..100). This should be scaled by the number of cores per node to convert the units into cores. + * - `ray_node_cpu_count` + - `InstanceId` + - The number of CPU cores per node. + * - `ray_node_gpus_utilization` + - `InstanceId`, `GpuDeviceName`, `GpuIndex` + - The GPU utilization per GPU as a percentage quantity (0..NGPU*100). `GpuDeviceName` is a name of a GPU device (e.g., Nvidia A10G) and `GpuIndex` is the index of the GPU. + * - `ray_node_disk_usage` + - `InstanceId` + - The amount of disk space used per node, in bytes. + * - `ray_node_disk_free` + - `InstanceId` + - The amount of disk space available per node, in bytes. + * - `ray_node_disk_io_write_speed` + - `InstanceId` + - The disk write throughput per node, in bytes per second. + * - `ray_node_disk_io_read_speed` + - `InstanceId` + - The disk read throughput per node, in bytes per second. + * - `ray_node_mem_used` + - `InstanceId` + - The amount of physical memory used per node, in bytes. + * - `ray_node_mem_total` + - `InstanceId` + - The amount of physical memory available per node, in bytes. + * - `ray_component_uss_mb` + - `Component`, `InstanceId` + - The measured unique set size in megabytes, broken down by logical Ray component. Ray components consist of system components (e.g., raylet, gcs, dashboard, or agent) and the method names of running tasks/actors. + * - `ray_component_cpu_percentage` + - `Component`, `InstanceId` + - The measured CPU percentage, broken down by logical Ray component. Ray components consist of system components (e.g., raylet, gcs, dashboard, or agent) and the method names of running tasks/actors. + * - `ray_node_gram_used` + - `InstanceId`, `GpuDeviceName`, `GpuIndex` + - The amount of GPU memory used per GPU, in bytes. + * - `ray_node_network_receive_speed` + - `InstanceId` + - The network receive throughput per node, in bytes per second. + * - `ray_node_network_send_speed` + - `InstanceId` + - The network send throughput per node, in bytes per second. + * - `ray_cluster_active_nodes` + - `node_type` + - The number of healthy nodes in the cluster, broken down by autoscaler node type. + * - `ray_cluster_failed_nodes` + - `node_type` + - The number of failed nodes reported by the autoscaler, broken down by node type. + * - `ray_cluster_pending_nodes` + - `node_type` + - The number of pending nodes reported by the autoscaler, broken down by node type. + +Metrics Semantics and Consistency +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Ray guarantees all its internal state metrics are *eventually* consistent even in the presence of failures--- should any worker fail, eventually the right state will be reflected in the Prometheus time-series output. However, any particular metrics query is not guaranteed to reflect an exact snapshot of the cluster state. + +For the `ray_tasks` and `ray_actors` metrics, you should use sum queries to plot their outputs (e.g., ``sum(ray_tasks) by (Name, State)``). The reason for this is that Ray's task metrics are emitted from multiple distributed components. Hence, there are multiple metric points, including negative metric points, emitted from different processes that must be summed to produce the correct logical view of the distributed system. For example, for a single task submitted and executed, Ray may emit ``(submitter) SUBMITTED_TO_WORKER: 1, (executor) SUBMITTED_TO_WORKER: -1, (executor) RUNNING: 1``, which reduces to ``SUBMITTED_TO_WORKER: 0, RUNNING: 1`` after summation. diff --git a/doc/source/ray-observability/user-guides/add-app-metrics.rst b/doc/source/ray-observability/user-guides/add-app-metrics.rst new file mode 100644 index 0000000000000..b82946ed1544c --- /dev/null +++ b/doc/source/ray-observability/user-guides/add-app-metrics.rst @@ -0,0 +1,33 @@ +.. _application-level-metrics: + +Adding Application-Level Metrics +-------------------------------- + +Ray provides a convenient API in :ref:`ray.util.metrics ` for defining and exporting custom metrics for visibility into your applications. +There are currently three metrics supported: Counter, Gauge, and Histogram. +These metrics correspond to the same `Prometheus metric types `_. +Below is a simple example of an actor that exports metrics using these APIs: + +.. literalinclude:: doc_code/metrics_example.py + :language: python + +While the script is running, the metrics are exported to ``localhost:8080`` (this is the endpoint that Prometheus would be configured to scrape). +If you open this in the browser, you should see the following output: + +.. code-block:: none + + # HELP ray_request_latency Latencies of requests in ms. + # TYPE ray_request_latency histogram + ray_request_latency_bucket{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor",le="0.1"} 2.0 + ray_request_latency_bucket{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor",le="1.0"} 2.0 + ray_request_latency_bucket{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor",le="+Inf"} 2.0 + ray_request_latency_count{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor"} 2.0 + ray_request_latency_sum{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor"} 0.11992454528808594 + # HELP ray_curr_count Current count held by the actor. Goes up and down. + # TYPE ray_curr_count gauge + ray_curr_count{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor"} -15.0 + # HELP ray_num_requests_total Number of requests processed by the actor. + # TYPE ray_num_requests_total counter + ray_num_requests_total{Component="core_worker",Version="3.0.0.dev0",actor_name="my_actor"} 2.0 + +Please see :ref:`ray.util.metrics ` for more details. diff --git a/doc/source/ray-observability/state/state-api.rst b/doc/source/ray-observability/user-guides/cli-sdk.rst similarity index 93% rename from doc/source/ray-observability/state/state-api.rst rename to doc/source/ray-observability/user-guides/cli-sdk.rst index 33f5bfa4f2d92..aa52cb14874df 100644 --- a/doc/source/ray-observability/state/state-api.rst +++ b/doc/source/ray-observability/user-guides/cli-sdk.rst @@ -1,3 +1,45 @@ +.. _observability-programmatic: + +Monitoring with the CLI or SDK +============================== + +Monitoring and debugging capabilities in Ray are available through a CLI or SDK. + + +Monitoring Cluster State and Resource Demands +--------------------------------------------- +You can monitor cluster usage and auto-scaling status by running (on the head node) a CLI command ``ray status``. It displays + +- **Cluster State**: Nodes that are up and running. Addresses of running nodes. Information about pending nodes and failed nodes. +- **Autoscaling Status**: The number of nodes that are autoscaling up and down. +- **Cluster Usage**: The resource usage of the cluster. E.g., requested CPUs from all Ray tasks and actors. Number of GPUs that are used. + +Here's an example output. + +.. code-block:: shell + + $ ray status + ======== Autoscaler status: 2021-10-12 13:10:21.035674 ======== + Node status + --------------------------------------------------------------- + Healthy: + 1 ray.head.default + 2 ray.worker.cpu + Pending: + (no pending nodes) + Recent failures: + (no failures) + + Resources + --------------------------------------------------------------- + Usage: + 0.0/10.0 CPU + 0.00/70.437 GiB memory + 0.00/10.306 GiB object_store_memory + + Demands: + (no resource demands) + .. _state-api-overview-ref: Monitoring Ray States @@ -15,9 +57,6 @@ Ray state APIs allow users to conveniently access the current state (snapshot) o State API CLI commands are :ref:`stable `, while python SDKs are :ref:`DeveloperAPI `. CLI usage is recommended over Python SDKs. -Getting Started ---------------- - Run any workload. In this example, you will use the following script that runs 2 tasks and creates 2 actors. .. code-block:: python diff --git a/doc/source/ray-observability/user-guides/configure-logging.rst b/doc/source/ray-observability/user-guides/configure-logging.rst new file mode 100644 index 0000000000000..a7f3ae7ef0607 --- /dev/null +++ b/doc/source/ray-observability/user-guides/configure-logging.rst @@ -0,0 +1,131 @@ +.. _configure-logging: + +Configuring Logging +=================== + +This guide helps you modify the default configuration of Ray's logging system. + + +Internal Ray Logging Configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +When ``import ray`` is executed, Ray's logger is initialized, generating a sensible configuration given in ``python/ray/_private/log.py``. The default logging level is ``logging.INFO``. + +All Ray loggers are automatically configured in ``ray._private.ray_logging``. To change the Ray library logging configuration: + +.. code-block:: python + + import logging + + logger = logging.getLogger("ray") + logger # Modify the ray logging config + +Similarly, to modify the logging configuration for any Ray subcomponent, specify the appropriate logger name: + +.. code-block:: python + + import logging + + # First, get the handle for the logger you want to modify + ray_data_logger = logging.getLogger("ray.data") + ray_tune_logger = logging.getLogger("ray.tune") + ray_rllib_logger = logging.getLogger("ray.rllib") + ray_air_logger = logging.getLogger("ray.air") + ray_train_logger = logging.getLogger("ray.train") + ray_workflow_logger = logging.getLogger("ray.workflow") + + # Modify the ray.data logging level + ray_data_logger.setLevel(logging.WARNING) + + # Other loggers can be modified similarly. + # Here's how to add an aditional file handler for ray tune: + ray_tune_logger.addHandler(logging.FileHandler("extra_ray_tune_log.log")) + +For more information about logging in workers, see :ref:`Customizing worker loggers`. + +Disabling logging to the driver +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In large scale runs, it may be undesirable to route all worker logs to the driver. You can disable this feature by setting ``log_to_driver=False`` in Ray init: + +.. code-block:: python + + import ray + + # Task and actor logs will not be copied to the driver stdout. + ray.init(log_to_driver=False) + +Log deduplication +~~~~~~~~~~~~~~~~~ + +By default, Ray deduplicates logs that appear redundantly across multiple processes. The first instance of each log message is always immediately printed. However, subsequent log messages of the same pattern (ignoring words with numeric components) are buffered for up to five seconds and printed in batch. For example, for the following code snippet: + +.. code-block:: python + + import ray + import random + + @ray.remote + def task(): + print("Hello there, I am a task", random.random()) + + ray.get([task.remote() for _ in range(100)]) + +The output is as follows: + +.. code-block:: bash + + 2023-03-27 15:08:34,195 INFO worker.py:1603 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 + (task pid=534172) Hello there, I am a task 0.20583517821231412 + (task pid=534174) Hello there, I am a task 0.17536720316370757 [repeated 99x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication) + +This feature is especially useful when importing libraries such as `tensorflow` or `numpy`, which may emit many verbose warning messages when imported. You can configure this feature as follows: + +1. Set ``RAY_DEDUP_LOGS=0`` to disable this feature entirely. +2. Set ``RAY_DEDUP_LOGS_AGG_WINDOW_S=`` to change the agggregation window. +3. Set ``RAY_DEDUP_LOGS_ALLOW_REGEX=`` to specify log messages to never deduplicate. +4. Set ``RAY_DEDUP_LOGS_SKIP_REGEX=`` to specify log messages to skip printing. + + +Customizing Actor logs prefixes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It is often useful to distinguish between log messages from different actors. For example, suppose you have a large number of worker actors. In this case, you may want to be able to easily see the index of the actor that logged a particular message. This can be achieved by defining the `__repr__ `__ method for an actor class. When defined, the actor repr will be used in place of the actor name. For example: + +.. literalinclude:: /ray-core/doc_code/actor-repr.py + +This produces the following output: + +.. code-block:: bash + + (MyActor(index=2) pid=482120) hello there + (MyActor(index=1) pid=482119) hello there + +Coloring Actor log prefixes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +By default Ray prints Actor logs prefixes in light blue: +Users may instead activate multi-color prefixes by setting the environment variable ``RAY_COLOR_PREFIX=1``. +This will index into an array of colors modulo the PID of each process. + +.. image:: ./images/coloring-actor-log-prefixes.png + :align: center + +Distributed progress bars (tqdm) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When using `tqdm `__ in Ray remote tasks or actors, you may notice that the progress bar output is corrupted. To avoid this problem, you can use the Ray distributed tqdm implementation at ``ray.experimental.tqdm_ray``: + +.. literalinclude:: /ray-core/doc_code/tqdm.py + +This tqdm implementation works as follows: + +1. The ``tqdm_ray`` module translates TQDM calls into special json log messages written to worker stdout. +2. The Ray log monitor, instead of copying these log messages directly to the driver stdout, routes these messages to a tqdm singleton. +3. The tqdm singleton determines the positions of progress bars from various Ray tasks / actors, ensuring they don't collide or conflict with each other. + +Limitations: + +- Only a subset of tqdm functionality is supported. Refer to the ray_tqdm `implementation `__ for more details. +- Performance may be poor if there are more than a couple thousand updates per second (updates are not batched). + +By default, the builtin print will also be patched to use `ray.experimental.tqdm_ray.safe_print` when `tqdm_ray` is used. +This avoids progress bar corruption on driver print statements. To disable this, set `RAY_TQDM_PATCH_PRINT=0`. diff --git a/doc/source/ray-observability/user-guides/troubleshoot-apps/troubleshoot-failures.rst b/doc/source/ray-observability/user-guides/debug-apps/debug-failures.rst similarity index 99% rename from doc/source/ray-observability/user-guides/troubleshoot-apps/troubleshoot-failures.rst rename to doc/source/ray-observability/user-guides/debug-apps/debug-failures.rst index 046c3e1bb2d20..206adb88853e6 100644 --- a/doc/source/ray-observability/user-guides/troubleshoot-apps/troubleshoot-failures.rst +++ b/doc/source/ray-observability/user-guides/debug-apps/debug-failures.rst @@ -1,7 +1,7 @@ -.. _observability-troubleshoot-failures: +.. _observability-debug-failures: -Troubleshooting Failures -======================== +Debugging Failures +================== What Kind of Failures Exist in Ray? ----------------------------------- diff --git a/doc/source/ray-observability/user-guides/troubleshoot-apps/troubleshoot-hangs.rst b/doc/source/ray-observability/user-guides/debug-apps/debug-hangs.rst similarity index 85% rename from doc/source/ray-observability/user-guides/troubleshoot-apps/troubleshoot-hangs.rst rename to doc/source/ray-observability/user-guides/debug-apps/debug-hangs.rst index 0725e8863bb16..74fc4d34f5733 100644 --- a/doc/source/ray-observability/user-guides/troubleshoot-apps/troubleshoot-hangs.rst +++ b/doc/source/ray-observability/user-guides/debug-apps/debug-hangs.rst @@ -1,7 +1,7 @@ -.. _observability-troubleshoot-hangs: +.. _observability-debug-hangs: -Troubleshooting Hangs -===================== +Debugging Hangs +=============== Observing Ray Work ------------------ diff --git a/doc/source/ray-observability/user-guides/debug-apps/debug-memory.rst b/doc/source/ray-observability/user-guides/debug-apps/debug-memory.rst new file mode 100644 index 0000000000000..10d10620cb6bf --- /dev/null +++ b/doc/source/ray-observability/user-guides/debug-apps/debug-memory.rst @@ -0,0 +1,56 @@ +.. _ray-core-profiling: + +.. _ray-core-mem-profiling: + +Debugging Memory Issues +----------------------- + +To memory profile Ray tasks or actors, use `memray `_. +Note that you can also use other memory profiling tools if it supports a similar API. + +First, install ``memray``. + +.. code-block:: bash + + pip install memray + +``memray`` supports a Python context manager to enable memory profiling. You can write the ``memray`` profiling file wherever you want. +But in this example, we will write them to `/tmp/ray/session_latest/logs` because Ray dashboard allows you to download files inside the log folder. +This will allow you to download profiling files from other nodes. + +.. tab-set:: + + .. tab-item:: Actors + + .. literalinclude:: ../doc_code/memray_profiling.py + :language: python + :start-after: __memray_profiling_start__ + :end-before: __memray_profiling_end__ + + .. tab-item:: Tasks + + Note that tasks have a shorter lifetime, so there could be lots of memory profiling files. + + .. literalinclude:: ../doc_code/memray_profiling.py + :language: python + :start-after: __memray_profiling_task_start__ + :end-before: __memray_profiling_task_end__ + +Once the task or actor runs, go to the :ref:`Logs View ` of the dashboard. Find and click the log file name. + +.. image:: ../images/memory-profiling-files.png + :align: center + +Click the download button. + +.. image:: ../images/download-memory-profiling-files.png + :align: center + +Now, you have the memory profiling file. Running + +.. code-block:: bash + + memray flamegraph + +And you can see the result of the memory profiling! + diff --git a/doc/source/ray-observability/monitoring-debugging/gotchas.rst b/doc/source/ray-observability/user-guides/debug-apps/general-troubleshoot.rst similarity index 98% rename from doc/source/ray-observability/monitoring-debugging/gotchas.rst rename to doc/source/ray-observability/user-guides/debug-apps/general-troubleshoot.rst index b1d4fb36f19ca..96ffbab7cc021 100644 --- a/doc/source/ray-observability/monitoring-debugging/gotchas.rst +++ b/doc/source/ray-observability/user-guides/debug-apps/general-troubleshoot.rst @@ -1,7 +1,7 @@ -.. _gotchas: +.. _observability-general-troubleshoot: -Ray Gotchas -=========== +General Troubleshooting +======================= Ray sometimes has some aspects of its behavior that might catch users off guard. There may be sound arguments for these design choices. diff --git a/doc/source/ray-observability/user-guides/debug-apps/index.md b/doc/source/ray-observability/user-guides/debug-apps/index.md new file mode 100644 index 0000000000000..53e9ee3e962e0 --- /dev/null +++ b/doc/source/ray-observability/user-guides/debug-apps/index.md @@ -0,0 +1,11 @@ +(observability-user-guides)= + +# Troubleshooting Applications + +These guides help you perform common debugging or optimization tasks for your distributed application on Ray: +* {ref}`observability-general-troubleshoot` +* {ref}`ray-core-mem-profiling` +* {ref}`observability-debug-hangs` +* {ref}`observability-debug-failures` +* {ref}`observability-optimize-performance` +* {ref}`ray-debugger` \ No newline at end of file diff --git a/doc/source/ray-observability/user-guides/troubleshoot-apps/profiling.rst b/doc/source/ray-observability/user-guides/debug-apps/optimize-performance.rst similarity index 75% rename from doc/source/ray-observability/user-guides/troubleshoot-apps/profiling.rst rename to doc/source/ray-observability/user-guides/debug-apps/optimize-performance.rst index 756005eaec21f..cad5e6f312ac1 100644 --- a/doc/source/ray-observability/user-guides/troubleshoot-apps/profiling.rst +++ b/doc/source/ray-observability/user-guides/debug-apps/optimize-performance.rst @@ -1,62 +1,62 @@ -.. _ray-core-profiling: - -Profiling -========= - -.. _ray-core-mem-profiling: - -Memory profile Ray Actors and Tasks ------------------------------------ - -To memory profile Ray tasks or actors, use `memray `_. -Note that you can also use other memory profiling tools if it supports a similar API. - -First, install ``memray``. - -.. code-block:: bash - - pip install memray - -``memray`` supports a Python context manager to enable memory profiling. You can write the ``memray`` profiling file wherever you want. -But in this example, we will write them to `/tmp/ray/session_latest/logs` because Ray dashboard allows you to download files inside the log folder. -This will allow you to download profiling files from other nodes. - -.. tab-set:: - - .. tab-item:: Actors - - .. literalinclude:: ../doc_code/memray_profiling.py - :language: python - :start-after: __memray_profiling_start__ - :end-before: __memray_profiling_end__ - - .. tab-item:: Tasks - - Note that tasks have a shorter lifetime, so there could be lots of memory profiling files. - - .. literalinclude:: ../doc_code/memray_profiling.py - :language: python - :start-after: __memray_profiling_task_start__ - :end-before: __memray_profiling_task_end__ - -Once the task or actor runs, go to the :ref:`Logs View ` of the dashboard. Find and click the log file name. - -.. image:: ../images/memory-profiling-files.png - :align: center - -Click the download button. - -.. image:: ../images/download-memory-profiling-files.png - :align: center - -Now, you have the memory profiling file. Running - -.. code-block:: bash - - memray flamegraph - -And you can see the result of the memory profiling! - +.. _observability-optimize-performance: + +Optimizing Performance +====================== + +No Speedup +---------- + +You just ran an application using Ray, but it wasn't as fast as you expected it +to be. Or worse, perhaps it was slower than the serial version of the +application! The most common reasons are the following. + +- **Number of cores:** How many cores is Ray using? When you start Ray, it will + determine the number of CPUs on each machine with ``psutil.cpu_count()``. Ray + usually will not schedule more tasks in parallel than the number of CPUs. So + if the number of CPUs is 4, the most you should expect is a 4x speedup. + +- **Physical versus logical CPUs:** Do the machines you're running on have fewer + **physical** cores than **logical** cores? You can check the number of logical + cores with ``psutil.cpu_count()`` and the number of physical cores with + ``psutil.cpu_count(logical=False)``. This is common on a lot of machines and + especially on EC2. For many workloads (especially numerical workloads), you + often cannot expect a greater speedup than the number of physical CPUs. + +- **Small tasks:** Are your tasks very small? Ray introduces some overhead for + each task (the amount of overhead depends on the arguments that are passed + in). You will be unlikely to see speedups if your tasks take less than ten + milliseconds. For many workloads, you can easily increase the sizes of your + tasks by batching them together. + +- **Variable durations:** Do your tasks have variable duration? If you run 10 + tasks with variable duration in parallel, you shouldn't expect an N-fold + speedup (because you'll end up waiting for the slowest task). In this case, + consider using ``ray.wait`` to begin processing tasks that finish first. + +- **Multi-threaded libraries:** Are all of your tasks attempting to use all of + the cores on the machine? If so, they are likely to experience contention and + prevent your application from achieving a speedup. + This is common with some versions of ``numpy``. To avoid contention, set an + environment variable like ``MKL_NUM_THREADS`` (or the equivalent depending on + your installation) to ``1``. + + For many - but not all - libraries, you can diagnose this by opening ``top`` + while your application is running. If one process is using most of the CPUs, + and the others are using a small amount, this may be the problem. The most + common exception is PyTorch, which will appear to be using all the cores + despite needing ``torch.set_num_threads(1)`` to be called to avoid contention. + +If you are still experiencing a slowdown, but none of the above problems apply, +we'd really like to know! Please create a `GitHub issue`_ and consider +submitting a minimal code example that demonstrates the problem. + +.. _`Github issue`: https://github.com/ray-project/ray/issues + +This document discusses some common problems that people run into when using Ray +as well as some known problems. If you encounter other problems, please +`let us know`_. + +.. _`let us know`: https://github.com/ray-project/ray/issues .. _ray-core-timeline: Visualizing Tasks in the Ray Timeline @@ -320,6 +320,6 @@ Our example in total now takes only 1.5 seconds to run: 20 0.001 0.000 0.001 0.000 worker.py:514(submit_task) ... -Profiling (Internal) --------------------- -If you are developing Ray core or debugging some system level failures, profiling the Ray core could help. In this case, see :ref:`Profiling (Internal) `. +Profiling for Developers +------------------------ +If you are developing Ray Core or debugging some system level failures, profiling the Ray Core could help. In this case, see :ref:`Profiling (Internal) `. diff --git a/doc/source/ray-observability/user-guides/troubleshoot-apps/ray-debugging.rst b/doc/source/ray-observability/user-guides/debug-apps/ray-debugging.rst similarity index 100% rename from doc/source/ray-observability/user-guides/troubleshoot-apps/ray-debugging.rst rename to doc/source/ray-observability/user-guides/debug-apps/ray-debugging.rst diff --git a/doc/source/ray-observability/user-guides/index.md b/doc/source/ray-observability/user-guides/index.md index a8772bfe58a14..9b81d03a55915 100644 --- a/doc/source/ray-observability/user-guides/index.md +++ b/doc/source/ray-observability/user-guides/index.md @@ -5,5 +5,9 @@ These guides help you monitor and debug your Ray applications and clusters. The guides include: -* {ref}`observability-troubleshoot-user-guides` +* {ref}`observability-general-troubleshoot` +* {ref}`observability-user-guides` +* {ref}`observability-programmatic` +* {ref}`configure-logging` +* {ref}`application-level-metrics` * {ref}`ray-tracing` \ No newline at end of file diff --git a/doc/source/ray-observability/user-guides/troubleshoot-apps/index.md b/doc/source/ray-observability/user-guides/troubleshoot-apps/index.md deleted file mode 100644 index cd6562375a40d..0000000000000 --- a/doc/source/ray-observability/user-guides/troubleshoot-apps/index.md +++ /dev/null @@ -1,10 +0,0 @@ -(observability-troubleshoot-user-guides)= - -# Troubleshooting Applications - -These guides help you perform common debugging or optimization tasks for your distributed application on Ray: -* {ref}`observability-troubleshoot-failures` -* {ref}`observability-troubleshoot-hangs` -* {ref}`observability-optimize-performance` -* {ref}`ray-debugger` -* {ref}`ray-core-profiling` \ No newline at end of file diff --git a/doc/source/ray-observability/user-guides/troubleshoot-apps/optimize-performance.rst b/doc/source/ray-observability/user-guides/troubleshoot-apps/optimize-performance.rst deleted file mode 100644 index 465f7b6b5c524..0000000000000 --- a/doc/source/ray-observability/user-guides/troubleshoot-apps/optimize-performance.rst +++ /dev/null @@ -1,59 +0,0 @@ -.. _observability-optimize-performance: - -Optimizing Performance -====================== - -No Speedup ----------- - -You just ran an application using Ray, but it wasn't as fast as you expected it -to be. Or worse, perhaps it was slower than the serial version of the -application! The most common reasons are the following. - -- **Number of cores:** How many cores is Ray using? When you start Ray, it will - determine the number of CPUs on each machine with ``psutil.cpu_count()``. Ray - usually will not schedule more tasks in parallel than the number of CPUs. So - if the number of CPUs is 4, the most you should expect is a 4x speedup. - -- **Physical versus logical CPUs:** Do the machines you're running on have fewer - **physical** cores than **logical** cores? You can check the number of logical - cores with ``psutil.cpu_count()`` and the number of physical cores with - ``psutil.cpu_count(logical=False)``. This is common on a lot of machines and - especially on EC2. For many workloads (especially numerical workloads), you - often cannot expect a greater speedup than the number of physical CPUs. - -- **Small tasks:** Are your tasks very small? Ray introduces some overhead for - each task (the amount of overhead depends on the arguments that are passed - in). You will be unlikely to see speedups if your tasks take less than ten - milliseconds. For many workloads, you can easily increase the sizes of your - tasks by batching them together. - -- **Variable durations:** Do your tasks have variable duration? If you run 10 - tasks with variable duration in parallel, you shouldn't expect an N-fold - speedup (because you'll end up waiting for the slowest task). In this case, - consider using ``ray.wait`` to begin processing tasks that finish first. - -- **Multi-threaded libraries:** Are all of your tasks attempting to use all of - the cores on the machine? If so, they are likely to experience contention and - prevent your application from achieving a speedup. - This is common with some versions of ``numpy``. To avoid contention, set an - environment variable like ``MKL_NUM_THREADS`` (or the equivalent depending on - your installation) to ``1``. - - For many - but not all - libraries, you can diagnose this by opening ``top`` - while your application is running. If one process is using most of the CPUs, - and the others are using a small amount, this may be the problem. The most - common exception is PyTorch, which will appear to be using all the cores - despite needing ``torch.set_num_threads(1)`` to be called to avoid contention. - -If you are still experiencing a slowdown, but none of the above problems apply, -we'd really like to know! Please create a `GitHub issue`_ and consider -submitting a minimal code example that demonstrates the problem. - -.. _`Github issue`: https://github.com/ray-project/ray/issues - -This document discusses some common problems that people run into when using Ray -as well as some known problems. If you encounter other problems, please -`let us know`_. - -.. _`let us know`: https://github.com/ray-project/ray/issues diff --git a/doc/source/rllib/rllib-training.rst b/doc/source/rllib/rllib-training.rst index 7b0529e54320f..bcbd9a82ff188 100644 --- a/doc/source/rllib/rllib-training.rst +++ b/doc/source/rllib/rllib-training.rst @@ -608,4 +608,4 @@ hangs or performance issues. Next Steps ---------- -- To check how your application is doing, you can use the :ref:`Ray dashboard`. \ No newline at end of file +- To check how your application is doing, you can use the :ref:`Ray dashboard `. \ No newline at end of file diff --git a/doc/source/train/getting-started.rst b/doc/source/train/getting-started.rst index a7105abed9db8..5b357df754323 100644 --- a/doc/source/train/getting-started.rst +++ b/doc/source/train/getting-started.rst @@ -190,4 +190,4 @@ Here are examples for some of the commonly used trainers: Next Steps ---------- -* To check how your application is doing, you can use the :ref:`Ray dashboard`. +* To check how your application is doing, you can use the :ref:`Ray dashboard `. diff --git a/doc/source/tune/getting-started.rst b/doc/source/tune/getting-started.rst index ccf805d833269..b3a75a09b2db2 100644 --- a/doc/source/tune/getting-started.rst +++ b/doc/source/tune/getting-started.rst @@ -164,4 +164,4 @@ Next Steps * Check out the :ref:`Tune tutorials ` for guides on using Tune with your preferred machine learning library. * Browse our :ref:`gallery of examples ` to see how to use Tune with PyTorch, XGBoost, Tensorflow, etc. * `Let us know `__ if you ran into issues or have any questions by opening an issue on our Github. -* To check how your application is doing, you can use the :ref:`Ray dashboard`. +* To check how your application is doing, you can use the :ref:`Ray dashboard `.