[docs][docs infra] [clusters][cherry-pick] Docs cherry-picks for 2.7.0 (

#39510) * Update metrics.md (#38512) 1. there are 3 dashboards in the folder now. Refer to the folder instead of only 1 dashboard 2. include "Copy" since people need to copy this from the head node to the Grafana server Signed-off-by: Huaiwei Sun <[email protected]> * polish observability (o11y) docs (#39069) Signed-off-by: Huaiwei Sun <[email protected]> Co-authored-by: angelinalg <[email protected]> Co-authored-by: matthewdeng <[email protected]> * [Doc] Unbold "Use Cases" in sidebar (#39295) Signed-off-by: pdmurray <[email protected]> * [docs] Cleanup for other AIR concepts (#39400) * [doc][clusters] add doc for setting up Ray and K8s (#39408) --------- Signed-off-by: Huaiwei Sun <[email protected]> Signed-off-by: pdmurray <[email protected]> Co-authored-by: Huaiwei Sun <[email protected]> Co-authored-by: matthewdeng <[email protected]> Co-authored-by: Peyton Murray <[email protected]> Co-authored-by: Richard Liaw <[email protected]>
ray-project · Sep 9, 2023 · 7555e44 · 7555e44
1 parent 15dc965
commit 7555e44
Show file tree

Hide file tree

Showing 39 changed files with 335 additions and 119 deletions.
diff --git a/doc/source/_static/js/custom.js b/doc/source/_static/js/custom.js
@@ -28,62 +28,86 @@ window.addEventListener("scroll", loadVisibleTermynals);
 createTermynals();
 loadVisibleTermynals();
 
-
 // Reintroduce dropdown icons on the sidebar. This is a hack, as we can't
 // programmatically figure out which nav items have children anymore.
 document.addEventListener("DOMContentLoaded", function() {
  let navItems = document.querySelectorAll(".bd-sidenav li");
- for (let i = 0; i < navItems.length; i++) {
- let navItem = navItems[i];
- const stringList = [
- "User Guides", "Examples",
- // Ray Core
- "Ray Core", "Ray Core API",
- // Ray Cluster
- "Ray Clusters", "Deploying on Kubernetes", "Deploying on VMs",
- "Applications Guide", "Ray Cluster Management API",
- "Getting Started with KubeRay", "KubeRay Ecosystem", "KubeRay Benchmarks", "KubeRay Troubleshooting",
- // Ray AIR
- "Ray AIR API",
- // Ray Data
- "Ray Data", "Ray Data API", "Integrations",
- // Ray Train
- "Ray Train", "More Frameworks",
- "Advanced Topics", "Internals",
- "Ray Train API",
- // Ray Tune
- "Ray Tune", "Ray Tune Examples", "Ray Tune API",
- // Ray Serve
- "Ray Serve", "Ray Serve API",
- "Production Guide", "Advanced Guides",
- "Deploy Many Models",
- // Ray RLlib
- "Ray RLlib", "Ray RLlib API",
- // More libraries
- "More Libraries", "Ray Workflows (Alpha)",
- // Monitoring/debugging
- "Monitoring and Debugging",
- // References
- "References", "Use Cases",
- // Developer guides
- "Developer Guides", "Getting Involved / Contributing",
- ];
-
- const containsString = stringList.some(str => navItem.innerText ===str);
-
- if (containsString && ! navItem.classList.contains('current')) {
- if (navItem.classList.contains('toctree-l1')) {
- navItem.style.fontWeight = "bold";
- }
- const href = navItem.querySelector("a").getAttribute("href");
- navItem.innerHTML +=
- '<a href="'+ href +'" style="display: none">'
- + '<input checked="" class="toctree-checkbox" id="toctree-checkbox-'
- + i + '" name="toctree-checkbox-' + i + '" type="button"></a>'
- + '<label for="toctree-checkbox-' + i + '">' +
- '<i class="fas fa-chevron-down"></i></label>'
- }
+
+ const defaultStyle = {"fontWeight": "bold"}
+
+ const stringList = [
+ {"text": "User Guides"},
+ {"text": "Examples"},
+ // Ray Core
+ {"text": "Ray Core"},
+ {"text": "Ray Core API"},
+ // Ray Cluster
+ {"text": "Ray Clusters"},
+ {"text": "Deploying on Kubernetes"},
+ {"text": "Deploying on VMs"},
+ {"text": "Applications Guide"},
+ {"text": "Ray Cluster Management API"},
+ {"text": "Getting Started with KubeRay"},
+ {"text": "KubeRay Ecosystem"},
+ {"text": "KubeRay Benchmarks"},
+ {"text": "KubeRay Troubleshooting"},
+ // Ray AIR
+ {"text": "Ray AIR API"},
+ // Ray Data
+ {"text": "Ray Data"},
+ {"text": "Ray Data API"},
+ {"text": "Integrations"},
+ // Ray Train
+ {"text": "Ray Train"},
+ {"text": "More Frameworks"},
+ {"text": "Advanced Topics"},
+ {"text": "Internals"},
+ {"text": "Ray Train API"},
+ // Ray Tune
+ {"text": "Ray Tune"},
+ {"text": "Ray Tune Examples"},
+ {"text": "Ray Tune API"},
+ // Ray Serve
+ {"text": "Ray Serve"},
+ {"text": "Ray Serve API"},
+ {"text": "Production Guide"},
+ {"text": "Advanced Guides"},
+ {"text": "Deploy Many Models"},
+ // Ray RLlib
+ {"text": "Ray RLlib"},
+ {"text": "Ray RLlib API"},
+ // More libraries
+ {"text": "More Libraries"},
+ {"text": "Ray Workflows (Alpha)"},
+ // Monitoring/debugging
+ {"text": "Monitoring and Debugging"},
+ // References
+ {"text": "References"},
+ {"text": "Use Cases", "style": {}}, // Don't use default style: https://github.com/ray-project/ray/issues/39172
+ // Developer guides
+ {"text": "Developer Guides"},
+ {"text": "Getting Involved / Contributing"},
+ ];
+
+ Array.from(navItems).filter(
+ item => stringList.some(({text}) => item.innerText === text) && ! item.classList.contains('current')
+ ).forEach((item, i) => {
+ if (item.classList.contains('toctree-l1')) {
+ const { style } = stringList.find(({text}) => item.innerText == text)
+
+ // Set the style on the menu items
+ Object.entries(style ?? defaultStyle).forEach(([key, value]) => {
+ item.style[key] = value
+ })
+
  }
+ item.innerHTML +=
+ `<a href="${item.querySelector("a").getAttribute("href")}" style="display: none">`
+ + '<input checked="" class="toctree-checkbox" id="toctree-checkbox-'
+ + i + '" name="toctree-checkbox-' + i + '" type="button"></a>'
+ + '<label for="toctree-checkbox-' + i + '">' +
+ '<i class="fas fa-chevron-down"></i></label>'
+ })
 });
 
 // Dynamically adjust the height of all panel elements in a gallery to be the same as

diff --git a/doc/source/_toc.yml b/doc/source/_toc.yml
@@ -290,6 +290,7 @@ parts:
  sections:
  - file: cluster/kubernetes/user-guides/aws-eks-gpu-cluster.md
  - file: cluster/kubernetes/user-guides/gcp-gke-gpu-cluster.md
+ - file: cluster/kubernetes/user-guides/storage.md
  - file: cluster/kubernetes/user-guides/config.md
  - file: cluster/kubernetes/user-guides/configuring-autoscaling.md
  - file: cluster/kubernetes/user-guides/kuberay-gcs-ft.md
@@ -372,6 +373,7 @@ parts:
  - file: ray-observability/user-guides/debug-apps/ray-debugging
  - file: ray-observability/user-guides/cli-sdk
  - file: ray-observability/user-guides/configure-logging
+ - file: ray-observability/user-guides/profiling
  - file: ray-observability/user-guides/add-app-metrics
  - file: ray-observability/user-guides/ray-tracing
  - file: ray-observability/reference/index

diff --git a/doc/source/cluster/configure-manage-dashboard.md b/doc/source/cluster/configure-manage-dashboard.md
@@ -133,7 +133,7 @@ The Ray Dashboard provides read **and write** access to the Ray Cluster. The rev
 
 ## Disabling the Dashboard
 
-Dashboard is included if you use `ray[default]`, `ray[air]`, or {ref}`other installation commands <installation>` and automatically started.
+Dashboard is included if you use `ray[default]` or {ref}`other installation commands <installation>` and automatically started.
 
 To disable Dashboard, use the following arguments `--include-dashboard`.
 
@@ -209,6 +209,9 @@ Configure these settings using the `RAY_GRAFANA_HOST`, `RAY_PROMETHEUS_HOST`, `R
 * Set `RAY_GRAFANA_IFRAME_HOST` to an address that the user's browsers can use to access Grafana and embed visualizations. If `RAY_GRAFANA_IFRAME_HOST` is not set, Ray Dashboard uses the value of `RAY_GRAFANA_HOST`.
 
 For example, if the IP of the head node is 55.66.77.88 and Grafana is hosted on port 3000. Set the value to `RAY_GRAFANA_HOST=http:https://55.66.77.88:3000`.
+* If you start a single-node Ray Cluster manually, make sure these environment variables are set and accessible before you start the cluster or as a prefix to the `ray start ...` command, e.g., `RAY_GRAFANA_HOST=http:https://55.66.77.88:3000 ray start ...`
+* If you start a Ray Cluster with {ref}`VM Cluster Launcher <cloud-vm-index>`, the environment variables should be set under `head_start_ray_commands` as a prefix to the `ray start ...` command.
+* If you start a Ray Cluster with {ref}`KubeRay <kuberay-index>`, refer to this {ref}`tutorial <kuberay-prometheus-grafana>`.
 
 If all the environment variables are set properly, you should see time-series metrics in {ref}`Ray Dashboard <observability-getting-started>`.
 
@@ -237,7 +240,7 @@ When both Grafana and the Ray Cluster are on the same Kubernetes cluster, set `R
 
 
 #### User authentication for Grafana
-When the Grafana instance requires user authentication, the following settings have to be in its `configuration file <https://grafana.com/docs/grafana/latest/setup-grafana/configure-grafana/>`_ to correctly embed in Ray Dashboard:
+When the Grafana instance requires user authentication, the following settings have to be in its [configuration file](https://grafana.com/docs/grafana/latest/setup-grafana/configure-grafana/) to correctly embed in Ray Dashboard:
 
 ```ini
  [security]
@@ -248,8 +251,15 @@ When the Grafana instance requires user authentication, the following settings h
 
 #### Troubleshooting
 
-##### Grafana dashboards are not embedded in the Ray dashboard
-If you're getting an error that says `RAY_GRAFANA_HOST` is not setup despite having set it up, check that:
+##### Dashboard message: either Prometheus or Grafana server is not deteced
+If you have followed the instructions above to set up everything, run the connection checks below in your browser:
+* check Head Node connection to Prometheus server: add `api/prometheus_health` to the end of Ray Dashboard URL (for example: http:https://127.0.0.1:8265/api/prometheus_health)and visit it.
+* check Head Node connection to Grafana server: add `api/grafana_health` to the end of Ray Dashboard URL (for example: http:https://127.0.0.1:8265/api/grafana_health) and visit it.
+* check browser connection to Grafana server: visit the URL used in `RAY_GRAFANA_IFRAME_HOST`.
+
+
+##### Getting an error that says `RAY_GRAFANA_HOST` is not setup
+If you have set up Grafana , check that:
 * You've included the protocol in the URL (e.g., `http:https://your-grafana-url.com` instead of `your-grafana-url.com`).
 * The URL doesn't have a trailing slash (e.g., `http:https://your-grafana-url.com` instead of `http:https://your-grafana-url.com/`).
 

diff --git a/doc/source/cluster/kubernetes/images/interactive-dev.png b/doc/source/cluster/kubernetes/images/interactive-dev.png
diff --git a/doc/source/cluster/kubernetes/images/production.png b/doc/source/cluster/kubernetes/images/production.png
diff --git a/doc/source/cluster/kubernetes/user-guides.md b/doc/source/cluster/kubernetes/user-guides.md
@@ -10,6 +10,7 @@ at the {ref}`introductory guide <kuberay-quickstart>` first.
 * {ref}`kuberay-rayservice`
 * {ref}`kuberay-observability`
 * {ref}`kuberay-k8s-setup`
+* {ref}`kuberay-storage`
 * {ref}`kuberay-config`
 * {ref}`kuberay-autoscaling`
 * {ref}`kuberay-gpu`

diff --git a/doc/source/cluster/kubernetes/user-guides/storage.md b/doc/source/cluster/kubernetes/user-guides/storage.md
@@ -0,0 +1,82 @@
+(kuberay-storage)=
+
+# Best Practices for Storage and Dependencies
+
+This document contains recommendations for setting up storage and handling application dependencies for your Ray deployment on Kubernetes.
+
+When you set up Ray on Kubernetes, the [KubeRay documentation](kuberay-quickstart) provides an overview of how to configure the operator to execute and manage the Ray cluster lifecycle.
+However, as administrators you may still have questions with respect to actual user workflows. For example:
+
+* How do you ship or run code on the Ray cluster?
+* What type of storage system should you set up for artifacts?
+* How do you handle package dependencies for your application?
+
+The answers to these questions vary between development and production. This table summarizes the recommended setup for each situation:
+
+| | Interactive Development | Production |
+|---|---|---|
+| Cluster Configuration | KubeRay YAML | KubeRay YAML |
+| Code | Run driver or Jupyter notebook on head node | Bake code into Docker image |
+| Artifact Storage | Set up an EFS <br /> or <br /> Cloud Storage (S3, GS) | Set up an EFS <br /> or <br /> Cloud Storage (S3, GS) |
+| Package Dependencies | Install onto NFS <br /> or <br /> Use runtime environments | Bake into Docker image |
+
+Table 1: Table comparing recommended setup for development and production.
+
+## Interactive development
+
+To provide an interactive development environment for data scientists and ML practitioners, we recommend setting up the code, storage, and dependencies in a way that reduces context switches for developers and shortens iteration times.
+
+```{eval-rst}
+.. image:: ../images/interactive-dev.png
+ :align: center
+..
+ Find the source document here (https://whimsical.com/clusters-P5Y6R23riCuNb6xwXVXN72)
+```
+
+### Storage
+
+Use one of these two standard solutions for artifact and log storage during the development process, depending on your use case:
+
+* POSIX-compliant network file storage, like Network File System (NFS) and Elastic File Service (EFS): This approach is useful when you want to have artifacts or dependencies accessible across different nodes with low latency. For example, experiment logs of different models trained on different Ray tasks.
+* Cloud storage, like AWS Simple Storage Service (S3) or GCP Google Storage (GS): This approach is useful for large artifacts or datasets that you need to access with high throughput.
+
+Ray's AI libraries such as Ray Data, Ray Train, and Ray Tune come with out-of-the-box capabilities to read and write from cloud storage and local or networked storage.
+### Driver script
+
+Run the main, or driver, script on the head node of the cluster. Ray Core and library programs often assume that the driver is on the head node and take advantage of the local storage. For example, Ray Tune generates log files on the head node by default.
+
+A typical workflow can look like this:
+
+* Start a Jupyter server on the head node
+* SSH onto the head node and run the driver script or application there
+* Use the Ray Job Submission client to submit code from a local machine onto a cluster
+
+### Dependencies
+
+For local dependencies, for example, if you’re working in a mono-repo, or external dependencies, like a pip package, use one of the following options:
+
+* Put the code and install the packages onto your NFS. The benefit is that you can quickly interact with the rest of the codebase and dependencies without shipping it across a cluster every time.
+* Use the [runtime env](runtime-environments) with the [Ray Job Submission Client](ray.job_submission.JobSubmissionClient), which can pull down code from S3 or ship code from your local working directory onto the remote cluster.
+* Bake remote and local dependencies into a published Docker image for all nodes to use. See [Custom Docker Images](serve-custom-docker-images). This approach is the most common way to deploy applications onto [Kubernetes](https://kube.academy/courses/building-applications-for-kubernetes), but it's also the highest friction option.
+
+## Production
+
+The recommendations for production align with standard Kubernetes best practices. See the configuration in the following image:
+
+```{eval-rst}
+.. image:: ../images/production.png
+ :align: center
+..
+ Find the source document here (https://whimsical.com/clusters-P5Y6R23riCuNb6xwXVXN72)
+```
+
+
+### Storage
+
+The choice of storage system remains the same across development and production.
+
+### Code and dependencies
+
+Bake your code, remote, and local dependencies into a published Docker image for all nodes in the cluster. This approach is the most common way to deploy applications onto [Kubernetes](https://kube.academy/courses/building-applications-for-kubernetes). See [Custom Docker Images](serve-custom-docker-images).
+
+Using cloud storage and the [runtime env](runtime-environments) is a less preferred method as it may not be as reproducible as the container path, but it's still viable. In this case, use the runtime environment option to download zip files containing code and other private modules from cloud storage, in addition to specifying the pip packages needed to run your application.
diff --git a/doc/source/cluster/metrics.md b/doc/source/cluster/metrics.md
@@ -6,7 +6,7 @@ Ray records and emits time-series metrics using the [Prometheus format](https://
 
 
 ## System and application metrics
-Ray exports metrics if you use `ray[default]`, `ray[air]`, or {ref}`other installation commands <installation>` that include Dashboard component. Dashboard agent process is responsible for aggregating and reporting metrics to the endpoints for Prometheus to scrape.
+Ray exports metrics if you use `ray[default]` or {ref}`other installation commands <installation>` that include Dashboard component. Dashboard agent process is responsible for aggregating and reporting metrics to the endpoints for Prometheus to scrape.
 
 **System metrics**: Ray exports a number of system metrics. View {ref}`system metrics <system-metrics>` for more details about the emitted metrics.
 
@@ -237,7 +237,7 @@ To fix this issue, employ an automated shell script for seamlessly transferring
 
 :::{tab-item} Using an existing Grafana server
 
-After your Grafana server is running, find the Ray-provided default Grafana dashboard JSON at `/tmp/ray/session_latest/metrics/grafana/dashboards/default_grafana_dashboard.json`. [Import this dashboard](https://grafana.com/docs/grafana/latest/dashboards/manage-dashboards/#import-a-dashboard) to your Grafana.
+After your Grafana server is running, start a Ray Cluster and find the Ray-provided default Grafana dashboard JSONs at `/tmp/ray/session_latest/metrics/grafana/dashboards`. [Copy the JSONs over and import the Grafana dashboards](https://grafana.com/docs/grafana/latest/dashboards/manage-dashboards/#import-a-dashboard) to your Grafana.
 
 If Grafana reports that the datasource is not found, [add a datasource variable](https://grafana.com/docs/grafana/latest/dashboards/variables/add-template-variables/?pg=graf&plcmt=data-sources-prometheus-btn-1#add-a-data-source-variable). The datasource's name must be the same as value in the `RAY_PROMETHEUS_NAME` environment. By default, `RAY_PROMETHEUS_NAME` equals `Prometheus`.
 :::

diff --git a/doc/source/data/examples/gptj_batch_prediction.ipynb b/doc/source/data/examples/gptj_batch_prediction.ipynb
@@ -5,9 +5,9 @@
  "cell_type": "markdown",
  "metadata": {},
  "source": [
- "# GPT-J-6B Batch Prediction with Ray AIR\n",
+ "# GPT-J-6B Batch Prediction with Ray Data\n",
  "\n",
- "This example showcases how to use the Ray AIR for **GPT-J batch inference**. GPT-J is a GPT-2-like causal language model trained on the Pile dataset. This model has 6 billion parameters. For more information on GPT-J, click [here](https://huggingface.co/docs/transformers/model_doc/gptj).\n",
+ "This example showcases how to use the Ray Data for **GPT-J batch inference**. GPT-J is a GPT-2-like causal language model trained on the Pile dataset. This model has 6 billion parameters. For more information on GPT-J, click [here](https://huggingface.co/docs/transformers/model_doc/gptj).\n",
  "\n",
  "We use Ray Data and a pretrained model from Hugging Face hub. Note that you can easily adapt this example to use other similar models.\n",
  "\n",
@@ -224,7 +224,7 @@
  "cell_type": "markdown",
  "metadata": {},
  "source": [
- "You may notice that we are not using an AIR {class}`Predictor <ray.train.predictor.Predictor>` here. This is because Predictors are mainly intended to be used with AIR {class}`Checkpoints <ray.train.Checkpoint>`, which we don't for this example. See {class}`ray.train.predictor.Predictor` for more information and usage examples."
+ "You may notice that we are not using a {class}`Predictor <ray.train.predictor.Predictor>` here. This is because Predictors are mainly intended to be used with Train {class}`Checkpoints <ray.train.Checkpoint>`, which we don't for this example. See {class}`ray.train.predictor.Predictor` for more information and usage examples."
  ]
  }
  ],