Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc][clusters] add doc for setting up Ray and K8s #39408

Merged
merged 15 commits into from
Sep 9, 2023
Next Next commit
add doc for setting up Ray and K8s
Signed-off-by: angelinalg <[email protected]>
  • Loading branch information
angelinalg committed Sep 7, 2023
commit 1b59b504d5aa20fee218ea705654a3da78820b55
1 change: 1 addition & 0 deletions doc/source/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -290,6 +290,7 @@ parts:
sections:
- file: cluster/kubernetes/user-guides/aws-eks-gpu-cluster.md
- file: cluster/kubernetes/user-guides/gcp-gke-gpu-cluster.md
- file: cluster/kubernetes/user-guides/ray-k8s-setup.md
- file: cluster/kubernetes/user-guides/config.md
- file: cluster/kubernetes/user-guides/configuring-autoscaling.md
- file: cluster/kubernetes/user-guides/gke-gcs-bucket.md
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/source/cluster/kubernetes/images/prod.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
81 changes: 81 additions & 0 deletions doc/source/cluster/kubernetes/user-guides/ray-k8s-setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
(ray-kubernetes-setup)=

# Set up a Ray + Kubernetes cluster
angelinalg marked this conversation as resolved.
Show resolved Hide resolved
angelinalg marked this conversation as resolved.
Show resolved Hide resolved

This document contains recommendations for setting up a Ray + Kubernetes cluster for your organization.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Ray and Kubernetes ecosystem encompasses various aspects. Could you specify which setup instructions are covered by this document?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be covered by:

This guide covers best practices for these deployment considerations:

* Where to ship or run your code on the Ray cluster
* Choosing a storage system for artifacts
* Package dependencies for your application

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be addressed


When you set up Ray on Kubernetes, the KubeRay documentation provides an overview of how to configure the operator to execute and manage the Ray cluster lifecycle. This guide complements the KubeRay documentation by providing best practices for effectively using Ray deployments in your organization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a bit more clarity as to why this doc matters

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please link to KubeRay doc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Good point.


This guide covers best practices for these deployment considerations:

* Where to ship or run your code on the Ray cluster
* Choosing a storage system for artifacts
* Package dependencies for your application

Deployment considerations are different for development and production. This table summarizes the recommended setup for both interactive development and production:

| | Interactive Development | Production |
|---|---|---|
| Cluster Configuration | KubeRay YAML | KubeRay YAML |
| Code | Run driver or Jupyter notebook on head node | S3 + runtime envs <br /> OR <br /> Bake code into Docker image (link) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this link thing / do we need to say more about the docker image setup? or is that common knowledge

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the word, link.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building a Ray image from scratch is not easy, and our image-building CI pipelines are pretty complex. It will be helpful to have a doc in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.ray.io/en/master/serve/production-guide/docker.html => This is not enough. For example, some users are sensitive to security and want to build the image with different Linux distributions.

richardliaw marked this conversation as resolved.
Show resolved Hide resolved
| Artifact Storage | Set up an EFS | Cloud storage (S3, GS) |
| Package Dependencies | Install onto NFS <br /> or <br /> Use runtime environments | Bake into docker image |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe spell out EFS, NFS, S3, GS the first time you use them, and/or add links for them

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| Artifact Storage | Set up an EFS | Cloud storage (S3, GS) |
| Package Dependencies | Install onto NFS <br /> or <br /> Use runtime environments | Bake into docker image |
| Artifact Storage | Set up an EFS | Cloud storage (S3, GS) |
| Package Dependencies | Install onto NFS <br /> or <br /> Use runtime environments | Bake into Docker image |

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks for catching!


Table 1: Table comparing recommended setup for development and production.

## Interactive development

To provide an interactive development environment for data scientists, you should set up the code, storage, and dependencies in a way that reduces context switches for developers and shortens iteration times.

```{eval-rst}
.. image:: ../images/interact-dev.png
:align: center
..
Find the source document here (https://whimsical.com/clusters-P5Y6R23riCuNb6xwXVXN72)
```

### Storage

Use one of these two standard solutions for artifact and log storage during the development process:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is inconsistent with the table above. We only mention NFS/EFS in the table under the 'interactive development' column. However, here we reference both NFS/EFS and S3/GS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


* POSIX-compliant network file storage (like AWS and EFS): This approach is useful when you have artifacts or dependencies accessible in an interactive fashion.
* Cloud storage (like AWS S3 or GCP GS): This approach is useful for large artifacts or datasets that you need to access with high throughput.

### Driver script

Run the main (driver) script on the head node of the cluster. Ray Core and library programs often assume that the driver is located on the head node, and take advantage of the local storage. For example:

* Start a Jupyter server on the head node
* SSH onto the head node and run the driver script or application there
* Use the Ray Job Submission client to submit code from a local machine onto a cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear what these are examples of. I thought of "Here are some examples of ways to run a driver script on the head node", but that doesn't seem to fit well with the first bullet about Jupyter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be addressed


### Dependencies

For local dependencies (for example, if you’re working in a mono-repo), or external dependencies (like a pip package), use one of the following options:

* Put the code and install the packages onto your NFS. The benefit is that you can quickly interact with the rest of the codebase and dependencies without shipping it across a cluster every time.
* Bake remote and local dependencies into a published Docker image for the workers. This is the most common way to deploy applications onto [Kubernetes](https://kube.academy/courses/building-applications-for-kubernetes).
* Use the `runtime env` with the [Ray Job Submission Client](ray.job_submission.JobSubmissionClient), which can pull down code from S3 or ship code from your local working directory onto the remote cluster.

## Production

For production, we suggest the following configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a motivating comment here for recommendations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be addressed



```{eval-rst}
.. image:: ../images/prod.png
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This image is inconsistent with the table above. We only mention S3/GS in the table under the 'production' column. However, here we only reference NFS/EFS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

:align: center
..
Find the source document here (https://whimsical.com/clusters-P5Y6R23riCuNb6xwXVXN72)
```


### Storage

Reading and writing data and artifacts to cloud storage is the most reliable and observable option for production Ray deployments.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


### Code and Dependencies

Bake your code, remote, and local dependencies into a published Docker image for the workers. This is the most common way to deploy applications onto [Kubernetes](https://kube.academy/courses/building-applications-for-kubernetes).

Using Cloud storage and the `runtime_env` is a less preferred method. In this case, use the runtime environment option to download zip files containing code and other private modules from cloud storage, in addition to specifying the pip packages needed to run your application.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a sentence to explain why runtime_env is a less preferred method for production.

Loading