-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[doc][clusters] add doc for setting up Ray and K8s #39408
Conversation
Signed-off-by: angelinalg <[email protected]>
|
||
# Set up a Ray + Kubernetes cluster | ||
|
||
This document contains recommendations for setting up a Ray + Kubernetes cluster for your organization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Ray and Kubernetes ecosystem encompasses various aspects. Could you specify which setup instructions are covered by this document?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be covered by:
This guide covers best practices for these deployment considerations:
* Where to ship or run your code on the Ray cluster
* Choosing a storage system for artifacts
* Package dependencies for your application
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be addressed
|
||
### Storage | ||
|
||
Use one of these two standard solutions for artifact and log storage during the development process: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is inconsistent with the table above. We only mention NFS/EFS in the table under the 'interactive development' column. However, here we reference both NFS/EFS and S3/GS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
|
||
|
||
```{eval-rst} | ||
.. image:: ../images/prod.png |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This image is inconsistent with the table above. We only mention S3/GS in the table under the 'production' column. However, here we only reference NFS/EFS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
|
||
### Storage | ||
|
||
Reading and writing data and artifacts to cloud storage is the most reliable and observable option for production Ray deployments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
|
||
Bake your code, remote, and local dependencies into a published Docker image for the workers. This is the most common way to deploy applications onto [Kubernetes](https://kube.academy/courses/building-applications-for-kubernetes). | ||
|
||
Using Cloud storage and the `runtime_env` is a less preferred method. In this case, use the runtime environment option to download zip files containing code and other private modules from cloud storage, in addition to specifying the pip packages needed to run your application. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a sentence to explain why runtime_env
is a less preferred method for production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just minor comments
|
||
This document contains recommendations for setting up a Ray + Kubernetes cluster for your organization. | ||
|
||
When you set up Ray on Kubernetes, the KubeRay documentation provides an overview of how to configure the operator to execute and manage the Ray cluster lifecycle. This guide complements the KubeRay documentation by providing best practices for effectively using Ray deployments in your organization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please link to KubeRay doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Good point.
| Artifact Storage | Set up an EFS | Cloud storage (S3, GS) | | ||
| Package Dependencies | Install onto NFS <br /> or <br /> Use runtime environments | Bake into docker image | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe spell out EFS, NFS, S3, GS the first time you use them, and/or add links for them
| Artifact Storage | Set up an EFS | Cloud storage (S3, GS) | | ||
| Package Dependencies | Install onto NFS <br /> or <br /> Use runtime environments | Bake into docker image | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Artifact Storage | Set up an EFS | Cloud storage (S3, GS) | | |
| Package Dependencies | Install onto NFS <br /> or <br /> Use runtime environments | Bake into docker image | | |
| Artifact Storage | Set up an EFS | Cloud storage (S3, GS) | | |
| Package Dependencies | Install onto NFS <br /> or <br /> Use runtime environments | Bake into Docker image | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks for catching!
|
||
* Start a Jupyter server on the head node | ||
* SSH onto the head node and run the driver script or application there | ||
* Use the Ray Job Submission client to submit code from a local machine onto a cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear what these are examples of. I thought of "Here are some examples of ways to run a driver script on the head node", but that doesn't seem to fit well with the first bullet about Jupyter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be addressed
Signed-off-by: angelinalg <[email protected]>
|
||
## Production | ||
|
||
For production, we suggest the following configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a motivating comment here for recommendations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be addressed
|
||
This document contains recommendations for setting up a Ray + Kubernetes cluster for your organization. | ||
|
||
When you set up Ray on Kubernetes, the KubeRay documentation provides an overview of how to configure the operator to execute and manage the Ray cluster lifecycle. This guide complements the KubeRay documentation by providing best practices for effectively using Ray deployments in your organization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a bit more clarity as to why this doc matters
| | Interactive Development | Production | | ||
|---|---|---| | ||
| Cluster Configuration | KubeRay YAML | KubeRay YAML | | ||
| Code | Run driver or Jupyter notebook on head node | S3 + runtime envs <br /> OR <br /> Bake code into Docker image (link) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this link thing / do we need to say more about the docker image setup? or is that common knowledge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the word, link
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building a Ray image from scratch is not easy, and our image-building CI pipelines are pretty complex. It will be helpful to have a doc in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.ray.io/en/master/serve/production-guide/docker.html => This is not enough. For example, some users are sensitive to security and want to build the image with different Linux distributions.
|
||
### Code and Dependencies | ||
|
||
Bake your code, remote, and local dependencies into a published Docker image for the workers. This is the most common way to deploy applications onto [Kubernetes](https://kube.academy/courses/building-applications-for-kubernetes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you also want to add a link to how to build it into the docker image? -> https://docs.ray.io/en/master/serve/production-guide/docker.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help me refresh this once more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you help me refresh this one as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: angelinalg <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I am not familiar with NFS/EFS. Could you explain why NFS is inside the "Ray cluster" in the interactive-dev.png but outside the "Ray cluster" in the production.png?
Signed-off-by: angelinalg <[email protected]>
Signed-off-by: angelinalg <[email protected]>
Signed-off-by: angelinalg <[email protected]>
#39510) * Update metrics.md (#38512) 1. there are 3 dashboards in the folder now. Refer to the folder instead of only 1 dashboard 2. include "Copy" since people need to copy this from the head node to the Grafana server Signed-off-by: Huaiwei Sun <[email protected]> * polish observability (o11y) docs (#39069) Signed-off-by: Huaiwei Sun <[email protected]> Co-authored-by: angelinalg <[email protected]> Co-authored-by: matthewdeng <[email protected]> * [Doc] Unbold "Use Cases" in sidebar (#39295) Signed-off-by: pdmurray <[email protected]> * [docs] Cleanup for other AIR concepts (#39400) * [doc][clusters] add doc for setting up Ray and K8s (#39408) --------- Signed-off-by: Huaiwei Sun <[email protected]> Signed-off-by: pdmurray <[email protected]> Co-authored-by: Huaiwei Sun <[email protected]> Co-authored-by: matthewdeng <[email protected]> Co-authored-by: Peyton Murray <[email protected]> Co-authored-by: Richard Liaw <[email protected]>
Signed-off-by: Jim Thompson <[email protected]>
Signed-off-by: Victor <[email protected]>
Fill the content gap that provides best practices for two flavors of deployments:
cc: @richardliaw
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.