Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc][clusters] add doc for setting up Ray and K8s #39408

Merged
merged 15 commits into from
Sep 9, 2023
Prev Previous commit
Next Next commit
address feedback and copy edit
Signed-off-by: angelinalg <[email protected]>
  • Loading branch information
angelinalg committed Sep 8, 2023
commit c858fa663d797cce0ffca66be32f6df33f09a1f5
37 changes: 18 additions & 19 deletions doc/source/cluster/kubernetes/user-guides/storage.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,24 @@
(kuberay-storage)=

# Storage and dependencies best practices with KubeRay
# Best Practices for Storage and Dependencies

This document contains recommendations for setting up storage and handling application dependencies for your Ray deployment on Kubernetes.

When you set up Ray on Kubernetes, the [KubeRay documentation](kuberay-quickstart) provides an overview of how to configure the operator to execute and manage the Ray cluster lifecycle.
However, administrators may still have questions with respect to actual user workflows. For example:
However, as administrators you may still have questions with respect to actual user workflows. For example:

* How do I ship or run code on the Ray cluster?
* What type of storage system should I set up for artifacts?
* How do I handle package dependencies for your application?
* How do you ship or run code on the Ray cluster?
* What type of storage system should you set up for artifacts?
* How do you handle package dependencies for your application?

The answers to these questions varies between development and production. This table summarizes the recommended setup for both situations:
The answers to these questions vary between development and production. This table summarizes the recommended setup for each situation:

| | Interactive Development | Production |
|---|---|---|
| Cluster Configuration | KubeRay YAML | KubeRay YAML |
| Code | Run driver or Jupyter notebook on head node | Bake code into Docker image |
| Artifact Storage | Set up an EFS <br /> or <br /> Cloud Storage (S3, GS) | Set up an EFS <br /> or <br /> Cloud Storage (S3, GS) |
| Package Dependencies | Install onto NFS <br /> or <br /> Use runtime environments | Bake into docker image |
| Package Dependencies | Install onto NFS <br /> or <br /> Use runtime environments | Bake into Docker image |

Table 1: Table comparing recommended setup for development and production.

Expand All @@ -37,13 +37,13 @@ To provide an interactive development environment for data scientists and ML pra

Use one of these two standard solutions for artifact and log storage during the development process, depending on your use case:

* POSIX-compliant network file storage (like NFS and EFS): This approach is useful when you want to have artifacts or dependencies accessible across different nodes with low latency. For example, experiment logs of different models trained on different Ray tasks.
* Cloud storage (like AWS S3 or GCP GS): This approach is useful for large artifacts or datasets that you need to access with high throughput.
* POSIX-compliant network file storage, like Network File System (NFS) and Elastic File Service (EFS): This approach is useful when you want to have artifacts or dependencies accessible across different nodes with low latency. For example, experiment logs of different models trained on different Ray tasks.
* Cloud storage, like AWS Simple Storage Service (S3) or GCP Google Storage (GS): This approach is useful for large artifacts or datasets that you need to access with high throughput.

Ray's AI libraries such as Ray Data, Ray Train, and Ray Tune come with out-of-the-box capabilities to read and write from cloud storage and local/networked storage.
Ray's AI libraries such as Ray Data, Ray Train, and Ray Tune come with out-of-the-box capabilities to read and write from cloud storage and local or networked storage.
### Driver script

Run the main (driver) script on the head node of the cluster. Ray Core and library programs often assume that the driver is located on the head node and take advantage of the local storage. For example, Ray Tune will by default generate log files on the head node.
Run the main, or driver, script on the head node of the cluster. Ray Core and library programs often assume that the driver is on the head node and take advantage of the local storage. For example, Ray Tune generates log files on the head node by default.

A typical workflow can look like this:

Expand All @@ -53,16 +53,15 @@ A typical workflow can look like this:

### Dependencies

For local dependencies (for example, if you’re working in a mono-repo), or external dependencies (like a pip package), use one of the following options:
For local dependencies, for example, if you’re working in a mono-repo, or external dependencies, like a pip package, use one of the following options:

* Put the code and install the packages onto your NFS. The benefit is that you can quickly interact with the rest of the codebase and dependencies without shipping it across a cluster every time.
* Use the `runtime env` with the [Ray Job Submission Client](ray.job_submission.JobSubmissionClient), which can pull down code from S3 or ship code from your local working directory onto the remote cluster.
* Bake remote and local dependencies into a published Docker image that all nodes will use ([guide](serve-custom-docker-images)). This is the most common way to deploy applications onto [Kubernetes](https://kube.academy/courses/building-applications-for-kubernetes), but it is also the highest friction option.
* Use the [runtime env](runtime-environments) with the [Ray Job Submission Client](ray.job_submission.JobSubmissionClient), which can pull down code from S3 or ship code from your local working directory onto the remote cluster.
* Bake remote and local dependencies into a published Docker image for all nodes to use. See [Custom Docker Images](serve-custom-docker-images). This approach is the most common way to deploy applications onto [Kubernetes](https://kube.academy/courses/building-applications-for-kubernetes), but it's also the highest friction option.

## Production

Our recommendations regarding production are more aligned with standard Kubernetes best practices. For production, we suggest the following configuration.

The recommendations for production align with standard Kubernetes best practices. See the configuration in the following image:

```{eval-rst}
.. image:: ../images/production.png
Expand All @@ -76,8 +75,8 @@ Our recommendations regarding production are more aligned with standard Kubernet

The choice of storage system remains the same across development and production.

### Code and Dependencies
### Code and dependencies

Bake your code, remote, and local dependencies into a published Docker image for all nodes in the cluster. This is the most common way to deploy applications onto [Kubernetes](https://kube.academy/courses/building-applications-for-kubernetes). Here is a [guide](serve-custom-docker-images) for doing so.
Bake your code, remote, and local dependencies into a published Docker image for all nodes in the cluster. This approach is the most common way to deploy applications onto [Kubernetes](https://kube.academy/courses/building-applications-for-kubernetes). See [Custom Docker Images](serve-custom-docker-images).

Using cloud storage and the `runtime_env` is a less preferred method but still viable as it may not be as reproducible as the container path. In this case, use the runtime environment option to download zip files containing code and other private modules from cloud storage, in addition to specifying the pip packages needed to run your application.
Using cloud storage and the [runtime env](runtime-environments) is a less preferred method as it may not be as reproducible as the container path, but it's still viable. In this case, use the runtime environment option to download zip files containing code and other private modules from cloud storage, in addition to specifying the pip packages needed to run your application.
Loading