Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Refactored k8s operator to use kopf for controller logic #15787

Merged
merged 44 commits into from
Jun 1, 2021

Conversation

tgaddair
Copy link
Contributor

@tgaddair tgaddair commented May 13, 2021

Why are these changes needed?

The current implementation of the k8s Ray Operator relies on watching the event stream to handle lifecycle events like cluster creation, update, and deletion. However, the event stream is unreliable because (1) events do not provide any durability or exactly-once processing guarantees, and (2) the Ray Operator does not persist any state internally and so cannot recover gracefully in the event of Ray Operator pod restart.

kopf is a Python framework that abstracts the complexity of writing a Kubernetes Controller to manage event handling for CRDs with fault tolerance and state management provided out of the box. By integrating with kopf, the Ray Operator should be more reliable in the face of restarts and other intermittent errors.

Related issue number

Fixes #15744.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

cc @DmitriGekhtman @richardliaw

@DmitriGekhtman
Copy link
Contributor

If I understand right, a watch on an k8s api object just responds to updates to the object's resourceVersion.
I imagine @kopf.on.update probably does more or less the same thing under the hood? Maybe with some sort of improvement to robustness? Curious to take a look and see...

I don't think the issues observed with the Ray 1.3.0 are with the watch itself -- it's mostly that events are handled plain wrong.
That said, looks like using kopf could definitely make things simpler and less error-prone.

@tgaddair
Copy link
Contributor Author

If I understand right, a watch on an k8s api object just responds to updates to the object's resourceVersion.
I imagine @kopf.on.update probably does more or less the same thing under the hood? Maybe with some sort of improvement to robustness? Curious to take a look and see...

I don't think the issues observed with the Ray 1.3.0 are with the watch itself -- it's mostly that events are handled plain wrong.
That said, looks like using kopf could definitely make things simpler and less error-prone.

The biggest difference between using the watch and on.update is around robustness, retries, etc. See this section for a more detailed overview of some feature kopf provides in this area: https://kopf.readthedocs.io/en/stable/continuity/

This is not something that watching the event stream alone would provide.

python/ray/ray_operator/operator.py Outdated Show resolved Hide resolved
python/ray/ray_operator/operator.py Outdated Show resolved Hide resolved
@DmitriGekhtman
Copy link
Contributor

DmitriGekhtman commented May 20, 2021

Just heads up on the merge conflicts:
Un-templated versions of the cluster_scoped/namespaced operators have moved to deploy/components.
Helm versions are in deploy/charts/ray/templates (will merge those into one file at some point for easier maintenance).

To add Kopf to the Python dependencies for Ray images, add it to docker/ray-deps/Dockerfile

@richardliaw
Copy link
Contributor

@DmitriGekhtman may I suggest you help resolve the merge conflicts to land this PR in 1.4?

@DmitriGekhtman
Copy link
Contributor

@DmitriGekhtman may I suggest you help resolve the merge conflicts to land this PR in 1.4?

Yeah -- let me take care of both of the logistical things from my last comment.

@DmitriGekhtman
Copy link
Contributor

Took care of above logistics, fixed lint complaints from local pre-push hook.

@tgaddair
Copy link
Contributor Author

I think everything should be aligned with the review comments at this point, please feel free to take another look.

@DmitriGekhtman
Copy link
Contributor

Looks good! Last thing is to align on the queue detail.

@tgaddair
Copy link
Contributor Author

Cool! Sounds good to me.

@DmitriGekhtman
Copy link
Contributor

okidoki -- docs update and e2e tests are looking good! Will ping reviewer for review.

@DmitriGekhtman
Copy link
Contributor

DmitriGekhtman commented May 29, 2021

Final changes:

  • Moved operator to main process
  • Improved teardown logic

The sequence of events when the RayCluster API resource is deleted is now

  1. Monitor loop exits normally, monitor subprocess is joined.
  2. ray down is called for an attempt at orderly cluster shutdown
  3. kopf removes the finalizer on the RayCluster resource, resource is deleted

I hope step 2 will mitigate some weird pod-deletion-blocking RPC errors that have been recently spotted.
Consistently stopping the monitor before the head pod should prevent the errors @tgaddair has been seeing lately.

@AmeerHajAli
Copy link
Contributor

@richardliaw , is this a release blocker. If so, can you please provide some information for why we should treat this new feature as a release blocker?

@DmitriGekhtman
Copy link
Contributor

@richardliaw or @edoakes could you merge this into master?

@DmitriGekhtman
Copy link
Contributor

DmitriGekhtman commented Jun 1, 2021

@richardliaw , is this a release blocker. If so, can you please provide some information for why we should treat this new feature as a release blocker?

Note, this a stability fix, not a new feature.
We'll defer to @richardliaw on picking it into 1.4.0.

@@ -103,6 +103,8 @@
"opentelemetry-exporter-otlp==1.1.0"
]
}
if sys.version_info >= (3, 7, 0):
extras["k8s"].append("kopf")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may not work, but let's see :)

Copy link
Contributor

@DmitriGekhtman DmitriGekhtman Jun 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's actually not important here -- just installing in the docker image is enough.

Copy link
Contributor

@ddelange ddelange Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems this PR broke support for py36 ref #26886.

it seems that kopf never supported py36: is there a fallback, or is this reason to open a 'drop py36 support' tracker issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is with documentation, which is missing this fact: py36 is not supported in the operator image, but it is perfectly fine to use a py37 operator image with py36 Ray images as long as the Ray versions are the same.

@richardliaw richardliaw merged commit 050a076 into ray-project:master Jun 1, 2021
@richardliaw
Copy link
Contributor

Nice work here! Let's not cherry-pick this to 1.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[kubernetes] Race condition with K8s Operator and CRD
7 participants