-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] Refactored k8s operator to use kopf for controller logic #15787
Conversation
If I understand right, a watch on an k8s api object just responds to updates to the object's I don't think the issues observed with the Ray 1.3.0 are with the watch itself -- it's mostly that events are handled plain wrong. |
The biggest difference between using the watch and This is not something that watching the event stream alone would provide. |
python/ray/autoscaler/kubernetes/operator_configs/operator_cluster_scoped.yaml
Outdated
Show resolved
Hide resolved
Just heads up on the merge conflicts: To add Kopf to the Python dependencies for Ray images, add it to docker/ray-deps/Dockerfile |
@DmitriGekhtman may I suggest you help resolve the merge conflicts to land this PR in 1.4? |
Yeah -- let me take care of both of the logistical things from my last comment. |
Took care of above logistics, fixed lint complaints from local pre-push hook. |
I think everything should be aligned with the review comments at this point, please feel free to take another look. |
Looks good! Last thing is to align on the queue detail. |
Cool! Sounds good to me. |
okidoki -- docs update and e2e tests are looking good! Will ping reviewer for review. |
Final changes:
The sequence of events when the RayCluster API resource is deleted is now
I hope step 2 will mitigate some weird pod-deletion-blocking RPC errors that have been recently spotted. |
@richardliaw , is this a release blocker. If so, can you please provide some information for why we should treat this new feature as a release blocker? |
@richardliaw or @edoakes could you merge this into master? |
Note, this a stability fix, not a new feature. |
@@ -103,6 +103,8 @@ | |||
"opentelemetry-exporter-otlp==1.1.0" | |||
] | |||
} | |||
if sys.version_info >= (3, 7, 0): | |||
extras["k8s"].append("kopf") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this may not work, but let's see :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's actually not important here -- just installing in the docker image is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is with documentation, which is missing this fact: py36 is not supported in the operator image, but it is perfectly fine to use a py37 operator image with py36 Ray images as long as the Ray versions are the same.
Nice work here! Let's not cherry-pick this to 1.4. |
Why are these changes needed?
The current implementation of the k8s Ray Operator relies on watching the event stream to handle lifecycle events like cluster creation, update, and deletion. However, the event stream is unreliable because (1) events do not provide any durability or exactly-once processing guarantees, and (2) the Ray Operator does not persist any state internally and so cannot recover gracefully in the event of Ray Operator pod restart.
kopf is a Python framework that abstracts the complexity of writing a Kubernetes Controller to manage event handling for CRDs with fault tolerance and state management provided out of the box. By integrating with kopf, the Ray Operator should be more reliable in the face of restarts and other intermittent errors.
Related issue number
Fixes #15744.
Checks
scripts/format.sh
to lint the changes in this PR.cc @DmitriGekhtman @richardliaw