[k8s] Refactored k8s operator to use kopf for controller logic #15787

tgaddair · 2021-05-13T18:34:30Z

Why are these changes needed?

The current implementation of the k8s Ray Operator relies on watching the event stream to handle lifecycle events like cluster creation, update, and deletion. However, the event stream is unreliable because (1) events do not provide any durability or exactly-once processing guarantees, and (2) the Ray Operator does not persist any state internally and so cannot recover gracefully in the event of Ray Operator pod restart.

kopf is a Python framework that abstracts the complexity of writing a Kubernetes Controller to manage event handling for CRDs with fault tolerance and state management provided out of the box. By integrating with kopf, the Ray Operator should be more reliable in the face of restarts and other intermittent errors.

Related issue number

Fixes #15744.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

cc @DmitriGekhtman @richardliaw

python/ray/ray_operator/operator.py

DmitriGekhtman · 2021-05-13T21:37:48Z

If I understand right, a watch on an k8s api object just responds to updates to the object's resourceVersion.
I imagine @kopf.on.update probably does more or less the same thing under the hood? Maybe with some sort of improvement to robustness? Curious to take a look and see...

I don't think the issues observed with the Ray 1.3.0 are with the watch itself -- it's mostly that events are handled plain wrong.
That said, looks like using kopf could definitely make things simpler and less error-prone.

tgaddair · 2021-05-13T21:48:30Z

If I understand right, a watch on an k8s api object just responds to updates to the object's resourceVersion.
I imagine @kopf.on.update probably does more or less the same thing under the hood? Maybe with some sort of improvement to robustness? Curious to take a look and see...

I don't think the issues observed with the Ray 1.3.0 are with the watch itself -- it's mostly that events are handled plain wrong.
That said, looks like using kopf could definitely make things simpler and less error-prone.

The biggest difference between using the watch and on.update is around robustness, retries, etc. See this section for a more detailed overview of some feature kopf provides in this area: https://kopf.readthedocs.io/en/stable/continuity/

This is not something that watching the event stream alone would provide.

python/ray/ray_operator/operator.py

python/ray/autoscaler/kubernetes/operator_configs/operator_cluster_scoped.yaml

python/ray/ray_operator/operator.py

DmitriGekhtman · 2021-05-20T20:23:53Z

Just heads up on the merge conflicts:
Un-templated versions of the cluster_scoped/namespaced operators have moved to deploy/components.
Helm versions are in deploy/charts/ray/templates (will merge those into one file at some point for easier maintenance).

To add Kopf to the Python dependencies for Ray images, add it to docker/ray-deps/Dockerfile

richardliaw · 2021-05-20T21:41:05Z

@DmitriGekhtman may I suggest you help resolve the merge conflicts to land this PR in 1.4?

DmitriGekhtman · 2021-05-20T22:04:26Z

@DmitriGekhtman may I suggest you help resolve the merge conflicts to land this PR in 1.4?

Yeah -- let me take care of both of the logistical things from my last comment.

DmitriGekhtman · 2021-05-20T22:33:30Z

Took care of above logistics, fixed lint complaints from local pre-push hook.

tgaddair · 2021-05-21T17:42:24Z

I think everything should be aligned with the review comments at this point, please feel free to take another look.

DmitriGekhtman · 2021-05-21T19:04:07Z

Looks good! Last thing is to align on the queue detail.

tgaddair · 2021-05-27T22:52:28Z

Cool! Sounds good to me.

DmitriGekhtman · 2021-05-28T20:36:54Z

okidoki -- docs update and e2e tests are looking good! Will ping reviewer for review.

python/ray/ray_operator/main.py

DmitriGekhtman · 2021-05-29T20:45:12Z

Final changes:

Moved operator to main process
Improved teardown logic

The sequence of events when the RayCluster API resource is deleted is now

Monitor loop exits normally, monitor subprocess is joined.
ray down is called for an attempt at orderly cluster shutdown
kopf removes the finalizer on the RayCluster resource, resource is deleted

I hope step 2 will mitigate some weird pod-deletion-blocking RPC errors that have been recently spotted.
Consistently stopping the monitor before the head pod should prevent the errors @tgaddair has been seeing lately.

AmeerHajAli · 2021-06-01T16:13:51Z

@richardliaw , is this a release blocker. If so, can you please provide some information for why we should treat this new feature as a release blocker?

DmitriGekhtman · 2021-06-01T16:20:26Z

@richardliaw or @edoakes could you merge this into master?

DmitriGekhtman · 2021-06-01T16:22:13Z

@richardliaw , is this a release blocker. If so, can you please provide some information for why we should treat this new feature as a release blocker?

Note, this a stability fix, not a new feature.
We'll defer to @richardliaw on picking it into 1.4.0.

richardliaw · 2021-06-01T19:00:49Z

python/setup.py

@@ -103,6 +103,8 @@
 "opentelemetry-exporter-otlp==1.1.0"
 ]
 }
+if sys.version_info >= (3, 7, 0):
+ extras["k8s"].append("kopf")


this may not work, but let's see :)

I think it's actually not important here -- just installing in the docker image is enough.

it seems this PR broke support for py36 ref #26886.

it seems that kopf never supported py36: is there a fallback, or is this reason to open a 'drop py36 support' tracker issue?

The issue is with documentation, which is missing this fact: py36 is not supported in the operator image, but it is perfectly fine to use a py37 operator image with py36 Ray images as long as the Ray versions are the same.

richardliaw · 2021-06-01T19:01:19Z

Nice work here! Let's not cherry-pick this to 1.4.

Refactored k8s operator to use kopf for controller logic

8fb61df

tgaddair requested a review from DmitriGekhtman May 13, 2021 18:34

Added role permissions

1956df6

DmitriGekhtman reviewed May 13, 2021

View reviewed changes

python/ray/ray_operator/operator.py Outdated Show resolved Hide resolved

Fixed TODO

6a2e28d

DmitriGekhtman reviewed May 13, 2021

View reviewed changes

python/ray/ray_operator/operator.py Outdated Show resolved Hide resolved

DmitriGekhtman reviewed May 13, 2021

View reviewed changes

python/ray/ray_operator/operator.py Outdated Show resolved Hide resolved

tgaddair added 3 commits May 13, 2021 15:37

Go back to global queue

b3ba4d9

Removed unused memo

18108ff

Fix target

cbc24f4

DmitriGekhtman reviewed May 18, 2021

View reviewed changes

python/ray/ray_operator/operator.py Outdated Show resolved Hide resolved

python/ray/ray_operator/operator.py Outdated Show resolved Hide resolved

DmitriGekhtman reviewed May 18, 2021

View reviewed changes

python/ray/autoscaler/kubernetes/operator_configs/operator_cluster_scoped.yaml Outdated Show resolved Hide resolved

DmitriGekhtman reviewed May 18, 2021

View reviewed changes

python/ray/ray_operator/operator.py Outdated Show resolved Hide resolved

DmitriGekhtman reviewed May 18, 2021

View reviewed changes

python/ray/ray_operator/operator.py Show resolved Hide resolved

DmitriGekhtman reviewed May 18, 2021

View reviewed changes

python/ray/ray_operator/operator.py Outdated Show resolved Hide resolved

DmitriGekhtman mentioned this pull request May 19, 2021

[autoscaler][kubernetes][minor][hotfix] #15916

Merged

6 tasks

DmitriGekhtman added 3 commits May 20, 2021 15:11

Merge branch 'master' into kopf

82e117a

logistics

e02046d

lint

31f9f2c

tgaddair added 3 commits May 21, 2021 10:26

Addressed comments

9cda7b3

Typing

3e78c5e

Changed to use memo

d868fb7

DmitriGekhtman assigned yiranwang52, edoakes and richardliaw May 27, 2021

DmitriGekhtman added 7 commits May 27, 2021 20:36

Space out retries in tests

53f5c1b

CRD before operator in scale test

e256a43

newline

f0fda95

Update delete instructions. Kopf finalizer warning.

1e0a33c

wip

a759b01

Test and doc update

2c2e5fd

Merge branch 'master' into kopf

ae4310d

yiranwang52 reviewed May 28, 2021

View reviewed changes

python/ray/ray_operator/main.py Outdated Show resolved Hide resolved

yiranwang52 approved these changes May 28, 2021

View reviewed changes

DmitriGekhtman added 5 commits May 29, 2021 11:37

Better subprocess clean-up

936a269

Operator in main thread

2aace4c

Teardown

4ef291c

Wait for service deletion in test

fb457b0

Fix test, wait for teardown

709b12c

DmitriGekhtman added 2 commits May 29, 2021 14:30

Remove main

b92054e

Sneak in doc fix + helm chart config consistency.

19c7910

richardliaw reviewed Jun 1, 2021

View reviewed changes

richardliaw merged commit 050a076 into ray-project:master Jun 1, 2021

tgaddair deleted the kopf branch June 1, 2021 19:03

ddelange mentioned this pull request Jul 30, 2022

[Core] Image rayproject/ray-ml:latest-py36-gpu broken by default #26886

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s] Refactored k8s operator to use kopf for controller logic #15787

[k8s] Refactored k8s operator to use kopf for controller logic #15787

tgaddair commented May 13, 2021 •

edited

Loading

DmitriGekhtman commented May 13, 2021

tgaddair commented May 13, 2021

DmitriGekhtman commented May 20, 2021 •

edited

Loading

richardliaw commented May 20, 2021

DmitriGekhtman commented May 20, 2021

DmitriGekhtman commented May 20, 2021

tgaddair commented May 21, 2021

DmitriGekhtman commented May 21, 2021

tgaddair commented May 27, 2021

DmitriGekhtman commented May 28, 2021

DmitriGekhtman commented May 29, 2021 •

edited

Loading

AmeerHajAli commented Jun 1, 2021

DmitriGekhtman commented Jun 1, 2021

DmitriGekhtman commented Jun 1, 2021 •

edited

Loading

richardliaw Jun 1, 2021

DmitriGekhtman Jun 1, 2021 •

edited

Loading

ddelange Jul 26, 2022 •

edited

Loading

DmitriGekhtman Jul 26, 2022

richardliaw commented Jun 1, 2021

[k8s] Refactored k8s operator to use kopf for controller logic #15787

[k8s] Refactored k8s operator to use kopf for controller logic #15787

Conversation

tgaddair commented May 13, 2021 • edited Loading

Why are these changes needed?

Related issue number

Checks

DmitriGekhtman commented May 13, 2021

tgaddair commented May 13, 2021

DmitriGekhtman commented May 20, 2021 • edited Loading

richardliaw commented May 20, 2021

DmitriGekhtman commented May 20, 2021

DmitriGekhtman commented May 20, 2021

tgaddair commented May 21, 2021

DmitriGekhtman commented May 21, 2021

tgaddair commented May 27, 2021

DmitriGekhtman commented May 28, 2021

DmitriGekhtman commented May 29, 2021 • edited Loading

AmeerHajAli commented Jun 1, 2021

DmitriGekhtman commented Jun 1, 2021

DmitriGekhtman commented Jun 1, 2021 • edited Loading

richardliaw Jun 1, 2021

Choose a reason for hiding this comment

DmitriGekhtman Jun 1, 2021 • edited Loading

Choose a reason for hiding this comment

ddelange Jul 26, 2022 • edited Loading

Choose a reason for hiding this comment

DmitriGekhtman Jul 26, 2022

Choose a reason for hiding this comment

richardliaw commented Jun 1, 2021

tgaddair commented May 13, 2021 •

edited

Loading

DmitriGekhtman commented May 20, 2021 •

edited

Loading

DmitriGekhtman commented May 29, 2021 •

edited

Loading

DmitriGekhtman commented Jun 1, 2021 •

edited

Loading

DmitriGekhtman Jun 1, 2021 •

edited

Loading

ddelange Jul 26, 2022 •

edited

Loading