🐛 Clean up shadow CRDs after API bindings are deleted #2298

kylape · 2022-11-03T12:24:17Z

Summary

When all API bindings that use a particular APIResourceSchema are deleted, the CRDs that back those bindings need to be deleted.

Fixes #2300

pkg/reconciler/apis/crdcleanup/crdcleanup_controller.go

ncdc · 2022-11-03T12:38:53Z

pkg/reconciler/apis/crdcleanup/crdcleanup_controller.go

+
+ for _, binding := range bindingList {
+ if len(binding.Status.BoundResources) == 0 {
+ logger.V(1).Info("APIBinding has no bound resources. Need to wait until all bindings have bound resources")


I worry a bit that an "invalid" binding (e.g. referenced APIExport doesn't exist, whatever...) could permanently interrupt the cleanup process

how can that happen?

I don't see how this controller can determine if any CRD can be deleted if it cannot see the set of schemas that every API binding points to. Perhaps there is a way to identify an "invalid" binding such that this controller can safely ignore it?

pkg/reconciler/apis/crdcleanup/crdcleanup_controller.go

pkg/indexers/apibinding.go

sttts · 2022-11-03T12:50:38Z

pkg/reconciler/apis/crdcleanup/crdcleanup_controller.go

+ for _, binding := range bindingList {
+ if len(binding.Status.BoundResources) == 0 {
+ logger.V(1).Info("APIBinding has no bound resources. Need to wait until all bindings have bound resources")
+ c.queue.AddAfter(key, time.Second)


why requeue? This happens automatically. You are watching bindings.

Turns out we do in fact need this explicit requeue. If the APIBinding that was discovered to have no bound resources is not actually related to this CRD (we have no reasonable way to know), then we lost perhaps our last opportunity to delete this particular CRD (in the event that it was the deletion event for the last APIBinding for this CRD).

Don't this is true. There is some case when BoundResources gets empty (could be in theory). The solution with requeue is not correct. It causes a potential DoS.

What we could do:

depend on resync to eventually relist the CRD

do something that normally a big nogo: enqueue both old and new object on update.

The APIBinding controller will create the bound CRD, wait for it to become established, and then it will add the BoundResource to the API binding's status. I've seen the potential race condition cause a problem when running e2e tests with -count 20, but not -count 5.

depend on resync to eventually relist the CRD

This seems reasonable for now at least, but how can we run the e2e test if we rely on the resync to catch edge cases? Do we need to modify or remove our new check in the e2e test? Can we change the resync period for the test?

It causes a potential DoS

Could we call queue.AddRateLimited(key) instead if we're only concerned about DoS?

The APIBinding controller will create the bound CRD, wait for it to become established, and then it will add the BoundResource to the API binding's status. I've seen the potential race condition cause a problem when running e2e tests with -count 20, but not -count 5.

Got it. Yes, this is true. Very good insight!

I see how your requeue helps. I am worried that it endlessly requeues. Thinking. Not sure how AddRateLimited really works internally. It's won't keep a giant per-key map with requeues, will it?

Just thinking out loud: could we check for APIResourceSchema existence, and everything else before the actual CRD creation, and only do the latter AFTER setting the BoundResources ?

Wondering whether we run into any other race that way.

Also the creationTime idea would help. Let a CRD live up to a minute or so and only delete then.

It's won't keep a giant per-key map with requeues, will it?

Yes, I'm afraid if some APIBinding somehow never made it to the Bound phase yet also had a valid APIExport, then eventually every CRD key would end up in the requeue map. This should be considered a bug, and we could even alert if the requeue count increased too quickly for the CRD cleanup controller.

could we check for APIResourceSchema existence, and everything else before the actual CRD creation, and only do the latter AFTER setting the BoundResources ?

Possibly, but this seems unintuitive to me. The resource isn't actually bound until the CRD is created!

creationTime idea would help.

If an APIBinding (that has a valid APIExport) never makes it to the Bound phase, then the creationTime idea would only delay the inevitable addition of the CRD to the requeue map.

Barring that edge case (which I believe doesn't exist today and if it did would be considered a bug), I can't think of any other scenarios that aren't covered by the code as-is, but I wouldn't mind adding a creationTime check as a "belts and suspenders" kind of thing.

then the creationTime idea would only delay the inevitable addition of the CRD to the requeue map.

No, I mean to delete the CRD after a minute, so give a minute of time to get over the race.

kylape · 2022-11-04T14:01:00Z

kcp-dev/kubernetes#110 is required for this PR in order to stop KCP from panicking upon CRD deletion.

pkg/reconciler/apis/apibindingdeletion/apibinding_deletion_controller.go

pkg/reconciler/apis/crdcleanup/crdcleanup_controller.go

stevekuznetsov · 2022-11-05T18:19:30Z

pkg/reconciler/apis/crdcleanup/crdcleanup_controller.go

+
+ logger := logging.WithObject(logging.WithReconciler(klog.Background(), ControllerName), binding)
+
+ for _, boundResource := range binding.Status.BoundResources {


Normally, we'd have an index on the CRD informer to track CRDs by their binding, and queue all matching CRDs. I think using the list here in status is still correct, but it would be good to ACK that we've thought through the edge cases here to ensure that we can't miss a CRD if it's still related to the binding but is not listed in the bound resources array.

Would the index get updated before this function is called? If yes, then using an index would be no better than listing.

The only edge case I can think of is if an APIResourceSchema was removed from an APIExport. Is that allowed?

Yes, the index will be updated first.

Yes, removing APIRS from an APIExport is possible.

pkg/reconciler/apis/crdcleanup/crdcleanup_controller.go

stevekuznetsov · 2022-11-05T18:22:41Z

pkg/reconciler/apis/crdcleanup/crdcleanup_controller.go

+ return nil
+ }
+
+ // The index can't solely be relied on to be certain this CRD can be deleted.


I don't think this is doing what you think this is doing. Right now the c.listAPIBindings() call is using the lister - because of the way the machinery works, whether you access the list of bindings from the lister & filter manually using Go code, or if you are accessing the filtered list via the indexer, you are fundamentally operating against the exact same set of objects in the in-memory cache, and you will never (under normal circumstances) get a different result. I think you might want to use the client to do a live call here, instead? It may be useful then also to add labels or something to the bound CRDs for efficient server-side selection, as CRDs are large objects and having to do a full LIST here will be a heavy call.

My comment here is misleading. I'm not listing here due to a concern about resources in in the index being out of date due to an unprocessed event, etc. I'm instead seeing if any APIBindings are in an unready state to avoid deleting the CRD immediately after it was created. The APIBinding controller will wait for the bound CRD to become established before it adds the corresponding BoundResource to its status, so it's expected that the case where an APIBinding has no BoundResources will be common. This thread is discussing this as well.

Note that if any APIBinding is not ready, we cannot confidently delete any (bound) CRD.

If that all sounds reasonable to you, then I'll update the comment in the code to be less confusing.

I see. Let's add another index on APIBindings (by phase) and simply check the number of them that are pending - and update the comment :)

So I started writing the index, and then I realized that there was another edge case: the APIExport reference in an APIBinding is invalid. This makes the index much less clean than "by phase", some thing like "APIExportIsValidAndPhaseIsNotBound". So I kept the list in place but added a big comment trying to explain why we are doing the list after the index lookup.

I'm not sure it's clear why this new edge case is related to the index.

Note that if any APIBinding is not ready, we cannot confidently delete any (bound) CRD.

The edge case makes this statement untrue since APIBindings with an invalid APIExport will never become ready, and we must exclude them from consideration to prevent every bound CRD from being requeued forever.

pkg/reconciler/apis/crdcleanup/crdcleanup_controller.go

pkg/reconciler/apis/apibinding/apibinding_reconcile.go

pkg/reconciler/apis/crdcleanup/crdcleanup_controller.go

sttts · 2022-11-09T14:10:36Z

pkg/reconciler/apis/crdcleanup/crdcleanup_controller.go

+
+ for _, boundResource := range newBinding.Status.BoundResources {
+ uidSet[boundResource.Schema.UID] = false
+ }


This is strange logic. Why loop over both? Why not just remove those UID that still show up in the new binding?

From the code comment: It looks at old and new versions in case a schema gets removed from an APIExport. In that case, the last APIBinding to have the schema removed will trigger the CRD delete, but only the old version will have the reference to the schema.

What Kyle wrote makes sense to me

pkg/reconciler/apis/crdcleanup/crdcleanup_controller.go

pkg/reconciler/apis/crdcleanup/crdcleanup_controller_test.go

ncdc · 2022-11-10T19:58:58Z

LGTM - please squash!
/approve

openshift-ci · 2022-11-10T19:59:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ncdc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ncdc]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

When all API bindings that use a particular APIResourceSchema are deleted, the CRDs that back those bindings need to be deleted. Set identity annotation on CRDs during refresh to prevent a panic

csams · 2022-11-11T22:12:06Z

/lgtm

kylape · 2022-11-15T15:50:17Z

/cherry-pick release-0.9

openshift-cherrypick-robot · 2022-11-15T15:50:56Z

@kylape: #2298 failed to apply on top of branch "release-0.9":

Applying: Clean up shadow CRDs after API bindings are deleted
Using index info to reconstruct a base tree...
M	pkg/indexers/apibinding.go
M	pkg/reconciler/apis/apibindingdeletion/apibinding_deletion_controller.go
M	pkg/server/apiextensions.go
M	pkg/server/controllers.go
M	pkg/server/server.go
M	test/e2e/apibinding/apibinding_deletion_test.go
Falling back to patching base and 3-way merge...
Auto-merging test/e2e/apibinding/apibinding_deletion_test.go
CONFLICT (content): Merge conflict in test/e2e/apibinding/apibinding_deletion_test.go
Auto-merging pkg/server/server.go
Auto-merging pkg/server/controllers.go
Auto-merging pkg/server/apiextensions.go
Auto-merging pkg/reconciler/apis/apibindingdeletion/apibinding_deletion_controller.go
CONFLICT (content): Merge conflict in pkg/reconciler/apis/apibindingdeletion/apibinding_deletion_controller.go
Auto-merging pkg/indexers/apibinding.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Clean up shadow CRDs after API bindings are deleted
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-0.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kylape · 2022-11-16T02:44:14Z

Fixes #2304

kylape requested a review from sttts November 3, 2022 12:24

openshift-ci bot requested review from davidfestal and shawn-hurley November 3, 2022 12:24

ncdc reviewed Nov 3, 2022

View reviewed changes

sttts reviewed Nov 3, 2022

View reviewed changes

pkg/indexers/apibinding.go Outdated Show resolved Hide resolved

sttts reviewed Nov 3, 2022

View reviewed changes

stevekuznetsov reviewed Nov 5, 2022

View reviewed changes

stevekuznetsov reviewed Nov 7, 2022

View reviewed changes

pkg/reconciler/apis/crdcleanup/crdcleanup_controller.go Outdated Show resolved Hide resolved

kylape force-pushed the crd-cleanup branch from 0137cb5 to a8ed16b Compare November 9, 2022 04:52

sttts reviewed Nov 9, 2022

View reviewed changes

pkg/reconciler/apis/apibinding/apibinding_reconcile.go Outdated Show resolved Hide resolved

sttts reviewed Nov 9, 2022

View reviewed changes

pkg/reconciler/apis/crdcleanup/crdcleanup_controller.go Outdated Show resolved Hide resolved

sttts reviewed Nov 9, 2022

View reviewed changes

pkg/reconciler/apis/crdcleanup/crdcleanup_controller.go Outdated Show resolved Hide resolved

sttts reviewed Nov 9, 2022

View reviewed changes

pkg/reconciler/apis/crdcleanup/crdcleanup_controller.go Outdated Show resolved Hide resolved

sttts reviewed Nov 9, 2022

View reviewed changes

pkg/reconciler/apis/crdcleanup/crdcleanup_controller.go Outdated Show resolved Hide resolved

sttts reviewed Nov 9, 2022

View reviewed changes

pkg/reconciler/apis/crdcleanup/crdcleanup_controller.go Outdated Show resolved Hide resolved

sttts reviewed Nov 9, 2022

View reviewed changes

pkg/reconciler/apis/crdcleanup/crdcleanup_controller.go Show resolved Hide resolved

sttts reviewed Nov 9, 2022

View reviewed changes

pkg/reconciler/apis/crdcleanup/crdcleanup_controller.go Outdated Show resolved Hide resolved

ncdc reviewed Nov 10, 2022

View reviewed changes

pkg/reconciler/apis/crdcleanup/crdcleanup_controller.go Outdated Show resolved Hide resolved

kylape force-pushed the crd-cleanup branch from 98dd011 to 752d592 Compare November 10, 2022 16:51

ncdc reviewed Nov 10, 2022

View reviewed changes

pkg/reconciler/apis/crdcleanup/crdcleanup_controller.go Outdated Show resolved Hide resolved

ncdc reviewed Nov 10, 2022

View reviewed changes

pkg/reconciler/apis/crdcleanup/crdcleanup_controller.go Outdated Show resolved Hide resolved

pkg/reconciler/apis/crdcleanup/crdcleanup_controller_test.go Outdated Show resolved Hide resolved

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 10, 2022

Clean up shadow CRDs after API bindings are deleted

5f4849b

When all API bindings that use a particular APIResourceSchema are deleted, the CRDs that back those bindings need to be deleted. Set identity annotation on CRDs during refresh to prevent a panic

kylape force-pushed the crd-cleanup branch from 40a83da to 5f4849b Compare November 10, 2022 20:06

kylape requested review from sttts and stevekuznetsov November 11, 2022 14:42

openshift-ci bot assigned csams Nov 11, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 11, 2022

openshift-merge-robot merged commit 4882d9e into kcp-dev:main Nov 11, 2022

kylape deleted the crd-cleanup branch November 12, 2022 02:04

This was referenced Nov 15, 2022

🐛 Clean up shadow CRDs after API bindings are deleted #2366

Merged

Skip instance deletion upon finalization for bound CRDs kcp-dev/kubernetes#110

Closed

kylape mentioned this pull request Nov 16, 2022

bug: kcp panics when trying delete bound CRD and a corresponding apibinding exists #2304

Closed

kylape mentioned this pull request Dec 16, 2022

flake: bound CRD deleted by CRD cleanup controller too aggressively #2507

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Clean up shadow CRDs after API bindings are deleted #2298

🐛 Clean up shadow CRDs after API bindings are deleted #2298

kylape commented Nov 3, 2022 •

edited

Loading

ncdc Nov 3, 2022

sttts Nov 3, 2022

kylape Nov 3, 2022

sttts Nov 3, 2022 •

edited

Loading

kylape Nov 3, 2022

kylape Nov 3, 2022

sttts Nov 4, 2022

kylape Nov 4, 2022 •

edited

Loading

sttts Nov 5, 2022

sttts Nov 5, 2022

sttts Nov 5, 2022

kylape Nov 7, 2022 •

edited

Loading

sttts Nov 9, 2022

kylape commented Nov 4, 2022

stevekuznetsov Nov 5, 2022

kylape Nov 7, 2022

stevekuznetsov Nov 7, 2022

stevekuznetsov Nov 5, 2022

kylape Nov 7, 2022 •

edited

Loading

stevekuznetsov Nov 7, 2022

kylape Nov 7, 2022

kylape Nov 7, 2022 •

edited

Loading

sttts Nov 9, 2022

kylape Nov 9, 2022 •

edited

Loading

ncdc Nov 10, 2022

ncdc commented Nov 10, 2022

openshift-ci bot commented Nov 10, 2022

csams commented Nov 11, 2022

kylape commented Nov 15, 2022

openshift-cherrypick-robot commented Nov 15, 2022

kylape commented Nov 16, 2022


		logger := logging.WithObject(logging.WithReconciler(klog.Background(), ControllerName), binding)

		for _, boundResource := range binding.Status.BoundResources {

🐛 Clean up shadow CRDs after API bindings are deleted #2298

🐛 Clean up shadow CRDs after API bindings are deleted #2298

Conversation

kylape commented Nov 3, 2022 • edited Loading

Summary

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sttts Nov 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylape Nov 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylape Nov 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylape commented Nov 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylape Nov 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylape Nov 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylape Nov 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ncdc commented Nov 10, 2022

openshift-ci bot commented Nov 10, 2022

csams commented Nov 11, 2022

kylape commented Nov 15, 2022

openshift-cherrypick-robot commented Nov 15, 2022

kylape commented Nov 16, 2022

kylape commented Nov 3, 2022 •

edited

Loading

sttts Nov 3, 2022 •

edited

Loading

kylape Nov 4, 2022 •

edited

Loading

kylape Nov 7, 2022 •

edited

Loading

kylape Nov 7, 2022 •

edited

Loading

kylape Nov 7, 2022 •

edited

Loading

kylape Nov 9, 2022 •

edited

Loading