-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: Raise the reconciliation timeout from 10 to 30s. #13323
Fix: Raise the reconciliation timeout from 10 to 30s. #13323
Conversation
Codecov ReportBase: 86.50% // Head: 86.45% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #13323 +/- ##
==========================================
- Coverage 86.50% 86.45% -0.05%
==========================================
Files 196 196
Lines 14544 14548 +4
==========================================
- Hits 12581 12578 -3
- Misses 1664 1670 +6
- Partials 299 300 +1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Changing one hard coded value for something else with no way to override does not seem ideal. Since it seems like some resources will maybe require specifying different values, I think my order of preference would be:
But I could be missing something obvs here :) |
@vaikas I think my preference would be to make this a common global variable which folks can then choose to expose as a flag via re: Unlike before, this wasn't arbitrary, it is the maximum webhook timeoutSeconds. That said, some reconcilers reconcile multiple resources, which is why I'm definitely receptive to something configurable, but I'm too tired from debugging this to have strong opinions on where it goes or what it's called 🤣 |
This change imports `./hack/patches/13323-serving.patch` to carry a patch for the above PR, which is creating unnecessary friction when used in conjunction with sigstore policy-controller, complex policies, multiple containers, and slow registries.
This change imports `./hack/patches/13323-serving.patch` to carry a patch for the above PR, which is creating unnecessary friction when used in conjunction with sigstore policy-controller, complex policies, multiple containers, and slow registries.
Yeah, I think a global flag is fine as long as it's configurable I'm fine with it. Would it be bad to put it in knative.dev/pkg/apis/contexts.go which has bunch of contexts already? |
Serving currently uses a fixed constant for the context timeout on reconciliation, which is causing some bad interactions with certain policy webhooks. Based on feedback on knative/serving#13323 I am hoisting this into a shared variable, so that folks can turn it into a flag, so it is configurable.
I opened knative/pkg#2597 for this. Since this is intended to bound reconciliation, I went with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should have a constant with a comment here?
Serving currently uses a fixed constant for the context timeout on reconciliation, which is causing some bad interactions with certain policy webhooks. Based on feedback on knative/serving#13323 I am hoisting this into a shared variable, so that folks can turn it into a flag, so it is configurable.
/lgtm |
/hold I don't want this going in before it uses the global |
This was actually something I had asked for a fairly long time ago, which I had wanted to move into `genreconciler` to encourage folks to move "long" work off of the main reconciliation thread. `10s` was sort of pulled out of thin air, and seemed like a good idea at the time! However, this arbitrary choice has come back to bite me! Investigating a rash of timeouts in the sigstore policy-controller webhook, I noticed that a great deal of requests were timing out at `10s` despite the webhook having a `25s` `timeoutSeconds`. Breaking this down by resource kind, I found that the only resources timing out at `9-10s` were ALL deployments, where replicasets and pods had some timeouts, but almost all were at the `25s` mark. So basically when Knative reconciles Deployments, it is setting a `10s` deadline on its outbound API requests, including Deployment updates, which pass this deadline on to downstream requests such as webhooks. My reasoning for advocating for `30s` is not arbitrary, it is based on this being the maximum value that a webhook can have for its `timeoutSeconds` field: https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#timeouts This means that Knative wouldn't be artificially lowering the webhook timeout on any webhooks configured on the resources it is managing directly.
ce51179
to
0f6ee82
Compare
/unhold Ok, this is rebased on the latest deps update and using the global now. Do we want to expose a flag in the main controller so that this value is configurable, or wait until the new raised limit is insufficient? |
I'd lean towards exposing it now. |
Ok, there is now a flag for this. |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mattmoor, psschwei, vaikas The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This was actually something I had asked for a fairly long time ago, which I had wanted to move into
genreconciler
to encourage folks to move "long" work off of the main reconciliation thread.10s
was sort of pulled out of thin air, and seemed like a good idea at the time! However, this arbitrary choice has come back to bite me!Investigating a rash of timeouts in the sigstore policy-controller webhook, I noticed that a great deal of requests were timing out at
10s
despite the webhook having a25s
timeoutSeconds
.Breaking this down by resource kind, I found that the only resources timing out at
9-10s
were ALL deployments, where replicasets and pods had some timeouts, but almost all were at the25s
mark.So basically when Knative reconciles Deployments, it is setting a
10s
deadline on its outbound API requests, including Deployment updates, which pass this deadline on to downstream requests such as webhooks.My reasoning for advocating for
30s
is not arbitrary, it is based on this being the maximum value that a webhook can have for itstimeoutSeconds
field: https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#timeoutsThis means that Knative wouldn't be artificially lowering the webhook timeout on any webhooks configured on the resources it is managing directly.
cc @dprotaso