Fix: Raise the reconciliation timeout from 10 to 30s. #13323

mattmoor · 2022-09-20T22:41:32Z

This was actually something I had asked for a fairly long time ago, which I had wanted to move into genreconciler to encourage folks to move "long" work off of the main reconciliation thread. 10s was sort of pulled out of thin air, and seemed like a good idea at the time! However, this arbitrary choice has come back to bite me!

Investigating a rash of timeouts in the sigstore policy-controller webhook, I noticed that a great deal of requests were timing out at 10s despite the webhook having a 25s timeoutSeconds.

Breaking this down by resource kind, I found that the only resources timing out at 9-10s were ALL deployments, where replicasets and pods had some timeouts, but almost all were at the 25s mark.

So basically when Knative reconciles Deployments, it is setting a 10s deadline on its outbound API requests, including Deployment updates, which pass this deadline on to downstream requests such as webhooks.

My reasoning for advocating for 30s is not arbitrary, it is based on this being the maximum value that a webhook can have for its timeoutSeconds field: https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#timeouts

This means that Knative wouldn't be artificially lowering the webhook timeout on any webhooks configured on the resources it is managing directly.

Increase the outbound context deadline in reconcilers to 30s (from 10s) to match the maximum K8s webhook timeout.

cc @dprotaso

mattmoor · 2022-09-20T22:45:07Z

cc @evankanderson @vaikas

codecov · 2022-09-20T22:46:55Z

Codecov Report

Base: 86.50% // Head: 86.45% // Decreases project coverage by -0.04% ⚠️

Coverage data is based on head (83dfafa) compared to base (2332731).
Patch coverage: 73.33% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #13323      +/-   ##
==========================================
- Coverage   86.50%   86.45%   -0.05%     
==========================================
  Files         196      196              
  Lines       14544    14548       +4     
==========================================
- Hits        12581    12578       -3     
- Misses       1664     1670       +6     
- Partials      299      300       +1

Impacted Files	Coverage Δ
cmd/controller/main.go	`0.00% <0.00%> (ø)`
pkg/reconciler/autoscaling/hpa/hpa.go	`91.66% <100.00%> (ø)`
pkg/reconciler/autoscaling/kpa/kpa.go	`95.26% <100.00%> (ø)`
pkg/reconciler/configuration/configuration.go	`82.93% <100.00%> (-1.43%)`	⬇️
pkg/reconciler/domainmapping/reconciler.go	`93.68% <100.00%> (ø)`
pkg/reconciler/gc/reconciler.go	`100.00% <100.00%> (ø)`
pkg/reconciler/labeler/labeler.go	`100.00% <100.00%> (ø)`
pkg/reconciler/nscert/nscert.go	`72.03% <100.00%> (ø)`
pkg/reconciler/revision/revision.go	`92.13% <100.00%> (ø)`
pkg/reconciler/route/route.go	`80.06% <100.00%> (ø)`
... and 3 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

vaikas · 2022-09-20T22:54:48Z

Changing one hard coded value for something else with no way to override does not seem ideal. Since it seems like some resources will maybe require specifying different values, I think my order of preference would be:

make this configurable across the board as a config option
make this configurable by resource type

But I could be missing something obvs here :)

mattmoor · 2022-09-20T22:58:30Z

@vaikas I think my preference would be to make this a common global variable which folks can then choose to expose as a flag via flag.DurationVar, but I don't have a preference. Where would you want it?

re: 30s

Unlike before, this wasn't arbitrary, it is the maximum webhook timeoutSeconds. That said, some reconcilers reconcile multiple resources, which is why I'm definitely receptive to something configurable, but I'm too tired from debugging this to have strong opinions on where it goes or what it's called 🤣

This change imports `./hack/patches/13323-serving.patch` to carry a patch for the above PR, which is creating unnecessary friction when used in conjunction with sigstore policy-controller, complex policies, multiple containers, and slow registries.

vaikas · 2022-09-20T23:30:20Z

Yeah, I think a global flag is fine as long as it's configurable I'm fine with it. Would it be bad to put it in knative.dev/pkg/apis/contexts.go which has bunch of contexts already?
or knative.dev/pkg/reconciler/ since it contains already the common reconcile shape?
Only thinking that way other things using Knative could get bennies from it too?

Serving currently uses a fixed constant for the context timeout on reconciliation, which is causing some bad interactions with certain policy webhooks. Based on feedback on knative/serving#13323 I am hoisting this into a shared variable, so that folks can turn it into a flag, so it is configurable.

mattmoor · 2022-09-21T01:37:36Z

I opened knative/pkg#2597 for this. Since this is intended to bound reconciliation, I went with reconciler.DefaultTimeout. Once that goes in, I'll rebase this once the nightly dep update merges it into serving.

evankanderson

I wonder if we should have a constant with a comment here?

mattmoor · 2022-09-21T01:44:18Z

@evankanderson see knative/pkg#2597

Serving currently uses a fixed constant for the context timeout on reconciliation, which is causing some bad interactions with certain policy webhooks. Based on feedback on knative/serving#13323 I am hoisting this into a shared variable, so that folks can turn it into a flag, so it is configurable.

evankanderson · 2022-09-21T02:44:59Z

/lgtm

mattmoor · 2022-09-21T02:46:19Z

/hold

I don't want this going in before it uses the global

This was actually something I had asked for a fairly long time ago, which I had wanted to move into `genreconciler` to encourage folks to move "long" work off of the main reconciliation thread. `10s` was sort of pulled out of thin air, and seemed like a good idea at the time! However, this arbitrary choice has come back to bite me! Investigating a rash of timeouts in the sigstore policy-controller webhook, I noticed that a great deal of requests were timing out at `10s` despite the webhook having a `25s` `timeoutSeconds`. Breaking this down by resource kind, I found that the only resources timing out at `9-10s` were ALL deployments, where replicasets and pods had some timeouts, but almost all were at the `25s` mark. So basically when Knative reconciles Deployments, it is setting a `10s` deadline on its outbound API requests, including Deployment updates, which pass this deadline on to downstream requests such as webhooks. My reasoning for advocating for `30s` is not arbitrary, it is based on this being the maximum value that a webhook can have for its `timeoutSeconds` field: https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#timeouts This means that Knative wouldn't be artificially lowering the webhook timeout on any webhooks configured on the resources it is managing directly.

mattmoor · 2022-09-21T15:52:05Z

/unhold

Ok, this is rebased on the latest deps update and using the global now.

Do we want to expose a flag in the main controller so that this value is configurable, or wait until the new raised limit is insufficient?

psschwei · 2022-09-21T20:30:50Z

Do we want to expose a flag in the main controller so that this value is configurable, or wait until the new raised limit is insufficient?

I'd lean towards exposing it now.

mattmoor · 2022-09-24T15:37:16Z

Ok, there is now a flag for this.

vaikas · 2022-09-25T19:56:27Z

/lgtm
/approve

psschwei

Thanks!
/approve

knative-prow · 2022-09-25T23:20:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mattmoor, psschwei, vaikas

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cmd/OWNERS~~ [psschwei]
~~pkg/OWNERS~~ [psschwei]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

knative-prow bot added area/API API objects and controllers area/autoscale area/networking size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 20, 2022

mattmoor mentioned this pull request Sep 21, 2022

Define a global for reconciler.DefaultTimeout. knative/pkg#2597

Merged

evankanderson reviewed Sep 21, 2022

View reviewed changes

knative-prow bot assigned evankanderson Sep 21, 2022

knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Sep 21, 2022

knative-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 21, 2022

mattmoor added 2 commits September 21, 2022 08:44

Use the global variable

0f6ee82

mattmoor force-pushed the raise-reconciliation-timeout branch from ce51179 to 0f6ee82 Compare September 21, 2022 15:50

knative-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 21, 2022

knative-prow bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 21, 2022

Add flag for timeout

83dfafa

knative-prow bot assigned vaikas Sep 25, 2022

knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Sep 25, 2022

psschwei reviewed Sep 25, 2022

View reviewed changes

knative-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 25, 2022

knative-prow bot merged commit ed3515c into knative:main Sep 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Raise the reconciliation timeout from 10 to 30s. #13323

Fix: Raise the reconciliation timeout from 10 to 30s. #13323

mattmoor commented Sep 20, 2022

mattmoor commented Sep 20, 2022

codecov bot commented Sep 20, 2022 •

edited

Loading

vaikas commented Sep 20, 2022

mattmoor commented Sep 20, 2022

vaikas commented Sep 20, 2022

mattmoor commented Sep 21, 2022

evankanderson left a comment

mattmoor commented Sep 21, 2022

evankanderson commented Sep 21, 2022

mattmoor commented Sep 21, 2022

mattmoor commented Sep 21, 2022

psschwei commented Sep 21, 2022

mattmoor commented Sep 24, 2022

vaikas commented Sep 25, 2022

psschwei left a comment

knative-prow bot commented Sep 25, 2022

Fix: Raise the reconciliation timeout from 10 to 30s. #13323

Fix: Raise the reconciliation timeout from 10 to 30s. #13323

Conversation

mattmoor commented Sep 20, 2022

mattmoor commented Sep 20, 2022

codecov bot commented Sep 20, 2022 • edited Loading

Codecov Report

vaikas commented Sep 20, 2022

mattmoor commented Sep 20, 2022

vaikas commented Sep 20, 2022

mattmoor commented Sep 21, 2022

evankanderson left a comment

Choose a reason for hiding this comment

mattmoor commented Sep 21, 2022

evankanderson commented Sep 21, 2022

mattmoor commented Sep 21, 2022

mattmoor commented Sep 21, 2022

psschwei commented Sep 21, 2022

mattmoor commented Sep 24, 2022

vaikas commented Sep 25, 2022

psschwei left a comment

Choose a reason for hiding this comment

knative-prow bot commented Sep 25, 2022

codecov bot commented Sep 20, 2022 •

edited

Loading