Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Raise the reconciliation timeout from 10 to 30s. #13323

Merged
merged 3 commits into from
Sep 25, 2022

Conversation

mattmoor
Copy link
Member

This was actually something I had asked for a fairly long time ago, which I had wanted to move into genreconciler to encourage folks to move "long" work off of the main reconciliation thread. 10s was sort of pulled out of thin air, and seemed like a good idea at the time! However, this arbitrary choice has come back to bite me!

Investigating a rash of timeouts in the sigstore policy-controller webhook, I noticed that a great deal of requests were timing out at 10s despite the webhook having a 25s timeoutSeconds.

Breaking this down by resource kind, I found that the only resources timing out at 9-10s were ALL deployments, where replicasets and pods had some timeouts, but almost all were at the 25s mark.

So basically when Knative reconciles Deployments, it is setting a 10s deadline on its outbound API requests, including Deployment updates, which pass this deadline on to downstream requests such as webhooks.

My reasoning for advocating for 30s is not arbitrary, it is based on this being the maximum value that a webhook can have for its timeoutSeconds field: https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#timeouts

This means that Knative wouldn't be artificially lowering the webhook timeout on any webhooks configured on the resources it is managing directly.

Increase the outbound context deadline in reconcilers to 30s (from 10s) to match the maximum K8s webhook timeout.

cc @dprotaso

@knative-prow knative-prow bot added area/API API objects and controllers area/autoscale area/networking size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 20, 2022
@mattmoor
Copy link
Member Author

cc @evankanderson @vaikas

@codecov
Copy link

codecov bot commented Sep 20, 2022

Codecov Report

Base: 86.50% // Head: 86.45% // Decreases project coverage by -0.04% ⚠️

Coverage data is based on head (83dfafa) compared to base (2332731).
Patch coverage: 73.33% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #13323      +/-   ##
==========================================
- Coverage   86.50%   86.45%   -0.05%     
==========================================
  Files         196      196              
  Lines       14544    14548       +4     
==========================================
- Hits        12581    12578       -3     
- Misses       1664     1670       +6     
- Partials      299      300       +1     
Impacted Files Coverage Δ
cmd/controller/main.go 0.00% <0.00%> (ø)
pkg/reconciler/autoscaling/hpa/hpa.go 91.66% <100.00%> (ø)
pkg/reconciler/autoscaling/kpa/kpa.go 95.26% <100.00%> (ø)
pkg/reconciler/configuration/configuration.go 82.93% <100.00%> (-1.43%) ⬇️
pkg/reconciler/domainmapping/reconciler.go 93.68% <100.00%> (ø)
pkg/reconciler/gc/reconciler.go 100.00% <100.00%> (ø)
pkg/reconciler/labeler/labeler.go 100.00% <100.00%> (ø)
pkg/reconciler/nscert/nscert.go 72.03% <100.00%> (ø)
pkg/reconciler/revision/revision.go 92.13% <100.00%> (ø)
pkg/reconciler/route/route.go 80.06% <100.00%> (ø)
... and 3 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@vaikas
Copy link
Contributor

vaikas commented Sep 20, 2022

Changing one hard coded value for something else with no way to override does not seem ideal. Since it seems like some resources will maybe require specifying different values, I think my order of preference would be:

  • make this configurable across the board as a config option
  • make this configurable by resource type

But I could be missing something obvs here :)

@mattmoor
Copy link
Member Author

@vaikas I think my preference would be to make this a common global variable which folks can then choose to expose as a flag via flag.DurationVar, but I don't have a preference. Where would you want it?

re: 30s

Unlike before, this wasn't arbitrary, it is the maximum webhook timeoutSeconds. That said, some reconcilers reconcile multiple resources, which is why I'm definitely receptive to something configurable, but I'm too tired from debugging this to have strong opinions on where it goes or what it's called 🤣

mattmoor added a commit to mattmoor/hakn that referenced this pull request Sep 20, 2022
This change imports `./hack/patches/13323-serving.patch` to carry a patch for the above PR, which is creating unnecessary friction when used in conjunction with sigstore policy-controller, complex policies, multiple containers, and slow registries.
mattmoor added a commit to chainguard-dev/hakn that referenced this pull request Sep 20, 2022
This change imports `./hack/patches/13323-serving.patch` to carry a patch for the above PR, which is creating unnecessary friction when used in conjunction with sigstore policy-controller, complex policies, multiple containers, and slow registries.
@vaikas
Copy link
Contributor

vaikas commented Sep 20, 2022

Yeah, I think a global flag is fine as long as it's configurable I'm fine with it. Would it be bad to put it in knative.dev/pkg/apis/contexts.go which has bunch of contexts already?
or knative.dev/pkg/reconciler/ since it contains already the common reconcile shape?
Only thinking that way other things using Knative could get bennies from it too?

mattmoor added a commit to mattmoor/pkg that referenced this pull request Sep 21, 2022
Serving currently uses a fixed constant for the context timeout on reconciliation, which is causing some bad interactions with certain policy webhooks.

Based on feedback on knative/serving#13323 I am hoisting this into a shared variable, so that folks can turn it into a flag, so it is configurable.
@mattmoor
Copy link
Member Author

I opened knative/pkg#2597 for this. Since this is intended to bound reconciliation, I went with reconciler.DefaultTimeout. Once that goes in, I'll rebase this once the nightly dep update merges it into serving.

Copy link
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should have a constant with a comment here?

@mattmoor
Copy link
Member Author

@evankanderson see knative/pkg#2597

knative-prow bot pushed a commit to knative/pkg that referenced this pull request Sep 21, 2022
Serving currently uses a fixed constant for the context timeout on reconciliation, which is causing some bad interactions with certain policy webhooks.

Based on feedback on knative/serving#13323 I am hoisting this into a shared variable, so that folks can turn it into a flag, so it is configurable.
@evankanderson
Copy link
Member

/lgtm

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Sep 21, 2022
@mattmoor
Copy link
Member Author

/hold

I don't want this going in before it uses the global

@knative-prow knative-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 21, 2022
This was actually something I had asked for a fairly long time ago, which I had wanted to move into `genreconciler` to encourage folks to move "long" work off of the main reconciliation thread.  `10s` was sort of pulled out of thin air, and seemed like a good idea at the time!  However, this arbitrary choice has come back to bite me!

Investigating a rash of timeouts in the sigstore policy-controller webhook, I noticed that a great deal of requests were timing out at `10s` despite the webhook having a `25s` `timeoutSeconds`.

Breaking this down by resource kind, I found that the only resources timing out at `9-10s` were ALL deployments, where replicasets and pods had some timeouts, but almost all were at the `25s` mark.

So basically when Knative reconciles Deployments, it is setting a `10s` deadline on its outbound API requests, including Deployment updates, which pass this deadline on to downstream requests such as webhooks.

My reasoning for advocating for `30s` is not arbitrary, it is based on this being the maximum value that a webhook can have for its `timeoutSeconds` field:
https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#timeouts

This means that Knative wouldn't be artificially lowering the webhook timeout on any webhooks configured on the resources it is managing directly.
@knative-prow knative-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 21, 2022
@mattmoor
Copy link
Member Author

/unhold

Ok, this is rebased on the latest deps update and using the global now.

Do we want to expose a flag in the main controller so that this value is configurable, or wait until the new raised limit is insufficient?

@knative-prow knative-prow bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 21, 2022
@psschwei
Copy link
Contributor

Do we want to expose a flag in the main controller so that this value is configurable, or wait until the new raised limit is insufficient?

I'd lean towards exposing it now.

@mattmoor
Copy link
Member Author

Ok, there is now a flag for this.

@vaikas
Copy link
Contributor

vaikas commented Sep 25, 2022

/lgtm
/approve

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Sep 25, 2022
Copy link
Contributor

@psschwei psschwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
/approve

@knative-prow
Copy link

knative-prow bot commented Sep 25, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mattmoor, psschwei, vaikas

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 25, 2022
@knative-prow knative-prow bot merged commit ed3515c into knative:main Sep 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/API API objects and controllers area/autoscale area/networking lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants