ResourceQuota error isn't reflected in kservice status #4416

juliaguo · 2019-06-18T22:53:00Z

In what area(s)?

/area API

What version of Knative?

HEAD

Expected Behavior

Default namespace contains a LimitRange that limits defaultRequest CPU to 100m. Created a ResourceQuota in the same namespace with CPU quota set to 50m. Tried to serve requests to an app deployed in the same namespace. Expected to see an error message when running kubectl get kservice or kubectl get pods saying that there was a failure since the resourcequota was exceeded.

Actual Behavior

Cannot hit the service (loading is stuck). kubectl get kservice shows the app as Ready, with no mention of the quota error in the status. No mention of pod creation failure. Only digging further down and looking at the yaml of the deployment shows the error.

Status of kservice:

status:
  address:
    hostname: testapp.default.svc.cluster.local
    url: http:https://testapp.default.svc.cluster.local
  conditions:
  - lastTransitionTime: 2019-05-09T23:14:18Z
    status: "True"
    type: ConfigurationsReady
  - lastTransitionTime: 2019-06-17T17:12:57Z
    message: build cannot be migrated forward.
    reason: build
    severity: Warning
    status: "False"
    type: Convertible
  - lastTransitionTime: 2019-05-09T23:14:19Z
    status: "True"
    type: Ready
  - lastTransitionTime: 2019-05-09T23:14:19Z
    status: "True"
    type: RoutesReady
  domain: testapp.default.example.com
  domainInternal: testapp.default.svc.cluster.local
  latestCreatedRevisionName: testapp-ncngm
  latestReadyRevisionName: testapp-ncngm
  observedGeneration: 1
  traffic:
  - latestRevision: true
    percent: 100
    revisionName: testapp-ncngm
  url: http:https://testapp.default.example.com

Status of deployment:

status:
  conditions:
  - lastTransitionTime: 2019-05-09T23:14:08Z
    lastUpdateTime: 2019-06-17T17:13:02Z
    message: ReplicaSet "testapp-ncngm-deployment-6cbf59d7b9" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: 2019-06-18T22:06:36Z
    lastUpdateTime: 2019-06-18T22:06:36Z
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  - lastTransitionTime: 2019-06-18T22:06:36Z
    lastUpdateTime: 2019-06-18T22:06:36Z
    message: 'pods "testapp-ncngm-deployment-6cbf59d7b9-cjrjl" is forbidden: exceeded
      quota: new-cpu-quota, requested: cpu=225m, used: cpu=200m, limited: cpu=50m'
    reason: FailedCreate
    status: "True"
    type: ReplicaFailure
  observedGeneration: 6
  unavailableReplicas: 1

Steps to Reproduce the Problem

Create a limit range in a namespace, setting the default CPU (or any resource) to a value.
Create ResourceQuota in the same namespace, setting the quota for the resource to a smaller value than the default.
Try to serve requests from an app in the same namespace.

I believe the issue might be related to how the deployment is being reconciled. It looks like there is an "Error getting pods" message that is getting logged but the status of the revision/kservice do not get updated. Also the logic is checking that deployment.Status.AvailableReplicas == 0, which might not match all cases where pod creation has failed (for example, if 2 replicas have already been created, and the 3rd replica exceeds the ResourceQuota limit). Would it be possible to use the UnavailableReplicas value in the deployment instead?

Code for reference: https://github.com/knative/serving/blob/master//pkg/reconciler/revision/reconcile_resources.go#L36:22

The text was updated successfully, but these errors were encountered:

dgerd · 2019-06-20T18:10:34Z

Thanks for the bug report. This looks like something that we should bubble up into the Service status.

Added API label and moved into Serving 0.8.

dprotaso · 2020-07-20T14:19:31Z

/reopen due to this comment https://knative.slack.com/archives/CA4DNJ9A4/p1595251772209500

knative-prow-robot · 2020-07-20T14:19:39Z

@dprotaso: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jonjohnsonjr · 2020-07-21T17:43:39Z

/unassign

evankanderson · 2021-03-22T07:32:26Z

/good-first-issue

It seems like this should be fairly easy to write a test for:

Enable ResourceQuota on a namespace
Deploy a Knative Service without resource requests
Check the status on the created Revision, which should fail (possibly after 2 minutes)

/triage accepted

knative-prow-robot · 2021-03-22T07:32:27Z

@evankanderson:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

It seems like this should be fairly easy to write a test for:

Enable ResourceQuota on a namespace

Deploy a Knative Service without resource requests

Check the status on the created Revision, which should fail (possibly after 2 minutes)

/triage accepted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

psschwei · 2022-06-20T17:47:26Z

One thing to note is that the failure won't show until the pod's progress deadline is exceeded. The default value is 10 minutes, so it'll take some time for it to fail (though the progress deadline can be configured lower either globally or on a per-revision basis).

That said, the error message still doesn't reference the resource quota issue. Instead, it will be something like Revision "hello-00001" failed with message: Initial scale was never achieved.

yuzisun · 2022-06-20T20:08:53Z

One thing to note is that the failure won't show until the pod's progress deadline is exceeded. The default value is 10 minutes, so it'll take some time for it to fail (though the progress deadline can be configured lower either globally or on a per-revision basis).

That said, the error message still doesn't reference the resource quota issue. Instead, it will be something like Revision "hello-00001" failed with message: Initial scale was never achieved.

@psschwei any advice how to fix this? We’d like to contribute the fix if possible.

yuzisun · 2022-06-20T23:03:56Z

seems the deployment status is only propagated if the revision is active?

https://github.com/knative/serving/blob/main/pkg/reconciler/revision/reconcile_resources.go#L75

psschwei · 2022-06-21T01:53:34Z

@psschwei any advice how to fix this? We’d like to contribute the fix if possible.

To get that info into the error message, we'd need to somehow propagate the deployment info into the initial scale error message (which is created here). Off the top of my head, not sure what the best way to do that would be...

rhuss · 2022-08-03T14:51:21Z

Would it be an good first step to mirror that Deployment status in the associated Revision? We then still can think about how to propagate that back to a Service (which btw can be associated to multiple revisions via traffic split, so I don't necessary think that deployment-related errors should bubble up until the service, except when we would collect them in a list)

psschwei · 2022-08-04T14:11:13Z

Would it be an good first step to mirror that Deployment status in the associated Revision?

I went back and looked at this on v1.6, and it looks like the quota errors are showing on both the revision and the service.

Revision:

$ k get revision -n rq-test hello-00001 -o json | jq .status.conditions
[
  {
    "lastTransitionTime": "2022-08-04T13:46:07Z",
    "message": "The target is not receiving traffic.",
    "reason": "NoTraffic",
    "severity": "Info",
    "status": "False",
    "type": "Active"
  },
  {
    "lastTransitionTime": "2022-08-04T13:45:27Z",
    "status": "Unknown",
    "type": "ContainerHealthy"
  },
  {
    "lastTransitionTime": "2022-08-04T13:45:27Z",
    "message": "pods \"hello-00001-deployment-54bf4b6774-g8l79\" is forbidden: exceeded quota: rq-e2e-test, requested: cpu=525m, used: cpu=0, limited: cpu=50m",
    "reason": "FailedCreate",
    "status": "False",
    "type": "Ready"
  },
  {
    "lastTransitionTime": "2022-08-04T13:45:27Z",
    "message": "pods \"hello-00001-deployment-54bf4b6774-g8l79\" is forbidden: exceeded quota: rq-e2e-test, requested: cpu=525m, used: cpu=0, limited: cpu=50m",
    "reason": "FailedCreate",
    "status": "False",
    "type": "ResourcesAvailable"
  }
]

Service:

$ k get ksvc -n rq-test hello -o json | jq .status.conditions
[
  {
    "lastTransitionTime": "2022-08-04T13:45:27Z",
    "message": "Revision \"hello-00001\" failed with message: pods \"hello-00001-deployment-54bf4b6774-g8l79\" is forbidden: exceeded quota: rq-e2e-test, requested: cpu=525m, used: cpu=0, limited: cpu=50m.",
    "reason": "RevisionFailed",
    "status": "False",
    "type": "ConfigurationsReady"
  },
  {
    "lastTransitionTime": "2022-08-04T13:45:27Z",
    "message": "Configuration \"hello\" does not have any ready Revision.",
    "reason": "RevisionMissing",
    "status": "False",
    "type": "Ready"
  },
  {
    "lastTransitionTime": "2022-08-04T13:45:27Z",
    "message": "Configuration \"hello\" does not have any ready Revision.",
    "reason": "RevisionMissing",
    "status": "False",
    "type": "RoutesReady"
  }
]

Off the top of my head, not sure what exactly changed between 1.4 and 1.6 to get these in there, but in there they are 😄

rhuss · 2022-08-05T07:33:22Z

ah, great. So I guess we could close this issue then ? Would be great to find out though when the fix went in ;-)

psschwei · 2022-08-05T14:26:47Z

Checking on what the exact fix was... not showing in v1.5, so it was something in the last release

houshengbo · 2023-08-22T19:20:38Z

@dprotaso Is this issue still valid? Is there anyone working on it?

dprotaso · 2023-08-23T14:28:40Z

/unassign @dprotaso

I'm currently not working on this - it is up for grabs

xiangpingjiang · 2023-08-23T14:46:26Z

/assign

houshengbo · 2023-08-23T23:29:11Z

/assign @xiangpingjiang Are you going to take over this issue?

xiangpingjiang · 2023-08-24T00:28:41Z

@houshengbo yes, I want have a try

xiangpingjiang · 2023-08-27T11:41:05Z

/assign

husnialhamdani · 2023-09-29T17:11:13Z

/assign

gabo1208 · 2023-10-16T10:10:57Z

Probably fixed here: #14453
I'm testing to see if it's true

It does bubbles up the quota limit errors, too. I'm adding this as fixed issue to the PR

dprotaso · 2023-12-05T19:12:02Z

@gabo1208 were you able to follow up if your changes fixes this issue?

gabo1208 · 2023-12-06T13:23:56Z

Let me test between today and Friday this exact case, but should be fixed, I'll update the issue with the results @dprotaso

gabo1208 · 2023-12-06T16:05:05Z

Using a limit range:

with this service:

kn service create $NAME --image gcr.io/knative-samples/helloworld-go --request cpu=1m --limit cpu=1

yields this:

Creating service 'test-23735' in namespace 'default':

  3.291s The Route is still working to reflect the latest desired specification.
  3.298s Configuration "test-23735" is waiting for a Revision to become ready.
  3.870s Revision "test-23735-00001" failed with message: pods "test-23735-00001-deployment-c885bfff-7vf77" is forbidden: minimum cpu usage per Container is 10m, but request is 1m.
  3.891s Configuration "test-23735" does not have any ready Revision.
Error: RevisionFailed: Revision "test-23735-00001" failed with message: pods "test-23735-00001-deployment-c885bfff-7vf77" is forbidden: minimum cpu usage per Container is 10m, but request is 1m.

Revision status:
 - lastTransitionTime: "2023-12-06T16:26:00Z"
    message: 'pods "test-23735-00001-deployment-c885bfff-7vf77" is forbidden: minimum
      cpu usage per Container is 10m, but request is 1m'
    reason: FailedCreate
    status: "False"
    type: ResourcesAvailable

So yep I think we can close this one cc @dprotaso :)

juliaguo added the kind/bug Categorizes issue or PR as related to a bug. label Jun 18, 2019

dgerd added this to the Serving 0.8 milestone Jun 20, 2019

dgerd added the area/API API objects and controllers label Jun 20, 2019

mattmoor assigned jonjohnsonjr Jun 25, 2019

jonjohnsonjr mentioned this issue Jun 27, 2019

Pod failures #4557

Open

mattmoor modified the milestones: Serving 0.8, Serving "v1" (ready for production) Jul 30, 2019

jonjohnsonjr mentioned this issue Aug 6, 2019

Refactor deployment status propagation #5077

Merged

knative-prow-robot closed this as completed in #5077 Aug 8, 2019

jonjohnsonjr mentioned this issue Aug 14, 2019

Surface Revision quota problems #496

Closed

knative-prow-robot reopened this Jul 20, 2020

dprotaso modified the milestones: Serving "v1" (ready for production), 0.17.x Jul 20, 2020

knative-prow-robot unassigned jonjohnsonjr Jul 21, 2020

mattmoor modified the milestones: 0.17.x, 0.18.x Aug 31, 2020

dprotaso modified the milestones: 0.18.x, 0.19.x Sep 28, 2020

dprotaso modified the milestones: 0.19.x, 0.20.x Nov 16, 2020

dprotaso modified the milestones: 0.20.x, 0.21.x Jan 18, 2021

dprotaso modified the milestones: 0.21.x, 0.22.x Mar 3, 2021

dprotaso reopened this Jun 20, 2022

dprotaso modified the milestones: v1.4.0, Icebox Jun 20, 2022

naveenrajm7 mentioned this issue Mar 3, 2023

Knative revision not reporting quota failure #13708

Closed

dprotaso removed their assignment Aug 23, 2023

knative-prow bot assigned xiangpingjiang Aug 23, 2023

xiangpingjiang removed their assignment Aug 23, 2023

knative-prow bot assigned xiangpingjiang Aug 27, 2023

xiangpingjiang removed their assignment Sep 29, 2023

knative-prow bot assigned husnialhamdani Sep 29, 2023

gabo1208 mentioned this issue Oct 17, 2023

Surface cpu and mem requests forbidden errors (and other ones too) in KSVC creation #14453

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ResourceQuota error isn't reflected in kservice status #4416

ResourceQuota error isn't reflected in kservice status #4416

juliaguo commented Jun 18, 2019

dgerd commented Jun 20, 2019

dprotaso commented Jul 20, 2020 •

edited

Loading

knative-prow-robot commented Jul 20, 2020

jonjohnsonjr commented Jul 21, 2020

evankanderson commented Mar 22, 2021

knative-prow-robot commented Mar 22, 2021

psschwei commented Jun 20, 2022

yuzisun commented Jun 20, 2022

yuzisun commented Jun 20, 2022

psschwei commented Jun 21, 2022

rhuss commented Aug 3, 2022

psschwei commented Aug 4, 2022

rhuss commented Aug 5, 2022

psschwei commented Aug 5, 2022

houshengbo commented Aug 22, 2023

dprotaso commented Aug 23, 2023 •

edited

Loading

xiangpingjiang commented Aug 23, 2023

houshengbo commented Aug 23, 2023

xiangpingjiang commented Aug 24, 2023

xiangpingjiang commented Aug 27, 2023

husnialhamdani commented Sep 29, 2023

gabo1208 commented Oct 16, 2023 •

edited

Loading

dprotaso commented Dec 5, 2023

gabo1208 commented Dec 6, 2023

gabo1208 commented Dec 6, 2023 •

edited

Loading

ResourceQuota error isn't reflected in kservice status #4416

ResourceQuota error isn't reflected in kservice status #4416

Comments

juliaguo commented Jun 18, 2019

In what area(s)?

What version of Knative?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

dgerd commented Jun 20, 2019

dprotaso commented Jul 20, 2020 • edited Loading

knative-prow-robot commented Jul 20, 2020

jonjohnsonjr commented Jul 21, 2020

evankanderson commented Mar 22, 2021

knative-prow-robot commented Mar 22, 2021

psschwei commented Jun 20, 2022

yuzisun commented Jun 20, 2022

yuzisun commented Jun 20, 2022

psschwei commented Jun 21, 2022

rhuss commented Aug 3, 2022

psschwei commented Aug 4, 2022

rhuss commented Aug 5, 2022

psschwei commented Aug 5, 2022

houshengbo commented Aug 22, 2023

dprotaso commented Aug 23, 2023 • edited Loading

xiangpingjiang commented Aug 23, 2023

houshengbo commented Aug 23, 2023

xiangpingjiang commented Aug 24, 2023

xiangpingjiang commented Aug 27, 2023

husnialhamdani commented Sep 29, 2023

gabo1208 commented Oct 16, 2023 • edited Loading

dprotaso commented Dec 5, 2023

gabo1208 commented Dec 6, 2023

gabo1208 commented Dec 6, 2023 • edited Loading

dprotaso commented Jul 20, 2020 •

edited

Loading

dprotaso commented Aug 23, 2023 •

edited

Loading

gabo1208 commented Oct 16, 2023 •

edited

Loading

gabo1208 commented Dec 6, 2023 •

edited

Loading