Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface cpu and mem requests forbidden errors (and other ones too) in KSVC creation #14453

Merged

Conversation

gabo1208
Copy link
Member

@gabo1208 gabo1208 commented Sep 27, 2023

changed activationrequired condition while reconciling the revision so the deployment status is propagated when there is something wrong (like low resources request and limits), since it was just beign propagated when the revision status was nil and the deployment existed

Fixes #9857 #4416

Proposed Changes

  • 🧹 Now the deployment and deployment's replicaset errors surface to the ksvc status correctly

Release Note

🧹 Now the deployment and deployment's replicaset errors surface to the ksvc status correctly

@knative-prow knative-prow bot added area/API API objects and controllers size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 27, 2023
@gabo1208 gabo1208 changed the title Surface cpu and mem requests forbidden errors (and other ones too) in KSVC creation [DRAFT] Surface cpu and mem requests forbidden errors (and other ones too) in KSVC creation Sep 28, 2023
@gabo1208
Copy link
Member Author

/assign @dprotaso

@gabo1208
Copy link
Member Author

gabo1208 commented Sep 28, 2023

@dprotaso I found out that sometimes when the code gets to this point: https://github.com/knative/serving/blob/main/pkg/reconciler/revision/reconcile_resources.go#L77

The: c := revisionCondSet.Manage(rs).GetCondition(RevisionConditionActive) can be nil, that's the only way we are surfacing deployments errors.

I simply modified IsActivationRequired() function as follows:
func (rs *RevisionStatus) IsActivationRequired(logger *zap.SugaredLogger) bool { c := revisionCondSet.Manage(rs).GetCondition(RevisionConditionActive) logger.Warn(c) return c != nil && c.Status != corev1.ConditionTrue }

so the logger returned the following logs whenever the error was successfully surfaced:
{"severity":"WARNING","timestamp":"2023-09-28T14:11:22.411410983Z","logger":"controller","caller":"v1/revision_helpers.go:145","message":"<nil>","commit":"72261bc-dirty","knative.dev/pod":"controller-76579c9dcb-wt26d","knative.dev/controller":"knative.dev.serving.pkg.reconciler.revision.Reconciler","knative.dev/kind":"serving.knative.dev.Revision","knative.dev/traceid":"a3cc46b6-3766-437a-9dbb-78926be5ad92","knative.dev/key":"default/test-9-00001","knative.dev/deployment":"test-9-00001-deployment"}

-> logger message: message: <nil>

So we rethink how we are surfacing the deployment status here. Removing the ! here: https://github.com/knative/serving/blob/main/pkg/reconciler/revision/reconcile_resources.go#L77 fixed this and also lets the revision behaves correctly with the thing that the status may change back and forth while the PA and Deployment reconciles.

WDYT?

@gabo1208 gabo1208 changed the title [DRAFT] Surface cpu and mem requests forbidden errors (and other ones too) in KSVC creation [WIP] Surface cpu and mem requests forbidden errors (and other ones too) in KSVC creation Sep 28, 2023
@knative-prow knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 28, 2023
@dprotaso
Copy link
Member

That's weird - prior to reconciliation we preprocess the resource and initialize the conditions

see
https://github.com/knative/pkg/blob/6cf4b051de4f9859b0f1b80f790a83e84a75bc5b/reconciler/reconcile_common.go#L52

Is there something that's deleting the condition erroneously?

@gabo1208
Copy link
Member Author

gabo1208 commented Sep 28, 2023

That's weird - prior to reconciliation we preprocess the resource and initialize the conditions

see
https://github.com/knative/pkg/blob/6cf4b051de4f9859b0f1b80f790a83e84a75bc5b/reconciler/reconcile_common.go#L52

Is there something that's deleting the condition erroneously?

I think is because RevisionActive is not part of the Revision's api.LivingConditionSet

One thing about it, when I add it then the cpu limit error don't surface anymore

@dprotaso
Copy link
Member

dprotaso commented Sep 28, 2023

I think is because RevisionActive is not part of the Revision's api.LivingConditionSet

Ah yeah that's it. Looks like we never mark it unknown hmmm..

Maybe that's something we should do

…there is something wrong (like low resources request and limits), since it was just beign propagated when the revision status was nil and the deployment existed

trying to surface replicaset creation errors
@gabo1208 gabo1208 force-pushed the kailed-sync-replicafailure-ksvc-9857 branch from dfe2bd7 to 933ff82 Compare October 2, 2023 14:29
@codecov
Copy link

codecov bot commented Oct 2, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (28effd9) 86.11% compared to head (7d2d707) 86.02%.
Report is 123 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14453      +/-   ##
==========================================
- Coverage   86.11%   86.02%   -0.10%     
==========================================
  Files         196      197       +1     
  Lines       14790    14922     +132     
==========================================
+ Hits        12736    12836     +100     
- Misses       1747     1776      +29     
- Partials      307      310       +3     
Files Coverage Δ
pkg/apis/serving/v1/revision_helpers.go 100.00% <100.00%> (ø)
pkg/apis/serving/v1/revision_lifecycle.go 93.10% <ø> (-0.06%) ⬇️
pkg/reconciler/revision/reconcile_resources.go 68.18% <100.00%> (+0.84%) ⬆️

... and 16 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gabo1208
Copy link
Member Author

gabo1208 commented Oct 3, 2023

I think is because RevisionActive is not part of the Revision's api.LivingConditionSet

Ah yeah that's it. Looks like we never mark it unknown hmmm..

Maybe that's something we should do

Since we have the scale to zero edge case I think it makes sense to leave it outside the Revision conditionset, I think the solution to propagate the status is good as it is right now, finishing the tests to remove the WIP

…can have Ready Inactive Revisions (scale to zero) * added docs and tests for this and the replicaset failure propagation
@knative-prow knative-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 3, 2023
@gabo1208 gabo1208 changed the title [WIP] Surface cpu and mem requests forbidden errors (and other ones too) in KSVC creation Surface cpu and mem requests forbidden errors (and other ones too) in KSVC creation Oct 3, 2023
@knative-prow knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 3, 2023
@gabo1208
Copy link
Member Author

gabo1208 commented Oct 3, 2023

/retest istio-latest-no-mesh_serving_main

@knative-prow
Copy link

knative-prow bot commented Oct 3, 2023

@gabo1208: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

  • /test build-tests
  • /test contour-latest
  • /test contour-tls
  • /test gateway-api-latest
  • /test istio-latest-no-mesh
  • /test istio-latest-no-mesh-tls
  • /test kourier-stable
  • /test kourier-stable-tls
  • /test unit-tests
  • /test upgrade-tests

The following commands are available to trigger optional jobs:

  • /test https
  • /test istio-latest-mesh
  • /test istio-latest-mesh-short
  • /test istio-latest-mesh-tls
  • /test performance-tests

Use /test all to run the following jobs that were automatically triggered:

  • build-tests_serving_main
  • istio-latest-no-mesh-tls_serving_main
  • istio-latest-no-mesh_serving_main
  • unit-tests_serving_main
  • upgrade-tests_serving_main

In response to this:

/retest istio-latest-no-mesh_serving_main

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gabo1208
Copy link
Member Author

gabo1208 commented Oct 3, 2023

/test istio-latest-no-mesh

@gabo1208 gabo1208 changed the title Surface cpu and mem requests forbidden errors (and other ones too) in KSVC creation [WIP] Surface cpu and mem requests forbidden errors (and other ones too) in KSVC creation Oct 4, 2023
@knative-prow knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 4, 2023
@gabo1208
Copy link
Member Author

gabo1208 commented Oct 4, 2023

Found a flaky test so changin this to WIP again

@gabo1208
Copy link
Member Author

gabo1208 commented Nov 2, 2023

Hmm true! Checking that failed test case

@knative-prow knative-prow bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 5, 2023
@gabo1208
Copy link
Member Author

gabo1208 commented Nov 7, 2023

/retest istio-latest-no-mesh_serving_main

Copy link

knative-prow bot commented Nov 7, 2023

@gabo1208: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

  • /test build-tests
  • /test contour-latest
  • /test contour-tls
  • /test gateway-api-latest
  • /test istio-latest-no-mesh
  • /test istio-latest-no-mesh-tls
  • /test kourier-stable
  • /test kourier-stable-tls
  • /test unit-tests
  • /test upgrade-tests

The following commands are available to trigger optional jobs:

  • /test https
  • /test istio-latest-mesh
  • /test istio-latest-mesh-short
  • /test istio-latest-mesh-tls
  • /test performance-tests

Use /test all to run the following jobs that were automatically triggered:

  • build-tests_serving_main
  • istio-latest-no-mesh-tls_serving_main
  • istio-latest-no-mesh_serving_main
  • unit-tests_serving_main
  • upgrade-tests_serving_main

In response to this:

/retest istio-latest-no-mesh_serving_main

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gabo1208
Copy link
Member Author

gabo1208 commented Nov 7, 2023

/test istio-latest-no-mesh

@gabo1208
Copy link
Member Author

gabo1208 commented Nov 8, 2023

/test istio-latest-no-mesh

1 similar comment
@gabo1208
Copy link
Member Author

gabo1208 commented Nov 8, 2023

/test istio-latest-no-mesh

@gabo1208
Copy link
Member Author

gabo1208 commented Nov 8, 2023

@kvmware @dprotaso this is good to go, now is flaky but as it was before the addition

@gabo1208
Copy link
Member Author

gabo1208 commented Nov 8, 2023

/test istio-latest-no-mesh

@gabo1208
Copy link
Member Author

gabo1208 commented Nov 8, 2023

/hold cancel

@knative-prow knative-prow bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 8, 2023
@krsna-m
Copy link
Contributor

krsna-m commented Nov 13, 2023

/lgtm

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Nov 13, 2023
@gabo1208
Copy link
Member Author

/test istio-latest-no-mesh

@knative-prow knative-prow bot merged commit d3127e9 into knative:main Nov 13, 2023
79 checks passed
@knative-prow-robot
Copy link
Contributor

@gabo1208: new pull request created: #14618

In response to this:

/cherry-pick release-1.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dprotaso
Copy link
Member

I think the flake surfaces that the ReplicaSetFailure and ProgressDeadline can both happen - if we set ReplicaSetFailure we shouldn't be propagating ProgressDeadline as the failure reason

@gabo1208
Copy link
Member Author

I think the flake surfaces that the ReplicaSetFailure and ProgressDeadline can both happen - if we set ReplicaSetFailure we shouldn't be propagating ProgressDeadline as the failure reason
Created an issue to follow this, because it is only happening with istio no mesh: #14621

@gabo1208
Copy link
Member Author

/cherry-pick release-1.12

@knative-prow-robot
Copy link
Contributor

@gabo1208: #14453 failed to apply on top of branch "release-1.12":

Applying: reconciling the revision so the deployment status is propagated when there is something wrong (like low resources request and limits), since it was just beign propagated when the revision status was nil and the deployment existed
Using index info to reconstruct a base tree...
M	pkg/apis/serving/v1/revision_helpers.go
M	pkg/reconciler/revision/reconcile_resources.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/reconciler/revision/reconcile_resources.go
Auto-merging pkg/apis/serving/v1/revision_helpers.go
CONFLICT (content): Merge conflict in pkg/apis/serving/v1/revision_helpers.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 reconciling the revision so the deployment status is propagated when there is something wrong (like low resources request and limits), since it was just beign propagated when the revision status was nil and the deployment existed
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-1.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

andrew-delph added a commit to andrew-delph/serving that referenced this pull request Nov 24, 2023
modify rest of ut

Update net-kourier nightly (knative#14605)

bumping knative.dev/net-kourier 6e4d79d...fb8da93:
  > fb8da93 upgrade to latest dependencies (# 1151)

Signed-off-by: Knative Automation <[email protected]>

Update net-gateway-api nightly (knative#14602)

bumping knative.dev/net-gateway-api a8d56a3...0aa321a:
  > 0aa321a upgrade to latest dependencies (# 579)

Signed-off-by: Knative Automation <[email protected]>

upgrade to latest dependencies (knative#14608)

bumping knative.dev/pkg 5c9b7a8...35011d4:
  > 35011d4 upgrade to latest dependencies (# 2892)
bumping knative.dev/networking 18529fd...e0bee34:
  > e0bee34 upgrade to latest dependencies (# 889)
bumping knative.dev/hack 8834794...5deadde:
  > 5deadde 🐛 Set latest release only when publishing to Github (# 346)
bumping knative.dev/caching c642577...b3781bc:
  > b3781bc upgrade to latest dependencies (# 806)

Signed-off-by: Knative Automation <[email protected]>

Update net-contour nightly (knative#14612)

bumping knative.dev/net-contour 467a573...d2054f2:
  > d2054f2 upgrade to latest dependencies (# 999)

Signed-off-by: Knative Automation <[email protected]>

Update net-certmanager nightly (knative#14611)

bumping knative.dev/net-certmanager 11e6219...8b2a470:
  > 8b2a470 upgrade to latest dependencies (# 625)
  > 2248405 upgrade to latest dependencies (# 624)

Signed-off-by: Knative Automation <[email protected]>

Update net-kourier nightly (knative#14613)

bumping knative.dev/net-kourier fb8da93...ad58d90:
  > ad58d90 upgrade to latest dependencies (# 1155)

Signed-off-by: Knative Automation <[email protected]>

Update net-istio nightly (knative#14614)

bumping knative.dev/net-istio 7f77e97...e3db912:
  > e3db912 upgrade to latest dependencies (# 1209)
  > 1e021c8 upgrade to latest dependencies (# 1208)

Signed-off-by: Knative Automation <[email protected]>

Update net-gateway-api nightly (knative#14616)

bumping knative.dev/net-gateway-api 0aa321a...29bf0b9:
  > 29bf0b9 upgrade to latest dependencies (# 581)
  > cd26216 upgrade to latest dependencies (# 580)

Signed-off-by: Knative Automation <[email protected]>

Surface cpu and mem requests forbidden errors (and other ones too) in KSVC creation (knative#14453)

* reconciling the revision so the deployment status is propagated when there is something wrong (like low resources request and limits), since it was just beign propagated when the revision status was nil and the deployment existed

trying to surface replicaset creation errors

* added revision condition active to revision live condition set so is not nil after the revision is created

* removing RevisionConditionActive from Revision ConditionSet since we can have Ready Inactive Revisions (scale to zero) * added docs and tests for this and the replicaset failure propagation

* fixing lint

* adjusted e2e and unit tests for the replica failure erros propagation, improved propagation logic + left todos regarding revision conditionset

* removed todo from revision lifecycle since the discussion has settled

* added test case for revision fast failures when it comes to replicaset failing to create

* fixed resource quota error, now it never waits for progress deadline and it fails fast, so removing the bits where it can go one way or another in the e2e resource_quota_error_test

* finishing the replicaset deployment status failure bubbling up to the revision table test

* removed unused test methods from revision testing package

* adding condition to wait for container creation in the resource quota service creation test

* with some istio cases this could fail with progress deadline exceeded error so adding that case too

* Update resource_quota_error_test.go

* formated the test file

fix deploy-replicaset-failure UT

rename tests

remove coment

remove comment

wip

want go correct

need fix ScaleTargetInitialized

only extra patch

patch scale to 0

more test helper changes

new tc

ut

WithPubService

activation failure is unreachable

copy inactive condition

comments

comment

comment

refactor computeActiveCondition

allow scale to 0 with no metrics if unreachable

remoce comment

add UT: initial scale zero: with ready pods

change to 1 ready pod

replicas 1

change logic order

readd activeThreshold

change Failed message

scaler_test handle no metrics case

mend

silly typo

fix UT

shortten test cases

scale to 0 if unreachable comment + logic

remove controlls podinformer as its not needed

revert computeActiveCondition refactor

use reachability variables

init scale 0 stays no traffic

undo imports reorder

add comments

pa change Inactive reason to Unreachable
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/API API objects and controllers area/test-and-release It flags unit/e2e/conformance/perf test issues for product features lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failed to sync with ReplicaFailure in ksvc creation sometimes
5 participants