Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaking test] unit test TestStoreListResourceVersion #125406

Closed
mimowo opened this issue Jun 10, 2024 · 12 comments · Fixed by #125450
Closed

[Flaking test] unit test TestStoreListResourceVersion #125406

mimowo opened this issue Jun 10, 2024 · 12 comments · Fixed by #125450
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/etcd Categorizes an issue or PR as relevant to SIG Etcd.

Comments

@mimowo
Copy link
Contributor

mimowo commented Jun 10, 2024

Which jobs are flaking?

pull-kubernetes-unit

Which tests are flaking?

TestStoreListResourceVersion in k8s.io/apiserver/pkg/registry/generic: registry

Since when has it been flaking?

seems like July 7th 2024:
image

Testgrid link

https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-unit

Reason for failure (if possible)

Don't know, here is the log:

{Failed;  === RUN   TestStoreListResourceVersion
W0610 08:12:32.485776   69887 logging.go:59] [core] [Channel #205 SubChannel #208] grpc: addrConn.createTransport failed to connect to {Addr: "localhost:45055", ServerName: "localhost:45055", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:45055: connect: connection refused"
    store_test.go:287: storage is (re)initializing
    store_test.go:296: expected waiting, but get <nil>
W0610 08:12:32.576001   69887 logging.go:59] [core] [Channel #325 SubChannel #327] grpc: addrConn.createTransport failed to connect to {Addr: "localhost:34737", ServerName: "localhost:34737", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:34737: connect: connection refused"
--- FAIL: TestStoreListResourceVersion (0.33s)
;}

Anything else we need to know?

No response

Relevant SIG(s)

/sig

@mimowo mimowo added the kind/flake Categorizes issue or PR as related to a flaky test. label Jun 10, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 10, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@neolit123
Copy link
Member

func TestStoreListResourceVersion(t *testing.T) {

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 10, 2024
@logicalhan
Copy link
Member

/sig etcd

@k8s-ci-robot k8s-ci-robot added the sig/etcd Categorizes an issue or PR as relevant to SIG Etcd. label Jun 11, 2024
@fedebongio
Copy link
Contributor

/assign @serathius
/cc @wenjiaswe

@serathius
Copy link
Contributor

Bisected this to a8ef6e9
cc @wojtek-t

@mauri870
Copy link
Member

I'm not sure if this is the proper way to fix it, but setting a limit in the ListOptions deflakes the test.

The flakiness comes from the fact that the cache is sometimes not initialized in time which causes it to return a "too many requests".

Perhaps in tests we could wait till the cache is ready in order to proceed?

@chaochn47
Copy link

I'm not sure if this is the proper way to fix it, but setting a limit in the ListOptions deflakes the test.

Right, setting a limit when a resource version is set would delegate the list to etcd when cache is not initialized based on the KEP description.

It is due to this type of request is not that costly to etcd.

Perhaps in tests we could wait till the cache is ready in order to proceed?

Looks like it is exactly how the wait cache ready is added in https://github.com/kubernetes/kubernetes/pull/124642/files#diff-d82df65cc0625a3f621333245e76fbc801b579eedd3d266fe36eeb99366b5680.

See below

if utilfeature.DefaultFeatureGate.Enabled(features.ResilientWatchCacheInitialization) {
// The tests assume that Get/GetList/Watch calls shouldn't fail.
// However, 429 error can now be returned if watchcache is under initialization.
// To avoid rewriting all tests, we wait for watcache to initialize.
if err := cacher.ready.wait(ctx); err != nil {
t.Fatal(err)
}
}

So I think we should implement this same thing in newTestGenericStoreRegistry when initializing the cacher.

@wojtek-t
Copy link
Member

Agree - waiting for cacher to be initialized when creating it here is a better fix for it:
https://github.com/kubernetes/kubernetes/blob/22512e7a3c6db15ac7a9d46edfe4563fe561418f/staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store_test.go#L2448C4-L2449C1

@mauri870
Copy link
Member

I thought about this as well, but waiting on cacher.ready.wait() is not possible outside of the cacher module, since it is a private field.

I'm thinking to either add an option WaitUntilReady to cacherstorage.Config and make cacherstorage.NewCacherFromConfig wait or add a new public method Wait(context.Context) error to Cacher.

Do you guys have a better idea? Thanks.

@wojtek-t
Copy link
Member

Exposing Wait(context) method sounds fine

@mauri870
Copy link
Member

I have pushed the changes we discussed. Thank you all for your guidance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/etcd Categorizes an issue or PR as relevant to SIG Etcd.
Projects
Development

Successfully merging a pull request may close this issue.

10 participants