Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaking Test] ci-kubernetes-unit (Unexpected event resourceVersion 2 less than or equal to bookmark 2) #125244

Closed
Atharva-Shinde opened this issue May 31, 2024 · 10 comments · Fixed by #125288
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@Atharva-Shinde
Copy link

Which jobs are flaking?

sig-release-master-blocking

  • ci-kubernetes-unit

Which tests are flaking?

k8s.io/apiserver/pkg/storage/cacher.cacher

Since when has it been flaking?

05/30/2024 19:25 IST
Prow Logs: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-unit/1796177771727163392
Triage: https://storage.googleapis.com/k8s-triage/index.html?test=k8s.io%2Fapiserver%2Fpkg%2Fstorage%2Fcacher.cacher

Testgrid link

https://testgrid.k8s.io/sig-release-master-blocking#ci-kubernetes-unit

Reason for failure (if possible)

{Failed;  === RUN   TestWatchBookmarksWithCorrectResourceVersion
W0530 14:27:24.502997   70007 logging.go:59] [core] [Channel #387 SubChannel #390] grpc: addrConn.createTransport failed to connect to {Addr: "localhost:38087", ServerName: "localhost:38087", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:38087: connect: connection refused"
W0530 14:27:24.503472   70007 logging.go:59] [core] [Channel #435 SubChannel #438] grpc: addrConn.createTransport failed to connect to {Addr: "localhost:37907", ServerName: "localhost:37907", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:37907: connect: connection refused"
W0530 14:27:24.817666   70007 logging.go:59] [core] [Channel #460 SubChannel #462] grpc: addrConn.createTransport failed to connect to {Addr: "localhost:32897", ServerName: "localhost:32897", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:32897: connect: connection refused"
W0530 14:27:25.178669   70007 reflector.go:556] storage/cacher.go:/pods: failed to list *example.Pod: injected error
E0530 14:27:25.178720   70007 cacher.go:476] cacher (pods): unexpected ListAndWatch error: failed to list *example.Pod: injected error; reinitializing...
W0530 14:27:25.368480   70007 logging.go:59] [core] [Channel #472 SubChannel #473] grpc: addrConn.createTransport failed to connect to {Addr: "localhost:41447", ServerName: "localhost:41447", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:41447: connect: connection refused"
W0530 14:27:25.885017   70007 logging.go:59] [core] [Channel #292 SubChannel #293] grpc: addrConn.createTransport failed to connect to {Addr: "localhost:43397", ServerName: "localhost:43397", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:43397: connect: connection refused"
W0530 14:27:25.940337   70007 logging.go:59] [core] [Channel #400 SubChannel #402] grpc: addrConn.createTransport failed to connect to {Addr: "localhost:32947", ServerName: "localhost:32947", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:32947: connect: connection refused"
    watcher_tests.go:1205: Unexpected event resourceVersion 2 less than or equal to bookmark 2)
W0530 14:27:26.194730   70007 logging.go:59] [core] [Channel #483 SubChannel #485] grpc: addrConn.createTransport failed to connect to {Addr: "localhost:34213", ServerName: "localhost:34213", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:34213: connect: connection refused"
--- FAIL: TestWatchBookmarksWithCorrectResourceVersion (1.83s)
;}

Anything else we need to know?

No response

Relevant SIG(s)

/sig

@Atharva-Shinde Atharva-Shinde added the kind/flake Categorizes issue or PR as related to a flaky test. label May 31, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 31, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Atharva-Shinde
Copy link
Author

/sig storage
@kubernetes/release-team-release-signal

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 31, 2024
@Atharva-Shinde
Copy link
Author

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label May 31, 2024
@wojtek-t wojtek-t changed the title [Flaking Test] ci-kubernetes-unit (grpc: addrConn.createTransport failed to connect) [Flaking Test] ci-kubernetes-unit (Unexpected event resourceVersion 2 less than or equal to bookmark 2) Jun 3, 2024
@wojtek-t
Copy link
Member

wojtek-t commented Jun 3, 2024

/cc @serathius @p0lyn0mial

This started last week, but it seems to be fairly rare, so might be a consequence of something a bit earlier.

The value of "2" suggests that it's happening on the first event, and my brain goes towards: https://github.com/kubernetes/kubernetes/pull/124754/files

@p0lyn0mial - will you be able to take a look at it?

@p0lyn0mial
Copy link
Contributor

@p0lyn0mial - will you be able to take a look at it?

sure, I will have a look.

/assign @p0lyn0mial

@wojtek-t
Copy link
Member

wojtek-t commented Jun 3, 2024

We chatted with @p0lyn0mial about that offline and it seems like a real correctness issue.

Regression indeed introduced by https://github.com/kubernetes/kubernetes/pull/124754/files

We have a race with setting RV in watchcache and incoming events.

The current explanation is:

  1. watchcache initializeds with RV=N
  2. an event comes with RV=N+1 [but wasn't yet dispatched]
  3. the polling introduced in that PR catches RV=N+1
  4. only then the event (with RV=N+1) is received

@Vyom-Yadav
Copy link
Member

@wojtek-t @p0lyn0mial So it is a release blocker, then? (Upcoming alpha 1 release cut scheduled for tomorrow)

@p0lyn0mial
Copy link
Contributor

@wojtek-t @p0lyn0mial So it is a release blocker, then? (Upcoming alpha 1 release cut scheduled for tomorrow)

I don't know what the release policy is but we might have a potential fix for this issue.

@wojtek-t
Copy link
Member

wojtek-t commented Jun 3, 2024

@wojtek-t @p0lyn0mial So it is a release blocker, then? (Upcoming alpha 1 release cut scheduled for tomorrow)

This is a real correctness bug introduced 2 weeks ago. I'm not sure what's our policy is for alpha releases, but it sounds like a potential blocker to me.

@wojtek-t
Copy link
Member

wojtek-t commented Jun 3, 2024

The fix for it is already approved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
Development

Successfully merging a pull request may close this issue.

5 participants