Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump GRPC to 1.64 #10250

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Bump GRPC to 1.64 #10250

wants to merge 3 commits into from

Conversation

mxpv
Copy link
Member

@mxpv mxpv commented May 21, 2024

This is follow up to #10247 with linter fixes.

GRPC 1.64 deprecates lots of functions that we currently use to create a new GRPC connection:

  • grpc.Dial / grpc.DialContext
  • WithBlock, WithTimeout, WithReturnConnectionError, and FailOnNonTempDialError.

Notably, Anti-Patterns of Client creation explicitly mentions that we should not create the client the way we do.

Especially bad: using deprecated DialOptions

FailOnNonTempDialError, WithBlock, and WithReturnConnectionError are three
DialOptions that are only supported by Dial because they only affect the
behavior of Dial itself. WithBlock causes Dial to wait until the ClientConn
reports its State as connectivity.Connected. The other two deal with returning
connection errors before the timeout (WithTimeout or on the context when using
DialContext).

The reason these options can be a problem is that connections with a ClientConn
are dynamic -- they may come and go over time. If your client successfully connects,
the server could go down 1 second later, and your RPCs will fail. "Knowing you are
connected" does not tell you much in this regard.

This PR removes deprecated dial options, AFAIU there are no alternatives to provide and the old approach is just not recommended by the GRPC team.

@k8s-ci-robot
Copy link

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@mxpv
Copy link
Member Author

mxpv commented May 21, 2024

@dcantah I think you've played with grpc stuff recently, pls PTAL.

@dcantah
Copy link
Member

dcantah commented May 21, 2024

@dmcgowan also as we were chatting on the history of a lot of these options on slack a bit ago. We were wondering if we would be able to just forego WithBlock today like we do here

@dmcgowan
Copy link
Member

I believe the original issue like 8 years was related to clients connecting and when the daemon was not online the client would end up hanging. Now in that case it seems to be handled better with this PR.

In this PR...

$ time bin/ctr images ls
WARN[0001] Failed to check deprecations                  error="connection error: desc = \"transport: Error while dialing: dial unix:https:///run/containerd/containerd.sock: timeout\": unavailable"
ctr: failed to list images: connection error: desc = "transport: Error while dialing: dial unix:https:///run/containerd/containerd.sock: timeout": unavailable
bin/ctr images ls  0.02s user 0.03s system 4% cpu 1.019 total

it comes back in 1 second. Although would be good to avoid the warning log.

In main...

time bin/ctr images ls
ctr: failed to dial "/run/containerd/containerd.sock": context deadline exceeded: connection error: desc = "transport: error while dialing: dial unix:https:///run/containerd/containerd.sock: timeout"
bin/ctr images ls  0.07s user 0.08s system 1% cpu 10.019 total

it takes 10 seconds to timeout.

@mxpv
Copy link
Member Author

mxpv commented May 22, 2024

Windows needed workaround 20aebe7. Everything else looks green.

Although would be good to avoid the warning log.

@dmcgowan The warning is coming from an extra call to introspection service when creating a client in ctr, which we always do. It's weird that you don't see it the second case (unless it was suppressed or succeeded?).

@ZhangShuaiyi
Copy link
Contributor

@mxpv @dcantah Is the net.DialTimeout that was introduced by #9447 still needed?

conn, err := net.DialTimeout("unix", address, time.Second*10)

@dcantah
Copy link
Member

dcantah commented May 24, 2024

@ZhangShuaiyi Good question, I'm not sure. The reason for it was so that on restart we can tell what shims to ignore if they're not responding when going to reload state (usually because they crashed for some reason so the socket isn't listening). We already had that info with grpc.WithBlock, it would just take the entire timeout to figure out if we should ignore the shim so we added that to be able to find out much faster if the socket even exists anymore. Now with this scheme I'm not sure if we know whether we can ignore a shim or not when reloading, as we don't do any network ops when making a client anymore, so maybe that dial still has value?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Needs Reviewers
Development

Successfully merging this pull request may close these issues.

None yet

5 participants