Closing connection takes up to 15 minutes. #7314

fpetkovski · 2024-06-09T07:15:25Z

What version of gRPC are you using?

1.64.0

What version of Go are you using (`go version`)?

go1.22.3

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

After upgrading to gRPC 1.64, closing connections started to take a very long time, specifically around 15 minutes. This happens when the server side abruptly goes away and the TCP connection breaks on one end.

I captured a goroutine profile while the application was trying to close one such connection. What stands out are the following two stacks:

#	0xbd151a	google.golang.org/grpc/internal/transport.(*http2Client).Close+0x2ba	/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_client.go:1012
#	0xc21225	google.golang.org/grpc.(*addrConn).tearDown+0x325			/go/pkg/mod/google.golang.org/[email protected]/clientconn.go:1590
#	0xc1e627	google.golang.org/grpc.(*ClientConn).Close+0x207			/go/pkg/mod/google.golang.org/[email protected]/clientconn.go:1154

and

1 @ 0x43e48e 0x436f17 0x46bb65 0x4e53e7 0x4e87b6 0x4e87a3 0x589565 0x59b525 0xbe2876 0xbe275c 0xbc2570 0xbcbd5b 0x4716c1
#	0x46bb64	internal/poll.runtime_pollWait+0x84						/usr/local/go/src/runtime/netpoll.go:343
#	0x4e53e6	internal/poll.(*pollDesc).wait+0x26						/usr/local/go/src/internal/poll/fd_poll_runtime.go:84
#	0x4e87b5	internal/poll.(*pollDesc).waitWrite+0x2d5					/usr/local/go/src/internal/poll/fd_poll_runtime.go:93
#	0x4e87a2	internal/poll.(*FD).Write+0x2c2							/usr/local/go/src/internal/poll/fd_unix.go:388
#	0x589564	net.(*netFD).Write+0x24								/usr/local/go/src/net/fd_posix.go:96
#	0x59b524	net.(*conn).Write+0x44								/usr/local/go/src/net/net.go:191
#	0xbe2875	google.golang.org/grpc/internal/transport.(*bufWriter).flushKeepBuffer+0x55	/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http_util.go:362
#	0xbe275b	google.golang.org/grpc/internal/transport.(*bufWriter).Flush+0x1b		/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http_util.go:345
#	0xbc256f	google.golang.org/grpc/internal/transport.(*loopyWriter).run+0x6f		/go/pkg/mod/google.golang.org/[email protected]/internal/transport/controlbuf.go:592
#	0xbcbd5a	google.golang.org/grpc/internal/transport.newHTTP2Client.func6+0xba		/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_client.go:467

I believe the (potential) bug was introduced on these lines where the client tries to send a GOAWAY packet to the server before closing the connection. In case the connection is half-closed, the call hangs for 15 minutes which is the default timeout for net.(*conn).Write.

What did you expect to see?

I would expect there to be a more reasonable timeout for closing a connection, or perhaps a way to control the timeout by the client.

What did you see instead?

Long time to close connections.

The text was updated successfully, but these errors were encountered:

atollena · 2024-06-11T16:51:10Z

Could you tell us which version you used before, where you were seeing a different behavior? Are you actually calling Close() on the gRPC client/channel and it's blocking for 15 minutes? I wouldn't expect the change that you pointed out to change the behavior when the connection is not gracefully closed, but I could be missing something.

15 minutes is the time it takes with typical linux default settings to detect silent connection drops, unless gRPC keepalives are enabled. So the behavior you are seeing is expected if there are outstanding RPCs (more details e.g. in https://www.evanjones.ca/tcp-connection-timeouts.html). It's been like this forever I think, so I suspect you're talking about something different?

atollena · 2024-06-11T16:59:12Z

Are you actually calling Close() on the gRPC client/channel and it's blocking for 15 minutes?

(OK I see in your goroutine capture that you are closing the channel)

fpetkovski · 2024-06-12T14:54:19Z

Thanks for your reply. We used to be on 1.62.1 and I don't think the Close method was blocking on dead connections. It started to do so because of the client-side GOAWAY. As can be seen on the the first goroutine stack, Close hangs on line 1012 in the http2_client which is exactly where the client waits for the GOAWAY packet to be sent. This behavior was introduced in #7015 and was not present before.

atollena · 2024-06-13T08:19:36Z

OK, thank you for the detailed report. The problem can occur when a call to Write blocks, which I think happens when the OS socket buffer is full. This is more likely when the connection is silently dropped.

When we attempt to flush the control buffer in the call to clientconn.Close, we used to immediately call conn.Close() which would make any blocked read or write fail immediately (including previously written data that may be related to RPCs, e.g. in the case of a client stream). Now we want to give time to flush the goaway frame, and the obvious way to do that was to wait for all writes to finish, but I'm not sure we thought about the case of loopy being blocked on writing to the socket.

One option to give some time to loopy to return while not waiting until the operating system fails the socket would be to set a short deadline on the transport connection via t.conn.SetWriteDeadline, perhaps in func (t *http2Client) Close(err error). The next question is how do we decide an appropriate value for the write deadline...

easwars · 2024-06-25T18:17:40Z

Quoting from https://www.rfc-editor.org/rfc/rfc9113.html#name-connection-error-handling
~~
It is possible that the GOAWAY will not be reliably received by the receiving endpoint. In the event of a connection error, GOAWAY only provides a best-effort attempt to communicate with the peer about why the connection is being terminated.

An endpoint can end a connection at any time. In particular, an endpoint MAY choose to treat a stream error as a connection error. Endpoints SHOULD send a GOAWAY frame when ending a connection, providing that circumstances permit it.
~~

Maybe our GOAWAY sending on close should be a best effort, but I'm not sure if there is a way to do that without setting a write deadline on the underlying conn.

fpetkovski added the Type: Bug label Jun 9, 2024

fpetkovski mentioned this issue Jun 9, 2024

[Chore] Update Prometheus thanos-io/thanos#7416

Merged

purnesh42H assigned aranjans Jun 11, 2024

aranjans mentioned this issue Jun 24, 2024

fix: Add http client conn deadline #7344

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closing connection takes up to 15 minutes. #7314

Closing connection takes up to 15 minutes. #7314

fpetkovski commented Jun 9, 2024 •

edited

Loading

atollena commented Jun 11, 2024

atollena commented Jun 11, 2024

fpetkovski commented Jun 12, 2024

atollena commented Jun 13, 2024 •

edited

Loading

easwars commented Jun 25, 2024

Closing connection takes up to 15 minutes. #7314

Closing connection takes up to 15 minutes. #7314

Comments

fpetkovski commented Jun 9, 2024 • edited Loading

What version of gRPC are you using?

What version of Go are you using (go version)?

What operating system (Linux, Windows, …) and version?

What did you do?

What did you expect to see?

What did you see instead?

atollena commented Jun 11, 2024

atollena commented Jun 11, 2024

fpetkovski commented Jun 12, 2024

atollena commented Jun 13, 2024 • edited Loading

easwars commented Jun 25, 2024

fpetkovski commented Jun 9, 2024 •

edited

Loading

What version of Go are you using (`go version`)?

atollena commented Jun 13, 2024 •

edited

Loading