Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

packets: Check connection liveness before writing query #934

Merged
merged 1 commit into from
Mar 29, 2019

Conversation

vmg
Copy link
Contributor

@vmg vmg commented Mar 12, 2019

Description

Heyoo! Here's a potential fix for the long feared issue of stale MySQL connections when the server prunes them, i.e. #882 and #657

I'm aware that the ideal fix for this issue is setting your connection lifetime shorter than the server's wait_timeout, but this approach does not handle the case where the server is actively pruning connections. This PR adds a quite lightweight check on connections we know to be non-active, so data corruption is not possible.

We've been testing this patch internally at GitHub for a few days and the results are flawless as far as I can observe: it has fixed all our connection errors, even in cases where the client connection lifetime is greater than the server's.

To note: I have not been able to test this patch in Windows, so I'm not sure whether it works there or whether the conncheck should be disabled for that platform.

The full details on how we've implemented this connection check are in the commit message for 9ec0a8f, which I'm inlining here for completeness:

This commit contains a potential fix to the issue reported in #657.

As a summary: when a MySQL server kills a connection on the server-side
(either because it is actively pruning connections, or because the
connection has hit the server-side timeout), the Go MySQL client does
not immediately become aware of the connection being dead.

Because of the way TCP works, the client cannot know that the connection
has received a RST packet from the server (i.e. the server-side has
closed) until it actually reads from it. This causes an unfortunate bug
wherein a MySQL idle connection is pulled from the connection pool, a
query packet is written to it without error, and then the query fails
with an "unexpected EOF" error when trying to read the response packet.

Since the initial write to the socket does not fail with an error, it is
generally not safe to return `driver.ErrBadConn` when the read fails,
because in theory the write could have arrived to the server and could
have been committed. Returning `ErrBadConn` could lead to duplicate
inserts on the database and data corruption because of the way the Go
SQL package performs retries.

In order to significantly reduce the circumstances where this
"unexpected EOF" error is returned for stale connections, this commit
performs a liveness check before writing a new query.

When do we check?
-----------------

This check is not performed for all writes. Go 1.10 introduced a new
`sql/driver` interface called `driver.SessionResetter`, which calls the
`ResetSession` method on any connections _when they are returned to the
connection pool_. Since performing the liveness check during
`ResetSession` is not particularly useful (the connection can spend a
long time in the pool before it's checked out again, and become stale),
we simply mark the connection with a `reset` flag instead.

This `reset` flag is then checked from `mysqlConn.writePacket` to
perform the liveness checks. This ensures that the liveness check will
only be performed for the first query on a connection that has been
checked out of the connection pool. These are pretty much the semantics
we want: a fresh connection from the pool is more likely to be stale,
and it has not performed any previous writes that could cause data
corruption. If a connection is being consistently used by the client
(i.e. through an open transaction), we do NOT perform liveness checks.
If MySQL Server kills such active connection, we want to bubble up the
error to the user because any silent retrying can and will lead to data
corruption.

Since the `ResetSession` interface is only available in Go 1.10+, the
liveness checks will only be performed starting with that Go version.

How do we check?
----------------

To perform the actual liveness test on the connection, we use the new
`syscall.Conn` interface which is available for all `net.Conn`s since Go
1.9. The `SyscallConn` method returns a `RawConn` that lets us read
directly from the connection's file descriptor using syscalls, and
skipping the default read pipeline of the Go runtime.

When reading directly from the file descriptor using `syscall.Read`, we
pass in a 1-length buffer, as passing a 0-length buffer will always
result in a 0-length read, and the 1-length buffer will never be filled
because we're not expecting any reads from MySQL before we have written
any request packets in a fresh connection.

All sockets created in the Go runtime are set to non-blocking
(O_NONBLOCK). Consequently, we can detect a socket that has been closed
on the server-side because the `read` syscall will return a 0-length read
_and_ no error.

We assume that any other errors returned from the `read` also mean the
connection is in a bad state, except for `EAGAIN`/`EWOULDBLOCK`, which is
the expected return for a healthy non-blocking socket in this circumstance.

Because of the dependency on `syscall.Conn`, liveness checks can only be
performed in Go 1.9+. This restriction however overlaps with the fact
that we only mark connections as having been reset in Go 1.10+, as
explained in the previous section.

Checklist

  • Code compiles correctly
  • Created tests which fail without the change (if possible)
  • All tests passing
  • Extended the README / documentation, if necessary
  • Added myself / the copyright holder to the AUTHORS file

conncheck.go Outdated Show resolved Hide resolved
}
rerr := rc.Read(func(fd uintptr) bool {
n, err = syscall.Read(int(fd), buff[:])
return true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested it on Windows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not! As pointed out in the PR body:

To note: I have not been able to test this patch in Windows, so I'm not sure whether it works there or whether the conncheck should be disabled for that platform.

Do you have any ideas on how would be able to run the CI test on Windows?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's difficult to add it in short term.
Instead, you can test your code in small repository, like this.
https://github.com/methane/peek/runs/25901610

In my case, I stopped while thinking about nice API works on both of Windows and Unix.

@vmg vmg force-pushed the vmg/conncheck branch 4 times, most recently from 41277ce to f327058 Compare March 12, 2019 14:54
@vmg
Copy link
Contributor Author

vmg commented Mar 12, 2019

@methane: I've updated the commit with your suggestions and removed the 1.8 file, but now the Travis CI build is failing because you're still testing 1.8.x in Travis. I've tried removing the 1.8 build from .travis.yml but somehow it's still being tested. 🤔

@vmg
Copy link
Contributor Author

vmg commented Mar 20, 2019

@methane: Hi! Build is green now. I went ahead and disabled the conncheck feature on Windows builds, at least until we can get Windows CI working.

What else can I do to make merging this PR easier?

packets.go Outdated Show resolved Hide resolved
connection.go Outdated Show resolved Hide resolved
connection.go Outdated Show resolved Hide resolved
This commit contains a potential fix to the issue reported in go-sql-driver#657.

As a summary: when a MySQL server kills a connection on the server-side
(either because it is actively pruning connections, or because the
connection has hit the server-side timeout), the Go MySQL client does
not immediately become aware of the connection being dead.

Because of the way TCP works, the client cannot know that the connection
has received a RST packet from the server (i.e. the server-side has
closed) until it actually reads from it. This causes an unfortunate bug
wherein a MySQL idle connection is pulled from the connection pool, a
query packet is written to it without error, and then the query fails
with an "unexpected EOF" error when trying to read the response packet.

Since the initial write to the socket does not fail with an error, it is
generally not safe to return `driver.ErrBadConn` when the read fails,
because in theory the write could have arrived to the server and could
have been committed. Returning `ErrBadConn` could lead to duplicate
inserts on the database and data corruption because of the way the Go
SQL package performs retries.

In order to significantly reduce the circumstances where this
"unexpected EOF" error is returned for stale connections, this commit
performs a liveness check before writing a new query.

When do we check?
-----------------

This check is not performed for all writes. Go 1.10 introduced a new
`sql/driver` interface called `driver.SessionResetter`, which calls the
`ResetSession` method on any connections _when they are returned to the
connection pool_. Since performing the liveness check during
`ResetSession` is not particularly useful (the connection can spend a
long time in the pool before it's checked out again, and become stale),
we simply mark the connection with a `reset` flag instead.

This `reset` flag is then checked from `mysqlConn.writePacket` to
perform the liveness checks. This ensures that the liveness check will
only be performed for the first query on a connection that has been
checked out of the connection pool. These are pretty much the semantics
we want: a fresh connection from the pool is more likely to be stale,
and it has not performed any previous writes that could cause data
corruption. If a connection is being consistently used by the client
(i.e. through an open transaction), we do NOT perform liveness checks.
If MySQL Server kills such active connection, we want to bubble up the
error to the user because any silent retrying can and will lead to data
corruption.

Since the `ResetSession` interface is only available in Go 1.10+, the
liveness checks will only be performed starting with that Go version.

How do we check?
----------------

To perform the actual liveness test on the connection, we use the new
`syscall.Conn` interface which is available for all `net.Conn`s since Go
1.9. The `SyscallConn` method returns a `RawConn` that lets us read
directly from the connection's file descriptor using syscalls, and
skipping the default read pipeline of the Go runtime.

When reading directly from the file descriptor using `syscall.Read`, we
pass in a 1-length buffer, as passing a 0-length buffer will always
result in a 0-length read, and the 1-length buffer will never be filled
because we're not expecting any reads from MySQL before we have written
any request packets in a fresh connection.

All sockets created in the Go runtime are set to non-blocking
(O_NONBLOCK). Consequently, we can detect a socket that has been closed
on the server-side because the `read` syscall will return a 0-length read
_and_ no error.

We assume that any other errors returned from the `read` also mean the
connection is in a bad state, except for `EAGAIN`/`EWOULDBLOCK`, which is
the expected return for a healthy non-blocking socket in this circumstance.

Because of the dependency on `syscall.Conn`, liveness checks can only be
performed in Go 1.9+. This restriction however overlaps with the fact
that we only mark connections as having been reset in Go 1.10+, as
explained in the previous section.
@vmg
Copy link
Contributor Author

vmg commented Mar 20, 2019

I fixed the gofmt issues and we're green again! 😄 🍏

Copy link
Member

@methane methane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But wait one more maintainer's review.

@vmg
Copy link
Contributor Author

vmg commented Mar 26, 2019

Hiii @methane! It seems like this PR got stalled again. I'm not sure who are the other maintainers of the project. Is there anything we could do to speed up this process? I would appreciate landing this PR so we can switch back to using go-sql-driver/mysql for GitHub's internal services. 😄

Copy link
Contributor

@shogo82148 shogo82148 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nice idea!
LGTM

@vmg
Copy link
Contributor Author

vmg commented Mar 29, 2019

Fantastic! Thank you @shogo82148! Does that mean this PR is ready to be merged?

@methane methane merged commit bc5e6ea into go-sql-driver:master Mar 29, 2019
@methane
Copy link
Member

methane commented Mar 29, 2019

Congrats.

fdurand added a commit to inverse-inc/packetfence that referenced this pull request Apr 1, 2019
fdurand added a commit to inverse-inc/packetfence that referenced this pull request Apr 1, 2019
fdurand added a commit to inverse-inc/packetfence that referenced this pull request Apr 1, 2019
fdurand added a commit to inverse-inc/packetfence that referenced this pull request Apr 3, 2019
@jingyugao
Copy link

Can we cherry-pick this commit into some tag?

@julienschmidt julienschmidt added this to the v1.5.0 milestone Jan 3, 2020
saydamir added a commit to saydamir/clickhouse-go that referenced this pull request Oct 8, 2021
This commit fixes for ClickHouse#213.

If the connection closes from the server side for some reason, the database driver returns driver.ErrBadConn to database/sql.

Usually, database/sql retries a request, assuming that the error occurs in a function that could be called first after retrieving a connection from the pool.
But beginTx in clickhouse-go doesn't perform any network interaction and driver.ErrBadConn is returned later in the transaction. database/sql doesn't retry it because assumes that connection is alive - beginTx didn't return the error.

 This commit adds a method to check the connection liveness and performs that check in beginTx function.

 The check is taken from go-sql-driver/mysql#934.

There is no way to check the liveness of a secure connection, as long as there is no access to raw TCP net.Conn in clickhouse-go.
saydamir added a commit to saydamir/clickhouse-go that referenced this pull request Oct 8, 2021
This commit fixes for ClickHouse#213.

If the connection closes from the server side for some reason, the database driver returns driver.ErrBadConn to database/sql.

Usually, database/sql retries a request, assuming that the error occurs in a function that could be called first after retrieving a connection from the pool.
But beginTx in clickhouse-go doesn't perform any network interaction and driver.ErrBadConn is returned later in the transaction. database/sql doesn't retry it because assumes that connection is alive - beginTx didn't return the error.

 This commit adds a method to check the connection liveness and performs that check in beginTx function.

 The check is taken from go-sql-driver/mysql#934.

There is no way to check the liveness of a secure connection, as long as there is no access to raw TCP net.Conn in clickhouse-go.
abraithwaite pushed a commit to segmentio/clickhouse-go that referenced this pull request Oct 23, 2021
This commit fixes for ClickHouse#213.

If the connection closes from the server side for some reason, the database driver returns driver.ErrBadConn to database/sql.

Usually, database/sql retries a request, assuming that the error occurs in a function that could be called first after retrieving a connection from the pool.
But beginTx in clickhouse-go doesn't perform any network interaction and driver.ErrBadConn is returned later in the transaction. database/sql doesn't retry it because assumes that connection is alive - beginTx didn't return the error.

 This commit adds a method to check the connection liveness and performs that check in beginTx function.

 The check is taken from go-sql-driver/mysql#934.

There is no way to check the liveness of a secure connection, as long as there is no access to raw TCP net.Conn in clickhouse-go.
farbodsalimi pushed a commit to segmentio/clickhouse-go that referenced this pull request May 18, 2022
This commit fixes for ClickHouse#213.

If the connection closes from the server side for some reason, the database driver returns driver.ErrBadConn to database/sql.

Usually, database/sql retries a request, assuming that the error occurs in a function that could be called first after retrieving a connection from the pool.
But beginTx in clickhouse-go doesn't perform any network interaction and driver.ErrBadConn is returned later in the transaction. database/sql doesn't retry it because assumes that connection is alive - beginTx didn't return the error.

 This commit adds a method to check the connection liveness and performs that check in beginTx function.

 The check is taken from go-sql-driver/mysql#934.

There is no way to check the liveness of a secure connection, as long as there is no access to raw TCP net.Conn in clickhouse-go.
kahalorah9f added a commit to kahalorah9f/clickhouse-go that referenced this pull request Aug 27, 2024
This commit fixes for ClickHouse/clickhouse-go#213.

If the connection closes from the server side for some reason, the database driver returns driver.ErrBadConn to database/sql.

Usually, database/sql retries a request, assuming that the error occurs in a function that could be called first after retrieving a connection from the pool.
But beginTx in clickhouse-go doesn't perform any network interaction and driver.ErrBadConn is returned later in the transaction. database/sql doesn't retry it because assumes that connection is alive - beginTx didn't return the error.

 This commit adds a method to check the connection liveness and performs that check in beginTx function.

 The check is taken from go-sql-driver/mysql#934.

There is no way to check the liveness of a secure connection, as long as there is no access to raw TCP net.Conn in clickhouse-go.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants