Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

occasional QUIC connection failures #7526

Closed
marten-seemann opened this issue Jul 8, 2020 · 0 comments · Fixed by #7535
Closed

occasional QUIC connection failures #7526

marten-seemann opened this issue Jul 8, 2020 · 0 comments · Fixed by #7535
Labels
kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization

Comments

@marten-seemann
Copy link
Member

TLDR: I recently discovered a bug in quic-go, that leads to a stall of QUIC connections if a certain packet shortly after completion of the QUIC handshake is declared lost. Analysis of logs from one of our DHT nodes reveals that this affects around 0.37% of connections.

Connection IDs in QUIC

QUIC uses Connection IDs to match packets with connections. Each endpoint announces the connection ID its peer uses on outgoing packets. This means that a client will use connection ID X when sending a packet to the server, while the server will use connection ID Y to send a packet back to the client. Each connection can be associated with multiple connection IDs. Using a fresh connection ID is needed to guarantee unlinkability during connection migration. Endpoints are also free to occasionally use new connection IDs when not migrating.

Endpoints announce new connection IDs to their peer (using NEW_CONNECTION_ID frames). The peer can then switch to using a new connection ID, and retire the old connection id (using a RETIRE_CONNECTION_ID frame).

Analysis of the Bug

The bug occurs when an endpoint receives a duplicate NEW_CONNECTION_ID frame annoucing the connection ID it is currently already using (this can happen if the packet is spuriously declared lost by the peer). Due to an off-by-one bug, it will then send a RETIRE_CONNECTION_ID frame for that connection ID (which makes no sense at all), but still continue using this connection ID. When the peer receives the RETIRE_CONNECTION_ID frame, it will drop the association between that connection ID and the connection after 5 seconds (this is intended to allow for reordered packets to arrive). Once the connection ID is dropped, the connection is basically dead: it times out (on the on side), and results in a stateless reset for the other side.

Measurement of the Impact

I analyzed roughly 1.55 million QUIC connections handled by our DHT booster nodes. 5678 of them (0.37%) experienced the bug described above occurred. All but 34 of those connections either timed out or were closed with a stateless reset (these 34 connection were closed regularly within the 5 second window and therefore didn't suffer any consequences).

Proposed Solution

There are two fixes that need to be made in quic-go:

  1. Fix the bug: Endpoints need to stop misbehaving when receiving a duplicate NEW_CONNECTION_ID frame: don't retire the conn ID that's in use when receiving a retransmission quic-go/quic-go#2652
  2. Peers need to enforce that a connection ID that's still in use is not retired. In fact, the QUIC specification defines this as a connection error, which we didn't check for so far: enforce that a connection ID is not retired in a packet that uses that connection ID quic-go/quic-go#2651. Arguably, a connection error is preferable to a connection timeout, since it allows for a faster recovery.

At this point, I propose to do the following:

  • Merge the fix (1.)

Add a flag to control the following:

  • Don't issue any new connection IDs. Endpoints will use the same connection for the entire connection. quic-go doesn't support connection migration, so this won't break anything.
  • Skip enforcement of (2.)

With this flag switched on, nodes that install the update will not issue any new connection IDs to their peers, thereby preventing the bug from occurring in nodes that haven't updated yet.

Once a large enough fraction of the network has upgraded, we can toggle the flag, and go back to using multiple connection IDs per connection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant