occasional QUIC connection failures #7526

marten-seemann · 2020-07-08T01:58:36Z

TLDR: I recently discovered a bug in quic-go, that leads to a stall of QUIC connections if a certain packet shortly after completion of the QUIC handshake is declared lost. Analysis of logs from one of our DHT nodes reveals that this affects around 0.37% of connections.

Connection IDs in QUIC

QUIC uses Connection IDs to match packets with connections. Each endpoint announces the connection ID its peer uses on outgoing packets. This means that a client will use connection ID X when sending a packet to the server, while the server will use connection ID Y to send a packet back to the client. Each connection can be associated with multiple connection IDs. Using a fresh connection ID is needed to guarantee unlinkability during connection migration. Endpoints are also free to occasionally use new connection IDs when not migrating.

Endpoints announce new connection IDs to their peer (using NEW_CONNECTION_ID frames). The peer can then switch to using a new connection ID, and retire the old connection id (using a RETIRE_CONNECTION_ID frame).

Analysis of the Bug

The bug occurs when an endpoint receives a duplicate NEW_CONNECTION_ID frame annoucing the connection ID it is currently already using (this can happen if the packet is spuriously declared lost by the peer). Due to an off-by-one bug, it will then send a RETIRE_CONNECTION_ID frame for that connection ID (which makes no sense at all), but still continue using this connection ID. When the peer receives the RETIRE_CONNECTION_ID frame, it will drop the association between that connection ID and the connection after 5 seconds (this is intended to allow for reordered packets to arrive). Once the connection ID is dropped, the connection is basically dead: it times out (on the on side), and results in a stateless reset for the other side.

Measurement of the Impact

I analyzed roughly 1.55 million QUIC connections handled by our DHT booster nodes. 5678 of them (0.37%) experienced the bug described above occurred. All but 34 of those connections either timed out or were closed with a stateless reset (these 34 connection were closed regularly within the 5 second window and therefore didn't suffer any consequences).

Proposed Solution

There are two fixes that need to be made in quic-go:

Fix the bug: Endpoints need to stop misbehaving when receiving a duplicate NEW_CONNECTION_ID frame: don't retire the conn ID that's in use when receiving a retransmission quic-go/quic-go#2652
Peers need to enforce that a connection ID that's still in use is not retired. In fact, the QUIC specification defines this as a connection error, which we didn't check for so far: enforce that a connection ID is not retired in a packet that uses that connection ID quic-go/quic-go#2651. Arguably, a connection error is preferable to a connection timeout, since it allows for a faster recovery.

At this point, I propose to do the following:

Merge the fix (1.)

Add a flag to control the following:

Don't issue any new connection IDs. Endpoints will use the same connection for the entire connection. quic-go doesn't support connection migration, so this won't break anything.
Skip enforcement of (2.)

With this flag switched on, nodes that install the update will not issue any new connection IDs to their peers, thereby preventing the bug from occurring in nodes that haven't updated yet.

Once a large enough fraction of the network has upgraded, we can toggle the flag, and go back to using multiple connection IDs per connection.

marten-seemann added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Jul 8, 2020

This was referenced Jul 8, 2020

add a flag to disable conn ID generation and the check for retired conn IDs quic-go/quic-go#2660

Merged

update quic-go to v0.17.3 libp2p/go-libp2p-quic-transport#168

Merged

update QUIC, enable the RetireBugBackwardsCompatibilityMode #7535

Merged

aschmahmann closed this as completed in #7535 Jul 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

occasional QUIC connection failures #7526

occasional QUIC connection failures #7526

marten-seemann commented Jul 8, 2020

occasional QUIC connection failures #7526

occasional QUIC connection failures #7526

Comments

marten-seemann commented Jul 8, 2020

Connection IDs in QUIC

Analysis of the Bug

Measurement of the Impact

Proposed Solution