Race at client connection can lead to invalid client vs cluster state #259

thiagoss · 2016-12-23T22:04:42Z

A client can try to reconnect to the cluster after a disconnection/timeout and it might lead to an inconsistent state:

Client tries to connect to Server A and times out
Client tries to connect to Server B and succeeds
It is published that Client is connected to Server B
The first attempt at 1 finishes its process and it is published that Client is connected to Server A

In the end, the Client will be connected to Server B but the cluster thinks it is connected to Server A. So no information about the events or replies to queries will ever be sent to the Client because Server B that is responsible for it has no connection. I have logs in case it is needed to confirm the issue.

This will make them eventually timeout and try a new server instead of trying to use a server that they aren't supposed to be using. This can happen on races of clients connecting to multiple servers due to a sequence of timeouts (frequent on high-load scenarios) and the ConnectRequest/AcceptRequests ending up on the Leader at the wrong order. Example: 1) Client A tries to connect to Server X and times out 2) Client A tries to connect to Server Y and succeeds 3) Leader receives AcceptRequest from Server Y and processes/publishes it 4) Leader receives AcceptRequest from Server X and processes/publishes it After this, Client A is connected to Server Y but the cluster believes it should be connected to server X. This causes some inconsistency checks that eventually leads to the Client A becoming useless and having all its operations timeout. Its internal session state won't receive publishes because Server Y won't send them (its connection was closed once it received the message that Client A is now connected to Server X). The solution on this patch is to reject messages from clients that are not connected to a particular server, leading the client to eventually timeout and start a new connection. A new field was added to Session requests to allow identifying when a request was forwarded (must be handled) or when it came directly from a client (needs check if the client is connected to the receiver). Fixes atomix#259

kuujo · 2017-02-26T10:58:49Z

Fixed by #276

thiagoss mentioned this issue Dec 23, 2016

Reject messages from clients that are connected to the wrong server #260

Closed

kuujo added the bug label Jan 4, 2017

kuujo closed this as completed Feb 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race at client connection can lead to invalid client vs cluster state #259

Race at client connection can lead to invalid client vs cluster state #259

thiagoss commented Dec 23, 2016

kuujo commented Feb 26, 2017

Race at client connection can lead to invalid client vs cluster state #259

Race at client connection can lead to invalid client vs cluster state #259

Comments

thiagoss commented Dec 23, 2016

kuujo commented Feb 26, 2017