This repository has been archived by the owner on Dec 19, 2017. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 156
Race at client connection can lead to invalid client vs cluster state #259
Labels
Comments
thiagoss
pushed a commit
to waltznetworks/copycat
that referenced
this issue
Dec 23, 2016
This will make them eventually timeout and try a new server instead of trying to use a server that they aren't supposed to be using. This can happen on races of clients connecting to multiple servers due to a sequence of timeouts (frequent on high-load scenarios) and the ConnectRequest/AcceptRequests ending up on the Leader at the wrong order. Example: 1) Client A tries to connect to Server X and times out 2) Client A tries to connect to Server Y and succeeds 3) Leader receives AcceptRequest from Server Y and processes/publishes it 4) Leader receives AcceptRequest from Server X and processes/publishes it After this, Client A is connected to Server Y but the cluster believes it should be connected to server X. This causes some inconsistency checks that eventually leads to the Client A becoming useless and having all its operations timeout. Its internal session state won't receive publishes because Server Y won't send them (its connection was closed once it received the message that Client A is now connected to Server X). The solution on this patch is to reject messages from clients that are not connected to a particular server, leading the client to eventually timeout and start a new connection. A new field was added to Session requests to allow identifying when a request was forwarded (must be handled) or when it came directly from a client (needs check if the client is connected to the receiver). Fixes atomix#259
thiagoss
pushed a commit
to waltznetworks/copycat
that referenced
this issue
Jan 5, 2017
This will make them eventually timeout and try a new server instead of trying to use a server that they aren't supposed to be using. This can happen on races of clients connecting to multiple servers due to a sequence of timeouts (frequent on high-load scenarios) and the ConnectRequest/AcceptRequests ending up on the Leader at the wrong order. Example: 1) Client A tries to connect to Server X and times out 2) Client A tries to connect to Server Y and succeeds 3) Leader receives AcceptRequest from Server Y and processes/publishes it 4) Leader receives AcceptRequest from Server X and processes/publishes it After this, Client A is connected to Server Y but the cluster believes it should be connected to server X. This causes some inconsistency checks that eventually leads to the Client A becoming useless and having all its operations timeout. Its internal session state won't receive publishes because Server Y won't send them (its connection was closed once it received the message that Client A is now connected to Server X). The solution on this patch is to reject messages from clients that are not connected to a particular server, leading the client to eventually timeout and start a new connection. A new field was added to Session requests to allow identifying when a request was forwarded (must be handled) or when it came directly from a client (needs check if the client is connected to the receiver). Fixes atomix#259
thiagoss
pushed a commit
to waltznetworks/copycat
that referenced
this issue
Jan 10, 2017
This will make them eventually timeout and try a new server instead of trying to use a server that they aren't supposed to be using. This can happen on races of clients connecting to multiple servers due to a sequence of timeouts (frequent on high-load scenarios) and the ConnectRequest/AcceptRequests ending up on the Leader at the wrong order. Example: 1) Client A tries to connect to Server X and times out 2) Client A tries to connect to Server Y and succeeds 3) Leader receives AcceptRequest from Server Y and processes/publishes it 4) Leader receives AcceptRequest from Server X and processes/publishes it After this, Client A is connected to Server Y but the cluster believes it should be connected to server X. This causes some inconsistency checks that eventually leads to the Client A becoming useless and having all its operations timeout. Its internal session state won't receive publishes because Server Y won't send them (its connection was closed once it received the message that Client A is now connected to Server X). The solution on this patch is to reject messages from clients that are not connected to a particular server, leading the client to eventually timeout and start a new connection. A new field was added to Session requests to allow identifying when a request was forwarded (must be handled) or when it came directly from a client (needs check if the client is connected to the receiver). Fixes atomix#259
thiagoss
pushed a commit
to waltznetworks/copycat
that referenced
this issue
Jan 10, 2017
This will make them eventually timeout and try a new server instead of trying to use a server that they aren't supposed to be using. This can happen on races of clients connecting to multiple servers due to a sequence of timeouts (frequent on high-load scenarios) and the ConnectRequest/AcceptRequests ending up on the Leader at the wrong order. Example: 1) Client A tries to connect to Server X and times out 2) Client A tries to connect to Server Y and succeeds 3) Leader receives AcceptRequest from Server Y and processes/publishes it 4) Leader receives AcceptRequest from Server X and processes/publishes it After this, Client A is connected to Server Y but the cluster believes it should be connected to server X. This causes some inconsistency checks that eventually leads to the Client A becoming useless and having all its operations timeout. Its internal session state won't receive publishes because Server Y won't send them (its connection was closed once it received the message that Client A is now connected to Server X). The solution on this patch is to reject messages from clients that are not connected to a particular server, leading the client to eventually timeout and start a new connection. A new field was added to Session requests to allow identifying when a request was forwarded (must be handled) or when it came directly from a client (needs check if the client is connected to the receiver). Fixes atomix#259
Fixed by #276 |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
A client can try to reconnect to the cluster after a disconnection/timeout and it might lead to an inconsistent state:
In the end, the Client will be connected to Server B but the cluster thinks it is connected to Server A. So no information about the events or replies to queries will ever be sent to the Client because Server B that is responsible for it has no connection. I have logs in case it is needed to confirm the issue.
The text was updated successfully, but these errors were encountered: