Skip to content
This repository has been archived by the owner on Dec 19, 2017. It is now read-only.

Race at client connection can lead to invalid client vs cluster state #259

Closed
thiagoss opened this issue Dec 23, 2016 · 1 comment
Closed
Labels

Comments

@thiagoss
Copy link

A client can try to reconnect to the cluster after a disconnection/timeout and it might lead to an inconsistent state:

  1. Client tries to connect to Server A and times out
  2. Client tries to connect to Server B and succeeds
  3. It is published that Client is connected to Server B
  4. The first attempt at 1 finishes its process and it is published that Client is connected to Server A

In the end, the Client will be connected to Server B but the cluster thinks it is connected to Server A. So no information about the events or replies to queries will ever be sent to the Client because Server B that is responsible for it has no connection. I have logs in case it is needed to confirm the issue.

thiagoss pushed a commit to waltznetworks/copycat that referenced this issue Dec 23, 2016
This will make them eventually timeout and try a new server instead of
trying to use a server that they aren't supposed to be using.

This can happen on races of clients connecting to multiple servers
due to a sequence of timeouts (frequent on high-load scenarios) and
the ConnectRequest/AcceptRequests ending up on the Leader at the
wrong order.

Example:

1) Client A tries to connect to Server X and times out
2) Client A tries to connect to Server Y and succeeds
3) Leader receives AcceptRequest from Server Y and processes/publishes it
4) Leader receives AcceptRequest from Server X and processes/publishes it

After this, Client A is connected to Server Y but the cluster believes
it should be connected to server X. This causes some inconsistency
checks that eventually leads to the Client A becoming useless and having
all its operations timeout. Its internal session state won't receive
publishes because Server Y won't send them (its connection was closed
once it received the message that Client A is now connected to Server
X).

The solution on this patch is to reject messages from clients that are
not connected to a particular server, leading the client to eventually
timeout and start a new connection. A new field was added to Session
requests to allow identifying when a request was forwarded (must be
handled) or when it came directly from a client (needs check if the
client is connected to the receiver).

Fixes atomix#259
@kuujo kuujo added the bug label Jan 4, 2017
thiagoss pushed a commit to waltznetworks/copycat that referenced this issue Jan 5, 2017
This will make them eventually timeout and try a new server instead of
trying to use a server that they aren't supposed to be using.

This can happen on races of clients connecting to multiple servers
due to a sequence of timeouts (frequent on high-load scenarios) and
the ConnectRequest/AcceptRequests ending up on the Leader at the
wrong order.

Example:

1) Client A tries to connect to Server X and times out
2) Client A tries to connect to Server Y and succeeds
3) Leader receives AcceptRequest from Server Y and processes/publishes it
4) Leader receives AcceptRequest from Server X and processes/publishes it

After this, Client A is connected to Server Y but the cluster believes
it should be connected to server X. This causes some inconsistency
checks that eventually leads to the Client A becoming useless and having
all its operations timeout. Its internal session state won't receive
publishes because Server Y won't send them (its connection was closed
once it received the message that Client A is now connected to Server
X).

The solution on this patch is to reject messages from clients that are
not connected to a particular server, leading the client to eventually
timeout and start a new connection. A new field was added to Session
requests to allow identifying when a request was forwarded (must be
handled) or when it came directly from a client (needs check if the
client is connected to the receiver).

Fixes atomix#259
thiagoss pushed a commit to waltznetworks/copycat that referenced this issue Jan 10, 2017
This will make them eventually timeout and try a new server instead of
trying to use a server that they aren't supposed to be using.

This can happen on races of clients connecting to multiple servers
due to a sequence of timeouts (frequent on high-load scenarios) and
the ConnectRequest/AcceptRequests ending up on the Leader at the
wrong order.

Example:

1) Client A tries to connect to Server X and times out
2) Client A tries to connect to Server Y and succeeds
3) Leader receives AcceptRequest from Server Y and processes/publishes it
4) Leader receives AcceptRequest from Server X and processes/publishes it

After this, Client A is connected to Server Y but the cluster believes
it should be connected to server X. This causes some inconsistency
checks that eventually leads to the Client A becoming useless and having
all its operations timeout. Its internal session state won't receive
publishes because Server Y won't send them (its connection was closed
once it received the message that Client A is now connected to Server
X).

The solution on this patch is to reject messages from clients that are
not connected to a particular server, leading the client to eventually
timeout and start a new connection. A new field was added to Session
requests to allow identifying when a request was forwarded (must be
handled) or when it came directly from a client (needs check if the
client is connected to the receiver).

Fixes atomix#259
thiagoss pushed a commit to waltznetworks/copycat that referenced this issue Jan 10, 2017
This will make them eventually timeout and try a new server instead of
trying to use a server that they aren't supposed to be using.

This can happen on races of clients connecting to multiple servers
due to a sequence of timeouts (frequent on high-load scenarios) and
the ConnectRequest/AcceptRequests ending up on the Leader at the
wrong order.

Example:

1) Client A tries to connect to Server X and times out
2) Client A tries to connect to Server Y and succeeds
3) Leader receives AcceptRequest from Server Y and processes/publishes it
4) Leader receives AcceptRequest from Server X and processes/publishes it

After this, Client A is connected to Server Y but the cluster believes
it should be connected to server X. This causes some inconsistency
checks that eventually leads to the Client A becoming useless and having
all its operations timeout. Its internal session state won't receive
publishes because Server Y won't send them (its connection was closed
once it received the message that Client A is now connected to Server
X).

The solution on this patch is to reject messages from clients that are
not connected to a particular server, leading the client to eventually
timeout and start a new connection. A new field was added to Session
requests to allow identifying when a request was forwarded (must be
handled) or when it came directly from a client (needs check if the
client is connected to the receiver).

Fixes atomix#259
@kuujo
Copy link
Member

kuujo commented Feb 26, 2017

Fixed by #276

@kuujo kuujo closed this as completed Feb 26, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants