Remove tracking of connection addresses in server-side connection management #276

kuujo · 2017-02-01T21:28:26Z

This is yet another attempt to resolve the issue described in #260.

#275 removed the connection protocol altogether. However, I believe that could actually have detrimental affects in preventing clients that are connected to followers from progressing for a brief time (less than the keep alive interval) after connecting to a new server. Part of this protocol is still useful. So, this PR preserves the ConnectRequest portion of the protocol - which associates a client's ID with a connection - but still removes AcceptRequest and ConnectEntry and still relies only on the Connection to indicate when the client disconnects from the server. I believe this should still resolve the problems described in #260 but should hopefully reduce the time it takes for the client to continue to progress after switching servers.

…agement.

…ection provided information to track connections.

coveralls · 2017-02-01T21:42:45Z

Changes Unknown when pulling 385acda on remove-connection-addresses into ** on master**.

kuujo · 2017-02-01T22:36:44Z

This implementation does seem more stable for me thus far

jhall11 · 2017-02-01T22:37:27Z

me too, I'm still testing, but so far so good

kuujo · 2017-02-03T01:19:56Z

@jhall11 what do you think?

jhall11 · 2017-02-06T19:20:00Z

After running a test that sets up a cluster, then continually kills node1, waits a bit, restarts that container, and reconnects node 1 to the cluster, We have seen UnknownSessionException occur when the node tries to reconnect. This occurred on the 153rd iteration of this test, so it is not very easy to replicate.

We are running ONOS on top of this pull request with the latest atomix snapshot.

I have yet to sort through the logs, but here they are if you want to take a look: https://drive.google.com/file/d/0BwORWZ1M_qo_TkpUVWV1dTNPeGM/view?usp=sharing

kuujo · 2017-02-06T23:07:13Z

Thanks! I'll take a look at them...

kuujo · 2017-02-06T23:19:38Z

@jhall11 can you tell me a bit more about the setup?

What you're describing is a three node cluster, right? You kill one node and eventually restart it? But during that period, clients are still presumably connected to the two other nodes? However, you periodically see UnknownSessionException in the logs (which I see mostly on 172.17.0.2). And it seems like this is happening in OnosCopycatClient's retry mechanism, which I'm assuming Madan implemented because he knew Copycat's ordering guarantees prevent it from retrying queries itself and leave that to the user. Just want to make sure that's all right.

I can see two possibilities here. First is that the CopycatClient is for some reason not attempting to register a new session. Second is that there's some issue with the session registration logic in this PR that causes it to not be properly persisted in the cluster even after the client registers a new session. What we should expect to see is just one UnknownSessionException if any and see the client/cluster work out that problem themselves. But for some reason either the client isn't attempting to create a new session or it does attempt to create a new session and the cluster is losing it.

kuujo · 2017-02-06T23:26:06Z

From a brief glance at the logs looks to me like it's the former - the client isn't attempting to register a new session when it learns its session was expired by the cluster.

kuujo · 2017-02-06T23:32:11Z

I think I understand the issue, but it's going to take some time to fix and probably belongs in a separate PR

kuujo · 2017-02-06T23:38:02Z

So, here's the problem. The DefaultCopycatClient just doesn't seem to be proactively expiring/recreating sessions in response to command/query failures. It only seems to do so when it learns its session was lost via KeepAliveResponse. This allows OnosCopycatClient queries to be retried 5 times and failed without the client attempting to recreate its session. Essentially, queries will fail until the client tries a keep-alive again, which is certainly not the behavior we want. But this will have to be addressed in another PR (I'll fork this branch).

kuujo added 2 commits February 1, 2017 13:15

Remove tracking of connection addresses in server-side connection man…

0b3088b

…agement.

Remove the AcceptRequest protocol and ConnectEntry, leaving only Conn…

385acda

…ection provided information to track connections.

kuujo added the bug label Feb 1, 2017

kuujo mentioned this pull request Feb 1, 2017

Remove ConnectRequest/AcceptRequest/ConnectEntry #275

Closed

kuujo mentioned this pull request Feb 7, 2017

Expire client sessions when command/query fail with unknown session #277

Merged

kuujo merged commit 2291417 into master Feb 7, 2017

kuujo deleted the remove-connection-addresses branch February 7, 2017 02:32

kuujo mentioned this pull request Feb 26, 2017

Race at client connection can lead to invalid client vs cluster state #259

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove tracking of connection addresses in server-side connection management #276

Remove tracking of connection addresses in server-side connection management #276

kuujo commented Feb 1, 2017

coveralls commented Feb 1, 2017

kuujo commented Feb 1, 2017

jhall11 commented Feb 1, 2017

kuujo commented Feb 3, 2017

jhall11 commented Feb 6, 2017

kuujo commented Feb 6, 2017

kuujo commented Feb 6, 2017

kuujo commented Feb 6, 2017 •

edited

Loading

kuujo commented Feb 6, 2017

kuujo commented Feb 6, 2017

Remove tracking of connection addresses in server-side connection management #276

Remove tracking of connection addresses in server-side connection management #276

Conversation

kuujo commented Feb 1, 2017

coveralls commented Feb 1, 2017

kuujo commented Feb 1, 2017

jhall11 commented Feb 1, 2017

kuujo commented Feb 3, 2017

jhall11 commented Feb 6, 2017

kuujo commented Feb 6, 2017

kuujo commented Feb 6, 2017

kuujo commented Feb 6, 2017 • edited Loading

kuujo commented Feb 6, 2017

kuujo commented Feb 6, 2017

kuujo commented Feb 6, 2017 •

edited

Loading