Skip to content
This repository has been archived by the owner on Dec 19, 2017. It is now read-only.

Remove tracking of connection addresses in server-side connection management #276

Merged
merged 2 commits into from
Feb 7, 2017

Conversation

kuujo
Copy link
Member

@kuujo kuujo commented Feb 1, 2017

This is yet another attempt to resolve the issue described in #260.

#275 removed the connection protocol altogether. However, I believe that could actually have detrimental affects in preventing clients that are connected to followers from progressing for a brief time (less than the keep alive interval) after connecting to a new server. Part of this protocol is still useful. So, this PR preserves the ConnectRequest portion of the protocol - which associates a client's ID with a connection - but still removes AcceptRequest and ConnectEntry and still relies only on the Connection to indicate when the client disconnects from the server. I believe this should still resolve the problems described in #260 but should hopefully reduce the time it takes for the client to continue to progress after switching servers.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 385acda on remove-connection-addresses into ** on master**.

@kuujo kuujo added the bug label Feb 1, 2017
@kuujo
Copy link
Member Author

kuujo commented Feb 1, 2017

This implementation does seem more stable for me thus far

@jhall11
Copy link
Collaborator

jhall11 commented Feb 1, 2017

me too, I'm still testing, but so far so good

@kuujo
Copy link
Member Author

kuujo commented Feb 3, 2017

@jhall11 what do you think?

@jhall11
Copy link
Collaborator

jhall11 commented Feb 6, 2017

After running a test that sets up a cluster, then continually kills node1, waits a bit, restarts that container, and reconnects node 1 to the cluster, We have seen UnknownSessionException occur when the node tries to reconnect. This occurred on the 153rd iteration of this test, so it is not very easy to replicate.

We are running ONOS on top of this pull request with the latest atomix snapshot.

I have yet to sort through the logs, but here they are if you want to take a look: https://drive.google.com/file/d/0BwORWZ1M_qo_TkpUVWV1dTNPeGM/view?usp=sharing

@kuujo
Copy link
Member Author

kuujo commented Feb 6, 2017

Thanks! I'll take a look at them...

@kuujo
Copy link
Member Author

kuujo commented Feb 6, 2017

@jhall11 can you tell me a bit more about the setup?

What you're describing is a three node cluster, right? You kill one node and eventually restart it? But during that period, clients are still presumably connected to the two other nodes? However, you periodically see UnknownSessionException in the logs (which I see mostly on 172.17.0.2). And it seems like this is happening in OnosCopycatClient's retry mechanism, which I'm assuming Madan implemented because he knew Copycat's ordering guarantees prevent it from retrying queries itself and leave that to the user. Just want to make sure that's all right.

I can see two possibilities here. First is that the CopycatClient is for some reason not attempting to register a new session. Second is that there's some issue with the session registration logic in this PR that causes it to not be properly persisted in the cluster even after the client registers a new session. What we should expect to see is just one UnknownSessionException if any and see the client/cluster work out that problem themselves. But for some reason either the client isn't attempting to create a new session or it does attempt to create a new session and the cluster is losing it.

@kuujo
Copy link
Member Author

kuujo commented Feb 6, 2017

From a brief glance at the logs looks to me like it's the former - the client isn't attempting to register a new session when it learns its session was expired by the cluster.

@kuujo
Copy link
Member Author

kuujo commented Feb 6, 2017

I think I understand the issue, but it's going to take some time to fix and probably belongs in a separate PR

@kuujo
Copy link
Member Author

kuujo commented Feb 6, 2017

So, here's the problem. The DefaultCopycatClient just doesn't seem to be proactively expiring/recreating sessions in response to command/query failures. It only seems to do so when it learns its session was lost via KeepAliveResponse. This allows OnosCopycatClient queries to be retried 5 times and failed without the client attempting to recreate its session. Essentially, queries will fail until the client tries a keep-alive again, which is certainly not the behavior we want. But this will have to be addressed in another PR (I'll fork this branch).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants