State Divergence with long Snapshot times #242

vyzo · 2016-07-18T10:00:03Z

We have run into an issue that eventually leads to state divergence in a cluster, as snapshot times get longer and longer.

We are using a snapshottable state machine implementation.

The state machine maintains a growing map which is part of the snapshot, and we have noticed that as the snapshot time increases, the cluster eventually hits a sequence of events that ends with the divergence of a cluster member. We have observed that all long runs in our tests eventually lead to this state.

The sequence of events extrapolated from our logs is as following (in a cluster with 3 copycats):

the leader starts a snapshot which takes longer than the election timeout
a follower opens an election and becomes the new leader and then both it and the other follower take a new snapshot
the former leader finishes its snapshot and becomes a follower, with an eventual state divergence at this point.

Our distilled logs from from runs causing divergence have the following form around the time of divergence:

# server-1
[14/07/2016:15:55:45 +0000] [copycat-server-/10.10.1.149:10000-copycat-state] INFO io.mediachain.copycat.StateMachine$JournalStateMachine - Generated block 1784832 -> MultihashReference(MultiHash[SHA-256]: QmcoCKJZ5Cti96R5M9svqEQFL9zMbt79JcBZe4aTnKymep)
[14/07/2016:15:55:51 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.copycat.server.state.FollowerState - /10.10.1.149:10000 - Polling members [ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=/10.10.1.71:10000, clientAddress=/10.10.1.71:10000], ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=/10.10.1.72:10000, clientAddress=/10.10.1.72:10000]]
[14/07/2016:15:55:51 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.copycat.server.state.ServerContext - /10.10.1.149:10000 - Transitioning to CANDIDATE
[14/07/2016:15:55:51 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.copycat.server.state.CandidateState - /10.10.1.149:10000 - Starting election
[14/07/2016:15:55:51 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.copycat.server.state.CandidateState - /10.10.1.149:10000 - Requesting votes from [ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=/10.10.1.71:10000, clientAddress=/10.10.1.71:10000], ServerMember[type=ACTIVE, status=AVAILABLE, serverAddress=/10.10.1.72:10000, clientAddress=/10.10.1.72:10000]]
[14/07/2016:15:55:51 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.copycat.server.state.ServerContext - /10.10.1.149:10000 - Transitioning to LEADER
[14/07/2016:15:55:51 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.copycat.server.state.ServerContext - /10.10.1.149:10000 - Found leader /10.10.1.149:10000
[14/07/2016:15:55:51 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.copycat.server.state.ServerStateMachine - /10.10.1.149:10000 - Taking snapshot 2677034
[14/07/2016:15:55:56 +0000] [copycat-server-/10.10.1.149:10000-copycat] WARN io.atomix.copycat.server.state.LeaderAppender - /10.10.1.149:10000 - Failed to configure /10.10.1.72:10000
[14/07/2016:15:55:56 +0000] [copycat-server-/10.10.1.149:10000-copycat] WARN io.atomix.copycat.server.state.LeaderAppender - /10.10.1.149:10000 - AppendRequest to /10.10.1.72:10000 failed. Reason: request timed out
[14/07/2016:15:55:56 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.catalyst.transport.netty.NettyClient - Connecting to /10.10.1.72:10000
[14/07/2016:15:55:56 +0000] [catalyst-event-loop-2] INFO io.atomix.catalyst.transport.netty.NettyClient - Connected to /10.10.1.72:10000
[14/07/2016:15:55:56 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.catalyst.transport.netty.NettyClient - Connecting to /10.10.1.72:10000
[14/07/2016:15:55:56 +0000] [catalyst-event-loop-3] INFO io.atomix.catalyst.transport.netty.NettyClient - Connected to /10.10.1.72:10000
[14/07/2016:15:55:57 +0000] [copycat-server-/10.10.1.149:10000-copycat] WARN io.atomix.copycat.server.state.LeaderAppender - /10.10.1.149:10000 - AppendRequest to /10.10.1.71:10000 failed. Reason: request timed out
[14/07/2016:15:55:57 +0000] [copycat-server-/10.10.1.149:10000-copycat] WARN io.atomix.copycat.server.state.LeaderAppender - /10.10.1.149:10000 - AppendRequest to /10.10.1.71:10000 failed. Reason: request timed out
[14/07/2016:15:55:57 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.catalyst.transport.netty.NettyClient - Connecting to /10.10.1.71:10000
[14/07/2016:15:55:57 +0000] [catalyst-event-loop-1] INFO io.atomix.catalyst.transport.netty.NettyClient - Connected to /10.10.1.71:10000
[14/07/2016:15:55:58 +0000] [copycat-server-/10.10.1.149:10000-copycat] WARN io.atomix.copycat.server.state.LeaderAppender - /10.10.1.149:10000 - AppendRequest to /10.10.1.71:10000 failed. Reason: request timed out
[14/07/2016:15:55:58 +0000] [copycat-server-/10.10.1.149:10000-copycat] INFO io.atomix.catalyst.transport.netty.NettyClient - Connecting to /10.10.1.71:10000
[14/07/2016:15:55:58 +0000] [catalyst-event-loop-2] INFO io.atomix.catalyst.transport.netty.NettyClient - Connected to /10.10.1.71:10000
[14/07/2016:15:55:58 +0000] [copycat-server-/10.10.1.149:10000-copycat-state] INFO io.mediachain.copycat.StateMachine$JournalStateMachine - Generated block 1785344 -> MultihashReference(MultiHash[SHA-256]: QmVh1nQnR9DaufoDF5dVD7fHCNPUSAGievHxLA5uHiv4Ts)

# server-2
[14/07/2016:15:55:45 +0000] [copycat-server-/10.10.1.71:10000-copycat-state] INFO io.mediachain.copycat.StateMachine$JournalStateMachine - Generated block 1784832 -> MultihashReference(MultiHash[SHA-256]: QmcoCKJZ5Cti96R5M9svqEQFL9zMbt79JcBZe4aTnKymep)
[14/07/2016:15:55:51 +0000] [copycat-server-/10.10.1.71:10000-copycat] INFO io.atomix.copycat.server.state.ServerContext - /10.10.1.71:10000 - Found leader /10.10.1.149:10000
[14/07/2016:15:55:56 +0000] [copycat-server-/10.10.1.71:10000-copycat] INFO io.atomix.copycat.server.state.ServerStateMachine - /10.10.1.71:10000 - Taking snapshot 2677034
[14/07/2016:15:56:02 +0000] [copycat-server-/10.10.1.71:10000-copycat-state] INFO io.mediachain.copycat.StateMachine$JournalStateMachine - Generated block 1785344 -> MultihashReference(MultiHash[SHA-256]: QmVh1nQnR9DaufoDF5dVD7fHCNPUSAGievHxLA5uHiv4Ts)

# server-3 [divergent]
[14/07/2016:15:55:45 +0000] [copycat-server-/10.10.1.72:10000-copycat-state] INFO io.mediachain.copycat.StateMachine$JournalStateMachine - Generated block 1784832 -> MultihashReference(MultiHash[SHA-256]: QmcoCKJZ5Cti96R5M9svqEQFL9zMbt79JcBZe4aTnKymep)
[14/07/2016:15:55:47 +0000] [copycat-server-/10.10.1.72:10000-copycat] INFO io.atomix.copycat.server.state.ServerStateMachine - /10.10.1.72:10000 - Taking snapshot 2677034
[14/07/2016:15:55:52 +0000] [copycat-server-/10.10.1.72:10000-copycat] INFO io.atomix.copycat.server.state.ServerContext - /10.10.1.72:10000 - Transitioning to FOLLOWER
[14/07/2016:15:55:52 +0000] [copycat-server-/10.10.1.72:10000-copycat] INFO io.atomix.copycat.server.state.ServerContext - /10.10.1.72:10000 - Found leader /10.10.1.149:10000
[14/07/2016:15:55:56 +0000] [copycat-server-/10.10.1.72:10000-copycat] INFO io.atomix.copycat.server.storage.compaction.Compactor - Compacting log with compaction: MINOR
[14/07/2016:15:55:58 +0000] [copycat-server-/10.10.1.72:10000-copycat-state] INFO io.mediachain.copycat.StateMachine$JournalStateMachine - Generated block 1785344 -> MultihashReference(MultiHash[SHA-256]: QmQc3YAUgYHk5LacfTypUWvFhMEqC8ZpBjZ2AsydLFfvi7)

The only mitigation we have against this problem is to increase the election timeout, so that it is longer than the snapshot time, and try to reduce that time with custom serialization.
Indeed, we have observed that when snapshot times don't exceed the timeout, the state remains identical in all members of the cluster.
This is not a long term solution however, because snapshot time will eventually exceed the election timeout again as our system grows.

The text was updated successfully, but these errors were encountered:

madjam · 2016-08-24T00:26:06Z

While I'm not arguing against the the fact that system invariants should be maintained irrespective of how long snapshots take, I'm wondering how sustainable it is to have a system where snapshot times can grow indefinitely. Even if the underlying issue is fixed you might still have a system that is unstable. As you are probably aware larger election timeouts have other negative side effects such as the system taking too long to recover from legitimate leader failures.

AFAIK, snapshotting is more appropriate when when the underlying state does not grow indefinitely. For example: a counter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State Divergence with long Snapshot times #242

State Divergence with long Snapshot times #242

vyzo commented Jul 18, 2016

madjam commented Aug 24, 2016

State Divergence with long Snapshot times #242

State Divergence with long Snapshot times #242

Comments

vyzo commented Jul 18, 2016

madjam commented Aug 24, 2016