Split brain issue #218

ertong · 2019-02-04T08:27:40Z

Once, while tinc network was running and no active intervention was made (no restarts, configuration change, etc.), a lot of nodes suddenly become offline.

When I tried to find out what was happening, I saw the following graph from "dump graph":
.
In fact I found two working subgraphs (when I try to dump graph from node of another subgraph, red becomes green and vice versa).

The network is the mix of 1.1pre17 and 1.1pre16 versions (both subgraphs contain both).

I tried to "reload" different nodes several times. I tried to restart tinc on different nodes several times. Every time, node connects to the same subgraph.

Typical configuration is the following:

Name = min_mars
Device = /dev/net/tun
AddressFamily = ipv4
GraphDumpFile=/etc/tinc/tit/graph.txt
LocalDiscovery=yes
AutoConnect=yes

The "solution" was to stop tinc daemons on all nodes of one subgraph and start them one by one. After this, every started node connects to another subgraph and joins the full network.

Unfortunately, I do not know how to reproduce this. But, currently I suspect something in AutoConnect feature.

The text was updated successfully, but these errors were encountered:

gsliepen · 2019-07-17T23:52:05Z

I think the AutoConnect feature was immediately cancelling any attempts to repair split meshes once every node had three working connections. It might be fixed in commit de7d5a0.

nh2 · 2019-08-03T23:47:02Z

Hi,

I believe this problem took down my production infrastructure (all nodes are 1.1pre17) for a couple hours today because it created a netsplit / network partition.

Randomly restarting some nodes worked.

Before restart

After restart

Explanation

In this network, node_3 and node_4 are decommissioned machines that were thus offline (for many weeks already).

However, tinc still seemed to count them towards the number of 3 working connections.

Questions

@gsliepen Can you confirm whether

this understanding of mine is correct and the expected behaviour so far
your commit de7d5a0 should fix the problem in this scenario as well (because it is different from the scenario in the issue description, where no connections are existant between the two partitions, while I have connections via dead nodes)
you'll make a release that includes that fix soon
tinc is supposed to handle situations where large amounts of nodes are decommissioned in one go as happened for me
setting ConnectTo to all machines in my network would have avoided this issue?

Thanks!

nh2 · 2019-08-04T00:14:32Z

setting ConnectTo to all machines in my network would have avoided this issue?

Hmm, this doesn't seem to help; even after I specified an explicit each-to-each ConnectTo in every node's config file, there are no more than 3 meta connections for most nodes.

nh2 · 2019-08-11T12:20:29Z

I've now also experienced this netsplit even with no decommissioned nodes. There are two different views of nodes on the network:

View 1

View 2

nh2 · 2020-01-17T15:57:42Z

@gsliepen I have now encountered another split-brain problem, even with commit de7d5a0 cherry-picked.

In my network of 8 machines, 4 believe in one world view and the other 4 in another one:

View 1 (4 machines think this)

Other nodes with same view:

View 2 (4 other machines think this)

Other nodes with same view:

Restarting is a workaround

After restarting tinc on node-1, I get this correct graph on all nodes:

@gsliepen Any other ideas to prevent this from happening?

nh2 · 2020-03-30T17:54:40Z

Happened again to me today.

I strongly suspect that the KeyExpire setting, defaulting to 3600 seconds, is the reason that we see this so often.

I noticed that by observing the following hourly spike patterns in smokeping overthe VPN connection:

There's no pattern or failures in the non-VPN pings:

This does not provide an explanation or fix for the underlying issue (tinc getting netsplit and not recovering), but does provide a method to work around it (setting KeyExpire to ~~100 years~~ 68 years, higher amounts will overflow the 32-bit int keylifetime).

However, given that incorect keys seem to be what confuses tinc here, the question remains whether externally sent, incorrect keys could also trigger the same problem.

fangfufu added needs_investigation Unexpected behaviours with uncertain causes - needs more investigation 1.1 Issue related to Tinc 1.1 labels Jun 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split brain issue #218

Split brain issue #218

ertong commented Feb 4, 2019

gsliepen commented Jul 17, 2019

nh2 commented Aug 3, 2019 •

edited

Loading

nh2 commented Aug 4, 2019

nh2 commented Aug 11, 2019

nh2 commented Jan 17, 2020 •

edited

Loading

nh2 commented Mar 30, 2020 •

edited

Loading

Split brain issue #218

Split brain issue #218

Comments

ertong commented Feb 4, 2019

gsliepen commented Jul 17, 2019

nh2 commented Aug 3, 2019 • edited Loading

Before restart

After restart

Explanation

Questions

nh2 commented Aug 4, 2019

nh2 commented Aug 11, 2019

View 1

View 2

nh2 commented Jan 17, 2020 • edited Loading

View 1 (4 machines think this)

View 2 (4 other machines think this)

Restarting is a workaround

nh2 commented Mar 30, 2020 • edited Loading

nh2 commented Aug 3, 2019 •

edited

Loading

nh2 commented Jan 17, 2020 •

edited

Loading

nh2 commented Mar 30, 2020 •

edited

Loading