Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split brain issue #218

Open
ertong opened this issue Feb 4, 2019 · 6 comments
Open

Split brain issue #218

ertong opened this issue Feb 4, 2019 · 6 comments
Labels
1.1 Issue related to Tinc 1.1 needs_investigation Unexpected behaviours with uncertain causes - needs more investigation

Comments

@ertong
Copy link

ertong commented Feb 4, 2019

Once, while tinc network was running and no active intervention was made (no restarts, configuration change, etc.), a lot of nodes suddenly become offline.

When I tried to find out what was happening, I saw the following graph from "dump graph":
image.
In fact I found two working subgraphs (when I try to dump graph from node of another subgraph, red becomes green and vice versa).

The network is the mix of 1.1pre17 and 1.1pre16 versions (both subgraphs contain both).

I tried to "reload" different nodes several times. I tried to restart tinc on different nodes several times. Every time, node connects to the same subgraph.

Typical configuration is the following:

Name = min_mars
Device = /dev/net/tun
AddressFamily = ipv4
GraphDumpFile=/etc/tinc/tit/graph.txt
LocalDiscovery=yes
AutoConnect=yes

The "solution" was to stop tinc daemons on all nodes of one subgraph and start them one by one. After this, every started node connects to another subgraph and joins the full network.

Unfortunately, I do not know how to reproduce this. But, currently I suspect something in AutoConnect feature.

@gsliepen
Copy link
Owner

I think the AutoConnect feature was immediately cancelling any attempts to repair split meshes once every node had three working connections. It might be fixed in commit de7d5a0.

@nh2
Copy link
Contributor

nh2 commented Aug 3, 2019

Hi,

I believe this problem took down my production infrastructure (all nodes are 1.1pre17) for a couple hours today because it created a netsplit / network partition.

Randomly restarting some nodes worked.

Before restart

before-restart

After restart

after-restart

Explanation

In this network, node_3 and node_4 are decommissioned machines that were thus offline (for many weeks already).

However, tinc still seemed to count them towards the number of 3 working connections.

Questions

@gsliepen Can you confirm whether

  1. this understanding of mine is correct and the expected behaviour so far
  2. your commit de7d5a0 should fix the problem in this scenario as well (because it is different from the scenario in the issue description, where no connections are existant between the two partitions, while I have connections via dead nodes)
  3. you'll make a release that includes that fix soon
  4. tinc is supposed to handle situations where large amounts of nodes are decommissioned in one go as happened for me
  5. setting ConnectTo to all machines in my network would have avoided this issue?

Thanks!

@nh2
Copy link
Contributor

nh2 commented Aug 4, 2019

setting ConnectTo to all machines in my network would have avoided this issue?

Hmm, this doesn't seem to help; even after I specified an explicit each-to-each ConnectTo in every node's config file, there are no more than 3 meta connections for most nodes.

@nh2
Copy link
Contributor

nh2 commented Aug 11, 2019

I've now also experienced this netsplit even with no decommissioned nodes. There are two different views of nodes on the network:

View 1

image

View 2

image

@nh2
Copy link
Contributor

nh2 commented Jan 17, 2020

@gsliepen I have now encountered another split-brain problem, even with commit de7d5a0 cherry-picked.

In my network of 8 machines, 4 believe in one world view and the other 4 in another one:

View 1 (4 machines think this)

cdn1


Other nodes with same view:

cdn2


node2


cdn3

View 2 (4 other machines think this)

node1


Other nodes with same view:

node3


worker2


worker1

Restarting is a workaround

After restarting tinc on node-1, I get this correct graph on all nodes:

image

@gsliepen Any other ideas to prevent this from happening?

@nh2
Copy link
Contributor

nh2 commented Mar 30, 2020

Happened again to me today.

I strongly suspect that the KeyExpire setting, defaulting to 3600 seconds, is the reason that we see this so often.

I noticed that by observing the following hourly spike patterns in smokeping overthe VPN connection:

image

There's no pattern or failures in the non-VPN pings:

image


This does not provide an explanation or fix for the underlying issue (tinc getting netsplit and not recovering), but does provide a method to work around it (setting KeyExpire to 100 years 68 years, higher amounts will overflow the 32-bit int keylifetime).

However, given that incorect keys seem to be what confuses tinc here, the question remains whether externally sent, incorrect keys could also trigger the same problem.

@fangfufu fangfufu added needs_investigation Unexpected behaviours with uncertain causes - needs more investigation 1.1 Issue related to Tinc 1.1 labels Jun 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.1 Issue related to Tinc 1.1 needs_investigation Unexpected behaviours with uncertain causes - needs more investigation
Projects
None yet
Development

No branches or pull requests

4 participants