Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SegFault when using TunnelServer=yes #247

Open
lukavia opened this issue Jun 27, 2020 · 3 comments
Open

SegFault when using TunnelServer=yes #247

lukavia opened this issue Jun 27, 2020 · 3 comments
Labels
1.0 Issue related to Tinc 1.0 1.1 Issue related to Tinc 1.1 bug Issues in which the users gave a clear indication to the causes of the unexpected behaviour potentially_fixed This might have been fixed, awaiting confirmation.

Comments

@lukavia
Copy link

lukavia commented Jun 27, 2020

I have a network with about ~800. The network is a mix of tinc 1.0 and
1.1 nodes. It is gradually expanding for several years now.

The problem is that at some point it seams the daemon can not handle the
processing of the new connection and the edges.

There are 3 major nodes in the system and every other node initially
makes connection to one of them.

Now after a lot of debugging I've limited to all nodes to connect only
to one node, and use iptables to grant new connections gradually. last
limit was 5 per minute.

I've started to monitor how the edges are growing on the main node and I
see that although I've limited the connections on the other 2 major
nodes at some point there are rapid spikes in the edges when new
connection is established.
So my guess is that the other nodes have a previous state on the edges
when they try to push it, that is causing the main nodes to become
overwhelmed.

So I've decided to put TunnelServer=yes on the major nodes so they don't
propagate the connections on the other nodes.

However I get a segfault soon after starting on each node that I enable
that option.

I've build from the latest code and here is a trace of such a run: (this
is not from a "major" node, but the effect is the same)

Got ANS_KEY from Backbone (xxx.xxx.xxx.xxx port 655): 16 Office 
Lukav_Beast 
52201D7CFDC2C7E1FD7871A36E651B7AC24A52B4ED892CD953397F6BA859AB22D5D4CB235B9CF85910B6BDE91A34C85E 
427 672 4 0 yyy.yyy.yyy.yyy 13935
Using reflexive UDP address from Office: yyy.yyy.yyy.yyy port 13935
UDP address of Office set to yyy.yyy.yyy.yyy port 13935
Got REQ_KEY from Backbone (xxx.xxx.xxx.xxx port 655): 15 Office Lukav_Beast

Program received signal SIGSEGV, Segmentation fault.
0x000055555556de41 in send_ans_key (to=to at entry=0x555555851060) at 
protocol_key.c:382
382        return send_request(to->nexthop->connection, "%d %s %s %s %d 
%d %d %d", ANS_KEY,
(gdb) bt
#0  0x000055555556de41 in send_ans_key (to=to at entry=0x555555851060) at 
protocol_key.c:382
#1  0x000055555556e169 in req_key_h (c=0x555555851be0, 
request=0x555555854bb7 "15 Office Lukav_Beast") at protocol_key.c:304
#2  0x000055555556a083 in receive_request (c=c at entry=0x555555851be0, 
request=0x555555854bb7 "15 Office Lukav_Beast") at protocol.c:146
#3  0x000055555555e993 in receive_meta (c=c at entry=0x555555851be0) at 
meta.c:333
#4  0x00005555555603f9 in handle_meta_connection_data 
(c=c at entry=0x555555851be0) at net.c:304
#5  0x00005555555678c2 in handle_meta_io (data=0x555555851be0, 
flags=<optimized out>) at net_socket.c:520
#6  0x000055555555c60a in event_loop () at event.c:359
#7  0x00005555555607f2 in main_loop () at net.c:510
#8  0x0000555555559208 in main (argc=6, argv=<optimized out>) at tincd.c:558
(gdb) bt full
#0  0x000055555556de41 in send_ans_key (to=to at entry=0x555555851060) at 
protocol_key.c:382
         keylen = <optimized out>
         key = 
"527E64B1DB47F2F527ADF7F609498FFCB4807AEC3CD49697D3D8D870619BC537E1B7C403875D81FC608A8F6E00D06063\000\306\377\377\377\177\000\000\331\334VUUU", 
'\000' <repeats 11 times>, 
"*ֲ\322\316\000\305\000\000\000\000\000\000\000\000\340\033\205UUU\000\000\001\000\000\000\000\000\000\000P\316\377\377\377\177\000\000\267K\205UUU\000\000`\020\205UUU\000\000@\306\377\377\377\177\000\000i\341VUUU\000\000\000\000\000\000\377\177\000\000\000\000\000\000\000\000\000\000"...
#1  0x000055555556e169 in req_key_h (c=0x555555851be0, 
request=0x555555854bb7 "15 Office Lukav_Beast") at protocol_key.c:304
         from_name = "Office\000\061\071.130", '\000' <repeats 1003 
times>...
         to_name = "Lukav_Beast", '\000' <repeats 366 times>...
         from = 0x555555851060
         to = <optimized out>
         reqno = 0
#2  0x000055555556a083 in receive_request (c=c at entry=0x555555851be0, 
request=0x555555854bb7 "15 Office Lukav_Beast") at protocol.c:146
         reqno = <optimized out>
#3  0x000055555555e993 in receive_meta (c=c at entry=0x555555851be0) at 
meta.c:333
         result = <optimized out>
         request = <optimized out>
         inlen = 0
         inbuf = 
"a\354\357\063J\363{\346d\177\271\371;+\212\371zFDt\271\061\370\ao\373\326\035\255=Α\254\257:\245\322ү\vƦ\205\035\336?1\234\372\001\004\063\323\t\004-\b8\367\f\201\342\304g\332\361jL76C\340-\t\006\210\214\314,C\352)ͺa\314\fAe\260\226\313\337\360|\256\236\263\344\205\061\207\303\t<\016\351\360\222\343[\317o\377\065<Ή?b(\267\321\356\360\242p$\314`\325ʆ\001|\036\204'\\\205i\314W\356#N4\000q\320\300\344\071\060\236w\016\306[\323X]\237\321\347\177\313KU\367ޚ\b}\307\374\367\032c\036\332:\307\367\265o\307Ƒ\212J\006NJ3!\305q\367\255\263\246\200i\035\327͌\001"...
         bufp = 0x7fffffffd6f0 
"a\354\357\063J\363{\346d\177\271\371;+\212\371zFDt\271\061\370\ao\373\326\035\255=Α\254\257:\245\322ү\vƦ\205\035\336?1\234\372\001\004\063\323\t\004-\b8\367\f\201\342\304g\332\361jL76C\340-\t\006\210\214\314,C\352)ͺa\314\fAe\260\226\313\337\360|\256\236\263\344\205\061\207\303\t<\016\351\360\222\343[\317o\377\065<Ή?b(\267\321\356\360\242p$\314`\325ʆ\001|\036\204'\\\205i\314W\356#N4"
         endp = <optimized out>
#4  0x00005555555603f9 in handle_meta_connection_data 
(c=c at entry=0x555555851be0) at net.c:304
No locals.
#5  0x00005555555678c2 in handle_meta_io (data=0x555555851be0, 
flags=<optimized out>) at net_socket.c:520
         c = 0x555555851be0
         socket_error = <optimized out>
         len = <optimized out>
#6  0x000055555555c60a in event_loop () at event.c:359
         node = 0x555555797dd8 <signalio+24>
         next = 0x555555797dd8 <signalio+24>
---Type <return> to continue, or q <return> to quit---
         io = 0x555555851d90
         tv = <optimized out>
         fds = <optimized out>
         curgen = 7
         diff = {tv_sec = 0, tv_usec = 512516}
         n = <optimized out>
         readable = {fds_bits = {256, 0 <repeats 15 times>}}
         writable = {fds_bits = {0 <repeats 16 times>}}
#7  0x00005555555607f2 in main_loop () at net.c:510
         sighup = {signum = 1, cb = 0x555555560480 <sighup_handler>, 
data = 0x7fffffffe1a0, node = {next = 0x7fffffffe2a8, prev = 0x0,
             parent = 0x7fffffffe2a8, left = 0x0, right = 0x0, data = 
0x7fffffffe1a0}}
         sigterm = {signum = 15, cb = 0x55555555f900 <sigterm_handler>, 
data = 0x7fffffffe1f0, node = {next = 0x0, prev = 0x7fffffffe2f8,
             parent = 0x7fffffffe2f8, left = 0x0, right = 0x0, data = 
0x7fffffffe1f0}}
         sigquit = {signum = 3, cb = 0x55555555f900 <sigterm_handler>, 
data = 0x7fffffffe240, node = {next = 0x7fffffffe2f8,
             prev = 0x7fffffffe2a8, parent = 0x7fffffffe2f8, left = 
0x7fffffffe2a8, right = 0x0, data = 0x7fffffffe240}}
         sigint = {signum = 2, cb = 0x55555555f900 <sigterm_handler>, 
data = 0x7fffffffe290, node = {next = 0x7fffffffe258,
             prev = 0x7fffffffe1b8, parent = 0x7fffffffe258, left = 
0x7fffffffe1b8, right = 0x0, data = 0x7fffffffe290}}
         sigalrm = {signum = 14, cb = 0x5555555605b0 <sigalrm_handler>, 
data = 0x7fffffffe2e0, node = {next = 0x7fffffffe208,
             prev = 0x7fffffffe258, parent = 0x0, left = 0x7fffffffe258, 
right = 0x7fffffffe208, data = 0x7fffffffe2e0}}
#8  0x0000555555559208 in main (argc=6, argv=<optimized out>) at tincd.c:558
         umbstr = <optimized out>
         priority = 0x0

Any help is much appreciated since my network is unusable at the moment

@fangfufu fangfufu added bug Issues in which the users gave a clear indication to the causes of the unexpected behaviour needs_investigation Unexpected behaviours with uncertain causes - needs more investigation and removed needs_investigation Unexpected behaviours with uncertain causes - needs more investigation labels Jun 23, 2021
@fangfufu
Copy link
Collaborator

@lukavia , which nodes are crashing? 1.1 nodes or 1.0 nodes? Have you got the exact version number of the nodes that are crashing?

@gsliepen , are 1.1 nodes meant to be compatible with 1.0 nodes?

@fangfufu fangfufu added 1.0 Issue related to Tinc 1.0 1.1 Issue related to Tinc 1.1 labels Jun 25, 2021
@lukavia
Copy link
Author

lukavia commented Jun 28, 2021

Version: 1.1pre17-1.2bpo10+1

Some time has passed but I think the problem also exists with only 1.1 nodes.

However I've migrated to another VPN solution and cannot assist with tests anymore. Sorry.

Best of luck

gsliepen added a commit that referenced this issue Jul 20, 2021
We could have a REQ_KEY coming from a node that is not reachable; either
because DEL_EDGEs have overtaken the REQ_KEY, or perhaps if TunnelServer
is used and some nodes have a different view of reachability.

This might fix GitHub issue #247.
@gsliepen
Copy link
Owner

This might be fixed by commit ed070d7.

@gsliepen gsliepen added the potentially_fixed This might have been fixed, awaiting confirmation. label Jul 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.0 Issue related to Tinc 1.0 1.1 Issue related to Tinc 1.1 bug Issues in which the users gave a clear indication to the causes of the unexpected behaviour potentially_fixed This might have been fixed, awaiting confirmation.
Projects
None yet
Development

No branches or pull requests

3 participants