Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load-Peaks and still not multidomain-usable #57

Open
tackin opened this issue Mar 20, 2020 · 14 comments
Open

Load-Peaks and still not multidomain-usable #57

tackin opened this issue Mar 20, 2020 · 14 comments

Comments

@tackin
Copy link

tackin commented Mar 20, 2020

load-peaks
The Gateways Erai an Rustig are using our fork (https://github.com/freifunktrier/mesh-announce) of this repo.
I have had 3 problems:

  1. Load peaks
  2. permanent changing wrong node-data in YANIC
  3. warnings like:
    Mar 20 17:13:49 pegol yanic[9430]: time="2020-03-20T17:13:49.179+01:00" level="warn" msg="override nodeID from 2661965025dc to 266196502501 on MAC address 26:61:96:60:25:05" caller="nodes. go:207 github.com/FreifunkBremen/yanic/runtime.(*Nodes).readIfaces"
    Mar 20 17:13:49 pegol yanic[9430]: time="2020-03-20T17:13:49.208+01:00" level="warn" msg="override nodeID from 266196502504 to 266196502505 on MAC address 26:61:96:60:25:04" caller="nodes. go:207 github.com/FreifunkBremen/yanic/runtime.(*Nodes).readIfaces"
    Mar 20 17:13:49 pegol yanic[9430]: time="2020-03-20T17:13:49.209+01:00" level="warn" msg="override nodeID from 2661965025dc to 266196502501 on MAC address 26:61:96:60:25:05" caller="nodes. go:207 github.com/FreifunkBremen/yanic/runtime.(*Nodes).readIfaces"
    Mar 20 17:13:49 pegol yanic[9430]: time="2020-03-20T17:13:49.211+01:00" level="warn" msg="override nodeID from 266196501003 to 2661965010dc on MAC address 26:61:96:60:10:dc" caller="nodes. go:207 github.com/FreifunkBremen/yanic/runtime.(*Nodes).readIfaces"
    Mar 20 17:13:49 pegol yanic[9430]: time="2020-03-20T17:13:49.216+01:00" level="warn" msg="override nodeID from 266196502504 to 266196502505 on MAC address 26:61:96:60:25:04" caller="nodes. go:207 github.com/FreifunkBremen/yanic/runtime.(*Nodes).readIfaces"
    Mar 20 17:13:49 pegol yanic[9430]: time="2020-03-20T17:13:49.217+01:00" level="warn" msg="override nodeID from 2661965025dc to 266196502501 on MAC address 26:61:96:60:25:05" caller="nodes. go:207 github.com/FreifunkBremen/yanic/runtime.(*Nodes).readIfaces"
    Mar 20 17:13:49 pegol yanic[9430]: time="2020-03-20T17:13:49.218+01:00" level="warn" msg="override nodeID from 266196501003 to 2661965010dc on MAC address 26:61:96:60:10:dc" caller="nodes. go:207 github.com/FreifunkBremen/yanic/runtime.(*Nodes).readIfaces"

I shiftet to my older mesh-announce fork from ffda (multicast on ff02:....) and my problems are gone.

@AiyionPrime
Copy link
Contributor

Your numbers two and three should be resolved by the merge of #58 .
Can you confirm that, @tackin ?
About the load peaks I cannot say anything, yet.

@tackin
Copy link
Author

tackin commented Apr 4, 2020

Need to install/test it again for 2. and 3. If 1. is fixed.
No. 3. is a YANIC thing. (May be solved)
No. 2. is not clear to me, if it is a YANIC- or mesh-announce-bug.

@tackin
Copy link
Author

tackin commented Apr 4, 2020

@AiyionPrime
Tested:
No. 3 seems to be solved.
No. 2 is not solved.

@tackin
Copy link
Author

tackin commented Apr 4, 2020

@AiyionPrime
Your PR#58 solves problem no.2 for me.

@AiyionPrime
Copy link
Contributor

The laod peaks appear in hannover as well, but seem to correlate with fastd's cpu usage (likely the context switches) and not mesh-announce. How to reproduce the finding of mesh-announce being the evil one?

@tackin
Copy link
Author

tackin commented Apr 5, 2020

By simply disable the service and see what happend.

See pict above. The Loadpeaks stopped when I stopped the service on rustig and erai.

@AiyionPrime
Copy link
Contributor

AiyionPrime commented Apr 5, 2020

Thanks, I will try to reproduce it tonight.

@AiyionPrime
Copy link
Contributor

First things first, hanover has the same issue on all four supernodes.
The peaks are always about one hour and 45 minutes apart from each other (averaged over the last day).

One thing to note is, they don't peak or start to spike at the same time.
We watched the load this day and could not find anything but fastd and ocassionally mesh-announce in the top 10 of htop.

At 20:30 we stopped the mesh-announce service, the resulting graph is this one.

As you can see, this does drastically reduce the load, but doesn't prevent the spike altogether.
As it appears mesh-announce is responsible for part of the load, but not the triggering event itself.
Therefore I can confirm the bug, a workaround that reduces the load is possibly to use the multi-domain feature in one instance.

Trier likely had a bigger impact by mesh-announce, as they had more instances running.
I'll try that tomorrow, for now our supernodes are busily tested in order to exclude causes like or monitoring, our zabbix, our whatsoever.

Looking over to Trier the load appears to peak in the same frequency:
https://draco.freifunk-trier.starletp9.de:3000/d/Gb1_MoJik/freifunk-trier-uberblick?orgId=1

Quite possible, that I miss the forest for the trees, but I can't figure out, whats triggering after 105 minutes, independent of when a system booted.

@AiyionPrime
Copy link
Contributor

AiyionPrime commented Apr 5, 2020

@moridius just stopped fastd on a supernode, it drastically reduces the spike as well, but not completely, if mesh-announce is left running.
@tackin have you already taken dumps of the traffic for two or three period-lenghts?

@tackin
Copy link
Author

tackin commented Apr 6, 2020

@tackin have you already taken dumps of the traffic for two or three period-lenghts?

No, sorry, I have no idea where/what to look for in a dump.
For us stopping fastd also would drop all tunnels and traffic. Would not make sense in testing I guess.

@AiyionPrime
Copy link
Contributor

Well, then.
Yesterday 20:30 I've shut down the first supernode 09, reducing its temp-load drastically, as seen in the last graph.
Thid did not change in the last 16? hours.

Today, 13:00 o'clock I've shut down the other mesh-announce instances as well.
They all showed the same result, drastic reduction of their load in the peak window.

The second shutdown did not effect the loadpeak on sn09 at all.
My conclusion stands, mesh-announce is responsible for (part of) the loadpeak, but for the event triggering it, it is not.

Here is the current graph, sn[01,08,09,10] are currently all of our supernodes running mesh-announce. The red dot marks 13:05, when my shutdown of the remaining three instances took effect.

We'll start tcpdumps later this afternoon. I'm now firing up mesh-announce again.

@AiyionPrime
Copy link
Contributor

I got my non-findings of the event and the resulting load peer reviewed yesterday.
Unlikely, that tcpdumps will help at this point already. Will determine, whether darmstadts fork had the issue as well. If not, go back to the fork determine it wasn't an issue back then, too and finally bisect, when things went south. Will do this after lunch.

@TobleMiner
Copy link
Member

Does this issue still exist? There have been major changes in mesh-announce and thus additional confirmation on this issue is required. This issue will be closed in a month if there is no further activity.

@tackin
Copy link
Author

tackin commented Mar 22, 2021

@TobleMiner Sorry, didn't find the time yet to test it. It's not a big issue/problem for us at the moment, so I feel no pressure. ;-) I'll come back to it a.s.a.p.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants