Reduce amount of processes spawned by Kilo #113

SerialVelocity · 2021-02-15T13:16:59Z

Kilo currently spawns about 50 processes on my machine per 30s (this is a small cluster). That seems to take considerable CPU resources (it spawns the most processes out of anything on my machine). Especially since it needs to take the xtables lock which causes large slowdowns due to the conflicts with kube-router.

It seems like a few things should be done to speed this up and remove the overhead:

Allow configuring the resync period (specifically the iptables part). It seems like the reconfiguration of the iptables rules is a failsafe and could be either removed entirely or run less frequently.
Use ipsets instead of iptables for the list of ip ranges
Load the entire KILO-NAT chain/ipset and compare against that instead of checking each rule individually

squat · 2021-02-15T13:49:35Z

These are all great ideas for optimizations @SerialVelocity.
1 is certainly trivial to implement, however it will continue to produce very spiky behavior, albeit less often. 3 would be the next easiest to implement, I suspect. Ideally combining 3 with ipsets would be the best for performance. Maybe future work could even include migrating entirely to nftables and this avoiding having to exec iptables at all thanks to support for netlink updates.
xref kubernetes/kubernetes#45385 for related conversation

cc @leonnicolas what do you think about hacking together on 2 or 3?

SerialVelocity · 2021-02-15T16:05:47Z

@squat There is another option which is to just have blanket rules for the node subnet, and the cluster subnet. That way you only have two rules?

squat · 2021-02-17T20:44:14Z

There is another option which is to just have blanket rules for the node subnet, and the cluster subnet. That way you only have two rules?

This is tricky because we would also need to have rules for the private IPs of each node, which would not necessarily fall into a single CIDR and thus couldn't be captured by a single rule. It also wouldn't address the issue for the IPs of any additional VPN peers.

I prototyped 3 and found I got a ~75% reduction in CPU usage for a small 3-node cluster. I'll try to clean up the work and post a PR ASAP. @leonnicolas and I can hopefully investigate the ipset solution soon :)

squat · 2021-02-17T21:49:25Z

btw, if you'd like to try out the test build a made, you can try running this image: squat/kilo:13a25891de81a4ae8a0079285799fce492c06ef4

SerialVelocity · 2021-02-18T11:04:52Z

This is tricky because we would also need to have rules for the private IPs of each node, which would not necessarily fall into a single CIDR and thus couldn't be captured by a single rule. It also wouldn't address the issue for the IPs of any additional VPN peers.

Oh, right. I have all of my private IPs forced to random IPs for #15 (as I'm not sure the PR fixes my issue)

btw, if you'd like to try out the test build a made, you can try running this image: squat/kilo:13a25891de81a4ae8a0079285799fce492c06ef4

Is there a branch for this? I can rebase my custom code on top to test it (since I use my automatic peer code from https://github.com/SerialVelocity/kilo/tree/kg-peer)

squat · 2021-02-18T18:03:01Z

I'll try to post tonight 👍
Btw is there any chance you could open a PR to merge your branch into upstream and let the kg agent set up non-node peers? I think it could be pretty cool to have upstream 🎉

SerialVelocity · 2021-02-18T21:42:02Z

I tried your build on my non-node peers but the amount of debug lines printed caused it to take up CPU as I have a slow HDD.

Unfortunately, I don't have the time to clean it up and submit it. It is really hacky right now.

SerialVelocity · 2021-02-27T03:01:37Z

Looks like 1 is partially done (30s is probably high enough?) and 3 are done.

Do we still want this open for 2?

squat · 2021-02-27T10:13:14Z

I would like to avoid adding extra flags for every possible re-sync period. Maybe we can have one global flag that sets the re-sync periods for all of Kilo's control loops? And you're right, 2 should still get done

SerialVelocity · 2021-02-28T00:45:41Z

What are all the control loops that exist?

squat · 2021-02-28T17:09:28Z

There are three separate controllers:

iptables
ip routes + rules
main Kilo controller, which in turn sets:
- wireguard config
- ip address

The next controller that is coming will be an ipset controller. This and the iproute controller don't really need to check periodically since they can use netlink to watch for updates.

Of all of these controllers, only the iptables and main Kilo controller have an extra timer to force a reconciliation check periodically in addition to when a change is made by us.

SerialVelocity · 2021-03-01T12:20:47Z

It might be worth adding a "--disable-reconciliation" flag instead that will disable the loops and the netlink watches. When used with something like kube-router, it will cause thrashing as kube-router already seems to recreate the ipsets/iptables constantly (I switched to cilium a couple of weeks ago for that reason)

squat · 2021-03-01T12:50:17Z

hmm interesting, I don't understand why we have thrashing. Kilo only recreates resources when it's own rules change, not when any other process's resources change. Do you have any more insight into this? It would be really helpful to fix this correctly while still maintaining reconciliation.

SerialVelocity · 2021-03-01T16:22:49Z

It would be really helpful to fix this correctly while still maintaining reconciliation.

If everything is set up correctly, reconciliation does nothing, right? So having a way to disable the extra logic is probably a nice to have feature.

I don't understand why we have thrashing. Kilo only recreates resources when it's own rules change, not when any other process's resources change

Sorry, by thrashing I mean Kilo is doing extra unnecessary checks whenever there are changes to the resources, not that Kilo will be updating lots of things. The way kube-router seems to work is it creates random suffixes for each iptables chain and ipset name. This means that whenever it updates its rules, there are a lot of changes to the ipsets/iptables. I'm not sure whether this is a bug or just how it was written though. Since iptables is just every X seconds, it should be fine. I'm not sure how netlink watches work though. Can you watch just the Kilo ipsets or do you watch everything and then reconcile? Same for ip routes, can you watch only the Kilo routes?

squat mentioned this issue Feb 24, 2021

pkg/iptables: reduce calls to iptables #116

Merged

squat closed this as completed in #116 Feb 26, 2021

squat reopened this Feb 27, 2021

squat mentioned this issue Feb 28, 2021

cmd/kg,pkg: add --resync-period flag #125

Merged

squat added this to the 0.2.0 milestone Mar 5, 2021

squat mentioned this issue Mar 12, 2021

High Cpu Usage #135

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce amount of processes spawned by Kilo #113

Reduce amount of processes spawned by Kilo #113

SerialVelocity commented Feb 15, 2021

squat commented Feb 15, 2021

SerialVelocity commented Feb 15, 2021

squat commented Feb 17, 2021

squat commented Feb 17, 2021

SerialVelocity commented Feb 18, 2021

squat commented Feb 18, 2021

SerialVelocity commented Feb 18, 2021 •

edited

Loading

SerialVelocity commented Feb 27, 2021

squat commented Feb 27, 2021

SerialVelocity commented Feb 28, 2021

squat commented Feb 28, 2021

SerialVelocity commented Mar 1, 2021

squat commented Mar 1, 2021

SerialVelocity commented Mar 1, 2021

Reduce amount of processes spawned by Kilo #113

Reduce amount of processes spawned by Kilo #113

Comments

SerialVelocity commented Feb 15, 2021

squat commented Feb 15, 2021

SerialVelocity commented Feb 15, 2021

squat commented Feb 17, 2021

squat commented Feb 17, 2021

SerialVelocity commented Feb 18, 2021

squat commented Feb 18, 2021

SerialVelocity commented Feb 18, 2021 • edited Loading

SerialVelocity commented Feb 27, 2021

squat commented Feb 27, 2021

SerialVelocity commented Feb 28, 2021

squat commented Feb 28, 2021

SerialVelocity commented Mar 1, 2021

squat commented Mar 1, 2021

SerialVelocity commented Mar 1, 2021

SerialVelocity commented Feb 18, 2021 •

edited

Loading