Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plugins/meta/bandwidth: traffic shaping plugin #96

Merged
merged 8 commits into from
Mar 14, 2018
Merged

plugins/meta/bandwidth: traffic shaping plugin #96

merged 8 commits into from
Mar 14, 2018

Conversation

DennisDenuto
Copy link
Contributor

As mentioned in #94

An attempt to port github.com/cloudfoundry/cf-networking-release traffic control tbf (ingress/egress) logic into a chained plugin

build.sh Outdated
@@ -19,7 +19,7 @@ export GOPATH=${PWD}/gopath
mkdir -p "${PWD}/bin"

echo "Building plugins"
PLUGINS="plugins/meta/* plugins/main/* plugins/ipam/* plugins/sample"
PLUGINS="plugins/meta/* plugins/main/* plugins/ipam/* plugins/sample plugins/tc/*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. you'll need to add it to plugins/linux_only.txt also to fix the AppVeyor (Windows CI) build.

Copy link
Contributor

@rosenhouse rosenhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I left a few small suggestions.

Also, I was hoping it would be possible to pull in some of the integration test coverage that we developed for the Silk CNI traffic-shaping, i.e. the tests that actually measure and assert on bandwidth for ingress and egress. Perhaps by chaining this plugin after ptp or bridge and asserting on the resulting state?

// limitations under the License.

// This is a sample chained plugin that supports multiple CNI versions. It
// parses prevResult according to the cniVersion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either remove this comment or make it more descriptive


// PluginConf is whatever you expect your configuration json to be. This is whatever
// is passed in on stdin. Your plugin may wish to expose its functionality via
// runtime args, see CONVENTIONS.md in the CNI spec.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look like they were copied from the sample. Not sure if they are are helpful here.

}
err = nonContainerNetNs.Do(func(netNS ns.NetNS) error {
nonContainerIfaceName := getRoutableHostIF(cip.Address.IP)
if nonContainerIfaceName != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you can assume that this plugin is being called from the "host" namespace. In other words, the current NS when the plugin starts is the place to put host devices. That should remove the need for these nested loop and conditionals.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right - plugins are always called from the host.

}

if conf.Rate == 0 && conf.Burst == 0 {
return types.PrintResult(conf.PrevResult, conf.CNIVersion)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a comment here, e.g. no traffic shaping was requested, so just no-op and quit

Copy link
Contributor

@rosenhouse rosenhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so a second pass showed some more serious issues that will require changes. let me know if you have questions about this.

err = nonContainerNetNs.Do(func(netNS ns.NetNS) error {
nonContainerIfaceName := getRoutableHostIF(cip.Address.IP)
if nonContainerIfaceName != "" {
ifbDeviceName := os.Getenv("CNI_IFNAME")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, also, this is not correct. According to the CNI spec, CNI_IFNAME is the desired name for the primary network interface inside the container, i.e. eth0. It is guaranteed unique only inside the container network namespace, not the host. You'll need to generate a unique name for the host-side interface -- perhaps by hashing the (network name, container id) pair (see here for some considerations), using a random name (see here), or by using an auto-incrementing integer on the system. You may need to store this name on the local filesystem in order to properly clean it up when DEL is called.

Copy link
Contributor Author

@DennisDenuto DennisDenuto Nov 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of generating a deterministic unique name for the host side interface. I have pushed a change to compute the sha1 of the (network name, container id) and only using the first 15 characters of the hex digest.
Thus on cmdDel i am able to determine the device name to delete.

On the off chance that the sha1 conflicts with an interface that already exists, this plugin will fail... If we think this scenario is likely enough to cause issues for people, I can change to the random name approach + stored state. (i.e. if the random name conflicts, generate a new random name until no conflict)

}
_ = conf

for _, prevIface := range conf.PrevResult.Interfaces {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I'm afraid PrevResult will not be populated on a DEL call. The runtime will execute the plugins in reverse order. (also, to top it off, most plugins don't return any data on a successful DEL).

So to know the interface name to teardown, you may need some other tricks. One way is that ADD may store some state (i.e. the interface name) on the filesystem, and DEL reads it back out. See the other comment about the IFB device name.

Copy link
Contributor Author

@DennisDenuto DennisDenuto Nov 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dealt with this by generating a deterministic unique name (sha1 of network name and container id).

@DennisDenuto
Copy link
Contributor Author

I like the idea of adding a measure test and asserting ingress/egress shaping. I can chain it with some plugins too. I can work on that next.

Copy link
Contributor

@rosenhouse rosenhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great progress, left a few more comments but I think it's really coming together.

Two other high-level things:

  • consider renaming: maybe something like plugins/meta/bandwidth?
  • please a README! explain how this works, link to relevant docs.

return "", err
}

return fmt.Sprintf("%x", hash.Sum(nil))[:15], nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ok, but I'd like other @containernetworking/cni-maintainers to weigh in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer a short prefix (<= 4 characters). That should still leave enough entropy (>=2^44) even just using the basic hex encoding.

}

if len(containerIPConfigs) == 0 {
return fmt.Errorf("got no container IPs")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why bother with this check? I don't see containerIPConfigs used later in this function.


for _, prevIface := range conf.PrevResult.Interfaces {
if prevIface.Sandbox != "" {
continue
Copy link
Contributor

@rosenhouse rosenhouse Dec 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. I wonder about readability of this code though.
One option is to extract a private function with a signature like

func getHostInterface(ifaces []*current.Interface)` (*current.Interface, error) { ... }

You could keep the Sandbox emptiness check in there, and do stuff like have it error if no matching interfaces are found.

That would make it more clear what this code is doing. It would also have the side benefit of removing the for-loop indentation from the main code-path here, so your top-level function would read

hostInterfaceface, err := getHostInterface(conf.PrevResult.Interfaces)
if err != nil {
   return err
}
err = CreateIngressQdisc(conf.Rate, conf.Burst, hostInterface.Name)
...

if err != nil {
return err
}
_ = conf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy-pasta from the plugin template example :-(

})
})

Expect(runtimeWithLimit).To(BeNumerically(">", runtimeWithoutLimit+1000*time.Millisecond))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. I love this.

],
"routes": []
}
}`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I see you've got a more integration-type test with the PTP plugin below. Is there a reason you want to test with a literal result JSON as well? (Not opposed, just curious.)

})).To(Succeed())
}

func startTcpServerInNS(containerNS ns.NetNS, laddr string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a glance, this sounds a lot like the echosvr we use in the portmap integration tests.

Perhaps there's an opportunity for code-reuse?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for pointing this out. I like the idea of code-reuse.

I am finding using echosvr a bit tricky to use for this particular test. Mainly because the echosvr tries to read a fixed amount of bytes and then continues to write back what it read and close the connection. However in the measure case I want to ensure that the entire message is sent and read by the server. I did this previously by sending a message with a trailing new-line character as a signal to the server that the entire message had been sent, and that it could then write back a response so the client could know that the server received the entire message.

I have tried to get this to work by tweaking the num of packets sent to the 'echo server' and tbf's burst/limit sent to fit into the buffer size of 512. However this results in the time taken with/without tbf to be negligible (~100 millisecond difference)

I have tried to increase the echosvr buffer size to allow sending/reading a larger payload, but conn.Read seems like it reads as much as the read syscall is set to (which when I run is consistently 61232 bytes). If i send < 61232 to the echosvr it results in a broken pipe error. Haven't looked too much into it but I suspect that since ingress and egress traffic are both shaped, that when the echo svr tries to write back a large payload it gets shaped, and it closes the connection prematurely.

Your thoughts are much appreciated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy with replacing the echoserver with this code. It really should be in another process, if only because Go namespace crappiness has caused us a lot of grief and flaky tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks for the reply. will do

@DennisDenuto
Copy link
Contributor Author

DennisDenuto commented Dec 12, 2017

Thanks for the comments.
fyi:

  • renamed to plugins/meta/bandwidth
  • use echosvr
  • ifb device name is 4 characters long
  • slight refactorings
  • added a readme.md

"burst": 456
}
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to have a more fleshed out example of chaining, e.g. with bridge first and then this.


This plugin configures a token bucket filter (tbf) queing discipline (qdisc) on both ingress and egress traffic. Resulting in traffic being shaped when reading / writing.

Due to limitations on tc shaping rules for ingress, this plugin creates an Intermediate Functional Block device (ifb) to redirect packets from the target interface. tc tbf is then applied to the ifb device. The packets that were redirected to the ifb devices, are written OUT (and shaped) to the target interface.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You say "target" interface but it may be more clear to distinguish between host-side and container-side interfaces here.

Copy link
Contributor

@rosenhouse rosenhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a couple tiny README nits.

Also please update the commit message with new location and a brief title, e.g.

plugins/meta/bandwidth: traffic shaping plugin

<more detail here>

@DennisDenuto
Copy link
Contributor Author

fyi

  • Added a chaining example in the readme
  • renamed 'target' -> 'host interface'
  • Updated the commit message

@rosenhouse rosenhouse changed the title plugins/tc/tbf: Add chained plugin to add a tbf qdisc to shape ingres… plugins/meta/bandwidth: traffic shaping plugin Dec 16, 2017
Copy link
Member

@matthewdupre matthewdupre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo


This plugin provides a way to use and configure Linux's Traffic control (tc) subystem. tc encompasses the sets of mechanisms and operations by which packets are queued for transmission/reception on a network interface.

This plugin configures a token bucket filter (tbf) queing discipline (qdisc) on both ingress and egress traffic. Resulting in traffic being shaped when reading / writing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

queing -> queuing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. fixed typo.

Copy link
Member

@squeed squeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks pretty good. Just a few small things.

"name": "mynet",
"plugins": [
{
"name": "my-net",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "name" field here (and below) isn't needed.

```


1. The following is the json payload the container runtime would need to perform during the ADD action:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably don't need to explain plugin chaining in this README.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed plugin chaining examples. pushed.

PrevResult *current.Result `json:"-"`

// Add plugin-specifc flags here
Rate int `json:"rate"` //Bandwidth rate in Kbps for traffic through container. 0 for no limit. If rate is set, burst must also be set
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add separate ingress and egress rates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. pushed.

switch {
case conf.Burst < 0 || conf.Rate < 0:
return nil, fmt.Errorf("rate and burst must be a positive integer")
case conf.Burst == 0 && conf.Rate != 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather infer when burst == 0 that burst = rate. But this is subjective.


conf.PrevResult.Interfaces = append(conf.PrevResult.Interfaces, &current.Interface{
Name: ifbDeviceName,
Sandbox: hostInterface.Sandbox,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sandbox should not be set, since the IFB device isn't in the sandbox / container netns.

@@ -0,0 +1,165 @@
package bandwidth

// Copyright 2015 CNI authors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2018 :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy new year :-)

@DennisDenuto
Copy link
Contributor Author

fyi @matthewdupre @squeed the requested changes have been pushed.

// removed (though you may wish to error if a non-chainable plugin is
// chained.
// If you need to modify the result before returning it, you will need
// to actually convert it to a concrete versioned struct.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should be removed - it is advice to the plugin author.
It could be replaced with a comment about how you resolved the decisions.

RawPrevResult *map[string]interface{} `json:"prevResult"`
PrevResult *current.Result `json:"-"`

// Add plugin-specifc flags here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should be removed - it is advice to the plugin author.

continue
}

return prevIface, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the maintainers meeting someone pointed out that this will break for the bridge plugin, since bridge returns two non-sandbox interfaces: the bridge itself, and the container veth, in that order.

We're considering how best to resolve the issue: perhaps by adding a peerIndex field to the result structure so that the prior plugin can specify which interface is the container's peer veth.

Alternately, you may be able to detect whether a particular interface is a peer of the sandbox one, cc @dcbw and @bboreham for possible tricks here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rosenhouse implementing the "get peer index" library code for #75 brought up the following question: what do we consider a "peer"? Of course we know what that should be for veth devices. But what other device types does this plugin support?

eg, does it support ipvlan or macvlan? In that case, there is only a "parent" device that may have multiple "child" ipvlan/macvlan devices attached. The child devices all have the IFLA_LINK attribute (which is what veth sets to indicate its peer) set to the parent device that they all share.

We can obviously define a peerIndex field however we like in the CNI spec, but it would be a shame if that was veth-only or relied on some implied behavior of the LInux kernel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, lets do the hack to detect the peer on the host side, and then update the readme to call out that this only works with plugins that put a peer veth on the host.

. "github.com/onsi/gomega"
"github.com/onsi/gomega/gexec"
"github.com/vishvananda/netlink"
"net"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits:

move "net" and "time" right below "fmt"?

@m1093782566
Copy link

m1093782566 commented Feb 12, 2018

Hi all,

Any chance to push it forward? I have been thinking about supporting traffic shaping for kubernetes cni and this work relies on it.

@rosenhouse
Copy link
Contributor

Yeah, since #75 is still not merged, I would suggest we copy over the peer-discovery function from there into here. I can probably get to that this week if @DennisDenuto doesn't beat me too it.

@DennisDenuto
Copy link
Contributor Author

apologies for the delay, i have been a bit preoccupied. @rosenhouse If you have the peer-discovery function, please feel free to add it. I won't be able to look into it this week. I would probably have time on the weekend.

@squeed squeed added this to the v0.8.0 milestone Feb 14, 2018
@rosenhouse
Copy link
Contributor

I've pushed a couple commits here to address the nits, and opened #120 so we can use that library function from this PR.

@dcbw
Copy link
Member

dcbw commented Feb 21, 2018

#120 has been merged so this can get rebased and updated to use GetVethPeerIfindex().

@m1093782566
Copy link

Friendly ping @DennisDenuto ~

@rosenhouse
Copy link
Contributor

rosenhouse commented Feb 24, 2018

Dennis, feel free to pick this up if you want, but if not, I can pair with @chinangela and team on it next week.

@Lion-Wei
Copy link

Looking forward for this to be merged. 👍

@DennisDenuto
Copy link
Contributor Author

@rosenhouse yeah that would be great if you could pair on it with the c2c team. Thanks

Add chained plugin to add a tbf qdisc to shape ingress/egress traffic
@caseydavenport
Copy link

Coming to this a bit late so apologies if it's already been discussed, but have folks considered using the capabilities mechanism + convention to allow runtime configuration of the plugin, similar to how the portmap plugin works today?

@squeed
Copy link
Member

squeed commented Mar 7, 2018

@DennisDenuto / @rosenhouse is this ready to go?

@m1093782566
Copy link

I can be the volunteer to implement what @caseydavenport want in followup when this PR gets in, if you are happy with that :)

@DennisDenuto
Copy link
Contributor Author

@m1093782566 Cool. I personally think submitting that as a separate PR makes sense.

@m1093782566
Copy link

@DennisDenuto

Agree! Sorry for keeping making noise, but please keep in mind that kubelet cni traffic shaping relies on this feature :)

Thanks!

When chained with a plugin that returns multiple devices, the bandwidth
plugin chooses the host veth device.

Signed-off-by: Tyler Schultz <[email protected]>
Copy link
Contributor

@rosenhouse rosenhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good besides that tiny thing with err being nil.

}

return nil, errors.New("no host interface found")
return nil, errors.New("no host interface found: " + err.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the loop never runs, then err won't be set, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, that would have nil pointed. Fixed.


It("supports add and del with bridge + bandwidth", func() {
basicAssertion("chained-bridge-bandwidth", "10.11.2.")
By(fmt.Sprintf("adding %s to %s\n\n", "basic-bridge", contNS2.ShortName()), func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I slightly prefer the By(...) syntax where there isn't another nested func(), e.g. https://onsi.github.io/ginkgo/#documenting-complex-its-by

This makes it so you don't have to do weird things like lift variables outside.

But no big deal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We removed the nesting and localized some vars.

Format plugin code

Signed-off-by: Aidan Obley <[email protected]>
@containernetworking containernetworking deleted a comment from cni-bot Mar 12, 2018
@rosenhouse
Copy link
Contributor

@squeed I think this is ready to go. WDYT?
any way to validate that this meets the K8s-community's needs for kubelet?

@m1093782566
Copy link

m1093782566 commented Mar 13, 2018

It looks pretty good now and can meet the basic needs of kubelet.

}
},
{
"name": "slowdown",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: this name isn't needed.

Copy link
Member

@dcbw dcbw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for Casey's nit, I think we should merge and fix with follow-ups.

@rosenhouse rosenhouse merged commit 1fb94a4 into containernetworking:master Mar 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants