Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iptables-based portmapping plugin #1

Merged
merged 4 commits into from
Jun 1, 2017

Conversation

squeed
Copy link
Member

@squeed squeed commented Mar 13, 2017

This is a fairly standard iptables-based portmapping plugin. It supports ipv4 and ipv6.

This replaces the proposed plugin in the cni repository.

Copy link

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something we do in kubernetes that this is not doing: We open and bind() the actual port on the host. This means that anyone who tries to use that port (via hostNetwork, a simple process, or whatever) will get EBUSY instead of actually binding but not working.

To do this, we'd need a long-running process to collect the ports or else a small long-running process per port-forward. It's a trivial program, though the overhead of Go might be bad for this. Easy to do in C...

Or we can decide this is not worth doing. It's sort of a hassle...

8080 and 8043 on the host forwarded to ports 80 and 443 in the container, the
rules look like this:

`PREROUTING`, `OUTPUT` chains:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something we discovered in Kubernetes - this expression is looser than we really want. This will match any address in the local routing table, whereas I think we only want addresses that are actually bound to interfaces. The delta here is things like load-balancer VIPs on GCE.

The solutions that is pending is to loop over all interfaces and expand them into a set of exact-IP matches. It is sort of awful because, at least in theory, this can change, so it needs to be refreshed periodically. I don't know how else to express this condition, but I am open to ideas.

An alternative would be to document this as "must be the last rule in the chain" but that feels very brittle in the face of collaboration users of iptables.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I'd always thought the local routing table was essentially read-only, but a quick experiment shows it's not (neat!).

Question: are you just concerned about performance, or is it expected that traffic from these interfaces are not forwarded? In a (non-k8s?) case, is it desirable that a GCE vip would be forwarded to a container? A quick glance at the documentation makes this at least seem plausible.

A few thoughts on possible solutions:

  • Have a daemon that listens to netlink messages and updates a linux ipset with the list of local ifs
  • Add a configuration option to the plugin to add conditions to the second-level entry rule. Then the administrator could exclude either specific CIDRs, devgroups, or ipsets
  • Write the port reservation daemon, then match on socket (this is medium sketchy)

Either way, I'm thinking about adding a portset (via ipset(8)) to the top-level entry rule for improved performance.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of progressing this PR, why don't we start by implementing this the way it's written today (any addresses in the local routing table) and then follow up with a fix using ipsets?

It wouldn't be a regression in behavior since that's how k8s behaves today, and so long as we have a plan for how to fix in a second release then I think I'm OK with it.

An alternative would be to document this as "must be the last rule in the chain" but that feels very brittle in the face of collaboration users of iptables

Yeah it does, but maybe it's fine as a first step?

chain.

`POSTROUTING`:
- `-s 127.0.0.1 ! -d 127.0.0.1 -j CNI-HOSTPORT-SNAT`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be 127.0.0.0/8 for both?

- `-j CNI-SN-xxxxx`

`CNI-SN-xxxxx`:
- `-p tcp -s 127.0.0.1 -d 172.16.30.2 --dport 80 -j MASQUERADE`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the -s at all? It was a pre-condition to end up here in the first place. Do we even need the destination and the port? Is there any case where the initial condition (-s localhost -d not-localhost) doesn't need masquerade?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely don't need the -s. I added the -d and --dport so that the minimal possible set of traffic from 127.0.0.1 is allowed out. Otherwise this is a nasty potential firewall bypass owing to the fact that route_localnet is enabled.

@squeed
Copy link
Member Author

squeed commented Mar 15, 2017

We open and bind() the actual port on the host.

I thought about doing this, with a helper daemon à la DHCP, but decided against the trouble. If you think it's a worthy idea, it could still be done. It can also be added later.

@thockin
Copy link

thockin commented Mar 15, 2017 via email

@thockin
Copy link

thockin commented Mar 17, 2017 via email

@thockin
Copy link

thockin commented Mar 17, 2017 via email

@squeed
Copy link
Member Author

squeed commented Mar 17, 2017

I'm not 100% certain about the firewall bypass, but my though is that, since we've enabled route_localnet, if all traffic from 127.0.0.1 to the 'net gets MASQ'd, this happens after the OUTPUT chain on the filter table. Any policies that depend on source ip or if could be skipped.
Yeah, it's a bit contrived, but not too weird, as far as these things go.

In any case, I'll be on vacation, offline for the coming week. This PR is editable by the general team - all are free to make changes.

@squeed
Copy link
Member Author

squeed commented Apr 12, 2017

Yeah, I was partially right about the firewall bypass - it's slightly contrived, but IMHO worth preventing. If you have a FILTER rule that specifies the source IP, then it is bypassed.

e.g.

iptables -t filter -A OUTPUT -p tcp -s 192.168.121.184 --dport 5050 -j DROP

Then if you have

iptables -t nat -A POSTROUTING -s 127.0.0.1 ! -d 127.0.0.1 -j MASQUERADE

This command works, though it shouldn't:

nc -s 127.0.0.1 10.7.18.117 5050

Admittedly it's a bit silly to have source IPs in the filter.OUTPUT chain, but it's plausible. Given that we also intend to use this plugin in rkt, which is a bit more general-purpose than k8s, I'm going to keep the more specific MASQUERADE entry conditions.

},
{
"type": "portmap",
"capabilities": {"portMappings": true}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent?

@feiskyer
Copy link
Contributor

feiskyer commented May 2, 2017

Any updates on this?

@squeed
Copy link
Member Author

squeed commented May 2, 2017

Sorry, this fell off of my plate. I should be done with this in the next few days.

@caseydavenport
Copy link

We open and bind() the actual port on the host.

It does sound like a helper daemon might be desired for this, and the dst type LOCAL concerns as well.

This feels like a minor enough window to me that I'd be happy to pass on it for now and add later if it becomes a problem, or once we add a helper daemon for some other reason.

You should use this plugin as part of a network configuration list. It accepts
the following configuration options:

* `noSnat` - boolean, defaul false. If true, do not set up the SNAT chains

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: defaul

8080 and 8043 on the host forwarded to ports 80 and 443 in the container, the
rules look like this:

`PREROUTING`, `OUTPUT` chains:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of progressing this PR, why don't we start by implementing this the way it's written today (any addresses in the local routing table) and then follow up with a fix using ipsets?

It wouldn't be a regression in behavior since that's how k8s behaves today, and so long as we have a plan for how to fix in a second release then I think I'm OK with it.

An alternative would be to document this as "must be the last rule in the chain" but that feels very brittle in the face of collaboration users of iptables

Yeah it does, but maybe it's fine as a first step?

@thockin
Copy link

thockin commented May 2, 2017 via email

@squeed
Copy link
Member Author

squeed commented May 3, 2017

Ok, I pushed an update to the plugin. It now takes a conditions argument (actually, one per family) so that the administrator can arbitrarily add conditions to the forwarding entry rules. While this doesn't directly solve the GKE / LOCAL problem, it does mean that fixing it is not a code change.

At this point in time, I feel it's ready for a release. Adding a port-reservation daemon would be a welcome addition, but that can be an additional feature.

@thockin
Copy link

thockin commented May 5, 2017

I'm not sure I see how the conditions fixes the local problem, but I am OK to get this in and iterate..

@feiskyer
Copy link
Contributor

feiskyer commented May 5, 2017

+1 for get this in first.

@squeed
Copy link
Member Author

squeed commented May 5, 2017 via email

@thockin
Copy link

thockin commented May 5, 2017 via email

@squeed
Copy link
Member Author

squeed commented May 5, 2017

I'm sure both of us don't quite have the whole picture :-).

With a single iptables rule you can exclude multiple ranges. You just can't include multiple ranges directly. In other words, iptables matches are always AND, never OR. However, you can match on multiple ranges using ipsets, which adds a bit more complexity but isn't too bad.

In any case, any administrator-provided conditions would be in addition to the --addrtype LOCAL.

Could you provide a more concrete example of how the GCE vips appear on the host? It's a bit tricky to reason about otherwise. I was assuming the machine had an eth0 of, say, 10.42.1.2 and an additional local routing table entry of, say, 8.8.8.8.

Does it make sense? My goal is to produce something simple and broadly useful, and this is a very good motivating example. I'm happy to try and get it right.

@thockin
Copy link

thockin commented May 5, 2017 via email

@squeed
Copy link
Member Author

squeed commented May 5, 2017

Argh, you're totally right about multiple -d. I don't know what the heck I was thinking.

}

func main() {
skel.PluginMain(cmdAdd, cmdDel, version.PluginSupports("", "0.1.0", "0.2.0", version.Current()))
Copy link
Contributor

@gunjan5 gunjan5 May 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why the supported versions are specified this way? This will skip 0.3.0 since version.Current() returns 0.3.1, and your sample config in the README.md uses 0.3.0 https://github.com/containernetworking/plugins/pull/1/files#diff-f6ee9b32ca4cfb6c2075e1ee55dadd07R22

 CNI_COMMAND=VERSION ./portmap < conf
{"cniVersion":"0.3.1","supportedVersions":["","0.1.0","0.2.0","0.3.1"]}

I see other CNI plugins just use version.All .

The plugin expects to receive the actual list of port mappings via the
`portMappings` [capability argument](https://github.com/containernetworking/cni/blob/master/CONVENTIONS.md)

So a sample config might look like:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about adding a note or a line here mentioning that the config should have a .conflist extension?

@squeed
Copy link
Member Author

squeed commented May 29, 2017

@caseydavenport added some ip(6)tables error-ignoring logic, PTAL.

Copy link

@caseydavenport caseydavenport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@squeed new ipv6 logic seems sound to me. I'll try it out when I'm in the office tomorrow.

ip4t := maybeGetIptables(false)
ip6t := maybeGetIptables(true)
if ip4t == nil && ip6t == nil {
return fmt.Errorf("neither iptables not ip6tables usable")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: not

@squeed squeed force-pushed the portmap-plugin branch 2 times, most recently from b7d3c0e to 9afd469 Compare May 30, 2017 13:09
return nil
}

_, err = ipt.List("filter", "OUTPUT")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it turns out this check isn't great because it checks the filter table, whereas the portmap plugin uses the nat table, and it's possible to have the one but not the other :D

# ip6tables -L -t nat
ip6tables v1.6.0: can't initialize ip6tables table `nat': Table does not exist (do you need to insmod?)
Perhaps ip6tables or your kernel needs to be upgraded
# ip6tables -L -t filter
Chain INPUT (policy DROP)
target     prot opt source               destination
ACCEPT     all      anywhere             anywhere
ACCEPT     tcp      anywhere             anywhere             tcp dpt:22

Chain FORWARD (policy DROP)
target     prot opt source               destination

Chain OUTPUT (policy DROP)
target     prot opt source               destination
ACCEPT     all      anywhere             anywhere

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Looks like v6 nat support was added in kernel 3.7.

Fixed.

@bboreham
Copy link
Contributor

Should we consider adding iptables ACCEPT rules for the case where the machine has default deny? kubernetes/kubernetes#35069

@squeed
Copy link
Member Author

squeed commented May 31, 2017

one last TODO: need to make the teardown code not use indexes, or else we risk stepping on our own toes.

@squeed
Copy link
Member Author

squeed commented May 31, 2017

And, updated - delete now deletes by rule directly.

@caseydavenport
Copy link

@squeed LGTM!

@bboreham
Copy link
Contributor

bboreham commented Jun 1, 2017

LGTM

@bboreham bboreham merged commit 0997c53 into containernetworking:master Jun 1, 2017
orangedeng pushed a commit to orangedeng/plugins that referenced this pull request Mar 5, 2018
Support delegating to Windows CNI - use a separate WindowsDelegate field as the format of Windows properties is different from the schema supported by Linux under the delegate field. This will users to configure both Linux and Windows together and also to use Windows specific configuration for Windows clusters.
Lion-Wei pushed a commit to Lion-Wei/plugins that referenced this pull request Apr 10, 2018
lsm5 pushed a commit to lsm5/containernetworking-plugins that referenced this pull request Oct 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants