-
Notifications
You must be signed in to change notification settings - Fork 773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iptables-based portmapping plugin #1
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something we do in kubernetes that this is not doing: We open and bind() the actual port on the host. This means that anyone who tries to use that port (via hostNetwork, a simple process, or whatever) will get EBUSY instead of actually binding but not working.
To do this, we'd need a long-running process to collect the ports or else a small long-running process per port-forward. It's a trivial program, though the overhead of Go might be bad for this. Easy to do in C...
Or we can decide this is not worth doing. It's sort of a hassle...
plugins/portmap/README.md
Outdated
8080 and 8043 on the host forwarded to ports 80 and 443 in the container, the | ||
rules look like this: | ||
|
||
`PREROUTING`, `OUTPUT` chains: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something we discovered in Kubernetes - this expression is looser than we really want. This will match any address in the local
routing table, whereas I think we only want addresses that are actually bound to interfaces. The delta here is things like load-balancer VIPs on GCE.
The solutions that is pending is to loop over all interfaces and expand them into a set of exact-IP matches. It is sort of awful because, at least in theory, this can change, so it needs to be refreshed periodically. I don't know how else to express this condition, but I am open to ideas.
An alternative would be to document this as "must be the last rule in the chain" but that feels very brittle in the face of collaboration users of iptables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I'd always thought the local
routing table was essentially read-only, but a quick experiment shows it's not (neat!).
Question: are you just concerned about performance, or is it expected that traffic from these interfaces are not forwarded? In a (non-k8s?) case, is it desirable that a GCE vip would be forwarded to a container? A quick glance at the documentation makes this at least seem plausible.
A few thoughts on possible solutions:
- Have a daemon that listens to netlink messages and updates a linux
ipset
with the list of local ifs - Add a configuration option to the plugin to add conditions to the second-level entry rule. Then the administrator could exclude either specific CIDRs, devgroups, or ipsets
- Write the port reservation daemon, then match on
socket
(this is medium sketchy)
Either way, I'm thinking about adding a portset (via ipset(8)
) to the top-level entry rule for improved performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the interest of progressing this PR, why don't we start by implementing this the way it's written today (any addresses in the local routing table) and then follow up with a fix using ipsets?
It wouldn't be a regression in behavior since that's how k8s behaves today, and so long as we have a plan for how to fix in a second release then I think I'm OK with it.
An alternative would be to document this as "must be the last rule in the chain" but that feels very brittle in the face of collaboration users of iptables
Yeah it does, but maybe it's fine as a first step?
plugins/portmap/README.md
Outdated
chain. | ||
|
||
`POSTROUTING`: | ||
- `-s 127.0.0.1 ! -d 127.0.0.1 -j CNI-HOSTPORT-SNAT` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be 127.0.0.0/8 for both?
plugins/portmap/README.md
Outdated
- `-j CNI-SN-xxxxx` | ||
|
||
`CNI-SN-xxxxx`: | ||
- `-p tcp -s 127.0.0.1 -d 172.16.30.2 --dport 80 -j MASQUERADE` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the -s at all? It was a pre-condition to end up here in the first place. Do we even need the destination and the port? Is there any case where the initial condition (-s localhost -d not-localhost) doesn't need masquerade?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely don't need the -s
. I added the -d
and --dport
so that the minimal possible set of traffic from 127.0.0.1 is allowed out. Otherwise this is a nasty potential firewall bypass owing to the fact that route_localnet
is enabled.
I thought about doing this, with a helper daemon à la DHCP, but decided against the trouble. If you think it's a worthy idea, it could still be done. It can also be added later. |
I think it is closing a real gap where users can get confused, but it is a
relatively slim gap. Obviously, if it is easy, we should do it. How easy
is easy enough?
Also, yes, it can be additive
…On Wed, Mar 15, 2017 at 9:35 AM, Casey Callendrello < ***@***.***> wrote:
We open and bind() the actual port on the host.
I thought about doing this, with a helper daemon à la DHCP, but decided
against the trouble. If you think it's a worthy idea, it could still be
done. It can also be added later.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVF8Gwszv20kgT-MicMCgZs8Qjpsiks5rmBNRgaJpZM4MbqAA>
.
|
On Wed, Mar 15, 2017 at 10:48 AM, Casey Callendrello ***@***.***> wrote:
@squeed commented on this pull request.
________________________________
In plugins/portmap/README.md:
> +## Rule structure
+The plugin sets up two sequences of chains and rules - one "primary" DNAT
+sequence to rewrite the destination, and one additional SNAT sequence that
+rewrites the source address for packets from localhost. The sequence is somewhat
+complex to minimize the number of rules non-forwarded packets must traverse.
+
+
+### DNAT
+The DNAT rule rewrites the destination port and address of new connections.
+There is a top-level chain, `CNI-HOSTPORT-DNAT` which is always created and
+never deleted. Each plugin execution creates an additional chain for ease
+of cleanup. So, if a single container exists on IP 172.16.30.2 with ports
+8080 and 8043 on the host forwarded to ports 80 and 443 in the container, the
+rules look like this:
+
+`PREROUTING`, `OUTPUT` chains:
Interesting. I'd always thought the local routing table was essentially read-only, but a quick experiment shows it's not (neat!).
Yeah, I learned this through GCE's tricks.
Question: are you just concerned about performance, or is it expected that traffic from these interfaces are not forwarded? In a (non-k8s?) case, is it desirable that a GCE vip would be forwarded to a container? A quick glance at the documentation makes this at least seem plausible.
GCE by default assigns VIPs as local addresses. Kubernetes on GCE
instead traps those VIPs and redirects to containers. Overall this
works really well, but now we have 2 different entities - CNI plugin
and kube-proxy - writing iptables rules that potentially overlap. We
can either try to ensure ordering of rules from most-specific to least
specific, or we can refine our definition of what we trap so that
ordering is not a factor. I obviously prefer the latter, but it's not
perfect, either.
A few thoughts on possible solutions:
Have a daemon that listens to netlink messages and updates a linux ipset with the list of local ifs
Add a configuration option to the plugin to add conditions to the second-level entry rule. Then the administrator could exclude either specific CIDRs, devgroups, or ipsets
Write the port reservation daemon, then match on socket (this is medium sketchy)
Either way, I'm thinking about adding a portset (via ipset(8)) to the top-level entry rule for improved performance.
I would not focus on performance yet. Let's figure out what we think
is the correct semantic.
|
I don't understand the firewall bypass? If this was (implicitly, before
jumping to this rule) `-s 127.0.0.0/8` and explicitly `! -d 127.0.0.0/8 -J
MASQUERADE`, what am I missing? One rule to catch it all..
…On Wed, Mar 15, 2017 at 10:56 AM, Casey Callendrello < ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In plugins/portmap/README.md
<#1 (comment)>
:
> +### SNAT
+The SNAT rule rewrites (masquerades) the source address for connections from
+localhost. If this rule did not exist, a connection to `localhost:80` would
+still have a source IP of 127.0.0.1 when received by the container, so no
+packets would respond. Again, it is a sequence of 3 chains. Because SNAT has to
+occur in the `POSTROUTING` chain, the packet has already been through the DNAT
+chain.
+
+`POSTROUTING`:
+- `-s 127.0.0.1 ! -d 127.0.0.1 -j CNI-HOSTPORT-SNAT`
+
+`CNI-HOSTPORT-SNAT`:
+- `-j CNI-SN-xxxxx`
+
+`CNI-SN-xxxxx`:
+- `-p tcp -s 127.0.0.1 -d 172.16.30.2 --dport 80 -j MASQUERADE`
We definitely don't need the -s. I added the -d and --dport so that the
minimal possible set of traffic from 127.0.0.1 is allowed out. Otherwise
this is a nasty potential firewall bypass owing to the fact that
route_localnet is enabled.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVIilQemzhSD98T156FMJuO0wF4flks5rmCZFgaJpZM4MbqAA>
.
|
I'm not 100% certain about the firewall bypass, but my though is that, since we've enabled route_localnet, if all traffic from 127.0.0.1 to the 'net gets MASQ'd, this happens after the In any case, I'll be on vacation, offline for the coming week. This PR is editable by the general team - all are free to make changes. |
Yeah, I was partially right about the firewall bypass - it's slightly contrived, but IMHO worth preventing. If you have a FILTER rule that specifies the source IP, then it is bypassed. e.g.
Then if you have
This command works, though it shouldn't:
Admittedly it's a bit silly to have source IPs in the |
plugins/portmap/README.md
Outdated
}, | ||
{ | ||
"type": "portmap", | ||
"capabilities": {"portMappings": true} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indent?
Any updates on this? |
Sorry, this fell off of my plate. I should be done with this in the next few days. |
It does sound like a helper daemon might be desired for this, and the dst type LOCAL concerns as well. This feels like a minor enough window to me that I'd be happy to pass on it for now and add later if it becomes a problem, or once we add a helper daemon for some other reason. |
plugins/portmap/README.md
Outdated
You should use this plugin as part of a network configuration list. It accepts | ||
the following configuration options: | ||
|
||
* `noSnat` - boolean, defaul false. If true, do not set up the SNAT chains |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: defaul
plugins/portmap/README.md
Outdated
8080 and 8043 on the host forwarded to ports 80 and 443 in the container, the | ||
rules look like this: | ||
|
||
`PREROUTING`, `OUTPUT` chains: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the interest of progressing this PR, why don't we start by implementing this the way it's written today (any addresses in the local routing table) and then follow up with a fix using ipsets?
It wouldn't be a regression in behavior since that's how k8s behaves today, and so long as we have a plan for how to fix in a second release then I think I'm OK with it.
An alternative would be to document this as "must be the last rule in the chain" but that feels very brittle in the face of collaboration users of iptables
Yeah it does, but maybe it's fine as a first step?
And to be fair, I sort of hate the "specific IPs" model, too - still hoping
for a better answer.
…On Tue, May 2, 2017 at 11:17 AM, Casey Davenport ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In plugins/portmap/README.md
<#1 (comment)>
:
> @@ -0,0 +1,103 @@
+## Port-mapping plugin
+
+This plugin will forward traffic from one or more ports on the host to the
+container. It expects to be run as a chained plugin.
+
+## Usage
+You should use this plugin as part of a network configuration list. It accepts
+the following configuration options:
+
+* `noSnat` - boolean, defaul false. If true, do not set up the SNAT chains
typo: defaul
------------------------------
In plugins/portmap/README.md
<#1 (comment)>
:
> +## Rule structure
+The plugin sets up two sequences of chains and rules - one "primary" DNAT
+sequence to rewrite the destination, and one additional SNAT sequence that
+rewrites the source address for packets from localhost. The sequence is somewhat
+complex to minimize the number of rules non-forwarded packets must traverse.
+
+
+### DNAT
+The DNAT rule rewrites the destination port and address of new connections.
+There is a top-level chain, `CNI-HOSTPORT-DNAT` which is always created and
+never deleted. Each plugin execution creates an additional chain for ease
+of cleanup. So, if a single container exists on IP 172.16.30.2 with ports
+8080 and 8043 on the host forwarded to ports 80 and 443 in the container, the
+rules look like this:
+
+`PREROUTING`, `OUTPUT` chains:
In the interest of progressing this PR, why don't we start by implementing
this the way it's written today (any addresses in the local routing table)
and then follow up with a fix using ipsets?
It wouldn't be a regression in behavior since that's how k8s behaves
today, and so long as we have a plan for how to fix in a second release
then I think I'm OK with it.
An alternative would be to document this as "must be the last rule in the
chain" but that feels very brittle in the face of collaboration users of
iptables
Yeah it does, but maybe it's fine as a first step?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVBQukRnyflKQjNpQBJVPpXJyqULVks5r13MwgaJpZM4MbqAA>
.
|
Ok, I pushed an update to the plugin. It now takes a At this point in time, I feel it's ready for a release. Adding a port-reservation daemon would be a welcome addition, but that can be an additional feature. |
I'm not sure I see how the conditions fixes the local problem, but I am OK to get this in and iterate.. |
+1 for get this in first. |
The idea is so the admin can add additional conditions to patch up the
local problem. One could exclude or require specific subnets, if those are
predictable and distinct, for example.
…On Fri, May 5, 2017 at 8:38 AM, Tim Hockin ***@***.***> wrote:
I'm not sure I see how the conditions fixes the local problem, but I am OK
to get this in and iterate..
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAhp0ghnrAdrht_nU07dwEZeXi5Zh7uvks5r2sPYgaJpZM4MbqAA>
.
|
excluding subnets isn't going to help the problem I see. besides that, a
single "conditions" block doesn't allow me to exclude 2 different ranges.
To solve the problem I see (or rather, the way I know how - please teach me
if I am wrong :) is to change 1 "addrtype LOCAL" rule into N "-d" rules.
On Fri, May 5, 2017 at 2:57 AM, Casey Callendrello <[email protected]
… wrote:
The idea is so the admin can add additional conditions to patch up the
local problem. One could exclude or require specific subnets, if those are
predictable and distinct, for example.
On Fri, May 5, 2017 at 8:38 AM, Tim Hockin ***@***.***>
wrote:
> I'm not sure I see how the conditions fixes the local problem, but I am
OK
> to get this in and iterate..
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/containernetworking/plugins/
pull/1#issuecomment-299393536>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAhp0ghnrAdrht_
nU07dwEZeXi5Zh7uvks5r2sPYgaJpZM4MbqAA>
> .
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVKpqdng9ahFoyEgimoxk71ZXVvMHks5r2vKggaJpZM4MbqAA>
.
|
I'm sure both of us don't quite have the whole picture :-). With a single iptables rule you can exclude multiple ranges. You just can't include multiple ranges directly. In other words, iptables matches are always AND, never OR. However, you can match on multiple ranges using ipsets, which adds a bit more complexity but isn't too bad. In any case, any administrator-provided conditions would be in addition to the Could you provide a more concrete example of how the GCE vips appear on the host? It's a bit tricky to reason about otherwise. I was assuming the machine had an eth0 of, say, 10.42.1.2 and an additional local routing table entry of, say, 8.8.8.8. Does it make sense? My goal is to produce something simple and broadly useful, and this is a very good motivating example. I'm happy to try and get it right. |
How do you exclude multiple ranges?
```
# iptables -t nat -d 1.2.3.0/24 ! -d 1.2.3.0/25 -j LOG
iptables v1.4.21: multiple -d flags not allowed
Try `iptables -h' or 'iptables --help' for more information.
```
When a packet to a VIP arrives at the node, srcip is the original client,
and dstip is the LB VIP. To make the kernel receive it in normal VM-mode,
the LB IP is added to the `local` route table. Because we mux containers
to nodes N:1, we demux LB packets in iptables. We recognize the dstip and
dstport, and DNAT based on that.
The problem is that those packets match addrtype LOCAL *and* the
VIP-specific rule.
So we could demand ordering, and say that hostport checks should always be
at the end. That seems delicate. We could just code the list of allowed
IPs for hostports. That's tedious, but seems semantically closer. Very
open to better ideas.
…On Fri, May 5, 2017 at 10:08 AM, Casey Callendrello < ***@***.***> wrote:
I'm sure both of us don't quite have the whole picture :-).
With a single iptables rule you can *exclude* multiple ranges. You just
can't *include* multiple ranges directly. In other words, iptables
matches are always AND, never OR. However, you can match on multiple ranges
using ipsets, which adds a bit more complexity but isn't too bad.
In any case, any administrator-provided conditions would be in addition to
the --addrtype LOCAL.
Could you provide a more concrete example of how the GCE vips appear on
the host? It's a bit tricky to reason about otherwise. I was assuming the
machine had an eth0 of, say, 10.42.1.2 and an additional local routing
table entry of, say, 8.8.8.8.
Does it make sense? My goal is to produce something simple and broadly
useful, and this is a very good motivating example. I'm happy to try and
get it right.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVNA2W24rkSNNnDzrQ5fXMmqB9P9zks5r21ecgaJpZM4MbqAA>
.
|
Argh, you're totally right about multiple |
plugins/portmap/main.go
Outdated
} | ||
|
||
func main() { | ||
skel.PluginMain(cmdAdd, cmdDel, version.PluginSupports("", "0.1.0", "0.2.0", version.Current())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why the supported versions are specified this way? This will skip 0.3.0
since version.Current()
returns 0.3.1
, and your sample config in the README.md uses 0.3.0
https://github.com/containernetworking/plugins/pull/1/files#diff-f6ee9b32ca4cfb6c2075e1ee55dadd07R22
CNI_COMMAND=VERSION ./portmap < conf
{"cniVersion":"0.3.1","supportedVersions":["","0.1.0","0.2.0","0.3.1"]}
I see other CNI plugins just use version.All
.
plugins/meta/portmap/README.md
Outdated
The plugin expects to receive the actual list of port mappings via the | ||
`portMappings` [capability argument](https://github.com/containernetworking/cni/blob/master/CONVENTIONS.md) | ||
|
||
So a sample config might look like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think about adding a note or a line here mentioning that the config should have a .conflist
extension?
@caseydavenport added some ip(6)tables error-ignoring logic, PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@squeed new ipv6 logic seems sound to me. I'll try it out when I'm in the office tomorrow.
plugins/meta/portmap/portmap.go
Outdated
ip4t := maybeGetIptables(false) | ||
ip6t := maybeGetIptables(true) | ||
if ip4t == nil && ip6t == nil { | ||
return fmt.Errorf("neither iptables not ip6tables usable") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: not
b7d3c0e
to
9afd469
Compare
plugins/meta/portmap/portmap.go
Outdated
return nil | ||
} | ||
|
||
_, err = ipt.List("filter", "OUTPUT") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it turns out this check isn't great because it checks the filter
table, whereas the portmap plugin uses the nat
table, and it's possible to have the one but not the other :D
# ip6tables -L -t nat
ip6tables v1.6.0: can't initialize ip6tables table `nat': Table does not exist (do you need to insmod?)
Perhaps ip6tables or your kernel needs to be upgraded
# ip6tables -L -t filter
Chain INPUT (policy DROP)
target prot opt source destination
ACCEPT all anywhere anywhere
ACCEPT tcp anywhere anywhere tcp dpt:22
Chain FORWARD (policy DROP)
target prot opt source destination
Chain OUTPUT (policy DROP)
target prot opt source destination
ACCEPT all anywhere anywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. Looks like v6 nat support was added in kernel 3.7.
Fixed.
Should we consider adding iptables ACCEPT rules for the case where the machine has default deny? kubernetes/kubernetes#35069 |
one last TODO: need to make the teardown code not use indexes, or else we risk stepping on our own toes. |
And, updated - delete now deletes by rule directly. |
@squeed LGTM! |
LGTM |
Support delegating to Windows CNI - use a separate WindowsDelegate field as the format of Windows properties is different from the schema supported by Linux under the delegate field. This will users to configure both Linux and Windows together and also to use Windows specific configuration for Windows clusters.
update integration test configlist
Noop - A noop PR
This is a fairly standard iptables-based portmapping plugin. It supports ipv4 and ipv6.
This replaces the proposed plugin in the cni repository.