Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: Avoid host networking for Ironic #21

Open
dtantsur opened this issue Mar 27, 2024 · 13 comments
Open

RFE: Avoid host networking for Ironic #21

dtantsur opened this issue Mar 27, 2024 · 13 comments
Assignees
Labels
triage/accepted Indicates an issue is ready to be actively worked on.

Comments

@dtantsur
Copy link
Member

Exposing Ironic on the host networking is far from ideal. For instance, if we do so, we're going to expose JSON RPC. There may be other internal traffic that we don't want everyone to see. In theory, only dnsmasq must be on host networking.

So, why does Metal3 use host networking at all?

  • DNSmasq serves as DHCP and TFTP servers. Both are UDP based and hard to route. DHCP involves broadcasts.
  • When booting over iPXE, hosts need to download iPXE scripts and kernel/initramfs from an HTTP server. This server is local to the Ironic instance that handles this host. Since the host is not yet a part of the cluster network, it cannot use the cluster service DNS name or IP.
  • IPA needs to reach back to the Ironic API (any of the running instances, optimally to the one handling the host). Still no cluster networking at this point.

One complication is supporting the bootstrap scenario. While most Metal3 consumers bootstrtap their production clusters by establishing a temporary cluster with Metal3 and then pivoting, OpenShift has an important limitation: the bootstrap cluster only provisions control plane hosts. Thus, it cannot rely on any components that won't come up without workers, including e.g. Ingress.

@metal3-io-bot metal3-io-bot added the needs-triage Indicates an issue lacks a `triage/foo` label and requires one. label Mar 27, 2024
@dtantsur
Copy link
Member Author

/triage accepted
/lifecycle frozen

@metal3-io-bot metal3-io-bot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Indicates an issue is ready to be actively worked on. and removed needs-triage Indicates an issue lacks a `triage/foo` label and requires one. labels Mar 27, 2024
@dtantsur
Copy link
Member Author

Currently discussed solution: using Ingress with a fall-back to a simple httpd-based proxy (probably derived from OpenShift's ironic-proxy) for edge cases like OpenShift.

The Ironic API part is relatively simple and could even be fixed with a load balancer like MetalLB. Ironic-standalone-operator could even create an IPAddressPool for the provisioning network if it's present (otherwise, just expect the human operator to create one). DNSmasq will refer the booting hosts to the load balanced IP address, which will reach to any Ironic instance.

iPXE configuration is harder. The boot configuration API will be required to make sure Ironic serves the iPXE scripts correctly regardless of whether it handles this host. But we still need to serve kernel/initramfs/ISO images, and these should not be proxied through Ironic.

The issue with images can be handled by using Ingress. Since each Ironic is aware of its host name (and thus the cluster DNS name), it can compose an image URL with a sub-path that refers to the right Ironic. So, the Ironic instance with the name ironic-1 will serve images from http(s):https://<ingress IP>/images/ironic-1/..., which will be redirected to http(s):https://ironic-1.<service>.<namespace>:6183/....

Open questions:

  • Can we have Ingress routes without HTTPS? it is required by both iPXE in the general case and virtual media in some rare cases.
  • Can we have Ingress IPs on control plane nodes? We do not want normal workloads to anyhow cross paths with either the provisioning network or the exposes Ironic API.

@lentzi90
Copy link
Member

Can we have Ingress routes without HTTPS? it is required by both iPXE in the general case and virtual media in some rare cases.

Ingress can handle both HTTP and HTTPS traffic. In general, it is expected that TLS termination happens in the ingress controller though, so the traffic reaching the "backend" would be HTTP. There are solutions to work around this when the traffic needs to be encrypted all the way, but it may be better then to consider LoadBalancers that deal with TCP instead.

Can we have Ingress IPs on control plane nodes? We do not want normal workloads to anyhow cross paths with either the provisioning network or the exposes Ironic API.

I'm not sure I understand the question here but I will try to clarify what I do know. Ingress-controllers are usually exposed through LoadBalancers. The exact implementation of this will differ between clusters, but I would say that it is quite common to exclude control-plane nodes, since you would not normally run the ingress-controller there. Traffic can still be forwarded to any node in the cluster. That said, it is definitely possible to configure things so that the ingress-controller runs on control-plane nodes and that the LoadBalancer targets them.

@hardys
Copy link
Member

hardys commented Mar 28, 2024

  • Can we have Ingress IPs on control plane nodes? We do not want normal workloads to anyhow cross paths with either the provisioning network or the exposes Ironic API.

As @lentzi90 says, in situations where dedicated compute hosts exist the application Ingress endpoint would normally be configured so it cannot connect to the control-plane hosts.

But IIUC the question here is actually can we run an additional ingress endpoint with a special configuration that targets the provisioning network, which I think probably is possible by running an additional Ingress Controller and something like IngressClass - we'd also need to consider how to restrict access to that Ingress endpoint so regular users can't connect to the provisioning network.

@dtantsur
Copy link
Member Author

This sounds like a lot of complexity to me. I start seeing writing our own simple load balancer based on httpd as actually a viable solution.

@dtantsur
Copy link
Member Author

"Fun" addition: I've recently learned that some BMCs severely restrict the URL length for virtual media. So if we start using longer URLs, we may see more issues.

@Rozzii
Copy link
Member

Rozzii commented Jun 28, 2024

@zaneb
Copy link
Member

zaneb commented Jul 4, 2024

I think you missed off a key reason why we can't just use a NodePort Service to expose the pod network (as in @mboukhalfa's PoC): node ports are constrained to a particular range (30000-32767) and available to Services on a first-come first-served basis. That means users in any namespace can squat on a port and steal traffic intended for ironic, which is a significant security vulnerability. (For existing deployments it also means requiring all users to change the settings for any external firewall they have to account for the ironic port changing.)

This could perhaps be mitigated by only running ironic on the control plane nodes and never allowing user workloads on those nodes. But OpenShift at least has topologies that allow running user workloads on the control plane, so this would be a non-starter for us. Although actually I think kubeproxy will forward traffic to any node on that port to the Service, so even separating the workloads doesn't help.

If this actually worked it would have made many things sooo much easier. So it is not for want of motivation that we haven't tried it.

I don't believe there is a viable alternative to host networking.

I start seeing writing our own simple load balancer based on httpd as actually a viable solution.

The ironic-proxy that you implmented in OpenShift is exactly that, isn't it?

@dtantsur
Copy link
Member Author

dtantsur commented Jul 4, 2024

The ironic-proxy that you implmented in OpenShift is exactly that, isn't it?

Yes. Some community members are not fond of using an alternative to an existing solution, but I actually believe you're right.

@mboukhalfa
Copy link
Member

@zaneb, good point. We are trying to encourage people to raise concerns in all ways in the discussion https://github.com/orgs/metal3-io/discussions/1739. That's the reason behind having these PoCs. The current showcase is very limited, and it doesn't even consider the dnsmasq case. We foresee that the final design and implementation for Ironic and Metal3 networks will not be easy or quick. It is a long-term process.

Our plan is to start with the following ideas:

I would like to get your feedback on the LoadBalancer and Multus use cases in this discussion https://github.com/orgs/metal3-io/discussions/1739. I am not an expert in network security within Kubernetes, so that's something we should document along the way.

@Rozzii
Copy link
Member

Rozzii commented Jul 11, 2024

This is not really frozen since @mboukhalfa is working on investigating this very topic.
/remove lifecycle frozen

@Rozzii
Copy link
Member

Rozzii commented Jul 11, 2024

/remove lifecycle-frozen

@Rozzii
Copy link
Member

Rozzii commented Jul 11, 2024

/remove lifecycle frozen

@Rozzii Rozzii removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/accepted Indicates an issue is ready to be actively worked on.
Projects
Status: Ironic-image WIP
Development

No branches or pull requests

7 participants