Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running Crossplane on hostNetwork #5520

Open
Argannor opened this issue Mar 26, 2024 · 10 comments · May be fixed by #5540
Open

Running Crossplane on hostNetwork #5520

Argannor opened this issue Mar 26, 2024 · 10 comments · May be fixed by #5540
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed provider validation webhooks
Milestone

Comments

@Argannor
Copy link
Contributor

What problem are you facing?

Running Crossplane on an AWS EKS cluster with the Calico CNI leads to multiple problems:

Why is this necessary? When running Calico on an EKS cluster, the kubernetes control plane will still run on the AWS CNI (the control plane is completely managed by AWS and cannot be modified). Thus everything that needs to be reached from the control plane needs to run on the host network itself.

How could Crossplane help solve your problem?

A potential solution could be to make the ports configurable, but this would also need to be implemented on the function/provider side. To my knowledge the hostPort and containerPort values have to be the same if running on the host network (since it's the same network namespace as the host).

Maybe services of type NodePort could be explored as well.

Any ideas and feedback are greatly appreciated!

@Argannor Argannor added the enhancement New feature or request label Mar 26, 2024
@negz
Copy link
Member

negz commented Mar 26, 2024

When running Calico on an EKS cluster, the kubernetes control plane will still run on the AWS CNI (the control plane is completely managed by AWS and cannot be modified). Thus everything that needs to be reached from the control plane needs to run on the host network itself.

Making sure I follow correctly - the control plane (i.e. API server) needs to reach Crossplane components to hit its webhooks?

How do other projects address this?

@ravilr
Copy link
Contributor

ravilr commented Mar 27, 2024

yes, this is an issue in EKS setups where the pod network is configured to use custom CNI in overlay mode (could be Calico overlay, Cilium overlay, etc..) that gives pods non-VPC-routable IPs. Pod-to-Pod communications themselves work over the overlay network enabled by the respective CNI, so Crossplane pod -> Function pods grpc endpoints works seamlessly(with appropriate networkpolicy, if it is being enforced) . But, the the EKS APIServer don't have direct connectivity(route) to talk to the pods on the overlay network.

And AWS don't yet allow configuring this route connectivity through configuration options in EKS control plane: aws/containers-roadmap#2227 (comment)

For any K8s APIServer -> webhook pod communications, to workaround above, the easiest option is to run any such webhook pods which requires ingress from EKS K8s APIServer, in hostNetwork mode, which gives such pods with VPC routable IPs.

Most of the CNCF projects out there, therefore allow configuring the ports for their webhook pods: for example
https://cert-manager.io/docs/installation/compatibility/#aws-eks
So does, any k8s/controller-runtime based apps through Manager Options, Options

With most of the upjet based official family providers enabling conversion webhook in the provider MR CRDs for API Versioning support, this problem is getting exacerbated in such EKS setups. Since the provider pkg reconciler in crossplane core seems to hardcode the webhook and metrics port to 9443 and 8080 respectively, and running such provider family pods in hostNetwork mode configured through runtimeDeploymentConfig leads to port conflicts and provider family pods failing to be scheduled unless there are more worker/ec2 nodes added to the cluster, one node per provider family deployment pod, which is not a tenable solution.

And setting/overriding the ports in runtimeDeploymentConfig per provider don't seem to work as expected.

@ravilr
Copy link
Contributor

ravilr commented Mar 27, 2024

Also, just to add for any future onlookers,
#5521 isn't the same as this issue.

In #5521, the error manifests as a timeout error failed to call webhook: Post "https://crossplane-webhooks.crossplane-system.svc:9443/validate-apiextensions-crossplane-io-v1-composition?timeout=10s": context deadline exceeded , which is indicative of some network ACL issue (NetworkPolicy or AWS EC2 SecurityGroup rules)..

This issue where the pods are running on an non-VPC routable overlay network, the error manifests as failed to call webhook: Post "https://crossplane-webhooks.crossplane-system.svc:9443/validate-apiextensions-crossplane-io-v1-composition?timeout=10s": Address is not allowed . Address is not allowed indicative of the EKS Control Plane network/Kube-APIServer not able to recognize the pod IP in the overlay network. so, such webhook pods end up running as hostNetwork: true mode to allow this reachability.

@negz
Copy link
Member

negz commented Mar 27, 2024

Adding a flag to providers (and core) to make the webhook port configurable sounds reasonable to me.

@negz negz added validation provider help wanted Extra attention is needed labels Mar 27, 2024
@Argannor
Copy link
Contributor Author

I think I could start working on this next week, if you want me to. Although we would also need to introduce a flag for the other ports as well (grpc and metrics are the ones I'm currently aware of)

@negz
Copy link
Member

negz commented Mar 29, 2024

Please do! Thank you.

@negz negz added the webhooks label Mar 29, 2024
@Argannor Argannor linked a pull request Apr 3, 2024 that will close this issue
6 tasks
@jbw976 jbw976 added this to the v1.16 milestone Apr 4, 2024
@jaredhancock31
Copy link

We noticed that even if we disable metrics via helm values the manager container will still try to bind to it

hostNetwork: true
metrics:
  # -- Enable Prometheus path, port and scrape annotations and expose port 8080 for both the Crossplane and RBAC Manager pods.
  enabled: false

manager init: https://github.com/crossplane/crossplane/blob/master/internal/controller/pkg/revision/runtime.go#L221

error: crossplane: error: core.startCommand.Run(): Cannot create manager: error listening on :8080: listen tcp :8080: bind: address already in use

I think this also causes some problems with Cluster Autoscaler since it doesn't know the pod wants the hostPort and therefore will not try to bring a new node to alleviate the conflict.

Haven't had time to go deep on this, so still need to confirm. But in general we saw that when metrics was disabled, the pod just crashloops and ClusterAutoscaler doesn't react to anything, where typically it would call out that host ports are not available and scale up.

@jbw976 jbw976 modified the milestones: v1.16, v1.17 May 15, 2024
@marianobilli
Copy link

please note that its not only setting hostNetwork: true, it also needs to allow to change the port of the webhook, as if more than one pod land on the same node, they all cannot use the same port, and the second pod to arrive will fail to start.

another thing that should be configurable is the dsnPolicy of the pod it should be
dnsPolicy: ClusterFirstWithHostNet

The port should be able to be configured per provider.

As i am using fluxcd for gitops, I am patching the deployments using the kustomization functionality. but would be good if the proper solution is provided here.

Otherwise people in eks with other cni network plugins cannot have the webhook functionality at all.

thanks.

@jbw976 jbw976 linked a pull request Jun 18, 2024 that will close this issue
6 tasks
@jbw976
Copy link
Member

jbw976 commented Jun 18, 2024

@marianobilli if you haven't gotten a chance to look at #5540, it may be useful to have your opinion there too 🙇‍♂️

@mark-loeser
Copy link

As i am using fluxcd for gitops, I am patching the deployments using the kustomization functionality. but would be good if the proper solution is provided here.

Otherwise people in eks with other cni network plugins cannot have the webhook functionality at all.

@marianobilli I was curious if you could share more about how you ended up addressing this issue in your setup? We are finding that the crossplane controller is reconciling back any changes we attempt to patch in instantly (for the deployment or service).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed provider validation webhooks
Projects
Status: In Design
Development

Successfully merging a pull request may close this issue.

7 participants