Hacker News new | past | comments | ask | show | jobs | submit login
A better Kubernetes from the ground up (dave.tf)
261 points by mr-karan on Nov 29, 2020 | hide | past | favorite | 152 comments



The open is weak (mutable pods is considered an inherent anti-pattern in k8s), but I think he’s got a lot of good points about networking.

Every time I look at k8s networking seriously it gives me great pause on whether I should continue to run such a complex system. IPv6+EtcD would solve this matter really well.


I came here to post pretty much the same thing.

Networking in k8s is the poorest designed aspect of Kubernetes. So much so, I'm honestly surprised it didn't kill off K8s early on. You still can't get proper service layer load balancing out of the box w/ iptables.

I'm not sure the history - it's the simplest thing possible, I guess, but doesn't perform very well at the most basic of tasks. To try and solve this, we have the current mess of service meshes, CNI, etc. It's going back to the middleware days of yore - which I think anyone with operational experience should know we really need to avoid. Having multiple layers of network proxies and masquerading between service calls, on an internal network, is just ridiculous and difficult to debug or operate at scale.


K8s networking is just typical networking. Networking always has to be made as complex and clunky as possible. Obvious solutions like unified IPv6 address space must be avoided as they would simplify the system and avoid the need for NAT.

I’m being snarky but it is a long time observation of mine. When I see a network config I am more often than not shocked by its needless complexity.


Folks who say stuff like this aren’t taking into account the number of real world constraints on networking at a large scale. There are bandwidth, reliability, security, management, cost, and performance considerations behind the complexity, just as there are at other layers of the stack. Folks higher up the stack want a simpler abstraction, but you can’t short-circuit physical limitations with a unified IP space.


To take the first property you mention, as an example: How do you feel bandwidth is helped by the proxying / NAT approach compared to traditional IP routing and Internet addressing?


Caching proxies and traffic controlling NAT.


Why do you need NAT to have caching or traffic control?


"K8s networking is just typical networking."

K8 networking is far more complex than typical networking and with less visibility


To your point, which “k8s networking”? My beef with k8s is that it punts on networking altogether and you have to cobble your own with some CNI or another, and depending on which you choose it will significantly impact upstack choices such as load balancing and ingress as well as downstack choices, such as whether you can use Tailscale to connect multiple nodes together (the CNI and load balancer must be compatible with). At least this has been my experience as someone who isn’t a network specialist.


That is not fair, kubernetes uses the same existing technology in a well integrated fashion, including firewalling. That is what Network Policy and Service objects gives you.

Communication between pods running across different computers gets encapsulated in another IP packet. This is pretty standard across the industry.


What rescued kubernetes was adoption by AWS and Google. If you want to host Kubernetes yourself it is still a waste of time broadly speaking.


Kubernetes didn't need to be rescued in the first place. If you know how to manage any Linux box, setting up kubernetes is as easy as "# apt install kubelet kubectl".

Broadly speaking kubernetes saves time by providing an api for existing Linux technology. For example, iptables, ipvs, storage, self-healing, virtual ips, etc via yaml.

Kubernetes was created to give anyone "Google like, production grade" infrastructure without the vendor lock-in of the cloud. And kubernetes on physical machines is where it shines most.

EDIT: it is very easy to complain that kubernetes is complex, but try to do virtual ips with keepalived, corosync, pacemaker to understand how much time kubernetes saves by providing the same capability in a generic way which is easy to use. I have the feeling that people take kubernetes for granted without knowing how much manual work one has to go through in order to build a similar experience using the standard building blocks available.

EDIT 2: straight out of kubernetes home page: "Kubernetes builds upon 15 years of experience running production workloads at Google" that is a lot of saved time just by using kubernetes.


Kubernetes "builds upon" Google's "experience," but it isn't what Google runs and doesn't totally reflect what Google actually does:

https://twitter.com/copyconstruct/status/1253968037579030532

https://twitter.com/copyconstruct/status/1254273432847446016

If you happen to have the problems Kubernetes solves, it's great. But even Google doesn't have exactly those problems.

(And there is a huge difference between setting up Kubernetes and hosting Kubernetes. If you know how to sit in a chair, you can be in a plane's pilot seat just fine, but piloting is an entirely different skill.)


I don't think "lessons learned from Borg" is necessarily a recipe for success. Omega was also supposed to be lessons learned from Borg, but it actually sucked and never launched, with only a few bits of Omega getting cherry-picked back into Borg.

Also, I would disregard the "moving toward" in that post. Borg has always had mutable container limits, for 10+ years (Large-scale cluster management at Google with Borg, §5.5).


Correct, kubernetes is a destilation of those 15 years of experience acquired running something else that also has network, storage and compute concepts. I didn't said Google uses kubernetes, though :)

Still, kubernetes makes a wonderful job abstracting all the hard work of using the underlying tools.

Look how simple is to create a ipvs round-robin by just defining a "kind: Service" object, try to do the same using the bare tools in a way that is synchronized across N computers.


I would argue it doesn’t go far enough, especially in networking/discovery. Kubernetes is incredibly useful but I personally want more out of the box than just abstractions over existing linux technologies.


Networking just works.

Discovery via DNS is pretty standard too.

Mind to elaborate on what you would like to see in those areas?


I don’t know enough about kubernetes networking to really give it a thought. That’s my issue, it’s kinda a really grey area. I just don’t know enough about it to be able to judge what I’m missing. Does that make sense? That’s why I want to see more.

Discovery over DNS suffers from the TTL issue right? You’re basically dos’ing yourself with DNS updates?


> setting up kubernetes is as easy as "# apt install kubelet kubectl".

That's not even remotely accurate unless you run Kubernetes on AWS, Azure, GCP.

> straight out of kubernetes home page: "Kubernetes builds upon 15 years of experience running production workloads at Google".

Sounds like you need to read that sentence carefully. They never say they use Kubernetes.

It seems to me you're a Kubernetes fanboy.


You don't need to use the complex stateful overlay networks. At Stripe we have a network overlay using IPv6 and the Linux kernel's built-in stateless tunnel device, so there's effectively unlimited addresses with no coordination between worker machines and no iptables port remapping.


> At Stripe we have a network overlay using IPv6 and the Linux kernel's built-in stateless tunnel device...

Can you please expand on this and if possible point to some references that'd help in case I want to go down this route myself? Thx.


I posted briefly elsewhere in the thread, but am now back at a compute with a keyboard so can give a more detailed answer.

First, background reading:

https://en.wikipedia.org/wiki/6to4

https://en.wikipedia.org/wiki/IPv6_rapid_deployment

The basic idea is you create a SIT tunnel device and assign it an IPv6 /64 composed of two parts:

1. A network prefix between 32 and 56 bits long. This prefix is the same for all machines in the network.

2. A subnet derived from the machine's IPv4 address, minus the netmask.

For example, if your IPv4 addresses are allocated from 192.168.1.0/24 and the machine has 192.168.1.155, then the network prefix should be 56 bits long (64 - (32 - 24)) and the machine's prefix is `xxxx:xxxx:xxxx:xx9B::/64`.

The Linux kernel knows how to wrap the IPv6 with IPv4 so it can route within your local network to any other machine with a similarly configured tunnel device. If you want to send packets to 192.168.1.200 then they get addressed to `xxxx:xxxx:xxxx:xxC8::1` or whatever, they'll transit the IPv4 network like normal, and on arrival the receiving machine's kernel will strip off the IPv4 wrapper and route the IPv6 locally.

How's this useful? Well, if each machine has a /64 prefix then each pod can be allocated an IPv6 within that prefix without coordinating with other machines. Let's say the pod gets `xxxx:xxxx:xxxx:xxC8::aaaa:bbbb:cccc`. Anything with a correctly configured tunnel and that pod IP can route it traffic, no proxy or iptables needed.


> Well, if each machine has a /64 prefix then each pod can be allocated an IPv6 within that prefix without coordinating with other machines. Let's say the pod gets `xxxx:xxxx:xxxx:xxC8::aaaa:bbbb:cccc`. Anything with a correctly configured tunnel and that pod IP can route it traffic, no proxy or iptables needed.

I’ve been meaning for a while now to experiment with this same idea in Erlang. I.e., hack up the Erlang runtime to use an IPv6 address as its PID type, such that each Erlang node running on a machine gets its own /64 subnet to hand out; and each Erlang actor-process on that node gets an IP allocated from its node’s /64 range.

This could just be a way of letting Erlang nodes talk to each-other through tunnels. Or it could be a way of having Erlang “VMs” exposed directly to the Internet as their own little machines.


This deserves a blog post in itself. Please be kind and share it with the world! I bet you will hire a few engineers as a result of this blog post :)


I think you might mean a SIT(simple internet transition) interface and not SIP? In case anyone is interested. This is a quick read on setting this up:

https://kogitae.fr/debianipv6-debian-wiki.htm


Yes, sorry, SIT -- it's been a while since I set it up and I forgot some of the details.


We've fixed that typo in the GP comment now.


Outstanding post. Thank you for taking the time to share this gem.


I haven't used it, but doesn't this suit the average usecase?

https://www.cni.dev/plugins/main/macvlan/

Basically just do normal ipv4 via your dhcp server rather than an overlay.

-- Edit

For arguments sake, I just set this up:

root@nas:/opt/cni/bin# ./dhcp daemon

cat /etc/cni/net.d/01-macvlan.conf { "name": "mynet", "type": "macvlan", "master": "eno1", "ipam": { "type": "dhcp", "routes": [{ "dst": "192.168.1.0/24"}] } }

PODIP = 192.168.1.181:8096

Works in my browser; so routes correctly.

Got its dhcp from my pihole.


This is super useful for home networks, I do this for my k8s cluster hosted by a bunch of pi's.

But in production, I'd rather my ability to launch a new pod not be dependent on a DHCP server being reachable and functional. In that case, this particular trick is rather neat, since assignment of IP addresses is fully static/local (without having to agree upfront what range of IPs each node can use for bringing pods online), while retaining the benefit of everything being directly routable. You can now also run a ridiculous amount of pods on a single node.


Yeah, I don't do this in production.

Though to counter your point, you don't actually need to use an external DHCP server in my example either, you can just define the block you're giving the server via the macvlan/ipvlan plugin, and I presume, again, it works with both IPV4 or IPV6.

So I guess my wider point is, k8s probably doesn't need to replaced to have the networking work how you like.


Indeed and if you bring up a Kubernetes cluster today on AWS using their EKS there is no overlay network. There's a CNI but no overlay.


In some deployments we remove the VPC CNI and use an overlay anyway because of the integrations we need. You do lose Security Groups but that's not a big deal if you aren't using them in your deployment anyway.


Oh sure there's lots of good reasons for not using their CNI not least of which is needing to worry IP space utilization. I was just trying to point out overlays aren't a requirement.


I would need to double check, but isn’t it still running an overlay? Just one which is transparent to you?


Well an overlay network by definition is always going to be transparent to you. But no the AWS VPC CNI does not create an overlay. It's just layer 3. It works by adding ENIs to your worker nodes with secondary IP addresses on them. And those secondary IPs are from your VPC's address space. See:

https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/c...


Yep, another reason it does that is to spread the risk and limit the need for tons of unchecked (no src/dst check) addresses in your VPC.

In the end there is no ideal scenario; boils down to what works best for the use case (or what is the 'least worst' solution). Sometimes it gets you down, but those imperfections can turn a churn job into an interesting one.


Could you elaborate on that a little bit? This sounds interesting and I'm not sure where to start researching about this.


On my mobile so hard to go.into detail, but take a look at 6rd and the Linux 6to4 tunnel driver. You can assign a prefix to the overlay, then each machine's IPv4 becomes a subnet, and packets can be routed by the kernel knowing only the destination pod's IPv6 address.


I'm curious about your practises regarding CPU Limits at Stripe. Do you noticed severe CPU throttling? What's your guidelines on this?


Product teams deploying code to our Kubernetes clusters are strongly recommended to use resource limits, and we're going to make that a hard requirement at some point.

We haven't noticed unusual CPU throttling, though we do have some workloads that turned out to be burstier than expected and had to adjust their CPU limits to match.

Note that when it comes to subtle Linux thread scheduling behavior, your experience will depend on which runtime you use, and if using runc then which version of the Linux kernel your workers run. We weren't affected by the CFS bug introduced in Linux v4.18 because we never ran Kubernetes workloads on a machine with the affected kernel, and if a similar bug occurs in the future it might not affect workloads running within gVisor or Firecracker.

Additionally, Stripe has historically cared more about security than efficiency. This lead to an architecture where services run on dedicated VMs, which naturally strands capacity and reduces the impact of bugs that appear at high utilization and/or high core count.


I believe you did run into the CFS bug in non-Kubernetes workloads, though, specifically with Hadoop tasks. An engineer told me about a workaround he devised using cpuset.


That's likely a different bug. From what I understand the CFS bug being discussed was introduced in Linux v4.18 and fixed in v5.3, and we have not used a kernel within in that range in our Hadoop clusters.


Not from stripe, but I've seen pretty bad CPU throttling.

Often see quota's get exhausted through short bursts that don't show up in metrics that then causes CFS throttling to occur even though it looks like the pod is no where near its limit. Also struggled with application startup requiring far more CPU than at runtime leading to ridiculously slow startup times if you had a low limit.

So far our solution has been to just remove CPU limits, but hoping things will get better.

Removing the limits really improved our latency tail, and so far hasn't resulted in CPU saturation at the node level but your mileage may vary


Did you know that the Linux kernel has a bug that makes CPU limits for containers extra costly?

https://github.com/kubernetes/kubernetes/issues/67577

https://github.com/torvalds/linux/commit/512ac999d2755d2b710...

If I recall correctly you need 4.18+ to get the fix.


We’ve seen just regular throttling too especially with erlang vm and go workloads


Yeah the value of being able to eliminate pod mutation as a source of trouble is pretty hard to overcome by any feature I can imagine mutability offering. Certainly not the example given. Actually the example must be over my head, because I do the equivalent of SIGTERM restarts with new configs on pods with a single command all day long when building services.


> IPv6+EtcD would solve this matter really well.

Like others on this thread I completely agree that k8s networking is over-the-top complex. Still, it is hard to understand how this addresses relatively common use cases like providing intelligent load balancing for clients outside the cluster or secure multitenancy within it.

Solving this requires a step back to think through simplifying networking itself. I thought amazon did the world a favor by removing L2 networking from their model. We need a better set of primitives that abstracts out as much of the low level implementation as possible. The current situation feels like data management before the advent of the relational model.


Yeah I’ve been dying to have a “simple” k8s solution that either uses just a VLAN or ipv6. It’s been one of the few components of k8s that I simply do not understand the rationale for, it seems so much engineering and complexity for what, exactly?


There's Bridget[1], a non-overlay network for k8s that can use/create a VLAN over a network adapter (or just use the network adapter directly).

[1] https://github.com/kvaps/bridget


I've used calico to set up a flat IPv6 cluster without overlays. It was pretty pleasant.

https://github.com/arianvp/packet-ipv6-kubernetes


Seems GKE uses calico, though not with ipv6 (as GCP doesn't support it internally), so at least this networking solution has some buy-in. (broadly speaking)


Cilium can be configured to use vxlan and etcd to coordinate machine subnets (it does routing using ebpf)


He’s overestimating how much money is made by selling Kubernetes network addons.


He addresses that pretty directly in his post, unless i’m misunderstanding you somehow?


I think you must have misunderstood one or the other.


Hear, hear. The unquestioning attitude towards the complex system of application level proxies is disquieting.


Interesting article. Thanks for sharing!

My dream platform:

1. Single binary install on nodes, and easy to join them into a cluster.

2. Resources defined as JSON with comments in a simple format with JSON Schema URLs denoting resource types - I should be able to run 1 container with 3 lines of resource definition.

3. Everything as a CRD... No resources or funcionality pre-installed and instead available via publicly hosted HTTPS schema URLs.

4. Pluggable / auto-installed runtimes based on the schema URL or a "runtime" field: containers, vms, firecracker, wasm, maybe even bare processes, etc.

5. A standard web dashboard with a marketplace which can install multiple containers, vms, wasm or via a copy-pasted HTTPS URL, or a standard electron app which lets me connect to clusters or manage local deployments on my devbox.

6. Apps can provide a JSON schema with config options, which map to env vars and volume mounts and can be displayed in a user friendly way in the web dashboard.

I feel something like this could standardize even more than kube. I'd love if one system allowed me to manage AWS EC2 / DigitalOcean / Proxmox instances, as well as manage services on my devbox for daily use (like daemons, etc.)

With that said, I do like a standard format across service providers. While I do find kube complex at times, I like that something "won" this battle, and I also like the push towards containers vs vms.

I'd love to see kube start pairing down and innovating toward making things easier for smaller clusters. Anyone know if there are discussions about that going on, or if there are resources for managing kube in smaller teams? Anyone interested in this?


Working on something that will probably end up covering a lot of those requirements in the long run. https://micro.mu


Hashicorp's Nomad comes close to meeting these requirements: https://www.nomadproject.io/

If you want to stick with k8s, the k3s project is an attempt to simplify it: https://k3s.io/


K8s forced industry to have a consistent open cluster API that will drive innovation and competition.

We can have multiple implementations of the same API, but what I am seeing currently from the "commercial" vendors is the base K8s with UI changes. I hope we will have multiple implementations of the spec. The specs also have to evolve with time.

Last decade, systemd had its fair share of criticism. But it provided a consistent API to run "Compute Units" locally. K8s can benefit from the same principles to manage "Compute Pods" across the cluster. The concepts of promise theory [1] provide interesting control loop co-ordination.

[1] https://en.wikipedia.org/wiki/Promise_theory


If you like systemd AND kubernetes API, this might be of interest to you: https://github.com/miekg/vks


I'm quite confident that one day kubernetes will make use of systemd-nspawn. Let's see... :)


> That means the computers in my home server rack, the DigitalOcean VM I have, and the couple other servers dotted around the internet. These should all be part of one cluster, and behave as such.

YES! This is exactly what we should have had been able to do in K8S swiftly, but I would like to suggest one additional goal: to get a highly-distributed, fault tolerant computer cluster over different network without a hassle (as a mean to spread risk), while also having a as-low-as-best-as-possible TCO through the use of heterogeneous architecture.

So this means I can have used AWS for their Graviton instances for high performance, near-bare-metal microservices, and can also run typical x86 workload on various cheap VPS providers such as Vultr, DO and any other cheap hosts out in the wild to handle normal stuff that typically wouldn't run well on ARM, such as GitLab, Prometheus, Keycloak, in a sense that we associate x86 to have an affinity with "heavy stuff" runner and this implies ARM to be the Mr. "lightweight stuff" guy. This is not really possible by today's shape of the k8s ecosystem, given that the majority of the images in Docker Hub is predominately x86/amd64, and my wild guess is around 90%. By the way I use ServerHunter[1] to scavenge such cheap x86 servers.

Also I'm a k3s[2] user and I have attempted to do this pretty well. Given your kernel support, you can even strap WireGuard on natively (albeit my VPS host internet bill tells me this can hit quite heavy)

While I can definitely control k8s by using kubeadm but I just liked k3s's philosophy of being battery-packed: It will install Traefik (although my recent needs to run nginx-ingress ultimately let me turn it off), CoreDNS, Flannel, a simple local storage and extra goodies all out of the box in your cluster, and you can opt them out whenever you want if you think they are insufficient for your kind of workload (whilst they are adequate for most of 90% of the thing actually). To be honest, this is how k8s should have had been this simple in the very beginning to say the least.

[1]: https://www.serverhunter.com/ [2]: https://k3s.io


Counter-argument: having your entire world-wide deployment operate under a single control-plane is a recipe for global outages. There should no single command that one can fat-finger that will bring down your system globally.

One-cluster-per-region (with some tie-in into one region being its own failure domain, both at the underlying infrastructure and application level) is the way to go for reliability.


The model put forth in TFA seems to address this in that pods (or “sub clusters” can run for weeks at a time without communication with the toplevel cluster. It’s pretty hand-wavy and probably can’t solve for all possible outage scenarios, but it seems like it would help dramatically.


> YES! This is exactly what we should have had been able to do in K8S swiftly, but I would like to suggest one additional goal: to get a highly-distributed, fault tolerant computer cluster over different network without a hassle (as a mean to spread risk)

In practice this is going to be tricky unless your services are completely stateless. For 80% of people that's going to be true, but if you have customers with large datasets, (I'm thinking mostly media) you do not want to be schlepping those things between clouds or between cloud and on prem or even between on prems.


One potential gotcha in your model is egress costs. For the big three, it's anywhere from $0.03/GB to $0.10/GB. The million dollar question suggested by your model is how do you you store and backup your persistent state in a manner which is not just conceptually robust but cost optimal?


I do this today with Nomad/Consul. Hashicorp's Raft+Serf implementation allows for shocking amounts of latency between servers/clients. I have several centralized server clusters (the "servers" _should_ be geographically close, the "agents" can be far and wide), and agents more than 200ms+ away, and across multiple clouds/on-prem. Everything works just fine.

I've legitimately considered running some kind of simple SaaS and behind the scenes running a Nomad cluster and having the remote SaaS agent just be a Nomad client.


Both of these taught me that Kubernetes is extremely complex, and that most people who are trying to use it are not prepared for the sheer amount of work that lies between the marketing brochure and the system those brochures promise.

...

GKE SRE taught me that even the foremost Kubernetes experts cannot safely operate Kubernetes at scale.

This is rough to hear coming from a former engineer at a top cloud company which has often lead the way on “marketing brochures”.


It caught my eye too, but if this perception is true they do a pretty good job of faking it. I don't know if our clusters are "at scale" - the largest of them has about 120 nodes - but we have had very few issues in over three years of running production workloads on GKE.


GKE SRE feels all the pain of kubernetes and none of the pleasure. You don't get paged to be told that the fleet is happy.


Absolutely true!


I got involved very early and had to roll my own implementation using terraform and shell scripts.

It has been in productions for a long time without issue. Only downside is that we have to do blue green upgrades for major versions instead of in-place upgrades.

I keep pushing for our customer to adopt eks or gke, anything but my custom code base.


Out of curiosity, is it rough hearing that the author felt lied to by the marketing, or that you worked for one of the companies doing it, or something else?


as a customer of the mentioned cloud managed service


Author here. If I had to run Kubernetes for money-making purposes, I would still seriously consider GKE. I still think they're the best at running k8s. It's just that k8s limits what any operator can hope for.


> GKE SRE taught me that even the foremost Kubernetes experts cannot safely operate Kubernetes at scale.

This is a pretty damning indictment, honestly.

But has k8s gotten so influential that it'll be impossible to dislodge? I guess k8s has no problem breaking backwards-compatibility with new versions, so maybe somebody can propose something better even if it breaks compatibility.


On Java and .NET world, with projects like Tye and Quarkus, you get to automate the interactions with Kubernets from the language tooling side, so it becomes less painful to deal with all kubernetes idiocracies.


You could always make use of the Kubernetes APIs from your application through fabric8, if that's what you mean.


Nope, I mean not having to deal with kubernetes directly.

As for frabric8, the project is dead.


I think you meant “indictment” rather than “indigent.”


Edited, thanks. Serves me right for uncritically accepting automated corrections.


>"So, for starters, let’s rip out all k8s networking. Overlay networks, gone. Services, gone. CNI, gone. kube-proxy, gone. Network addons, gone."

Then they go on to to suggest:

>"If you have more elaborate connectivity needs, you bolt those on as additional network interfaces and boring, predictable IPv6 routes. Need to secure node-to-node comms? Bring up wireguard tunnels, add routes to push node IPs through the wireguard tunnel, and you’re done.

and

>'We could also have some fun with NAT64 and CLAT: make the entire network IPv6-only, but use CLAT to trick pods into thinking they have v4 connectivity. Within the pod, do 4-to-6 translation and send the traffic onwards to a NAT64 gateway."

So changing one type of NAT translation for another type of NAT translation and throw in some additional tunneling in there? How is that any simpler, more elegant or even more manageable than the current state of K8S networking? There is no requirement that you have to have an overlay network at all. If you bring up EKS cluster today in AWS the default is the aws-vpc CNI which is a flat address space the same as your VPC there is no overlay.

Then further:

>"Sticking at the pod layer for a bit longer: now that they’re mutable, the next obvious thing I want is rollbacks. For that, let’s keep old versions of pod definitions around, and make it trivial to “go back to version N”.

>"Now, a pod update looks like: write an updated definition of the pod, and it updates to match. Update broken? Write back version N-1, and you’re done."

This is exactly what using a GitOps operator does. Then they go on in the next sentence to call Gitops "nonsense"?

Not much of this is convincing or even well-throught out. This is definitely not "from the ground up." It's more like "throwing some shit against the wall and seeing if something sticks."


I must be boring because I manage plain old (virtual) machines with a load balancer at the edge.

It is easy as pie to manage and is easy to understand and secure.

I build the clustering into relevant services as needed, because what it means to be in a cluster together is highly service specific, so you are deluding yourself by thinking that a generic external clustering framework is anywhere near the answer.

If all you ever seen is Google prescribed order of the world (Google SRE), then of course you would contemplate rewriting K8s instead of throwing it out.


May be there can be another form of Kubernetes, that is even simpler, and opinionated.

A project that picks one of [firecracker, gvisor, runc] and aggressively supports it, and combines it with a barebones network overlay that assumes nodes are running on a single subnet. Perhaps, also assumes that the control plane runs on dependable hardware and not distribute everything. May be it exists already, and I just don’t know it...

Fewer moving parts the better for people like me who are not looking at huge scale but still scale enough to span several servers in a few racks.


Doesn’t Nomad already support FC?


There is a community firecracker task driver plugin for Nomad: https://www.nomadproject.io/docs/drivers/external/firecracke...


Cloud Foundry would be good for you


This was a good read. My question is though, what’s to stop this Kubenext of becoming the next kubernetes? To use the analogy, what’s to keep this sleek go orchestration from becoming the c++ everyone wants to migrate away from.

The networking bit reminds me a lot of Mesos, which was utterly hammered out of existence with corporations blind bandwagon riding of kubernetes. Mesos networking ran along with docker’s and required you to do borg-style service port mapping (albeit in an atomic number way). What I don’t like about 90% of this (and why ECS is my jam) is I want an orchestration of containers, I don’t want zookeeper, etcd, dns-proxy, the plethora of other services to make my orchestration - orchestrate.

I spent a good few years running cloud architecture and infrastructure and managed a few SRE’s. The thing that makes our lives easier was guaranteed red/blue deployments (mentioned a bit in the article with PinnedDeployments), auto-sizing clusters to resource totals (and out again for deployments). Terraform/CloudFormation or really anything IaC. Slackbot for deployments with rapid feedback of status through the CI chain. A kin to “bot deploy <repo> to <env>“ and let the bot figure out the config from yaml in the repo’s.

I had the most joy using DC/OS. I had the most commercial success with ECS. I’ve had the most requests for kubernetes. I’ve had the worst headaches with abstractions of kubernetes (packaged installers, canonical...)


A bit OT. I’ve been putting off learning how to setup and run k8s, and am unfortunately in a situation where I don’t have anyone at work to learn from.

For context I’m no stranger to what the shape of production ready systems should be and can fill in the gaps given enough time to research and educate myself, but I don’t do operational work day-to-day.

I’m bringing a project to life right now and can’t but feel like while tough to learn and manage, k8s would be a good investment to make. I don’t have anyone using it yet, so I just did the bare minimum to get my docker-compose stack up and running on a Linode box. It works great for making sure what I have now works in a remote environment too, and I had to do a decent amount of configuration rework to get it ready, which should be transferable.

Now I’m wondering, how will things like rolling deployments work? I want to decouple the monitoring stack from the application stack, how will I handle adding another physical machine to my setup? I’m sure more questions will come up like this as I run into them, but would be curious to hear initial thoughts from anyone here to help me make a decision :)


I think the easiest way to get started is to use DigitalOcean's managed Kubernetes offering. If you already know Docker then basically you'll just be learning how to setup a cluster on Kubernetes, installing your app to the cluster and other apps using kubectl and helm, and then setting up an ingress into the cluster (likely nginx-ingress although there are other options).

In regards to your specific questions... adding another physical machine to your cluster would be really simple if you are using Terraform, you would just increment your node count number and then re-apply the Terraform config. Here is an example Terraform config(main.tf) from a simple project I have:

  variable "do_token" {}

  provider "digitalocean" {
    token = var.do_token
  }

  resource "digitalocean_kubernetes_cluster" "your-cluster" {
    name    = "your-cluster"
    region  = "sfo2"
    # Grab the latest version slug from `doctl kubernetes options versions`
    version = "1.16.6-do.2"

    node_pool {
      name       = "your-pool-1"
      size       = "s-1vcpu-2gb"
      node_count = 3
    }
  }
For deployments I would honestly just keep it simple and increment your docker tag version number on your image each time you are doing a deploy. Then when you deploy your new image (e.g. kubectl apply -f deployment.yaml), the new image will be pulled down to each cluster pod and the application containers will then be restarted one by one.

If you are running this on another test/development cluster (e.g. minikube) prior to deployment, then you should have great confidence that this will succeed. In the event that you did run into an issue, just roll back the docker image version number and reapply the yaml using kubectl again.

I've been following this method for a while and have never had any downtime with deployments. Eventually if you get sophisticated you'll want to add these steps into an automated CI/CD pipeline, but kubectl apply can carry you pretty far in solo operations.


Thank you for taking the time to craft such a thorough response! I had moving all infrastructure to terraform on my backlog already so that’s validating. I might try the k8s platform offered by Linode since I am there already, but I am constantly reading about how people use DO so maybe it’s more feature rich or something, I have just been a Linode fan for many years is all hah, and the pricing isn’t bad


Me and my colleagues came up with an abstraction called conductors for the orchestration problem. Short version is that when you have some dependence on multiple resources being in a particular state before creating or updating yet another resource, use a conductor. They observe multiple resource kinds, and operate a FSM internally. State transitions happen upon receiving an event from an observed resource. The logs of the conductor make it easy to debug dependency problems, because you can easily see that your conductor’s FSM is in a particular state, waiting to see some number of a particular resource.

The other principles we developed are “controllers” only control one resource and all resource updates for a resource kind must be serialized through a coordinator. Our paper has much more detail: A Cloud Native Platform for Stateful Streaming, https://arxiv.org/abs/2006.00064


Conductors sorta of sounds like the OTP pattern with supervisors: https://erlang.org/doc/design_principles/sup_princ.html


Whenever people use Go in an analogy and depict it as simple, I'm out. Easy, yes; simple, no.


How is it not simple? Or, how is it not simpler than other languages/ecosystems that could be used for implementing a system like Kubernetes?


I have the same feeling, but in this case I think the author 1) thinks too highly of Go and is unaware of the downstream problems that its progarmming philosophy, and 2) probably isn't aware that there are probably much better analogies (C++ : X) to be made in the world of programming languages.


This is beautiful, I hope this person goes on and actually implements what's they're talking about.

I really liked the part about simplifying networking, I really feel there should be a more general push towards both ipv6 and srv records.


Networking is the most befuddling thing in k8s. I really don’t see how they got from borg to that. I think they thought that people just wouldn’t accept the free-for-all borg model. But I’d rather have some slight complexity in my name service (DNS sucks anyway) than configure networking in k8s.


Making pods mutable would break the core benefit of what kubernetes imperial system does for you.

This GitOps 'nonsense' gives me a well defined and automatically backuped infrastructure setup with audit build in. It doesn't allow someone to snowflake around which is brilliant and forces you and your colleges to manifest stuff and not forgetting it and degrading your system over time (nix and nixos are also great examples of such systems)

This reminds me of the time i learned html and wanted to set new lines all over a text instead of using proper paragraphs and letting html take care of the proper formatting.

I would like to see a better/stronger statefulset though. As long as this pod is alive, make sure its state is not interrupted. Like allow a pod to be migrated to another node.

Nonetheless, i'm in the middle of setting up kubernetes with kubeadm and cilium network. Its already really easy to do so. It will just get more easy and more stable over time and its already great.

When you look at the storage example: yes its more difficult then just using a hard drive. But you ignore the issue with one hard drive: Backup, checksum / bit rot and recovery. With a storage layer, you can actually increase replica count, you can backup ALL storage volumnes automatically.

the same with networking: With cilium you can now have a lightweight firewall with dns support.

It is much more critical for the whole industry to start rebuilding software to be more cloud/container native. This will reduce the pain points we have right now and will make it more resilient to operate. For example Jenkins: Instead of one big master, have a ha setup for your working queue, a pod for a dashboard and schedule workers on demand.

My personal conclusion: Don't use it, if you don't need it. If you need it, embrace the advantages.


> Making pods mutable would break the core benefit of what kubernetes imperial system does for you.

Which core benefit is that? I’m not following.

> This GitOps 'nonsense' gives me a well defined and automatically backuped infrastructure setup with audit build in. It doesn't allow someone to snowflake around which is brilliant and forces you and your colleges to manifest stuff and not forgetting it and degrading your system over time (nix and nixos are also great examples of such systems)

TFA says you can still use the GitOps “nonsense” if you want under his proposal.


Can't you use node affinity to stop k8s from moving a pod to a different node whilst alive?


Thats not the problem this would solve. As long as the node runs and the pod itself runs and there is no issue with a pod with a higher priority, k8s will not throw it from that node.

But imagine a database as a pod with 60gig of ram and a ha setup. Now you need to update your node, what does k8s? It will throw it out and creates a new one which needs to recover or read all the logs to fill up 60gig of ram again from nothing. Instead it could migrate this pod to another node and keep the downtime to a minimum.

Or a jenkins master, it has to shutdown on node 1, recreate to node 2 which takes time and then your agents need to be able to recover from it.

You have to be able to roll through your whole k8s infrastructure to update every node on a regular basis; Alone for security reasons.


Sooner than later kubernetes will support live migration of workloads via checkpoint-restore of processes, like xen, and many other software already has.

https://en.m.wikipedia.org/wiki/CRIU

EDIT: https://github.com/kubernetes/kubernetes/issues/3949


Having worked in the Erlang VM for a few years now this is something that i have wished for so hard (erlang + systemd gets you very far, but not quite). But, like the author, I have been happy doing orchestrations on metal, and I don't have the heart to try to make this (and try to get mindshare on it) myself.


The article lost me at making pods mutable.


In theory it would be nice - restart performance would increase, various glitchiness related to pods moving around would get reduced, could have distributed disconnected nodes that only occasionally connect to control plane, could have some basic form of persisted storage...


Author complains system is too complicated then laments it needs to add bunch more features that would make it even more complex (particularly mutable pods).

> A modest expansion of the previous section: make each field of an object owned explicitly by a particular control loop. That loop is the only one allowed to write to that field. If no owner is defined, the field is writable by the cluster operator, and nothing else

This is already a thing starting 1.17 i think with server side apply https://kubernetes.io/docs/reference/using-api/server-side-a... (except it’s opt-in)


He’s proposing changing fundamental design decisions, and then eliminating big chunks of the existing implementation.

Kubernetes is obviously overly-complicated. I’m just using it in a CI/CD environment and, in a month, have hit most (all?) of the issues mentioned in the article.

I don’t think it’s fixable. It’s the interesting to see an expert in this area come to the same conclusion for the same reasons.


There are many parts that are hand waved in and in practice would be difficult such as mutable pods and direct to pod load balancing. Pretty typical in distributed compute to have something that sounds simple turn into madness upon closer look. He maybe an expert but sure as hell didn’t think through all implications


I very much didn't think through it, and opened with exactly that disclaimer :). You're right that the lofty ideas probably won't survive contact with reality. In my defense, I wrote this in a couple of hours to get it out of my head, and then people inexplicably started reading it.


Heh, I actually agree that some of the called out use cases such as PinnedDeployment are needed but imo extensibility of kubernetes api make it non issue since you can just build your own implementations


My counter to that is that people mostly don't roll their own, so the defaults matter. Adding more implementations just increases the total amount of complexity going on.

That probably argues for "Deployment shouldn't have been a core object type", and I think k8s folks generally agree on that now, in hindsight. But the idea of generalizing CRDs to the extreme is relatively recent.


People looking for simplicity should really give Nomad (https://www.nomadproject.io/) a chance.

I'm running this in almost IPv6-only setup (https://blog.42.be/2020/11/using-nomad-to-deploymanage-conta...). I'm so glad I don't have to mess with overlays and NAT.


Same here. Nomad is great, especially that you can run non-Docker workloads


> Versioning - Update broken? Write back version N-1, and you’re done.

Doesn't kubectl rollout rollback already do this for deployments?

> Pinned deployments

Service meshes like Istio let you run mutiple versions of services that you can selectively route traffic to. You can sign up for it if you need it. What value do pinned deployments add over that?

I kind of get the problem parts of k8s networking. But other than that this seems like it makes already complicated kubernetes some more complicated for not so convincing reasons.


> What value do pinned deployments add over that?

Not tacking on a service mesh middleware that I now have to manage and debug.


What would you do with multiple deployments without the ability to control traffic though?


Conceptually, I want the cluster orchestrator to populate a set of systemd units on each machine, and then switch to a very passive role in the node’s life.

I tried to write a cluster manager from 2010-2015, and essentially what it did was write a big shell script for starting a service. The node could then reboot on its own, and the init system would invoke all the shell scripts.

The "host" part of the command line was on the left, and specified the container, and the user parts go on the right, after --

    my-container --fs /images/mine --bind ... -- my-server --port $PORT
There could also be a line before that to sync the container / layers, etc.

I still think that is a good paradigm ... The whole thing could just be a "shell script compiler", where the shell script uses a very small number of tools.

The problem with cluster managers is that you get into the "inner platform" problem. Once you have a lot of software to run the cluster on each node, now you have to figure out how to update and monitor THAT software. But you don't necessarily have the update and monitoring mechanisms it implements!


If k8s was being designed from ground up I would want a move away from yaml for configuration to a strongly typed DSL, something like gradle doing with move to Kotlin DSL from weak groovy DSL.

This would mean better integration/autocomplete in editors and no time wasted on silly things like number of indents being wrong.


You may be interested in https://github.com/stripe/skycfg, which we use to generate Kubernetes resources using its native Protobuf data model.


> I currently believe that the very best container orchestration system is no container orchestration system, and that effort would be well spent avoiding that rabbit hole at all costs.

Me too, hence why I look forward to projects like Project Tye and Quarkus to take us out of the current trend.


I keep wanting to start a fund, to try to pay Dave to talk to some of his grand plans for rebuilding MetalLB.

He had some ideas for a sizable re-architecting, but was rapidly burning out, oh I dunno, like 1 year ago, 18 months ago. Maintaining such a vital yet support-needy project seems like it was taking quite a toll.

I love this "A better Kubernetes" theorcrafting but the post I really really want to see from Dave is "A better MetalLB". The indie operators everywhere would love to have some sights on what a better MetalLB might look like.


Pretty OT, but: I know of many large providers using MetalLB as a core part of their k8s infrastructure, who gave nothing back. If walking away punishes them for that, then I'm very pleased :)

MetalLB is in a healthier place now than with me at the helm. It's being maintained and developed by people who still believe in and use k8s, which is how it should be.


It's such a pity we live in such a Dark Forest, so many silent quiet predators out there, & so few willing to bring their light, willing to contribute. Being the only prisoner in the dilemna doing the right thing seems like it sucks a lot. Thank you for soldiering on & getting us all so far as you did.

I'm glad there's good people at the helm. I don't know how to avoid uncomfortable personalization, but this really was a very notable incident for me, where on Twitter you talked a bit about how you didn't intend to blog or talk or converse with people about what sounded like some fairly large changes that you had in mind.

This is such an absolutely critical quasi-mystical level of the stack, in an area that's totally essential for people to get themselves online in a meaningful capacity, yet one which so few folk have much experience with. I'm glad we have new maintainers, but I still feel the loss of some greater brighter alternate reality where more people are able to better build their own stable online presences via the alternate-history version enhanced/remade version of MetalLB that, it feels like, no one will ever know. I can't imagine how stupid & terrible trying to support MetalLB must have been, I can't imagine what it feels like to have giant powers that be take & take without investing back, but I also just think of that brighter future, that unknown better model. Surely we'll make more goes in the future, more tries, to make better ways of bringing traffic in, & it seems sure MetalLB will continue to be one of those chief ways, but those paths will not be touched with the intimate grace of experience you, the creator's long trekking would have brought.

I've thought about trying to beg access, beg for some time to try to document the learnings, lessons, would do-overs over MetalLB. But I don't think I can bring enough critical review & questioning to bear, I don't think I'd ask the right questions at the right times. I'd hope time might soften some of the hard feeling, but I also knew I'd be trampling on a year+ old declaration, a declaration essentially that the world did not deserve your insights.

I'm sorry for punishing your good faith good reply with this post! I write it sort of as a confession, but the crime is the confession, not the thought. This has been a big event for me; apologies for being un-easy with it. Mostly, thank you thank you thank you, for having carried us all so so so very far.


It's too bad that Docker Swarm is kind of dead because that would be the simpler Kubernetes for many use cases. Why is there no community effort to reanimate or adopt the project?


Kubernetes provided a more robust api for cloud providers (the part of the API the end user of k8 doesn’t see, the part of the api that allows k8 to tell your cloud provider to create resources like load balancers, volumes, etc). Once k8 was ubiquitous, docker swarm had no chance.


> most people who are trying to use it are not prepared for the sheer amount of work that lies between the marketing brochure and the system those brochures promise.

So, what should most of us be using instead? Suppose I have a 12-factor app that I want to deploy on AWS (directly, not via a third-party PaaS like Heroku). Does anyone know if Amazon ECS is significantly simpler than Kubernetes? Is there a better option than either of those?


Is this a serious question?

I mean, we all know that EC2 is still there, right? An EC2 instance is like a computer. If you have a program, you can run it there. You don't even need a container!

I just don't know a non-sarcastic way to say this. Containers, and especially container runtimes, just aren't necessary. Containers can be _useful_, certainly. But for decades before they were invented, we ran programs on computers and it was pretty good. We can still do that.


ECS can be simpler if you stick to the default and use something like copilot - it effectively turns your compose file into a cluster with all of the networking etc


We used CloudFormation and ECS/Fargate to great effect at my last company. It simplified a lot, especially logging, networking, and IAM integration compared to Kubernetes.

The biggest downside was that it took quite a while to spin up new tasks (but that might be Fargate specifically) which hurt us when we were trying to do async things (and we wouldn’t be able to use lambda for many async things because the run time would be too long or the enormous bundle size for a small-ish Python task would exceed Lambda’s limits). It would also make deployments take longer than we wanted (we were really pushing the envelope for frequent, tiny deployments) especially in rollback scenarios.


Oh noooo! I was so excited, then they lost me at Go.

Dream bigger: why is Kubernetes so complex? Re-inventing what we already have. Why is it so complex? Lack of standards, lack of flexibility, too much focus on features.

A Linux system has a lot of complexity, but it very rarely gets in the way of the other bits, and it can [mostly] all be replaced as needed. Why? ABI compatibility, kernel-userland split, a big collection of independent composeable tools, a framework that provides everything you need but doesn't force you to use it in the most difficult way, and definitely doesn't force "patterns" on you (like "deployment").

On top of Linux, we've built the world's most advanced and wide-ranging systems, particularly because it's not opinionated. It gives you just enough rope and bamboo to build a hovel or a skyscraper. It is not modern, it isn't the best design, it doesn't force you into the minutia of understanding the system.

A new Kubernetes should either be stupidly simple, or incorporate itself into the OS, since the OS already has most if not all of the components of Kubernetes. They're just not being used properly.


I spent six years on Borg SRE, and two years so far working with Kubernetes, and this post reads like a strange combination of utopianism and adoration of obsolete ideas.

Picking some parts to comment on at semi-random:

  > For that, let’s keep old versions of pod definitions
  > around, and make it trivial to “go back to version N”.
  >
  > [...]
  >
  > Bonus things you get from this: a diffable history of
  > what happened to your cluster, without needing GitOps
  > nonsense. By all means keep the GitOps nonsense if you
  > want, it has benefits, but you can answer a basic “what
  > changed?” question using only data in the cluster.
This assumes either tiny clusters that will never run more than a hundred machines at a time, or an audit horizon measured in weeks. The clusters I help run today are much smaller than what some companies operate, but already we're hitting scaling issues with the sheer amount of data that sticks around during normal operation. If we had to store multiple copies of _pods_ for longer than a couple minutes then there wouldn't be an EBS volume with enough iops to handle background rescheduling.

If you want an audit log, the obvious place is Git. I don't know why the OP derisively calls this "GitOps nonsense", because Google does the same thing and you should too. Figuring out what changed two months ago is much easier when each change has a commit message and a reviewer.

  > The latter is the bane of MetalLB’s existence, wherein it
  > gets into fights with other load-balancer implementations.
  > That should never happen. The orchestrator should have
  > rejected MetalLB’s addition to the cluster, because
  > LB-related fields would have two owners.
To me the problem seems less an issue of field ownership, and more a problem of the network tier mutating parts of the workload scheduling tier. Why is MetalLB (or any other load balancer) changing Kubernetes state at all? Something has gone wrong here. The load balancer should watch the Kubernetes API to discover which endpoints exist and what their IPs are, and if it tries to _change_ state then that change should be blocked by the configured authorization policy.

  > So, for starters, let’s rip out all k8s networking.
  > Overlay networks, gone. Services, gone. CNI, gone.
  > kube-proxy, gone. Network addons, gone.
If the author tries to start designing their new network stack they'll quickly have to put overlay networks and CNI and so on back in, because it turns out the real world gets a say in how we run our infrastructure and users need to be able to customize the boundary between Kubernetes and everything else.

Kubernetes already suffers from insufficient customization in some areas, and networking is one of the few bright spots where it gives in and lets the operator do whatever we want to. IPv4? IPv6? Dual stack heterogenous routing? CNI lets you mix-n-match anything you can put into a binary as long as it can output JSON, and if it were even slightly more opinionated then it wouldn't be fit for purpose.

  > Let’s give every pod an IPv6 address. Yes, only an IPv6
  > address for now. Where do they come from? Your LAN has a
  > /64 already (if it doesn’t, get with the program, I’m
  > not designing for the past here), so pluck IPs from there.
  >
  > [... lots of description of a thing that could be a
  >  CNI plugin ...]
  >
  > That leaves bare metal clusters out in the cold, sort-of.
  > I argue this is a good thing, because there is no
  > one-size-fits-all load balancing.
And this is why you absolutely don't want your workload scheduler to have opinions about networking. All of the stuff in there -- the hard requirement on an IPv6-aware fabric, the mission-impossible idea of routing traffic to pod IPs allocated at random from the full local subnet, the NAT64 (!!) -- can be done with a relatively small driver binary in any language that can call netlink, which means it's all possible to that in Kubernetes _today_ without foreclosing on the idea of running outside a carefully curated cloud environment.

  > We’re going to focus on doing one thing really well: if
  > you send me a packet for a pod, I’ll get the packet to
  > the pod. You can take that and build LBs based on LVS,
  > nginx, maglev-style things, cloud LBs, F5 boxes, the
  > world’s your oyster. And maybe I’ll even provide a couple
  > “default” implementations, as a treat. I do have lots of
  > opinions about load-balancers, so maybe I can make you a
  > good one. But the key is that the orchestration system
  > knows nothing about any of this, all it does is deliver
  > packets to pods.
The workload scheduler (Kubernetes is a workload scheduler) shouldn't be in the business of delivering packets. That's up to the kernel and the network fabric. If your services' packets have to transit a userspace proxy on the way to their destination then you're already in trouble, and if that proxy is implemented by a second-order Wireguard overlay then all hope is lost.

  > I think this mostly translates to syncing more data down
  > to nodes in persistent storage, so that nodes have
  > everything they need to come back up into the programmed
  > state, even from a cold boot. Conceptually, I want the
  > cluster orchestrator to populate a set of systemd units
  > on each machine, and then switch to a very passive role
  > in the node’s life.
  >
  > [...]
  >
  > One way to view this is that in my “distributed” cluster
  > my pods are more likely to be unreplicated pets.
It sounds like the author doesn't need Kubernetes at all. They want Puppet, or something Puppet-shaped, but with all the extra complexity that comes from having a distributed control plane.

I don't know why they would want that, since if they're using Kubernetes at all then presumably they've got at least a few hundred machines being managed by a couple different product teams.

There is a place for a central service that writes stuff to `/etc/` and assumes that individual machines are meaningful, but that service's target audience is completely separate from that of Kubernetes. There's no point in trying to design a replacement for Kubernetes to fit that market, any more than trying to design a 16-wheeler that competes with minivans.


> if they're using Kubernetes at all then presumably they've got at least a few hundred machines being managed by a couple different product teams.

I think this, unfortunately, does not hold up in practice. I am seeing way too many small orgs/teams use Kubernetes to orchestrate dozens and even sub-dozens of machines. From the perspective of solving for sub-dozens to thousands, I think there is some merit to TFA's perspective, even if it doesn't match the Kubernetes use-case you describe.


I honestly think the solution is to encourage smaller shops to stop using Kubernetes. You can get pretty far with VMs running systemd/init.d services and a mechanical way to apply filesystem changes. This is how "typical" services were operated for decades, and it has distinct advantages at small scales.

If I were advising a startup on technical architecture, I would recommend they write their software as if it runs in Kubernetes (avoid hardcoded configuration, bundle dependencies into the build artifact, use mTLS instead of firewalls) but otherwise behave like someone running a LAMP stack in 2003.


Sound advice. Sadly, the industry is by and large ignoring it, and the median k8s cluster size is single digits. There are a few elephants who run thousands of nodes, and at that scale the benefits outweigh the complexity tax. But a huge chunk of the industry is tiny and has been convinced they need to k8s, while another chunk of the industry is trying to figure out how to scale k8s down, on the faulty hypothesis that you can then grow effortlessly.


On one hand ... yeah, just get a VM, use config management (Chef, Ansible), maybe try terraform to manage "your cloud" (so the state is managed), get docker on it, and run the app from the image you built on your CI.

... but, using k3s is a lot simpler than fighting Ansible, it gives a standard. Yes, it takes a week to get used to it, but then it works, and it gives a lot of benefits.


I think kubernetes is still lacking a load balancer that can run on bare metal. Do you want to get a couple of VPS severs to run a cluster? Forget about it or install Nginx and manually do reverse proxy to appropriate web app. I was hoping that MetalLB would solve it but these tools are written for service providers with their own routing etc. There is nothing for a small guy just wanting to deploy his blog and don't want to pay Amazon or Google for setting up a cluster.


Load balancers have never been part of Kubernetes. Service and Ingress are just routing rules.

Service with ClusterIP creates an internal IP that redirects traffic to that service's pods. Service with NodePort creates a CluserIP and creates a port on every node that lets external traffic also reach those pods.

Service with LoadBalancer creates a ClusterIP, NodePort, and a whatever load balancer implementation exists in that cloud to route external traffic to those K8S nodes and nodeport port numbers. This is the part that's missing in your bare metal setup, but you can create a nodeport service using port 80/443 and have DNS pointing to your nodes' public addresses.

That's all it takes to get cluster access and you can point this Service to another downstream Service/Ingress if you want the routing rules to stay in YAML too.


That's why it is kind of pointless on self hosted environment, as you can't create an IP and a hostname that can be accessed from outside world. You have to manually do a reverse proxy to whatever ip has been added in the private network. If you use K8s on AWS or GCP then your services automatically are accessible from outside world. There were attempts to make Nginx or similar LB working with virtual hosts but never seen that working in practice. It seems like developers keep it enterprisey so that people keep using and be locked in to the cloud.


What does self-hosted mean? Do any of your nodes have a public IP address? If so then I explained exactly how they can be accessed. If all of your nodes are private and behind some kind of firewall or virtual network then yes you'll have to make a public bridge, otherwise just point the DNS to your nodes. You can use a Deployment with the standard ports 80/443 to avoid any port translation too.

That's all the clouds do anyway, they just run a bunch of load balancers that move external traffic to the nodes of your K8S cluster with the port specified in the Ingress/Service and K8S takes it from there. There's no shortcut, the traffic has to be routed somehow so if you don't use a managed service then you have to do it yourself. It's got nothing to do with "developers keep it enterprisey".

Here's an answer I wrote about skipping GKE's cloud LB to accomplish the same thing: https://stackoverflow.com/a/54297777/173322

Also try using Ambassador as a reverse proxy/ingress/LB. It uses Envoy and is much faster and more configurable. You can set it to use the host network on your nodes and skip the K8S cluster mapping: https://www.getambassador.io/


None of these examples work for self hosted scenario where you have one public IP (or a few statically assigned to your server). If you want to host a service the traditional way, you create a virtual host and a reverse proxy to your container (that is on a private network) or a service bound to a local IP. Currently I couldn't find a way to automate that. Let's say that could be done by a LoadBalancer, that is in fact controlling virtual hosts and reverse proxy entries for the pool of available IP addresses. For hostnames, there would be a pool of domains configured to resolve any host to that one IP.


Yes they do, I've run it that way along with many others. Virtual hosts are easily supported. You should read up on the K8S documentation because it seems you're unfamiliar with the K8S constructs.

What you're looking for is called an Ingress, which is like a Service that has more advanced routing specifically designed to support multiple hosts and backends from one "ingress" point for the cluster. The Ingress (again a set of routing rules) can be implemented by many different proxies like Nginx, HAProxy, Traefix, Caddy, etc. Or you can use Ambassador like I mentioned which bypasses the K8S Ingress and uses its own streamlined config with better performance, observability and automated HTTPS too.


nginx-ingress-controller running on a NodePort service should work well enough here - just set DNS records to round-robin onto your machines. This will work on any network topology, and already works on k8s.

If you want any better load balancing and/or failover, you'll have to either have something provided by your cloud provider (an LBaaS, and then tie that into Kubernetes via some controller), or have a network topology that can be exploited for this (shared L2 with a VIP or L3 with BGP, so that you can use something like MetalLB).

This is not a Kubernetes problem, but a networking problem.


This requires fair bit of fiddling and never got it working properly. Kind of defeats the purpose of automated infrastructure.


Also the team randomly decided NodePorts can only[1]:

1. be run in a specific range

2. you can expand that range, but if you do, any port in that range might be randomly assigned, so your service might start up on port 80.

Changing alone would make using Kubernetes 10x easier for small operators. But one bomb-throwing curmudgeon drops by with entirely content-free unsubstantiated anger-posting:

> Unless I'm misunderstanding the proposal, it involves unpredictable and difficult to diagnose failure of services, which seems like a complete non-starter to me.

And the small-operators who just want some ability to open http or dns ports get nothing. Maddening beyond belief.

[1] https://github.com/kubernetes/kubernetes/issues/9995


It eventually made it to the SIG Networking agenda, but it needs people to join the meeting and discuss it. That's the official process for changes now.


I would like something that would reconfigure Nginx reverse proxy to whatever private ip and port service is given, so that it would work like for big cloud providers without the need of BGP access level to a network.



This won't work as it doesn't support virtual hosts. For example you cannot provision two services using same port and an IP address. If you have a bare metal server with one IP address you would still have to do e.g. reverse proxy manually which defeats the purpose of having k8s.


I read this as. Everything about the k8s design is wrong and we should do it all from scratch. In fact who even likes containers? I don't buy it.

On a different note, I wish we could use postgresql instead of etcd. That change alone would allow for amazing things with k8s.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: