Two reasons Kubernetes is so complex

quite

                        January 27, 2022

                Two reasons Kubernetes is so complex

                        Preface
Hello friends! It’s been a while. I’ve been finding it very hard to write while holding up a full-time job, and I’ve also been dealing with some very frustrating joint/ergo struggles that make using a compute kinda painful. I think they’re making progress and I’m better figuring out how to manage time and energy while working, so hopefully I won’t go on quite as long a hiatus before the next post 🙂
Also! My team published our first paper, which I’m really excited about. It’s pretty in-the-weeds stuff so I don’t expect many people outside of ML to read it, but I do think it’s some really great (if really early!) progress towards understanding what the hell is going on inside GPT-3 and friends. Alongside that, I was able to publish a writeup on Garçon, one of my very first projects at Anthropic, which is the infrastructure tooling that powers most of our interpretability work.
With that out of the way, onward to the idle thoughts I wanted to share with y’all.
Why is Kubernetes so dang hard?
Anthropic runs most of our systems inside of Kubernetes, and so I’ve been gaining a lot more experience and familiarity with that tool. And while it’s been, on net, really great, I’ve definitely also experienced the (near universal, I think) feeling of “holy smokes why is this thing so complicated?” as well as “Why is it so bloody hard to debug anything?”
While some of those feelings are fairly universal of learning any new system, Kubernetes really does feel a lot bigger, scarier, and more intractable than some other systems I’ve worked with. As I’ve learned it and worked with it, I’ve tried to understand why it looks the way it does, and which design decisions and tradeoffs lead to it looking the way it does. I don’t claim to have the full answer, but this post is an attempt to commit to paper two specific thoughts or paradigms I have that I reach for as I try to understand why working with Kubernetes feels so hairy sometimes.
Kubernetes is a cluster operating system
It’s easy to think of Kubernetes as a system for deploying containerized applications, or some similar functional description. While this can be a useful perspective, I think it’s much more productive to think of Kubernetes as a general-purpose cluster operating system kernel. What do I mean by that, and what’s the difference?
The job of a traditional operating system is to take a single computer, and all of its attendant hardware, and to expose an interface that programs can use to access this hardware. While the exact details vary, in general this interface has some number of the following goals:

Resource sharing — We want to take the one physical computer and subdivide its resources among multiple programs, in such a way that they are isolated from each other to some extent. 
Portability — We want to abstract the precise details of the underlying hardware to some extent, such that the same program can run on different pieces of hardware without modifications, or with only minor modifications.
Generality — As we come up with new kinds of hardware, or plug new hardware into our computer, we want to be able to fit those into our abstractions and interfaces in an incremental way, ideally without (a) drastically changing any interfaces or (b) breaking any existing pieces of software which don’t use that hardware. 
Totality — Related to generality, we want the OS to mediate all access to hardware: it should be rare or impossible for software to completely bypass the OS kernel. Software can use the OS kernel to set up a direct connection to hardware such that future interaction happens directly (e.g. setting up a memory-mapped command pipe), but the initial allocation and configuration is still under the OS’ aegis.
Performance — We want to pay an acceptably small performance cost for having this abstraction, as compared to “directly writing a special-purpose piece of software that ran directly on the hardware and had exclusive direct access to the hardware” (ala a unikernel). In some cases we want to achieve higher performance in practice than such a system, by offering optimizations like I/O schedulers or caching layers.

While “ease of programming” is often an additional goal, in practice it often loses out to the above concerns. Operating system kernels often get designed around the above goals, and then userspace libraries are written to wrap the low-level, general-purpose, high-performance interfaces into easier-to-use abstractions. OS developers tend to be far more concerned with “How fast is it possible to make nginx run on my OS” than they are with “How many lines of code shorter is the nginx port to my OS?”
I‘ve come to think of Kubernetes as operating in a very similar design space; instead of abstracting over a single computer, however, it aims to abstract an entire datacenter or cloud, or a large slice thereof.
The reason I find this view helpful is that that problem is much harder and more general than, say, “making it possible to deploy HTTP applications in containers,” and it points at specific reasons Kubernetes is so flexible. Kubernetes aspires to be general and powerful enough to deploy any kind of application on any kind of hardware (or VM instances), without necessitating that you “go around” or “go outside” the Kubernetes interface. I won’t try to opine here on whether or not it achieves that goal (or, perhaps, when it does or doesn’t achieve that goal in practice); the mere perspective of viewing that as its problem statement causes a lot of design decisions I encounter to “make sense” to me, and seems like a useful lens.
I think that perhaps the biggest design choice this perspective explains is how pluggable and configurable Kubernetes is. It is, in general, impossible to make choices which can be everything to everyone, especially if you aspire to do so without extravagant performance cost. This is true especially in the modern cloud environment, where the types of applications and type of hardware deployed vary vastly and are very fast moving targets. Thus, if you want to be everything to everyone, you end up needing to be enormously configurable, which ends up creating a powerful system, but one which can be hard to understand, or which makes even “simple” tasks complex.
Another perspective
While discussing this post with my partner Kate, I came up with another lens on this distinction that I like:
I get the sense that many users perceive Kubernetes as (or, perhaps, want it to be) essentially “a Heroku” in the sense of being a platform for deployed applications that abstracts over most of the traditional underlying OS and distributed systems details.
My contention is that Kubernetes sees itself as solving a problem statement closer to “CloudFormation” – in the sense that it wants to be sufficient to define your entire infrastructure — except that it also attempts to do so in a way that is generic over the underlying cloud provider or hardware.
Everything in Kubernetes is a control loop
One could imagine a very imperative “cluster operating system,” like the above, which exposed primitives like “allocate 5 CPUs worth of compute” or “create a new virtual network,” which in turn backed onto configuration changes either in the system’s internal abstractions or into calls into the EC2 API (or other underlying cloud provider).
Kubernetes, as a core design decision, does not work like that. Instead, Kubernetes makes the core design decision that all configuration is declarative, and that all configuration is implemented by way of “operators” which function as control loops: They continually compare the desired configuration with the state of reality, and then attempt to take actions to bring reality in line with the desired state.
This is a very deliberate design choice, and one made with good reasons. In general, any system which is not designed as a control loop will inevitably drift out of the desired configuration, and so, at scale, someone needs to be writing control loops. By internalizing them, Kubernetes hopes to allow most of the core control loops to be written only once, and by domain experts, and thus make it much easier to build reliable systems on top of them. It’s also a natural choice for a system that is, by its nature, distributed and designed for building distributed systems. The defining nature of distributed systems is the possibility of partial failure, which necessitates that systems past some scale be self-healing and converge on the correct state regardless of local failures.
However, this design choice also brings with it an enormous amount of complexity and opportunity for confusion¹. To pick two concrete ones:
Errors are delayed
Creating an object (e.g. a pod) in Kubernetes, in general, just creates an object in the configuration store asserting the desired existence of that object. If it turns out to be impossible to actually fulfill that request, either because of resource limitations (the cluster is at capacity), or because the object is internally-inconsistent in some way (the container image you reference does not exist), you will not, in general, see that error at creation time. The configuration creation will go through, and then, when the relevant operator wakes up and attempts to implement the change, only then will an error be created.
This indirectness makes everything harder to debug and reason about, since you can’t use “the creation succeeded” as a good shorthand for “the resulting object exists.”  It also means that log messages or debug output related to a failure do not appear in the context of the process that created an object. A well-written controller will emit Kubernetes events explaining what’s happening, or otherwise annotate the troublesome object; but for a less well-tested controller or a rarer failure, you might just get logspam in the controller’s own logs. And some changes may involve multiple controllers, acting independently or even in conjunction, making it that much harder to track down just which damn piece of code is actually failing.
Operators may be buggy
The declarative control-loop pattern provides the implicit promise that you, the user, don’t need to worry about how to get from state A to state B; you need merely write state B into the configuration database, and wait. And when it works well, this is in fact a tremendous simplification.
However, sometimes it’s not possible to get from state A to state B, even if state B would be achievable on its own. Or perhaps it is possible, but would require downtime. Or perhaps it’s possible, but it’s a rare use case, and so the author of the controller forgot to implement it. For the core built-in primitives in Kubernetes, you have a decent guarantee that they are well-tested and well-used, and hopefully work pretty well. But when you start adding third-party resources, to manage TLS certificates or cloud load balancers or hosted databases or external DNS names (and the design of Kubernetes tends to push you in this direction, because it’s happier when it can be the source-of-truth for your entire stack), you wander off the beaten path, and it becomes much less clear how well-tested all the paths are. And, in line with the previous point about delayed errors, the failure modes are subtle and happen at a distance; and it can be difficult to tell the difference between “the change hasn’t gotten picked up yet” and “the change will never be picked up.”
Conclusion
I’ve tried to avoid making value judgments on whether I think these design decisions were good choices or not in this post. I think there is plenty of scope for debate about when and for what kinds of systems Kubernetes makes sense and adds value, versus when something simpler might suffice. However, in order to make those kinds of decisions, I find it tremendously valuable to come to them with a decent understanding of Kubernetes on its own terms, and a good understanding of where its complexity comes from, and what goals it is serving.
I try to apply this sort of analysis to any system I work with. Even if a system is designed in ways which seem — and may even be — suboptimal in its current context, it’s always the case that it got that way for some reason. And insofar as this is a system you will have to interact with and reason about and make decisions about, you will have a better time if you can understand those reasons and the motivations and the internal logic that brought the system to that point, instead of dismissing it out of hand. I’m hoping this post may be helpful to other folks who are also new to using Kubernetes in production, or who are just considering adopting it, in helping to provide some useful frames for reasoning about why (I believe) it looks the way it is, and what expectations are reasonable to have for it.

If we want to be more nuanced, we might say instead that it front-loads complexity instead of, or in addition to, adding it. This design makes you deal up-front with practicalities you might otherwise have ignored for a long time. Whether or not that is a desirable choice depends on your goals, your scale, your time horizon, and related factors. ↩

                        Don't miss what's next. Subscribe to Musing in Computer Systems: