Kubernetes StatefulSets are Broken
Don't get me wrong; we are strong supporters of Kubernetes. It is a critical piece of our architecture and provides massive value when wielded correctly. But, Kubernetes was originally intended to act as a container orchestration platform for stateless workloads, not stateful applications.
Over the past few years, the Kubernetes community has done a great job evolving the project to support stateful workloads by creating StatefulSets, which is Kubernetes' answer to storage-centric workloads.
What is a Kubernetes Statefulset?
Kubernetes StatefulSets run the gamut from databases, queues, and object store to janky old web applications that need to modify a local filesystem for whatever reason. They provide developers with a set of pretty powerful guarantees:
- Consistent network identity for each pod: This allows you to easily configure the DNS address to the pod in your application. It works great for database connection strings or configuring complicated Kafka clients. We also use it for setting up erlang’s mesh network at times too.
- Persistent volume automation: Whenever a pod is restarted, even if it is rescheduled onto a different node, the persistent volume is reattached to the node it is placed on. This is somewhat limited by the capabilities of the CSI (Container Storage Interface) you’re using. For instance on AWS this only works within the same regional AZ since EBS volumes are AZ-linked.
- Sequential Rolling Updates: StatefulSet updates are designed to be rolling and consistent. It will always update in the same order which can help preserve systems that have delicate coordination protocols.
These guarantees cover a ton of the operations needed to run a stateful workload. In particular, it almost completely handles the availability portion. Given that EBS uptime and redundancy guarantees are extremely strong, the StatefulSet’s rescheduling automation almost trivially guarantees you a high availability service. However, some caveats do apply (e.g., that you have room in your cluster and don’t botch the AZ setup.)
Kubernetes has a ton of promise in this area, and in theory, could certainly evolve into a platform to easily run stateful workloads alongside the stateless ones most developers use it for.
What’s Missing From the Kubernetes StatefulSet?
So why do we think StatefulSets are broken? Well, if you run through the operational needs of a stateful workload in your head, there’s one key component that you might notice is missing:
What do you do when you need to resize the underlying disk?
The dataset is a common database store that typically grows at a pretty constant positive rate. Unless you support horizontal scaling and partitioning, you’ll need to add headroom in the disk as that dataset grows. This is where Kubernetes falls flat on its face.
Currently, the StatefulSet controller has no built-in support for volume resizing. This is despite the fact that almost all CSI implementations have native support for volume resizing the controller could hook into. There is a workaround, but it’s almost ludicrously roundabout:
- Delete the StatefulSet while orphaning pods to avoid downtime with: kubectl delete sts <name> --cascade=orphan
- Manually edit the persistent volume for each pod to the new storage size
- Manually edit the StatefulSet volume claim with the new storage size and add a dummy pod annotation to force a rolling update
- Recreate the StatefulSet with that new spec which allows the controller to reclaim control of the orphaned pods and begin the rolling update which will trigger the CSI to apply the volume resize
Okay, so there’s a pretty noteworthy flaw in Kubernetes StatefulSets, but there is a workaround even if it’s somewhat janky.
That shouldn’t be too bad, right?
But it gets worse!
The situation gets downright painful when you realize the impact of this limitation and that a lot of the Kubernetes operators have been built to manage stateful workloads.
Kubernetes statefulset example
A pretty good example is the Prometheus operator, which is a great project for both provisioning Prometheus databases and allowing a CRD-based workflow for configuring metrics, scrapers, and alerts.
The problem arises because the built-in controller for the operator has no logic to manage StatefulSet resize, but it does have the logic to recreate its underlying StatefulSet if it sees an event that triggered its deletion. This means that you effectively have no way to use the above workaround, since the moment you do a cascade orphan delete, the operator will recreate the StatefulSet against the old spec and prevent proper resize. The only solution is to delete the entire CRD or find a tweak that can fool the operator into not reconciling the object (sometimes scale to zero will do this).
Regardless, as a result of this flaw, there is effectively no way to resize a Prometheus instance with the operator without either significant downtime or data loss. Considering how robust the automation in StatefulSets is in all other cases, it’s pretty shocking that this is still a potential failure mode.
Our Head of Community, Abhi, actually hit this issue with interplay between operators and StatefulSet volume resizes as well while implementing it in the open-source Vitess operator.
“Considering the natural complexity of a Vitess deployment, you can infer that disk resizing is proportionally complicated. Vitess is a database sharing system that sits on top of MySQL, meaning that volume resizing had to be both partitioning-aware and shard-aware. We had to manually write our own shard-safe rolling restarts, create a cascade condition that worked with the parent-child structure of Vitess custom resources, and address every conceivable failure condition to prevent downtime. Shoutout to notable Kubernetes contributor enisoc for designing this feature.”
Other widely used and notable database operators, like Zalando's Postgres operator, effectively reimplement the same procedure we implemented in the Plural operator in their own codebase. This causes a ton of wasted developer cycles on a problem that should only have to be fixed once.
The Potential of Kubernetes
In general, we are extremely bullish on the potential for Kubernetes to make the operations of virtually any workload almost trivial, and a huge part of our mission at Plural is to make that a possibility.
That said, we also need to be clear-eyed about gaps that still remain in the Kubernetes ecosystem, so we can either work around them or close them upstream. I think it’s pretty clear this is a significant gap, and if prioritized, this could be fixed pretty easily in a future release of Kubernetes.
If you thought this was interesting, check out what we’re building on Kubernetes here.