Scale testing Plural: Managing 1100 Kubernetes clusters with just over 3cpu

At Plural, we are building the most scalable Kubernetes management platform out there, both in terms of resources used and human time invested. In reality though, we have never really had the opportunity to push our technology to its theoretical limit since users frequently have cluster counts in the mid-double or low triple digits. As part of an exercise for ourselves, and to have demonstrable proof of the value of our technology, we decided to really try to put it through its paces. But first, a bit of context about Plural to explain the constraints at play.

What is Plural?

At its core, Plural provides a GitOps-style controller for both Kubernetes YAML manifests and for IaC, primarily Terraform, as it is the dominant toolchain in that space. On top of those core workflows, we then build more powerful higher level constructs like infrastructure self-service, AI driven troubleshooting, cost management, and more. This usually requires a few guarantees, to provide the full provisioning power of a GitOps flow:

  • API-driven provisioning against declarative state referenced in a git repository, the controller is then responsible for communicating those manifests to the end cluster.
  • Continuous resource reconciliation to guarantee no drift versus that desired state from when it was first defined
  • Some mechanism for recursion - in ArgoCD this is known as the app-of-apps pattern powered by the fact Argo applications themselves are CRDs syncable by Argo. Plural is the same, except named services-of-services, we have a CRD set which can be themselves synced by Plural.

The main implication of these core features is derived from continuous resource reconciliation: there is usually at least a linear relationship between the entities managed by such a controller and the compute/memory required to implement it. In particular, this means poll loops running per each cluster, each “service” defined within the cluster, and so forth.

This inherently becomes a threat of a thundering herd, and the usual approach to combatting that is caching at each level of the system, alongside jittering any poll loops or cron jobs. For the most part we’ve been very delicate in implementing these techniques, but you never know for sure until you test it.

This can also be mitigated with more hacky techniques like downtuning the poll intervals to conserve resources, but for the most part we wanted this test to maintain the core user experience customers would expect of a K8s GitOps controller, which involves fairly frequent (~2m in our case) drift remediation as table stakes.

You can read about the Plural architecture here but at its core, we slice the problem into two tiers:

  1. A management plane, which maintains the state of everything declaratively defined throughout your fleet
  2. An agent (Kubernetes operator), which performs the actual actions at a cluster-level communicated to it by the management plane via polling against the current state, providing an eventually consistent guarantee.

In theory this should let us shard the work of GitOps reconciliation horizontally by cluster and support extreme scale, but there can still be bottlenecks in the implementation of the management plane, and this test was intended to uncover them. But first, how do you even create 1100 k8s clusters (and without breaking the bank)?

Hurdles in Creating a 1k+ k8s fleet

Our general strategy was to use the smallest possible AWS t-series instance that can host k3s, a “lightweight” - but actually perfectly scalable - Kubernetes distribution that can be backed by sqlite. The minimum usable instance turned out to be a 2cpu t3.medium.

Our approach was straightforward: create a simple script that would automatically:

  1. Install and run run k3s

  2. Register each new cluster with our central Plural Console.

We added this script to the EC2 instance "user data" - which is Amazon's way of running initialization code when a virtual machine starts up.

The entire process was automated using Terraform. The repo is itself viewable here. Seems simple enough. Unfortunately, though, not quite. Here is the list of issues if you try this at home:

  • Docker pull rate limits - easily the most annoying. K3s uses Docker Hub to host some of its core images, most notably coredns. Docker Hub is notorious for its rate limits, which are usually avoidable by simply adding a pull credential. Unfortunately, there is also a poorly documented, and completely unavoidable, Abuse Rate Limit. If you create a few hundred k3s clusters at once, you will hit this, and there’s no way around it. The ultimate solve was to set up k3s in airgapped mode, downloading the images before boot from the GitHub releases page. This commit fixed it. Once the images are there, k3s automagically slurps them into containerd to start the necessary pods. It also ended up saving us a lot of network bandwidth, so in a way we should thank Docker Hub for keeping our cash from AWS.
  • AWS Usage Limits - AWS has reasonable limits on the number of EC2 machines that you can spawn in one account. We initially were hoping to do 10k clusters, but in reality, there are approvals required to increase these limits beyond normal thresholds. As a result, we were left with only a 1k cluster test, and needed to use multiple regions to fulfill that amount.
  • Terraform parallelism/refresh - By default Terraform refreshes state on each run and runs with 10-way parallelism. When you’re deliberately creating a state file with 1k+ elements, that is incredibly slow and must be tuned. We used the flags -parallelism=250 -refresh=false to avoid this.

The Plural Console Setup

The setup for the Console used in this test is actually very similar to if you were running plural up , which is the simplest way to self-service a Plural Console setup (so now you know it can support). Basically, the management cluster had 4 m7a.xlarge (4cpu/16Gi) nodes for its EKS cluster (we made these larger just in case it was necessary, they ended up being very underutilized) and a db.t4g.large RDS Postgres database, with 2cpu and 8Gi in memory. The intention was to scale that up as well, but it turned out to have been completely unnecessary.

We did some minor reconfiguration of the Plural Console itself, via tweaked Helm values:

  1. We gave a bit more resources to the console pods
  2. We put Kubernetes Agent Server - KAS, the Kubernetes reverse-tunneling proxy implementation we use, onto a separate node so it could be understood in isolation.
  3. We added three global services to the setup so that there would be a somewhat realistic set of services on the clusters, representing RBAC config and two example microservices. This meant there would be ~4000 total services in the environment.

An initial stab

We did a first test from 100 to 500 clusters without any tweaks to our codebase, and got results like this from Datadog:

Not particularly bad, but the major utilization spikes were concerning, and indicated bottlenecks. As we got to 500 clusters, these became pretty apparent, with overused networking in the management plane and aggressive CPU utilization. In particular, the causes were:

  • To render CPU/memory metrics in our table view, on each node, we would stream from the db each cluster row, and query KAS for a CRD we define summarizing the k8s metrics server’s state. This was to maintain a roughly accurate in-memory cache to avoid extremely expensive n+1 GraphQL resolutions. This ends up being quite resource intensive at scale, and was implemented naively for simplicity. Clearly needed optimization and rethinking.
  • Secondly, the rendering of tarballs of manifests from Git, while heavily cached, still could be more cached internally in the management cluster. Basically there’s a single Erlang actor which owns each git repo, and when a request for a tarball to our api comes into a server node, it identifies the adjacent node hosting that actor, asks it for the tarball, and streams it from disk (oftentimes over the network). These are actually cached after initial download in our agent, but it is still a ton of internal cluster bandwidth on that initial load. (Further, there are some interesting nuances with elixir/erlang file handling when done inter-process vs in raw mode that caused inefficiency here since it was impossible to use a raw file handle with that implementation).

These were ultimately easy fixes. In order:

  • For the utilization metrics, we already had a distributed scheduler built into our clustered server. We simply refactored the approach to be triggered by that, so the stream only happens once on one node and then distributes the results to each node using Erlang’s pg module. We also tweaked the code to only query known healthy clusters and have a fixed limit, since in reality, users will rarely look at > 100 or so clusters anyway. This turned an operation that was of cost # servers * # managed clusters to a simple constant.
  • For tarball caching, the solution was to simply add another cache layer. We already have a computed unique digest for each tarball we generate, which is a natural cache key, and since it’s a fully read-only code path, is trivial to cache. Simply adding a file cache to each node, which downloads and stores on disk a tarball streamed from our Git and Helm agents, allows us to ensure future requests are served straight from disk rather than the network in the steady state.

These changes were ultimately very quick and easy once identified, and largely were done in this PR: https://github.com/pluralsh/console/pull/1924/files. Note specifically this change for the metrics cache broadcasting and this change setting up the local file cache that’s used elsewhere in the PR.

The Final Results

Once these changes were made, the results were immediately apparent. In particular, CPU utilization was much lower and, more importantly, more stable, KAS was much happier, and the system was easily able to converge no matter how many clusters were thrown at it. In particular, here are the results at 100, 500, and 1k+ clusters:

At 100 clusters, the management cluster is barely utilizing 1cpu, with nicely regular memory and CPU utilization (there’s always going to be some fluctuation due to intensive cron jobs, and periodically refetching git repos to pull in new changes):

At 100 clusters, the management cluster is barely utilizing 1cpu, with nicely regular memory and CPU utilization (there's always going to be some fluctuation due to intensive cron jobs, and periodically refetching git repos to pull in new changes):

At 500 clusters, we see about 1.5cpu usage in the management plane, again very stable overall, and ~ .2cpu utilization for the database (equating to 10% of 2cpu):

Then at 1k clusters, the same behavior remains, with it stabilizing around 2.5cpu in the management plane and around .4cpu usage in the database, 43/s write IOPS - plenty of headroom to scale further even with the given limited hardware:

In addition, and this is a consistent issue within the Kubernetes ecosystem, with tools like ArgoCD or many of the Kubernetes dashboards choking after rendering a few hundred objects (due to over-reliance on the k8s API which doesn’t support pagination properly), there’s no real degradation of our UI at this cluster scale as well. Here’s a quick video of me going through the UI to demonstrate:

0:00
/0:36

Learnings

I think this is a good example of why strategies like stress testing and chaos testing are important. A lot of these bottlenecks actually were known issues, but there was no forcing function to fix them until they blocked a clear objective. Further, the issues in intra-cluster networking caused by lack of local caching were only really apparent at very large scale, and was great to discover when properly exercised. Plus, it gave new members of our team time to get familiar with new toolchains like k3s and low-level EC2 configuration that’s oftentimes ignored with modern tooling.

Overall we’re very happy with the results. We’ve ultimately been able to prove that Plural can manage 1100 Kubernetes clusters with a baseline amount of software provisioned on them, using a resource footprint small enough to fit on a Macbook Pro with plenty of room to spare. Hopefully AWS will give us some slack to go all the way to 10k relatively soon, would be awesome to see the results at that scale as well!