Scaling a Custom GitOps Engine at Plural

Introduction

For those unfamiliar, GitOps at its core is using Git as the ultimate source of truth for the state of your infrastructure and continuously reconciling the actual state of your infrastructure to that desired state.  In the case of Kubernetes, a GitOps engine constantly queries Git and reconciles the Kubernetes API against the YAML it gathers from Git in a continuous loop.  Within Plural’s broader Fleet Management solution, this solves for two main concerns:

  • Scalable configuration of large sets of clusters, since Git can effectively publish-subscribe configuration to any subset of your fleet via a few CRDs we’ve crafted, it provides a perfect developer experience to manage fleets as long as the system is optimized for large cluster counts.
  • An efficient bulk resource application framework for more complicated infrastructure provisioning and automation, like creating a tenanted fleet with a single PR, like in this demo.  Just define a large set of declarative resources in a single folder, create a Plural service against the folder, and let the system do everything from terraform execution, to syncing Kubernetes manifests and setting up RBAC access to various resources.

In brief, since this isn’t a blog post about Plural itself, the architecture involves a central management hub, and an on-cluster agent that receives tarballed resource manifests to apply in a constant reconciliation loop.  This architecture does a great job of distributing work across clusters since resource application is cluster-local and thus shared. We have noticed, however, that the agent became a bottleneck on very large clusters and knew we needed to optimize.

That said, we constantly maintain an internal design constraint ensuring our agent is as lightweight as possible, since if you have a 1-1 relationship between agent and cluster, it should be functionally a background concern to not trade the problem of cluster management for agent management (and for what its worth, this is why we chose to make a custom implementation since we wanted full control of the source code to optimize it as closely as possible).

Problem

The core issue derives from the reconciliation loop approach to resource application.  While the reconciliation loop is a powerful mechanism for ensuring that the desired state of a system matches the actual state, it does have some drawbacks and challenges:

  • Resource Consumption: Constantly applying changes can lead to increased load on the Kubernetes API server, especially in large and dynamic environments.  If policy management tools like OPA Gatekeeper are present which heavily leverage admission controllers, this can propagate down to node resource overutilization as well.
  • Latency: There can be a delay between detecting a state change and its reconciliation, leading to temporary inconsistencies.  This is aggravated by the fact that you might need to Server-Side Apply each object, which even with parallelism consumes a lot of time, and then watch for status updates, consuming even more wall-clock time.
  • Race Conditions: No matter what, this is going to be heavily parallelized.  Each object is applied in a fixed goroutine pool, and naively implemented caching approaches will be tormented with races and other concurrency bugs as a result.

Legacy implementation

We originally adopted cli-utils as the library of choice for bulk resource application.  It is a relatively lightweight project supported by a Kubernetes SIG and it gave us core features like client-side resource schema validation,  parallelizing resource application across multiple goroutines, easy configuration for basic apply settings, filtering logic, and status watching logic.  Gitlab’s GitOps agent leverages the library as well as ours, and we believe Anthos’ Config Sync also uses it under the covers. Internally, it leverages Kubernetes Server-Side Apply, which really deserves its own discussion.

Server-Side Apply (SSA) in Kubernetes is a powerful feature that essentially updates only the parts of the resource that have changed by comparing an object with the live state on the Kubernetes API server.  This can mean SSAs are relatively lightweight API calls, however, they still pass through all Kubernetes admission webhooks, and there is inherent latency in making any network call to the API server.  Multiply that across hundreds of resources in a cluster, reconciled continuously, and it can still put meaningful pressure on your cluster and drag down the responsiveness of our agent.

The cli-utils implementation had several substantial drawbacks though.  First, there was no way to track the state between applies (it was conceptually built around making it easier to make one-off CLIs similar to kubectl).  We would need that if we wanted to deduplicate unnecessary applies between reconciliations.  Secondly, its status-watching logic is naive and only suitable for a CLI use case.  In particular, it works like this:

  • for every group, kind, and namespace applied in bulk, issue a watch for that tuple on demand
  • wait for either a timeout to finish or the resource to converge on a ready state, and close the watch.

This is fine in a CLI use case since the process is short-lived and you’d need to on-demand watch anyway, but in a persistent, constantly reconciling agent, it causes needless network call overhead.  A better strategy is to record the GVKs that have been historically applied, and keep a fixed list of watch streams open for each of them and then use those streams to gather status.

The second key insight was that if we’re capturing the watch streams, we could devise a simple, resource-efficient caching strategy to deduplicate unneeded applies based on the events we see incoming.

Solution

We've built our streamlined cache based on SHA hashes. For any unstructured Kubernetes object received in a watch stream, we compute a few SHAs based on:

  1. object metadata (only name, namespace, labels, annotations, deletion timestamp)
  2. all other top-level fields except status which is a reserved field in kubernetes unrelated to applies

In full, the cache entry looks like:

// ResourceCacheEntry contains the latest SHAs for a single resource from multiple stages
// as well as the last-seen status of the resource.
type ResourceCacheEntry struct {
	// manifestSHA is SHA of the last seen resource manifest from the Git repository.
	manifestSHA *string

	// applySHA is SHA of the resource post-server-side apply.
	// Taking only metadata w/ name, namespace, annotations and labels and non-status non-metadata fields.
	applySHA *string

	// serverSHA is SHA from a watch of the resource, using the same pruning function as applySHA.
	serverSHA *string

	// status is a simplified Console structure containing last seen status of cache resource.
	status *console.ComponentAttributes
}

Condition for Applying a Resource

A resource can be applied if either of the following conditions is met:

  • applySHAserverSHA
  • The SHA of a Git manifest being reconciled ≠ last calculated manifestSHA

Reasoning

  • Git Manifest SHA Divergence - if the SHA of a Git manifest diverges, it indicates that the Git upstream has changed. In such cases, the resource must be modified regardless of other factors.
  • Resource Drift - if there is no divergence in the Git manifest SHA, any drift in the resource would be due to reconfiguration by another client. This drift would trigger a watch event and cause a change in serverSHA. The change should then be reconciled back in the next loop of the agent.

By following this logic, we ensure that resources are only modified when necessary, maintaining consistency and alignment with the upstream Git repository and handling any external reconfigurations efficiently.

There were two main things we got out of this:

  1. Great memory efficiency - this struct is highly compressed, versus just caching the entire object from the Kubernetes API, allowing us to handle very large cluster sizes with memory footprints under 200Mi.
  2. Consistency - Kubernetes watches have strong consistency guarantees deriving from etcd’s raft implementation. If the watch is managed properly, we can ensure any update to a given resource ultimately populates the cache, and thus drive our hit rates well over >90% at a steady state (with the resulting misses due to multiple services contesting a common resource).

Benchmark

The below images show the differences between our deployment-operator with cache enabled and disabled.  The first one shows cache-related metrics as well as general resource consumption. The second one shows the measured execution time of the service reconcile loop with the distinction to different stages of the reconciliation process. In general, it performs these steps:

  • Initialize - get Service object either from cache or API.
  • Process Manifests - fetch the manifests tarball from the API, untar, and template them.
  • Apply - run the cli-utils apply.

You can see both the 50th percentile reconcile times of all services across the cluster and the same times measured for the console service. I have specifically selected console service as it is one of our biggest services that has to process a big number of CRDs during the apply. It has 56 resources that would normally need to be reapplied during every reconcile.

Cache disabled

Cache enabled

Conclusion

As seen based on the above images enabling resource cache in the deployment operator provides notable improvements across various performance metrics:

Performance Improvement

  • Reconcile Time - the average reconcile time is halved, dropping from 2.49 seconds to 1.24 seconds, significantly speeding up the reconciliation process.
  • Apply Time - the average apply time is reduced by more than half, from 1.9 seconds to 637 milliseconds, enhancing the speed of applying changes.  A lot of this residual latency is just setting up the cli-utils applier, validating the namespace is in place, and setting up the go client’s features like schema mapper.  We can likely optimize that away too, just still need to work on it.

Console Service Optimization

  • Reconcile Time - for the console service, the average reconcile time decreases from 6.14 seconds to 2.24 seconds, a substantial improvement.
  • Apply Time - the average apply time for the console service drops from 4.28 seconds to 375 milliseconds, ensuring faster application of configurations.

Resource Cache Effectiveness

  • Hit Ratio - a high resource cache hit ratio of 94.4% indicates that the majority of resource requests are being efficiently served from the cache, reducing the need for repeated data fetching and processing.

Interesting Learnings within the Guts of Kubernetes

One of the more interesting things we discovered in this journey is how the sprawling nature of the Kubernetes ecosystem heavily amplifies API server load. Many open source controllers, like OPA Gatekeeper, Istio, etc, heavily leverage admission controllers, which get called per apply.  If you have automation that applies frequently, like a CD tool, or really any theoretical controller, it implicitly can propagate load on cluster in unexpected, unintuitive ways due to those transitive dependencies.

The second is the CRD lifecycle.  There is a known issue where the Kubernetes go client’s API discovery cache heavily throttles clients when CRDs change.  Any continuous deployment implementation, or really any tooling that touches CRDs, can trigger this bug if it's not delicate.  We had solved this directly with a cli-utils apply filter, but the caching approach we introduced provided a more general framework that also sidestepped client throttling bugs that could have occurred.

The last is the complexity of managing the Kubernetes Watch API. We needed to build a customStatusWatcher that we could use as a replacement for the cli-utils default watcher and as a base for our global resource cache. We have done several optimizations there, but the main one was a replacement of shared informers with a thin wrapper around the dynamic client that is responsible for doing an initial list call, feeding it into the event channel, starting a watch from extracted resourceVersion and in case of any error during watch, restarting it while keeping last seen resourceVersion . This ensures that no events are missed between the list and watch, and during restarts i.e. if the API server drops a watch after some time. Another benefit is improved memory usage since the dynamic client doesn’t provide a built-in caching mechanism and we decided to take advantage of this and build our own cache.

Thanks to all this we can ensure that there is only a single watch open for every unique resource type found in the cli-utils inventory, but also reuse the same StatusWatcher while configuring the cli-utils applier.

Credits to Łukasz Zajączkowski, Michael Guarino, and Marcin Maciaszczyk for the joint effort in the design, implementation and creation of this blog post.