Building Internal Kubernetes Platforms

Architecting and Building Internal Kubernetes Plaforms

We are running a #KubernetesPlatforms series where we are sharing our experiences and thoughts on how to build Internal Kubernetes Platforms. This is the introductory post of the series.

Most people relate to Kubernetes as a container orchestration technology. While it certainly is one, and has become the de facto choice for container orchestration, the power of Kubernetes is much beyond just handling a container’s lifecycle needs.

Today, the primary users of Kubernetes are DevOps and SRE engineers. Different teams in an organization typically have DevOps engineers managing Kubernetes infrastructure as per the needs of those teams. That’s how Kubernetes adoption begins in many organizations.

Having such siloed devops efforts doesn’t scale across a large organization for many reasons. You will start seeing duplicated efforts and lack of consistency across teams. Implementing a common set of security & compliance policies, governance becomes a mammoth effort.

To adopt Kubernetes across many teams in an organization, in a consistent manner, the need for an internal Kubernetes platform arises.

And the best part?

Kubernetes is fundamentally designed to build such abstractions (in the form of platforms) so that the end users (who deploy apps to Kubernetes) do not have to deal with its complexities.

Kubernetes provides building blocks that can be used to build various systems for others to use.

An organization can build a system (a platform) that caters to how that organization would like Kubernetes to be consumed internally.

In this post we will understand Kubernetes’ extensibility, how that very extensibility leads to additional complexities, what organizations & teams do to remove those complexities that eventually leads to building internal Kubernetes platforms.

Kubernetes Built-in Abstractions aren’t enough

Kubernetes already offers abstractions. The first and foremost component that anyone interacts with, a Pod, is already an abstraction. A Kubernetes Pod provides a nice abstraction over containers so that we can treat it like an app. 

Similarly a Persistent Volume is a nice abstraction to provision storage for your app without getting into the nitty gritty of how the underlying storage gets provisioned and managed. You could have many such examples in the form of Clusters (abstracting servers), Ingress, etc.

However these abstractions still don’t take away the complexities that a developer or a devops engineer needs to deal with for operating a modern application on Kubernetes.

Let’s take a use case of deploying a simple Microservice to production in Kubernetes. Here’s the minimal set of Kubernetes objects that one typically deals with.

Minimum set of Kubernetes Objects required to deploy a Microservice to production

That’s just what’s required at a minimum for a production Microservice. Throw in Security, Operations, CI/CD, Cost management, we will be easily dealing with many more Kubernetes objects/APIs.

This is still way too complex for many teams to deal with on a daily basis. Or as AWS would put it, a lot of “Undifferentiated Heavy Lifting”.

Kubernetes is extensible by design

One of the core design principles of Kubernetes that differentiates it from others is it’s extensibility. Kubernetes implements only a few components as part of its core and makes everything else extensible.

Let’s understand this with a simple example.

If you want to expose your Kubernetes service outside the cluster, you would typically create a Kubernetes object called Ingress. However, just the creation of this Ingress object doesn’t really result in any action per se.

You need to pick an Ingress Controller which does the actual work of understanding your Ingress rules and deploying necessary configurations.

For example, Nginx Ingress Controller is a popular one used by many teams. When you deploy a Nginx Ingress Controller, it monitors Ingress objects in the cluster and creates necessary Nginx configurations to expose your service.

Nginx Ingress Controller is one such option and there are many different Ingress Controllers available to you. You got to pick one for your needs.

The same model applies to many different objects in a Kubernetes Cluster. For example, Container Runtime, Storage, Networking are all completely extensible. If you create an Amazon EKS cluster, it uses a Networking plugin called as amazon-vpc-cni which exposes Pod IPs as first class VPC IPs. If you create an Azure AKS cluster or a Google GKE cluster, they would use a different networking plugin. This is possible because Kubernetes allows such extensions.

Kubernetes’ extensibility adds more complexity

The very nature of Kubernetes extensibility has made its adoption wide spread. For any given requirement there are now multiple choices available. However, that very nature of extensibility brings complexity.

When teams across an organization start adopting Kubernetes, they could use different technologies for the same purpose. Say, one team could be using Nginx Ingress Controller while another could be using Traefik. The way secrets are passed on to applications could vary between different teams.

This is definitely a challenge for organizations which expect standards and compliance to be adhered across different teams.

Managed Kubernetes doesn’t cut it too

The big three Cloud Providers offer a managed Kubernetes service in the form Amazon EKS, Azure AKS, Google GKE and so does others like OpenShift.

All of these services certainly remove some of the heavy lifting. You no longer manage the Control Plane, node provisioning is simplified, cluster upgrades are automated (though you still have to validate) and many cluster management activities are simplified and automated.

However, when it comes to workload lifecycle management, you still have to deal with a whole lot of Kubernetes that we saw earlier. You are expected to write a bunch of these YAML definitions to deploy your workloads. You also have to maintain them as your workloads evolve.

Automation to the rescue

Nobody likes to deal with such complexities and so many moving parts that can break things due to any human error. So, we naturally start automating as much as possible.

This exercise in the Kubernetes world typically starts with using tools such as Helm Charts. Helm Charts allows operators to automate repeatable deployment and management tasks.

You could write Helm Charts that describe all Kubernetes resources required for your workload. With that you could reuse that Helm Chart every time you need to create a Kubernetes infrastructure for your workload. For any changes to your workload’s deployment configuration, you simply keep evolving the Helm Charts.

For deploying any third party dependencies you can find readily available Helm Charts from the community through ArtifactHub. For example, if you are looking to install Redis on your Kubernetes cluster, then you can pick up this Helm Chart provided by Bitnami.

On the other hand, if you are already using Terraform as your Infrastructure as Code tool, then there is a provider available for Kubernetes.

While such automation efforts certainly bring in repeatability, it has the following limitations.

  1. Limited to no self service experience: Most of these automations work behind a CI/CD pipeline and thus does not provide a direct self service experience to developers. Getting access to Kubernetes environments would still require lots of manual interventions
  2. Lack of standardization: If you have many teams adopting Kubernetes, then such automations tend to happen silo-ed within those teams. Soon you will find different teams adopting Kubernetes in different ways
  3. No centralized control: You will also find that there is no centralized control over your Kubernetes infrastructure across your organization. This specifically becomes a challenge when you have to apply specific patches, upgrades across the board or perform cost optimization exercises

The Birth of Internal Kubernetes Platforms

All of the above challenges lead to building internal Kubernetes platforms. With an internal Kubernetes platform,

  • Developers can work freely without dealing with Kubernetes complexities in an autonomous way through a self service experience
  • Admins and DevOps can focus on standardizing Kubernetes workflows and streamline operations
  • Security or Compliance controls can be implemented centrally. If there is a patch or an upgrade to be performed, it can be rolled out in a predictable way
  • Production environments can be centrally controlled to ensure availability and operational reliability
  • Developers get enough tooling if they would like to interact with Kubernetes. For example, getting a shell in to their container from the browser or viewing logs from the browser
  • Cost can be managed centrally where you get centralized cost visibility, can apply cost optimization strategies (such as auto turning off dev clusters or using Spot Instances)

Benefits of such a platform are endless.

But the key goal is to provide a balance between the autonomy that developers seek and the control that devops and admins want.

And that’s why companies like Eventbrite and Spotify actually built one for themselves. Puppet’s 2020 State of DevOps report called out usage of self service platforms as a characteristic of high DevOps evolution.

The platform is a product by itself

One of the common pitfalls of building such an internal platform is it stops evolving. Most platform initiatives are thought out as one time exercises. However, for a platform to be successful it needs to continuously evolve.

  • Developer workflows keep changing. The team behind the platform need to work closely with developers (the platform’s customers), gather feedback and keep addressing their pain points
  • Different workloads have different needs. Kubernetes workflows built for a Microservice workload do not work well for a Data or MLOps workload
  • Kubernetes ecosystem is rapidly evolving. The platform needs to continuously adapt to the changing landscape and do so transparently without impacting developers

Closing Thoughts

Successful Kubernetes adoption in an organization results from changing our thinking from,

“How do we move to Kubernetes?”

to

“How do we get our developers to embrace Kubernetes with least friction? What are their workflows, pain points?”.

Building internal Kubernetes platforms with the right set of components helps in answering these questions, drive successful Kubernetes adoption across teams and derive the benefits Kubernetes provides.

From the next post, we will kick start a series on architecting and building internal Kubernetes platforms, going deep into the ingredients/components of such a platform.

The first post in the series is about the foremost capability of a Kubernetes platform – providing a self-service experience.

Leave a Reply

%d