Kubernetes Observability Challenges and Landscape

Organizations have been monitoring applications for almost as long as they built those applications. However, with the adoption of microservices and distributed systems architectures, this space has become much more complex.

In this post, we will look at some of the challenges that we face when it comes to monitoring our modern applications. We will discuss these challenges in the context of Containers and Kubernetes, look at how the ecosystem is currently, and recommendations on some of the popular projects in this space.

The Three Pillars of Observability

If you are just getting started with Kubernetes, one of the terms that you will often encounter is “Observability”.

Simply put, Observability refers to our ability to observe systems through various external outputs to detect any anomalies and fix them so that the system continues to operate well.

The external outputs that are generated by systems typically fall under three main pillars:

Logs
Monitoring
Tracing

These three pillars when combined together provide a holistic picture of how our systems are behaving.

Observability challenges in the world of Containers and Kubernetes

As developers and DevOps engineers, we have been collecting these Observability data from our systems, integrating them, and using them to identify issues. However, when we deploy our applications as containers, a new set of challenges emerge.

Multiple Components

Kubernetes, by design, is highly decoupled in nature. The system is made up of many different components that work together to support various aspects of a container’s lifecycle.

For example, you have your containers running as Pods, which in turn run on Nodes. There are host operating systems running on these Nodes. And for those Pods you might have Ingress Controllers to serve traffic from outside the cluster. If your application is composed of a number of microservices then you throw in a service mesh like Istio.

What this all means is that just one set of logs or metrics doesn’t cut it through when you deploy your applications to Kubernetes. Each of these components emits its own Observability data. You need to collect all that data and more importantly correlate them to get enough context.

Ephemeral Nature

Containers in Kubernetes are highly ephemeral/dynamic in nature. Containers can be spun up and down in response to changing demands. Pods can get rescheduled from one node to another in the cluster due to node availability issues or scheduling priorities. The storage volume mapped to a container may change depending on storage needs.

In a Kubernetes infrastructure, “things” tend to move around. Hence, traditional approaches towards collection of logs, metrics, and traces do not hold good in such a dynamic environment. For example, writing logs to a file wouldn’t work in a Kubernetes environment where the file system doesn’t outlast an application.

Bringing context to Observability

Most teams that adopt Kubernetes, practice Continuous Delivery. New versions of applications are deployed continuously, often multiple times in a day. You may also be running multiple versions of your applications in the same cluster and performing a/b testing. And those different versions of your applications may consume different resources (such as number of pods).

So, when you are looking at your Observability data, you will need additional context automatically injected (eg: version numbers) to observe and make decisions (say, to roll out a new version further).

Current Kubernetes Observability Ecosystem

For such a highly distributed and dynamic environment, where containers are treated as “ephemeral” and “disposable”, traditional approaches towards collecting logs, metrics, and traces do not work well.

Hence a whole suite of new tools have emerged that are designed from the grounds up for such an environment. Existing tools are also adding support to adapt to this new environment.

Here’s a snapshot of the current CNCF Observability landscape of various projects in this space.

Popular Observability Projects

Out of the above vast ecosystem of projects, here are the ones widely used today:

Logging
- Fluentd & Fluentbit
- Elastic Logstash
Monitoring
- Prometheus
- Cortex
- Thanos
Tracing
- Jaegar
- OpenTracing

These projects run as sidecar containers or Daemonsets in Kubernetes clusters, collect Observability data and send them to a centralized backend infrastructure.

Observability Infrastructure

Once we have chosen our Observability stack, we also need to focus on the infrastructure to support our stack. Similar to the applications from which we collect Observability data, the infrastructure that underpins our Observability stack needs to be Scalable, Highly Available, and Performant.

Remember, when our applications are not performing well, it is the Observability data that helps us to identify issues. This means that the underlying infrastructure for the Observability stack is as important as our application’s infrastructure.

We also need to think about “Long Term Storage” for our Observability data that is durable and cost-effective.

Here’s a list of popular backends for each of the Observability components.

Logs
- ElasticSearch
Metrics
- Cortex
- Thanos
- InfluxDB
Tracing
- ElasticSearch

Depending on scale, we can run this infrastructure as

A dedicated Kubernetes cluster or
A dedicated node group within the cluster where your applications are hosted.

Most of these projects have integrations with object stores like Amazon S3 for long term storage.

Cloud Providers also provide managed services that can collect and store Observability data. For example, AWS Container Insights can collect and aggregate metrics from your EKS clusters. Similarly, CloudWatch Logs and XRay can be leveraged for logs and traces.

AWS also provides standalone managed services such as Amazon Managed Service for Prometheus (runs Cortex internally) that can be used instead of hosting it by yourself.

Conclusion

As our applications and systems become more and more distributed, the tools that are required to monitor them also need to evolve. Containerized applications and dynamic environments such as Kubernetes require a new approach towards Observability.

The current landscape of Observability tools for Kubernetes is still evolving with many competing projects. However, there are some early signs of few projects maturing and becoming the de-facto choice.

Amidst all of these projects, there is one player who is promising to integrate all three pillars and provide a single pane of glass. We will dive deep into that in our next post.

Stay tuned!

Kubernetes Observability Challenges and Landscape

The Three Pillars of Observability