A primer on effective monitoring practices
Captured source
source ↗A primer on effective monitoring practices Scale • Enzo Nocera • 30/05/23 • 19 min read
As Platform Engineers specialized in monitoring and observability solutions, my coworkers and I had the chance to experiment a lot with monitoring practices and tools. Now we would like to share our views on implementing monitoring in a way that allows your product to become more resilient and responsive.
Monitoring has a significant impact on your product and its quality. In our experience, naive attempts to implement monitoring systems without the accompanying organizational and cultural shifts that allow teams to compensate for the high costs of monitoring and guarantee that the system in place is aligned with their goals always fail.
A primer on monitoring
This post is aimed at engineers, engineering managers, and product managers. We will define monitoring and how it differs from observability and go over some general principles that should be kept in mind during the whole lifecycle of implementing and maintaining your monitoring system. We’ll also look at some common types of monitoring and, finally, how to continuously improve your product by taking advantage of the insights provided by monitoring.
Monitoring is all about collecting telemetries (metrics, logs, traces,…) from systems and storing them in a time series database (TSDB) to be watched in the background by a system capable of issuing alerts when certain behaviors are observed and other systems taking input from those alerts/time-series (e.g., Kubernetes auto-scaling on custom metrics ).
Common telemetries: metrics, logs, and traces
Alerts can be routed to other systems to trigger automated procedures, be displayed on a dashboard for further investigation, trigger a page to summon the on-call engineer (which is expensive), and much more.
Metrics are the most used among all telemetries as they are typically fresh data and easily aggregable, so they scale easily with your production.
Logs describe unique events, so they’re not directly used for monitoring because they’re too verbose and harder to manipulate for a machine than plain numbers like metrics. But they can be counted and turned into metrics (e.g., by incrementing a counter for every log describing an HTTP 500). Their main use case is still in the name — to log with a certain verbosity what’s happening on a system for traceability and posthoc investigation.
Traces are mostly used to track a user request that may be distributed among different services. They are basically a collection of the logs triggered by one user request traversing several of your service components and are not really used in monitoring as such but more in observability (more on this later).
Why do you need monitoring?
If you’ve ever looked into the literature on monitoring, you’ve probably come across Part III of the Google SRE Book , which presents a Hierarchy of Needs for service reliability. Google puts monitoring at the “base” of the pyramid, saying it’s impossible to run a reliable service if you are unaware of its state.
Meanwhile, if we look at Service Quality literature, one of the most widely accepted models, which has been in use since the early 90s, is the SERVQUAL model , which evaluates the quality of services in five dimensions. Among them, reliability (“The ability to perform the promised service dependably and accurately”) and responsiveness (“The willingness to help customers and to provide prompt service”) are two criteria that monitoring systems can improve.
Without monitoring, we are unable to detect whether our system is drifting away from its nominal state. Therefore, our users will perceive our services as unreliable as they can’t trust them to perform consistently. But by being unaware of those drifts, we are not able to quickly put the system back on the rails. This negatively impacts our Mean Time To Recover (MTTR), resulting in a drop in responsiveness and in our customers’ trust in us.
With repeat occurrences, the perceived quality of our service will drop to a point where the user is no longer willing to pay for it (the decision to use a service is mainly motivated by the quality-cost ratio), which unavoidably, leads to churn.
The difference between monitoring and observability
There have been some shifts in recent years, and we’ve seen the emergence of solutions labeled “observability”. Some have argued that observability is just a buzzy tech word and yet another synonym for monitoring. But we disagree.
The need for human intervention is becoming less and less with the development of autonomous systems, but the need for a qualified workforce to maintain those automations is rising. This effectively means that we don’t have to scale our operational team proportionally to our production scale. And that’s where we created the need for observability next to monitoring.
Monitoring has allowed us to create automations and/or autonomous systems by collecting telemetries and launching automated procedures based on the evaluation of a set of rules taking input from those telemetries. On the other hand, observability has been designed for humans. It allows us to observe our (increasingly more) complex systems effectively.
The more we automate our systems, the more they look like a black box for us. Observability was born because we made our systems more and more autonomous to handle more and more complex problems, but we realized it made the cognitive load needed to reason about them humanly unbearable.
At the end of the day, all observability solutions advertise themselves as easy to use with a good developer experience for this exact reason: they allow humans to observe things they couldn’t with the naked eye.
That’s why observability and monitoring are complementary , not interchangeable.
The principles of monitoring
Now that we’ve established that there is a need for monitoring let’s look at the important principles you should keep in mind when working on implementing a solution.
Monitoring is not just technical
Don’t ignore the product and business implications of monitoring. You should have regular meetings (e.g., every quarter) with those parties to speak about your objectives and your Quality of Service (QoS).
This will allow you to assert the relevance of your current monitoring solution, which should be aligned with your service objectives. And if the alert volume is already high, and you struggle to meet the objectives, product management…
Excerpt shown — open the source for the full document.