What Good Observability Looks Like in Production

1. Observability Was Not the Same Thing as Instrumentation

By the time observability became a serious topic, the platform already had most of the building blocks you would expect. Prometheus was scraping metrics. Grafana was full of dashboards. Graylog was collecting logs. Alerts existed. Teams channels and email routes existed. On paper, that sounds like observability.

It was not, at least not automatically.

The thing that production taught me very quickly was that collecting data and understanding a system are not the same activity. A platform can be full of telemetry and still be hard to operate. In fact, a lot of noisy environments have exactly that problem: they produce more information than the humans responding to incidents can use.

That is why I stopped thinking about observability mainly as a tooling topic. The stack mattered, of course. Prometheus, Grafana, and Graylog each solved real problems. But the more important question was operational rather than technical. When something starts going wrong in production, does the observability model help the team understand the issue quickly enough to reduce impact? Or does it bury the team in signals that are technically correct and operationally unhelpful?

That distinction mattered more than the tooling itself. The stack was there to reduce ambiguity. If it created more of it, then it was not doing the job as well as it looked from a diagram.

2. The Problem Was Never Lack of Data

The early instinct in most teams is easy to recognize. Something breaks, or an incident is harder to debug than it should have been, so the response is to add more metrics, more alerts, more dashboards, and more logs.

That instinct sounds responsible. In practice, it often makes the environment harder to operate.

The reason is simple. Most production systems do not suffer because they have too little telemetry. They suffer because the telemetry is not organized around decision-making. Engineers under pressure do not need infinite detail. They need a reliable path through the detail. They need to know what is user-visible, what changed recently, what is likely causal versus merely correlated, and what action is safest right now.

Without that structure, observability degrades into accumulation. A dashboard exists because it might be useful one day. A metric is scraped because Prometheus can scrape it. A log stream is retained because someone might need it later. An alert fires because the threshold exists, not because waking a human is warranted. Eventually the stack becomes rich in data and poor in guidance.

That was the point where I started treating observability as an operating interface for production rather than as a reporting layer. The question was not whether the platform knew a lot about itself. The question was whether the people responsible for it could make better decisions because of that knowledge.

3. Prometheus, Grafana, and Graylog Had Different Jobs

One of the more useful shifts was getting more disciplined about what each tool was actually for.

Prometheus was the signal source. It was where the most useful production symptoms first became visible. Error rate, latency, saturation, resource pressure, restart patterns, and workload health all showed up there before anyone had a full explanation. Prometheus was good at telling the team that the system's behavior had changed and that something worth attention might be happening.

Grafana was the investigation surface. Once Prometheus or an alert indicated that something was wrong, Grafana helped answer the next layer of questions. Is this isolated to one service or broader? Did latency climb before or after the rollout? Is memory use growing steadily or spiking sharply? Is one namespace unhealthy, or is the whole cluster under pressure? In other words, Grafana helped shape the problem.

Graylog was the explanation layer. Metrics showed that behavior had shifted. Dashboards narrowed the scope. Logs were often where the raw narrative became visible. Exceptions after a rollout, dependency timeouts, authentication failures, bad configuration values, recurring connection errors, or repeated application-level faults became much easier to interpret once the time window and affected scope were already known.

This separation sounds obvious when written down, but it made a real operational difference. Without it, teams tend to expect every observability tool to answer every kind of question. Then they become disappointed when metrics do not explain root cause, dashboards do not tell them what to do, or logs are too overwhelming to use as an entry point.

The tools were complementary, not interchangeable. Once that became clear, the platform was easier to operate under pressure.

4. Dashboards and Alerts Were Not the Same Thing

One of the strongest practical lessons was that dashboards and alerts need to serve different purposes.

A dashboard is for understanding. It gives context, trends, and shape. It lets an engineer investigate, compare, and reason about behavior. An alert is for action. It interrupts someone because the system believes human attention is required now.

When those two roles get blurred, the observability model starts working against the people using it.

The easiest way to see that failure mode is in alert design. Teams often turn any technically interesting threshold into a notification because it feels safer to be told more. CPU spikes, memory movement, restarts, short-lived saturation, noisy log bursts, and local anomalies all become alerts. Eventually the alert stream stops representing urgency and starts representing everything the platform happens to notice about itself.

That is operationally destructive. It teaches engineers that a notification does not necessarily mean a decision is needed. Once that trust is gone, the signal-to-noise problem is no longer theoretical. It is embedded in human behavior.

The cleaner rule was much simpler: a proper alert means human action is required now. If the signal does not ask for a decision, it probably belongs somewhere else. It may still belong in a dashboard. It may still matter for daytime review. It may still deserve a ticket, a weekly summary, or a trend report. But it should not compete with real production signals for human attention.

That distinction turned out to be one of the most important parts of making observability useful instead of merely complete.

5. Noise Was a Human Systems Failure, Not Just a Technical One

I do not think alert noise is mainly a monitoring flaw. I think it is a human systems design flaw.

When a production environment generates too many notifications, the problem is not just that the tooling is verbose. The deeper problem is that the platform has lost the ability to express urgency clearly. Engineers begin to receive the same delivery mechanism for very different classes of events. Something mildly interesting and something user-visible arrive through the same channel, with similar language, at similar times, and eventually they are treated with similar skepticism.

That is how teams end up waking people at night for things that could have waited until morning, while also missing the early shape of incidents that truly mattered.

The most useful framing I found was this: noise means the system is talking, but nobody needs to act yet. A proper alert means the system is asking for intervention. That sounds almost too simple, but it cleaned up a lot of confusion very quickly because it forced every candidate alert to justify itself in human terms rather than technical terms.

A sustained user-facing latency breach, a material error-rate increase on a critical path, or service unavailability clearly fit that bar. A single pod restart, a brief CPU excursion, or one noisy error pattern without visible impact usually did not. Those lower-level signals still mattered. They just mattered as context or investigation inputs, not as primary incident entry points.

Once the team started treating alerting as a human trust problem rather than a metric-threshold problem, the observability model improved much faster.

6. Delivery Channels Needed Different Meanings

Another detail that mattered more than many teams admit was where the signals were sent.

Not every alert belongs in the same channel, and not every channel carries the same meaning. If everything is delivered everywhere, the system is not becoming more visible. It is becoming more repetitive.

In practice, Teams and email served different roles. Teams worked well for shared operational awareness during working hours, for degraded conditions that were worth watching but not yet severe, and for keeping the platform team aligned during an active incident. It was a good place for visibility that might lead to action, but did not always justify immediate interruption.

Email had a different shape. It was slower and more durable, which made it more appropriate for wider distribution, summaries, persistent records, and notifications that needed to be visible beyond the engineers actively sitting in operational chat. Email was not the right medium for urgent real-time response, but it was often the better place for structured visibility that should not vanish into a busy chat stream.

The point was not the tools themselves. The point was that delivery path should match urgency and ownership. Once that mapping was clearer, the notification model became easier to trust because the route itself carried meaning. If a signal arrived in one place rather than another, engineers already had a better hint about how seriously to treat it.

Observability gets much calmer when the channels stop competing with each other.

7. The Alerting Rule That Changed Everything Was Very Simple

The most useful internal rule I found was also the least sophisticated.

If an alert fires, someone should immediately understand what kind of decision it is asking for.

That decision might be to roll back a deployment. It might be to investigate a user-facing service urgently. It might be to confirm whether autoscaling is failing, whether a dependency is down, or whether traffic should be shifted or reduced. It might even be to acknowledge that the event is informational and no urgent action is needed. But the class of decision should be obvious.

If the first reaction to an alert is "that is interesting," then it probably does not belong in an urgent alert stream. Interesting is what dashboards, trends, and daily review loops are for. Urgent alerts should create operational clarity, not intellectual curiosity.

This rule also helped keep incident response disciplined. During a real production problem, the right sequence is usually stabilize first, investigate second. Good alerts supported that sequence because they pointed toward the safest next operational move. Bad alerts disrupted it because they dragged the team into analysis before the situation was under control.

I did not need a more elegant rule than that. The practical value came from applying it consistently.

8. Good Dashboards Were Smaller and More Opinionated Than I Expected

Grafana made it very easy to build large, ambitious dashboards, and that was part of the problem.

At some point most teams realize they can graph almost everything: request rate, error rate, latency, CPU, memory, disk, network, pod count, node state, queue depth, database health, ingress trends, deployment history, namespace saturation, and any custom application metric they can expose. That can produce dashboards that look comprehensive and feel reassuring.

The issue is that incident dashboards are not museums. Their job is not to display everything the platform knows. Their job is to shorten the path from confusion to the next good question.

The dashboards that actually helped in production were usually much smaller. A good service dashboard answered, in order, whether the service was healthy from a user perspective, whether something had changed recently, whether the issue looked local or systemic, and whether the bottleneck was more likely to be traffic, compute, memory, rollout behavior, or a downstream dependency.

That meant prioritizing error rate, latency, throughput, saturation indicators, rollout markers, and a handful of supporting resource trends. Everything else had to justify its place.

This was one of those lessons that sounds stylistic until you feel the difference in an incident. Large dashboards make engineers scroll. Smaller dashboards make engineers decide.

9. Logs Were Essential, but Rarely a Good Starting Point

Graylog was extremely valuable, but only when it was used at the right point in the flow.

Logs are where a lot of the raw explanation lives. Exceptions, dependency failures, authentication problems, configuration mistakes, and rollout-specific errors often become obvious there. But logs are also the fastest way to drown in detail if you start with them too early.

The pattern that worked best was consistent enough that it became a habit. Use an alert or symptom to confirm that something user-visible may be happening. Use Prometheus and Grafana to narrow the scope, affected service, and time window. Then use Graylog to explain what the service or dependency was actually doing inside that narrowed window.

Once that narrowing had happened, Graylog became much more useful. Repeated exceptions across pods, timeouts to a specific dependency, a bad configuration value introduced during rollout, or a sudden shift in one class of application errors could usually be spotted much more quickly. Without that narrowing, the log surface was simply too large and too mentally expensive to treat as the first operational step.

I think this is one of the more underrated observability lessons in Kubernetes environments. Teams often collect logs successfully long before they learn how to use them efficiently under pressure.

10. Example: A Latency Spike Was Not a Logging Problem

One recurring pattern looked something like this. A service suddenly showed a sustained latency increase on a user-facing path. The first temptation, especially from people who knew the application well, was to dive straight into logs and start reading exceptions or request traces at random.

That rarely worked well.

What worked better was to begin with the signal. Prometheus showed that latency had crossed a meaningful threshold and stayed there long enough to matter. Grafana then helped narrow the shape of the problem. Did the increase start directly after a deployment? Was it isolated to one service or visible in downstream dependencies too? Was throughput increasing at the same time? Was resource pressure building, or did the application look healthy from a CPU and memory perspective?

Only once that picture was clearer did Graylog become the right tool. At that point, the logs might show repeated dependency timeouts, exceptions after a specific configuration change, or a failing path that matched the exact interval visible in Grafana. The value of the logs came from the fact that the earlier tools had already made the search tractable.

The lesson was simple: logs are often the explanation layer, not the detection layer. Treating them as the entry point slowed incident understanding more often than it sped it up.

11. Example: The Alert Storm Was Not the Real Incident

Another pattern showed up when one service began failing noisily enough to drag half the platform into the conversation.

What should have been one incident often arrived as an alert storm. Error rate alarms fired. Latency alarms fired. Pod restart alerts fired. Resource pressure warnings fired. Downstream services started reporting secondary symptoms. On paper, the monitoring stack was doing a thorough job. Operationally, it was creating confusion about whether there were multiple incidents or one incident with many side effects.

This is where the difference between symptom alerts and supporting telemetry became critical.

The cleaner approach was to let a user-impacting signal open the incident and let the lower-level signals support understanding once someone was already investigating. That did not mean suppressing the rest of the data. It meant refusing to give every derivative symptom the same status as the primary problem.

Once that shift happened, incidents became much easier to reason about. The platform had not necessarily become more stable in the moment, but the team was no longer losing time untangling its own instrumentation before getting to the actual issue.

This was one of the clearest examples of observability affecting reliability directly. A noisy stack does not only annoy engineers. It delays correct action.

12. Example: A Rollout Looked Fine Until the Logs Told the Truth

Some of the most instructive production incidents started with a deployment that appeared healthy at first glance.

ArgoCD showed the new version as synced. Pods were running. Basic platform health looked acceptable. Then user-facing behavior started drifting in the wrong direction. Error rates moved up or latency worsened just enough that something was clearly off.

Metrics and dashboards were still the first useful tools here because they answered the immediate questions. Did the change line up with the deployment? Was the issue concentrated in one service? Was the service under unusual resource pressure or was the shape of failure pointing somewhere else? Once that scope was narrowed, Graylog usually exposed the explanation much faster than raw graph-reading could. A dependency started timing out. A new configuration path was invalid. One class of exception exploded immediately after rollout. Something that looked like a generic service regression was often much more specific once the logs were being read inside the right context.

This kind of incident reinforced the same point again and again: observability works best as a sequence. Signals first. Shape second. Explanation third. When the team followed that sequence, production got easier to reason about.

13. What I Stopped Alerting On Changed the Quality of the Whole System

One of the most meaningful improvements came not from adding new signals, but from removing or downgrading weak ones.

Single pod restarts by themselves rarely deserved urgent escalation. Brief CPU spikes without any visible user impact usually belonged in trend review, not in the middle of a working day. One noisy error class that self-resolved without changing availability or latency generally did not need to interrupt people in real time. The platform still observed those conditions. It simply stopped pretending they all carried the same urgency.

That change improved more than the alert stream. It improved trust.

Once engineers saw that an alert usually meant a real decision might be needed, the observability model started working with them instead of against them. Teams channels became more readable. Email became more meaningful. Escalation stopped competing with background commentary. The platform was not quieter because it knew less. It was quieter because it had become more deliberate about what deserved interruption.

This is one of the reasons I think observability maturity is measured as much by what a team removes as by what it adds.

14. The Trade-Offs Were Real

Observability has its own trade-offs, and pretending otherwise usually produces bad systems.

If alerts are too sensitive, the platform detects trouble earlier at the cost of noise and distrust. If they are too conservative, the stack stays quiet longer while genuine problems gather user impact. If dashboards are too broad, they become hard to use. If they are too narrow, they may miss useful context. If log collection is too limited, explanation becomes difficult. If it is too broad and poorly structured, the platform pays a mental and sometimes financial cost for data nobody can use effectively.

There is also a trade-off between completeness and operability. Engineers often prefer the idea of seeing everything. People handling incidents usually benefit more from seeing the right things in the right order.

I do not think those trade-offs disappear. I think good observability comes from acknowledging them early and designing around human response patterns rather than around tool capability alone.

15. What I Would Do Earlier

Looking back, I would push a few things much earlier in the lifecycle of a platform.

I would define the distinction between alerts and informational signals from the beginning instead of letting the alert stream become crowded first and cleaning it later. I would make dashboard design more opinionated sooner, especially for service-level views used during incidents. I would teach teams earlier that logs are most powerful after scope has been narrowed rather than at the beginning of a production mystery. I would also make delivery channels more intentional from the start so that Teams, email, and true urgent notifications never drift into the same semantic bucket.

Most of all, I would treat observability design as part of platform design from day one, not as something that gets layered on once the services are already running.

The earlier the platform starts expressing urgency and context cleanly, the less often engineers have to learn those lessons under pressure.

16. Why This Still Felt Like Platform Engineering

This work mattered because it was not only about better dashboards or cleaner alerts. It was about making the production environment easier to understand and safer to operate.

That is why I think observability belongs naturally inside platform engineering. A platform is not complete when workloads can be deployed. It becomes much more useful when the people responsible for those workloads can tell what is happening, what is urgent, and what to do next without fighting their own instrumentation first.

Across the rest of the series, the same pattern keeps showing up in different forms. The landing zone work was about clear cloud boundaries. The developer platform work was about reducing cognitive load for application teams. The networking work was about making private infrastructure usable. The GitOps work was about making deployment state understandable. The reliability work was about building safer response habits around failure. Observability sits directly beside that. Its job is to turn telemetry into operational clarity.

The platform did not become better because Prometheus scraped more metrics or because Graylog held more logs. It became better when the important signals became easier to trust and the path from signal to action became shorter.

That is what observability actually helped with in production.

Observability in Practice: Noise, Signals, and Alerts in Production

1. Observability Was Not the Same Thing as Instrumentation

2. The Problem Was Never Lack of Data

3. Prometheus, Grafana, and Graylog Had Different Jobs

4. Dashboards and Alerts Were Not the Same Thing

5. Noise Was a Human Systems Failure, Not Just a Technical One

6. Delivery Channels Needed Different Meanings

7. The Alerting Rule That Changed Everything Was Very Simple

8. Good Dashboards Were Smaller and More Opinionated Than I Expected

9. Logs Were Essential, but Rarely a Good Starting Point

10. Example: A Latency Spike Was Not a Logging Problem

11. Example: The Alert Storm Was Not the Real Incident

12. Example: A Rollout Looked Fine Until the Logs Told the Truth

13. What I Stopped Alerting On Changed the Quality of the Whole System

14. The Trade-Offs Were Real

15. What I Would Do Earlier

16. Why This Still Felt Like Platform Engineering

Comments

More from this blog

Building a Kubernetes Platform on AKS: Private Clusters, GitOps, and Workload Separation

Designing Multi-Environment Platforms: What Actually Works in Practice

Designing Azure Landing Zones for Enterprise Cloud Adoption: Tenants, Management Groups, and Subscription Strategy

Designing a Developer Platform: From Infrastructure to Self-Service

Command Palette

1. Observability Was Not the Same Thing as Instrumentation

2. The Problem Was Never Lack of Data

3. Prometheus, Grafana, and Graylog Had Different Jobs

4. Dashboards and Alerts Were Not the Same Thing

5. Noise Was a Human Systems Failure, Not Just a Technical One

6. Delivery Channels Needed Different Meanings

7. The Alerting Rule That Changed Everything Was Very Simple

8. Good Dashboards Were Smaller and More Opinionated Than I Expected

9. Logs Were Essential, but Rarely a Good Starting Point

10. Example: A Latency Spike Was Not a Logging Problem

11. Example: The Alert Storm Was Not the Real Incident

12. Example: A Rollout Looked Fine Until the Logs Told the Truth

13. What I Stopped Alerting On Changed the Quality of the Whole System

14. The Trade-Offs Were Real

15. What I Would Do Earlier

16. Why This Still Felt Like Platform Engineering

Comments

More from this blog