Real SRE Lessons: Fixing Reliability Issues in Production

1. Reliability Was Never About Preventing Failure

By the time reliability became a serious topic, most of the visible platform work was already in place. The Azure foundation existed. Private AKS access was working. GitLab CI/CD and ArgoCD had established a deployment path. Platform control planes and workload clusters had been separated. The environment model was much clearer than it had been at the beginning.

That kind of progress can create a false sense of safety.

Once a platform looks well designed on paper, people naturally start expecting stability to follow from structure. Sometimes it does. More often, structure simply changes the kind of failures you see. The environment becomes more governable, but production still finds the weak assumptions. Traffic patterns change. Dependencies respond differently under load than they do in test. Resource limits that seemed reasonable during onboarding turn out to be badly tuned once real user behavior arrives. Health checks look fine until startup takes longer than usual. A deployment succeeds, but the service behind it is not actually ready for live traffic.

That was the point where reliability stopped feeling like a monitoring topic and started feeling like an operating discipline.

I do not think serious platform teams should define reliability as "the system does not fail." That standard is neither honest nor useful. Real systems fail. Dependencies slow down. Nodes get pressured. Configuration mistakes get through review. The better question is whether failure becomes visible quickly, whether the signals make sense, and whether the recovery path is disciplined enough that a bad situation does not get worse through confusion.

That changed how I thought about production. The goal was not to build a platform where nothing ever broke. The goal was to build one where failure was easier to contain, faster to understand, and safer to recover from.

2. What Actually Broke Was Usually Ordinary

One of the more useful lessons from production work is that major incidents are often made of very ordinary parts.

The failures I kept seeing were rarely dramatic in the way architecture diagrams imply. Most of them were not full-site outages caused by one spectacular design flaw. They were smaller operational weaknesses that lined up badly enough to become user-visible. Pods got OOMKilled because limits and actual usage had drifted apart. Readiness checks reported healthy too early. Services came up before a dependency was actually reachable. A rollout technically completed while latency quietly climbed in the background. A cluster or application component restarted repeatedly because the health checks were punishing a slow startup instead of detecting a dead process.

There were also the failures that did not look like failures at first. A service still responded, but slower. Error rates were low enough that nobody declared an outage immediately, yet high enough that customers were having a bad experience. Timeouts appeared only during specific traffic windows. A downstream dependency degraded just enough to create retries, queueing, or partial failures that spread into other services.

That kind of reliability work is harder than the dramatic version because it resists easy storytelling. Nothing has completely collapsed, but the system is no longer trustworthy. The platform is still running, but the margin is thinner than it looked yesterday. Recovery often starts before anyone can confidently explain root cause.

This is why I have grown skeptical of reliability writing that focuses only on idealized incident categories. In real environments, the things that break most often are rarely exotic. They are usually the operational details that teams assume are under control until production proves otherwise.

3. The Hard Part Was Not Detecting That Something Was Wrong

At first glance, that sounds backwards. Surely the hard part of reliability is noticing that something is failing.

Sometimes it is. More often, the harder part is recognizing what kind of failure you are looking at and deciding what to do first.

Most mature platforms already produce a lot of data. Prometheus is scraping. Grafana is full of dashboards. Logs are flowing. ArgoCD shows deployment history. GitLab shows what changed and when. The problem is that data by itself does not create operational clarity. During an incident, the platform does not reward the team with extra time just because the monitoring stack is well populated.

This is where many reliability efforts become less effective than they should be. Teams gather far more telemetry than they can use under pressure, then assume visibility must be good because the graphs are detailed. In reality, the first minutes of an incident are usually dominated by much simpler questions. Is this user-visible? Did something change recently? Is the fastest safe move to roll back, scale, restart, fail over, or reduce traffic? Are we looking at a service problem, a dependency problem, or a platform problem?

If the signals do not help answer those questions quickly, then the environment may be observable in a technical sense while still being hard to operate.

That distinction became central to how I thought about reliability. Reliability is not improved by accumulating more data than humans can use. It is improved by making the first operational decisions easier to get right.

4. OOMKilled Pods Were Usually a Truth Problem

One of the most common reliability issues in Kubernetes was also one of the least glamorous: containers getting OOMKilled.

This is one of those failures that is easy to underestimate because the first few occurrences often look like a local application problem. A pod restarts. The service comes back. The deployment remains technically healthy enough that nobody treats it as urgent. Then traffic increases, the pattern repeats, and what looked like a small runtime issue becomes a reliability problem.

In practice, OOMKilled pods were often exposing a gap between the story we had told the cluster and the behavior the application actually had. Requests and limits might have been chosen early, copied from another service, or based on a test environment that never exercised the same memory profile as production. From the scheduler's perspective, the configuration was the truth. From the workload's perspective, the real memory demand was the truth. Production was where those two truths collided.

This mattered because the failure rarely stayed isolated. Restarting pods create request failures, slower recovery, and misleading noise in dashboards. If the service sits behind retries or depends on other components that are also under pressure, the restart loop becomes part of a wider degradation pattern rather than a single bad pod event.

The fix was usually not "give it more memory and move on," at least not if the goal was to improve reliability rather than silence a symptom. The better approach was to look at actual runtime behavior over time, align requests and limits with that behavior, and treat recurring memory pressure as something worth understanding instead of something worth hiding. In some cases the application genuinely needed more headroom. In others, the configuration had simply remained wrong for too long because nobody revisited it after the service matured.

This is one of the reasons I think reliability and platform engineering overlap so heavily. A lot of recurring production pain is not caused by one code defect. It comes from the platform tolerating stale assumptions for too long.

5. Health Checks Broke Services More Often Than Teams Expected

Another recurring class of failure came from liveness, readiness, and startup behavior.

Health checks are a good example of a platform feature that looks simple until it starts making the wrong decision automatically. A readiness probe that turns green too early can expose a service before it is ready to handle traffic. A liveness probe that is too aggressive can turn a slow startup or transient dependency issue into a restart loop. A service that technically starts but depends on a database connection, secret mount, external API, or cache warm-up phase can look healthy to Kubernetes while still being operationally unavailable.

This showed up most clearly after otherwise normal deployments. The rollout completed. ArgoCD showed the application synced. Pods were running. Then error rate and latency started climbing because the new pods were accepting traffic before the application had actually stabilized. From the outside, it looked like a mysterious regression. In reality, the cluster had done exactly what the probes told it to do.

These incidents were a useful reminder that Kubernetes is not judging application readiness intelligently. It is enforcing the contract you define. If that contract is optimistic, shallow, or borrowed from another service with different behavior, the platform will enforce the wrong thing with great consistency.

The improvements here were rarely exotic. Better startup behavior, more realistic readiness checks, and more careful probe timing prevented a surprising amount of avoidable pain. The hard part was not knowing that health checks matter. The hard part was resisting the temptation to treat them as boilerplate.

6. Degradation Was Harder Than Outage

Full outages are ugly, but they are often easier to reason about than partial failure.

If a service is completely down, everyone agrees something is wrong. The incident gets attention quickly. The recovery objective is obvious. Partial degradation is more difficult because the system is still doing enough to confuse people. Requests succeed sometimes. Dashboards show activity. Internal metrics may look acceptable depending on where you are staring. The service is technically up, yet users are clearly having a worse experience.

This kind of problem appeared often enough that it changed how I thought about production signals. Latency spikes, intermittent timeouts, elevated but not catastrophic error rates, and slowly worsening response times were often more operationally dangerous than hard failures because they invited hesitation. Teams started debating whether the incident was real while customers were already experiencing it.

The platform made this harder when alerts were tied too closely to internal component thresholds rather than user-visible symptoms. CPU or memory alerts might fire early, late, or not at all depending on the shape of the failure. Restart counts could be informative but still secondary. What actually mattered during these incidents was usually much closer to the edge: request success, request latency, saturation symptoms, and the timing of recent changes.

Once I saw enough of those incidents, I stopped thinking of reliability primarily in terms of uptime. Availability matters, but a service that technically responds while being operationally unreliable is still a reliability problem. Teams that only optimize for "is it up?" miss a large amount of what users actually experience as broken.

7. What Did Not Help

Some of the early responses to reliability problems were well-intentioned and not especially useful.

The first weak instinct was to add more alerts. On paper, this looked responsible. CPU thresholds, memory thresholds, restart thresholds, pod health, node conditions, latency, and error rates all got attention. The result was not better reliability. The result was alert fatigue and slower incident understanding. Multiple alerts described the same underlying issue from different angles, and the people on call learned to distrust the noise before they learned to trust the signal.

The second weak instinct was to build dashboards that were technically rich but operationally unhelpful. Grafana made it easy to create detailed views, and detailed views are often satisfying to build. That did not mean they were useful during live incidents. A dashboard that requires careful interpretation under pressure is not much of an incident tool. In several cases, the most detailed dashboards were the least helpful because they invited analysis before the service had been stabilized.

The third mistake was trying to debug too early.

This is a very common engineer instinct. Something breaks, and the team immediately wants root cause. That impulse is understandable, especially for capable engineers who do not like uncertainty. But during a live reliability event, early debugging often competes with the more important goal: reduce blast radius and restore sane behavior as quickly as possible. If the service is degraded, the priority should be stabilization. Root cause analysis matters, but it matters more after the system is no longer actively hurting users.

None of these were useless practices in themselves. Monitoring matters. Dashboards matter. Debugging matters. They just mattered in the wrong order when an incident was already in progress.

8. The Response Model That Worked Better

The response model that helped most was conceptually simple and operationally disciplined.

First, decide whether the issue is user-visible and whether it is getting worse. That sounds obvious, but it immediately changes how you prioritize. Not every alert deserves the same level of urgency. Not every odd graph shape is an incident. The faster the team can answer "is this affecting users right now?" the more sensible the next decision becomes.

Second, stabilize before investigating deeply. If a recent deployment is the likely cause, roll it back or revert it in Git and let the environment reconcile. If traffic needs to be reduced, do that. If a service is clearly under-provisioned and adding headroom is the safest move, do that. If a specific bad instance or pod set is making things worse, replace it. The point is not to guess wildly. The point is to prefer reversible actions that reduce harm.

Third, correlate aggressively. Reliability incidents often sit close to a recent deployment, config change, dependency change, traffic pattern shift, or platform event. GitLab history, ArgoCD sync history, Prometheus metrics, and logs all become more useful once the team has stabilized the situation enough to read them in sequence instead of in panic.

Only after that did deeper investigation become worthwhile. At that stage, the team could ask better questions. Was the failure mode exposed by a bad rollout setting, a resource mismatch, a dependency contract, or something the service was doing under real load that tests never captured? Those are good questions. They are just not always the first questions.

This sounds like straightforward incident discipline because it is. The difference is that a surprising amount of production pain comes from not following it consistently.

9. Reliability Improved When Signals Became More Actionable

One of the biggest improvements came from treating alert quality as a reliability concern in its own right.

The simplest filter I found was also the most useful: if this alert fires, what should someone actually do next? If the answer was vague, theoretical, or "it depends, go investigate," the alert probably was not good enough to interrupt a human.

That immediately changed the shape of the alerting model. User-impact signals mattered more than internal discomfort signals. Error rate, latency, and service availability generally deserved more attention than raw CPU, memory, or restart counts on their own. That did not make internal metrics irrelevant. It made them supporting evidence rather than primary incident entry points in many cases.

Prometheus and Grafana were already capable of showing the necessary data. The work was in reducing the distance between a signal and an operational decision. Good alerts did not merely say that the platform was behaving strangely. They narrowed the likely problem enough that the team could decide whether to roll back, scale, pause, or escalate.

This was also where reliability and observability stopped being interchangeable in my head. Observability is the broader ability to inspect the system. Reliability improves when the important parts of that visibility are structured into signals that help people act correctly under pressure.

10. Standardization Removed Entire Classes of Incidents

One of the less glamorous but more effective reliability improvements came from standardization.

By this point in the platform journey, a lot of the earlier work had already been about reducing repeated decisions. Golden templates, clearer delivery paths, controlled GitOps flows, and environment standards all helped application teams avoid re-solving the same platform questions from scratch. Reliability benefited from the same approach.

When probes, resource defaults, rollout patterns, and exposure models were left entirely to individual interpretation, recurring incidents multiplied. Not because teams were careless, but because production behavior is hard to predict and every service was effectively inventing its own operational contract. Once better defaults existed, entire categories of failure became less common.

That did not mean forcing every service into exactly the same shape. Some workloads genuinely needed different tuning. But it did mean that the platform could stop treating obviously failure-prone decisions as if they deserved infinite flexibility. Conservative defaults around readiness, reasonable requests and limits, safer rollout behavior, and consistent service patterns reduced the number of times the same incident had to be relearned under a new name.

Reliability is often discussed as if it were mainly an incident management discipline. In practice, it improves a lot when the platform quietly prevents repeated operational mistakes from reaching production at all.

11. Example: A Deployment That Looked Healthy and Wasn't

One reliability pattern I saw more than once was a deployment that appeared completely normal from the delivery pipeline's perspective while being operationally wrong in production.

The GitLab pipeline passed. ArgoCD synced the new version. Kubernetes showed running pods. On paper, this looked like success. Then latency started rising and a portion of requests began failing because the new pods were reporting readiness before a dependency path had actually settled. Sometimes that was a database connection path. Sometimes it was a downstream internal service. Sometimes it was a startup routine that technically launched the process but had not finished the real work needed before serving traffic.

This kind of incident was useful because it exposed a gap between delivery success and runtime readiness.

The immediate handling was usually straightforward once the pattern was recognized. Revert or roll back the recent change, let the stable version recover service, and confirm the symptoms disappear. The more important work happened afterwards: make the readiness contract more honest, revisit startup behavior, and stop treating pod status as if it were the same thing as application health.

A surprising amount of reliability work is about closing exactly that gap. Platforms are very good at telling you whether they executed your instructions. They are much less capable of telling you whether your instructions represented reality.

12. Example: A Service That Failed Only During Peak Hours

Another common pattern was the service that behaved acceptably most of the time and then fell apart during the period when users actually needed it most.

In one form or another, this often came down to memory pressure, concurrency assumptions, or request behavior that looked fine in low-volume conditions and much worse during the daily peak. Outside those windows, the service appeared stable enough that the configuration passed review and the urgency stayed low. During peak usage, pods restarted, latency climbed, and the service began to look unreliable in a way that was highly visible to users even though it never became completely unavailable.

The first temptation in those incidents was to treat them as purely application-level defects. Sometimes they were. Just as often, the platform configuration was part of the story. Requests and limits were too optimistic. Autoscaling thresholds did not line up with the actual pressure signal. The team had reasonable telemetry, but not the habit of revisiting it against real production load.

What helped was correlating the runtime pattern with the actual traffic window instead of only staring at isolated pod failures. Once the service behavior was understood in the context of user demand, the fix usually became clearer. Adjust headroom where it was genuinely needed. Align the resource model with observed usage instead of inherited defaults. Then watch whether the improvement survives the next peak rather than declaring success immediately.

This is another area where reliability stops being theoretical very quickly. Production rarely cares whether the configuration was written with good intentions. It cares whether the service survives the period when it is actually needed.

13. Example: The Alert Storm Was Not the Incident

Some of the worst on-call moments were not caused by one massive platform failure. They were caused by a manageable failure arriving through an alerting model that made it look chaotic.

One service degraded. That should have been the incident. Instead, the response began with a flood of related alerts from pod restarts, node symptoms, latency alarms, downstream retry patterns, and secondary errors from services that depended on the original failing path. The team was not short on data. It was short on a clean entry point into the problem.

This is where alert design proved to be directly relevant to reliability rather than a separate observability concern. A noisy system does not merely annoy the on-call engineer. It delays the point at which the real incident gets understood accurately.

The solution was not to suppress everything. It was to distinguish between the alert that should open the incident and the supporting signals that help explain it once someone is already looking. A user-visible symptom should generally start the conversation. Lower-level component symptoms should enrich it, not compete with it.

Once the alerting model moved in that direction, incident response got calmer very quickly. The platform had not become magically more reliable overnight. The team had simply stopped tripping over its own instrumentation on the way to the real issue.

14. The Trade-Offs Were Real

Reliability work is full of trade-offs that are easy to state and harder to live with.

More sensitive alerts can detect trouble earlier, but they also create noise and unnecessary interruption if they are not designed carefully. More dashboards can make the environment richer to inspect, but they can also slow decision-making if the operational path through them is unclear. Standardization reduces repeated mistakes, but too much rigidity can make it harder for services with genuinely unusual needs to operate correctly. Faster recovery actions, such as rollback, can reduce user pain quickly, but they may delay full understanding if the team never comes back for proper analysis afterwards.

There is also a deeper trade-off between elegance and operability. Engineers naturally like clean explanations and precise root cause. Production often rewards teams that can take safe, imperfect, stabilizing action before the whole story is known. That can feel unsatisfying in the moment, but it is usually the more mature posture.

I do not think reliability improves by pretending these trade-offs disappear. It improves when the platform and the team make them consciously instead of backing into them during the middle of an incident.

15. What I Would Do Earlier

Looking back, there are a few things I would push much earlier in the lifecycle of a platform.

I would define alerting principles sooner and make teams defend why an alert deserves to wake a human. I would standardize resource defaults and health-check patterns earlier, especially for services entering Kubernetes for the first time. I would spend more time teaching the difference between deployment success and runtime readiness because that misunderstanding causes more production pain than many teams realize. I would also make incident handling discipline more explicit, especially the habit of stabilizing first and investigating second.

Most importantly, I would treat recurring operational symptoms as design feedback much earlier. Repeated pod restarts, repeated memory pressure, repeated dependency startup issues, and repeated alert storms are usually not just bad luck. They are the platform telling you that a default, a contract, or a habit is wrong.

The earlier that feedback is taken seriously, the less often reliability work turns into repeated firefighting.

16. Why This Still Felt Like Platform Engineering

What made this reliability work meaningful was that it was never just about being better at incidents.

The incidents mattered, but the bigger lesson was how much of reliability is shaped before the incident starts. Platform defaults, workload contracts, alert quality, rollout behavior, resource conventions, and the clarity of the recovery path all influence whether failure stays small or becomes expensive. That is why I think reliability belongs naturally inside platform engineering. It is not only about operating the system after something breaks. It is also about designing the system so that common failures are easier to survive.

By this point in the broader series, that pattern should feel familiar. The landing zone work was about making cloud structure governable. The platform work was about reducing developer dependence on raw infrastructure. The networking work was about making private connectivity operable. The GitOps work was about making deployment state understandable. The multi-environment work was about separating change flows honestly. Reliability was another version of the same underlying discipline: remove ambiguity, make the important paths more predictable, and design the platform so that people can recover sensibly when reality stops matching the diagram.

The goal was never zero failure.

The goal was a platform where failure becomes visible quickly, signals remain trustworthy, and recovery is disciplined enough that the system earns trust again.

Reliability in Practice: What Actually Breaks and How I Handle It

1. Reliability Was Never About Preventing Failure

2. What Actually Broke Was Usually Ordinary

3. The Hard Part Was Not Detecting That Something Was Wrong

4. OOMKilled Pods Were Usually a Truth Problem

5. Health Checks Broke Services More Often Than Teams Expected

6. Degradation Was Harder Than Outage

7. What Did Not Help

8. The Response Model That Worked Better

9. Reliability Improved When Signals Became More Actionable

10. Standardization Removed Entire Classes of Incidents

11. Example: A Deployment That Looked Healthy and Wasn't

12. Example: A Service That Failed Only During Peak Hours

13. Example: The Alert Storm Was Not the Incident

14. The Trade-Offs Were Real

15. What I Would Do Earlier

16. Why This Still Felt Like Platform Engineering

Comments

More from this blog

Observability in Practice: Noise, Signals, and Alerts in Production

Building a Kubernetes Platform on AKS: Private Clusters, GitOps, and Workload Separation

Designing Multi-Environment Platforms: What Actually Works in Practice

Designing Azure Landing Zones for Enterprise Cloud Adoption: Tenants, Management Groups, and Subscription Strategy

Designing a Developer Platform: From Infrastructure to Self-Service

Command Palette

1. Reliability Was Never About Preventing Failure

2. What Actually Broke Was Usually Ordinary

3. The Hard Part Was Not Detecting That Something Was Wrong

4. OOMKilled Pods Were Usually a Truth Problem

5. Health Checks Broke Services More Often Than Teams Expected

6. Degradation Was Harder Than Outage

7. What Did Not Help

8. The Response Model That Worked Better

9. Reliability Improved When Signals Became More Actionable

10. Standardization Removed Entire Classes of Incidents

11. Example: A Deployment That Looked Healthy and Wasn't

12. Example: A Service That Failed Only During Peak Hours

13. Example: The Alert Storm Was Not the Incident

14. The Trade-Offs Were Real

15. What I Would Do Earlier

16. Why This Still Felt Like Platform Engineering

Comments

More from this blog