Ammar's devops

Observability in Practice: Noise, Signals, and Alerts in Production

Syed Ammar — Mon, 02 Feb 2026 09:30:00 GMT

1. Observability Was Not the Same Thing as Instrumentation

By the time observability became a serious topic, the platform already had most of the building blocks you would expect. Prometheus was scraping metrics. Grafana was full of dashboards. Graylog was collecting logs. Alerts existed. Teams channels and email routes existed. On paper, that sounds like observability.

It was not, at least not automatically.

The thing that production taught me very quickly was that collecting data and understanding a system are not the same activity. A platform can be full of telemetry and still be hard to operate. In fact, a lot of noisy environments have exactly that problem: they produce more information than the humans responding to incidents can use.

That is why I stopped thinking about observability mainly as a tooling topic. The stack mattered, of course. Prometheus, Grafana, and Graylog each solved real problems. But the more important question was operational rather than technical. When something starts going wrong in production, does the observability model help the team understand the issue quickly enough to reduce impact? Or does it bury the team in signals that are technically correct and operationally unhelpful?

That distinction mattered more than the tooling itself. The stack was there to reduce ambiguity. If it created more of it, then it was not doing the job as well as it looked from a diagram.

2. The Problem Was Never Lack of Data

The early instinct in most teams is easy to recognize. Something breaks, or an incident is harder to debug than it should have been, so the response is to add more metrics, more alerts, more dashboards, and more logs.

That instinct sounds responsible. In practice, it often makes the environment harder to operate.

The reason is simple. Most production systems do not suffer because they have too little telemetry. They suffer because the telemetry is not organized around decision-making. Engineers under pressure do not need infinite detail. They need a reliable path through the detail. They need to know what is user-visible, what changed recently, what is likely causal versus merely correlated, and what action is safest right now.

Without that structure, observability degrades into accumulation. A dashboard exists because it might be useful one day. A metric is scraped because Prometheus can scrape it. A log stream is retained because someone might need it later. An alert fires because the threshold exists, not because waking a human is warranted. Eventually the stack becomes rich in data and poor in guidance.

That was the point where I started treating observability as an operating interface for production rather than as a reporting layer. The question was not whether the platform knew a lot about itself. The question was whether the people responsible for it could make better decisions because of that knowledge.

3. Prometheus, Grafana, and Graylog Had Different Jobs

One of the more useful shifts was getting more disciplined about what each tool was actually for.

Prometheus was the signal source. It was where the most useful production symptoms first became visible. Error rate, latency, saturation, resource pressure, restart patterns, and workload health all showed up there before anyone had a full explanation. Prometheus was good at telling the team that the system's behavior had changed and that something worth attention might be happening.

Grafana was the investigation surface. Once Prometheus or an alert indicated that something was wrong, Grafana helped answer the next layer of questions. Is this isolated to one service or broader? Did latency climb before or after the rollout? Is memory use growing steadily or spiking sharply? Is one namespace unhealthy, or is the whole cluster under pressure? In other words, Grafana helped shape the problem.

Graylog was the explanation layer. Metrics showed that behavior had shifted. Dashboards narrowed the scope. Logs were often where the raw narrative became visible. Exceptions after a rollout, dependency timeouts, authentication failures, bad configuration values, recurring connection errors, or repeated application-level faults became much easier to interpret once the time window and affected scope were already known.

This separation sounds obvious when written down, but it made a real operational difference. Without it, teams tend to expect every observability tool to answer every kind of question. Then they become disappointed when metrics do not explain root cause, dashboards do not tell them what to do, or logs are too overwhelming to use as an entry point.

The tools were complementary, not interchangeable. Once that became clear, the platform was easier to operate under pressure.

4. Dashboards and Alerts Were Not the Same Thing

One of the strongest practical lessons was that dashboards and alerts need to serve different purposes.

A dashboard is for understanding. It gives context, trends, and shape. It lets an engineer investigate, compare, and reason about behavior. An alert is for action. It interrupts someone because the system believes human attention is required now.

When those two roles get blurred, the observability model starts working against the people using it.

The easiest way to see that failure mode is in alert design. Teams often turn any technically interesting threshold into a notification because it feels safer to be told more. CPU spikes, memory movement, restarts, short-lived saturation, noisy log bursts, and local anomalies all become alerts. Eventually the alert stream stops representing urgency and starts representing everything the platform happens to notice about itself.

That is operationally destructive. It teaches engineers that a notification does not necessarily mean a decision is needed. Once that trust is gone, the signal-to-noise problem is no longer theoretical. It is embedded in human behavior.

The cleaner rule was much simpler: a proper alert means human action is required now. If the signal does not ask for a decision, it probably belongs somewhere else. It may still belong in a dashboard. It may still matter for daytime review. It may still deserve a ticket, a weekly summary, or a trend report. But it should not compete with real production signals for human attention.

That distinction turned out to be one of the most important parts of making observability useful instead of merely complete.

5. Noise Was a Human Systems Failure, Not Just a Technical One

I do not think alert noise is mainly a monitoring flaw. I think it is a human systems design flaw.

When a production environment generates too many notifications, the problem is not just that the tooling is verbose. The deeper problem is that the platform has lost the ability to express urgency clearly. Engineers begin to receive the same delivery mechanism for very different classes of events. Something mildly interesting and something user-visible arrive through the same channel, with similar language, at similar times, and eventually they are treated with similar skepticism.

That is how teams end up waking people at night for things that could have waited until morning, while also missing the early shape of incidents that truly mattered.

The most useful framing I found was this: noise means the system is talking, but nobody needs to act yet. A proper alert means the system is asking for intervention. That sounds almost too simple, but it cleaned up a lot of confusion very quickly because it forced every candidate alert to justify itself in human terms rather than technical terms.

A sustained user-facing latency breach, a material error-rate increase on a critical path, or service unavailability clearly fit that bar. A single pod restart, a brief CPU excursion, or one noisy error pattern without visible impact usually did not. Those lower-level signals still mattered. They just mattered as context or investigation inputs, not as primary incident entry points.

Once the team started treating alerting as a human trust problem rather than a metric-threshold problem, the observability model improved much faster.

6. Delivery Channels Needed Different Meanings

Another detail that mattered more than many teams admit was where the signals were sent.

Not every alert belongs in the same channel, and not every channel carries the same meaning. If everything is delivered everywhere, the system is not becoming more visible. It is becoming more repetitive.

In practice, Teams and email served different roles. Teams worked well for shared operational awareness during working hours, for degraded conditions that were worth watching but not yet severe, and for keeping the platform team aligned during an active incident. It was a good place for visibility that might lead to action, but did not always justify immediate interruption.

Email had a different shape. It was slower and more durable, which made it more appropriate for wider distribution, summaries, persistent records, and notifications that needed to be visible beyond the engineers actively sitting in operational chat. Email was not the right medium for urgent real-time response, but it was often the better place for structured visibility that should not vanish into a busy chat stream.

The point was not the tools themselves. The point was that delivery path should match urgency and ownership. Once that mapping was clearer, the notification model became easier to trust because the route itself carried meaning. If a signal arrived in one place rather than another, engineers already had a better hint about how seriously to treat it.

Observability gets much calmer when the channels stop competing with each other.

7. The Alerting Rule That Changed Everything Was Very Simple

The most useful internal rule I found was also the least sophisticated.

If an alert fires, someone should immediately understand what kind of decision it is asking for.

That decision might be to roll back a deployment. It might be to investigate a user-facing service urgently. It might be to confirm whether autoscaling is failing, whether a dependency is down, or whether traffic should be shifted or reduced. It might even be to acknowledge that the event is informational and no urgent action is needed. But the class of decision should be obvious.

If the first reaction to an alert is "that is interesting," then it probably does not belong in an urgent alert stream. Interesting is what dashboards, trends, and daily review loops are for. Urgent alerts should create operational clarity, not intellectual curiosity.

This rule also helped keep incident response disciplined. During a real production problem, the right sequence is usually stabilize first, investigate second. Good alerts supported that sequence because they pointed toward the safest next operational move. Bad alerts disrupted it because they dragged the team into analysis before the situation was under control.

I did not need a more elegant rule than that. The practical value came from applying it consistently.

8. Good Dashboards Were Smaller and More Opinionated Than I Expected

Grafana made it very easy to build large, ambitious dashboards, and that was part of the problem.

At some point most teams realize they can graph almost everything: request rate, error rate, latency, CPU, memory, disk, network, pod count, node state, queue depth, database health, ingress trends, deployment history, namespace saturation, and any custom application metric they can expose. That can produce dashboards that look comprehensive and feel reassuring.

The issue is that incident dashboards are not museums. Their job is not to display everything the platform knows. Their job is to shorten the path from confusion to the next good question.

The dashboards that actually helped in production were usually much smaller. A good service dashboard answered, in order, whether the service was healthy from a user perspective, whether something had changed recently, whether the issue looked local or systemic, and whether the bottleneck was more likely to be traffic, compute, memory, rollout behavior, or a downstream dependency.

That meant prioritizing error rate, latency, throughput, saturation indicators, rollout markers, and a handful of supporting resource trends. Everything else had to justify its place.

This was one of those lessons that sounds stylistic until you feel the difference in an incident. Large dashboards make engineers scroll. Smaller dashboards make engineers decide.

9. Logs Were Essential, but Rarely a Good Starting Point

Graylog was extremely valuable, but only when it was used at the right point in the flow.

Logs are where a lot of the raw explanation lives. Exceptions, dependency failures, authentication problems, configuration mistakes, and rollout-specific errors often become obvious there. But logs are also the fastest way to drown in detail if you start with them too early.

The pattern that worked best was consistent enough that it became a habit. Use an alert or symptom to confirm that something user-visible may be happening. Use Prometheus and Grafana to narrow the scope, affected service, and time window. Then use Graylog to explain what the service or dependency was actually doing inside that narrowed window.

Once that narrowing had happened, Graylog became much more useful. Repeated exceptions across pods, timeouts to a specific dependency, a bad configuration value introduced during rollout, or a sudden shift in one class of application errors could usually be spotted much more quickly. Without that narrowing, the log surface was simply too large and too mentally expensive to treat as the first operational step.

I think this is one of the more underrated observability lessons in Kubernetes environments. Teams often collect logs successfully long before they learn how to use them efficiently under pressure.

10. Example: A Latency Spike Was Not a Logging Problem

One recurring pattern looked something like this. A service suddenly showed a sustained latency increase on a user-facing path. The first temptation, especially from people who knew the application well, was to dive straight into logs and start reading exceptions or request traces at random.

That rarely worked well.

What worked better was to begin with the signal. Prometheus showed that latency had crossed a meaningful threshold and stayed there long enough to matter. Grafana then helped narrow the shape of the problem. Did the increase start directly after a deployment? Was it isolated to one service or visible in downstream dependencies too? Was throughput increasing at the same time? Was resource pressure building, or did the application look healthy from a CPU and memory perspective?

Only once that picture was clearer did Graylog become the right tool. At that point, the logs might show repeated dependency timeouts, exceptions after a specific configuration change, or a failing path that matched the exact interval visible in Grafana. The value of the logs came from the fact that the earlier tools had already made the search tractable.

The lesson was simple: logs are often the explanation layer, not the detection layer. Treating them as the entry point slowed incident understanding more often than it sped it up.

11. Example: The Alert Storm Was Not the Real Incident

Another pattern showed up when one service began failing noisily enough to drag half the platform into the conversation.

What should have been one incident often arrived as an alert storm. Error rate alarms fired. Latency alarms fired. Pod restart alerts fired. Resource pressure warnings fired. Downstream services started reporting secondary symptoms. On paper, the monitoring stack was doing a thorough job. Operationally, it was creating confusion about whether there were multiple incidents or one incident with many side effects.

This is where the difference between symptom alerts and supporting telemetry became critical.

The cleaner approach was to let a user-impacting signal open the incident and let the lower-level signals support understanding once someone was already investigating. That did not mean suppressing the rest of the data. It meant refusing to give every derivative symptom the same status as the primary problem.

Once that shift happened, incidents became much easier to reason about. The platform had not necessarily become more stable in the moment, but the team was no longer losing time untangling its own instrumentation before getting to the actual issue.

This was one of the clearest examples of observability affecting reliability directly. A noisy stack does not only annoy engineers. It delays correct action.

12. Example: A Rollout Looked Fine Until the Logs Told the Truth

Some of the most instructive production incidents started with a deployment that appeared healthy at first glance.

ArgoCD showed the new version as synced. Pods were running. Basic platform health looked acceptable. Then user-facing behavior started drifting in the wrong direction. Error rates moved up or latency worsened just enough that something was clearly off.

Metrics and dashboards were still the first useful tools here because they answered the immediate questions. Did the change line up with the deployment? Was the issue concentrated in one service? Was the service under unusual resource pressure or was the shape of failure pointing somewhere else? Once that scope was narrowed, Graylog usually exposed the explanation much faster than raw graph-reading could. A dependency started timing out. A new configuration path was invalid. One class of exception exploded immediately after rollout. Something that looked like a generic service regression was often much more specific once the logs were being read inside the right context.

This kind of incident reinforced the same point again and again: observability works best as a sequence. Signals first. Shape second. Explanation third. When the team followed that sequence, production got easier to reason about.

13. What I Stopped Alerting On Changed the Quality of the Whole System

One of the most meaningful improvements came not from adding new signals, but from removing or downgrading weak ones.

Single pod restarts by themselves rarely deserved urgent escalation. Brief CPU spikes without any visible user impact usually belonged in trend review, not in the middle of a working day. One noisy error class that self-resolved without changing availability or latency generally did not need to interrupt people in real time. The platform still observed those conditions. It simply stopped pretending they all carried the same urgency.

That change improved more than the alert stream. It improved trust.

Once engineers saw that an alert usually meant a real decision might be needed, the observability model started working with them instead of against them. Teams channels became more readable. Email became more meaningful. Escalation stopped competing with background commentary. The platform was not quieter because it knew less. It was quieter because it had become more deliberate about what deserved interruption.

This is one of the reasons I think observability maturity is measured as much by what a team removes as by what it adds.

14. The Trade-Offs Were Real

Observability has its own trade-offs, and pretending otherwise usually produces bad systems.

If alerts are too sensitive, the platform detects trouble earlier at the cost of noise and distrust. If they are too conservative, the stack stays quiet longer while genuine problems gather user impact. If dashboards are too broad, they become hard to use. If they are too narrow, they may miss useful context. If log collection is too limited, explanation becomes difficult. If it is too broad and poorly structured, the platform pays a mental and sometimes financial cost for data nobody can use effectively.

There is also a trade-off between completeness and operability. Engineers often prefer the idea of seeing everything. People handling incidents usually benefit more from seeing the right things in the right order.

I do not think those trade-offs disappear. I think good observability comes from acknowledging them early and designing around human response patterns rather than around tool capability alone.

15. What I Would Do Earlier

Looking back, I would push a few things much earlier in the lifecycle of a platform.

I would define the distinction between alerts and informational signals from the beginning instead of letting the alert stream become crowded first and cleaning it later. I would make dashboard design more opinionated sooner, especially for service-level views used during incidents. I would teach teams earlier that logs are most powerful after scope has been narrowed rather than at the beginning of a production mystery. I would also make delivery channels more intentional from the start so that Teams, email, and true urgent notifications never drift into the same semantic bucket.

Most of all, I would treat observability design as part of platform design from day one, not as something that gets layered on once the services are already running.

The earlier the platform starts expressing urgency and context cleanly, the less often engineers have to learn those lessons under pressure.

16. Why This Still Felt Like Platform Engineering

This work mattered because it was not only about better dashboards or cleaner alerts. It was about making the production environment easier to understand and safer to operate.

That is why I think observability belongs naturally inside platform engineering. A platform is not complete when workloads can be deployed. It becomes much more useful when the people responsible for those workloads can tell what is happening, what is urgent, and what to do next without fighting their own instrumentation first.

Across the rest of the series, the same pattern keeps showing up in different forms. The landing zone work was about clear cloud boundaries. The developer platform work was about reducing cognitive load for application teams. The networking work was about making private infrastructure usable. The GitOps work was about making deployment state understandable. The reliability work was about building safer response habits around failure. Observability sits directly beside that. Its job is to turn telemetry into operational clarity.

The platform did not become better because Prometheus scraped more metrics or because Graylog held more logs. It became better when the important signals became easier to trust and the path from signal to action became shorter.

That is what observability actually helped with in production.

Building a Kubernetes Platform on AKS: Private Clusters, GitOps, and Workload Separation

Syed Ammar — Tue, 13 Jan 2026 08:00:00 GMT

In this article, I explain how I designed and implemented a private AKS-based platform with clear separation between platform and workload clusters.

The focus is on real-world decisions around networking, GitOps, security, and operating models rather than theoretical architecture.

1. Introduction

After setting up the Azure landing zone and defining the platform structure, the next step was to enable teams to run workloads on Kubernetes in a controlled and scalable way.

The goal was not just to create AKS clusters, but to design a platform model where:

infrastructure and platform tooling are separated from workloads
deployments are consistent and controlled
teams can use Kubernetes without needing deep expertise

This led to designing a multi-cluster architecture, where different clusters had clearly defined responsibilities.

2. What the Platform Needed to Solve

The platform had to support:

secure Kubernetes clusters without public exposure
separation between platform tooling and application workloads
consistent deployment patterns
integration with existing GitLab CI workflows
centralized secrets management
onboarding teams with minimal Kubernetes knowledge

Instead of a single cluster or loosely structured setup, I needed a model that would scale cleanly across environments and teams.

3. AKS Architecture: Platform vs Workload Clusters

The core design decision was to separate platform clusters from workload clusters.

Platform clusters

These were hosted under the Platform management group, with different subscriptions:

platform_test
platform_nonprod
platform_prod

Each of these had its own AKS cluster:

platform_test AKS
platform_nonprod AKS
platform_prod AKS

Responsibilities of platform clusters

These clusters did not run application workloads. Instead, they hosted platform-level components, such as:

ArgoCD
GitLab runners
Kyverno (policy enforcement)
supporting platform tooling

The idea was to centralize platform capabilities instead of duplicating them across every workload cluster.A key decision early on was to separate platform responsibilities from application workloads.

Instead of running everything inside a single cluster or duplicating tooling everywhere, I designed a multi-cluster architecture with clear responsibilities.

Platform clusters

Under the Platform management group, I created three dedicated subscriptions:

platform_test
platform_nonprod
platform_prod

Each subscription had its own AKS cluster:

platform_test AKS
platform_nonprod AKS
platform_prod AKS

These clusters acted as the platform control layer, not workload environments.

What runs in platform clusters

These clusters hosted all shared platform components, including:

ArgoCD (GitOps control plane)
GitLab runners (for CI/CD execution inside cluster network)
Kyverno for Kubernetes policy enforcement
admission control policies (OPA/Kyverno-based patterns)
cluster-level monitoring components
supporting platform services

The goal was to avoid:

duplicating tooling in every cluster
mixing platform concerns with application workloads

Workload clusters

Separate AKS clusters were deployed in workload subscriptions:

dev
test
staging
prod

These clusters were intentionally kept minimal.

They only contained:

application workloads
required runtime dependencies
monitoring agents

They did not include:

CI/CD tools
GitOps controllers
platform-level policy engines (centrally managed instead)

Why this separation

This design provided:

isolation between platform and applications
ability to upgrade platform tooling independently
reduced blast radius
clearer ownership boundaries

It also made it easier to enforce consistency across clusters, since platform logic was centralized.

4. Platform Cluster Lifecycle and Promotion Strategy

Each platform cluster had a specific role in the lifecycle of platform changes.

platform_test

This cluster was used for:

testing new platform components
trying new versions of tools (ArgoCD, Kyverno, etc.)
validating breaking changes

After validation:

workloads were scaled down to zero
cluster remained available for future testing

This ensured that experiments did not impact stable environments.

platform_nonprod

This cluster hosted stable platform tooling for non-production environments.

It included:

ArgoCD (non-prod control plane)
GitLab runners
Kyverno policies for non-prod clusters
supporting services

Important detail:

ArgoCD in this cluster was responsible for managing:

dev clusters
test clusters
staging clusters

This created a clear separation between:

experimentation (platform_test)
stable non-prod operations

platform_prod

This cluster hosted production-grade platform tooling.

It included:

ArgoCD (production control plane)
GitLab runners
Kyverno / policy enforcement
platform-level observability components

ArgoCD here was responsible for:

managing production workload clusters

This ensured that:

production deployments were isolated
no non-prod logic or experiments could affect production

Promotion model

Changes followed a flow:

Tested in platform_test
Promoted to platform_nonprod
Validated against non-prod workload clusters
Promoted to platform_prod

This created a controlled promotion pipeline for platform changes, not just applications.

5. Private AKS and Access Model

All clusters were deployed as private AKS clusters.

This meant:

no public API server
no direct internet exposure

Access design

To enable secure access, I implemented:

VPN Gateway in the platform network
Azure VPN client for engineers
access routed through private networking

This allowed:

secure kubectl access
no exposure of cluster endpoints

DNS resolution across clusters

Private clusters introduced a challenge:

AKS API endpoints use private FQDNs, which must be resolvable across VNets and subscriptions.

To solve this, I implemented:

VNet peering across platform and workload networks
centralized Private DNS zones
Azure Private DNS Resolver

This ensured:

consistent name resolution
access across multiple clusters

Alternative access patterns

In some cases:

jumpbox VM was used for debugging

However, the primary model remained:

VPN-based access with private DNS

6. GitOps Control Plane with ArgoCD

GitOps was implemented as the primary deployment model.

Control plane separation

ArgoCD in platform_nonprod → manages non-prod clusters
ArgoCD in platform_prod → manages production clusters

This ensured:

strict separation between environments
no accidental cross-environment deployments

Application management

Applications were defined using:

ArgoCD Applications
ApplicationSets

ApplicationSets allowed:

dynamic generation of apps
multi-environment deployments
standardized patterns

Drift and reconciliation

ArgoCD continuously ensured:

desired state = actual state
drift detection
automatic reconciliation

This removed the need for:

manual kubectl deployments
ad-hoc changes

7. Application Deployment Flow

The deployment model integrated GitLab CI with GitOps.

Flow

Developer pushes code
GitLab CI builds container image
Image pushed to:
- GitLab Container Registry
- Azure Container Registry (ACR)
Deployment triggered (pipeline or Git change)
ArgoCD syncs state into cluster

Helm-based deployments

Applications were packaged as Helm charts.

This allowed:

environment-specific values
reusable templates
consistent deployments

Reality of the setup

This was not fully pure GitOps.

Instead, it was:

GitOps for cluster state
CI-driven triggers for deployments

This approach worked well in a hybrid environment and allowed gradual adoption.

8. Policy and Governance inside Kubernetes

Kubernetes governance was enforced using Kyverno and policy-based controls.

Why policy enforcement was needed

Without policies:

teams could deploy inconsistent resources
security risks increase
cluster behavior becomes unpredictable

Tools used

Kyverno for policy enforcement
admission control patterns
validation and mutation rules

Conceptually aligned with:

OPA/Gatekeeper-style governance

Example controls

Policies enforced:

required labels and annotations
resource limits and requests
restrictions on privileged containers
namespace-level controls

Benefits

consistent deployments across clusters
reduced risk
centralized governance

9. Secrets Management

Secrets were handled using Azure-native integration.

Structure

separate Key Vaults per:
- team
- environment

Integration

External Secrets Operator used in clusters
pulls secrets from Key Vault into Kubernetes

Access control

managed through Entra ID groups
scoped per team

Outcome

no secrets stored in Git
centralized control
clear ownership

10. Networking and Ingress

Networking followed a private-first, hub-spoke model.

Cluster placement

clusters deployed in spoke VNets
connected to central hub

Traffic control

controlled ingress paths
internal service communication via private networking

Design goal

minimize public exposure
keep communication predictable

11. Developer Workflow

Developers interacted with the platform through:

Git repositories
CI pipelines
Helm values

What developers do

write code
push changes
update configs

What platform handles

infrastructure
networking
policies
deployment

Key principle

Enablement over access.

Teams were not required to understand:

Kubernetes internals
Azure networking
security policies

12. Challenges and Trade-offs

Building this platform was not just a technical exercise. Most of the complexity came from working within real constraints rather than designing in isolation.

One of the biggest challenges was operating in a hybrid environment. Some applications were still running on-premises and had to continue functioning while we were introducing Kubernetes on AKS. This meant I could not design everything as a clean, cloud-native system from the start. For example, the decision to push images to both GitLab Container Registry and Azure Container Registry was not ideal from a purity standpoint, but it was necessary to support existing workflows. The goal was to move forward without breaking what already worked.

Networking was another major challenge, especially with private AKS clusters. While private clusters significantly improve security, they introduce complexity around access and DNS resolution. I had to ensure that engineers could access clusters securely through VPN, while also making sure that private FQDNs resolved correctly across multiple VNets and subscriptions. This required careful planning of VNet peering, Private DNS zones, and the introduction of Azure Private DNS Resolver. These are not things that are easy to change later, so getting them right early was critical.

There was also a challenge around scaling connectivity. As more environments, regions, and external integrations were introduced, the network design needed to handle increasing complexity. Decisions around VPN Gateway sizing, NAT behavior, and routing were not static. They had to evolve as requirements grew, which meant the initial design needed to be flexible enough to adapt.

Another important challenge was developer adoption. Many teams were not familiar with Kubernetes or cloud-native practices. If I had simply provided clusters and access, the result would likely have been inconsistent deployments and operational issues. Instead, I had to design the platform in a way that guided teams toward the right patterns. This sometimes meant not implementing exactly what teams initially asked for. In many cases, requests were based on existing habits rather than what would work well in the new platform. It required balancing listening to requirements with making decisions that would scale long term.

There was also a constant trade-off between control and flexibility. Centralizing platform components like ArgoCD, policy enforcement, and secrets management improved consistency and security, but it reduced the level of direct control that application teams had. This was intentional, but it required careful design to ensure that teams still felt enabled rather than restricted.

Another trade-off was between pure GitOps and practical workflows. In an ideal setup, everything would be fully driven from Git with automated promotion between environments. In reality, we integrated GitOps with existing GitLab CI pipelines, including manual triggers where needed. While this was not a textbook GitOps implementation, it worked well in practice and allowed teams to adopt the model gradually instead of forcing a complete shift.

Finally, there was the challenge of changing established ways of working. Some processes had been followed for years, and moving to Infrastructure as Code, GitOps, and platform-driven workflows required a mindset shift. This was not something that could be solved purely with tooling. It required gradual introduction, clear patterns, and consistent reinforcement.

Overall, the main challenge was not designing the platform itself, but integrating it into an existing ecosystem with real constraints, existing systems, and varying levels of maturity.

13. Lessons Learned

Looking back, several important lessons came out of building and operating this platform.

One of the most important lessons was that separating platform and workload concerns early makes everything easier later. By keeping platform tooling (ArgoCD, runners, policies) in dedicated clusters and keeping workload clusters minimal, it became much easier to manage upgrades, enforce consistency, and reduce risk. Without this separation, platform components tend to get tightly coupled with workloads, making changes harder over time.

Another key lesson was that private clusters are worth the complexity, but only if networking is designed properly from the beginning. The security benefits are clear, but they come with a cost in terms of DNS, access, and connectivity. Investing time early in designing VNet structure, DNS resolution, and access patterns avoids much bigger problems later.

I also learned that GitOps adoption should be incremental, not forced. While the idea of full GitOps is appealing, teams need time to adapt. Integrating GitOps with existing CI/CD pipelines allowed us to introduce the model gradually, without disrupting existing workflows. Over time, this can evolve toward a more complete GitOps approach, but starting with a practical implementation was the right decision.

Another important lesson was around policy enforcement. Without centralized policies, Kubernetes environments quickly become inconsistent. Introducing tools like Kyverno allowed us to enforce standards such as resource limits, labeling, and security controls. This ensured that even as more teams onboarded, the platform remained predictable.

One of the strongest takeaways was that enablement is more effective than access. Giving teams full access to infrastructure does not necessarily lead to better outcomes, especially when they are new to the platform. Providing clear templates, workflows, and guardrails allowed teams to move faster with fewer errors. The role of the platform was not just to provide infrastructure, but to guide how it should be used.

I also learned that real-world constraints should shape design decisions. It is easy to aim for ideal architectures, but in practice, existing systems, organizational structure, and team maturity all play a role. Supporting hybrid environments, dual registries, and gradual migration was not ideal from a theoretical perspective, but it was necessary to move forward without disruption.

Finally, I realized that platform engineering is as much about people as it is about technology. The success of the platform depended not only on the technical design, but also on how well it aligned with teams, how easily it could be adopted, and how clearly it communicated the right way to work.

These lessons shaped not just the platform itself, but also how I approached designing systems in general.

14. Conclusion

The goal of this platform was never just to run Kubernetes clusters, but to create a system that teams could rely on without needing to understand all of its internal complexity.

By separating platform and workload clusters, I was able to keep responsibilities clear. Platform tooling such as ArgoCD, runners, and policy enforcement remained centralized, while workload clusters stayed focused on running applications. This made the overall system easier to operate, scale, and evolve over time.

Running everything as private AKS clusters improved security, but more importantly, it forced a more disciplined approach to networking, access, and connectivity. Decisions around VPN access, DNS resolution, and VNet design became foundational rather than afterthoughts.

Introducing GitOps provided consistency in how applications were deployed, even though the implementation was intentionally pragmatic and integrated with existing CI/CD workflows. This allowed teams to adopt new patterns gradually instead of forcing a complete shift upfront.

At the same time, the platform was designed with enablement in mind. Instead of exposing raw infrastructure, I focused on providing structured workflows, templates, and guardrails. This proved to be more effective, especially for teams that were new to Kubernetes and cloud environments.

Looking back, the most important part of this work was not any individual tool or technology, but the combination of decisions around structure, ownership, and workflows. These are the things that ultimately determine whether a platform is usable in practice.

This setup provided a foundation that could scale with the organization, support both existing systems and new workloads, and evolve over time without requiring constant redesign.

Designing Multi-Environment Platforms: What Actually Works in Practice

Syed Ammar — Mon, 05 Jan 2026 09:30:00 GMT

1. Multi-Environment Was Not the Hard Part

By the time I reached this stage of the platform, a lot of the visible foundation work was already in place. The Azure landing zone existed. The network model existed. Private AKS access existed. The separation between platform control planes and workload clusters existed. GitLab CI/CD and ArgoCD were already doing real work. From a distance, it looked like the difficult part should have been over.

It was not.

What became obvious at that point was that "having multiple environments" is rarely the real problem. Almost every engineering organization has some version of dev, test, staging, and prod. The vocabulary is familiar enough that people assume the design is straightforward. In practice, most of the friction does not come from the number of environments. It comes from the fact that different layers of the system are borrowing the same words for different jobs.

Infrastructure environments, platform environments, and application environments do not move at the same speed, do not carry the same risk, and do not belong to the same owners. A networking change is not the same kind of change as a service rollout. An AKS platform validation environment is not the same thing as an application testing environment. A production cluster foundation and a production application release may share the word prod, but operationally they are not the same object.

When those distinctions get flattened into one environment story, the platform becomes harder to reason about than it should be. Teams start asking simple questions that should have simple answers and discovering that they do not. Which repository should change? Is this a platform promotion or an application promotion? Does this issue belong to Azure, AKS, ArgoCD, or the service itself? Are we testing a new platform capability, or are we testing business functionality? Under delivery pressure, that ambiguity becomes expensive very quickly.

That was the real lesson. Multi-environment design is not mainly a naming problem or a folder structure problem. It is an operating model problem.

2. Where Most Multi-Environment Designs Go Wrong

The most common mistake I see is treating every environment as if it represents the same layer of the system.

On paper, a single environment ladder looks tidy. You create dev, test, staging, and prod, then assume everything moves through that path together. Infrastructure, clusters, shared platform services, and business workloads all inherit the same labels. It feels consistent because the words repeat.

The problem is that consistency in naming is not the same thing as consistency in operation.

A platform team validating a new ingress pattern, a workload team testing a feature branch, and an infrastructure team changing DNS or identity integration are not doing the same kind of work. They should not be forced into the same promotion shape just because the environment labels match. If they are, one of two things usually happens. Either everything becomes tightly coupled and slow, or teams quietly bypass the model because it does not reflect how the system actually changes.

This is where multi-environment setups often become harder than the single-environment prototypes they replaced. Not because they are too large, but because the boundaries are dishonest. The environment names stop telling you what can change there, who owns the change, and what kind of failure that environment is meant to contain.

That last point matters more than people often admit. An environment name is only useful if it carries operational meaning. If test can mean "a place to validate OpenTofu changes," "a place to test ArgoCD behavior," "a place for developers to exercise a feature," and "a place to try a new secrets pattern," then the label is doing very little work. The platform team ends up translating the meaning manually every time a change or incident happens.

That is not scalability. That is a support burden with better branding.

3. The Environment Model That Worked Better

What worked better in practice was separating platform and infrastructure environments from application and workload environments, even though both still used familiar labels.

At the platform and infrastructure layer, the model was closer to test, non_prod, and prod. These environments existed to validate and promote the Azure and AKS foundation itself. This included subscriptions, networking, cluster creation, private connectivity, platform control-plane components, identity integration, observability foundations, and the shared services the runtime depended on.

At the application layer, the model was closer to dev, test, staging, and prod. These environments existed for normal workload lifecycle: building, validating, promoting, and operating business services.

The difference was not academic. It changed how the platform behaved.

A platform test environment was where I wanted to validate a cluster-level or control-plane change without dragging application release pressure into the decision. An application test environment was where a team wanted to validate the behavior of its service. Both were legitimate. They were just not the same thing.

The same was true in production. A production platform environment represented the AKS and Azure foundation that production workloads depended on. A production application environment represented the workload actually serving production traffic. The names overlapped, but the responsibilities did not.

The names themselves were not sacred. Another team could choose different labels and still arrive at a healthy model. The important part was the separation. Multi-environment platforms work much better when the environment structure reflects the real operating layers of the system instead of pretending everything moves through one universal pipeline.

4. Not Every Application Environment Needed Its Own Copy of the Platform

One of the more important design decisions was refusing to create a one-to-one mapping between every application environment and a separate copy of the entire platform.

This sounds obvious when written down, but a surprising amount of environment sprawl comes from chasing symmetry. If the application lifecycle has dev, test, staging, and prod, it is tempting to assume the platform should expose four complete copies of Azure resources, AKS foundations, networking constructs, shared services, and operational tooling in exactly the same shape.

In a growing microservices environment, that tends to become expensive, noisy, and harder to govern than people expect. Every extra copy brings more OpenTofu state, more DNS, more secrets boundaries, more monitoring, more upgrade paths, and more places for drift to hide. A design can look very clean on a whiteboard while creating far too much operational surface area in real life.

What worked better was being explicit about where isolation actually mattered.

The platform and infrastructure layer needed clear separation between test, non_prod, and prod because the foundation itself required controlled validation and promotion. Workload environments needed their own lifecycle because services had to move more quickly and more frequently. But that did not imply that every workload environment needed a separate end-to-end copy of the platform underneath it.

In practice, non-production application environments could share a non-production platform foundation while still remaining distinct at the workload layer through repository structure, namespaces, policy, and promotion rules. Production kept stricter isolation because the risk justified it. That approach preserved the operating boundaries that mattered without multiplying the entire platform every time an application team wanted another stage in its delivery flow.

Symmetry is attractive on a slide. In operation, it is often just a more expensive form of confusion.

5. Structuring the Platform Layer with OpenTofu on Azure

Once the environment model was separated by layer, the infrastructure side needed to make that separation real.

For Azure and AKS foundation work, I kept reusable infrastructure logic separate from environment instantiation. In practical terms, that meant one repository held reusable OpenTofu modules and another held the actual environment definitions. In my case, that was roughly an azure-modules layer and an azure-environments layer. The modules repository represented the standard building blocks: network patterns, AKS cluster definitions, private connectivity, DNS integration, identity plumbing, and other pieces the platform needed repeatedly. The environment repository represented the real deployments of those building blocks into test, non_prod, and prod.

That separation mattered because it forced the platform to distinguish between two different questions.

The first question was, "What is our standard way to build this component?" The second was, "How is this environment using that component right now?" Those should not be answered in the same place.

Without that line, infrastructure repositories tend to become a mixture of templates, overrides, special cases, and half-embedded environment logic. They still work for a while, but they get harder to change safely because nobody can tell whether they are modifying a reusable platform capability or making a one-off adjustment for a single environment.

With the split in place, platform evolution became more deliberate. A module could change independently. An environment could adopt that change independently. Those are different operations with different review expectations, and the repository model made that visible instead of hiding it.

This also aligned well with governance. The platform was already running in Azure with subscription boundaries, RBAC, private networking, and controlled access patterns established earlier in the series. Separating the module layer from the environment layer helped keep that structure auditable. The Azure platform did not become easier because the code was cleaner. It became easier because the code finally matched the operating model.

6. Structuring the Workload Layer with GitLab, ArgoCD, and AKS

The workload side needed a different shape because it was solving a different problem.

Application teams were not trying to promote clusters, DNS zones, or shared platform services. They were trying to ship software. That meant the common path had to live where developers already worked and think in terms that matched application delivery, not infrastructure assembly.

GitLab CI/CD remained responsible for build and workflow logic. It built images, ran tests, enforced checks, and handled the sequencing around application delivery. ArgoCD remained responsible for reconciliation and desired cluster state. That split had already proven useful in the earlier GitOps work and it stayed useful here.

What changed at the multi-environment level was the discipline around how application environments were represented. The shape of delivery had to stay structurally similar across dev, test, staging, and prod, even when the approval model, config values, and promotion rules differed. If every environment became a separate ritual, the platform team would end up debugging the differences instead of operating a platform.

The point was not to remove all environment-specific behavior. That would have been unrealistic. Production should behave more carefully than development. Some services needed different configuration, tighter review, or stricter policy in later stages. But the path itself needed to remain understandable. Teams should be able to answer basic questions without reverse-engineering the platform each time. Which image is being promoted? Which config changed? Which Git repository represents desired state? Which ArgoCD application will reconcile it into AKS?

That clarity reduced a lot of coordination tax. A service moving from test to staging should not need an unrelated OpenTofu change or a platform engineer interpreting cluster internals just to keep the release moving. If the platform already provides the capability, the workload should move through its own lifecycle without asking the infrastructure layer for permission every time.

7. Example: Adding a New Service Without Reopening Infrastructure Design

One of the clearest signs that the environment model was working was what happened when a team needed to onboard a new service.

In a weaker platform model, a "new service" request often becomes an accidental infrastructure project. The team has code, but then immediately runs into a chain of platform questions. Which cluster should it land on? Does it need a new namespace? How should the ingress look? Where do the secrets come from? Which environment gets created first? Does this require Azure changes? Which DNS entry is correct? Which pipeline pattern should it follow? In theory, those are solvable questions. In practice, they often turn into a series of DevOps dependencies.

The platform model was healthier when most of those decisions were already absorbed.

A service team could start from the standard delivery shape, use the existing GitLab workflow, declare how the service should be exposed, and plug into the existing GitOps structure for dev, test, staging, and prod. Secrets followed the platform contract instead of an ad hoc approach. Environment-specific behavior lived where the deployment model expected it to live. ArgoCD reconciled the declared state into the right workload environment on AKS.

The more important part was what did not happen. The team did not need to reopen the design of the entire Azure and AKS foundation just because a new microservice appeared. If the requested behavior fit the platform contract, the service moved through the application lifecycle. Only if the request introduced a genuinely new platform capability did it become a platform-layer discussion.

That distinction saved a lot of unnecessary work.

It also created a healthier conversation between application teams and the platform team. Instead of every onboarding exercise becoming a vague request for "help with Kubernetes," the question became much sharper: are you asking for something the platform already supports, or are you asking for the platform contract to evolve? That is a much more scalable interface.

8. Example: A Platform Change Should Not Ride Along With an Application Release

Another place where the two-layer environment model proved its value was when the platform itself had to change.

One practical example was improving how workloads consumed secrets and identity. In an Azure and AKS environment, there are several ways to get this wrong. Teams can overuse CI variables, create Kubernetes secrets manually, or build one-off patterns that work for a service today and become support debt later. Moving toward a cleaner Key Vault-backed model with predictable workload identity behavior was the right platform direction, but it was not the kind of change that should have been coupled to a random application release.

That sort of change belongs to the platform lifecycle first.

The Azure and AKS foundations had to be validated. Identity plumbing, cluster integration, access boundaries, and the expected deployment patterns had to work consistently. The right place to prove that was the platform test environment, then the broader non_prod platform environment, and only after that the production platform environment. Application teams still shipped their services through dev, test, staging, and prod, but they were not forced to synchronize their delivery cadence with the rollout of the underlying platform capability.

That separation was important because it prevented the usual coupling mistakes. A service release was not blocked just because the platform team was validating a cluster-level change. A platform rollout was not rushed because an application team wanted to get a feature into production. Each layer could move on its own timeline inside a controlled model.

Once the platform capability was established, service teams could consume it through the existing application path. That is what a good platform should do. It should absorb the complexity of foundational change first, then expose a stable contract to the teams building on top of it.

9. Example: Debugging Got Easier Once the Environment Boundaries Were Honest

The value of a multi-environment design only really shows up when something is going wrong.

One of the recurring benefits of the clearer model was faster triage when a workload behaved differently across environments. In a muddled environment structure, the first phase of incident response is often spent figuring out which layer might be responsible. People start checking pipelines, cluster settings, secrets, ingress, DNS, and recent infrastructure changes all at once because the boundaries are not clear enough to narrow the search.

That got easier once the environment model became more honest.

If a service was healthy in test but failing in staging, and both lived on the same non-production platform foundation, that told you something immediately. The problem was less likely to be "the whole platform is broken" and more likely to be in the workload promotion path, the service configuration, or a dependency visible only in the later application stage. If several workloads began failing after a platform rollout into non_prod, the direction of investigation shifted quickly toward the platform layer instead of wasting time treating every service as an isolated mystery.

Prometheus and Grafana also became more useful in this model because the environment labels finally matched something operationally meaningful. Metrics, dashboards, and alerts were easier to interpret when prod could be understood in the right context and when platform-level concerns were not mixed carelessly with workload-level ones. ArgoCD history helped for the same reason. A change trail is far more valuable when you already know which kind of environment change you are looking for.

This may sound like a small improvement, but in practice it changes the tone of operational work. The platform becomes easier to reason about under pressure because the environment model gives you a better first hypothesis.

10. Governance Only Works When It Follows the Same Boundaries

Another lesson from this work was that environment design and access control have to reinforce each other.

It is not enough to say that platform environments and application environments are different if the access model ignores that distinction. If everyone can change everything through the same path, the boundary is mostly conceptual.

The healthier model was to keep direct Azure and AKS access narrower at the platform layer and make the common application path self-service through Git-based workflows. Platform and infrastructure repositories carried the controls appropriate for higher-blast-radius changes. Production paths were tighter than non-production paths. Application teams did not need broad direct access to platform internals just to move a service forward. They interacted with the platform through GitLab CI/CD, GitOps-managed state, and the reusable patterns the platform exposed intentionally.

That was not about restriction for its own sake. It was about matching access to responsibility.

If the platform is designed well, most service changes should not require a developer to hold wide Azure permissions or cluster-admin access. Giving broad rights to compensate for a weak platform interface is a common trap. It feels flexible in the moment and creates far more governance and audit pain later.

Multi-environment design becomes much more durable when the repository model, the promotion model, and the RBAC model all describe the same boundaries.

11. What Stayed Hard

Even with a clearer model, multi-environment platforms do not become effortless.

One persistent challenge was naming. The two-layer model was operationally better, but it still required people to unlearn the assumption that the same word always referred to the same layer. Newer engineers understandably asked why a platform test environment and an application test environment were both called test if they meant different things. The honest answer was that the names were familiar, but familiarity does not eliminate the need for clear explanation.

Another challenge was deciding where standardization should stop. In a microservices environment, there is always pressure for exceptions. One team wants an extra pre-production stage. Another wants different promotion semantics because its release risk is higher. Another wants more direct access because its service has unusual operational needs. Some exceptions are justified. Many are just local optimizations that weaken the shared model if you accept them too easily.

There was also a judgment call around isolation. Not every non-production workload deserved its own cluster, but not every workload belonged in the same place either. Those decisions had to be made with some discipline around blast radius, regulatory sensitivity, noisy-neighbor risk, and operational burden. A senior platform design rarely comes down to one universal answer. It usually comes down to applying a consistent decision framework and resisting arbitrary divergence.

In other words, the model reduced ambiguity, but it did not remove the need for engineering judgment.

12. The Trade-Offs Were Real

I do not think there is a serious multi-environment design that avoids trade-offs. The useful ones are the designs where the trade-offs are deliberate.

Separating platform environments from application environments added conceptual overhead at first. There were more repositories, more boundaries to explain, and more care required in how changes moved. A flatter model would have looked simpler to someone seeing it for the first time.

But that flatter model would also have hidden the real costs. It would have coupled unrelated changes, blurred ownership, and forced the platform team to act as a constant interpreter between infrastructure and application delivery. That kind of simplicity tends to collapse at exactly the point where the platform is supposed to scale.

There was also a trade-off between flexibility and repeatability. The more opinionated the environment model became, the less room there was for every team to invent its own lifecycle. That was intentional. Standardization moves some decision-making away from individual teams and into the platform. Done badly, that becomes rigidity. Done well, it removes repeated low-value decisions and lets teams focus on the work that actually belongs to them.

The same applied to governance. Controlled workflows are slower than unconstrained access if you only measure the first five minutes of a change. They are usually much faster if you measure the full lifecycle of auditing, rollback, incident response, and long-term operability.

The goal was never to make the platform infinitely flexible. It was to make the common path safe, clear, and scalable.

13. What I Would Do Differently

If I were designing the same model again, I would make a few parts of it explicit earlier.

The first is environment language. The separation between platform and application environments was the right decision, but I would spend more time up front giving teams a clearer mental model of what each environment meant, what kind of changes belonged there, and which repositories represented that change. A lot of avoidable confusion in platform work comes from people making reasonable assumptions based on incomplete naming.

I would also encode more of the boundary rules directly into automation. If a change belongs to the platform layer, the repository and pipeline structure should make that obvious and hard to bypass. If an application promotion is expected to follow a certain shape, the GitLab and ArgoCD path should reinforce that instead of relying on tribal memory.

I would probably invest earlier in environment-level observability conventions as well. Dashboards, labels, and alert routing become much more valuable when they line up cleanly with the operating model from the start. Once teams trust that the environment boundaries mean something, operational tooling becomes easier to read.

None of those are arguments against the model. They are the things I would tighten sooner because the model proved worth keeping.

14. Why This Was Platform Engineering

This part of the work reinforced something I have come to believe quite strongly: multi-environment design is not about creating more copies of infrastructure. It is about designing a system that different kinds of engineering work can move through without constantly colliding with each other.

By this point in the broader platform journey, the landing zone, private networking, AKS separation model, GitOps workflow, and reusable deployment patterns all existed for a reason. The multi-environment design was where those earlier decisions either became a coherent operating model or remained a collection of good components.

What made the difference was not the number of environments. It was the quality of the boundaries between them.

A good platform is not measured by how much infrastructure it exposes. It is measured by how rarely application teams need to care about that infrastructure to do normal work safely. In the same way, a good multi-environment model is not measured by how many stages it names. It is measured by whether engineers can understand what each environment is for, where a change belongs, and how to move it forward without unnecessary coordination.

That is why I think this is platform engineering rather than just environment management. The work was not to produce another set of Azure resources or another set of AKS clusters. The work was to design an operating model that reduced ambiguity, preserved governance, and let more teams move independently on top of the same foundation.

That is what actually worked in practice.

Designing Azure Landing Zones for Enterprise Cloud Adoption: Tenants, Management Groups, and Subscription Strategy

Syed Ammar — Mon, 29 Dec 2025 08:00:00 GMT

1. Introduction

In one of my recent roles, I was hired for setting up the foundation for moving workloads from a primarily on-prem environment toward Azure. The starting point was not a greenfield setup, but rather an existing landscape with established systems, evolving cloud requirements, and no clearly defined Azure operating model in place.

Before onboarding any workloads, it became clear that we needed to first define how the cloud environment itself should be structured and managed. Instead of jumping straight into deploying services, I spent time understanding how the organization operated, how teams were structured, how responsibilities were divided, and what kind of environments would be required both immediately and in the future. This involved discussions with IT, Infrastructure, security, and application stakeholders to make sure the design aligned with real workflows rather than purely technical assumptions.

One of the key realizations early on was that simply creating subscriptions and deploying resources would not scale. Without a clear structure, access model, and governance approach, the environment would quickly become difficult to manage as more teams and workloads were introduced. The goal, therefore, was to design a landing zone that could act as a stable and scalable foundation which supports multiple environments, enforcing consistency, and enabling controlled growth.

The work focused on defining the core building blocks of the Azure platform: how tenants, management groups, and subscriptions should be structured; how access should be controlled through RBAC; and how governance and security should be applied from the beginning. This was less about individual resource deployment and more about establishing a cloud operating model that would guide how infrastructure is provisioned and managed over time.

In the following sections, I will walk through the key decisions behind this design, including how the environment was structured, how access and governance were handled, and the trade-offs involved along the way.

2. What the Landing Zone Needed to Solve

Before defining any architecture, the first step was to clearly understand what problems the landing zone needed to address. This was not just a technical exercise, but a combination of organizational, operational, and security considerations that would shape how the platform would evolve over time.

One of the primary challenges was the lack of a consistent structure in Azure. Without clear boundaries, there was a risk that resources would be created in an ad hoc way, leading to unclear ownership, inconsistent configurations, and increasing operational overhead. As more teams started adopting cloud services, this kind of setup would quickly become difficult to control.

Another key requirement was environment separation. Different workloads needed to run across development, testing, and production environments, each with different levels of access, stability, and governance. These environments could not simply coexist in the same space without introducing risks around accidental changes, access leakage, or unintended impact on production systems.

Access control was also a major concern. Multiple teams with different responsibilities needed access to the platform, but with clearly defined boundaries. The goal was to ensure that engineers had the access they needed to do their work, while avoiding overly broad permissions that could lead to security or operational risks. This required a structured approach to RBAC that aligned with real team responsibilities.

From a governance perspective, there was a need to introduce consistency without slowing teams down. This included standardizing how resources are named, how they are organized, and what baseline configurations are required. At the same time, it was important to avoid overly restrictive controls that would block development or introduce unnecessary friction. My goal was enablement for developers and the infrastructure team, with guardrails rather than gatekeeping.

Networking and connectivity were another important area. The platform needed to support secure communication between workloads, as well as controlled connectivity to external systems and, where needed, existing on-premises environments. These decisions had to be made early, as they would influence how services are deployed and consumed later.

Finally, the landing zone needed to support future growth. This meant designing with the expectation that more workloads, teams, and environments would be added over time. The structure had to be scalable, predictable, and easy to extend without requiring major redesigns.

Taken together, the landing zone was not just about organizing resources in Azure. It was about creating a structured and governed environment that could support real-world operations balancing flexibility for engineering teams with control, security, and long-term maintainability.

A major future requirement was supporting Kubernetes-based workloads in a structured way, which influenced decisions around networking, identity, environment separation, and automation from the start.

3. Initial Challenges and Design Goals

Before defining the structure, there were a few key challenges that shaped the design.

Challenges:

There was no established cloud operating model, which meant decisions around structure, access, and ownership had to be defined from scratch. At the same time, the design needed to align with how teams actually worked, not just how things look on paper.
Environment separation was another important concern. It was not just about dev and prod, but about clearly isolating risk, access, and stability. Without this, it would be easy for changes in non-production to impact production systems.
Access control also required careful planning. Different teams needed different levels of access, and without a structured approach, permissions could quickly become too broad or inconsistent. At the same time, overly strict controls could slow down development.
Networking decisions had to be made early, as they would impact connectivity, security, and how services interact. These are difficult to change later, so they needed to be thought through upfront.
Finally, there was a constant need to avoid overengineering designing something scalable, but still simple enough to operate and understand.

Design Goals

Based on these challenges, a few clear goals guided the design.

The first was clear environment separation, ensuring that development, testing, and production were isolated in a meaningful way.
The second was alignment with ownership, so that subscriptions, access, and resources reflected real team responsibilities.
Scalability was also important, allowing new workloads and environments to be added without redesigning the structure.
Consistency was another key goal, with standardized naming, organization, and baseline configurations to keep the platform predictable.
Security and governance were built in from the start, with guardrails that protect the platform without blocking teams.
Finally, the design needed to be practical and maintainable, implemented through infrastructure as code and understandable by the teams operating it.

These principles guided all further decisions in the landing zone design.

4. Tenant and Identity Boundary Decisions

One of the first areas that needed clarity was the tenant and identity boundary, as this defines how access, authentication, and overall control of the platform are managed.

The environment was built within an existing Azure tenant, which meant working within established identity and governance constraints. Rather than creating a separate tenant, the focus was on structuring access and responsibilities correctly within the current one. This required close coordination with stakeholders responsible for identity and security to ensure alignment with organizational policies.

A key decision was to separate concerns between tenant-level administration and platform-level operations. Tenant-wide permissions were kept limited, while most operational responsibilities were handled at management group and subscription level. This helped reduce risk and avoided unnecessary exposure of high-privilege roles.

Access was designed around groups rather than individual users. Instead of assigning permissions directly, roles were mapped to Entra ID groups representing different teams and responsibilities. This made access easier to manage, especially as team members changed over time.

Different types of identities were also handled differently. User access was separated from automation, with service principals or managed identities used for CI/CD pipelines and infrastructure provisioning. These identities were granted only the permissions required for their specific scope, avoiding overly broad access.

Another important aspect was ensuring that access boundaries aligned with how teams worked. Platform, networking, and application teams each had clearly defined scopes, reducing overlap and making ownership more explicit.

Overall, the goal at this level was to establish a clean and controlled identity model that supports secure access, scales with the organization, and integrates well with the rest of the landing zone design.

5. Management Group Hierarchy Design

With the identity boundary defined, the next step was structuring the management group hierarchy. This was a key part of the design, as it defines how governance, policies, and access scale across the platform.

The Management domain hosted cross-cutting operational capabilities such as monitoring, diagnostics, security visibility, and other platform-level management tooling.

The hierarchy was intentionally kept simple and built around three primary areas:

Platform
Workloads
Sandboxes

This structure was designed to reflect both ownership and usage patterns, rather than just technical grouping.

The Platform management group was not treated as one large catch-all area. It was intentionally split into four platform domains: Identity, Connectivity, Management, and Shared Services. That separation exists for a practical reason. These domains have different blast radius, different access requirements, and different operational lifecycles. Identity and Management sit closer to the shared control plane and therefore need tighter governance. Connectivity affects every connected workload and has to be centrally controlled. Shared Services provide reusable capabilities, but should not become the place where application runtimes are hidden.

The Workloads management group was where application environments lived and where the actual runtime of the business services was deployed. This distinction mattered throughout the design: the platform layer hosted shared control-plane services, while workload subscriptions hosted the components that actually run the applications.

The Sandboxes management group was designed for experimentation and non-critical usage. This allowed engineers to test ideas or explore services without impacting structured environments. Governance here was intentionally more relaxed, while still maintaining basic guardrails.

One of the key considerations was balancing control and simplicity. Instead of creating a deep or overly complex hierarchy, this structure provided clear separation of concerns while remaining easy to understand and operate.

Another important aspect was leveraging inheritance. By assigning policies and RBAC at the management group level, baseline configurations could be enforced consistently across all child subscriptions. This reduced duplication and ensured that new subscriptions automatically followed the same standards.

Overall, this approach provided a clean and scalable foundation. It clearly separated platform responsibilities, workload environments, and experimental usage, while keeping the structure flexible enough to grow over time without requiring major changes.

Conceptually, the workload side was managed through a higher-level NonProd vs Prod operating model, while still exposing environment-specific subscriptions such as dev, staging, and prod for day-to-day deployment and ownership boundaries.

6. Subscription Strategy

After defining the management group hierarchy, the next step was designing the subscription model. Subscriptions were used as the primary boundary for isolation, access control, and operational ownership.

Under the Platform management group, subscriptions were separated by environment:

platform_nonprod
platform_testAfter defining the management group hierarchy, the next step was designing the subscription model. Subscriptions were used as the primary boundary for isolation, access control, and operational ownership.

Under the Platform management group, subscriptions were organized around platform domains and, where needed, split between NonProd and Prod:
- identity_nonprod / identity_prod
- connectivity_nonprod / connectivity_prod
- management_nonprod / management_prod
- sharedservices_nonprod / sharedservices_prod
This separation made the platform easier to reason about. Identity-related dependencies, hub networking, management tooling, and shared capabilities could evolve independently, and a change in one platform domain did not automatically expand the blast radius into all the others.

Under the Workloads management group, subscriptions were organized by application environments:
- dev
- staging
- prod
In practice, the most important operational boundary was NonProd vs Prod. Development and staging sat on the non-production side, where teams could validate infrastructure and application changes more freely. Production remained isolated with tighter RBAC, stricter policy enforcement, and more controlled deployment processes.

The Sandboxes management group contained separate subscriptions (sandbox1, sandbox2, sandbox3) used for experimentation. These were intentionally isolated from both platform and workload environments, allowing engineers to test new ideas or services without affecting structured environments.

This overall structure provided clear separation between:
- shared platform control-plane services
- application runtime environments
- experimental usage
It also helped reduce risk by limiting the blast radius of changes and made it easier to apply different governance and access controls across environments.

One observation from this setup is that naming should always reflect the real operating model. If non-production serves several purposes such as experimentation, integration, and pre-production validation, that needs to be visible in the structure so teams understand where a change belongs.
platform_prod

This separation allowed platform-level changes to be tested safely before reaching production. Core infrastructure such as networking and shared services could be validated in non-production environments without impacting critical workloads. At the same time, the production platform remained tightly controlled with stricter access and governance.

Under the Workloads management group, subscriptions were organized by application environments:

dev
staging
prod

This ensured clear isolation between development and production workloads. It also allowed different levels of access, policy enforcement, and operational control depending on the environment. For example, production environments were more restricted, while development and staging allowed more flexibility.

The Sandboxes management group contained separate subscriptions (sandbox1, sandbox2, sandbox3) used for experimentation. These were intentionally isolated from both platform and workload environments, allowing engineers to test new ideas or services without affecting structured environments.

This overall structure provided clear separation between:

platform infrastructure
application workloads
experimental usage

It also helped reduce risk by limiting the blast radius of changes and made it easier to apply different governance and access controls across environments.

One observation from this setup is that while separating platform environments added safety, it also introduced some overlap in naming and structure (for example, test vs nonprod). In future iterations, this could be simplified to reduce cognitive overhead while still maintaining the same level of isolation.

7. Governance Model

Governance was treated as a foundational part of the landing zone rather than something added later. The goal was to introduce enough structure to keep the environment consistent and secure, while still allowing teams to move quickly.

One of the first steps was defining basic standards that would apply across all subscriptions. This included naming conventions, resource organization, and tagging to ensure that resources were easy to identify, track, and manage. Keeping these consistent was important not just for readability, but also for automation, cost management, and operational clarity.

Governance was also aligned with the management group hierarchy. Policies and baseline RBAC were assigned at the management group level and inherited down into child subscriptions. That inheritance model was important because a newly vended subscription did not start empty. It inherited the expected guardrails, access model, and baseline standards from day one. This also allowed different levels of control depending on the environment: stricter for production and platform resources, and more flexible for sandboxes.

Another important aspect was ensuring that governance did not become a blocker. Instead of introducing overly restrictive controls from the start, the approach was to apply practical guardrails that addressed real risks. For example, ensuring that critical resources followed standard configurations and limiting risky patterns in production environments, while keeping non-production environments more open for development.

There was also a focus on ownership and accountability. Subscriptions and resources were structured in a way that made it clear which team was responsible for what. This reduced ambiguity and made it easier to manage changes, troubleshoot issues, and enforce standards over time.

From an implementation perspective, governance was closely tied to infrastructure as code. Baseline configurations, policy assignments, role bindings, and budget settings were embedded into OpenTofu modules and deployment workflows, ensuring that new resources followed the same patterns by default rather than relying on manual enforcement.

Overall, the governance model aimed to strike a balance by providing enough control to keep the platform stable and secure, while remaining lightweight enough to support ongoing development and growth.

Cost management was also considered as part of governance. Budgets were defined at subscription level with daily, weekly, and monthly monitoring, along with alerts to ensure visibility into spending. In other words, governance was not only about security guardrails, but also about keeping access, compliance, and cost behavior predictable.

This was particularly important in non-production and sandbox environments, where automated cleanup and usage patterns could otherwise lead to unnecessary costs. By combining budget alerts with tagging and subscription boundaries, it was possible to maintain accountability and control as the platform scaled.

8. RBAC and Access Control Strategy

Access control was one of the most important parts of the landing zone design, as it directly impacts both security and day-to-day operations. The goal was to ensure that teams had the access they needed to work effectively, while keeping permissions scoped and controlled.

The approach was based on role-based access control aligned with responsibilities, rather than assigning broad permissions by default. Instead of granting access at individual resource level, permissions were primarily assigned at management group and subscription level, allowing inheritance to handle most use cases. This reduced duplication and made access easier to manage as the environment grew.

Access was structured using Entra ID groups, with roles mapped to specific team responsibilities such as platform, networking, and application teams. This avoided direct user-level assignments and made it easier to onboard or offboard users without changing role assignments across the platform.

The Platform management group had more restricted and controlled access, as it contained shared infrastructure that impacted all environments. Only the platform team and a limited set of administrators had elevated permissions here.

Under the Workloads management group, access was further separated by environment. Development and staging subscriptions allowed broader access for application teams to deploy and test, while production access was more tightly controlled and typically limited to specific roles or controlled processes.

For automation, separate identities were used instead of relying on user credentials. CI/CD pipelines (e.g., GitLab) were integrated using service principals or managed identities, with permissions scoped only to the subscriptions or resources they needed to manage. This ensured that automation remained controlled and auditable.

One important consideration was minimizing the use of overly privileged roles such as Owner. Wherever possible, more scoped roles were used to limit access while still enabling necessary operations. This helped reduce risk, especially in production environments.

Overall, the RBAC strategy focused on clear boundaries, group-based access, and least privilege, ensuring that access scaled with the platform while remaining secure and manageable.

9. Policy, Compliance, and Guardrails

Alongside RBAC, policies were used to enforce baseline standards and prevent common misconfigurations. The goal was not to restrict everything, but to introduce practical guardrails that kept the platform consistent and secure as it scaled.

Policies were applied primarily at the management group level, allowing them to be inherited by all underlying subscriptions. This ensured that new subscriptions automatically followed the same baseline without requiring manual setup each time, which is exactly what you want if subscription creation is being automated.

The approach differed slightly across management groups. In the Platform and production workload environments, policies were stricter to protect critical infrastructure and ensure compliance with security expectations. In contrast, non-production and sandbox environments had more relaxed policies to allow experimentation and faster iteration.

Some of the key areas covered by policies included:

enforcing required tags for ownership and cost tracking
restricting allowed regions to maintain consistency
ensuring baseline configurations for resources and diagnostics
preventing certain risky exposure patterns in production environments

A key consideration was avoiding overly aggressive enforcement early on. Instead of applying a large number of strict policies upfront, the approach was to introduce controls incrementally based on actual needs. This helped avoid blocking teams while still moving toward a more governed environment.

Policies were also closely aligned with the overall structure of the landing zone. By combining management group hierarchy, subscription boundaries, and policy inheritance, governance could be applied consistently without becoming difficult to manage. These guardrails worked alongside RBAC and subscription-level budget controls, rather than replacing them.

Over time, this created a balance where teams could work with flexibility in non-production environments, while production and platform layers remained controlled and predictable.

10. Platform Security Foundations

Security was treated as a foundational aspect of the landing zone rather than something applied at the workload level later. Many of the key security controls were built directly into the platform design, reducing the need for reactive fixes as the environment grew.

One of the primary decisions was to enforce isolation through structure. By separating platform, workloads, and sandbox environments into different management groups and subscriptions, the risk of unintended access or impact was significantly reduced. Production environments were especially isolated, with stricter access controls and governance.

Access control itself played a major role in platform security. RBAC was designed around least privilege, with permissions scoped to roles and responsibilities rather than individuals. High-privilege access was limited, especially in platform and production subscriptions, reducing the overall attack surface.

Where automation or service-to-service access was needed, managed identities were preferred over long-lived credentials. This reduced secret sprawl and made permissions easier to scope, review, and rotate.

Defender for Cloud was also part of the cross-cutting security model. It provided a useful baseline across subscriptions by surfacing recommendations, highlighting configuration gaps, and making it easier to track whether the platform was drifting away from expected security posture over time.

Networking was another key component of the security foundation. The design leaned toward private connectivity wherever possible, limiting public exposure of services. Private endpoints became a recurring pattern for PaaS dependencies, and this approach influenced how services were deployed and accessed, ensuring that internal communication between components remained controlled.

Baseline protections were also considered at the platform level. This included enforcing standard configurations through policies, ensuring resources followed expected patterns, and avoiding insecure defaults. While not all controls were applied at once, the structure allowed them to be introduced gradually without requiring major changes.

Another important aspect was separation of concerns. Platform-level resources, such as shared infrastructure, were kept isolated from application workloads. This ensured that changes or issues in one area would not directly affect others, and allowed tighter control over critical components.

Finally, the platform was designed with auditability in mind. By structuring access, policies, and deployments consistently, it became easier to track changes, understand ownership, and maintain visibility across the environment.

Overall, security was not treated as a separate layer, but as an integral part of how the platform was structured and operated from the beginning.

11. Networking Foundations

Networking was one of the most critical parts of the landing zone, as it defined how services communicate, how access is controlled, and how the platform integrates with existing systems.

The design followed a hub-and-spoke model. The Connectivity subscription acted as the hub, and the workload subscriptions acted as the spokes. This allowed shared network control to stay centralized while keeping workload environments isolated from one another.

Each workload VNet was connected to the hub through VNet peering. That made it possible for workloads to consume shared connectivity services without flattening everything into a single network boundary.

The hub hosted the shared networking control plane: Azure Firewall, centralized routing, and private DNS. Keeping firewalling, route control, and name resolution in the Connectivity subscription meant those patterns were defined once and consumed consistently, rather than being reimplemented differently by each workload team.

A key decision was to move toward private connectivity by default. Wherever possible, services were not exposed publicly, and communication between components was handled through private endpoints and internal networking paths. This aligned with the overall security model and reduced unnecessary exposure of critical services.

Networking was also closely aligned with the subscription and management group structure. Platform-level networking components lived in the Connectivity subscription, while workload environments owned their own virtual networks, subnetting, private endpoints, and application-facing load balancers. This separation ensured clear ownership and reduced the risk of cross-environment impact.

At the exposure layer, I kept a deliberate distinction. Azure Firewall remained in the Connectivity hub because it is a shared inspection and egress control point. Application Gateway or AKS ingress components sat close to the workloads they exposed, because they are part of the application entry path. Workload-specific load balancers also stayed in the workload layer rather than being pulled into the platform.

Before rolling out networking to production, all core components were first implemented and validated in the non-production connectivity environment. This included setting up virtual networks, defining address spaces, and testing connectivity patterns.

A key part of this phase was ensuring that IP ranges did not conflict with existing on-premises infrastructure. This required coordination with internal IT teams and careful planning of address spaces to support both current and future connectivity requirements.

Core networking components such as VPN gateways, private DNS resolution, firewall rules, and connectivity patterns were tested in non-production first. Once validated, the same setup was replicated in the production connectivity environment. This approach reduced risk and ensured that production networking was based on tested and predictable configurations rather than assumptions.

DNS and service discovery were also an important part of the design, particularly with the use of private endpoints. Shared private DNS lived with the hub, while workload-owned private endpoints stayed with the workloads that depended on them. Ensuring consistent name resolution across subscriptions and environments required careful planning, especially as more services were introduced.

Overall, the networking foundation focused on centralized control, environment isolation, and secure connectivity, providing a structure that could support both current workloads and future expansion without major redesign.

12. Shared Services and Platform Capabilities

In addition to the core structure, a set of shared services was established to support workloads across all environments. These were placed within the platform subscriptions, ensuring they were centrally managed and consistently available.

The goal was to centralize capabilities that are common across multiple workloads, while avoiding unnecessary duplication and keeping control within the platform layer.

The most important design boundary here was between platform and workload. The platform layer hosted shared control-plane services: identity-related infrastructure, centralized connectivity, management tooling, reusable secrets patterns, registries, and observability. The workload layer hosted the application runtime: AKS or other compute, messaging, data services, storage, and the private endpoints required by those applications.

That distinction mattered because it is easy to accidentally push too much into "platform." Services such as Kafka, ActiveMQ, application databases, and workload storage were not treated as platform services. Even when shared by a particular application landscape, they still belonged in workload subscriptions because their lifecycle, scaling, failure modes, and ownership were part of the workload, not the shared control plane.

The same logic applied to MongoDB Atlas. Atlas was treated as an external managed service rather than something living inside the platform layer. Even though it sits outside native Azure resource ownership, architecturally it was still a workload dependency and was handled through the workload's connectivity and security model.

One of the key areas was network-related shared services. Components such as VPN gateways, private DNS resolution, and connectivity services were hosted in the platform layer, allowing workload environments to consume them without needing to manage their own implementations.

Another important area was secrets management. Azure Key Vault was used as the central mechanism for storing and managing sensitive data. Instead of using a single shared vault, separate Key Vaults were created per team or service, with further separation across environments (dev, test, prod). This aligned with the overall structure of the platform and ensured clear isolation of secrets.

Access to Key Vaults was controlled through Entra ID groups, allowing teams to access only the secrets relevant to their services and environments. This approach simplified access management while maintaining strong security boundaries.

Within Kubernetes environments (AKS), secrets were integrated using the External Secrets Operator, allowing workloads to securely retrieve secrets from Azure Key Vault without embedding them directly into application configurations. This created a clear separation between secret storage and application deployment.

Container image management reflected a hybrid setup. Azure Container Registry (ACR) was used as the primary registry for cloud workloads, while an existing on-premises GitLab setup required images to be available in GitLab as well. To support both environments, images were built and pushed through GitLab CI pipelines to both GitLab's registry and Azure Container Registry. While this introduced some duplication, it allowed compatibility with existing workflows and supported a gradual transition toward cloud-native deployments.

Operational tooling was also centralized where it made sense, particularly for monitoring and observability. This helped maintain consistency across environments and reduced duplication of effort.

A key consideration throughout was deciding what should be centralized and what should remain within workloads. Foundational capabilities such as networking, secret management, and shared operational tooling were centralized, while application-specific runtime resources remained within workload subscriptions.

A typical workload subscription therefore contained the runtime components needed by the application itself: an AKS cluster or other compute layer, messaging components, data services, storage accounts, and workload-specific private endpoints. The platform provided the shared foundations around those workloads, not the workloads themselves.

Overall, the shared services layer provided reusable building blocks that supported all environments, reinforced consistency, and enabled teams to operate securely without duplicating core infrastructure components.

13. Infrastructure as Code Approach

The landing zone was implemented using Infrastructure as Code (IaC) to ensure consistency, repeatability, and controlled changes across the platform. In practice, OpenTofu and GitLab CI became the mechanism for subscription vending, baseline platform setup, and consistent provisioning across the estate. Rather than creating resources manually, all core components including management groups, subscriptions, networking, and shared services were defined through code.

The implementation was structured across three separate repositories, each with a clear responsibility.

The first repository handled the creation of remote state backends. For each subscription, storage accounts and containers were provisioned through GitLab CI pipelines to store OpenTofu state. This ensured proper isolation of state per environment and avoided conflicts between different parts of the platform.

The second repository contained the core infrastructure modules. This included reusable modules for subscription vending, management group placement, policy assignment, networking, and other shared building blocks. The goal here was to define the building blocks of the platform in a modular and reusable way.

The third repository was used for environment-specific configurations, consuming the modules defined in the module repository. This separation allowed infrastructure logic to remain reusable, while environments could be defined and managed independently.

A key part of the workflow was the use of versioned modules. Changes to infrastructure were implemented through small, incremental updates aligned with individual tasks (for example, vending a new subscription, assigning baseline policies, adding a VPN gateway, or provisioning AKS). Each change was merged into the main branch of the modules repository and resulted in a new semantic version release.

New subscriptions were not created as empty containers. They were vended through code, attached to the correct management group, and received their initial RBAC, policy, and baseline configuration through the same automated path. That made the landing zone easier to scale because new environments inherited the platform model instead of being hand-crafted.

These module releases were then propagated to the environment repository. For each change, a corresponding branch (aligned with the task or ticket) was used, and updates triggered the creation of merge requests in the environment repository. This ensured that infrastructure changes were explicitly reviewed and applied in a controlled manner.

The workflow was tightly integrated with GitLab CI/CD pipelines, which handled validation, planning, and application of changes. It was also connected to Jira, allowing changes to be tracked from requirement to implementation. This made it easier for teams to understand the status of infrastructure changes and maintain visibility across the platform.

This approach provided a clear separation between:

infrastructure logic (modules)
environment configuration
state management

It also ensured that all changes were traceable, versioned, and applied in a consistent way across environments.

Overall, the Infrastructure as Code setup allowed the platform to be managed as a structured system rather than a collection of manual configurations, making it easier to scale, maintain, and evolve over time.

14. CI/CD and Deployment Workflow for the Platform

Infrastructure changes were not applied manually, but went through a structured CI/CD workflow to ensure consistency, visibility, and control across the platform.

The workflow was built around GitLab CI/CD pipelines, which handled validation, planning, subscription vending, policy assignment, and applying infrastructure changes. Every change started as a task (tracked in Jira) and was implemented through a dedicated branch aligned with that task.

Changes were first introduced in the modules repository, typically as small, incremental updates (for example, adding a resource group, VPN gateway, or AKS cluster). Each change went through peer review within the team before being merged. The team consisted of four engineers, and while everyone contributed changes, merges to the main branch were controlled to maintain consistency and avoid conflicts.

Once a change was merged into the main branch, a new versioned release of the module was created automatically. This ensured that infrastructure changes were versioned, traceable, and could be consumed in a controlled way.

These module updates were then propagated to the environment repository, where the new version triggered a corresponding branch and merge request. This allowed changes to be reviewed again in the context of specific environments before being applied.

The pipeline followed a clear flow:

validate configuration
vend or update the subscription baseline
generate plan
review changes
apply changes

To improve visibility, the pipeline included tooling that surfaced planned infrastructure changes directly in merge requests, showing what resources would be created, updated, or destroyed. This made it easier for reviewers to understand the impact of changes before approval. The same workflow was also used to assign or update policy sets through code, which kept governance changes reviewable rather than hidden in the portal.

Before applying changes to production, updates were first tested in sandbox or non-production environments. Using tofu apply, changes were validated through pipeline logs, allowing the team to observe exactly what was being created, modified, or removed. Only after this validation were changes promoted to production environments.

For production, additional care was taken with controlled application and review, ensuring that changes were predictable and aligned with expectations.

This workflow ensured that infrastructure changes were:

reviewed (through team peer review and merge requests)
controlled (restricted merge access and staged rollout)
visible (clear plans and logs in CI pipelines)
traceable (linked to Jira tasks and versioned releases)

Overall, the CI/CD approach treated infrastructure as a continuously managed system, with clear processes for validation, review, and promotion across environments.

15. Environment Separation: Dev, Staging, Prod, Sandbox

Environment separation was a core principle of the landing zone design, ensuring that workloads could be developed, tested, and operated without introducing unnecessary risk to production systems.

At a higher level, the key operational split was between NonProd and Prod, even though the workload layer still exposed dev, staging, and prod as separate subscriptions.

Under the Workloads management group, subscriptions were organized by environment:

dev
staging
prod

This structure provided clear isolation between environments, both in terms of infrastructure and access. Development and staging environments formed the non-production side for building and validating changes, while production remained stable and tightly controlled.

The same principle existed in the platform layer, where Identity, Connectivity, Management, and Shared Services had non-production and production boundaries of their own. That allowed platform changes to be validated safely before affecting the live control plane.

Access and governance differed across environments. Non-production environments allowed more flexibility for development and testing, enabling teams to iterate quickly. In contrast, production environments had stricter access controls, tighter governance, more review, and fewer exceptions to reduce risk.

This separation also aligned with the CI/CD workflow. Changes were first applied and validated in sandbox or non-production environments, where infrastructure updates could be tested safely. Only after validation were changes promoted to production, ensuring that deployments were based on tested configurations rather than assumptions.

The Sandboxes management group provided additional isolation for experimentation. The platform team (consisting of four engineers) had access to multiple sandbox subscriptions, which were used for testing new features and infrastructure changes.

To optimize this process, CI pipelines dynamically selected a sandbox subscription where resources were not currently deployed and used it for testing. This allowed parallel experimentation without conflicts between team members.

To avoid unnecessary costs, sandbox resources were treated as ephemeral. Infrastructure deployed for testing was automatically cleaned up using scheduled jobs (cron-based pipelines in GitLab CI), typically running at the end of the day. This ensured that unused resources did not persist beyond their purpose. In cases where longer testing was required, this cleanup behavior could be adjusted or disabled as needed.

Another important aspect was consistency across environments. While access levels and governance differed, the underlying infrastructure patterns remained the same. The same OpenTofu modules and deployment workflows were used across dev, staging, and prod, minimizing drift and ensuring predictable behavior when promoting changes.

Overall, environment separation ensured clear boundaries, controlled risk, and efficient resource usage, supporting both rapid development and stable production operations.

16. Operational Model and Team Responsibilities

Beyond the technical design, it was important to define a clear operational model regarding who owns what, how changes are made, and how responsibilities are divided across teams.

The platform was managed by a small platform engineering team of four members, responsible for designing, maintaining, and evolving the landing zone and its core components. This included management groups, subscriptions, networking, shared services, and infrastructure modules.

A key principle was clear ownership boundaries. Platform-level resources, such as networking, shared services, and foundational infrastructure, were owned and managed by the platform team. This ensured consistency and avoided fragmentation of critical components.

A useful way to think about the operating model is that the platform team owned the shared control plane, while workload teams owned the runtime behavior of their applications. Even when the platform team provided templates or automation for AKS, messaging, or data services, those components still belonged architecturally to the workload boundary rather than the shared platform layer.

Application teams operated within the workload subscriptions, but direct access to the Azure portal was intentionally limited. Instead of broad access, the focus was on enablement through self-service. The platform provided predefined, reusable patterns (golden templates) that teams could use to deploy their services without needing deep knowledge of Azure, Kubernetes, or underlying infrastructure.

This approach reduced the risk of misconfigurations while also lowering the barrier for teams that were not yet familiar with cloud-native concepts. Rather than requiring every team to understand the full platform, the responsibility was shifted toward the platform team to provide a reliable and easy-to-use interface.

In exceptional cases, break-glass access was available for debugging or emergency scenarios, but this was tightly controlled and not part of normal operations.

Infrastructure changes were handled exclusively through Infrastructure as Code and CI/CD workflows, ensuring that all changes were versioned, reviewed, and consistent. This avoided manual changes in the portal and kept the platform predictable.

The operational model also involved collaboration with internal IT and security teams, particularly around networking, identity, and access decisions. This ensured that the platform aligned with broader organizational requirements rather than operating in isolation.

Overall, the model focused on centralized control with decentralized usage: the platform team owned and operated the infrastructure, while application teams were enabled to use it through standardized, self-service patterns.

17. Key Trade-offs and Decisions

Designing the landing zone involved a number of trade-offs between control, flexibility, and simplicity. Rather than aiming for a "perfect" architecture, the goal was to make practical decisions that aligned with the organization's needs and maturity level.

One of the main trade-offs was between centralized control and team autonomy. Direct access to the Azure portal was limited, and most operations were handled through predefined templates and CI/CD workflows. This reduced the risk of misconfiguration and improved consistency, but also meant that teams relied on the platform layer rather than having full control. Given that many teams were still early in their cloud adoption, this trade-off favored stability and enablement over flexibility.

Another decision was around subscription and environment separation. Splitting environments (dev, staging, prod) across separate subscriptions improved isolation and reduced risk, but introduced additional management overhead. Similarly, separating platform subscriptions into non-production and production added safety, but increased complexity in terms of structure and naming.

There was also a balance between strong governance and developer experience. Applying too many policies or restrictions early on could slow down teams, while too little governance would lead to inconsistency and potential security risks. The approach taken was to introduce guardrails gradually, focusing on practical controls rather than enforcing everything upfront.

In networking, adopting a private-first approach improved security and control, but added complexity in areas such as DNS, connectivity, and troubleshooting. This required additional effort upfront, but provided a more secure and scalable foundation in the long term.

Another trade-off was in shared services vs workload ownership. Centralizing networking, policy, and secrets management improved consistency and control, but I did not want the platform layer to become a dumping ground for runtime dependencies. Components such as Kafka, ActiveMQ, databases, and storage might be common within an application landscape, but they still belonged closer to the workload subscriptions because their scaling, availability, and incident ownership were tied to the applications consuming them.

Finally, the hybrid setup for container registries (GitLab and Azure Container Registry) introduced some duplication in CI/CD pipelines. However, this decision was necessary to maintain compatibility with existing on-premises workflows while enabling a gradual transition toward cloud-native practices.

Overall, these decisions were guided by the principle of building a platform that was secure, scalable, and usable, while acknowledging the constraints of existing systems and team maturity.

18. Challenges Encountered

While the overall structure provided a solid foundation, implementing the landing zone came with several practical challenges both technical and organizational.

One of the main challenges was operating in a hybrid environment. Existing systems needed to continue functioning on-premises while new workloads were being introduced in Azure. For example, certain applications had to remain operational in their original setup while being gradually migrated and tested in AKS. This required careful coordination to ensure both environments could coexist without disruption.

Networking and connectivity were also complex, particularly with growing requirements. As new regions and external partners were introduced, ensuring reliable and scalable connectivity became more challenging. This led to exploring solutions such as VPN Gateway configurations (including higher-tier SKUs) and addressing NAT and routing considerations to support expanding connectivity needs.

Another significant challenge was adoption and enablement of development teams. Many teams were not familiar with cloud, Kubernetes, or infrastructure concepts. While input from teams was important, it was not always directly actionable. In some cases, requirements reflected existing ways of working rather than future needs. This required balancing feedback with a forward-looking approach similar to the idea that if asked, users might request incremental improvements to what they already know, rather than adopting a fundamentally better model.

There was also resistance to changing established practices. Some processes had been followed in a certain way for a long time, and moving toward infrastructure as code, self-service models, and cloud-native patterns required a shift in mindset. This was not purely a technical change, but an organizational one.

At the same time, it was important to align with real requirements. While introducing new patterns and improvements, the platform still needed to support existing workflows and constraints. This meant finding a balance between innovation and compatibility, rather than enforcing change too aggressively.

Overall, many of the challenges were not just about designing the platform, but about integrating it into an existing ecosystem balancing legacy systems, new technologies, and team readiness.

19. Lessons Learned

Looking back, several key lessons stood out from designing and implementing the landing zone.

One of the most important was that structure should follow ownership and operations, not just technical best practices. Decisions around management groups, subscriptions, and access only worked well when they reflected how teams actually operated.

Another key lesson was to keep the design as simple as possible, but not simpler. It is easy to overengineer early, especially when trying to account for future scale. In practice, a clear and understandable structure proved more valuable than a highly complex one.

Access control needs to be designed early. RBAC becomes difficult to fix later, and unclear ownership or overly broad permissions can quickly create problems as the platform grows. Investing time upfront in defining roles and boundaries pays off significantly.

Networking decisions have long-term impact. Address space planning, connectivity models, and private networking choices are difficult to change later. Taking time to validate assumptions especially with existing on-premises systems was critical.

Another important lesson was around enablement over control. Instead of giving teams direct access and expecting them to manage infrastructure, providing self-service patterns and templates proved more effective, especially for teams new to cloud and Kubernetes.

Working in a hybrid environment also reinforced the importance of pragmatism over idealism. Not all decisions can follow best practices when existing systems and constraints are involved. Supporting both on-premises and cloud workflows required flexibility and incremental change rather than a complete redesign.

Finally, platform work is as much organizational as it is technical. Aligning with teams, managing expectations, and gradually introducing new ways of working were just as important as the technical design itself.

These lessons helped shape not just the landing zone, but also how the platform evolved over time.

20. What I Would Do Differently

With the benefit of hindsight, there are several areas where the approach could be improved or simplified.

One area is simplifying environment and naming consistency, particularly within platform subscriptions. While separating platform domains across non-production and production added safety, it also introduced some overlap and cognitive overhead. A more streamlined naming approach could achieve the same isolation with less complexity.

Another improvement would be to define and document the operating model earlier. While many decisions were aligned with how teams worked, having clearer documentation and onboarding guidance from the beginning would have made it easier for other teams to understand and adopt the platform.

Governance could also be introduced more progressively but with clearer direction. While avoiding overly strict controls early on helped with flexibility, having a more defined roadmap for governance and policy enforcement would make long-term alignment easier.

In networking, while the design worked well, earlier alignment on future connectivity requirements (such as expanding regions, new partners, and scaling VPN capacity) could have reduced the need for later adjustments.

Another area for improvement is developer onboarding and enablement. While self-service patterns and templates were introduced, investing earlier in documentation, examples, and clear workflows could have reduced the learning curve for teams less familiar with cloud and Kubernetes.

Finally, in a hybrid environment, it would be beneficial to plan the transition strategy more explicitly. Supporting both on-premises and cloud workflows was necessary, but having a clearer roadmap for gradual migration could help reduce complexity over time.

Overall, most improvements are not about changing the core design, but about simplifying, documenting, and aligning earlier, making the platform easier to adopt and evolve.

21. How the Landing Zone Enabled Later Platform Work

Once the landing zone was in place, it provided a stable and predictable foundation for building higher-level platform capabilities.

With clear subscription boundaries and management group structure, it became straightforward to onboard new workloads without redefining access, governance, or networking each time. Teams could be onboarded into predefined environments rather than starting from scratch. A new workload could be placed into the correct subscription model, inherit baseline policies and RBAC, and connect its spoke network to the central connectivity layer without redesigning the foundations each time.

The networking foundation enabled secure deployment of services such as AKS, with private connectivity, controlled ingress/egress, and integration with existing systems. Because address spaces, peering patterns, firewall control, and DNS behavior were already defined and validated, new services could be deployed without rethinking network design.

The RBAC and identity model allowed controlled access to both infrastructure and applications. This made it possible to integrate CI/CD pipelines and automation safely, as permissions were already scoped and aligned with responsibilities.

The use of Infrastructure as Code and CI/CD workflows meant that new components such as Kubernetes clusters, networking resources, or shared services could be deployed in a consistent and repeatable way. This significantly reduced the risk of configuration drift and made scaling the platform much easier.

Shared services such as Key Vault, container registries, and centralized networking provided reusable building blocks that application teams could rely on, rather than reimplementing core infrastructure for each workload. At the same time, runtime components such as AKS, messaging, databases, and storage stayed within workload boundaries, which kept ownership clearer when applications were onboarded.

This foundation also enabled the introduction of GitOps patterns and Kubernetes-based workloads, where deployments could be managed in a structured and automated way, building on top of the existing platform.

Overall, the landing zone transformed Azure from a set of individual resources into a cohesive platform, where infrastructure, security, and operations were aligned. This allowed the focus to shift from setting up environments to actually running and scaling workloads.

In the next part, I will go deeper into how this foundation was used to build and operate a Kubernetes platform, including GitOps workflows and application onboarding.

22. Additional Design Considerations

In addition to the core landing zone design, there were several supporting considerations that helped keep the platform consistent and operationally manageable.

Naming conventions and tagging were introduced early to maintain clarity across resources. Subscriptions, resource groups, and services followed consistent naming patterns, while tags such as environment, ownership, and team helped with identification, cost tracking, and operational visibility.

At the resource level, a clear structure was followed to separate platform, networking, and application resources. Resource groups were organized based on responsibility and lifecycle, ensuring that shared infrastructure remained distinct from workload-specific components.

Connectivity to on-premises systems was an important aspect of the design. The platform needed to integrate with existing infrastructure while supporting future expansion. This required careful planning of VPN connectivity, address spaces, and DNS resolution, as well as coordination with internal IT teams to avoid conflicts and maintain trust boundaries between environments.

For automation, service principals and managed identities were used instead of user-based access. CI/CD pipelines (GitLab) were granted scoped permissions aligned with their responsibilities, ensuring that infrastructure changes could be applied securely and consistently without exposing unnecessary privileges.

Basic audit and monitoring considerations were also included, such as ensuring that activity logs, diagnostic settings, and Defender for Cloud coverage were available where needed. While not the primary focus of the landing zone, this provided a foundation for future observability and security monitoring.

These additional elements supported the overall goal of creating a platform that was not only structured and secure, but also maintainable and scalable in day-to-day operations.

Taken together, these decisions helped turn Azure from a collection of cloud resources into a structured operating model that could support secure growth, repeatable delivery, and future platform evolution.

Designing a Developer Platform: From Infrastructure to Self-Service

Syed Ammar — Tue, 16 Dec 2025 09:30:00 GMT

1. Infrastructure Was Not the Hard Part

Earlier in this series, I wrote about the Azure foundation work: landing zones, subscription boundaries, RBAC, networking, and the operating model needed to make cloud adoption manageable. That work mattered, but it was not the point where application teams actually felt enabled. It was the point where the real platform problem became visible.

Once the Azure side was structured and AKS clusters were available, the assumption from the outside was often that the difficult part was over. The organization had cloud infrastructure, CI/CD pipelines, Kubernetes, and the usual set of modern tooling. On paper, that sounds like enablement. In practice, it only meant the raw ingredients were now present. The day-to-day experience for developers was still far more complicated than it needed to be.

This is a gap I have seen repeatedly. Teams ask for Kubernetes, infrastructure as code, CI/CD, or cloud resources, and those things get delivered. But giving people access to powerful systems is not the same as making them productive with those systems. A running AKS cluster does not automatically become a usable application platform. A GitLab pipeline does not become a deployment model just because it exists. If every team still depends on the platform team to interpret manifests, fix ingress, manage secrets, explain environment behavior, or rescue broken deployments, then infrastructure has been provisioned but the platform has not really been designed.

That distinction became central to the work. The problem was no longer how to stand up Azure resources. The problem was how to turn Azure, AKS, GitLab CI/CD, ArgoCD, OpenTofu, Prometheus, and Grafana into something that application teams could use safely and repeatedly without needing a DevOps engineer every time they wanted to make a change.

2. Where Developers Were Actually Struggling

The environment was built around Azure and Kubernetes, supporting a growing microservices landscape. From a platform perspective, that was a reasonable direction. From a developer perspective, it came with a large amount of operational surface area that most teams had no reason to become experts in.

What slowed teams down was not usually the application code itself. It was everything around the code. A team could build a service, but getting it from repository to reliable runtime meant dealing with Kubernetes manifests, image build conventions, service exposure, ingress behavior, environment-specific configuration, secret handling, rollout behavior, and runtime debugging. Even small mistakes in those areas could cause deployments to fail in ways that were difficult to reason about if your day job was building product features rather than operating clusters.

Kubernetes YAML was a common source of friction, but the issue was broader than syntax. A manifest is not just a configuration file. It encodes operational decisions. A developer writing a Deployment, Service, or Ingress definition is making decisions about health checks, scaling assumptions, network exposure, restart behavior, labels, selectors, and configuration layout, whether they realize it or not. In a microservices environment, those decisions get repeated over and over again across services, environments, and teams. If each team makes them differently, inconsistency becomes normal very quickly.

Azure introduced a second layer of complexity on top of Kubernetes. Networking alone could become a significant tax: private endpoints, private DNS, internal versus external exposure, ingress patterns, and the difference between something being reachable inside the cluster, inside a VNet, or from outside the environment entirely. Then there was secrets management, where developers needed a safe way to consume application secrets without hardcoding values, embedding them in repo variables indefinitely, or treating Kubernetes secrets as if they were a complete secrets strategy.

CI/CD was another pain point. Developers did not just need a pipeline; they needed to understand how images were tagged, where artifacts were published, what promoted a change from one environment to another, how deployment state was represented, and why a pipeline passed while the workload still failed after deployment. That distinction between build success and runtime success often created confusion. The question was rarely "Did the code compile?" It was more often "Why is the application healthy in one environment, but not in another?" or "Why did the cluster accept this change but the service still is not reachable?"

The natural consequence was dependency on the platform or DevOps team. Requests came in under different labels, but many of them meant the same thing: something about the platform was harder than the application team should have to absorb. Sometimes that showed up as a deployment request. Sometimes it was a networking question. Sometimes it was a secrets issue, an ArgoCD sync problem, or a pod repeatedly crashing for reasons that were obvious only if you already understood the runtime. Over time, the platform team becomes a human API for infrastructure and operations, which is not scalable for either side.

3. Why Raw Kubernetes Was the Wrong Interface

One of the important lessons in this work was that the answer was not to insist that every developer learn Kubernetes more deeply. A certain level of platform awareness is useful, and application teams should understand the operational basics of the systems they run on. But there is a difference between healthy operational ownership and pushing infrastructure complexity downstream because the platform has not been productized.

AKS is a strong runtime when used well. GitLab and ArgoCD are good building blocks. Azure provides the necessary primitives for identity, networking, and secrets. None of that changes the fact that the combined abstraction level is still too low for most product teams to work against directly. Expecting every backend engineer to think fluently in terms of ingress classes, RBAC scopes, managed identities, private DNS resolution, rollout health, and GitOps reconciliation is usually a sign that the platform team has exposed implementation details as the user interface.

That is not a criticism of developers. It is a design problem. Most application teams are trying to ship business capability. Their mental model starts with endpoints, dependencies, configuration, latency, failure handling, and domain behavior. When a team has to become part-time cluster operator just to release a service safely, the platform is asking them to spend cognitive energy on the wrong layer.

This mattered even more in a microservices model. A monolith can hide a lot of infrastructure complexity simply because the deployment surface is smaller. A microservices landscape does the opposite. It multiplies the number of deployable units, network paths, secrets, dashboards, and failure modes. That makes standardization and abstraction more valuable, not less. Without them, every new service adds not only application behavior but another copy of the same infrastructure decisions.

Kubernetes was not the problem. Exposing it directly to developers as the default interface was.

The goal, then, was not to teach every team deep Kubernetes internals. The goal was to make sure they did not need deep Kubernetes internals for the common path. That is a very different design problem from simply giving people access to a cluster.

4. The Shift From Tooling to Product Thinking

The language shift from "providing infrastructure" to "designing a platform" sounds cosmetic until you feel the difference in day-to-day work. When the job is framed as infrastructure delivery, success is easy to define in component terms. The cluster exists. The pipeline runs. The OpenTofu applies cleanly. ArgoCD is installed. Prometheus is scraping. Those are all useful milestones, but they still say very little about whether an application team can get a service into production without tripping over platform internals.

Once the work was treated as platform design, the questions changed. What does a sane onboarding path look like for the next microservice, not the current one? Which parts of Azure and Kubernetes should be invisible to an application team most of the time? Where do we want flexibility, and where do we want one opinionated answer because variation only creates support load? If a team needs to ship a routine change, can they do it safely without broad Azure permissions, a kubeconfig, or a side conversation with the platform team?

Those questions led to a small set of principles that were useful precisely because they were not theoretical. Reduce cognitive load instead of moving it around. Prefer one good path over five loosely supported ones. Encode security and governance into the workflow rather than relying on everyone to remember them. Automate the repetitive parts. Make self-service real, but keep the blast radius controlled. Self-service without guardrails is just delegated risk.

Once that became the frame, the tooling started to fall into place. OpenTofu was the way to keep the Azure and AKS foundation consistent. GitLab CI/CD was the obvious interface because developers already lived there. ArgoCD gave us a reconciler and an audit trail instead of a collection of imperative deploy steps. Prometheus and Grafana stopped being side projects and became part of what it meant to run on the platform. Key Vault was not just where secrets lived; it was part of the expected way services consumed sensitive configuration.

The point was not to make the infrastructure look simpler than it was. The point was to keep it out of the developer's critical path.

5. Platform Architecture at a Glance

The easiest way to explain the platform is to follow a single change. A developer opens or merges a change in GitLab. CI builds the image, tags it, and runs the expected checks. Deployment state is updated in Git rather than by calling into the cluster directly. ArgoCD notices the change and reconciles AKS toward that declared state. The workload starts behind an approved networking pattern, and its runtime behavior is already visible through the shared observability stack.

Under that developer-facing path sat the Azure and Kubernetes foundation. OpenTofu provisioned the repeatable Azure structure, the AKS integration points, and the surrounding platform dependencies. The important detail was not that developers never touched Azure. It was that they did not need to think directly in terms of DNS zones, ingress plumbing, RBAC assignments, or secret wiring to ship a normal service change.

Secrets followed the same general idea. Sensitive values lived in Azure Key Vault, and the platform defined how those values became available to workloads. Observability followed it too. Prometheus and Grafana were not optional extras teams had to discover later; they were part of the runtime contract.

That mental model turned out to be important. If a team cannot explain the deployment path in a few sentences, they usually do not trust it. Developer to GitLab, GitLab to Git state, ArgoCD to AKS, then metrics and dashboards available by default was simple enough to hold in your head even though the platform underneath was not simple at all.

6. Designing the Platform Contract

Once the platform was treated as something engineers would consume rather than admire from a diagram, the next step was defining the contract with application teams. If that contract lives in tribal knowledge, the platform does not scale. People start succeeding based on who they know, which repository they copied from last, or which engineer happens to remember why a particular service was set up differently two years ago.

The contract needed to answer a few basic questions very clearly. What does a service team provide? What does the platform generate, enforce, or manage for them? Which decisions still belong to the application, and which ones are intentionally taken off the table? That boundary matters because production incidents have a habit of finding any responsibility that was left ambiguous.

In our case, application teams owned their code, service-specific configuration, health semantics, and the runtime behavior of what they built. The platform owned the repeated scaffolding around that code: deployment structure, GitOps mechanics, secrets integration, exposure patterns, environment layout, and the defaults that should not be re-decided from repository to repository.

That is why the developer interface could not be raw AKS or the Azure portal. It had to live where developers already worked: repository structure, standard configuration, merge requests, and CI/CD. A developer should not need a kubeconfig to deploy a routine change.

A lot of weak self-service models fail exactly here. They claim to abstract complexity but still force teams to think in cluster terms for everyday work. If the normal deployment path still depends on people understanding namespaces, ingress annotations, ArgoCD behavior, and Azure resource relationships in detail, the platform has only renamed the problem.

7. Golden Paths and Reusable Templates

The most concrete part of that contract was the introduction of golden templates and reusable deployment patterns. This was where the platform stopped being theoretical and started changing the daily experience of building and releasing services.

Before that work, too many teams were solving the same problems slightly differently. One service had one pipeline structure, another had a different tagging model, another used a different deployment layout, and another copied a manifest from an older repository and adjusted it by trial and error. Those differences were rarely deliberate architecture decisions. Most of the time they were just accumulated variation. That kind of variation becomes expensive very quickly because the platform team now has to support not only the applications, but every historical interpretation of how an application might be deployed.

The golden path was designed to remove that unnecessary variation. GitLab CI/CD templates standardized how services were built, tagged, scanned, and promoted. Deployment templates standardized how a service described its runtime needs. The configuration structure across environments was made consistent so teams did not have to invent their own model for dev, test, and production every time a new service was onboarded.

This did not mean every application became identical. It meant the common path became predictable. A team starting a new service no longer had to assemble the delivery model from scratch. They inherited a working pattern. The platform templates already knew how to build a container image, publish it through the approved path, update the GitOps source of truth, and let ArgoCD reconcile the workload into AKS. Default labels, common probes, naming patterns, environment structure, and other repetitive details were handled the same way across services unless there was a valid reason not to.

For a typical service, that meant the team started from a standard GitLab template, filled in the service-specific inputs, and stayed focused on the behavior of the application itself. They still decided what healthy looked like, what dependencies the service had, whether it should be internal or externally reachable, and what runtime profile it needed. They no longer had to rebuild the surrounding deployment model each time or guess which pieces were mandatory because a previous repository happened to include them.

That changed the nature of the work for application teams. Instead of writing and maintaining a large amount of repetitive Kubernetes and pipeline configuration, teams mainly provided the parts that were genuinely specific to the service. What port does the application listen on? Should it be exposed internally or externally? Which secrets does it need? Does it need more than the default resource profile? What does healthy look like? Those are meaningful questions. Requiring every team to also handcraft the surrounding deployment machinery was not.

The best templates do more than save time. They shrink the number of decisions that can go wrong. In platform work, reducing the decision surface is often more valuable than adding more options.

8. Abstraction Without Losing Operational Ownership

One of the easiest mistakes in platform work is to confuse abstraction with hiding reality. That was never the aim. The aim was to remove the repetitive, fragile infrastructure work from the daily developer path without pretending that operational responsibility had vanished.

There is a difference between hiding Kubernetes and hiding the consequences of running on Kubernetes. Application teams still needed to understand their own probes, scaling behavior, dependency timeouts, startup patterns, and failure modes. If a service fell over because it could not handle a database reconnect or because its readiness endpoint was misleading, that was still an application problem. No amount of template work changes that.

What the platform absorbed were the mechanics that were both necessary and endlessly repeated: how deployment state was rendered, how secrets were supplied, how the GitOps update happened, how approved service exposure worked, and how the standard build and promotion path behaved. Those were not strategic decisions each team needed to make for itself. They were recurring opportunities for drift and support tickets.

You could see the difference in mundane tasks. Before the platform, getting a service reachable might mean arguing with ingress annotations, checking whether the Service selector matched the Deployment labels, and discovering that development and production had evolved slightly different conventions. A pipeline could go green while the pod still landed in CrashLoopBackOff because the expected secret key was missing or the readiness probe assumed a path that no longer existed. After the abstractions were in place, teams still had to declare intent, but they did it through a narrower interface and with fewer ways to get the plumbing wrong.

That is the useful kind of abstraction. It reduces friction without diluting ownership.

9. Self-Service Through GitLab, GitOps, and ArgoCD

The self-service model worked because it used an interface developers already trusted. GitLab was already where code changed, where merge requests were reviewed, and where pipelines were expected to run. It made more sense to expose platform capabilities there than to ask application teams to become occasional Azure operators or cluster administrators.

The flow itself was straightforward, which was exactly the point. A change started in the application repository. GitLab built the image, ran the expected checks, and pushed the artifact through the approved path. The deployable state was then updated in Git rather than applied directly to the cluster. ArgoCD watched that declared state and reconciled AKS toward it.

That changed more than the mechanics. It removed a whole category of half-manual work that tends to accumulate around weak deployment models. A green pipeline no longer meant someone still had to grab the right context, apply manifests by hand, or fix the environment after the fact. The deploy step stopped being tribal knowledge.

It also gave us a cleaner operating model. Git became the record of intent. Merge requests became the place where deployment-affecting changes were reviewed. ArgoCD reduced the drift that creeps in as soon as manual cluster changes become normal. The platform team no longer had to treat direct kubectl access as the standard path, which made the state of the environment far easier to reason about later.

The important part was not just that developers could deploy for themselves. It was that they could do it without broad AKS or Azure permissions. The workflow was the interface. That is a better kind of autonomy than handing out elevated access and hoping discipline scales.

The same model held for promotion. Moving from development to test or production was not a different ritual with a different toolset. It was the same path with tighter controls and environment-specific differences made explicit.

10. Concrete Examples of the Platform in Practice

Onboarding a New Internal API Service

One of the clearest ways to explain the difference this made is to look at a very ordinary case: a backend microservice exposing an internal API for other services in the environment. Nothing about that kind of service is unusual. That is exactly why it is a useful example. If the platform cannot make the common case easy, it does not matter how sophisticated the underlying tooling is.

Before the platform patterns were in place, onboarding a service like this involved more infrastructure decision-making than most application teams wanted to own. The team would build the container, then start asking the familiar questions. Which manifest structure should be used? Does this need an Ingress or only a Service? How should it be exposed internally? Where do the secrets go? Which variables belong in the pipeline, and which belong in the cluster? Which environment-specific settings need to be duplicated by hand? A pipeline might build successfully, but that still left plenty of ways for the deployment to fail later. The Service selector might not match the Deployment labels. The ingress path might be correct for one environment and wrong for another. The pod might come up only to fail its readiness probe because the expected secret was missing or mounted under a different key.

That usually led to the same kind of support loop. Someone from the platform team would diff manifests, inspect the namespace, compare the service with an older repository that was "close enough," and work backwards from the symptom. Even when the issue was fixed, the model had not improved. The next team would run into a slightly different version of the same problem.

After the platform model settled, the same service followed a much narrower path. The repository started from the standard GitLab template. The team supplied the application-specific inputs, declared that the service was internal rather than externally published, referenced the required secrets through the approved Key Vault-backed pattern, and let the pipeline handle the rest. GitLab built the image, the deployment state was updated through Git, and ArgoCD reconciled the change into AKS. The service became reachable through the approved internal route without the team needing to re-design ingress, DNS behavior, or secret delivery from scratch.

Promotion worked the same way. Moving the service forward was not a separate deployment ritual. It was a controlled change through the same model, with environment-specific configuration where needed and stricter review where it mattered. The point was not that no one ever needed help. The point was that the common path stopped depending on expert intervention.

Turning Service Exposure Into a Platform Decision

Another recurring problem was service exposure. In an Azure and AKS environment with private networking, ingress, private DNS, and different internal and external paths, the question "Should this service be reachable, and by whom?" had a lot more behind it than most teams expected. A service was not simply public or private. It could be cluster-internal only, private inside the wider environment, or deliberately published through an approved external route. Each option implied different ingress behavior, DNS records, certificates, and access boundaries.

Left to individual teams, this became one of the most reliable ways to create inconsistency. Some services were exposed too broadly because the quickest route in development got copied forward. Others were harder to consume than they needed to be because the team did not have a stable model for internal reachability. The symptom looked simple: the workload was running, but the caller could not reach it, or it was reachable from places it should never have been reachable from.

The fix was to stop treating exposure as a low-level implementation detail each repository had to solve independently. The platform reduced it to a small set of supported intent-based choices. A team could say a service was cluster-internal, private internal, or externally published through the approved route. From there, the templates and GitOps structure mapped that decision to the right ingress and DNS behavior. The team still owned whether the service should be exposed. They no longer had to own every underlying networking decision as well.

That sounds like a small abstraction, but it removed a disproportionate amount of support load. It also closed off a class of configuration drift that is hard to detect until a service is already in use. This is the kind of problem platform engineering should solve once rather than asking every team to learn it independently.

Fixing Secrets Sprawl Without Blocking Delivery

Another problem that surfaced quickly was secrets sprawl. In the absence of a strong platform path, teams will use whatever gets them moving. Some values ended up in GitLab variables because that was quick. Some were created as Kubernetes secrets by hand. Some were copied between environments with too much manual handling. That does not usually begin as a dramatic security failure. It begins as convenience. The trouble starts later, when a value needs to be rotated, audited, or made consistent across environments and nobody is fully sure which copy is authoritative.

The core issue was not just where a value lived. It was that each team was being forced to invent its own model for sensitive configuration. That is exactly the kind of design failure a platform should prevent. The fix was to standardize around Azure Key Vault as the system of record and make secret consumption part of the supported path rather than a per-service improvisation.

That meant a service declared which secrets it needed through the agreed configuration structure, and the platform handled the delivery into the workload. Where managed identity or a cleaner Azure-native access path made sense, that was better because it removed secret distribution entirely. Where concrete values were still required, they came through the Key Vault-backed pattern rather than through manual cluster changes or scattered CI variables.

This paid off most obviously when a secret changed under real operating conditions. Rotation should not require every team to understand the internals of Kubernetes secret objects or to log into the cluster. It should be a controlled platform operation that leaves the application-facing contract alone.

Promoting the Same Artifact Across Environments

One of the quieter but more important platform problems was environment drift caused by promotion models that were not actually promotion models. If a service was effectively rebuilt, reconfigured by hand, or subtly reinterpreted at each environment boundary, then development, test, and production were not really running the same thing. At that point, debugging becomes more of a comparison exercise than an engineering one because you can never be fully sure whether a difference in behavior is caused by the application or by the path it took to get deployed.

The fix was to move to a build-once, promote-forward model. GitLab built the artifact, tagged it immutably, and the change moved through environments by updating Git-declared desired state rather than by rebuilding each time. ArgoCD then reconciled that state into AKS, which meant the platform could reason about deployments as versioned state instead of as a blend of pipeline history and cluster-side improvisation.

That made promotion easier to audit because the change was visible in Git. It made rollback less theatrical because reverting desired state is much cleaner than trying to reconstruct what somebody applied manually three days earlier. It also made environment differences easier to reason about, because the intended differences were explicit configuration or policy boundaries, not a separate deployment craft at every stage.

This is one of those design choices that looks procedural until you have lived without it. Once teams are rebuilding artifacts differently or treating each environment as its own hand-tuned process, the platform loses one of the things it most needs: predictability.

11. Access Control Was an Enabler, Not a Restriction

Limiting direct access to Azure and AKS was an intentional design choice, and it is one of the areas where platform engineering often gets misunderstood. Restricting broad portal or cluster access was not about gatekeeping. It was about designing an operating model that could scale, remain auditable, and avoid turning every engineer into an infrastructure administrator.

If everyone can make direct changes in the portal, apply manifests manually, or alter cluster state outside the standard workflow, you do not really have a platform. You have shared infrastructure with weak boundaries. That can feel fast in the moment, especially for experienced engineers, but the hidden cost shows up later as configuration drift, unclear ownership, inconsistent practices, and deployments that behave differently from what the repositories say should exist.

RBAC was used to align access with responsibilities. Application teams had the permissions they needed to use the platform, not to reconfigure its control plane. The platform team retained ownership over the foundational Azure resources, AKS configuration, and the parts of the stack where a mistake would have cross-team impact. Automation identities were also scoped carefully. GitLab runners, deployment jobs, and GitOps-related automation used the permissions required for their purpose and no more.

That distinction mattered in practice. Nobody needed broad Owner rights on a subscription or wide-open access to AKS just to ship an application change. Routine delivery moved through the same governed path every time, which is exactly what made it scalable.

This model made the overall system safer, but it also made it easier to work with. When the expected path for change is a Git-based workflow backed by ArgoCD, everyone knows where to look when something changes, who reviewed it, and how it can be rolled back. When the primary path is "someone changed something directly," every incident starts with detective work.

There were still situations where deeper access was needed for investigation or exceptional cases, but that was treated as the exception rather than the platform contract. A self-service model should minimize dependence on privileged access, not normalize it.

12. Secrets, Networking, and the Infrastructure Teams Should Not Have to Re-Explain

Some of the most valuable platform work lived in the areas nobody finds glamorous and everybody rediscovers the hard way if they are not standardized.

Secrets management was one of those areas. Azure Key Vault became the authoritative place for sensitive values, and the platform defined the standard path for making those values available to workloads. That avoided a common anti-pattern where every team evolves its own mix of pipeline variables, manually created Kubernetes secrets, copied configuration, and half-documented workarounds. Even when the application requirement was simple, the delivery path needed to be safe and predictable.

Networking was another area where raw infrastructure complexity easily leaks into developer workflows. Private networking, DNS behavior, ingress rules, and internal versus external exposure all matter a great deal in Azure and AKS, but they are poor candidates for every team to solve independently. In a private-first setup, the number of moving parts grows quickly. It is not enough for a container to be running. It has to be reachable by the right systems, through the right path, with the right name resolution and the right exposure boundary.

Without platform patterns, these concerns turn into repeated support requests and repeated mistakes. One service is exposed too broadly. Another is reachable only inside the cluster when it needs to be available internally across the environment. A DNS assumption works in development but not in production. An ingress change resolves one issue while introducing another. None of that is especially interesting work for application teams, and none of it should need to be solved from scratch for each repository.

The fix was to treat these as shared platform concerns rather than as application-by-application craft work. A service team should be able to say whether a workload is internal, externally published, or only cluster-internal, and let the platform map that intent onto the right ingress, DNS, and networking behavior. The same logic applied to identity and secret consumption. Where direct secret usage was necessary, it followed a consistent Key Vault-backed pattern. Where a service could use managed identity or another Azure-native access model, that path was preferred because it removed a whole class of secret handling entirely.

These were not the parts of the platform anyone liked presenting on slides, but they were the parts that consumed week after week if they were not standardized. Good platform work solves that class of problem once.

13. Observability Had to Be Part of the Platform

A platform is not finished once it can deploy workloads. It also has to make those workloads legible after they start. That is why observability was part of the platform from the start rather than a separate improvement project for later.

Prometheus and Grafana were already in the stack, but the important step was making them part of the normal operating path for anything running on AKS. If a team deployed a new service, there needed to be a predictable place to look for health, resource pressure, and runtime signals without building a bespoke observability setup around every repository.

That sounds obvious, but it changes the quality of operational conversations. Without shared observability, "self-service" often means a team can deploy independently and then immediately ask the platform team what they are looking at. With shared dashboards and known signals, the first conversation starts from data instead of from instrumentation archaeology.

Observability also benefited from the same standardization as the rest of the platform. When services follow common deployment patterns, label conventions, namespace layout, and scrape behavior stop being incidental details and start becoming useful shared structure. That is what lets a platform team support many services without turning each one into a unique monitoring problem.

Application teams still needed to own service-specific telemetry where it mattered, but the baseline had to be there by default. A deployed workload should not become opaque the moment it leaves CI.

14. Standardization Across Environments

One of the quieter but more important results of the platform was consistency across development, test, and production. This is where OpenTofu, GitOps, and reusable deployment patterns reinforced each other.

The Azure and AKS foundation was provisioned through OpenTofu modules so the environment shape did not drift without anyone noticing. Networking, cluster integration, secrets handling, and the shared platform dependencies followed the same general structure across environments even when production had tighter controls and different sizing. That matters because inconsistent environments create fake confidence. Something appears to work in development, but only because development has drifted into a completely different system.

The application delivery model followed the same logic. The GitLab pipeline shape was consistent. The GitOps structure was consistent. ArgoCD reconciled the same style of desired state in every environment. Teams did not have to learn one model for development and another for production. The things that differed were the things that should differ: configuration, policy, approval, and scale.

This is where standardization earns its keep. It reduces cognitive load, but more importantly it reduces the number of places a problem can hide. When the platform shape is predictable, environment-specific issues are easier to reason about because the platform itself is not introducing accidental variation.

Consistency also made governance easier to apply without turning production into a foreign country. Production could be more tightly controlled than development while still following the same basic operating model.

15. The Trade-Offs Were Real

None of this came without trade-offs, and it is important to be honest about them because platform work becomes fragile when it is described as if there were no downsides.

The most obvious trade-off was flexibility versus standardization. A strongly opinionated platform makes the common path easier, but it also means some teams cannot do everything exactly the way they would choose if left alone. That is not automatically a problem. In most cases, the variation being removed is not producing business value. But the tension is real, especially with experienced engineers who are used to tailoring pipelines and runtime configuration closely.

There was also a trade-off between direct access and controlled workflows. Direct kubectl access or broad Azure permissions can feel faster for the person holding them. The problem is that this speed does not scale as an operating model. It shifts complexity into hidden state and makes the platform harder to govern and support. The GitLab-plus-ArgoCD approach was more disciplined and more repeatable, but it required accepting that convenience for a few power users could not be the main design target.

Another trade-off sat between abstraction and freedom. If the platform abstracts too little, developers remain buried in infrastructure concerns. If it abstracts too much, teams can feel disconnected from how their software really behaves in production. The right balance was to abstract the repetitive infrastructure mechanics while keeping application teams close to the runtime characteristics they still needed to own.

There was also an ongoing question about how far to take standardization. Not everything should be templated. A platform becomes brittle when it tries to turn every edge case into a first-class built-in feature. Part of the job was deciding what belonged in the golden path, what should be possible through extension points, and what should remain a deliberate exception handled with platform involvement. That boundary matters because a platform that tries to support every possible use case eventually becomes another form of complexity.

16. Making the Platform Useful Without Making It Rigid

The hardest part of this work was not choosing tools. It was deciding where the standard path should end and where application-specific freedom should begin.

There was some initial resistance, which was not surprising. Teams that have struggled with slow infrastructure processes often interpret standardization as another form of control being added around them. If the platform team is not careful, that is exactly what it becomes. The way through that is not messaging. It is making the golden path genuinely easier than the ad hoc alternatives.

That required iteration. Early templates are rarely correct in all the important ways. Some are too narrow and force unnatural workarounds. Others try to be so flexible that they become hard to understand and hard to maintain. A usable platform usually emerges through repeated refinement: watching where teams still get stuck, where the abstractions are leaking, which defaults are working, and which ones are generating support load instead of reducing it.

It also required deciding what should not be standardized. Some services are long-running APIs. Others are workers, scheduled jobs, or integration components with very different runtime expectations. Some need external exposure. Others must remain internal. Some can use a straightforward secret model. Others need more careful identity handling. If a platform treats all of those as identical, it becomes unrealistic. If it treats each one as entirely unique, it loses the benefits of being a platform. The useful middle ground is a constrained set of patterns with well-understood variation points.

Another practical challenge was that support load often rises before it falls. During the transition, the platform team is still supporting the old way of working while teaching and refining the new one. That is normal. It is one reason platform engineering is as much about product thinking and operating model design as it is about YAML, pipelines, and cloud services.

17. Before and After the Platform

The difference before and after the platform was not mainly about tool choice. It was about the operating model around those tools.

Before the platform, deployments were technically possible but operationally inconsistent. Teams could get services into AKS, but they often did so through slightly different pipelines, slightly different manifests, and slightly different assumptions about networking, secrets, and environment behavior. That made the platform team a bottleneck because every inconsistency eventually surfaced as a support request, a failed rollout, or a production question nobody wanted to answer for the first time under pressure.

After the platform, the default path became much more predictable. A service followed a standard template, deployments moved through GitLab and GitOps, ArgoCD reconciled the desired state, and observability was already part of the runtime model. Developers still owned their applications, but they no longer had to become part-time experts in Azure and Kubernetes mechanics just to make routine changes safely.

That is the change I care about most. The platform did not remove operational responsibility. It removed avoidable infrastructure complexity from the day-to-day path of delivering software.

18. What Changed Once the Model Settled

The outcome was not that infrastructure complexity disappeared. It was that the right parts of that complexity moved into the platform, where they could be solved once and reused, instead of being rediscovered by every team and every service.

The immediate effect was reduced dependence on the platform team for routine delivery work. Application teams could use Git-based workflows to build, deploy, and promote services through a predictable path. They did not need broad AKS access to get code running. They did not need to understand every Azure networking detail to expose a service correctly. They did not need to invent a new deployment shape for each repository.

That improved developer experience in a practical sense rather than a cosmetic one. Teams had fewer infrastructure decisions to make for ordinary service delivery. Deployments became more repeatable. Configuration drift was reduced. Environment behavior became more predictable. When issues did happen, teams were not starting from a blank page; the observability, deployment path, and runtime conventions were already there.

The platform team benefited as well, but in a more important way than simply getting fewer messages. The nature of the work shifted. Less time went into acting as a release team, a YAML debugging service, or the final escalation point for every ingress or secret issue. More time could be spent improving shared capabilities, refining templates, hardening workflows, and thinking ahead about where the platform needed to evolve as more services were added.

That is the scaling effect that matters. A platform should improve not only the speed of one deployment, but the sustainability of the operating model as the number of teams, services, and environments grows.

19. Why I See This as Platform Engineering

This experience changed how I think about the line between DevOps work and platform engineering. Infrastructure automation was part of the job, but it was not the part that mattered most. The more significant work was deciding how other engineers should experience that infrastructure and which trade-offs should be encoded into the default path.

Provisioning Azure with OpenTofu, running AKS, wiring GitLab CI/CD, installing ArgoCD, and operating Prometheus and Grafana are all useful capabilities. They become platform engineering when they are assembled into a system other engineers can rely on without needing to understand every internal detail. That means choosing defaults, defining boundaries, deciding where flexibility is worth the cost, and being deliberate about which problems the platform absorbs so application teams do not have to.

The important result was not that the environment used a modern stack. It was that developers had less irrelevant infrastructure to think about while governance, security, and consistency improved instead of being negotiated away. At that point, the job stops feeling like "running Kubernetes" and starts feeling much closer to product design for engineers.

This experience also changed how I think about DevOps itself. The hard part is rarely building infrastructure. The hard part is building systems other engineers can depend on without first having to reverse-engineer them.

If I were taking this further, I would invest even more in service onboarding, platform documentation, and eventually a stronger internal developer portal on top of the existing workflows. But the lesson I would keep is straightforward. A platform is successful when developers can use it well without needing to understand how it is implemented. The measure of success is how much irrelevant infrastructure complexity stays out of their way.

GitOps in Production

Syed Ammar — Wed, 19 Nov 2025 09:30:00 GMT

1. GitOps Looked Like the Right Answer

We had a problem that pipelines could not solve.

Deployments could be technically successful without making the environment understandable. A pipeline could build an image, run tests, and even deploy cleanly while leaving the most important question unanswered: what is actually running in the cluster right now, and how did it get there?

By the time GitOps became a serious topic, the platform already had most of the pieces covered in the earlier posts. The Azure foundation existed. The AKS platform model existed. Private networking and controlled access were in place. The separation between platform control planes and workload clusters had already been established. GitLab CI/CD already handled builds and a lot of the workflow logic around application delivery. The next problem was not how to push code into a cluster. It was how to make deployments understandable, repeatable, and auditable as the environment grew.

Before GitOps, that ambiguity showed up in familiar ways. Git suggested one thing. The cluster sometimes contained another. Manual changes accumulated because they were convenient in the moment. A quick fix applied with kubectl solved a problem now and created uncertainty later. When something failed, debugging often started with reconstructing state rather than addressing the actual issue.

GitOps was appealing because it offered a cleaner answer. Put desired state in Git. Let ArgoCD reconcile the cluster toward that state. Stop treating the cluster as the place where truth lives by accident. That promise was strong enough to be worth pursuing.

What mattered later was realizing that GitOps is not automatically good just because the words sound disciplined. Installing ArgoCD is easy. Designing a deployment system around it is where the real work starts.

2. ArgoCD Was Not the Design

One of the first lessons was that saying "we use ArgoCD" does not actually explain much.

It does not tell you where environment-specific configuration lives. It does not tell you how changes move from development into production. It does not tell you whether CI still owns part of the deployment process, whether images are promoted or rebuilt, whether teams touch one repository or several, or how production changes are controlled. It certainly does not tell you whether developers find the system understandable.

ArgoCD is a reconciler. That is useful, but it is not a deployment model on its own.

This was one of the reasons GitOps had to be treated as an operating decision rather than as a tooling milestone. Most of the complexity was not inside ArgoCD. It sat around it: repository structure, promotion paths, CI boundaries, ownership boundaries, how much indirection teams had to tolerate, and what the actual source of deployment truth was supposed to be at each stage.

That is also why GitOps discussions often become strangely unhelpful in practice. People talk about purity before they have settled the operating model. They argue about whether a flow is "real GitOps" before they can answer much simpler questions, like whether the team understands where to make a change or whether production state is easier to reason about than it was before.

I cared a lot less about purity than about clarity.

The platform was not being built from scratch around ArgoCD. GitLab CI/CD already existed and was already doing useful work. It built images, ran tests, handled sequencing, and enforced checks the teams depended on. Replacing that whole layer just to make the architecture look cleaner would have been a mistake.

So the real question was not whether ArgoCD would replace CI. It was where CI should stop and where GitOps should start.

That boundary turned out to be the most important design decision in the whole GitOps model. If CI owns too much, then GitOps becomes a thin decorative layer and the cluster can still drift from what Git suggests should exist. If GitOps is asked to do too much, teams start forcing workflow logic, sequencing, and build concerns into a tool that was not designed for them. Neither extreme is good.

The hybrid nature of the environment made this a practical decision rather than a philosophical one. Good platform design has to meet the system where it actually is.

3. What GitOps Solved Immediately

Even with those constraints, GitOps solved several problems quickly.

The biggest one was drift.

Without GitOps, the cluster has a habit of becoming the real source of truth even when nobody intends that. A manual fix is applied under pressure. A pipeline updates something indirectly. A configuration change lands in one environment and not another. Over time, the repositories stop being dependable representations of runtime state. At that point, the operational cost is not just technical. It is cognitive. Engineers stop trusting what they read, and every issue begins with checking whether the environment is really what it claims to be.

GitOps improved that immediately because it made declared state matter again. ArgoCD continuously compared what the cluster was running with what Git said it should be running. That did not eliminate every source of complexity, but it did make silent drift much harder to ignore.

One recurring pattern made that concrete. A production issue would be mitigated with a direct kubectl change because speed mattered more than elegance in the moment. The service would recover, but the fix lived only in the cluster. Git still described the old state, so the next deployment change could quietly overwrite the fix and put everyone back into the same confusion. With ArgoCD in place, that mismatch stopped being invisible. The application was visibly out of sync, which forced a more honest decision: either commit the intended state to Git or accept that reconciliation would move the cluster back.

That changed behavior more effectively than policy language ever did. Hidden state became harder to normalize.

It also improved visibility. Git became a more meaningful place to understand deployment intent. That alone is a significant improvement in a multi-service environment where ad hoc operational knowledge does not scale.

Another benefit was consistency. Once the model settled, deployments followed a more repeatable path: change the right source in Git, let ArgoCD see it, and let the cluster reconcile toward that state. That is much easier to reason about than a mixture of direct deploy steps, pipeline-side mutation, and cluster-side exceptions.

GitOps did not remove the need for good operational judgment. It removed a class of hidden state that had been making that judgment harder.

4. Why Production Needed a Different GitOps Model From Non-Production

One of the important decisions was not to treat all environments the same.

Non-production exists to allow iteration. Production exists to carry consequences. That difference needs to appear not only in RBAC and policy, but in how the deployment model itself behaves.

A naive GitOps setup can accidentally weaken that distinction. If any valid change to Git can propagate quickly and automatically everywhere, then the system is clean in theory and too casual in practice. Production should not feel like a faster version of non-production with more anxious people around it.

The platform already had separate cluster-management boundaries between production and non-production. GitOps needed to reinforce that. Production changes needed clearer ownership, a more deliberate promotion path, and less room for accidental propagation from lower environments. Non-production could remain more flexible because that is where experimentation and iteration belonged.

This mattered more than it first seemed to. GitOps often gets described as if it makes environment promotion obvious. It does not. It makes it possible to model promotion clearly, which is a different thing. Whether you actually do that depends on repository design, ownership, promotion rules, and how many places a team has to touch to move a change forward.

The useful question was not "Are we using GitOps everywhere?" It was "Does production state move through a path that is more disciplined than before?"

5. The Hybrid Model: GitLab for Workflow, ArgoCD for State

The model that ended up working was not pure GitOps. It was a hybrid, and that was the right answer for this environment.

GitLab CI/CD remained responsible for building images, running tests, enforcing checks, and handling workflow logic. ArgoCD remained responsible for cluster reconciliation and state alignment. That split was not a compromise born of weakness. It was a recognition that CI systems and GitOps controllers are good at different things.

A simplified version of the flow looked like this:

Application code change
  -> GitLab CI/CD builds, tests, and publishes an image
  -> Desired deployment state is updated in Git for the target environment
  -> ArgoCD detects the Git change
  -> Target cluster reconciles toward that state
  -> Production promotion happens through a separate, deliberate Git change

The repository split mattered just as much as the controller split. In practice, the shape was deliberately boring:

payments-api/
  .gitlab-ci.yml
  Dockerfile
  deploy/
    chart/
      Chart.yaml
      values.yaml
      templates/
  src/
  tests/

platform-gitops/
  applicationsets/
    workloads.yaml
  environments/
    nonprod/
      azure/
        westeurope-01/
          payments-api/
            values.yaml
      aws/
        eu-central-1-01/
          payments-api/
            values.yaml
      gcp/
        europe-west4-01/
          payments-api/
            values.yaml
      oci/
        eu-frankfurt-1-01/
          payments-api/
            values.yaml
    prod/
      azure/
        westeurope-01/
          payments-api/
            values.yaml
      aws/
        eu-central-1-01/
          payments-api/
            values.yaml
      gcp/
        europe-west4-01/
          payments-api/
            values.yaml
      oci/
        eu-frankfurt-1-01/
          payments-api/
            values.yaml

The exact names were different, but the pattern was the important part. Application repositories owned code, tests, images, and deployable charts. The GitOps repository owned environment and cluster-specific overrides, promotion, and the ArgoCD definitions that connected those things together. The cluster destinations themselves were registered in ArgoCD separately; the GitOps repo was mapping workloads onto already-known targets, not inventing cluster inventory from scratch. In this kind of model, the same service chart could target AKS, EKS, GKE, and OKE while keeping the per-cloud differences mostly inside environment values.

Promotion also was not a single fan-out change to all four clouds at once. The same image digest usually moved through a smaller rollout ring first, then into broader production targets through separate Git changes, which kept validation and blast radius much easier to reason about.

At that scale, the Git-owned deployment state usually looked less like custom control-plane YAML and more like ordinary environment values that CI updated through a merge request when a release was promoted. The file below is one concrete example for Azure production; the same pattern existed for AWS, GCP, and OCI with cloud-specific differences kept in their own environment paths rather than hidden inside CI logic:

# platform-gitops/environments/prod/azure/westeurope-01/payments-api/values.yaml
global:
  environment: prod
  cloud: azure
  region: westeurope
  cluster: azure-westeurope-01

image:
  repository: registry.xxxcompany.com/commerce/payments-api
  digest: sha256:2e8f1a317b4f6dc5c53fd3a5f0a9f9d6f73be3dc11d2a6b5bb48d03e8a0ab912

autoscaling:
  enabled: true
  minReplicas: 6
  maxReplicas: 18
  targetCPUUtilizationPercentage: 70

resources:
  requests:
    cpu: "500m"
    memory: "1Gi"
  limits:
    cpu: "1"
    memory: "2Gi"

ingress:
  enabled: true
  className: istio-internal
  hosts:
    - host: payments.prod.eu.xxxcompany.com
      paths:
        - path: /
          pathType: Prefix

externalSecrets:
  enabled: true
  secretStoreRef:
    name: prod-cluster-secrets

podDisruptionBudget:
  minAvailable: 4

Equivalent files existed under environments/prod/aws/..., environments/prod/gcp/..., and environments/prod/oci/... with only the cloud-specific differences changed.

The important part was not that every cloud used identical values. It was that the same release model could be promoted across Azure, AWS, GCP, and OCI while keeping cloud-specific differences explicit and reviewable in Git.

The ApplicationSet layer then generated ArgoCD applications from the environment directories instead of asking teams to handcraft per-cluster Application objects:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: workloads
  namespace: argocd
spec:
  goTemplate: true
  goTemplateOptions:
    - missingkey=error
  generators:
    - git:
        repoURL: https://git.xxxcompany.com/platform/platform-gitops.git
        revision: main
        directories:
          - path: environments/*/*/*/*
  template:
    metadata:
      name: '{{ index .path.segments 4 }}-{{ index .path.segments 1 }}-{{ index .path.segments 2 }}-{{ index .path.segments 3 }}'
      labels:
        service: '{{ index .path.segments 4 }}'
        environment: '{{ index .path.segments 1 }}'
        cloud: '{{ index .path.segments 2 }}'
        cluster: '{{ index .path.segments 3 }}'
    spec:
      project: '{{ index .path.segments 1 }}'
      destination:
        name: '{{ printf "%s-%s" (index .path.segments 2) (index .path.segments 3) }}'
        namespace: '{{ index .path.segments 4 }}'
      sources:
        - repoURL: 'https://git.xxxcompany.com/apps/{{ index .path.segments 4 }}.git'
          targetRevision: main
          path: deploy/chart
          helm:
            releaseName: '{{ index .path.segments 4 }}'
            valueFiles:
              - $values/{{ .path.path }}/values.yaml
        - repoURL: https://git.xxxcompany.com/platform/platform-gitops.git
          targetRevision: main
          ref: values
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
          - CreateNamespace=true
          - ApplyOutOfSyncOnly=true
      revisionHistoryLimit: 3

That is where the phrase "source of truth" becomes useful only if it is precise. Git was the source of desired cluster state. CI was still part of the system that produced deployable artifacts and, in some cases, the changes that moved state forward. That is not a contradiction, but it does mean the system has to be explained honestly.

The hybrid model worked because it made those responsibilities legible instead of pretending everything had become magically simple just because ArgoCD was present.

6. ApplicationSets Helped, but They Raised the Cost of Understanding

As the number of services grew, ApplicationSets became valuable. Managing every ArgoCD application individually does not scale well in a larger microservices environment. Once patterns begin to repeat, a templated way to generate those application definitions becomes useful very quickly.

ApplicationSets helped with onboarding, consistency, and reducing repetitive configuration. That was the upside. The downside was abstraction.

Every new abstraction makes a system more scalable for the people maintaining it and potentially less obvious for the people using it. Once ApplicationSets entered the model, understanding deployments no longer meant understanding only Kubernetes and ArgoCD. It also meant understanding the structure that generated ArgoCD objects in the first place.

That cost showed up most clearly in developer experience. GitOps improved system integrity faster than it improved day-to-day clarity for teams. A developer may now have to understand which repository holds application code, which repository or path holds deployment state, what CI changes automatically, what requires a Git change, and when ArgoCD will actually reconcile the cluster. None of that is impossible. But it is more to hold than "push code and watch a deploy happen."

This is why I do not think GitOps should be discussed only as a control or compliance improvement. The real test is whether teams can use it without losing too much situational clarity. A good platform notices that tension early instead of celebrating the abstraction and leaving teams to deal with the confusion.

7. What GitOps Helped Expose

One of the useful things GitOps did was force ambiguity into the open.

For example, when a service appeared healthy in CI but was not behaving as expected in the cluster, GitOps made it easier to ask the right question: is the declared state wrong, or is the cluster not matching the declared state? That is a much better starting point than trying to reconstruct who applied what by hand.

It also exposed problems in promotion logic. If moving from one environment to another involved too many hidden transformations or too many repositories touched in inconsistent ways, GitOps did not hide that. It made it obvious that the promotion model itself needed work.

Another recurring issue was ownership confusion. If a change involved both CI behavior and GitOps state changes, teams naturally asked which system truly owned deployment intent. That was not a flaw in GitOps so much as a signal that the boundary between workflow and state needed to be explained and, in some cases, simplified.

GitOps also made indirect complexity visible. Multiple repositories, pipeline-generated changes, and ApplicationSet indirection all become more noticeable once the system starts depending on Git as the place where operational truth is supposed to live. That can be uncomfortable, but it is useful. A platform cannot simplify what it refuses to see.

8. What I Changed, and What Actually Matters

The improvements I cared about most were all about clarity.

The first was clearer boundaries between CI and GitOps. CI should own build, validation, and artifact creation. GitOps should own declared deployment state and reconciliation. Once that line gets blurry, the system becomes harder to debug and harder to explain.

The second was a simpler promotion model. Promotion should be explicit, visible in Git, and understandable without cross-referencing too many systems. If moving a change from development toward production feels like chasing state through a maze, the model is too indirect.

The third was reducing unnecessary indirection. More repositories, more layers of generation, and more transformation steps all increase cognitive load. Some indirection is worth it. Too much turns clarity into ceremony.

I would also invest earlier in the developer-facing interface. Teams should not need to understand ArgoCD internals, ApplicationSet behavior, or the entire GitOps control plane to make routine changes safely. That is exactly the sort of complexity a platform should absorb.

None of those changes replace GitOps. They make it easier to live with.

That is also why I am skeptical of GitOps writing that treats the whole topic as a purity contest. The real trade-offs are simpler and harder: more consistency, more auditability, and less drift, but also more abstraction, more indirection, and potentially worse developer experience if the platform does not provide a better interface on top.

In real environments, the question is not whether the system is close to theory. The question is whether it is clearer, safer, and more sustainable than what it replaced.

9. What This Taught Me About GitOps

The most important lesson was that GitOps is not really a tool choice. It is an operating model.

Installing ArgoCD is easy. Designing a deployment system where state is trustworthy, promotion is understandable, responsibilities are clear, and teams can still work effectively is much harder. That is the part that determines whether GitOps reduces ambiguity or merely reorganizes it.

That may be the most practical summary I can give. ArgoCD mattered because reconciliation mattered. But the real success or failure had much more to do with the design around it than with the tool itself.

This also reinforced a broader point from the rest of the series. Platform engineering is rarely about choosing the right tool in isolation. It is about deciding how the system around that tool should work so other engineers can rely on it without having to reverse-engineer it first.

The landing zone work established the boundaries. The AKS and networking work made private Kubernetes operational. The platform design work made Kubernetes usable for application teams. GitOps was the next layer in that same progression. It answered a different question: once the platform exists, how do you make deployment state disciplined enough to trust?

That is why I do not see GitOps as a separate specialty topic. In this environment, it was part of the same platform story. It sat directly on top of the subscription model, the private AKS networking model, the cluster-separation model, and the broader goal of reducing unnecessary infrastructure complexity for application teams.

The result I cared about was not "we use ArgoCD." It was that the deployment model became more understandable, more auditable, and less dependent on hidden cluster state than it had been before.

That is the version of GitOps that is worth talking about in production.

Designing Private AKS Access: VPN, DNS, and Hub-Spoke Networking

Syed Ammar — Tue, 28 Oct 2025 08:00:00 GMT

1. Private AKS Was a Security Decision With Operational Consequences

In the previous posts, I wrote about the Azure landing zone and the platform model that sat on top of it. This is the part where those decisions became real. Private AKS networking was one of the most demanding pieces of the platform, not because AKS itself was especially difficult to provision, but because the moment you decide the control plane should not be publicly reachable, you stop having a simple cluster problem and start having a network design problem.

That decision was not made for aesthetics. The broader Azure environment was being designed around controlled access, private networking, and clear security boundaries between workloads, shared services, and administrative paths. A public Kubernetes API endpoint would have cut across that model. Even if the cluster was still protected with RBAC and identity controls, the operating assumption would have been very different: the control plane would be on the internet, and the main question would be who can authenticate to it. In our case, the stronger requirement was that the control plane should not be publicly exposed in the first place.

That sounds straightforward when written in an architecture document. In practice, it changes almost everything around the cluster. The API server still has to be reachable by the people and systems that operate it. DNS still has to resolve. Routes still have to exist. Peering still has to carry traffic where it needs to go. Operators still need a workable access path from outside Azure. On-premises systems, VPN-connected engineers, shared hub services, and workload spokes all become part of the story.

This is the part of cloud networking that tends to get flattened into diagrams. A private AKS cluster is often described as if it were the same cluster with public access toggled off. It is not. The day you remove the public endpoint, you inherit responsibility for every path by which that cluster will ever be reached.

That is why I see this area as such a strong differentiator in platform work. Plenty of engineers can create an AKS cluster. Far fewer are comfortable owning what happens when the control plane is private, DNS spans multiple networks, engineers connect over VPN, and the cluster becomes unreachable even though Azure insists everything is healthy.

2. What Private AKS Actually Changes

With a public cluster, the access story is comparatively simple. The Kubernetes API is reachable over a public endpoint, and the main controls sit at the identity and authorization layers. You still need to think about RBAC, API exposure rules, IP restrictions if you use them, and how cluster credentials are handled, but the network path itself is not usually the main source of friction. A laptop on the internet can reach the endpoint if policy allows it.

Private AKS changes that shape completely. The cluster API becomes reachable only over private network paths. That means cluster administration is no longer just a Kubernetes concern. It is tied to VNet design, peering, route propagation, DNS resolution, VPN access, and hybrid connectivity if the wider environment includes on-premises systems.

This is where people often underestimate the work. The cluster can be provisioned successfully and still be functionally inaccessible from the places that matter. The node pool can come up. The control plane can be healthy. The Azure resource can show no visible faults. And yet kubectl still fails because the private FQDN does not resolve on the engineer's machine, the route from the VPN client never reaches the spoke, or the name resolves to the right address but the traffic has no valid path back.

That distinction became important very quickly. The cluster was not the problem. The path to the cluster was the problem.

A simplified version of the AKS side looked roughly like this:

resource "azurerm_kubernetes_cluster" "workload" {
  name                = "aks-workload-nonprod"
  location            = azurerm_resource_group.aks.location
  resource_group_name = azurerm_resource_group.aks.name
  dns_prefix          = "aks-workload-nonprod"

  private_cluster_enabled = true
  private_dns_zone_id     = azurerm_private_dns_zone.aks_api.id

  default_node_pool {
    name           = "system"
    vm_size        = "Standard_D4s_v5"
    node_count     = 3
    vnet_subnet_id = azurerm_subnet.aks_nodes.id
  }

  identity {
    type         = "UserAssigned"
    identity_ids = [azurerm_user_assigned_identity.aks.id]
  }

  network_profile {
    network_plugin = "azure"
    network_policy = "azure"
  }
}

The important part was not the exact resource block. It was that the cluster was deliberately placed into a spoke subnet, the control plane was made private from the start, and the cluster was tied to a custom private DNS zone instead of treating name resolution as something to sort out later. In a real deployment, that identity also needs the right permissions on the private DNS zone and network resources.

Private AKS also forces you to separate two conversations that are often mixed together. One is control-plane access: how administrators, automation, or debugging tools reach the Kubernetes API. The other is application exposure: how workloads running inside the cluster are reached by other services, users, or external systems. Keeping the control plane private does not automatically mean every application endpoint is private. Those are separate decisions with separate security and networking implications. Treating them as one problem is a reliable way to make the architecture harder to reason about.

What private AKS really did was expose the quality of the surrounding network design. If hub-and-spoke, VPN, DNS, and address planning are sound, private clusters fit naturally into that model. If those pieces are vague, private AKS makes the vagueness impossible to ignore.

3. The Architecture I Built Around It

The network model was based on hub-and-spoke, but not inside a single subscription. The hub VNet lived in a dedicated connectivity or platform subscription and acted as the central place for shared network services. That included the VPN entry point, the core routing patterns, and the DNS components that needed to be shared across environments. Workload VNets lived in spokes in separate subscriptions, typically split by environment and ownership so that development, staging, production, and shared platform domains could remain isolated in ways that matched governance and responsibility.

Private AKS clusters sat in workload spokes rather than in the hub. That part was intentional. The hub was there to provide common network capabilities and controlled connectivity, not to become the place where application runtimes accumulated. Each cluster belonged with the workload environment it served. That kept blast radius and ownership cleaner, and it aligned better with the rest of the Azure operating model.

Those spokes were peered back to the hub. The peering design was not there only for east-west traffic between VNets. It was also what made centralized access and name resolution workable across subscriptions. The goal was simple: connect once to the point-to-site VPN in the platform subscription, then reach any private AKS control plane that lived in the peered spoke subscriptions without maintaining separate VPN entry points per environment. Once the VPN gateway, DNS infrastructure, and shared controls are centralized, the spokes need a dependable way to use them without re-creating the same components in every environment.

The access story then became layered. Routine deployments did not rely on engineers opening direct cluster sessions from their laptops. GitLab CI/CD and GitOps already handled the normal path for application delivery. Direct access was mainly for cluster administration, deeper troubleshooting, and those moments where platform engineers need to inspect the runtime directly rather than infer it from pipelines and dashboards.

DNS became part of the architecture rather than an afterthought. Private DNS zones and a centralized DNS strategy were necessary to make the AKS private API FQDN resolvable from the right places. In a more complex environment, that also pushed us toward a hub-centered resolver pattern rather than leaving every spoke or connected client to solve name resolution independently.

One useful lesson here is that hub-and-spoke diagrams are often too neat. The real architecture is not just hub, spoke, and lines between them. It is the sum of what those lines carry: route propagation, gateway usage, DNS queries, private endpoint access, and administrative traffic. Private AKS forces you to care about all of that.

4. Why I Chose VPN Access Over Public Endpoints or a Jumpbox

Once the control plane is private, the next question is how humans actually reach it. There are only a few realistic options. You can expose the API publicly after all, which defeats the design goal. You can force administrative access through a jumpbox or bastion-style machine inside Azure. Or you can give approved operators a private network path from their own workstation into the environment.

Public API access was the easiest option technically and the wrong one architecturally. It would have created a clean short-term answer by weakening the very control we were trying to introduce. That was not a serious option for this environment.

Using a jumpbox was more realistic and is a pattern many teams fall back to. It has some advantages. The machine sits inside the network, the tooling can be controlled centrally, and the cluster can remain private. But it also creates its own problems. The operational experience gets worse quickly when every non-routine debugging step has to happen through a shared remote host. Tooling drifts. Session state accumulates. File handling becomes awkward. DNS testing becomes less honest because you are no longer seeing what the operator machine sees. And in practice, jumpboxes tend to become semi-permanent shortcuts for work that should have clearer access patterns.

I preferred VPN for the primary administrative path. The model we used centered on an Azure VPN Gateway in the hub, with point-to-site connectivity through the Azure VPN Client for the engineers who genuinely needed cluster-level access. In practical terms, that meant an approved operator could establish a private path from their workstation into the hub-and-spoke environment and work against the cluster as if they were inside the network, while still keeping the API private.

The choice of a VPNGW2 tier was not about chasing a premium SKU for its own sake. It was a pragmatic middle ground for a platform that needed to support real operator access, hybrid connectivity considerations, and enough headroom that the gateway itself would not immediately become the next bottleneck. Networking decisions should leave some room for growth. If a design works only at the exact moment it is drawn, it usually does not work.

In Terraform, the hub-and-spoke access path depended less on the AKS resource itself and more on getting peering behavior right. In the real environment, those peerings often crossed subscription boundaries even though the example below keeps the code simplified:

resource "azurerm_virtual_network_peering" "hub_to_spoke" {
  name                      = "hub-to-aks-spoke"
  resource_group_name       = azurerm_resource_group.hub.name
  virtual_network_name      = azurerm_virtual_network.hub.name
  remote_virtual_network_id = azurerm_virtual_network.aks_spoke.id

  allow_virtual_network_access = true
  allow_forwarded_traffic      = true
  allow_gateway_transit        = true
}

resource "azurerm_virtual_network_peering" "spoke_to_hub" {
  name                      = "aks-spoke-to-hub"
  resource_group_name       = azurerm_resource_group.spoke.name
  virtual_network_name      = azurerm_virtual_network.aks_spoke.name
  remote_virtual_network_id = azurerm_virtual_network.hub.id

  allow_virtual_network_access = true
  allow_forwarded_traffic      = true
  use_remote_gateways          = true
}

That was one of the easiest places to make the environment look connected while still breaking the real operator path. A successful VPN session did not help much if the spoke could not actually use the hub gateway the way the design assumed. The whole point of the model was to let operators use one centralized VPN in the hub subscription and still reach clusters in multiple spoke subscriptions.

What mattered just as much as the gateway choice was the operating model around it. Not everyone needed VPN-based cluster access. In fact, most people should not have it. Normal deployments still moved through GitLab and ArgoCD. VPN access existed for the people responsible for platform operations, deeper debugging, and controlled administrative work. That distinction kept the access story cleaner and aligned with the broader principle of self-service for routine changes and tighter access for control-plane operations.

The client profile mattered too. A point-to-site tunnel is only half the answer if the connected machine is still asking the wrong DNS servers or lacks the routes for the spoke address space. One of the recurring lessons in this work was that "connected to VPN" and "able to operate the cluster" are not the same thing.

There was still room for a jumpbox as a break-glass or comparison tool, especially when isolating whether a problem was on the operator machine, the VPN path, or inside Azure itself. But it was not the primary interface. If private networking only works reliably from a manually maintained jump host, then the access model has not really been solved.

5. DNS Is Where Private AKS Stops Being Simple

DNS was the part that separated a private AKS diagram from a working private AKS platform.

On paper, the AKS control plane has a private FQDN and a private endpoint. In conversation, that often gets compressed into "the cluster is private." What matters operationally is that the right machines, in the right networks, need to resolve that FQDN to the right private address every time. If they cannot, the cluster may as well not exist for them.

This becomes more complicated the moment you move beyond a single VNet. In a hub-and-spoke design, the cluster lives in a spoke. Administrators may connect through a VPN terminating in the hub. Shared services may also live in the hub. Other workloads may be in separate spokes. On-premises DNS infrastructure may still exist. Peering does not magically solve name resolution across all of that. Private DNS only feels transparent when the environment is small enough that its assumptions have not been tested yet.

The AKS private API FQDN problem usually shows up in one of two ways. Either the name does not resolve at all from the place you are testing, or it resolves differently depending on where the query originated. Both are dangerous because they create the illusion that the cluster is "sometimes available" when the real issue is that the DNS path is inconsistent.

To make this reliable, private DNS had to be treated as shared infrastructure rather than as a side effect of cluster creation. Private DNS zones had to be linked deliberately where needed. DNS forwarding had to be explicit. In a multi-VNet and hybrid environment, the resolver path mattered just as much as the zone itself. In other words, the private DNS zone was necessary, but in this design it was not sufficient on its own.

That is what pushed the design toward a centralized DNS model in the hub. Instead of letting every spoke or connected client improvise, the environment benefited from having a clear place where private resolution was handled. Azure DNS Private Resolver became part of that answer. With a resolver pattern in the hub, it becomes much easier to define how on-premises systems, VPN-connected clients, and peered networks find private names in Azure without duplicating custom behavior in too many places. That was especially important because the design goal was not access to one cluster in one VNet, but consistent access from one hub to private AKS clusters spread across multiple spoke subscriptions.

The private DNS zone side looked roughly like this:

resource "azurerm_private_dns_zone" "aks_api" {
  name                = "privatelink.westeurope.azmk8s.io"
  resource_group_name = azurerm_resource_group.connectivity.name
}

resource "azurerm_private_dns_zone_virtual_network_link" "hub" {
  name                  = "hub-link"
  resource_group_name   = azurerm_resource_group.connectivity.name
  private_dns_zone_name = azurerm_private_dns_zone.aks_api.name
  virtual_network_id    = azurerm_virtual_network.hub.id
}

resource "azurerm_private_dns_zone_virtual_network_link" "aks_spoke" {
  name                  = "aks-spoke-link"
  resource_group_name   = azurerm_resource_group.connectivity.name
  private_dns_zone_name = azurerm_private_dns_zone.aks_api.name
  virtual_network_id    = azurerm_virtual_network.aks_spoke.id
}

This was the kind of configuration that made the difference between "the cluster exists" and "the cluster can actually be reached from the places that matter." But in this model, zone links alone were not enough. VPN-connected clients and the wider DNS estate still needed a clear resolver path into Azure from the hub.

A simplified version of the resolver side looked like this:

resource "azurerm_subnet" "dns_inbound" {
  name                 = "snet-dns-inbound"
  resource_group_name  = azurerm_resource_group.connectivity.name
  virtual_network_name = azurerm_virtual_network.hub.name
  address_prefixes     = ["10.10.10.0/28"]

  delegation {
    name = "dns-resolver"

    service_delegation {
      name    = "Microsoft.Network/dnsResolvers"
      actions = ["Microsoft.Network/virtualNetworks/subnets/join/action"]
    }
  }
}

resource "azurerm_private_dns_resolver" "hub" {
  name                = "hub-dns-resolver"
  resource_group_name = azurerm_resource_group.connectivity.name
  location            = azurerm_resource_group.connectivity.location
  virtual_network_id  = azurerm_virtual_network.hub.id
}

resource "azurerm_private_dns_resolver_inbound_endpoint" "hub" {
  name                    = "hub-inbound-endpoint"
  private_dns_resolver_id = azurerm_private_dns_resolver.hub.id
  location                = azurerm_resource_group.connectivity.location

  ip_configurations {
    private_ip_allocation_method = "Dynamic"
    subnet_id                    = azurerm_subnet.dns_inbound.id
  }
}

That inbound endpoint was what gave the hub a stable DNS entry point for VPN-connected clients and existing DNS servers. In a broader hybrid setup, outbound endpoints and forwarding rules can also sit alongside it, but the main architectural point here was that the private AKS zone and the resolver path had to be designed together.

In practical terms, that meant the existing DNS estate had to participate. On-premises resolvers needed conditional forwarding for the relevant private zones, and VPN-connected clients needed to use a DNS path that could actually answer private Azure names. Without that, the cluster might resolve correctly from one network and disappear from another even though nothing about AKS itself had changed.

The practical value of this is hard to overstate. Without a consistent resolver path, debugging cluster access becomes guesswork. A name can resolve inside one spoke, fail on a VPN-connected laptop, work on a jumpbox, fail from a peered VNet, then appear to work again because a local cache still holds a stale answer. That is not really an AKS problem. It is a DNS operating model problem.

Private AKS made that impossible to ignore. The cluster API was one of the cleanest examples of why DNS has to be designed, not assumed.

6. Concrete Problems I Actually Had to Solve

This is where private AKS networking stopped being architectural theory and turned into real platform engineering work. The interesting problems were rarely "Can Azure create the cluster?" They were almost always about why a private design that looked correct in a diagram still failed under real usage.

The Cluster Was Healthy, But Nobody Could Resolve Its Name

One of the first recurring issues was that the cluster existed, the node pools were healthy, the private control plane had been created, and yet operators still could not reach it. The failure mode was not dramatic. kubectl simply failed because the private API FQDN did not resolve from the place the engineer was working.

This is the sort of problem that wastes time because it looks like a cluster problem until you prove otherwise. The natural reaction is to inspect the AKS deployment, check role assignments, or assume the cluster provisioning did not finish cleanly. In reality, the control plane was fine. The broken piece was that the DNS path between the operator and the cluster had never truly been established.

In a private setup, it is not enough that the zone exists somewhere in Azure. It has to be reachable through the actual resolver path used by the calling machine. That is where cross-VNet resolution becomes very real. A peered network is not automatically a correctly resolving network. A VPN-connected laptop is definitely not one unless you make it one.

The fix was to stop thinking about name resolution as local to the cluster deployment and start treating it as part of the shared network architecture. The zone linkage and forwarding model had to be explicit. Queries needed to follow a resolver path that made sense from the hub, from the spokes, and from the VPN-connected client. Once that was made consistent, the issue stopped looking mysterious. Before that, it was easy to lose time investigating the wrong layer.

The VPN Was Connected, But the API Server Still Timed Out

Another common failure mode was even more misleading. DNS would resolve correctly, which made everyone feel closer to the answer, but the API server still timed out from the operator machine. At that point, people often assume the VPN itself is fine because it connected successfully. That is not a safe assumption.

A connected VPN icon tells you almost nothing about whether the route you need is actually usable.

In a hub-and-spoke model, the access path from a point-to-site VPN client to a private AKS cluster in a spoke depends on more than the gateway existing. Peering configuration matters. Gateway transit matters. Whether the spoke is using remote gateways matters. Address spaces need to be advertised correctly. User-defined routes, if present, need to do the right thing. A single incorrect assumption there is enough to produce a perfectly connected VPN session that still cannot reach the cluster.

This was one of the places where working methodically mattered. Once DNS returned the correct private address, the question changed from "Can I resolve it?" to "Can traffic from this specific source reach that specific destination, and can the return path work?" The fastest way to solve it was to stop staring only at the cluster and start validating the network hop by hop. Testing from a known-good machine inside Azure, then from the hub path, then from the VPN client made it much easier to isolate whether the fault sat in routing, peering configuration, or client-side resolution.

Private networking punishes vague troubleshooting. If you skip layers, you end up changing the wrong thing.

Overlapping CIDR Ranges Turned Into Delayed Pain

CIDR planning was another area where the problems arrived later than the decisions that caused them. A network design can look fine during cluster creation and still become a trap when hybrid connectivity or additional spokes are introduced.

The issue with overlapping ranges is not just that they are theoretically undesirable. It is that they create ambiguity in places where the platform desperately needs clarity. If an on-premises network, a spoke VNet, a VPN client address pool, or the AKS service and pod ranges overlap in the wrong way, traffic starts following paths that are hard to reason about and even harder to debug under pressure.

This kind of problem rarely announces itself cleanly. It tends to show up as intermittent reachability, routes that look correct from one perspective but not another, or debugging sessions where one engineer can reach a service and another cannot because they are effectively standing on different address assumptions. When a private cluster depends on hub connectivity, VPN access, and cross-environment communication, address space mistakes stop being local mistakes.

That is why I treat CIDR planning as foundational work rather than a spreadsheet exercise. It is much easier to reserve sensible space early than to redesign around overlaps later when clusters, VPNs, and hybrid links already depend on the current plan. If I had to summarize the lesson plainly, it would be this: CIDR debt is real debt. It accumulates quietly and gets expensive at exactly the wrong time.

Debugging the Private Endpoint Was Mostly About Eliminating Assumptions

The hardest part of debugging private endpoints is that many failures look the same from the outside. The API is unreachable. That does not tell you whether the problem is name resolution, route propagation, peering behavior, client-side DNS, NSG policy, a custom route, or a firewall path that is doing something unexpected.

The discipline that helped most was treating the path as a chain that had to be proven one link at a time. Does the private name resolve? Does it resolve to the expected address? From which networks? Does traffic from the source actually have a route to that address? Does the return path exist? Are the relevant peering settings correct? Is the problem still present when tested from a machine known to be inside the right network boundary?

That sounds basic, but it is the difference between debugging and thrashing. In private environments, it is easy to jump to conclusions because the symptom is simply "it does not connect." One of the most useful habits I built in this work was to stop treating the cluster as the first suspect. Very often, the cluster was fine. The environment around it was not.

7. The Model That Ended Up Working

The model that held up best was the one with the fewest hidden exceptions.

AKS clusters lived in workload spokes. The hub handled shared connectivity concerns from a dedicated platform subscription. VPN access terminated centrally through the hub gateway, and the spoke VNets in other subscriptions were designed to use that shared path rather than exposing separate administrative entry points. DNS was treated as a first-class design concern, with a resolver strategy in the hub and private name resolution made explicit instead of accidental. Private DNS zones, resolver endpoints, and forwarding rules were designed to support how operators, spokes, and hybrid paths actually worked, not just how the cluster was created.

That structure also made responsibility clearer. Workload teams did not need to become experts in private DNS resolution or VPN route propagation just to run services on the platform. The platform team owned those shared network decisions. In turn, application teams got a more predictable environment, and the support burden reduced because the same networking questions did not need to be solved from scratch for every cluster or service.

The human access path also became much cleaner. Routine deployments still went through GitOps and did not depend on direct cluster sessions. VPN access existed for the smaller group of people who actually needed it for cluster administration and deeper troubleshooting. That separation made the platform more governable and easier to audit.

Just as important, the working model made troubleshooting repeatable. When private cluster access failed, there was a known path to validate: resolve the name, verify the expected answer, confirm the network path from the current source, compare behavior from a known-good Azure vantage point, then move outward. Good architecture helps prevent incidents. Good operating models help end them.

8. The Trade-Offs Private AKS Introduced

Private AKS was the right choice for this environment, but it was not the cheap choice.

The most obvious trade-off was operational complexity. A public control plane lets you lean much more heavily on identity and authorization as the primary access controls. A private control plane pulls networking into every cluster conversation. Access, debugging, onboarding, and hybrid integration all become more involved because the control plane is now part of a wider private network design.

DNS was the largest recurring cost in that decision. If the resolver path is not clear, cluster access fails in ways that look inconsistent and waste time. That is why I would not recommend private AKS as a default posture for every team regardless of context. If the surrounding platform does not have a real answer for private DNS, peering, VPN access, and route planning, the cluster will inherit those weaknesses immediately.

There was also a trade-off between security posture and ease of access. A public API with strict identity controls and limited source IPs is operationally simpler. A private API is harder to get wrong from an exposure perspective, but only if the rest of the network model is competently designed. Otherwise you trade one kind of risk for another and simply move the pain into day-two operations.

The human access model had similar trade-offs. VPN access gave a better operator experience than forcing everything through a jumpbox, but it also meant the VPN path itself had to be designed and supported properly. That includes route propagation, client configuration, and DNS behavior on the operator machine, none of which can be hand-waved away just because the gateway connected successfully.

What made the trade-off worthwhile was that it aligned with the rest of the platform direction. The environment was already moving toward private connectivity, controlled access, and explicit network boundaries. Private AKS was consistent with that model. It would have been much harder to justify if the rest of the platform still operated as if public control-plane access were the normal answer.

9. What I Would Do Differently

With hindsight, there are a few things I would tighten earlier.

The first is CIDR planning. I would spend even more time upfront reserving address space with future spokes, future regions, VPN client pools, and hybrid connectivity in mind. This is the sort of work that feels overly cautious until it saves you from a redesign later. Once private clusters, route tables, VPN paths, and peering relationships depend on the existing ranges, changing them becomes painful quickly.

The second is DNS ownership. I would define the private DNS and resolver model earlier and more explicitly instead of letting cluster creation and later troubleshooting gradually reveal what the right structure should have been. Private DNS is not support glue around private AKS. It is part of the design. Treating it that way from day one would reduce a lot of drift and a lot of confused debugging.

I would also formalize the validation path sooner. Once you know private clusters are part of the platform, there is no reason to rely on memory for testing. A small, repeatable checklist for name resolution, route validation, peering assumptions, and known-good test points saves a surprising amount of time. When networking fails, the difference between a runbook and a hunch is enormous.

If I were scaling the model further, I would also look earlier at whether Azure Firewall or a stronger centralized egress and inspection pattern should sit more visibly in the design. Not because every private AKS deployment needs maximum network complexity, but because once the environment grows, ad hoc egress and inspection decisions become another source of inconsistent behavior.

I would keep the same core direction, though. The biggest changes I would make are about designing the invisible pieces earlier, not replacing the model itself.

10. Why This Work Mattered

This work mattered because it was not just about creating a Kubernetes cluster. It was about making a private platform operable.

Private AKS networking is one of those areas where platform engineering stops being abstract very quickly. You cannot solve it with pipelines alone. You cannot solve it with a clean Terraform or OpenTofu module alone. You have to understand how routing, DNS, VPN access, private endpoints, and governance interact in a real environment where multiple teams and networks already exist.

That is also why I think this kind of work differentiates platform engineers from people who only want to stay at the tool layer. Networking is where a lot of cloud implementations become vague. People know the words hub, spoke, VPN, and DNS, but the real signal is whether they can explain why a private cluster is healthy and still unreachable, and whether they know how to fix that without turning the design back into a public one.

The platform value here was not that every developer had to understand private DNS zones, gateway transit, or resolver paths. It was the opposite. The platform needed to absorb that complexity so the environment remained secure and usable without every workload team becoming a networking specialist.

The hardest part of private AKS was not creating the cluster. It was making sure the name resolved and the route existed from the places that mattered.

That is the kind of work I increasingly associate with senior platform engineering. Not just provisioning infrastructure, but taking responsibility for the invisible systems around it so other engineers can rely on them without constantly rediscovering how they work.

Reliability in Practice: What Actually Breaks and How I Handle It

Syed Ammar — Tue, 15 Jul 2025 08:30:00 GMT

1. Reliability Was Never About Preventing Failure

By the time reliability became a serious topic, most of the visible platform work was already in place. The Azure foundation existed. Private AKS access was working. GitLab CI/CD and ArgoCD had established a deployment path. Platform control planes and workload clusters had been separated. The environment model was much clearer than it had been at the beginning.

That kind of progress can create a false sense of safety.

Once a platform looks well designed on paper, people naturally start expecting stability to follow from structure. Sometimes it does. More often, structure simply changes the kind of failures you see. The environment becomes more governable, but production still finds the weak assumptions. Traffic patterns change. Dependencies respond differently under load than they do in test. Resource limits that seemed reasonable during onboarding turn out to be badly tuned once real user behavior arrives. Health checks look fine until startup takes longer than usual. A deployment succeeds, but the service behind it is not actually ready for live traffic.

That was the point where reliability stopped feeling like a monitoring topic and started feeling like an operating discipline.

I do not think serious platform teams should define reliability as "the system does not fail." That standard is neither honest nor useful. Real systems fail. Dependencies slow down. Nodes get pressured. Configuration mistakes get through review. The better question is whether failure becomes visible quickly, whether the signals make sense, and whether the recovery path is disciplined enough that a bad situation does not get worse through confusion.

That changed how I thought about production. The goal was not to build a platform where nothing ever broke. The goal was to build one where failure was easier to contain, faster to understand, and safer to recover from.

2. What Actually Broke Was Usually Ordinary

One of the more useful lessons from production work is that major incidents are often made of very ordinary parts.

The failures I kept seeing were rarely dramatic in the way architecture diagrams imply. Most of them were not full-site outages caused by one spectacular design flaw. They were smaller operational weaknesses that lined up badly enough to become user-visible. Pods got OOMKilled because limits and actual usage had drifted apart. Readiness checks reported healthy too early. Services came up before a dependency was actually reachable. A rollout technically completed while latency quietly climbed in the background. A cluster or application component restarted repeatedly because the health checks were punishing a slow startup instead of detecting a dead process.

There were also the failures that did not look like failures at first. A service still responded, but slower. Error rates were low enough that nobody declared an outage immediately, yet high enough that customers were having a bad experience. Timeouts appeared only during specific traffic windows. A downstream dependency degraded just enough to create retries, queueing, or partial failures that spread into other services.

That kind of reliability work is harder than the dramatic version because it resists easy storytelling. Nothing has completely collapsed, but the system is no longer trustworthy. The platform is still running, but the margin is thinner than it looked yesterday. Recovery often starts before anyone can confidently explain root cause.

This is why I have grown skeptical of reliability writing that focuses only on idealized incident categories. In real environments, the things that break most often are rarely exotic. They are usually the operational details that teams assume are under control until production proves otherwise.

3. The Hard Part Was Not Detecting That Something Was Wrong

At first glance, that sounds backwards. Surely the hard part of reliability is noticing that something is failing.

Sometimes it is. More often, the harder part is recognizing what kind of failure you are looking at and deciding what to do first.

Most mature platforms already produce a lot of data. Prometheus is scraping. Grafana is full of dashboards. Logs are flowing. ArgoCD shows deployment history. GitLab shows what changed and when. The problem is that data by itself does not create operational clarity. During an incident, the platform does not reward the team with extra time just because the monitoring stack is well populated.

This is where many reliability efforts become less effective than they should be. Teams gather far more telemetry than they can use under pressure, then assume visibility must be good because the graphs are detailed. In reality, the first minutes of an incident are usually dominated by much simpler questions. Is this user-visible? Did something change recently? Is the fastest safe move to roll back, scale, restart, fail over, or reduce traffic? Are we looking at a service problem, a dependency problem, or a platform problem?

If the signals do not help answer those questions quickly, then the environment may be observable in a technical sense while still being hard to operate.

That distinction became central to how I thought about reliability. Reliability is not improved by accumulating more data than humans can use. It is improved by making the first operational decisions easier to get right.

4. OOMKilled Pods Were Usually a Truth Problem

One of the most common reliability issues in Kubernetes was also one of the least glamorous: containers getting OOMKilled.

This is one of those failures that is easy to underestimate because the first few occurrences often look like a local application problem. A pod restarts. The service comes back. The deployment remains technically healthy enough that nobody treats it as urgent. Then traffic increases, the pattern repeats, and what looked like a small runtime issue becomes a reliability problem.

In practice, OOMKilled pods were often exposing a gap between the story we had told the cluster and the behavior the application actually had. Requests and limits might have been chosen early, copied from another service, or based on a test environment that never exercised the same memory profile as production. From the scheduler's perspective, the configuration was the truth. From the workload's perspective, the real memory demand was the truth. Production was where those two truths collided.

This mattered because the failure rarely stayed isolated. Restarting pods create request failures, slower recovery, and misleading noise in dashboards. If the service sits behind retries or depends on other components that are also under pressure, the restart loop becomes part of a wider degradation pattern rather than a single bad pod event.

The fix was usually not "give it more memory and move on," at least not if the goal was to improve reliability rather than silence a symptom. The better approach was to look at actual runtime behavior over time, align requests and limits with that behavior, and treat recurring memory pressure as something worth understanding instead of something worth hiding. In some cases the application genuinely needed more headroom. In others, the configuration had simply remained wrong for too long because nobody revisited it after the service matured.

This is one of the reasons I think reliability and platform engineering overlap so heavily. A lot of recurring production pain is not caused by one code defect. It comes from the platform tolerating stale assumptions for too long.

5. Health Checks Broke Services More Often Than Teams Expected

Another recurring class of failure came from liveness, readiness, and startup behavior.

Health checks are a good example of a platform feature that looks simple until it starts making the wrong decision automatically. A readiness probe that turns green too early can expose a service before it is ready to handle traffic. A liveness probe that is too aggressive can turn a slow startup or transient dependency issue into a restart loop. A service that technically starts but depends on a database connection, secret mount, external API, or cache warm-up phase can look healthy to Kubernetes while still being operationally unavailable.

This showed up most clearly after otherwise normal deployments. The rollout completed. ArgoCD showed the application synced. Pods were running. Then error rate and latency started climbing because the new pods were accepting traffic before the application had actually stabilized. From the outside, it looked like a mysterious regression. In reality, the cluster had done exactly what the probes told it to do.

These incidents were a useful reminder that Kubernetes is not judging application readiness intelligently. It is enforcing the contract you define. If that contract is optimistic, shallow, or borrowed from another service with different behavior, the platform will enforce the wrong thing with great consistency.

The improvements here were rarely exotic. Better startup behavior, more realistic readiness checks, and more careful probe timing prevented a surprising amount of avoidable pain. The hard part was not knowing that health checks matter. The hard part was resisting the temptation to treat them as boilerplate.

6. Degradation Was Harder Than Outage

Full outages are ugly, but they are often easier to reason about than partial failure.

If a service is completely down, everyone agrees something is wrong. The incident gets attention quickly. The recovery objective is obvious. Partial degradation is more difficult because the system is still doing enough to confuse people. Requests succeed sometimes. Dashboards show activity. Internal metrics may look acceptable depending on where you are staring. The service is technically up, yet users are clearly having a worse experience.

This kind of problem appeared often enough that it changed how I thought about production signals. Latency spikes, intermittent timeouts, elevated but not catastrophic error rates, and slowly worsening response times were often more operationally dangerous than hard failures because they invited hesitation. Teams started debating whether the incident was real while customers were already experiencing it.

The platform made this harder when alerts were tied too closely to internal component thresholds rather than user-visible symptoms. CPU or memory alerts might fire early, late, or not at all depending on the shape of the failure. Restart counts could be informative but still secondary. What actually mattered during these incidents was usually much closer to the edge: request success, request latency, saturation symptoms, and the timing of recent changes.

Once I saw enough of those incidents, I stopped thinking of reliability primarily in terms of uptime. Availability matters, but a service that technically responds while being operationally unreliable is still a reliability problem. Teams that only optimize for "is it up?" miss a large amount of what users actually experience as broken.

7. What Did Not Help

Some of the early responses to reliability problems were well-intentioned and not especially useful.

The first weak instinct was to add more alerts. On paper, this looked responsible. CPU thresholds, memory thresholds, restart thresholds, pod health, node conditions, latency, and error rates all got attention. The result was not better reliability. The result was alert fatigue and slower incident understanding. Multiple alerts described the same underlying issue from different angles, and the people on call learned to distrust the noise before they learned to trust the signal.

The second weak instinct was to build dashboards that were technically rich but operationally unhelpful. Grafana made it easy to create detailed views, and detailed views are often satisfying to build. That did not mean they were useful during live incidents. A dashboard that requires careful interpretation under pressure is not much of an incident tool. In several cases, the most detailed dashboards were the least helpful because they invited analysis before the service had been stabilized.

The third mistake was trying to debug too early.

This is a very common engineer instinct. Something breaks, and the team immediately wants root cause. That impulse is understandable, especially for capable engineers who do not like uncertainty. But during a live reliability event, early debugging often competes with the more important goal: reduce blast radius and restore sane behavior as quickly as possible. If the service is degraded, the priority should be stabilization. Root cause analysis matters, but it matters more after the system is no longer actively hurting users.

None of these were useless practices in themselves. Monitoring matters. Dashboards matter. Debugging matters. They just mattered in the wrong order when an incident was already in progress.

8. The Response Model That Worked Better

The response model that helped most was conceptually simple and operationally disciplined.

First, decide whether the issue is user-visible and whether it is getting worse. That sounds obvious, but it immediately changes how you prioritize. Not every alert deserves the same level of urgency. Not every odd graph shape is an incident. The faster the team can answer "is this affecting users right now?" the more sensible the next decision becomes.

Second, stabilize before investigating deeply. If a recent deployment is the likely cause, roll it back or revert it in Git and let the environment reconcile. If traffic needs to be reduced, do that. If a service is clearly under-provisioned and adding headroom is the safest move, do that. If a specific bad instance or pod set is making things worse, replace it. The point is not to guess wildly. The point is to prefer reversible actions that reduce harm.

Third, correlate aggressively. Reliability incidents often sit close to a recent deployment, config change, dependency change, traffic pattern shift, or platform event. GitLab history, ArgoCD sync history, Prometheus metrics, and logs all become more useful once the team has stabilized the situation enough to read them in sequence instead of in panic.

Only after that did deeper investigation become worthwhile. At that stage, the team could ask better questions. Was the failure mode exposed by a bad rollout setting, a resource mismatch, a dependency contract, or something the service was doing under real load that tests never captured? Those are good questions. They are just not always the first questions.

This sounds like straightforward incident discipline because it is. The difference is that a surprising amount of production pain comes from not following it consistently.

9. Reliability Improved When Signals Became More Actionable

One of the biggest improvements came from treating alert quality as a reliability concern in its own right.

The simplest filter I found was also the most useful: if this alert fires, what should someone actually do next? If the answer was vague, theoretical, or "it depends, go investigate," the alert probably was not good enough to interrupt a human.

That immediately changed the shape of the alerting model. User-impact signals mattered more than internal discomfort signals. Error rate, latency, and service availability generally deserved more attention than raw CPU, memory, or restart counts on their own. That did not make internal metrics irrelevant. It made them supporting evidence rather than primary incident entry points in many cases.

Prometheus and Grafana were already capable of showing the necessary data. The work was in reducing the distance between a signal and an operational decision. Good alerts did not merely say that the platform was behaving strangely. They narrowed the likely problem enough that the team could decide whether to roll back, scale, pause, or escalate.

This was also where reliability and observability stopped being interchangeable in my head. Observability is the broader ability to inspect the system. Reliability improves when the important parts of that visibility are structured into signals that help people act correctly under pressure.

10. Standardization Removed Entire Classes of Incidents

One of the less glamorous but more effective reliability improvements came from standardization.

By this point in the platform journey, a lot of the earlier work had already been about reducing repeated decisions. Golden templates, clearer delivery paths, controlled GitOps flows, and environment standards all helped application teams avoid re-solving the same platform questions from scratch. Reliability benefited from the same approach.

When probes, resource defaults, rollout patterns, and exposure models were left entirely to individual interpretation, recurring incidents multiplied. Not because teams were careless, but because production behavior is hard to predict and every service was effectively inventing its own operational contract. Once better defaults existed, entire categories of failure became less common.

That did not mean forcing every service into exactly the same shape. Some workloads genuinely needed different tuning. But it did mean that the platform could stop treating obviously failure-prone decisions as if they deserved infinite flexibility. Conservative defaults around readiness, reasonable requests and limits, safer rollout behavior, and consistent service patterns reduced the number of times the same incident had to be relearned under a new name.

Reliability is often discussed as if it were mainly an incident management discipline. In practice, it improves a lot when the platform quietly prevents repeated operational mistakes from reaching production at all.

11. Example: A Deployment That Looked Healthy and Wasn't

One reliability pattern I saw more than once was a deployment that appeared completely normal from the delivery pipeline's perspective while being operationally wrong in production.

The GitLab pipeline passed. ArgoCD synced the new version. Kubernetes showed running pods. On paper, this looked like success. Then latency started rising and a portion of requests began failing because the new pods were reporting readiness before a dependency path had actually settled. Sometimes that was a database connection path. Sometimes it was a downstream internal service. Sometimes it was a startup routine that technically launched the process but had not finished the real work needed before serving traffic.

This kind of incident was useful because it exposed a gap between delivery success and runtime readiness.

The immediate handling was usually straightforward once the pattern was recognized. Revert or roll back the recent change, let the stable version recover service, and confirm the symptoms disappear. The more important work happened afterwards: make the readiness contract more honest, revisit startup behavior, and stop treating pod status as if it were the same thing as application health.

A surprising amount of reliability work is about closing exactly that gap. Platforms are very good at telling you whether they executed your instructions. They are much less capable of telling you whether your instructions represented reality.

12. Example: A Service That Failed Only During Peak Hours

Another common pattern was the service that behaved acceptably most of the time and then fell apart during the period when users actually needed it most.

In one form or another, this often came down to memory pressure, concurrency assumptions, or request behavior that looked fine in low-volume conditions and much worse during the daily peak. Outside those windows, the service appeared stable enough that the configuration passed review and the urgency stayed low. During peak usage, pods restarted, latency climbed, and the service began to look unreliable in a way that was highly visible to users even though it never became completely unavailable.

The first temptation in those incidents was to treat them as purely application-level defects. Sometimes they were. Just as often, the platform configuration was part of the story. Requests and limits were too optimistic. Autoscaling thresholds did not line up with the actual pressure signal. The team had reasonable telemetry, but not the habit of revisiting it against real production load.

What helped was correlating the runtime pattern with the actual traffic window instead of only staring at isolated pod failures. Once the service behavior was understood in the context of user demand, the fix usually became clearer. Adjust headroom where it was genuinely needed. Align the resource model with observed usage instead of inherited defaults. Then watch whether the improvement survives the next peak rather than declaring success immediately.

This is another area where reliability stops being theoretical very quickly. Production rarely cares whether the configuration was written with good intentions. It cares whether the service survives the period when it is actually needed.

13. Example: The Alert Storm Was Not the Incident

Some of the worst on-call moments were not caused by one massive platform failure. They were caused by a manageable failure arriving through an alerting model that made it look chaotic.

One service degraded. That should have been the incident. Instead, the response began with a flood of related alerts from pod restarts, node symptoms, latency alarms, downstream retry patterns, and secondary errors from services that depended on the original failing path. The team was not short on data. It was short on a clean entry point into the problem.

This is where alert design proved to be directly relevant to reliability rather than a separate observability concern. A noisy system does not merely annoy the on-call engineer. It delays the point at which the real incident gets understood accurately.

The solution was not to suppress everything. It was to distinguish between the alert that should open the incident and the supporting signals that help explain it once someone is already looking. A user-visible symptom should generally start the conversation. Lower-level component symptoms should enrich it, not compete with it.

Once the alerting model moved in that direction, incident response got calmer very quickly. The platform had not become magically more reliable overnight. The team had simply stopped tripping over its own instrumentation on the way to the real issue.

14. The Trade-Offs Were Real

Reliability work is full of trade-offs that are easy to state and harder to live with.

More sensitive alerts can detect trouble earlier, but they also create noise and unnecessary interruption if they are not designed carefully. More dashboards can make the environment richer to inspect, but they can also slow decision-making if the operational path through them is unclear. Standardization reduces repeated mistakes, but too much rigidity can make it harder for services with genuinely unusual needs to operate correctly. Faster recovery actions, such as rollback, can reduce user pain quickly, but they may delay full understanding if the team never comes back for proper analysis afterwards.

There is also a deeper trade-off between elegance and operability. Engineers naturally like clean explanations and precise root cause. Production often rewards teams that can take safe, imperfect, stabilizing action before the whole story is known. That can feel unsatisfying in the moment, but it is usually the more mature posture.

I do not think reliability improves by pretending these trade-offs disappear. It improves when the platform and the team make them consciously instead of backing into them during the middle of an incident.

15. What I Would Do Earlier

Looking back, there are a few things I would push much earlier in the lifecycle of a platform.

I would define alerting principles sooner and make teams defend why an alert deserves to wake a human. I would standardize resource defaults and health-check patterns earlier, especially for services entering Kubernetes for the first time. I would spend more time teaching the difference between deployment success and runtime readiness because that misunderstanding causes more production pain than many teams realize. I would also make incident handling discipline more explicit, especially the habit of stabilizing first and investigating second.

Most importantly, I would treat recurring operational symptoms as design feedback much earlier. Repeated pod restarts, repeated memory pressure, repeated dependency startup issues, and repeated alert storms are usually not just bad luck. They are the platform telling you that a default, a contract, or a habit is wrong.

The earlier that feedback is taken seriously, the less often reliability work turns into repeated firefighting.

16. Why This Still Felt Like Platform Engineering

What made this reliability work meaningful was that it was never just about being better at incidents.

The incidents mattered, but the bigger lesson was how much of reliability is shaped before the incident starts. Platform defaults, workload contracts, alert quality, rollout behavior, resource conventions, and the clarity of the recovery path all influence whether failure stays small or becomes expensive. That is why I think reliability belongs naturally inside platform engineering. It is not only about operating the system after something breaks. It is also about designing the system so that common failures are easier to survive.

By this point in the broader series, that pattern should feel familiar. The landing zone work was about making cloud structure governable. The platform work was about reducing developer dependence on raw infrastructure. The networking work was about making private connectivity operable. The GitOps work was about making deployment state understandable. The multi-environment work was about separating change flows honestly. Reliability was another version of the same underlying discipline: remove ambiguity, make the important paths more predictable, and design the platform so that people can recover sensibly when reality stops matching the diagram.

The goal was never zero failure.

The goal was a platform where failure becomes visible quickly, signals remain trustworthy, and recovery is disciplined enough that the system earns trust again.

Cloud Costs in Practice: What Actually Helped Reduce Spend

Syed Ammar — Tue, 20 May 2025 07:00:00 GMT

Cloud Costs in Practice: What Actually Helped Reduce Spend

FinOps Lessons from Running EKS, EC2, RDS, and Supporting Platform Services on AWS

1. Cost Work Came From a Different Cloud Estate, but the Lesson Was the Same

Most of the earlier posts in this series focused on Azure, AKS, private networking, platform separation, GitOps, and operating model design. This post comes from a different environment on AWS, but I have kept it in the portfolio because it shaped my thinking in exactly the same way. The cloud provider was different. The underlying lesson was not.

Cloud cost is usually discussed as if it belongs to finance, procurement, or a reporting function on the edge of engineering. In practice, the biggest savings I saw came from platform decisions, workload behavior, and the discipline to distinguish between what the system genuinely needed and what it was simply carrying by habit.

That mattered a lot in an AWS estate built around EKS, EC2, RDS, S3, and a mix of supporting platform services. By the time cost became a serious topic, the bill was already large enough that vague optimization advice was not going to help. Nobody needed another reminder to "be mindful of spend." What we needed was a clearer view of which parts of the platform were predictably valuable, which parts were genuinely variable, and which parts were just expensive because nobody had challenged them properly.

The useful part of FinOps, at least in this environment, was not the label. It was the discipline of making cost legible to engineers.

2. FinOps Was Useful Only When It Stopped Being Abstract

I have never found FinOps especially helpful when it is treated as a parallel management activity full of dashboards, allocation models, and generic cost optimization advice. It becomes useful when it is tied directly to how the platform actually behaves.

That meant starting from engineering reality rather than finance language.

The platform had a normal set of cost drivers for a modern SaaS environment. EKS provided the orchestration layer. EC2 carried a large portion of the compute footprint, including worker capacity and other supporting workloads. Aurora PostgreSQL sat under a meaningful part of the application. S3 stored a very large amount of data. GitLab runners and a few heavier job types introduced their own kind of bursty compute demand. None of that was surprising. What mattered was understanding how much of it was steady, how much of it was seasonal or bursty, and how much of it existed because past decisions had simply never been revisited.

That is where FinOps stopped sounding like a corporate program and started sounding like engineering work. Once the discussion moved away from "reduce the monthly bill" and toward "separate baseline, burst, and waste," the decisions became much easier to defend.

3. Cost Was Not the Problem at First. Visibility Was.

When cloud spend starts climbing, the first instinct is often to look for savings instruments, new tooling, or provider discounts. Those can help, but they are not where I started.

The first real problem was visibility.

Without a reliable picture of where spend was going, most optimization effort turns into guesswork. You can right-size a handful of instances and still miss the much larger pattern. You can talk about reservations before you understand the steady floor of the platform. You can argue about whether Kubernetes is expensive without knowing whether the problem is actually EKS itself, oversized node groups, idle environments, or storage growth that nobody is watching closely enough.

The cost work only became productive once it was possible to answer practical questions quickly. Which parts of the platform were stable enough to commit to? Which services or environments were disproportionately expensive? Which resources were heavily used during working hours but mostly idle at night? Which line items reflected deliberate architecture decisions, and which ones were just leftovers from earlier stages of the platform?

That visibility came from usage history, environment knowledge, and cost breakdowns that engineering teams could actually map back to real workloads. It was much less glamorous than a FinOps pitch deck and much more useful.

4. Usage Patterns Told Us More Than the Invoice Did

One of the more useful things about this environment was that the application usage profile was not random. It was a SaaS platform in a construction context, which meant traffic was strongly tied to working hours.

During the day, especially between roughly 8 AM and 8 PM, usage was predictably higher. Evenings dropped off. Weekends were materially quieter. That pattern mattered because it told us something the invoice alone could not: a meaningful part of the compute footprint was steady enough to plan around, but not everything needed to be paid for at on-demand rates all the time.

This is where a lot of cost work goes wrong. Teams jump straight from "the bill is high" to "we should optimize everything" without separating baseline demand from burst demand. Once those two are mixed together, almost every decision gets worse. You under-commit because peak usage looks scary, or you over-commit because the platform is large and the discounts look attractive.

Historical EKS usage trends were particularly useful here. Looking at node usage over time gave a much more honest picture of what the platform consistently needed in order to operate safely and what only showed up during predictable peaks or occasional spikes. That made later decisions around Reserved Instances much less speculative.

The important step was not identifying the highest traffic hour. It was understanding the floor of the platform well enough to commit to it confidently.

5. Reserved Instances for EC2 Had the Biggest Financial Impact

The single most effective cost measure in this environment was Reserved Instances for the EC2 footprint that represented the platform's steady baseline.

A large portion of the compute layer sat on compute-optimized instances in the c5 family. Those were not chosen because they were fashionable. They matched the actual workload profile well enough that they had become the normal shape of a lot of the platform's compute demand. Once usage history made it clear that a substantial amount of that demand was persistent rather than occasional, keeping it all on on-demand pricing stopped making sense.

This is where the useful part of cost optimization was not "buy Reserved Instances." Anyone can say that. The real work was identifying how much of the EC2 footprint was stable enough to reserve without painting the platform into a corner.

That is a more careful decision than it sounds. Overcommitting can be just as bad as staying entirely on-demand. If you reserve too aggressively, you lock yourself into assumptions the platform may outgrow or invalidate. If you avoid reservations entirely because uncertainty feels safer, you end up paying on-demand rates for capacity that is effectively permanent.

What worked well was reserving the baseline rather than the peaks. Some commitments were made on 1-year terms and some on 3-year terms, depending on how stable the underlying usage looked. The point was not to maximize the reservation percentage for its own sake. The point was to cover the part of the platform we were already confident would exist regardless of day-to-day traffic variation.

That produced the largest savings because it addressed the biggest recurring line item without relying on risky architectural change.

6. RDS Was an Easier Commitment Than Compute

If EC2 required some judgment, the database layer was even more straightforward.

Aurora PostgreSQL was carrying a meaningful and relatively steady portion of the platform's workload, and the database shape was much less burst-driven than parts of the application tier. In this kind of environment, that matters. Stateless compute often moves around. Database capacity tends to change more slowly and with more caution.

That made the reservation decision simpler.

For the Aurora PostgreSQL footprint, a 1-year reservation on db.r5.2xlarge was a very easy win. The operational risk was low because the demand was stable and the database was not the kind of component that was likely to disappear or shrink dramatically in the near term. It was exactly the kind of spend that should not have been living at full on-demand pricing once the usage pattern was clear.

I think this is one of the more practical parts of FinOps that gets lost when people only talk at the portfolio level. Different parts of the platform deserve different commitment strategies. Databases are not application node groups. Bursty job runners are not Aurora. Treating them all as one commitment problem is a good way to make mediocre decisions in every direction.

The database layer was a good reminder that cost optimization improves when the platform is discussed as a system of workloads with different behaviors, not as one giant number.

7. DoiT Helped on the Remaining On-Demand Usage

Even after reservations, there was still a meaningful amount of on-demand usage that was not sensible to commit to.

That included the kinds of workloads most platforms always have some amount of: bursty usage, less predictable demand, and capacity that would have been risky to lock into a long commitment. This is where DoiT was useful.

It was not the biggest lever in the environment, and I would not pretend otherwise. The larger savings came from getting the commitment strategy right on the steady-state compute and database footprint. But for the remaining on-demand capacity, DoiT helped deliver roughly 10% savings without forcing awkward engineering changes just to chase a discount.

That was valuable precisely because it addressed the part of the bill that reservations were never meant to solve.

I think this is an important point because cost stories often become too clean in hindsight. They make it sound as if one strategy solved everything. It did not. Reservations were right for baseline demand. DoiT helped with the still-variable on-demand layer. Those were complementary decisions, not competing ones.

The engineering lesson was simple: do not force a financial mechanism to solve the wrong class of usage.

8. Storage Tiering Mattered More Than People Expected

Storage was another major cost area, especially once the S3 footprint moved past 200 TB.

At that scale, it stops making sense to talk about storage as one flat bucket of data. Different data has different access patterns, different business value, and different expectations around retrieval speed. If all of it sits in the same storage class indefinitely, the platform is paying for convenience it does not actually need.

Lifecycle policies made a real difference here because they introduced a more honest relationship between access pattern and storage cost. Frequently used data could remain where it needed to remain. Less frequently accessed data could move to cheaper tiers. Rarely accessed or archival material could move much further down the cost curve.

For some of the archive-heavy use cases, Glacier One Zone was a sensible fit. This was mostly data not directly needed by customers in a day-to-day operational path and more often touched by data warehouse or downstream analytical use cases. In other words, it did not carry the same retrieval expectations as customer-facing transactional data.

What mattered here was not just the storage class choice. It was acknowledging that "keep everything in the expensive tier forever" is usually a product of indecision, not of actual access requirements.

At 200 TB and beyond, even modest improvements in lifecycle policy discipline become real money. This was one of those areas where the savings were not flashy, but they were undeniable.

9. EKS Was Part of the Story, but Worker Behavior Mattered More Than the Control Plane

It is easy to blame Kubernetes itself for a high bill because it is a visible part of the platform architecture. In practice, the EKS control plane fee was not the heart of the problem. The more important questions lived underneath it.

How large were the worker footprints during the hours that mattered? How much of that size reflected real demand versus inherited assumptions? Which node groups were carrying stable application load, and which were mostly there to absorb variability that could have been treated differently?

This is where historical EKS usage trends paid off again. Once the underlying worker demand was understood, the conversation stopped being "EKS is expensive" and became much more precise. The platform was paying for a combination of baseline worker capacity, daytime peaks, and a handful of supporting workloads that behaved very differently from the core application.

That precision mattered because it prevented the wrong kind of reaction. The answer was not to make the platform fragile by squeezing worker capacity too hard. The answer was to reserve what was demonstrably stable, leave the unpredictable part flexible, and stop confusing variability with waste.

Kubernetes cost work often sounds more complicated than it really is. Most of the time, it comes back to understanding how much of the worker estate is structural and how much of it is situational.

10. GitLab Runners and Ephemeral Compute Were Easy Wins

Some of the cleanest cost wins came from workloads that never needed to be running continuously in the first place.

GitLab runners were a good example. A few of the job types required relatively large EC2 instances, and some workloads occasionally needed GPU-backed machines that were completely outside the normal EKS pattern. Keeping those instances alive full time would have been a very expensive way to avoid a small amount of orchestration work.

The better approach was to make them genuinely ephemeral.

Instances were brought up when a job started and shut down again after the job completed. GitLab automation handled the mechanics, which meant the platform did not rely on someone remembering to clean up expensive build infrastructure later. That mattered especially for the larger or more specialized instances, where the financial penalty for laziness would have been obvious very quickly.

This was one of the clearest examples of a broader principle: turning something off is often more effective than endlessly optimizing something that should not have been running in the first place.

There is a certain kind of cloud waste that comes not from wrong sizing but from unnecessary runtime. Ephemeral compute is where that waste is easiest to challenge because the system itself can enforce the lifecycle instead of hoping people do the right thing manually.

11. Tagging Temporary Resources and Shutting Them Down at Midnight Helped More Than It Sounds

The same logic applied beyond GitLab runners.

Some resources were clearly temporary or non-essential outside working hours, but they still had a habit of surviving overnight simply because nobody was actively thinking about them after the workday ended. Once that pattern exists, the bill fills up with small amounts of runtime that nobody would ever defend individually and nobody gets around to removing systematically.

The simple answer was tagging and scheduled shutdown.

Resources designated as ephemeral were tagged accordingly and automatically shut down around 00:00 each day. This was not a sophisticated piece of cost engineering, but it was effective precisely because it did not depend on intention surviving the end of the day. If a resource genuinely needed to stay alive, it should not have been in the ephemeral category. If it did not need to stay alive, the platform should not have left the choice to memory.

There is a tendency in cloud cost discussions to look for sophisticated optimization first. In my experience, a lot of spend disappears once the platform gets better at enforcing lifecycle on things that were never meant to be permanent.

The platform did not become cheaper because we found a clever algorithm. It became cheaper because we stopped paying for nighttime inertia.

12. Tagging, Review Loops, and Cost Ownership Made the Savings Stick

One-off savings are easy to lose if nobody owns the follow-through.

That is why tagging mattered for more than just reporting. Resources needed enough metadata around environment, service, and ownership that cost analysis could be tied back to a real engineering conversation. If a workload was unusually expensive, it should be possible to identify it quickly. If a platform service had grown well beyond what was expected, that should be visible before the end of the quarter. If an environment cost changed materially, the right people should not discover that by accident weeks later.

Regular review loops helped keep the optimization work from turning into a one-time cleanup exercise. Daily checks, weekly summaries, and broader monthly or quarterly reviews were useful not because more meetings are inherently good, but because cost drift is rarely dramatic at the start. It accumulates. The earlier it is made visible, the easier it is to correct without disrupting platform work.

Alerts were part of that as well. The platform team and the relevant departmental leadership could see meaningful changes before they turned into unpleasant surprises. That kept cost discussions grounded in recent data rather than in stale assumptions.

This is one of the places where FinOps, when done properly, is just operational hygiene. Visibility only matters if it feeds a loop that people trust and respond to.

13. What Did Not Help

Some approaches were consistently less useful than they sounded.

Over-optimizing very small resources rarely moved the needle. It created noise and made people feel busy, but it did not address the meaningful parts of the bill. The real savings came from baseline compute, database commitments, storage lifecycle policy, and runtime discipline around ephemeral workloads.

Trying to optimize everything at once was also a mistake. Large platforms always have more potential savings ideas than anyone has time to pursue well. The right move was to start with the most structurally important cost drivers and only then work downward. That is much more effective than scattering attention across dozens of small items with unclear impact.

I was also cautious about making commitment decisions too early. Reserved capacity is powerful when it matches reality. It is much less attractive when it is being used to paper over a platform that has never been properly understood. The right order was visibility first, then baseline analysis, then commitments.

The same was true of tooling. Tools can help, but they do not replace judgment. Cost optimization only becomes durable when the platform model itself is sane enough that the numbers mean something.