Designing a Developer Platform on Azure & Kubernetes

1. Infrastructure Was Not the Hard Part

Earlier in this series, I wrote about the Azure foundation work: landing zones, subscription boundaries, RBAC, networking, and the operating model needed to make cloud adoption manageable. That work mattered, but it was not the point where application teams actually felt enabled. It was the point where the real platform problem became visible.

Once the Azure side was structured and AKS clusters were available, the assumption from the outside was often that the difficult part was over. The organization had cloud infrastructure, CI/CD pipelines, Kubernetes, and the usual set of modern tooling. On paper, that sounds like enablement. In practice, it only meant the raw ingredients were now present. The day-to-day experience for developers was still far more complicated than it needed to be.

This is a gap I have seen repeatedly. Teams ask for Kubernetes, infrastructure as code, CI/CD, or cloud resources, and those things get delivered. But giving people access to powerful systems is not the same as making them productive with those systems. A running AKS cluster does not automatically become a usable application platform. A GitLab pipeline does not become a deployment model just because it exists. If every team still depends on the platform team to interpret manifests, fix ingress, manage secrets, explain environment behavior, or rescue broken deployments, then infrastructure has been provisioned but the platform has not really been designed.

That distinction became central to the work. The problem was no longer how to stand up Azure resources. The problem was how to turn Azure, AKS, GitLab CI/CD, ArgoCD, OpenTofu, Prometheus, and Grafana into something that application teams could use safely and repeatedly without needing a DevOps engineer every time they wanted to make a change.

2. Where Developers Were Actually Struggling

The environment was built around Azure and Kubernetes, supporting a growing microservices landscape. From a platform perspective, that was a reasonable direction. From a developer perspective, it came with a large amount of operational surface area that most teams had no reason to become experts in.

What slowed teams down was not usually the application code itself. It was everything around the code. A team could build a service, but getting it from repository to reliable runtime meant dealing with Kubernetes manifests, image build conventions, service exposure, ingress behavior, environment-specific configuration, secret handling, rollout behavior, and runtime debugging. Even small mistakes in those areas could cause deployments to fail in ways that were difficult to reason about if your day job was building product features rather than operating clusters.

Kubernetes YAML was a common source of friction, but the issue was broader than syntax. A manifest is not just a configuration file. It encodes operational decisions. A developer writing a Deployment, Service, or Ingress definition is making decisions about health checks, scaling assumptions, network exposure, restart behavior, labels, selectors, and configuration layout, whether they realize it or not. In a microservices environment, those decisions get repeated over and over again across services, environments, and teams. If each team makes them differently, inconsistency becomes normal very quickly.

Azure introduced a second layer of complexity on top of Kubernetes. Networking alone could become a significant tax: private endpoints, private DNS, internal versus external exposure, ingress patterns, and the difference between something being reachable inside the cluster, inside a VNet, or from outside the environment entirely. Then there was secrets management, where developers needed a safe way to consume application secrets without hardcoding values, embedding them in repo variables indefinitely, or treating Kubernetes secrets as if they were a complete secrets strategy.

CI/CD was another pain point. Developers did not just need a pipeline; they needed to understand how images were tagged, where artifacts were published, what promoted a change from one environment to another, how deployment state was represented, and why a pipeline passed while the workload still failed after deployment. That distinction between build success and runtime success often created confusion. The question was rarely "Did the code compile?" It was more often "Why is the application healthy in one environment, but not in another?" or "Why did the cluster accept this change but the service still is not reachable?"

The natural consequence was dependency on the platform or DevOps team. Requests came in under different labels, but many of them meant the same thing: something about the platform was harder than the application team should have to absorb. Sometimes that showed up as a deployment request. Sometimes it was a networking question. Sometimes it was a secrets issue, an ArgoCD sync problem, or a pod repeatedly crashing for reasons that were obvious only if you already understood the runtime. Over time, the platform team becomes a human API for infrastructure and operations, which is not scalable for either side.

3. Why Raw Kubernetes Was the Wrong Interface

One of the important lessons in this work was that the answer was not to insist that every developer learn Kubernetes more deeply. A certain level of platform awareness is useful, and application teams should understand the operational basics of the systems they run on. But there is a difference between healthy operational ownership and pushing infrastructure complexity downstream because the platform has not been productized.

AKS is a strong runtime when used well. GitLab and ArgoCD are good building blocks. Azure provides the necessary primitives for identity, networking, and secrets. None of that changes the fact that the combined abstraction level is still too low for most product teams to work against directly. Expecting every backend engineer to think fluently in terms of ingress classes, RBAC scopes, managed identities, private DNS resolution, rollout health, and GitOps reconciliation is usually a sign that the platform team has exposed implementation details as the user interface.

That is not a criticism of developers. It is a design problem. Most application teams are trying to ship business capability. Their mental model starts with endpoints, dependencies, configuration, latency, failure handling, and domain behavior. When a team has to become part-time cluster operator just to release a service safely, the platform is asking them to spend cognitive energy on the wrong layer.

This mattered even more in a microservices model. A monolith can hide a lot of infrastructure complexity simply because the deployment surface is smaller. A microservices landscape does the opposite. It multiplies the number of deployable units, network paths, secrets, dashboards, and failure modes. That makes standardization and abstraction more valuable, not less. Without them, every new service adds not only application behavior but another copy of the same infrastructure decisions.

Kubernetes was not the problem. Exposing it directly to developers as the default interface was.

The goal, then, was not to teach every team deep Kubernetes internals. The goal was to make sure they did not need deep Kubernetes internals for the common path. That is a very different design problem from simply giving people access to a cluster.

4. The Shift From Tooling to Product Thinking

The language shift from "providing infrastructure" to "designing a platform" sounds cosmetic until you feel the difference in day-to-day work. When the job is framed as infrastructure delivery, success is easy to define in component terms. The cluster exists. The pipeline runs. The OpenTofu applies cleanly. ArgoCD is installed. Prometheus is scraping. Those are all useful milestones, but they still say very little about whether an application team can get a service into production without tripping over platform internals.

Once the work was treated as platform design, the questions changed. What does a sane onboarding path look like for the next microservice, not the current one? Which parts of Azure and Kubernetes should be invisible to an application team most of the time? Where do we want flexibility, and where do we want one opinionated answer because variation only creates support load? If a team needs to ship a routine change, can they do it safely without broad Azure permissions, a kubeconfig, or a side conversation with the platform team?

Those questions led to a small set of principles that were useful precisely because they were not theoretical. Reduce cognitive load instead of moving it around. Prefer one good path over five loosely supported ones. Encode security and governance into the workflow rather than relying on everyone to remember them. Automate the repetitive parts. Make self-service real, but keep the blast radius controlled. Self-service without guardrails is just delegated risk.

Once that became the frame, the tooling started to fall into place. OpenTofu was the way to keep the Azure and AKS foundation consistent. GitLab CI/CD was the obvious interface because developers already lived there. ArgoCD gave us a reconciler and an audit trail instead of a collection of imperative deploy steps. Prometheus and Grafana stopped being side projects and became part of what it meant to run on the platform. Key Vault was not just where secrets lived; it was part of the expected way services consumed sensitive configuration.

The point was not to make the infrastructure look simpler than it was. The point was to keep it out of the developer's critical path.

5. Platform Architecture at a Glance

The easiest way to explain the platform is to follow a single change. A developer opens or merges a change in GitLab. CI builds the image, tags it, and runs the expected checks. Deployment state is updated in Git rather than by calling into the cluster directly. ArgoCD notices the change and reconciles AKS toward that declared state. The workload starts behind an approved networking pattern, and its runtime behavior is already visible through the shared observability stack.

Under that developer-facing path sat the Azure and Kubernetes foundation. OpenTofu provisioned the repeatable Azure structure, the AKS integration points, and the surrounding platform dependencies. The important detail was not that developers never touched Azure. It was that they did not need to think directly in terms of DNS zones, ingress plumbing, RBAC assignments, or secret wiring to ship a normal service change.

Secrets followed the same general idea. Sensitive values lived in Azure Key Vault, and the platform defined how those values became available to workloads. Observability followed it too. Prometheus and Grafana were not optional extras teams had to discover later; they were part of the runtime contract.

That mental model turned out to be important. If a team cannot explain the deployment path in a few sentences, they usually do not trust it. Developer to GitLab, GitLab to Git state, ArgoCD to AKS, then metrics and dashboards available by default was simple enough to hold in your head even though the platform underneath was not simple at all.

6. Designing the Platform Contract

Once the platform was treated as something engineers would consume rather than admire from a diagram, the next step was defining the contract with application teams. If that contract lives in tribal knowledge, the platform does not scale. People start succeeding based on who they know, which repository they copied from last, or which engineer happens to remember why a particular service was set up differently two years ago.

The contract needed to answer a few basic questions very clearly. What does a service team provide? What does the platform generate, enforce, or manage for them? Which decisions still belong to the application, and which ones are intentionally taken off the table? That boundary matters because production incidents have a habit of finding any responsibility that was left ambiguous.

In our case, application teams owned their code, service-specific configuration, health semantics, and the runtime behavior of what they built. The platform owned the repeated scaffolding around that code: deployment structure, GitOps mechanics, secrets integration, exposure patterns, environment layout, and the defaults that should not be re-decided from repository to repository.

That is why the developer interface could not be raw AKS or the Azure portal. It had to live where developers already worked: repository structure, standard configuration, merge requests, and CI/CD. A developer should not need a kubeconfig to deploy a routine change.

A lot of weak self-service models fail exactly here. They claim to abstract complexity but still force teams to think in cluster terms for everyday work. If the normal deployment path still depends on people understanding namespaces, ingress annotations, ArgoCD behavior, and Azure resource relationships in detail, the platform has only renamed the problem.

7. Golden Paths and Reusable Templates

The most concrete part of that contract was the introduction of golden templates and reusable deployment patterns. This was where the platform stopped being theoretical and started changing the daily experience of building and releasing services.

Before that work, too many teams were solving the same problems slightly differently. One service had one pipeline structure, another had a different tagging model, another used a different deployment layout, and another copied a manifest from an older repository and adjusted it by trial and error. Those differences were rarely deliberate architecture decisions. Most of the time they were just accumulated variation. That kind of variation becomes expensive very quickly because the platform team now has to support not only the applications, but every historical interpretation of how an application might be deployed.

The golden path was designed to remove that unnecessary variation. GitLab CI/CD templates standardized how services were built, tagged, scanned, and promoted. Deployment templates standardized how a service described its runtime needs. The configuration structure across environments was made consistent so teams did not have to invent their own model for dev, test, and production every time a new service was onboarded.

This did not mean every application became identical. It meant the common path became predictable. A team starting a new service no longer had to assemble the delivery model from scratch. They inherited a working pattern. The platform templates already knew how to build a container image, publish it through the approved path, update the GitOps source of truth, and let ArgoCD reconcile the workload into AKS. Default labels, common probes, naming patterns, environment structure, and other repetitive details were handled the same way across services unless there was a valid reason not to.

For a typical service, that meant the team started from a standard GitLab template, filled in the service-specific inputs, and stayed focused on the behavior of the application itself. They still decided what healthy looked like, what dependencies the service had, whether it should be internal or externally reachable, and what runtime profile it needed. They no longer had to rebuild the surrounding deployment model each time or guess which pieces were mandatory because a previous repository happened to include them.

That changed the nature of the work for application teams. Instead of writing and maintaining a large amount of repetitive Kubernetes and pipeline configuration, teams mainly provided the parts that were genuinely specific to the service. What port does the application listen on? Should it be exposed internally or externally? Which secrets does it need? Does it need more than the default resource profile? What does healthy look like? Those are meaningful questions. Requiring every team to also handcraft the surrounding deployment machinery was not.

The best templates do more than save time. They shrink the number of decisions that can go wrong. In platform work, reducing the decision surface is often more valuable than adding more options.

8. Abstraction Without Losing Operational Ownership

One of the easiest mistakes in platform work is to confuse abstraction with hiding reality. That was never the aim. The aim was to remove the repetitive, fragile infrastructure work from the daily developer path without pretending that operational responsibility had vanished.

There is a difference between hiding Kubernetes and hiding the consequences of running on Kubernetes. Application teams still needed to understand their own probes, scaling behavior, dependency timeouts, startup patterns, and failure modes. If a service fell over because it could not handle a database reconnect or because its readiness endpoint was misleading, that was still an application problem. No amount of template work changes that.

What the platform absorbed were the mechanics that were both necessary and endlessly repeated: how deployment state was rendered, how secrets were supplied, how the GitOps update happened, how approved service exposure worked, and how the standard build and promotion path behaved. Those were not strategic decisions each team needed to make for itself. They were recurring opportunities for drift and support tickets.

You could see the difference in mundane tasks. Before the platform, getting a service reachable might mean arguing with ingress annotations, checking whether the Service selector matched the Deployment labels, and discovering that development and production had evolved slightly different conventions. A pipeline could go green while the pod still landed in CrashLoopBackOff because the expected secret key was missing or the readiness probe assumed a path that no longer existed. After the abstractions were in place, teams still had to declare intent, but they did it through a narrower interface and with fewer ways to get the plumbing wrong.

That is the useful kind of abstraction. It reduces friction without diluting ownership.

9. Self-Service Through GitLab, GitOps, and ArgoCD

The self-service model worked because it used an interface developers already trusted. GitLab was already where code changed, where merge requests were reviewed, and where pipelines were expected to run. It made more sense to expose platform capabilities there than to ask application teams to become occasional Azure operators or cluster administrators.

The flow itself was straightforward, which was exactly the point. A change started in the application repository. GitLab built the image, ran the expected checks, and pushed the artifact through the approved path. The deployable state was then updated in Git rather than applied directly to the cluster. ArgoCD watched that declared state and reconciled AKS toward it.

That changed more than the mechanics. It removed a whole category of half-manual work that tends to accumulate around weak deployment models. A green pipeline no longer meant someone still had to grab the right context, apply manifests by hand, or fix the environment after the fact. The deploy step stopped being tribal knowledge.

It also gave us a cleaner operating model. Git became the record of intent. Merge requests became the place where deployment-affecting changes were reviewed. ArgoCD reduced the drift that creeps in as soon as manual cluster changes become normal. The platform team no longer had to treat direct kubectl access as the standard path, which made the state of the environment far easier to reason about later.

The important part was not just that developers could deploy for themselves. It was that they could do it without broad AKS or Azure permissions. The workflow was the interface. That is a better kind of autonomy than handing out elevated access and hoping discipline scales.

The same model held for promotion. Moving from development to test or production was not a different ritual with a different toolset. It was the same path with tighter controls and environment-specific differences made explicit.

10. Concrete Examples of the Platform in Practice

Onboarding a New Internal API Service

One of the clearest ways to explain the difference this made is to look at a very ordinary case: a backend microservice exposing an internal API for other services in the environment. Nothing about that kind of service is unusual. That is exactly why it is a useful example. If the platform cannot make the common case easy, it does not matter how sophisticated the underlying tooling is.

Before the platform patterns were in place, onboarding a service like this involved more infrastructure decision-making than most application teams wanted to own. The team would build the container, then start asking the familiar questions. Which manifest structure should be used? Does this need an Ingress or only a Service? How should it be exposed internally? Where do the secrets go? Which variables belong in the pipeline, and which belong in the cluster? Which environment-specific settings need to be duplicated by hand? A pipeline might build successfully, but that still left plenty of ways for the deployment to fail later. The Service selector might not match the Deployment labels. The ingress path might be correct for one environment and wrong for another. The pod might come up only to fail its readiness probe because the expected secret was missing or mounted under a different key.

That usually led to the same kind of support loop. Someone from the platform team would diff manifests, inspect the namespace, compare the service with an older repository that was "close enough," and work backwards from the symptom. Even when the issue was fixed, the model had not improved. The next team would run into a slightly different version of the same problem.

After the platform model settled, the same service followed a much narrower path. The repository started from the standard GitLab template. The team supplied the application-specific inputs, declared that the service was internal rather than externally published, referenced the required secrets through the approved Key Vault-backed pattern, and let the pipeline handle the rest. GitLab built the image, the deployment state was updated through Git, and ArgoCD reconciled the change into AKS. The service became reachable through the approved internal route without the team needing to re-design ingress, DNS behavior, or secret delivery from scratch.

Promotion worked the same way. Moving the service forward was not a separate deployment ritual. It was a controlled change through the same model, with environment-specific configuration where needed and stricter review where it mattered. The point was not that no one ever needed help. The point was that the common path stopped depending on expert intervention.

Turning Service Exposure Into a Platform Decision

Another recurring problem was service exposure. In an Azure and AKS environment with private networking, ingress, private DNS, and different internal and external paths, the question "Should this service be reachable, and by whom?" had a lot more behind it than most teams expected. A service was not simply public or private. It could be cluster-internal only, private inside the wider environment, or deliberately published through an approved external route. Each option implied different ingress behavior, DNS records, certificates, and access boundaries.

Left to individual teams, this became one of the most reliable ways to create inconsistency. Some services were exposed too broadly because the quickest route in development got copied forward. Others were harder to consume than they needed to be because the team did not have a stable model for internal reachability. The symptom looked simple: the workload was running, but the caller could not reach it, or it was reachable from places it should never have been reachable from.

The fix was to stop treating exposure as a low-level implementation detail each repository had to solve independently. The platform reduced it to a small set of supported intent-based choices. A team could say a service was cluster-internal, private internal, or externally published through the approved route. From there, the templates and GitOps structure mapped that decision to the right ingress and DNS behavior. The team still owned whether the service should be exposed. They no longer had to own every underlying networking decision as well.

That sounds like a small abstraction, but it removed a disproportionate amount of support load. It also closed off a class of configuration drift that is hard to detect until a service is already in use. This is the kind of problem platform engineering should solve once rather than asking every team to learn it independently.

Fixing Secrets Sprawl Without Blocking Delivery

Another problem that surfaced quickly was secrets sprawl. In the absence of a strong platform path, teams will use whatever gets them moving. Some values ended up in GitLab variables because that was quick. Some were created as Kubernetes secrets by hand. Some were copied between environments with too much manual handling. That does not usually begin as a dramatic security failure. It begins as convenience. The trouble starts later, when a value needs to be rotated, audited, or made consistent across environments and nobody is fully sure which copy is authoritative.

The core issue was not just where a value lived. It was that each team was being forced to invent its own model for sensitive configuration. That is exactly the kind of design failure a platform should prevent. The fix was to standardize around Azure Key Vault as the system of record and make secret consumption part of the supported path rather than a per-service improvisation.

That meant a service declared which secrets it needed through the agreed configuration structure, and the platform handled the delivery into the workload. Where managed identity or a cleaner Azure-native access path made sense, that was better because it removed secret distribution entirely. Where concrete values were still required, they came through the Key Vault-backed pattern rather than through manual cluster changes or scattered CI variables.

This paid off most obviously when a secret changed under real operating conditions. Rotation should not require every team to understand the internals of Kubernetes secret objects or to log into the cluster. It should be a controlled platform operation that leaves the application-facing contract alone.

Promoting the Same Artifact Across Environments

One of the quieter but more important platform problems was environment drift caused by promotion models that were not actually promotion models. If a service was effectively rebuilt, reconfigured by hand, or subtly reinterpreted at each environment boundary, then development, test, and production were not really running the same thing. At that point, debugging becomes more of a comparison exercise than an engineering one because you can never be fully sure whether a difference in behavior is caused by the application or by the path it took to get deployed.

The fix was to move to a build-once, promote-forward model. GitLab built the artifact, tagged it immutably, and the change moved through environments by updating Git-declared desired state rather than by rebuilding each time. ArgoCD then reconciled that state into AKS, which meant the platform could reason about deployments as versioned state instead of as a blend of pipeline history and cluster-side improvisation.

That made promotion easier to audit because the change was visible in Git. It made rollback less theatrical because reverting desired state is much cleaner than trying to reconstruct what somebody applied manually three days earlier. It also made environment differences easier to reason about, because the intended differences were explicit configuration or policy boundaries, not a separate deployment craft at every stage.

This is one of those design choices that looks procedural until you have lived without it. Once teams are rebuilding artifacts differently or treating each environment as its own hand-tuned process, the platform loses one of the things it most needs: predictability.

11. Access Control Was an Enabler, Not a Restriction

Limiting direct access to Azure and AKS was an intentional design choice, and it is one of the areas where platform engineering often gets misunderstood. Restricting broad portal or cluster access was not about gatekeeping. It was about designing an operating model that could scale, remain auditable, and avoid turning every engineer into an infrastructure administrator.

If everyone can make direct changes in the portal, apply manifests manually, or alter cluster state outside the standard workflow, you do not really have a platform. You have shared infrastructure with weak boundaries. That can feel fast in the moment, especially for experienced engineers, but the hidden cost shows up later as configuration drift, unclear ownership, inconsistent practices, and deployments that behave differently from what the repositories say should exist.

RBAC was used to align access with responsibilities. Application teams had the permissions they needed to use the platform, not to reconfigure its control plane. The platform team retained ownership over the foundational Azure resources, AKS configuration, and the parts of the stack where a mistake would have cross-team impact. Automation identities were also scoped carefully. GitLab runners, deployment jobs, and GitOps-related automation used the permissions required for their purpose and no more.

That distinction mattered in practice. Nobody needed broad Owner rights on a subscription or wide-open access to AKS just to ship an application change. Routine delivery moved through the same governed path every time, which is exactly what made it scalable.

This model made the overall system safer, but it also made it easier to work with. When the expected path for change is a Git-based workflow backed by ArgoCD, everyone knows where to look when something changes, who reviewed it, and how it can be rolled back. When the primary path is "someone changed something directly," every incident starts with detective work.

There were still situations where deeper access was needed for investigation or exceptional cases, but that was treated as the exception rather than the platform contract. A self-service model should minimize dependence on privileged access, not normalize it.

12. Secrets, Networking, and the Infrastructure Teams Should Not Have to Re-Explain

Some of the most valuable platform work lived in the areas nobody finds glamorous and everybody rediscovers the hard way if they are not standardized.

Secrets management was one of those areas. Azure Key Vault became the authoritative place for sensitive values, and the platform defined the standard path for making those values available to workloads. That avoided a common anti-pattern where every team evolves its own mix of pipeline variables, manually created Kubernetes secrets, copied configuration, and half-documented workarounds. Even when the application requirement was simple, the delivery path needed to be safe and predictable.

Networking was another area where raw infrastructure complexity easily leaks into developer workflows. Private networking, DNS behavior, ingress rules, and internal versus external exposure all matter a great deal in Azure and AKS, but they are poor candidates for every team to solve independently. In a private-first setup, the number of moving parts grows quickly. It is not enough for a container to be running. It has to be reachable by the right systems, through the right path, with the right name resolution and the right exposure boundary.

Without platform patterns, these concerns turn into repeated support requests and repeated mistakes. One service is exposed too broadly. Another is reachable only inside the cluster when it needs to be available internally across the environment. A DNS assumption works in development but not in production. An ingress change resolves one issue while introducing another. None of that is especially interesting work for application teams, and none of it should need to be solved from scratch for each repository.

The fix was to treat these as shared platform concerns rather than as application-by-application craft work. A service team should be able to say whether a workload is internal, externally published, or only cluster-internal, and let the platform map that intent onto the right ingress, DNS, and networking behavior. The same logic applied to identity and secret consumption. Where direct secret usage was necessary, it followed a consistent Key Vault-backed pattern. Where a service could use managed identity or another Azure-native access model, that path was preferred because it removed a whole class of secret handling entirely.

These were not the parts of the platform anyone liked presenting on slides, but they were the parts that consumed week after week if they were not standardized. Good platform work solves that class of problem once.

13. Observability Had to Be Part of the Platform

A platform is not finished once it can deploy workloads. It also has to make those workloads legible after they start. That is why observability was part of the platform from the start rather than a separate improvement project for later.

Prometheus and Grafana were already in the stack, but the important step was making them part of the normal operating path for anything running on AKS. If a team deployed a new service, there needed to be a predictable place to look for health, resource pressure, and runtime signals without building a bespoke observability setup around every repository.

That sounds obvious, but it changes the quality of operational conversations. Without shared observability, "self-service" often means a team can deploy independently and then immediately ask the platform team what they are looking at. With shared dashboards and known signals, the first conversation starts from data instead of from instrumentation archaeology.

Observability also benefited from the same standardization as the rest of the platform. When services follow common deployment patterns, label conventions, namespace layout, and scrape behavior stop being incidental details and start becoming useful shared structure. That is what lets a platform team support many services without turning each one into a unique monitoring problem.

Application teams still needed to own service-specific telemetry where it mattered, but the baseline had to be there by default. A deployed workload should not become opaque the moment it leaves CI.

14. Standardization Across Environments

One of the quieter but more important results of the platform was consistency across development, test, and production. This is where OpenTofu, GitOps, and reusable deployment patterns reinforced each other.

The Azure and AKS foundation was provisioned through OpenTofu modules so the environment shape did not drift without anyone noticing. Networking, cluster integration, secrets handling, and the shared platform dependencies followed the same general structure across environments even when production had tighter controls and different sizing. That matters because inconsistent environments create fake confidence. Something appears to work in development, but only because development has drifted into a completely different system.

The application delivery model followed the same logic. The GitLab pipeline shape was consistent. The GitOps structure was consistent. ArgoCD reconciled the same style of desired state in every environment. Teams did not have to learn one model for development and another for production. The things that differed were the things that should differ: configuration, policy, approval, and scale.

This is where standardization earns its keep. It reduces cognitive load, but more importantly it reduces the number of places a problem can hide. When the platform shape is predictable, environment-specific issues are easier to reason about because the platform itself is not introducing accidental variation.

Consistency also made governance easier to apply without turning production into a foreign country. Production could be more tightly controlled than development while still following the same basic operating model.

15. The Trade-Offs Were Real

None of this came without trade-offs, and it is important to be honest about them because platform work becomes fragile when it is described as if there were no downsides.

The most obvious trade-off was flexibility versus standardization. A strongly opinionated platform makes the common path easier, but it also means some teams cannot do everything exactly the way they would choose if left alone. That is not automatically a problem. In most cases, the variation being removed is not producing business value. But the tension is real, especially with experienced engineers who are used to tailoring pipelines and runtime configuration closely.

There was also a trade-off between direct access and controlled workflows. Direct kubectl access or broad Azure permissions can feel faster for the person holding them. The problem is that this speed does not scale as an operating model. It shifts complexity into hidden state and makes the platform harder to govern and support. The GitLab-plus-ArgoCD approach was more disciplined and more repeatable, but it required accepting that convenience for a few power users could not be the main design target.

Another trade-off sat between abstraction and freedom. If the platform abstracts too little, developers remain buried in infrastructure concerns. If it abstracts too much, teams can feel disconnected from how their software really behaves in production. The right balance was to abstract the repetitive infrastructure mechanics while keeping application teams close to the runtime characteristics they still needed to own.

There was also an ongoing question about how far to take standardization. Not everything should be templated. A platform becomes brittle when it tries to turn every edge case into a first-class built-in feature. Part of the job was deciding what belonged in the golden path, what should be possible through extension points, and what should remain a deliberate exception handled with platform involvement. That boundary matters because a platform that tries to support every possible use case eventually becomes another form of complexity.

16. Making the Platform Useful Without Making It Rigid

The hardest part of this work was not choosing tools. It was deciding where the standard path should end and where application-specific freedom should begin.

There was some initial resistance, which was not surprising. Teams that have struggled with slow infrastructure processes often interpret standardization as another form of control being added around them. If the platform team is not careful, that is exactly what it becomes. The way through that is not messaging. It is making the golden path genuinely easier than the ad hoc alternatives.

That required iteration. Early templates are rarely correct in all the important ways. Some are too narrow and force unnatural workarounds. Others try to be so flexible that they become hard to understand and hard to maintain. A usable platform usually emerges through repeated refinement: watching where teams still get stuck, where the abstractions are leaking, which defaults are working, and which ones are generating support load instead of reducing it.

It also required deciding what should not be standardized. Some services are long-running APIs. Others are workers, scheduled jobs, or integration components with very different runtime expectations. Some need external exposure. Others must remain internal. Some can use a straightforward secret model. Others need more careful identity handling. If a platform treats all of those as identical, it becomes unrealistic. If it treats each one as entirely unique, it loses the benefits of being a platform. The useful middle ground is a constrained set of patterns with well-understood variation points.

Another practical challenge was that support load often rises before it falls. During the transition, the platform team is still supporting the old way of working while teaching and refining the new one. That is normal. It is one reason platform engineering is as much about product thinking and operating model design as it is about YAML, pipelines, and cloud services.

17. Before and After the Platform

The difference before and after the platform was not mainly about tool choice. It was about the operating model around those tools.

Before the platform, deployments were technically possible but operationally inconsistent. Teams could get services into AKS, but they often did so through slightly different pipelines, slightly different manifests, and slightly different assumptions about networking, secrets, and environment behavior. That made the platform team a bottleneck because every inconsistency eventually surfaced as a support request, a failed rollout, or a production question nobody wanted to answer for the first time under pressure.

After the platform, the default path became much more predictable. A service followed a standard template, deployments moved through GitLab and GitOps, ArgoCD reconciled the desired state, and observability was already part of the runtime model. Developers still owned their applications, but they no longer had to become part-time experts in Azure and Kubernetes mechanics just to make routine changes safely.

That is the change I care about most. The platform did not remove operational responsibility. It removed avoidable infrastructure complexity from the day-to-day path of delivering software.

18. What Changed Once the Model Settled

The outcome was not that infrastructure complexity disappeared. It was that the right parts of that complexity moved into the platform, where they could be solved once and reused, instead of being rediscovered by every team and every service.

The immediate effect was reduced dependence on the platform team for routine delivery work. Application teams could use Git-based workflows to build, deploy, and promote services through a predictable path. They did not need broad AKS access to get code running. They did not need to understand every Azure networking detail to expose a service correctly. They did not need to invent a new deployment shape for each repository.

That improved developer experience in a practical sense rather than a cosmetic one. Teams had fewer infrastructure decisions to make for ordinary service delivery. Deployments became more repeatable. Configuration drift was reduced. Environment behavior became more predictable. When issues did happen, teams were not starting from a blank page; the observability, deployment path, and runtime conventions were already there.

The platform team benefited as well, but in a more important way than simply getting fewer messages. The nature of the work shifted. Less time went into acting as a release team, a YAML debugging service, or the final escalation point for every ingress or secret issue. More time could be spent improving shared capabilities, refining templates, hardening workflows, and thinking ahead about where the platform needed to evolve as more services were added.

That is the scaling effect that matters. A platform should improve not only the speed of one deployment, but the sustainability of the operating model as the number of teams, services, and environments grows.

19. Why I See This as Platform Engineering

This experience changed how I think about the line between DevOps work and platform engineering. Infrastructure automation was part of the job, but it was not the part that mattered most. The more significant work was deciding how other engineers should experience that infrastructure and which trade-offs should be encoded into the default path.

Provisioning Azure with OpenTofu, running AKS, wiring GitLab CI/CD, installing ArgoCD, and operating Prometheus and Grafana are all useful capabilities. They become platform engineering when they are assembled into a system other engineers can rely on without needing to understand every internal detail. That means choosing defaults, defining boundaries, deciding where flexibility is worth the cost, and being deliberate about which problems the platform absorbs so application teams do not have to.

The important result was not that the environment used a modern stack. It was that developers had less irrelevant infrastructure to think about while governance, security, and consistency improved instead of being negotiated away. At that point, the job stops feeling like "running Kubernetes" and starts feeling much closer to product design for engineers.

This experience also changed how I think about DevOps itself. The hard part is rarely building infrastructure. The hard part is building systems other engineers can depend on without first having to reverse-engineer them.

If I were taking this further, I would invest even more in service onboarding, platform documentation, and eventually a stronger internal developer portal on top of the existing workflows. But the lesson I would keep is straightforward. A platform is successful when developers can use it well without needing to understand how it is implemented. The measure of success is how much irrelevant infrastructure complexity stays out of their way.

Designing a Developer Platform: From Infrastructure to Self-Service

1. Infrastructure Was Not the Hard Part

2. Where Developers Were Actually Struggling

3. Why Raw Kubernetes Was the Wrong Interface

4. The Shift From Tooling to Product Thinking

5. Platform Architecture at a Glance

6. Designing the Platform Contract

7. Golden Paths and Reusable Templates

8. Abstraction Without Losing Operational Ownership

9. Self-Service Through GitLab, GitOps, and ArgoCD

10. Concrete Examples of the Platform in Practice

Onboarding a New Internal API Service

Turning Service Exposure Into a Platform Decision

Fixing Secrets Sprawl Without Blocking Delivery

Promoting the Same Artifact Across Environments

11. Access Control Was an Enabler, Not a Restriction

12. Secrets, Networking, and the Infrastructure Teams Should Not Have to Re-Explain

13. Observability Had to Be Part of the Platform

14. Standardization Across Environments

15. The Trade-Offs Were Real

16. Making the Platform Useful Without Making It Rigid

17. Before and After the Platform

18. What Changed Once the Model Settled

19. Why I See This as Platform Engineering

Comments

More from this blog

Observability in Practice: Noise, Signals, and Alerts in Production

Building a Kubernetes Platform on AKS: Private Clusters, GitOps, and Workload Separation

Designing Multi-Environment Platforms: What Actually Works in Practice

Designing Azure Landing Zones for Enterprise Cloud Adoption: Tenants, Management Groups, and Subscription Strategy

Command Palette

1. Infrastructure Was Not the Hard Part

2. Where Developers Were Actually Struggling

3. Why Raw Kubernetes Was the Wrong Interface

4. The Shift From Tooling to Product Thinking

5. Platform Architecture at a Glance

6. Designing the Platform Contract

7. Golden Paths and Reusable Templates

8. Abstraction Without Losing Operational Ownership

9. Self-Service Through GitLab, GitOps, and ArgoCD

10. Concrete Examples of the Platform in Practice

Onboarding a New Internal API Service

Turning Service Exposure Into a Platform Decision

Fixing Secrets Sprawl Without Blocking Delivery

Promoting the Same Artifact Across Environments

11. Access Control Was an Enabler, Not a Restriction

12. Secrets, Networking, and the Infrastructure Teams Should Not Have to Re-Explain

13. Observability Had to Be Part of the Platform

14. Standardization Across Environments

15. The Trade-Offs Were Real

16. Making the Platform Useful Without Making It Rigid

17. Before and After the Platform

18. What Changed Once the Model Settled

19. Why I See This as Platform Engineering

Comments

More from this blog