Skip to main content

Command Palette

Search for a command to run...

Building a Kubernetes Platform on AKS: Private Clusters, GitOps, and Workload Separation

Published
14 min read

In this article, I explain how I designed and implemented a private AKS-based platform with clear separation between platform and workload clusters.

The focus is on real-world decisions around networking, GitOps, security, and operating models rather than theoretical architecture.

1. Introduction

After setting up the Azure landing zone and defining the platform structure, the next step was to enable teams to run workloads on Kubernetes in a controlled and scalable way.

The goal was not just to create AKS clusters, but to design a platform model where:

  • infrastructure and platform tooling are separated from workloads

  • deployments are consistent and controlled

  • teams can use Kubernetes without needing deep expertise

This led to designing a multi-cluster architecture, where different clusters had clearly defined responsibilities.

2. What the Platform Needed to Solve

The platform had to support:

  • secure Kubernetes clusters without public exposure

  • separation between platform tooling and application workloads

  • consistent deployment patterns

  • integration with existing GitLab CI workflows

  • centralized secrets management

  • onboarding teams with minimal Kubernetes knowledge

Instead of a single cluster or loosely structured setup, I needed a model that would scale cleanly across environments and teams.

3. AKS Architecture: Platform vs Workload Clusters

The core design decision was to separate platform clusters from workload clusters.

Platform clusters

These were hosted under the Platform management group, with different subscriptions:

  • platform_test

  • platform_nonprod

  • platform_prod

Each of these had its own AKS cluster:

  • platform_test AKS

  • platform_nonprod AKS

  • platform_prod AKS

Responsibilities of platform clusters

These clusters did not run application workloads. Instead, they hosted platform-level components, such as:

  • ArgoCD

  • GitLab runners

  • Kyverno (policy enforcement)

  • supporting platform tooling

The idea was to centralize platform capabilities instead of duplicating them across every workload cluster.A key decision early on was to separate platform responsibilities from application workloads.

Instead of running everything inside a single cluster or duplicating tooling everywhere, I designed a multi-cluster architecture with clear responsibilities.

Platform clusters

Under the Platform management group, I created three dedicated subscriptions:

  • platform_test

  • platform_nonprod

  • platform_prod

Each subscription had its own AKS cluster:

  • platform_test AKS

  • platform_nonprod AKS

  • platform_prod AKS

These clusters acted as the platform control layer, not workload environments.

What runs in platform clusters

These clusters hosted all shared platform components, including:

  • ArgoCD (GitOps control plane)

  • GitLab runners (for CI/CD execution inside cluster network)

  • Kyverno for Kubernetes policy enforcement

  • admission control policies (OPA/Kyverno-based patterns)

  • cluster-level monitoring components

  • supporting platform services

The goal was to avoid:

  • duplicating tooling in every cluster

  • mixing platform concerns with application workloads

Workload clusters

Separate AKS clusters were deployed in workload subscriptions:

  • dev

  • test

  • staging

  • prod

These clusters were intentionally kept minimal.

They only contained:

  • application workloads

  • required runtime dependencies

  • monitoring agents

They did not include:

  • CI/CD tools

  • GitOps controllers

  • platform-level policy engines (centrally managed instead)

Why this separation

This design provided:

  • isolation between platform and applications

  • ability to upgrade platform tooling independently

  • reduced blast radius

  • clearer ownership boundaries

It also made it easier to enforce consistency across clusters, since platform logic was centralized.

4. Platform Cluster Lifecycle and Promotion Strategy

Each platform cluster had a specific role in the lifecycle of platform changes.

platform_test

This cluster was used for:

  • testing new platform components

  • trying new versions of tools (ArgoCD, Kyverno, etc.)

  • validating breaking changes

After validation:

  • workloads were scaled down to zero

  • cluster remained available for future testing

This ensured that experiments did not impact stable environments.

platform_nonprod

This cluster hosted stable platform tooling for non-production environments.

It included:

  • ArgoCD (non-prod control plane)

  • GitLab runners

  • Kyverno policies for non-prod clusters

  • supporting services

Important detail:

ArgoCD in this cluster was responsible for managing:

  • dev clusters

  • test clusters

  • staging clusters

This created a clear separation between:

  • experimentation (platform_test)

  • stable non-prod operations

platform_prod

This cluster hosted production-grade platform tooling.

It included:

  • ArgoCD (production control plane)

  • GitLab runners

  • Kyverno / policy enforcement

  • platform-level observability components

ArgoCD here was responsible for:

  • managing production workload clusters

This ensured that:

  • production deployments were isolated

  • no non-prod logic or experiments could affect production

Promotion model

Changes followed a flow:

  1. Tested in platform_test

  2. Promoted to platform_nonprod

  3. Validated against non-prod workload clusters

  4. Promoted to platform_prod

This created a controlled promotion pipeline for platform changes, not just applications.

5. Private AKS and Access Model

All clusters were deployed as private AKS clusters.

This meant:

  • no public API server

  • no direct internet exposure

Access design

To enable secure access, I implemented:

  • VPN Gateway in the platform network

  • Azure VPN client for engineers

  • access routed through private networking

This allowed:

  • secure kubectl access

  • no exposure of cluster endpoints

DNS resolution across clusters

Private clusters introduced a challenge:

AKS API endpoints use private FQDNs, which must be resolvable across VNets and subscriptions.

To solve this, I implemented:

  • VNet peering across platform and workload networks

  • centralized Private DNS zones

  • Azure Private DNS Resolver

This ensured:

  • consistent name resolution

  • access across multiple clusters

Alternative access patterns

In some cases:

  • jumpbox VM was used for debugging

However, the primary model remained:

  • VPN-based access with private DNS

6. GitOps Control Plane with ArgoCD

GitOps was implemented as the primary deployment model.

Control plane separation

  • ArgoCD in platform_nonprod → manages non-prod clusters

  • ArgoCD in platform_prod → manages production clusters

This ensured:

  • strict separation between environments

  • no accidental cross-environment deployments

Application management

Applications were defined using:

  • ArgoCD Applications

  • ApplicationSets

ApplicationSets allowed:

  • dynamic generation of apps

  • multi-environment deployments

  • standardized patterns

Drift and reconciliation

ArgoCD continuously ensured:

  • desired state = actual state

  • drift detection

  • automatic reconciliation

This removed the need for:

  • manual kubectl deployments

  • ad-hoc changes

7. Application Deployment Flow

The deployment model integrated GitLab CI with GitOps.

Flow

  1. Developer pushes code

  2. GitLab CI builds container image

  3. Image pushed to:

    • GitLab Container Registry

    • Azure Container Registry (ACR)

  4. Deployment triggered (pipeline or Git change)

  5. ArgoCD syncs state into cluster

Helm-based deployments

Applications were packaged as Helm charts.

This allowed:

  • environment-specific values

  • reusable templates

  • consistent deployments

Reality of the setup

This was not fully pure GitOps.

Instead, it was:

  • GitOps for cluster state

  • CI-driven triggers for deployments

This approach worked well in a hybrid environment and allowed gradual adoption.

8. Policy and Governance inside Kubernetes

Kubernetes governance was enforced using Kyverno and policy-based controls.

Why policy enforcement was needed

Without policies:

  • teams could deploy inconsistent resources

  • security risks increase

  • cluster behavior becomes unpredictable

Tools used

  • Kyverno for policy enforcement

  • admission control patterns

  • validation and mutation rules

Conceptually aligned with:

  • OPA/Gatekeeper-style governance

Example controls

Policies enforced:

  • required labels and annotations

  • resource limits and requests

  • restrictions on privileged containers

  • namespace-level controls

Benefits

  • consistent deployments across clusters

  • reduced risk

  • centralized governance

9. Secrets Management

Secrets were handled using Azure-native integration.

Structure

  • separate Key Vaults per:

    • team

    • environment

Integration

  • External Secrets Operator used in clusters

  • pulls secrets from Key Vault into Kubernetes

Access control

  • managed through Entra ID groups

  • scoped per team

Outcome

  • no secrets stored in Git

  • centralized control

  • clear ownership

10. Networking and Ingress

Networking followed a private-first, hub-spoke model.

Cluster placement

  • clusters deployed in spoke VNets

  • connected to central hub

Traffic control

  • controlled ingress paths

  • internal service communication via private networking

Design goal

  • minimize public exposure

  • keep communication predictable

11. Developer Workflow

Developers interacted with the platform through:

  • Git repositories

  • CI pipelines

  • Helm values

What developers do

  • write code

  • push changes

  • update configs

What platform handles

  • infrastructure

  • networking

  • policies

  • deployment

Key principle

Enablement over access.

Teams were not required to understand:

  • Kubernetes internals

  • Azure networking

  • security policies

12. Challenges and Trade-offs

Building this platform was not just a technical exercise. Most of the complexity came from working within real constraints rather than designing in isolation.

One of the biggest challenges was operating in a hybrid environment. Some applications were still running on-premises and had to continue functioning while we were introducing Kubernetes on AKS. This meant I could not design everything as a clean, cloud-native system from the start. For example, the decision to push images to both GitLab Container Registry and Azure Container Registry was not ideal from a purity standpoint, but it was necessary to support existing workflows. The goal was to move forward without breaking what already worked.

Networking was another major challenge, especially with private AKS clusters. While private clusters significantly improve security, they introduce complexity around access and DNS resolution. I had to ensure that engineers could access clusters securely through VPN, while also making sure that private FQDNs resolved correctly across multiple VNets and subscriptions. This required careful planning of VNet peering, Private DNS zones, and the introduction of Azure Private DNS Resolver. These are not things that are easy to change later, so getting them right early was critical.

There was also a challenge around scaling connectivity. As more environments, regions, and external integrations were introduced, the network design needed to handle increasing complexity. Decisions around VPN Gateway sizing, NAT behavior, and routing were not static. They had to evolve as requirements grew, which meant the initial design needed to be flexible enough to adapt.

Another important challenge was developer adoption. Many teams were not familiar with Kubernetes or cloud-native practices. If I had simply provided clusters and access, the result would likely have been inconsistent deployments and operational issues. Instead, I had to design the platform in a way that guided teams toward the right patterns. This sometimes meant not implementing exactly what teams initially asked for. In many cases, requests were based on existing habits rather than what would work well in the new platform. It required balancing listening to requirements with making decisions that would scale long term.

There was also a constant trade-off between control and flexibility. Centralizing platform components like ArgoCD, policy enforcement, and secrets management improved consistency and security, but it reduced the level of direct control that application teams had. This was intentional, but it required careful design to ensure that teams still felt enabled rather than restricted.

Another trade-off was between pure GitOps and practical workflows. In an ideal setup, everything would be fully driven from Git with automated promotion between environments. In reality, we integrated GitOps with existing GitLab CI pipelines, including manual triggers where needed. While this was not a textbook GitOps implementation, it worked well in practice and allowed teams to adopt the model gradually instead of forcing a complete shift.

Finally, there was the challenge of changing established ways of working. Some processes had been followed for years, and moving to Infrastructure as Code, GitOps, and platform-driven workflows required a mindset shift. This was not something that could be solved purely with tooling. It required gradual introduction, clear patterns, and consistent reinforcement.

Overall, the main challenge was not designing the platform itself, but integrating it into an existing ecosystem with real constraints, existing systems, and varying levels of maturity.

13. Lessons Learned

Looking back, several important lessons came out of building and operating this platform.

One of the most important lessons was that separating platform and workload concerns early makes everything easier later. By keeping platform tooling (ArgoCD, runners, policies) in dedicated clusters and keeping workload clusters minimal, it became much easier to manage upgrades, enforce consistency, and reduce risk. Without this separation, platform components tend to get tightly coupled with workloads, making changes harder over time.

Another key lesson was that private clusters are worth the complexity, but only if networking is designed properly from the beginning. The security benefits are clear, but they come with a cost in terms of DNS, access, and connectivity. Investing time early in designing VNet structure, DNS resolution, and access patterns avoids much bigger problems later.

I also learned that GitOps adoption should be incremental, not forced. While the idea of full GitOps is appealing, teams need time to adapt. Integrating GitOps with existing CI/CD pipelines allowed us to introduce the model gradually, without disrupting existing workflows. Over time, this can evolve toward a more complete GitOps approach, but starting with a practical implementation was the right decision.

Another important lesson was around policy enforcement. Without centralized policies, Kubernetes environments quickly become inconsistent. Introducing tools like Kyverno allowed us to enforce standards such as resource limits, labeling, and security controls. This ensured that even as more teams onboarded, the platform remained predictable.

One of the strongest takeaways was that enablement is more effective than access. Giving teams full access to infrastructure does not necessarily lead to better outcomes, especially when they are new to the platform. Providing clear templates, workflows, and guardrails allowed teams to move faster with fewer errors. The role of the platform was not just to provide infrastructure, but to guide how it should be used.

I also learned that real-world constraints should shape design decisions. It is easy to aim for ideal architectures, but in practice, existing systems, organizational structure, and team maturity all play a role. Supporting hybrid environments, dual registries, and gradual migration was not ideal from a theoretical perspective, but it was necessary to move forward without disruption.

Finally, I realized that platform engineering is as much about people as it is about technology. The success of the platform depended not only on the technical design, but also on how well it aligned with teams, how easily it could be adopted, and how clearly it communicated the right way to work.

These lessons shaped not just the platform itself, but also how I approached designing systems in general.

14. Conclusion

The goal of this platform was never just to run Kubernetes clusters, but to create a system that teams could rely on without needing to understand all of its internal complexity.

By separating platform and workload clusters, I was able to keep responsibilities clear. Platform tooling such as ArgoCD, runners, and policy enforcement remained centralized, while workload clusters stayed focused on running applications. This made the overall system easier to operate, scale, and evolve over time.

Running everything as private AKS clusters improved security, but more importantly, it forced a more disciplined approach to networking, access, and connectivity. Decisions around VPN access, DNS resolution, and VNet design became foundational rather than afterthoughts.

Introducing GitOps provided consistency in how applications were deployed, even though the implementation was intentionally pragmatic and integrated with existing CI/CD workflows. This allowed teams to adopt new patterns gradually instead of forcing a complete shift upfront.

At the same time, the platform was designed with enablement in mind. Instead of exposing raw infrastructure, I focused on providing structured workflows, templates, and guardrails. This proved to be more effective, especially for teams that were new to Kubernetes and cloud environments.

Looking back, the most important part of this work was not any individual tool or technology, but the combination of decisions around structure, ownership, and workflows. These are the things that ultimately determine whether a platform is usable in practice.

This setup provided a foundation that could scale with the organization, support both existing systems and new workloads, and evolve over time without requiring constant redesign.