Azure AKS Platform Design: Private Clusters & GitOps

In this article, I explain how I designed and implemented a private AKS-based platform with clear separation between platform and workload clusters.

The focus is on real-world decisions around networking, GitOps, security, and operating models rather than theoretical architecture.

1. Introduction

After setting up the Azure landing zone and defining the platform structure, the next step was to enable teams to run workloads on Kubernetes in a controlled and scalable way.

The goal was not just to create AKS clusters, but to design a platform model where:

infrastructure and platform tooling are separated from workloads
deployments are consistent and controlled
teams can use Kubernetes without needing deep expertise

This led to designing a multi-cluster architecture, where different clusters had clearly defined responsibilities.

2. What the Platform Needed to Solve

The platform had to support:

secure Kubernetes clusters without public exposure
separation between platform tooling and application workloads
consistent deployment patterns
integration with existing GitLab CI workflows
centralized secrets management
onboarding teams with minimal Kubernetes knowledge

Instead of a single cluster or loosely structured setup, I needed a model that would scale cleanly across environments and teams.

3. AKS Architecture: Platform vs Workload Clusters

The core design decision was to separate platform clusters from workload clusters.

Platform clusters

These were hosted under the Platform management group, with different subscriptions:

platform_test
platform_nonprod
platform_prod

Each of these had its own AKS cluster:

platform_test AKS
platform_nonprod AKS
platform_prod AKS

Responsibilities of platform clusters

These clusters did not run application workloads. Instead, they hosted platform-level components, such as:

ArgoCD
GitLab runners
Kyverno (policy enforcement)
supporting platform tooling

The idea was to centralize platform capabilities instead of duplicating them across every workload cluster.A key decision early on was to separate platform responsibilities from application workloads.

Instead of running everything inside a single cluster or duplicating tooling everywhere, I designed a multi-cluster architecture with clear responsibilities.

Platform clusters

Under the Platform management group, I created three dedicated subscriptions:

platform_test
platform_nonprod
platform_prod

Each subscription had its own AKS cluster:

platform_test AKS
platform_nonprod AKS
platform_prod AKS

These clusters acted as the platform control layer, not workload environments.

What runs in platform clusters

These clusters hosted all shared platform components, including:

ArgoCD (GitOps control plane)
GitLab runners (for CI/CD execution inside cluster network)
Kyverno for Kubernetes policy enforcement
admission control policies (OPA/Kyverno-based patterns)
cluster-level monitoring components
supporting platform services

The goal was to avoid:

duplicating tooling in every cluster
mixing platform concerns with application workloads

Workload clusters

Separate AKS clusters were deployed in workload subscriptions:

dev
test
staging
prod

These clusters were intentionally kept minimal.

They only contained:

application workloads
required runtime dependencies
monitoring agents

They did not include:

CI/CD tools
GitOps controllers
platform-level policy engines (centrally managed instead)

Why this separation

This design provided:

isolation between platform and applications
ability to upgrade platform tooling independently
reduced blast radius
clearer ownership boundaries

It also made it easier to enforce consistency across clusters, since platform logic was centralized.

4. Platform Cluster Lifecycle and Promotion Strategy

Each platform cluster had a specific role in the lifecycle of platform changes.

platform_test

This cluster was used for:

testing new platform components
trying new versions of tools (ArgoCD, Kyverno, etc.)
validating breaking changes

After validation:

workloads were scaled down to zero
cluster remained available for future testing

This ensured that experiments did not impact stable environments.

platform_nonprod

This cluster hosted stable platform tooling for non-production environments.

It included:

ArgoCD (non-prod control plane)
GitLab runners
Kyverno policies for non-prod clusters
supporting services

Important detail:

ArgoCD in this cluster was responsible for managing:

dev clusters
test clusters
staging clusters

This created a clear separation between:

experimentation (platform_test)
stable non-prod operations

platform_prod

This cluster hosted production-grade platform tooling.

It included:

ArgoCD (production control plane)
GitLab runners
Kyverno / policy enforcement
platform-level observability components

ArgoCD here was responsible for:

managing production workload clusters

This ensured that:

production deployments were isolated
no non-prod logic or experiments could affect production

Promotion model

Changes followed a flow:

Tested in platform_test
Promoted to platform_nonprod
Validated against non-prod workload clusters
Promoted to platform_prod

This created a controlled promotion pipeline for platform changes, not just applications.

5. Private AKS and Access Model

All clusters were deployed as private AKS clusters.

This meant:

no public API server
no direct internet exposure

Access design

To enable secure access, I implemented:

VPN Gateway in the platform network
Azure VPN client for engineers
access routed through private networking

This allowed:

secure kubectl access
no exposure of cluster endpoints

DNS resolution across clusters

Private clusters introduced a challenge:

AKS API endpoints use private FQDNs, which must be resolvable across VNets and subscriptions.

To solve this, I implemented:

VNet peering across platform and workload networks
centralized Private DNS zones
Azure Private DNS Resolver

This ensured:

consistent name resolution
access across multiple clusters

Alternative access patterns

In some cases:

jumpbox VM was used for debugging

However, the primary model remained:

VPN-based access with private DNS

6. GitOps Control Plane with ArgoCD

GitOps was implemented as the primary deployment model.

Control plane separation

ArgoCD in platform_nonprod → manages non-prod clusters
ArgoCD in platform_prod → manages production clusters

This ensured:

strict separation between environments
no accidental cross-environment deployments

Application management

Applications were defined using:

ArgoCD Applications
ApplicationSets

ApplicationSets allowed:

dynamic generation of apps
multi-environment deployments
standardized patterns

Drift and reconciliation

ArgoCD continuously ensured:

desired state = actual state
drift detection
automatic reconciliation

This removed the need for:

manual kubectl deployments
ad-hoc changes

7. Application Deployment Flow

The deployment model integrated GitLab CI with GitOps.

Flow

Developer pushes code
GitLab CI builds container image
Image pushed to:
- GitLab Container Registry
- Azure Container Registry (ACR)
Deployment triggered (pipeline or Git change)
ArgoCD syncs state into cluster

Helm-based deployments

Applications were packaged as Helm charts.

This allowed:

environment-specific values
reusable templates
consistent deployments

Reality of the setup

This was not fully pure GitOps.

Instead, it was:

GitOps for cluster state
CI-driven triggers for deployments

This approach worked well in a hybrid environment and allowed gradual adoption.

8. Policy and Governance inside Kubernetes

Kubernetes governance was enforced using Kyverno and policy-based controls.

Why policy enforcement was needed

Without policies:

teams could deploy inconsistent resources
security risks increase
cluster behavior becomes unpredictable

Tools used

Kyverno for policy enforcement
admission control patterns
validation and mutation rules

Conceptually aligned with:

OPA/Gatekeeper-style governance

Example controls

Policies enforced:

required labels and annotations
resource limits and requests
restrictions on privileged containers
namespace-level controls

Benefits

consistent deployments across clusters
reduced risk
centralized governance

9. Secrets Management

Secrets were handled using Azure-native integration.

Structure

separate Key Vaults per:
- team
- environment

Integration

External Secrets Operator used in clusters
pulls secrets from Key Vault into Kubernetes

Access control

managed through Entra ID groups
scoped per team

Outcome

no secrets stored in Git
centralized control
clear ownership

10. Networking and Ingress

Networking followed a private-first, hub-spoke model.

Cluster placement

clusters deployed in spoke VNets
connected to central hub

Traffic control

controlled ingress paths
internal service communication via private networking

Design goal

minimize public exposure
keep communication predictable

11. Developer Workflow

Developers interacted with the platform through:

Git repositories
CI pipelines
Helm values

What developers do

write code
push changes
update configs

What platform handles

infrastructure
networking
policies
deployment

Key principle

Enablement over access.

Teams were not required to understand:

Kubernetes internals
Azure networking
security policies

12. Challenges and Trade-offs

Building this platform was not just a technical exercise. Most of the complexity came from working within real constraints rather than designing in isolation.

One of the biggest challenges was operating in a hybrid environment. Some applications were still running on-premises and had to continue functioning while we were introducing Kubernetes on AKS. This meant I could not design everything as a clean, cloud-native system from the start. For example, the decision to push images to both GitLab Container Registry and Azure Container Registry was not ideal from a purity standpoint, but it was necessary to support existing workflows. The goal was to move forward without breaking what already worked.

Networking was another major challenge, especially with private AKS clusters. While private clusters significantly improve security, they introduce complexity around access and DNS resolution. I had to ensure that engineers could access clusters securely through VPN, while also making sure that private FQDNs resolved correctly across multiple VNets and subscriptions. This required careful planning of VNet peering, Private DNS zones, and the introduction of Azure Private DNS Resolver. These are not things that are easy to change later, so getting them right early was critical.

There was also a challenge around scaling connectivity. As more environments, regions, and external integrations were introduced, the network design needed to handle increasing complexity. Decisions around VPN Gateway sizing, NAT behavior, and routing were not static. They had to evolve as requirements grew, which meant the initial design needed to be flexible enough to adapt.

Another important challenge was developer adoption. Many teams were not familiar with Kubernetes or cloud-native practices. If I had simply provided clusters and access, the result would likely have been inconsistent deployments and operational issues. Instead, I had to design the platform in a way that guided teams toward the right patterns. This sometimes meant not implementing exactly what teams initially asked for. In many cases, requests were based on existing habits rather than what would work well in the new platform. It required balancing listening to requirements with making decisions that would scale long term.

There was also a constant trade-off between control and flexibility. Centralizing platform components like ArgoCD, policy enforcement, and secrets management improved consistency and security, but it reduced the level of direct control that application teams had. This was intentional, but it required careful design to ensure that teams still felt enabled rather than restricted.

Another trade-off was between pure GitOps and practical workflows. In an ideal setup, everything would be fully driven from Git with automated promotion between environments. In reality, we integrated GitOps with existing GitLab CI pipelines, including manual triggers where needed. While this was not a textbook GitOps implementation, it worked well in practice and allowed teams to adopt the model gradually instead of forcing a complete shift.

Finally, there was the challenge of changing established ways of working. Some processes had been followed for years, and moving to Infrastructure as Code, GitOps, and platform-driven workflows required a mindset shift. This was not something that could be solved purely with tooling. It required gradual introduction, clear patterns, and consistent reinforcement.

Overall, the main challenge was not designing the platform itself, but integrating it into an existing ecosystem with real constraints, existing systems, and varying levels of maturity.

13. Lessons Learned

Looking back, several important lessons came out of building and operating this platform.

One of the most important lessons was that separating platform and workload concerns early makes everything easier later. By keeping platform tooling (ArgoCD, runners, policies) in dedicated clusters and keeping workload clusters minimal, it became much easier to manage upgrades, enforce consistency, and reduce risk. Without this separation, platform components tend to get tightly coupled with workloads, making changes harder over time.

Another key lesson was that private clusters are worth the complexity, but only if networking is designed properly from the beginning. The security benefits are clear, but they come with a cost in terms of DNS, access, and connectivity. Investing time early in designing VNet structure, DNS resolution, and access patterns avoids much bigger problems later.

I also learned that GitOps adoption should be incremental, not forced. While the idea of full GitOps is appealing, teams need time to adapt. Integrating GitOps with existing CI/CD pipelines allowed us to introduce the model gradually, without disrupting existing workflows. Over time, this can evolve toward a more complete GitOps approach, but starting with a practical implementation was the right decision.

Another important lesson was around policy enforcement. Without centralized policies, Kubernetes environments quickly become inconsistent. Introducing tools like Kyverno allowed us to enforce standards such as resource limits, labeling, and security controls. This ensured that even as more teams onboarded, the platform remained predictable.

One of the strongest takeaways was that enablement is more effective than access. Giving teams full access to infrastructure does not necessarily lead to better outcomes, especially when they are new to the platform. Providing clear templates, workflows, and guardrails allowed teams to move faster with fewer errors. The role of the platform was not just to provide infrastructure, but to guide how it should be used.

I also learned that real-world constraints should shape design decisions. It is easy to aim for ideal architectures, but in practice, existing systems, organizational structure, and team maturity all play a role. Supporting hybrid environments, dual registries, and gradual migration was not ideal from a theoretical perspective, but it was necessary to move forward without disruption.

Finally, I realized that platform engineering is as much about people as it is about technology. The success of the platform depended not only on the technical design, but also on how well it aligned with teams, how easily it could be adopted, and how clearly it communicated the right way to work.

These lessons shaped not just the platform itself, but also how I approached designing systems in general.

14. Conclusion

The goal of this platform was never just to run Kubernetes clusters, but to create a system that teams could rely on without needing to understand all of its internal complexity.

By separating platform and workload clusters, I was able to keep responsibilities clear. Platform tooling such as ArgoCD, runners, and policy enforcement remained centralized, while workload clusters stayed focused on running applications. This made the overall system easier to operate, scale, and evolve over time.

Running everything as private AKS clusters improved security, but more importantly, it forced a more disciplined approach to networking, access, and connectivity. Decisions around VPN access, DNS resolution, and VNet design became foundational rather than afterthoughts.

Introducing GitOps provided consistency in how applications were deployed, even though the implementation was intentionally pragmatic and integrated with existing CI/CD workflows. This allowed teams to adopt new patterns gradually instead of forcing a complete shift upfront.

At the same time, the platform was designed with enablement in mind. Instead of exposing raw infrastructure, I focused on providing structured workflows, templates, and guardrails. This proved to be more effective, especially for teams that were new to Kubernetes and cloud environments.

Looking back, the most important part of this work was not any individual tool or technology, but the combination of decisions around structure, ownership, and workflows. These are the things that ultimately determine whether a platform is usable in practice.

This setup provided a foundation that could scale with the organization, support both existing systems and new workloads, and evolve over time without requiring constant redesign.

Command Palette

1. Introduction

2. What the Platform Needed to Solve

3. AKS Architecture: Platform vs Workload Clusters

Platform clusters

Responsibilities of platform clusters

Platform clusters

What runs in platform clusters

Workload clusters

Why this separation

4. Platform Cluster Lifecycle and Promotion Strategy

platform_test

platform_nonprod

platform_prod

Promotion model

5. Private AKS and Access Model

Access design

DNS resolution across clusters

Alternative access patterns

6. GitOps Control Plane with ArgoCD

Control plane separation

Application management

Drift and reconciliation

7. Application Deployment Flow

Flow

Helm-based deployments

Reality of the setup

8. Policy and Governance inside Kubernetes

Why policy enforcement was needed

Tools used

Example controls

Benefits

9. Secrets Management

Structure

Integration

Access control

Outcome

10. Networking and Ingress

Cluster placement

Traffic control

Design goal

11. Developer Workflow

What developers do

What platform handles

Key principle

12. Challenges and Trade-offs

13. Lessons Learned

14. Conclusion

Comments

More from this blog