Building a Kubernetes Platform on AKS: Private Clusters, GitOps, and Workload Separation
In this article, I explain how I designed and implemented a private AKS-based platform with clear separation between platform and workload clusters.
The focus is on real-world decisions around networking, GitOps, security, and operating models rather than theoretical architecture.
1. Introduction
After setting up the Azure landing zone and defining the platform structure, the next step was to enable teams to run workloads on Kubernetes in a controlled and scalable way.
The goal was not just to create AKS clusters, but to design a platform model where:
infrastructure and platform tooling are separated from workloads
deployments are consistent and controlled
teams can use Kubernetes without needing deep expertise
This led to designing a multi-cluster architecture, where different clusters had clearly defined responsibilities.
2. What the Platform Needed to Solve
The platform had to support:
secure Kubernetes clusters without public exposure
separation between platform tooling and application workloads
consistent deployment patterns
integration with existing GitLab CI workflows
centralized secrets management
onboarding teams with minimal Kubernetes knowledge
Instead of a single cluster or loosely structured setup, I needed a model that would scale cleanly across environments and teams.
3. AKS Architecture: Platform vs Workload Clusters
The core design decision was to separate platform clusters from workload clusters.
Platform clusters
These were hosted under the Platform management group, with different subscriptions:
platform_testplatform_nonprodplatform_prod
Each of these had its own AKS cluster:
platform_test AKSplatform_nonprod AKSplatform_prod AKS
Responsibilities of platform clusters
These clusters did not run application workloads. Instead, they hosted platform-level components, such as:
ArgoCD
GitLab runners
Kyverno (policy enforcement)
supporting platform tooling
The idea was to centralize platform capabilities instead of duplicating them across every workload cluster.A key decision early on was to separate platform responsibilities from application workloads.
Instead of running everything inside a single cluster or duplicating tooling everywhere, I designed a multi-cluster architecture with clear responsibilities.
Platform clusters
Under the Platform management group, I created three dedicated subscriptions:
platform_testplatform_nonprodplatform_prod
Each subscription had its own AKS cluster:
platform_test AKS
platform_nonprod AKS
platform_prod AKS
These clusters acted as the platform control layer, not workload environments.
What runs in platform clusters
These clusters hosted all shared platform components, including:
ArgoCD (GitOps control plane)
GitLab runners (for CI/CD execution inside cluster network)
Kyverno for Kubernetes policy enforcement
admission control policies (OPA/Kyverno-based patterns)
cluster-level monitoring components
supporting platform services
The goal was to avoid:
duplicating tooling in every cluster
mixing platform concerns with application workloads
Workload clusters
Separate AKS clusters were deployed in workload subscriptions:
dev
test
staging
prod
These clusters were intentionally kept minimal.
They only contained:
application workloads
required runtime dependencies
monitoring agents
They did not include:
CI/CD tools
GitOps controllers
platform-level policy engines (centrally managed instead)
Why this separation
This design provided:
isolation between platform and applications
ability to upgrade platform tooling independently
reduced blast radius
clearer ownership boundaries
It also made it easier to enforce consistency across clusters, since platform logic was centralized.
4. Platform Cluster Lifecycle and Promotion Strategy
Each platform cluster had a specific role in the lifecycle of platform changes.
platform_test
This cluster was used for:
testing new platform components
trying new versions of tools (ArgoCD, Kyverno, etc.)
validating breaking changes
After validation:
workloads were scaled down to zero
cluster remained available for future testing
This ensured that experiments did not impact stable environments.
platform_nonprod
This cluster hosted stable platform tooling for non-production environments.
It included:
ArgoCD (non-prod control plane)
GitLab runners
Kyverno policies for non-prod clusters
supporting services
Important detail:
ArgoCD in this cluster was responsible for managing:
dev clusters
test clusters
staging clusters
This created a clear separation between:
experimentation (platform_test)
stable non-prod operations
platform_prod
This cluster hosted production-grade platform tooling.
It included:
ArgoCD (production control plane)
GitLab runners
Kyverno / policy enforcement
platform-level observability components
ArgoCD here was responsible for:
- managing production workload clusters
This ensured that:
production deployments were isolated
no non-prod logic or experiments could affect production
Promotion model
Changes followed a flow:
Tested in platform_test
Promoted to platform_nonprod
Validated against non-prod workload clusters
Promoted to platform_prod
This created a controlled promotion pipeline for platform changes, not just applications.
5. Private AKS and Access Model
All clusters were deployed as private AKS clusters.
This meant:
no public API server
no direct internet exposure
Access design
To enable secure access, I implemented:
VPN Gateway in the platform network
Azure VPN client for engineers
access routed through private networking
This allowed:
secure kubectl access
no exposure of cluster endpoints
DNS resolution across clusters
Private clusters introduced a challenge:
AKS API endpoints use private FQDNs, which must be resolvable across VNets and subscriptions.
To solve this, I implemented:
VNet peering across platform and workload networks
centralized Private DNS zones
Azure Private DNS Resolver
This ensured:
consistent name resolution
access across multiple clusters
Alternative access patterns
In some cases:
- jumpbox VM was used for debugging
However, the primary model remained:
- VPN-based access with private DNS
6. GitOps Control Plane with ArgoCD
GitOps was implemented as the primary deployment model.
Control plane separation
ArgoCD in platform_nonprod → manages non-prod clusters
ArgoCD in platform_prod → manages production clusters
This ensured:
strict separation between environments
no accidental cross-environment deployments
Application management
Applications were defined using:
ArgoCD Applications
ApplicationSets
ApplicationSets allowed:
dynamic generation of apps
multi-environment deployments
standardized patterns
Drift and reconciliation
ArgoCD continuously ensured:
desired state = actual state
drift detection
automatic reconciliation
This removed the need for:
manual kubectl deployments
ad-hoc changes
7. Application Deployment Flow
The deployment model integrated GitLab CI with GitOps.
Flow
Developer pushes code
GitLab CI builds container image
Image pushed to:
GitLab Container Registry
Azure Container Registry (ACR)
Deployment triggered (pipeline or Git change)
ArgoCD syncs state into cluster
Helm-based deployments
Applications were packaged as Helm charts.
This allowed:
environment-specific values
reusable templates
consistent deployments
Reality of the setup
This was not fully pure GitOps.
Instead, it was:
GitOps for cluster state
CI-driven triggers for deployments
This approach worked well in a hybrid environment and allowed gradual adoption.
8. Policy and Governance inside Kubernetes
Kubernetes governance was enforced using Kyverno and policy-based controls.
Why policy enforcement was needed
Without policies:
teams could deploy inconsistent resources
security risks increase
cluster behavior becomes unpredictable
Tools used
Kyverno for policy enforcement
admission control patterns
validation and mutation rules
Conceptually aligned with:
- OPA/Gatekeeper-style governance
Example controls
Policies enforced:
required labels and annotations
resource limits and requests
restrictions on privileged containers
namespace-level controls
Benefits
consistent deployments across clusters
reduced risk
centralized governance
9. Secrets Management
Secrets were handled using Azure-native integration.
Structure
separate Key Vaults per:
team
environment
Integration
External Secrets Operator used in clusters
pulls secrets from Key Vault into Kubernetes
Access control
managed through Entra ID groups
scoped per team
Outcome
no secrets stored in Git
centralized control
clear ownership
10. Networking and Ingress
Networking followed a private-first, hub-spoke model.
Cluster placement
clusters deployed in spoke VNets
connected to central hub
Traffic control
controlled ingress paths
internal service communication via private networking
Design goal
minimize public exposure
keep communication predictable
11. Developer Workflow
Developers interacted with the platform through:
Git repositories
CI pipelines
Helm values
What developers do
write code
push changes
update configs
What platform handles
infrastructure
networking
policies
deployment
Key principle
Enablement over access.
Teams were not required to understand:
Kubernetes internals
Azure networking
security policies
12. Challenges and Trade-offs
Building this platform was not just a technical exercise. Most of the complexity came from working within real constraints rather than designing in isolation.
One of the biggest challenges was operating in a hybrid environment. Some applications were still running on-premises and had to continue functioning while we were introducing Kubernetes on AKS. This meant I could not design everything as a clean, cloud-native system from the start. For example, the decision to push images to both GitLab Container Registry and Azure Container Registry was not ideal from a purity standpoint, but it was necessary to support existing workflows. The goal was to move forward without breaking what already worked.
Networking was another major challenge, especially with private AKS clusters. While private clusters significantly improve security, they introduce complexity around access and DNS resolution. I had to ensure that engineers could access clusters securely through VPN, while also making sure that private FQDNs resolved correctly across multiple VNets and subscriptions. This required careful planning of VNet peering, Private DNS zones, and the introduction of Azure Private DNS Resolver. These are not things that are easy to change later, so getting them right early was critical.
There was also a challenge around scaling connectivity. As more environments, regions, and external integrations were introduced, the network design needed to handle increasing complexity. Decisions around VPN Gateway sizing, NAT behavior, and routing were not static. They had to evolve as requirements grew, which meant the initial design needed to be flexible enough to adapt.
Another important challenge was developer adoption. Many teams were not familiar with Kubernetes or cloud-native practices. If I had simply provided clusters and access, the result would likely have been inconsistent deployments and operational issues. Instead, I had to design the platform in a way that guided teams toward the right patterns. This sometimes meant not implementing exactly what teams initially asked for. In many cases, requests were based on existing habits rather than what would work well in the new platform. It required balancing listening to requirements with making decisions that would scale long term.
There was also a constant trade-off between control and flexibility. Centralizing platform components like ArgoCD, policy enforcement, and secrets management improved consistency and security, but it reduced the level of direct control that application teams had. This was intentional, but it required careful design to ensure that teams still felt enabled rather than restricted.
Another trade-off was between pure GitOps and practical workflows. In an ideal setup, everything would be fully driven from Git with automated promotion between environments. In reality, we integrated GitOps with existing GitLab CI pipelines, including manual triggers where needed. While this was not a textbook GitOps implementation, it worked well in practice and allowed teams to adopt the model gradually instead of forcing a complete shift.
Finally, there was the challenge of changing established ways of working. Some processes had been followed for years, and moving to Infrastructure as Code, GitOps, and platform-driven workflows required a mindset shift. This was not something that could be solved purely with tooling. It required gradual introduction, clear patterns, and consistent reinforcement.
Overall, the main challenge was not designing the platform itself, but integrating it into an existing ecosystem with real constraints, existing systems, and varying levels of maturity.
13. Lessons Learned
Looking back, several important lessons came out of building and operating this platform.
One of the most important lessons was that separating platform and workload concerns early makes everything easier later. By keeping platform tooling (ArgoCD, runners, policies) in dedicated clusters and keeping workload clusters minimal, it became much easier to manage upgrades, enforce consistency, and reduce risk. Without this separation, platform components tend to get tightly coupled with workloads, making changes harder over time.
Another key lesson was that private clusters are worth the complexity, but only if networking is designed properly from the beginning. The security benefits are clear, but they come with a cost in terms of DNS, access, and connectivity. Investing time early in designing VNet structure, DNS resolution, and access patterns avoids much bigger problems later.
I also learned that GitOps adoption should be incremental, not forced. While the idea of full GitOps is appealing, teams need time to adapt. Integrating GitOps with existing CI/CD pipelines allowed us to introduce the model gradually, without disrupting existing workflows. Over time, this can evolve toward a more complete GitOps approach, but starting with a practical implementation was the right decision.
Another important lesson was around policy enforcement. Without centralized policies, Kubernetes environments quickly become inconsistent. Introducing tools like Kyverno allowed us to enforce standards such as resource limits, labeling, and security controls. This ensured that even as more teams onboarded, the platform remained predictable.
One of the strongest takeaways was that enablement is more effective than access. Giving teams full access to infrastructure does not necessarily lead to better outcomes, especially when they are new to the platform. Providing clear templates, workflows, and guardrails allowed teams to move faster with fewer errors. The role of the platform was not just to provide infrastructure, but to guide how it should be used.
I also learned that real-world constraints should shape design decisions. It is easy to aim for ideal architectures, but in practice, existing systems, organizational structure, and team maturity all play a role. Supporting hybrid environments, dual registries, and gradual migration was not ideal from a theoretical perspective, but it was necessary to move forward without disruption.
Finally, I realized that platform engineering is as much about people as it is about technology. The success of the platform depended not only on the technical design, but also on how well it aligned with teams, how easily it could be adopted, and how clearly it communicated the right way to work.
These lessons shaped not just the platform itself, but also how I approached designing systems in general.
14. Conclusion
The goal of this platform was never just to run Kubernetes clusters, but to create a system that teams could rely on without needing to understand all of its internal complexity.
By separating platform and workload clusters, I was able to keep responsibilities clear. Platform tooling such as ArgoCD, runners, and policy enforcement remained centralized, while workload clusters stayed focused on running applications. This made the overall system easier to operate, scale, and evolve over time.
Running everything as private AKS clusters improved security, but more importantly, it forced a more disciplined approach to networking, access, and connectivity. Decisions around VPN access, DNS resolution, and VNet design became foundational rather than afterthoughts.
Introducing GitOps provided consistency in how applications were deployed, even though the implementation was intentionally pragmatic and integrated with existing CI/CD workflows. This allowed teams to adopt new patterns gradually instead of forcing a complete shift upfront.
At the same time, the platform was designed with enablement in mind. Instead of exposing raw infrastructure, I focused on providing structured workflows, templates, and guardrails. This proved to be more effective, especially for teams that were new to Kubernetes and cloud environments.
Looking back, the most important part of this work was not any individual tool or technology, but the combination of decisions around structure, ownership, and workflows. These are the things that ultimately determine whether a platform is usable in practice.
This setup provided a foundation that could scale with the organization, support both existing systems and new workloads, and evolve over time without requiring constant redesign.