Designing Azure Landing Zones for Enterprise Cloud Adoption: Tenants, Management Groups, and Subscription Strategy
Design decisions, tradeoffs, and structure behind a production-ready Azure foundation
1. Introduction
In one of my recent roles, I was hired for setting up the foundation for moving workloads from a primarily on-prem environment toward Azure. The starting point was not a greenfield setup, but rather an existing landscape with established systems, evolving cloud requirements, and no clearly defined Azure operating model in place.
Before onboarding any workloads, it became clear that we needed to first define how the cloud environment itself should be structured and managed. Instead of jumping straight into deploying services, I spent time understanding how the organization operated, how teams were structured, how responsibilities were divided, and what kind of environments would be required both immediately and in the future. This involved discussions with IT, Infrastructure, security, and application stakeholders to make sure the design aligned with real workflows rather than purely technical assumptions.
One of the key realizations early on was that simply creating subscriptions and deploying resources would not scale. Without a clear structure, access model, and governance approach, the environment would quickly become difficult to manage as more teams and workloads were introduced. The goal, therefore, was to design a landing zone that could act as a stable and scalable foundation which supports multiple environments, enforcing consistency, and enabling controlled growth.
The work focused on defining the core building blocks of the Azure platform: how tenants, management groups, and subscriptions should be structured; how access should be controlled through RBAC; and how governance and security should be applied from the beginning. This was less about individual resource deployment and more about establishing a cloud operating model that would guide how infrastructure is provisioned and managed over time.
In the following sections, I will walk through the key decisions behind this design, including how the environment was structured, how access and governance were handled, and the trade-offs involved along the way.
2. What the Landing Zone Needed to Solve
Before defining any architecture, the first step was to clearly understand what problems the landing zone needed to address. This was not just a technical exercise, but a combination of organizational, operational, and security considerations that would shape how the platform would evolve over time.
One of the primary challenges was the lack of a consistent structure in Azure. Without clear boundaries, there was a risk that resources would be created in an ad hoc way, leading to unclear ownership, inconsistent configurations, and increasing operational overhead. As more teams started adopting cloud services, this kind of setup would quickly become difficult to control.
Another key requirement was environment separation. Different workloads needed to run across development, testing, and production environments, each with different levels of access, stability, and governance. These environments could not simply coexist in the same space without introducing risks around accidental changes, access leakage, or unintended impact on production systems.
Access control was also a major concern. Multiple teams with different responsibilities needed access to the platform, but with clearly defined boundaries. The goal was to ensure that engineers had the access they needed to do their work, while avoiding overly broad permissions that could lead to security or operational risks. This required a structured approach to RBAC that aligned with real team responsibilities.
From a governance perspective, there was a need to introduce consistency without slowing teams down. This included standardizing how resources are named, how they are organized, and what baseline configurations are required. At the same time, it was important to avoid overly restrictive controls that would block development or introduce unnecessary friction. My goal was enablement for developers and the infrastructure team, with guardrails rather than gatekeeping.
Networking and connectivity were another important area. The platform needed to support secure communication between workloads, as well as controlled connectivity to external systems and, where needed, existing on-premises environments. These decisions had to be made early, as they would influence how services are deployed and consumed later.
Finally, the landing zone needed to support future growth. This meant designing with the expectation that more workloads, teams, and environments would be added over time. The structure had to be scalable, predictable, and easy to extend without requiring major redesigns.
Taken together, the landing zone was not just about organizing resources in Azure. It was about creating a structured and governed environment that could support real-world operations balancing flexibility for engineering teams with control, security, and long-term maintainability.
A major future requirement was supporting Kubernetes-based workloads in a structured way, which influenced decisions around networking, identity, environment separation, and automation from the start.
3. Initial Challenges and Design Goals
Before defining the structure, there were a few key challenges that shaped the design.
Challenges:
There was no established cloud operating model, which meant decisions around structure, access, and ownership had to be defined from scratch. At the same time, the design needed to align with how teams actually worked, not just how things look on paper.
Environment separation was another important concern. It was not just about dev and prod, but about clearly isolating risk, access, and stability. Without this, it would be easy for changes in non-production to impact production systems.
Access control also required careful planning. Different teams needed different levels of access, and without a structured approach, permissions could quickly become too broad or inconsistent. At the same time, overly strict controls could slow down development.
Networking decisions had to be made early, as they would impact connectivity, security, and how services interact. These are difficult to change later, so they needed to be thought through upfront.
Finally, there was a constant need to avoid overengineering designing something scalable, but still simple enough to operate and understand.
Design Goals
Based on these challenges, a few clear goals guided the design.
The first was clear environment separation, ensuring that development, testing, and production were isolated in a meaningful way.
The second was alignment with ownership, so that subscriptions, access, and resources reflected real team responsibilities.
Scalability was also important, allowing new workloads and environments to be added without redesigning the structure.
Consistency was another key goal, with standardized naming, organization, and baseline configurations to keep the platform predictable.
Security and governance were built in from the start, with guardrails that protect the platform without blocking teams.
Finally, the design needed to be practical and maintainable, implemented through infrastructure as code and understandable by the teams operating it.
These principles guided all further decisions in the landing zone design.
4. Tenant and Identity Boundary Decisions
One of the first areas that needed clarity was the tenant and identity boundary, as this defines how access, authentication, and overall control of the platform are managed.
The environment was built within an existing Azure tenant, which meant working within established identity and governance constraints. Rather than creating a separate tenant, the focus was on structuring access and responsibilities correctly within the current one. This required close coordination with stakeholders responsible for identity and security to ensure alignment with organizational policies.
A key decision was to separate concerns between tenant-level administration and platform-level operations. Tenant-wide permissions were kept limited, while most operational responsibilities were handled at management group and subscription level. This helped reduce risk and avoided unnecessary exposure of high-privilege roles.
Access was designed around groups rather than individual users. Instead of assigning permissions directly, roles were mapped to Entra ID groups representing different teams and responsibilities. This made access easier to manage, especially as team members changed over time.
Different types of identities were also handled differently. User access was separated from automation, with service principals or managed identities used for CI/CD pipelines and infrastructure provisioning. These identities were granted only the permissions required for their specific scope, avoiding overly broad access.
Another important aspect was ensuring that access boundaries aligned with how teams worked. Platform, networking, and application teams each had clearly defined scopes, reducing overlap and making ownership more explicit.
Overall, the goal at this level was to establish a clean and controlled identity model that supports secure access, scales with the organization, and integrates well with the rest of the landing zone design.
5. Management Group Hierarchy Design
With the identity boundary defined, the next step was structuring the management group hierarchy. This was a key part of the design, as it defines how governance, policies, and access scale across the platform.
The Management domain hosted cross-cutting operational capabilities such as monitoring, diagnostics, security visibility, and other platform-level management tooling.
The hierarchy was intentionally kept simple and built around three primary areas:
Platform
Workloads
Sandboxes
This structure was designed to reflect both ownership and usage patterns, rather than just technical grouping.
The Platform management group was not treated as one large catch-all area. It was intentionally split into four platform domains: Identity, Connectivity, Management, and Shared Services. That separation exists for a practical reason. These domains have different blast radius, different access requirements, and different operational lifecycles. Identity and Management sit closer to the shared control plane and therefore need tighter governance. Connectivity affects every connected workload and has to be centrally controlled. Shared Services provide reusable capabilities, but should not become the place where application runtimes are hidden.
The Workloads management group was where application environments lived and where the actual runtime of the business services was deployed. This distinction mattered throughout the design: the platform layer hosted shared control-plane services, while workload subscriptions hosted the components that actually run the applications.
The Sandboxes management group was designed for experimentation and non-critical usage. This allowed engineers to test ideas or explore services without impacting structured environments. Governance here was intentionally more relaxed, while still maintaining basic guardrails.
One of the key considerations was balancing control and simplicity. Instead of creating a deep or overly complex hierarchy, this structure provided clear separation of concerns while remaining easy to understand and operate.
Another important aspect was leveraging inheritance. By assigning policies and RBAC at the management group level, baseline configurations could be enforced consistently across all child subscriptions. This reduced duplication and ensured that new subscriptions automatically followed the same standards.
Overall, this approach provided a clean and scalable foundation. It clearly separated platform responsibilities, workload environments, and experimental usage, while keeping the structure flexible enough to grow over time without requiring major changes.
Conceptually, the workload side was managed through a higher-level NonProd vs Prod operating model, while still exposing environment-specific subscriptions such as dev, staging, and prod for day-to-day deployment and ownership boundaries.
6. Subscription Strategy
After defining the management group hierarchy, the next step was designing the subscription model. Subscriptions were used as the primary boundary for isolation, access control, and operational ownership.
Under the Platform management group, subscriptions were separated by environment:
platform_nonprodplatform_testAfter defining the management group hierarchy, the next step was designing the subscription model. Subscriptions were used as the primary boundary for isolation, access control, and operational ownership.Under the Platform management group, subscriptions were organized around platform domains and, where needed, split between NonProd and Prod:
identity_nonprod/identity_prodconnectivity_nonprod/connectivity_prodmanagement_nonprod/management_prodsharedservices_nonprod/sharedservices_prod
This separation made the platform easier to reason about. Identity-related dependencies, hub networking, management tooling, and shared capabilities could evolve independently, and a change in one platform domain did not automatically expand the blast radius into all the others.
Under the Workloads management group, subscriptions were organized by application environments:
devstagingprod
In practice, the most important operational boundary was NonProd vs Prod. Development and staging sat on the non-production side, where teams could validate infrastructure and application changes more freely. Production remained isolated with tighter RBAC, stricter policy enforcement, and more controlled deployment processes.
The Sandboxes management group contained separate subscriptions (
sandbox1,sandbox2,sandbox3) used for experimentation. These were intentionally isolated from both platform and workload environments, allowing engineers to test new ideas or services without affecting structured environments.This overall structure provided clear separation between:
shared platform control-plane services
application runtime environments
experimental usage
It also helped reduce risk by limiting the blast radius of changes and made it easier to apply different governance and access controls across environments.
One observation from this setup is that naming should always reflect the real operating model. If non-production serves several purposes such as experimentation, integration, and pre-production validation, that needs to be visible in the structure so teams understand where a change belongs.
platform_prod
This separation allowed platform-level changes to be tested safely before reaching production. Core infrastructure such as networking and shared services could be validated in non-production environments without impacting critical workloads. At the same time, the production platform remained tightly controlled with stricter access and governance.
Under the Workloads management group, subscriptions were organized by application environments:
devstagingprod
This ensured clear isolation between development and production workloads. It also allowed different levels of access, policy enforcement, and operational control depending on the environment. For example, production environments were more restricted, while development and staging allowed more flexibility.
The Sandboxes management group contained separate subscriptions (sandbox1, sandbox2, sandbox3) used for experimentation. These were intentionally isolated from both platform and workload environments, allowing engineers to test new ideas or services without affecting structured environments.
This overall structure provided clear separation between:
platform infrastructure
application workloads
experimental usage
It also helped reduce risk by limiting the blast radius of changes and made it easier to apply different governance and access controls across environments.
One observation from this setup is that while separating platform environments added safety, it also introduced some overlap in naming and structure (for example, test vs nonprod). In future iterations, this could be simplified to reduce cognitive overhead while still maintaining the same level of isolation.
7. Governance Model
Governance was treated as a foundational part of the landing zone rather than something added later. The goal was to introduce enough structure to keep the environment consistent and secure, while still allowing teams to move quickly.
One of the first steps was defining basic standards that would apply across all subscriptions. This included naming conventions, resource organization, and tagging to ensure that resources were easy to identify, track, and manage. Keeping these consistent was important not just for readability, but also for automation, cost management, and operational clarity.
Governance was also aligned with the management group hierarchy. Policies and baseline RBAC were assigned at the management group level and inherited down into child subscriptions. That inheritance model was important because a newly vended subscription did not start empty. It inherited the expected guardrails, access model, and baseline standards from day one. This also allowed different levels of control depending on the environment: stricter for production and platform resources, and more flexible for sandboxes.
Another important aspect was ensuring that governance did not become a blocker. Instead of introducing overly restrictive controls from the start, the approach was to apply practical guardrails that addressed real risks. For example, ensuring that critical resources followed standard configurations and limiting risky patterns in production environments, while keeping non-production environments more open for development.
There was also a focus on ownership and accountability. Subscriptions and resources were structured in a way that made it clear which team was responsible for what. This reduced ambiguity and made it easier to manage changes, troubleshoot issues, and enforce standards over time.
From an implementation perspective, governance was closely tied to infrastructure as code. Baseline configurations, policy assignments, role bindings, and budget settings were embedded into OpenTofu modules and deployment workflows, ensuring that new resources followed the same patterns by default rather than relying on manual enforcement.
Overall, the governance model aimed to strike a balance by providing enough control to keep the platform stable and secure, while remaining lightweight enough to support ongoing development and growth.
Cost management was also considered as part of governance. Budgets were defined at subscription level with daily, weekly, and monthly monitoring, along with alerts to ensure visibility into spending. In other words, governance was not only about security guardrails, but also about keeping access, compliance, and cost behavior predictable.
This was particularly important in non-production and sandbox environments, where automated cleanup and usage patterns could otherwise lead to unnecessary costs. By combining budget alerts with tagging and subscription boundaries, it was possible to maintain accountability and control as the platform scaled.
8. RBAC and Access Control Strategy
Access control was one of the most important parts of the landing zone design, as it directly impacts both security and day-to-day operations. The goal was to ensure that teams had the access they needed to work effectively, while keeping permissions scoped and controlled.
The approach was based on role-based access control aligned with responsibilities, rather than assigning broad permissions by default. Instead of granting access at individual resource level, permissions were primarily assigned at management group and subscription level, allowing inheritance to handle most use cases. This reduced duplication and made access easier to manage as the environment grew.
Access was structured using Entra ID groups, with roles mapped to specific team responsibilities such as platform, networking, and application teams. This avoided direct user-level assignments and made it easier to onboard or offboard users without changing role assignments across the platform.
The Platform management group had more restricted and controlled access, as it contained shared infrastructure that impacted all environments. Only the platform team and a limited set of administrators had elevated permissions here.
Under the Workloads management group, access was further separated by environment. Development and staging subscriptions allowed broader access for application teams to deploy and test, while production access was more tightly controlled and typically limited to specific roles or controlled processes.
For automation, separate identities were used instead of relying on user credentials. CI/CD pipelines (e.g., GitLab) were integrated using service principals or managed identities, with permissions scoped only to the subscriptions or resources they needed to manage. This ensured that automation remained controlled and auditable.
One important consideration was minimizing the use of overly privileged roles such as Owner. Wherever possible, more scoped roles were used to limit access while still enabling necessary operations. This helped reduce risk, especially in production environments.
Overall, the RBAC strategy focused on clear boundaries, group-based access, and least privilege, ensuring that access scaled with the platform while remaining secure and manageable.
9. Policy, Compliance, and Guardrails
Alongside RBAC, policies were used to enforce baseline standards and prevent common misconfigurations. The goal was not to restrict everything, but to introduce practical guardrails that kept the platform consistent and secure as it scaled.
Policies were applied primarily at the management group level, allowing them to be inherited by all underlying subscriptions. This ensured that new subscriptions automatically followed the same baseline without requiring manual setup each time, which is exactly what you want if subscription creation is being automated.
The approach differed slightly across management groups. In the Platform and production workload environments, policies were stricter to protect critical infrastructure and ensure compliance with security expectations. In contrast, non-production and sandbox environments had more relaxed policies to allow experimentation and faster iteration.
Some of the key areas covered by policies included:
enforcing required tags for ownership and cost tracking
restricting allowed regions to maintain consistency
ensuring baseline configurations for resources and diagnostics
preventing certain risky exposure patterns in production environments
A key consideration was avoiding overly aggressive enforcement early on. Instead of applying a large number of strict policies upfront, the approach was to introduce controls incrementally based on actual needs. This helped avoid blocking teams while still moving toward a more governed environment.
Policies were also closely aligned with the overall structure of the landing zone. By combining management group hierarchy, subscription boundaries, and policy inheritance, governance could be applied consistently without becoming difficult to manage. These guardrails worked alongside RBAC and subscription-level budget controls, rather than replacing them.
Over time, this created a balance where teams could work with flexibility in non-production environments, while production and platform layers remained controlled and predictable.
10. Platform Security Foundations
Security was treated as a foundational aspect of the landing zone rather than something applied at the workload level later. Many of the key security controls were built directly into the platform design, reducing the need for reactive fixes as the environment grew.
One of the primary decisions was to enforce isolation through structure. By separating platform, workloads, and sandbox environments into different management groups and subscriptions, the risk of unintended access or impact was significantly reduced. Production environments were especially isolated, with stricter access controls and governance.
Access control itself played a major role in platform security. RBAC was designed around least privilege, with permissions scoped to roles and responsibilities rather than individuals. High-privilege access was limited, especially in platform and production subscriptions, reducing the overall attack surface.
Where automation or service-to-service access was needed, managed identities were preferred over long-lived credentials. This reduced secret sprawl and made permissions easier to scope, review, and rotate.
Defender for Cloud was also part of the cross-cutting security model. It provided a useful baseline across subscriptions by surfacing recommendations, highlighting configuration gaps, and making it easier to track whether the platform was drifting away from expected security posture over time.
Networking was another key component of the security foundation. The design leaned toward private connectivity wherever possible, limiting public exposure of services. Private endpoints became a recurring pattern for PaaS dependencies, and this approach influenced how services were deployed and accessed, ensuring that internal communication between components remained controlled.
Baseline protections were also considered at the platform level. This included enforcing standard configurations through policies, ensuring resources followed expected patterns, and avoiding insecure defaults. While not all controls were applied at once, the structure allowed them to be introduced gradually without requiring major changes.
Another important aspect was separation of concerns. Platform-level resources, such as shared infrastructure, were kept isolated from application workloads. This ensured that changes or issues in one area would not directly affect others, and allowed tighter control over critical components.
Finally, the platform was designed with auditability in mind. By structuring access, policies, and deployments consistently, it became easier to track changes, understand ownership, and maintain visibility across the environment.
Overall, security was not treated as a separate layer, but as an integral part of how the platform was structured and operated from the beginning.
11. Networking Foundations
Networking was one of the most critical parts of the landing zone, as it defined how services communicate, how access is controlled, and how the platform integrates with existing systems.
The design followed a hub-and-spoke model. The Connectivity subscription acted as the hub, and the workload subscriptions acted as the spokes. This allowed shared network control to stay centralized while keeping workload environments isolated from one another.
Each workload VNet was connected to the hub through VNet peering. That made it possible for workloads to consume shared connectivity services without flattening everything into a single network boundary.
The hub hosted the shared networking control plane: Azure Firewall, centralized routing, and private DNS. Keeping firewalling, route control, and name resolution in the Connectivity subscription meant those patterns were defined once and consumed consistently, rather than being reimplemented differently by each workload team.
A key decision was to move toward private connectivity by default. Wherever possible, services were not exposed publicly, and communication between components was handled through private endpoints and internal networking paths. This aligned with the overall security model and reduced unnecessary exposure of critical services.
Networking was also closely aligned with the subscription and management group structure. Platform-level networking components lived in the Connectivity subscription, while workload environments owned their own virtual networks, subnetting, private endpoints, and application-facing load balancers. This separation ensured clear ownership and reduced the risk of cross-environment impact.
At the exposure layer, I kept a deliberate distinction. Azure Firewall remained in the Connectivity hub because it is a shared inspection and egress control point. Application Gateway or AKS ingress components sat close to the workloads they exposed, because they are part of the application entry path. Workload-specific load balancers also stayed in the workload layer rather than being pulled into the platform.
Before rolling out networking to production, all core components were first implemented and validated in the non-production connectivity environment. This included setting up virtual networks, defining address spaces, and testing connectivity patterns.
A key part of this phase was ensuring that IP ranges did not conflict with existing on-premises infrastructure. This required coordination with internal IT teams and careful planning of address spaces to support both current and future connectivity requirements.
Core networking components such as VPN gateways, private DNS resolution, firewall rules, and connectivity patterns were tested in non-production first. Once validated, the same setup was replicated in the production connectivity environment. This approach reduced risk and ensured that production networking was based on tested and predictable configurations rather than assumptions.
DNS and service discovery were also an important part of the design, particularly with the use of private endpoints. Shared private DNS lived with the hub, while workload-owned private endpoints stayed with the workloads that depended on them. Ensuring consistent name resolution across subscriptions and environments required careful planning, especially as more services were introduced.
Overall, the networking foundation focused on centralized control, environment isolation, and secure connectivity, providing a structure that could support both current workloads and future expansion without major redesign.
12. Shared Services and Platform Capabilities
In addition to the core structure, a set of shared services was established to support workloads across all environments. These were placed within the platform subscriptions, ensuring they were centrally managed and consistently available.
The goal was to centralize capabilities that are common across multiple workloads, while avoiding unnecessary duplication and keeping control within the platform layer.
The most important design boundary here was between platform and workload. The platform layer hosted shared control-plane services: identity-related infrastructure, centralized connectivity, management tooling, reusable secrets patterns, registries, and observability. The workload layer hosted the application runtime: AKS or other compute, messaging, data services, storage, and the private endpoints required by those applications.
That distinction mattered because it is easy to accidentally push too much into "platform." Services such as Kafka, ActiveMQ, application databases, and workload storage were not treated as platform services. Even when shared by a particular application landscape, they still belonged in workload subscriptions because their lifecycle, scaling, failure modes, and ownership were part of the workload, not the shared control plane.
The same logic applied to MongoDB Atlas. Atlas was treated as an external managed service rather than something living inside the platform layer. Even though it sits outside native Azure resource ownership, architecturally it was still a workload dependency and was handled through the workload's connectivity and security model.
One of the key areas was network-related shared services. Components such as VPN gateways, private DNS resolution, and connectivity services were hosted in the platform layer, allowing workload environments to consume them without needing to manage their own implementations.
Another important area was secrets management. Azure Key Vault was used as the central mechanism for storing and managing sensitive data. Instead of using a single shared vault, separate Key Vaults were created per team or service, with further separation across environments (dev, test, prod). This aligned with the overall structure of the platform and ensured clear isolation of secrets.
Access to Key Vaults was controlled through Entra ID groups, allowing teams to access only the secrets relevant to their services and environments. This approach simplified access management while maintaining strong security boundaries.
Within Kubernetes environments (AKS), secrets were integrated using the External Secrets Operator, allowing workloads to securely retrieve secrets from Azure Key Vault without embedding them directly into application configurations. This created a clear separation between secret storage and application deployment.
Container image management reflected a hybrid setup. Azure Container Registry (ACR) was used as the primary registry for cloud workloads, while an existing on-premises GitLab setup required images to be available in GitLab as well. To support both environments, images were built and pushed through GitLab CI pipelines to both GitLab's registry and Azure Container Registry. While this introduced some duplication, it allowed compatibility with existing workflows and supported a gradual transition toward cloud-native deployments.
Operational tooling was also centralized where it made sense, particularly for monitoring and observability. This helped maintain consistency across environments and reduced duplication of effort.
A key consideration throughout was deciding what should be centralized and what should remain within workloads. Foundational capabilities such as networking, secret management, and shared operational tooling were centralized, while application-specific runtime resources remained within workload subscriptions.
A typical workload subscription therefore contained the runtime components needed by the application itself: an AKS cluster or other compute layer, messaging components, data services, storage accounts, and workload-specific private endpoints. The platform provided the shared foundations around those workloads, not the workloads themselves.
Overall, the shared services layer provided reusable building blocks that supported all environments, reinforced consistency, and enabled teams to operate securely without duplicating core infrastructure components.
13. Infrastructure as Code Approach
The landing zone was implemented using Infrastructure as Code (IaC) to ensure consistency, repeatability, and controlled changes across the platform. In practice, OpenTofu and GitLab CI became the mechanism for subscription vending, baseline platform setup, and consistent provisioning across the estate. Rather than creating resources manually, all core components including management groups, subscriptions, networking, and shared services were defined through code.
The implementation was structured across three separate repositories, each with a clear responsibility.
The first repository handled the creation of remote state backends. For each subscription, storage accounts and containers were provisioned through GitLab CI pipelines to store OpenTofu state. This ensured proper isolation of state per environment and avoided conflicts between different parts of the platform.
The second repository contained the core infrastructure modules. This included reusable modules for subscription vending, management group placement, policy assignment, networking, and other shared building blocks. The goal here was to define the building blocks of the platform in a modular and reusable way.
The third repository was used for environment-specific configurations, consuming the modules defined in the module repository. This separation allowed infrastructure logic to remain reusable, while environments could be defined and managed independently.
A key part of the workflow was the use of versioned modules. Changes to infrastructure were implemented through small, incremental updates aligned with individual tasks (for example, vending a new subscription, assigning baseline policies, adding a VPN gateway, or provisioning AKS). Each change was merged into the main branch of the modules repository and resulted in a new semantic version release.
New subscriptions were not created as empty containers. They were vended through code, attached to the correct management group, and received their initial RBAC, policy, and baseline configuration through the same automated path. That made the landing zone easier to scale because new environments inherited the platform model instead of being hand-crafted.
These module releases were then propagated to the environment repository. For each change, a corresponding branch (aligned with the task or ticket) was used, and updates triggered the creation of merge requests in the environment repository. This ensured that infrastructure changes were explicitly reviewed and applied in a controlled manner.
The workflow was tightly integrated with GitLab CI/CD pipelines, which handled validation, planning, and application of changes. It was also connected to Jira, allowing changes to be tracked from requirement to implementation. This made it easier for teams to understand the status of infrastructure changes and maintain visibility across the platform.
This approach provided a clear separation between:
infrastructure logic (modules)
environment configuration
state management
It also ensured that all changes were traceable, versioned, and applied in a consistent way across environments.
Overall, the Infrastructure as Code setup allowed the platform to be managed as a structured system rather than a collection of manual configurations, making it easier to scale, maintain, and evolve over time.
14. CI/CD and Deployment Workflow for the Platform
Infrastructure changes were not applied manually, but went through a structured CI/CD workflow to ensure consistency, visibility, and control across the platform.
The workflow was built around GitLab CI/CD pipelines, which handled validation, planning, subscription vending, policy assignment, and applying infrastructure changes. Every change started as a task (tracked in Jira) and was implemented through a dedicated branch aligned with that task.
Changes were first introduced in the modules repository, typically as small, incremental updates (for example, adding a resource group, VPN gateway, or AKS cluster). Each change went through peer review within the team before being merged. The team consisted of four engineers, and while everyone contributed changes, merges to the main branch were controlled to maintain consistency and avoid conflicts.
Once a change was merged into the main branch, a new versioned release of the module was created automatically. This ensured that infrastructure changes were versioned, traceable, and could be consumed in a controlled way.
These module updates were then propagated to the environment repository, where the new version triggered a corresponding branch and merge request. This allowed changes to be reviewed again in the context of specific environments before being applied.
The pipeline followed a clear flow:
validate configuration
vend or update the subscription baseline
generate plan
review changes
apply changes
To improve visibility, the pipeline included tooling that surfaced planned infrastructure changes directly in merge requests, showing what resources would be created, updated, or destroyed. This made it easier for reviewers to understand the impact of changes before approval. The same workflow was also used to assign or update policy sets through code, which kept governance changes reviewable rather than hidden in the portal.
Before applying changes to production, updates were first tested in sandbox or non-production environments. Using tofu apply, changes were validated through pipeline logs, allowing the team to observe exactly what was being created, modified, or removed. Only after this validation were changes promoted to production environments.
For production, additional care was taken with controlled application and review, ensuring that changes were predictable and aligned with expectations.
This workflow ensured that infrastructure changes were:
reviewed (through team peer review and merge requests)
controlled (restricted merge access and staged rollout)
visible (clear plans and logs in CI pipelines)
traceable (linked to Jira tasks and versioned releases)
Overall, the CI/CD approach treated infrastructure as a continuously managed system, with clear processes for validation, review, and promotion across environments.
15. Environment Separation: Dev, Staging, Prod, Sandbox
Environment separation was a core principle of the landing zone design, ensuring that workloads could be developed, tested, and operated without introducing unnecessary risk to production systems.
At a higher level, the key operational split was between NonProd and Prod, even though the workload layer still exposed dev, staging, and prod as separate subscriptions.
Under the Workloads management group, subscriptions were organized by environment:
devstagingprod
This structure provided clear isolation between environments, both in terms of infrastructure and access. Development and staging environments formed the non-production side for building and validating changes, while production remained stable and tightly controlled.
The same principle existed in the platform layer, where Identity, Connectivity, Management, and Shared Services had non-production and production boundaries of their own. That allowed platform changes to be validated safely before affecting the live control plane.
Access and governance differed across environments. Non-production environments allowed more flexibility for development and testing, enabling teams to iterate quickly. In contrast, production environments had stricter access controls, tighter governance, more review, and fewer exceptions to reduce risk.
This separation also aligned with the CI/CD workflow. Changes were first applied and validated in sandbox or non-production environments, where infrastructure updates could be tested safely. Only after validation were changes promoted to production, ensuring that deployments were based on tested configurations rather than assumptions.
The Sandboxes management group provided additional isolation for experimentation. The platform team (consisting of four engineers) had access to multiple sandbox subscriptions, which were used for testing new features and infrastructure changes.
To optimize this process, CI pipelines dynamically selected a sandbox subscription where resources were not currently deployed and used it for testing. This allowed parallel experimentation without conflicts between team members.
To avoid unnecessary costs, sandbox resources were treated as ephemeral. Infrastructure deployed for testing was automatically cleaned up using scheduled jobs (cron-based pipelines in GitLab CI), typically running at the end of the day. This ensured that unused resources did not persist beyond their purpose. In cases where longer testing was required, this cleanup behavior could be adjusted or disabled as needed.
Another important aspect was consistency across environments. While access levels and governance differed, the underlying infrastructure patterns remained the same. The same OpenTofu modules and deployment workflows were used across dev, staging, and prod, minimizing drift and ensuring predictable behavior when promoting changes.
Overall, environment separation ensured clear boundaries, controlled risk, and efficient resource usage, supporting both rapid development and stable production operations.
16. Operational Model and Team Responsibilities
Beyond the technical design, it was important to define a clear operational model regarding who owns what, how changes are made, and how responsibilities are divided across teams.
The platform was managed by a small platform engineering team of four members, responsible for designing, maintaining, and evolving the landing zone and its core components. This included management groups, subscriptions, networking, shared services, and infrastructure modules.
A key principle was clear ownership boundaries. Platform-level resources, such as networking, shared services, and foundational infrastructure, were owned and managed by the platform team. This ensured consistency and avoided fragmentation of critical components.
A useful way to think about the operating model is that the platform team owned the shared control plane, while workload teams owned the runtime behavior of their applications. Even when the platform team provided templates or automation for AKS, messaging, or data services, those components still belonged architecturally to the workload boundary rather than the shared platform layer.
Application teams operated within the workload subscriptions, but direct access to the Azure portal was intentionally limited. Instead of broad access, the focus was on enablement through self-service. The platform provided predefined, reusable patterns (golden templates) that teams could use to deploy their services without needing deep knowledge of Azure, Kubernetes, or underlying infrastructure.
This approach reduced the risk of misconfigurations while also lowering the barrier for teams that were not yet familiar with cloud-native concepts. Rather than requiring every team to understand the full platform, the responsibility was shifted toward the platform team to provide a reliable and easy-to-use interface.
In exceptional cases, break-glass access was available for debugging or emergency scenarios, but this was tightly controlled and not part of normal operations.
Infrastructure changes were handled exclusively through Infrastructure as Code and CI/CD workflows, ensuring that all changes were versioned, reviewed, and consistent. This avoided manual changes in the portal and kept the platform predictable.
The operational model also involved collaboration with internal IT and security teams, particularly around networking, identity, and access decisions. This ensured that the platform aligned with broader organizational requirements rather than operating in isolation.
Overall, the model focused on centralized control with decentralized usage: the platform team owned and operated the infrastructure, while application teams were enabled to use it through standardized, self-service patterns.
17. Key Trade-offs and Decisions
Designing the landing zone involved a number of trade-offs between control, flexibility, and simplicity. Rather than aiming for a "perfect" architecture, the goal was to make practical decisions that aligned with the organization's needs and maturity level.
One of the main trade-offs was between centralized control and team autonomy. Direct access to the Azure portal was limited, and most operations were handled through predefined templates and CI/CD workflows. This reduced the risk of misconfiguration and improved consistency, but also meant that teams relied on the platform layer rather than having full control. Given that many teams were still early in their cloud adoption, this trade-off favored stability and enablement over flexibility.
Another decision was around subscription and environment separation. Splitting environments (dev, staging, prod) across separate subscriptions improved isolation and reduced risk, but introduced additional management overhead. Similarly, separating platform subscriptions into non-production and production added safety, but increased complexity in terms of structure and naming.
There was also a balance between strong governance and developer experience. Applying too many policies or restrictions early on could slow down teams, while too little governance would lead to inconsistency and potential security risks. The approach taken was to introduce guardrails gradually, focusing on practical controls rather than enforcing everything upfront.
In networking, adopting a private-first approach improved security and control, but added complexity in areas such as DNS, connectivity, and troubleshooting. This required additional effort upfront, but provided a more secure and scalable foundation in the long term.
Another trade-off was in shared services vs workload ownership. Centralizing networking, policy, and secrets management improved consistency and control, but I did not want the platform layer to become a dumping ground for runtime dependencies. Components such as Kafka, ActiveMQ, databases, and storage might be common within an application landscape, but they still belonged closer to the workload subscriptions because their scaling, availability, and incident ownership were tied to the applications consuming them.
Finally, the hybrid setup for container registries (GitLab and Azure Container Registry) introduced some duplication in CI/CD pipelines. However, this decision was necessary to maintain compatibility with existing on-premises workflows while enabling a gradual transition toward cloud-native practices.
Overall, these decisions were guided by the principle of building a platform that was secure, scalable, and usable, while acknowledging the constraints of existing systems and team maturity.
18. Challenges Encountered
While the overall structure provided a solid foundation, implementing the landing zone came with several practical challenges both technical and organizational.
One of the main challenges was operating in a hybrid environment. Existing systems needed to continue functioning on-premises while new workloads were being introduced in Azure. For example, certain applications had to remain operational in their original setup while being gradually migrated and tested in AKS. This required careful coordination to ensure both environments could coexist without disruption.
Networking and connectivity were also complex, particularly with growing requirements. As new regions and external partners were introduced, ensuring reliable and scalable connectivity became more challenging. This led to exploring solutions such as VPN Gateway configurations (including higher-tier SKUs) and addressing NAT and routing considerations to support expanding connectivity needs.
Another significant challenge was adoption and enablement of development teams. Many teams were not familiar with cloud, Kubernetes, or infrastructure concepts. While input from teams was important, it was not always directly actionable. In some cases, requirements reflected existing ways of working rather than future needs. This required balancing feedback with a forward-looking approach similar to the idea that if asked, users might request incremental improvements to what they already know, rather than adopting a fundamentally better model.
There was also resistance to changing established practices. Some processes had been followed in a certain way for a long time, and moving toward infrastructure as code, self-service models, and cloud-native patterns required a shift in mindset. This was not purely a technical change, but an organizational one.
At the same time, it was important to align with real requirements. While introducing new patterns and improvements, the platform still needed to support existing workflows and constraints. This meant finding a balance between innovation and compatibility, rather than enforcing change too aggressively.
Overall, many of the challenges were not just about designing the platform, but about integrating it into an existing ecosystem balancing legacy systems, new technologies, and team readiness.
19. Lessons Learned
Looking back, several key lessons stood out from designing and implementing the landing zone.
One of the most important was that structure should follow ownership and operations, not just technical best practices. Decisions around management groups, subscriptions, and access only worked well when they reflected how teams actually operated.
Another key lesson was to keep the design as simple as possible, but not simpler. It is easy to overengineer early, especially when trying to account for future scale. In practice, a clear and understandable structure proved more valuable than a highly complex one.
Access control needs to be designed early. RBAC becomes difficult to fix later, and unclear ownership or overly broad permissions can quickly create problems as the platform grows. Investing time upfront in defining roles and boundaries pays off significantly.
Networking decisions have long-term impact. Address space planning, connectivity models, and private networking choices are difficult to change later. Taking time to validate assumptions especially with existing on-premises systems was critical.
Another important lesson was around enablement over control. Instead of giving teams direct access and expecting them to manage infrastructure, providing self-service patterns and templates proved more effective, especially for teams new to cloud and Kubernetes.
Working in a hybrid environment also reinforced the importance of pragmatism over idealism. Not all decisions can follow best practices when existing systems and constraints are involved. Supporting both on-premises and cloud workflows required flexibility and incremental change rather than a complete redesign.
Finally, platform work is as much organizational as it is technical. Aligning with teams, managing expectations, and gradually introducing new ways of working were just as important as the technical design itself.
These lessons helped shape not just the landing zone, but also how the platform evolved over time.
20. What I Would Do Differently
With the benefit of hindsight, there are several areas where the approach could be improved or simplified.
One area is simplifying environment and naming consistency, particularly within platform subscriptions. While separating platform domains across non-production and production added safety, it also introduced some overlap and cognitive overhead. A more streamlined naming approach could achieve the same isolation with less complexity.
Another improvement would be to define and document the operating model earlier. While many decisions were aligned with how teams worked, having clearer documentation and onboarding guidance from the beginning would have made it easier for other teams to understand and adopt the platform.
Governance could also be introduced more progressively but with clearer direction. While avoiding overly strict controls early on helped with flexibility, having a more defined roadmap for governance and policy enforcement would make long-term alignment easier.
In networking, while the design worked well, earlier alignment on future connectivity requirements (such as expanding regions, new partners, and scaling VPN capacity) could have reduced the need for later adjustments.
Another area for improvement is developer onboarding and enablement. While self-service patterns and templates were introduced, investing earlier in documentation, examples, and clear workflows could have reduced the learning curve for teams less familiar with cloud and Kubernetes.
Finally, in a hybrid environment, it would be beneficial to plan the transition strategy more explicitly. Supporting both on-premises and cloud workflows was necessary, but having a clearer roadmap for gradual migration could help reduce complexity over time.
Overall, most improvements are not about changing the core design, but about simplifying, documenting, and aligning earlier, making the platform easier to adopt and evolve.
21. How the Landing Zone Enabled Later Platform Work
Once the landing zone was in place, it provided a stable and predictable foundation for building higher-level platform capabilities.
With clear subscription boundaries and management group structure, it became straightforward to onboard new workloads without redefining access, governance, or networking each time. Teams could be onboarded into predefined environments rather than starting from scratch. A new workload could be placed into the correct subscription model, inherit baseline policies and RBAC, and connect its spoke network to the central connectivity layer without redesigning the foundations each time.
The networking foundation enabled secure deployment of services such as AKS, with private connectivity, controlled ingress/egress, and integration with existing systems. Because address spaces, peering patterns, firewall control, and DNS behavior were already defined and validated, new services could be deployed without rethinking network design.
The RBAC and identity model allowed controlled access to both infrastructure and applications. This made it possible to integrate CI/CD pipelines and automation safely, as permissions were already scoped and aligned with responsibilities.
The use of Infrastructure as Code and CI/CD workflows meant that new components such as Kubernetes clusters, networking resources, or shared services could be deployed in a consistent and repeatable way. This significantly reduced the risk of configuration drift and made scaling the platform much easier.
Shared services such as Key Vault, container registries, and centralized networking provided reusable building blocks that application teams could rely on, rather than reimplementing core infrastructure for each workload. At the same time, runtime components such as AKS, messaging, databases, and storage stayed within workload boundaries, which kept ownership clearer when applications were onboarded.
This foundation also enabled the introduction of GitOps patterns and Kubernetes-based workloads, where deployments could be managed in a structured and automated way, building on top of the existing platform.
Overall, the landing zone transformed Azure from a set of individual resources into a cohesive platform, where infrastructure, security, and operations were aligned. This allowed the focus to shift from setting up environments to actually running and scaling workloads.
In the next part, I will go deeper into how this foundation was used to build and operate a Kubernetes platform, including GitOps workflows and application onboarding.
22. Additional Design Considerations
In addition to the core landing zone design, there were several supporting considerations that helped keep the platform consistent and operationally manageable.
Naming conventions and tagging were introduced early to maintain clarity across resources. Subscriptions, resource groups, and services followed consistent naming patterns, while tags such as environment, ownership, and team helped with identification, cost tracking, and operational visibility.
At the resource level, a clear structure was followed to separate platform, networking, and application resources. Resource groups were organized based on responsibility and lifecycle, ensuring that shared infrastructure remained distinct from workload-specific components.
Connectivity to on-premises systems was an important aspect of the design. The platform needed to integrate with existing infrastructure while supporting future expansion. This required careful planning of VPN connectivity, address spaces, and DNS resolution, as well as coordination with internal IT teams to avoid conflicts and maintain trust boundaries between environments.
For automation, service principals and managed identities were used instead of user-based access. CI/CD pipelines (GitLab) were granted scoped permissions aligned with their responsibilities, ensuring that infrastructure changes could be applied securely and consistently without exposing unnecessary privileges.
Basic audit and monitoring considerations were also included, such as ensuring that activity logs, diagnostic settings, and Defender for Cloud coverage were available where needed. While not the primary focus of the landing zone, this provided a foundation for future observability and security monitoring.
These additional elements supported the overall goal of creating a platform that was not only structured and secure, but also maintainable and scalable in day-to-day operations.
Taken together, these decisions helped turn Azure from a collection of cloud resources into a structured operating model that could support secure growth, repeatable delivery, and future platform evolution.