Azure Kubernetes Service (AKS) Production Guide: Complete Enterprise Deployment
Azure Kubernetes Service (AKS) is Microsoft’s managed Kubernetes offering that simplifies deploying and managing containerized applications on Azure. This comprehensive guide covers everything needed to build and operate production-grade AKS clusters.
Why Choose Azure AKS?
Key Benefits
- Fully Managed Control Plane: Azure manages Kubernetes control plane at no cost
- Azure Integration: Native integration with Azure AD, ACR, Azure Monitor, Key Vault, and more
- Enterprise Features: Virtual nodes, Azure Policy, Defender for Cloud integration
- Cost Efficiency: Free control plane, pay only for worker nodes
- Developer Tools: Seamless integration with Visual Studio, VS Code, and Azure DevOps
- Hybrid Capabilities: Azure Arc for on-premises Kubernetes management
AKS vs. Other Managed Kubernetes
| Feature | AKS | EKS | GKE |
|---|---|---|---|
| Control Plane Cost | Free | $0.10/hr | Free (Autopilot) |
| Upgrade Process | In-place | Rolling | Auto (Autopilot) |
| Windows Containers | Yes | Yes | No |
| Virtual Nodes | Yes (ACI) | Fargate | Autopilot |
| IDE Integration | Excellent | Good | Good |
Getting Started with AKS
Architecture Overview
┌──────────────────────────────────────────────────────────────┐│ Azure Subscription ││ ││ ┌─────────────────────────────────────────────────────────┐ ││ │ AKS Control Plane (Azure Managed - Free) │ ││ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ ││ │ │ API │ │ etcd │ │Scheduler │ │ ││ │ │ Server │ │ │ │ │ │ ││ │ └──────────┘ └──────────┘ └──────────┘ │ ││ └─────────────────────────────────────────────────────────┘ ││ │ ││ ┌──────────────────────────┼──────────────────────────────┐ ││ │ Virtual Network │ │ ││ │ ▼ │ ││ │ ┌────────────────────────────────────────────────┐ │ ││ │ │ Node Pools (Your Subscription) │ │ ││ │ │ │ │ ││ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ ││ │ │ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ │ ││ │ │ │ (AZ 1) │ │ (AZ 2) │ │ (AZ 3) │ │ │ ││ │ │ │ │ │ │ │ │ │ │ ││ │ │ │┌───────┐│ │┌───────┐│ │┌───────┐│ │ │ ││ │ │ ││ Pods ││ ││ Pods ││ ││ Pods ││ │ │ ││ │ │ │└───────┘│ │└───────┘│ │└───────┘│ │ │ ││ │ │ └─────────┘ └─────────┘ └─────────┘ │ │ ││ │ │ │ │ ││ │ │ ┌────────────────────────────────────┐ │ │ ││ │ │ │ Virtual Nodes (Azure Container │ │ │ ││ │ │ │ Instances - Serverless) │ │ │ ││ │ │ └────────────────────────────────────┘ │ │ ││ │ └────────────────────────────────────────────────┘ │ ││ └──────────────────────────────────────────────────────────┘ │└──────────────────────────────────────────────────────────────┘Prerequisites
- Azure CLI 2.50+ installed
- kubectl 1.28+ installed
- Azure subscription with appropriate permissions
- (Optional) Terraform or Bicep for infrastructure as code
Creating Your First AKS Cluster
Using Azure CLI (Quick Start)
# Create resource groupaz group create \ --name rg-aks-production \ --location eastus
# Create AKS clusteraz aks create \ --resource-group rg-aks-production \ --name aks-production-cluster \ --node-count 3 \ --node-vm-size Standard_D4s_v5 \ --enable-managed-identity \ --enable-azure-rbac \ --enable-aad \ --enable-addons monitoring,azure-policy \ --network-plugin azure \ --network-policy azure \ --zones 1 2 3 \ --kubernetes-version 1.29.0 \ --generate-ssh-keys
# Get credentialsaz aks get-credentials \ --resource-group rg-aks-production \ --name aks-production-cluster
# Verify connectionkubectl get nodesUsing Terraform (Production Recommended)
terraform { required_version = ">= 1.6" required_providers { azurerm = { source = "hashicorp/azurerm" version = "~> 3.80" } }}
provider "azurerm" { features {}}
# Resource Groupresource "azurerm_resource_group" "aks" { name = "rg-aks-production" location = "East US"}
# Virtual Networkresource "azurerm_virtual_network" "aks" { name = "vnet-aks" location = azurerm_resource_group.aks.location resource_group_name = azurerm_resource_group.aks.name address_space = ["10.0.0.0/16"]}
# Subnet for AKS nodesresource "azurerm_subnet" "aks_nodes" { name = "snet-aks-nodes" resource_group_name = azurerm_resource_group.aks.name virtual_network_name = azurerm_virtual_network.aks.name address_prefixes = ["10.0.1.0/24"]}
# Subnet for AKS pods (Azure CNI)resource "azurerm_subnet" "aks_pods" { name = "snet-aks-pods" resource_group_name = azurerm_resource_group.aks.name virtual_network_name = azurerm_virtual_network.aks.name address_prefixes = ["10.0.64.0/18"]
delegation { name = "aks-delegation" service_delegation { name = "Microsoft.ContainerService/managedClusters" actions = ["Microsoft.Network/virtualNetworks/subnets/join/action"] } }}
# Log Analytics Workspaceresource "azurerm_log_analytics_workspace" "aks" { name = "log-aks-production" location = azurerm_resource_group.aks.location resource_group_name = azurerm_resource_group.aks.name sku = "PerGB2018" retention_in_days = 30}
# AKS Clusterresource "azurerm_kubernetes_cluster" "aks" { name = "aks-production-cluster" location = azurerm_resource_group.aks.location resource_group_name = azurerm_resource_group.aks.name dns_prefix = "aks-prod" kubernetes_version = "1.29.0"
# Automatically upgrade to latest patch version automatic_channel_upgrade = "patch"
# Network Profile network_profile { network_plugin = "azure" network_policy = "azure" dns_service_ip = "10.0.128.10" service_cidr = "10.0.128.0/18" load_balancer_sku = "standard" outbound_type = "loadBalancer" }
# Default Node Pool (System) default_node_pool { name = "system" node_count = 3 vm_size = "Standard_D4s_v5" os_disk_size_gb = 128 os_disk_type = "Ephemeral" vnet_subnet_id = azurerm_subnet.aks_nodes.id pod_subnet_id = azurerm_subnet.aks_pods.id zones = [1, 2, 3] enable_auto_scaling = true min_count = 3 max_count = 10 max_pods = 50
upgrade_settings { max_surge = "33%" }
node_labels = { "workload-type" = "system" }
tags = { Environment = "Production" } }
# Identity identity { type = "SystemAssigned" }
# Azure AD Integration azure_active_directory_role_based_access_control { managed = true azure_rbac_enabled = true admin_group_object_ids = [var.aks_admin_group_id] }
# Add-ons oms_agent { log_analytics_workspace_id = azurerm_log_analytics_workspace.aks.id }
azure_policy_enabled = true
key_vault_secrets_provider { secret_rotation_enabled = true secret_rotation_interval = "2m" }
# Enable Workload Identity workload_identity_enabled = true oidc_issuer_enabled = true
# Maintenance Windows maintenance_window_auto_upgrade { frequency = "Weekly" interval = 1 duration = 4 day_of_week = "Sunday" start_time = "02:00" utc_offset = "+00:00" }
maintenance_window_node_os { frequency = "Weekly" interval = 1 duration = 4 day_of_week = "Sunday" start_time = "02:00" utc_offset = "+00:00" }
tags = { Environment = "Production" ManagedBy = "Terraform" }}
# User Node Pool - General Purposeresource "azurerm_kubernetes_cluster_node_pool" "user_general" { name = "general" kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id vm_size = "Standard_D4s_v5" node_count = 3 os_disk_size_gb = 128 os_disk_type = "Ephemeral" vnet_subnet_id = azurerm_subnet.aks_nodes.id pod_subnet_id = azurerm_subnet.aks_pods.id zones = [1, 2, 3] enable_auto_scaling = true min_count = 3 max_count = 20 max_pods = 50
node_labels = { "workload-type" = "general" }
tags = { Environment = "Production" }}
# User Node Pool - Memory Optimizedresource "azurerm_kubernetes_cluster_node_pool" "user_memory" { name = "memory" kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id vm_size = "Standard_E8s_v5" node_count = 2 enable_auto_scaling = true min_count = 2 max_count = 10 vnet_subnet_id = azurerm_subnet.aks_nodes.id pod_subnet_id = azurerm_subnet.aks_pods.id zones = [1, 2, 3]
node_labels = { "workload-type" = "memory-intensive" }
node_taints = [ "workload-type=memory-intensive:NoSchedule" ]
tags = { Environment = "Production" }}
# User Node Pool - Spot Instancesresource "azurerm_kubernetes_cluster_node_pool" "user_spot" { name = "spot" kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id vm_size = "Standard_D4s_v5" priority = "Spot" eviction_policy = "Delete" spot_max_price = -1 # Pay up to on-demand price node_count = 3 enable_auto_scaling = true min_count = 1 max_count = 20 vnet_subnet_id = azurerm_subnet.aks_nodes.id pod_subnet_id = azurerm_subnet.aks_pods.id
node_labels = { "workload-type" = "spot" "kubernetes.azure.com/scalesetpriority" = "spot" }
node_taints = [ "kubernetes.azure.com/scalesetpriority=spot:NoSchedule" ]
tags = { Environment = "Production" }}
# Azure Container Registryresource "azurerm_container_registry" "acr" { name = "acrproduction${random_string.suffix.result}" resource_group_name = azurerm_resource_group.aks.name location = azurerm_resource_group.aks.location sku = "Premium" admin_enabled = false
georeplications { location = "West US" tags = {} }
network_rule_set { default_action = "Deny"
ip_rule { action = "Allow" ip_range = "0.0.0.0/0" # Replace with your IP ranges } }}
# Grant AKS access to ACRresource "azurerm_role_assignment" "aks_acr" { principal_id = azurerm_kubernetes_cluster.aks.kubelet_identity[0].object_id role_definition_name = "AcrPull" scope = azurerm_container_registry.acr.id skip_service_principal_aad_check = true}
resource "random_string" "suffix" { length = 8 special = false upper = false}
# Outputsoutput "cluster_name" { value = azurerm_kubernetes_cluster.aks.name}
output "kube_config" { value = azurerm_kubernetes_cluster.aks.kube_config_raw sensitive = true}
output "oidc_issuer_url" { value = azurerm_kubernetes_cluster.aks.oidc_issuer_url}Azure-Specific Integrations
1. Azure Active Directory Integration
Workload Identity (Recommended)
# Already enabled in Terraform with:# workload_identity_enabled = true# oidc_issuer_enabled = true
# Create Azure AD applicationAPP_NAME="aks-workload-identity"APP_ID=$(az ad app create --display-name $APP_NAME --query appId -o tsv)
# Create service principalSP_ID=$(az ad sp create --id $APP_ID --query id -o tsv)
# Create Kubernetes service accountcat <<EOF | kubectl apply -f -apiVersion: v1kind: ServiceAccountmetadata: name: workload-identity-sa namespace: default annotations: azure.workload.identity/client-id: $APP_IDEOF
# Create federated credentialAKS_OIDC_ISSUER=$(az aks show \ --resource-group rg-aks-production \ --name aks-production-cluster \ --query oidcIssuerProfile.issuerUrl -o tsv)
az ad app federated-credential create \ --id $APP_ID \ --parameters '{ "name": "kubernetes-federated-credential", "issuer": "'$AKS_OIDC_ISSUER'", "subject": "system:serviceaccount:default:workload-identity-sa", "audiences": ["api://AzureADTokenExchange"] }'
# Grant permissions (example: Key Vault access)KEYVAULT_NAME="kv-production"az keyvault set-policy \ --name $KEYVAULT_NAME \ --spn $APP_ID \ --secret-permissions get list# Use workload identity in deploymentapiVersion: apps/v1kind: Deploymentmetadata: name: workload-identity-demospec: replicas: 3 selector: matchLabels: app: workload-identity-demo template: metadata: labels: app: workload-identity-demo azure.workload.identity/use: "true" spec: serviceAccountName: workload-identity-sa containers: - name: app image: myapp:latest env: - name: AZURE_CLIENT_ID value: "$APP_ID" - name: AZURE_TENANT_ID value: "your-tenant-id"2. Azure Key Vault Integration
Using CSI Secret Store Driver
# Install CSI Secret Store Driver (if not using add-on)helm repo add csi-secrets-store-provider-azure https://azure.github.io/secrets-store-csi-driver-provider-azure/chartshelm install csi-secrets-store-provider-azure/csi-secrets-store-provider-azure \ --generate-name \ --namespace kube-system# SecretProviderClassapiVersion: secrets-store.csi.x-k8s.io/v1kind: SecretProviderClassmetadata: name: azure-keyvault-secrets namespace: defaultspec: provider: azure parameters: usePodIdentity: "false" useVMManagedIdentity: "false" clientID: "$APP_ID" # From workload identity keyvaultName: "kv-production" cloudName: "" objects: | array: - | objectName: database-username objectType: secret objectVersion: "" - | objectName: database-password objectType: secret objectVersion: "" tenantId: "your-tenant-id" secretObjects: - secretName: database-credentials type: Opaque data: - objectName: database-username key: username - objectName: database-password key: password---# Use in deploymentapiVersion: apps/v1kind: Deploymentmetadata: name: app-with-secretsspec: replicas: 3 selector: matchLabels: app: app-with-secrets template: metadata: labels: app: app-with-secrets azure.workload.identity/use: "true" spec: serviceAccountName: workload-identity-sa containers: - name: app image: myapp:latest volumeMounts: - name: secrets-store mountPath: "/mnt/secrets" readOnly: true env: - name: DB_USERNAME valueFrom: secretKeyRef: name: database-credentials key: username - name: DB_PASSWORD valueFrom: secretKeyRef: name: database-credentials key: password volumes: - name: secrets-store csi: driver: secrets-store.csi.k8s.io readOnly: true volumeAttributes: secretProviderClass: "azure-keyvault-secrets"3. Azure Load Balancer and Application Gateway
Public Load Balancer Service
apiVersion: v1kind: Servicemetadata: name: web-app-lb annotations: service.beta.kubernetes.io/azure-load-balancer-resource-group: "rg-aks-production" service.beta.kubernetes.io/azure-pip-name: "pip-web-app" service.beta.kubernetes.io/azure-dns-label-name: "myapp"spec: type: LoadBalancer selector: app: web-app ports: - port: 80 targetPort: 8080 protocol: TCPApplication Gateway Ingress Controller (AGIC)
# Install AGIC using Helmhelm repo add application-gateway-kubernetes-ingress \ https://appgwingress.blob.core.windows.net/ingress-azure-helm-package/helm install ingress-azure \ application-gateway-kubernetes-ingress/ingress-azure \ --namespace default \ --set appgw.subscriptionId=$SUBSCRIPTION_ID \ --set appgw.resourceGroup=rg-aks-production \ --set appgw.name=appgw-aks \ --set armAuth.type=workloadIdentity \ --set armAuth.identityClientID=$APP_IDapiVersion: networking.k8s.io/v1kind: Ingressmetadata: name: web-app-ingress annotations: kubernetes.io/ingress.class: azure/application-gateway appgw.ingress.kubernetes.io/ssl-redirect: "true" appgw.ingress.kubernetes.io/backend-protocol: "http" appgw.ingress.kubernetes.io/cookie-based-affinity: "true" appgw.ingress.kubernetes.io/request-timeout: "30"spec: tls: - secretName: web-app-tls hosts: - app.example.com rules: - host: app.example.com http: paths: - path: / pathType: Prefix backend: service: name: web-app port: number: 804. Azure Storage Integration
Azure Disk (Persistent Volumes)
# StorageClass for Premium SSDapiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: managed-premium-ssdprovisioner: disk.csi.azure.comparameters: skuname: Premium_LRS kind: Managed cachingmode: ReadOnlyreclaimPolicy: DeletevolumeBindingMode: WaitForFirstConsumerallowVolumeExpansion: true---# PersistentVolumeClaimapiVersion: v1kind: PersistentVolumeClaimmetadata: name: postgres-dataspec: accessModes: - ReadWriteOnce storageClassName: managed-premium-ssd resources: requests: storage: 100GiAzure Files (Shared Storage)
# StorageClass for Azure FilesapiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: azurefile-premiumprovisioner: file.csi.azure.comparameters: skuName: Premium_LRS location: eastus resourceGroup: rg-aks-productionreclaimPolicy: DeletevolumeBindingMode: ImmediateallowVolumeExpansion: truemountOptions: - dir_mode=0777 - file_mode=0777 - uid=0 - gid=0 - mfsymlinks - cache=strict - actimeo=30---# PersistentVolumeClaim for shared storageapiVersion: v1kind: PersistentVolumeClaimmetadata: name: shared-storagespec: accessModes: - ReadWriteMany storageClassName: azurefile-premium resources: requests: storage: 100Gi5. Virtual Nodes (Azure Container Instances)
# Enable virtual nodesaz aks enable-addons \ --resource-group rg-aks-production \ --name aks-production-cluster \ --addons virtual-node \ --subnet-name snet-virtual-nodes# Deploy to virtual nodes for burst capacityapiVersion: apps/v1kind: Deploymentmetadata: name: burst-workloadspec: replicas: 10 selector: matchLabels: app: burst-workload template: metadata: labels: app: burst-workload spec: nodeSelector: kubernetes.io/role: agent type: virtual-kubelet tolerations: - key: virtual-kubelet.io/provider operator: Exists containers: - name: app image: nginx resources: requests: cpu: 500m memory: 1GiMonitoring and Observability
Azure Monitor Container Insights
# Already enabled with Terraform oms_agent add-on
# View metrics in Azure Portal or query with CLIaz monitor metrics list \ --resource /subscriptions/$SUBSCRIPTION_ID/resourceGroups/rg-aks-production/providers/Microsoft.ContainerService/managedClusters/aks-production-cluster \ --metric "node_cpu_usage_percentage"Prometheus and Grafana on AKS
# Install kube-prometheus-stackhelm repo add prometheus-community https://prometheus-community.github.io/helm-chartshelm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace \ --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=managed-premium-ssd \ --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \ --set grafana.persistence.enabled=true \ --set grafana.persistence.storageClassName=managed-premium-ssd \ --set grafana.persistence.size=10Gi \ --set grafana.ingress.enabled=true \ --set grafana.ingress.ingressClassName=azure-application-gateway \ --set grafana.ingress.hosts[0]=grafana.example.comSecurity Best Practices
1. Azure Policy for AKS
# Azure Policy is already enabled in Terraform# View built-in policy initiativesaz policy set-definition list --query "[?contains(displayName, 'Kubernetes')]" -o table
# Assign Kubernetes cluster pod security baseline standardsaz policy assignment create \ --name "aks-pod-security-baseline" \ --scope "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/rg-aks-production" \ --policy-set-definition "/providers/Microsoft.Authorization/policySetDefinitions/a8640138-9b0a-4a28-b8cb-1666c838647d"2. Microsoft Defender for Containers
# Enable Defender for Containersaz security pricing create \ --name Containers \ --tier Standard3. Network Security
# Network Policy (Azure CNI Network Policy)apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: deny-all-ingress namespace: productionspec: podSelector: {} policyTypes: - Ingress---apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: allow-web-app namespace: productionspec: podSelector: matchLabels: app: web-app policyTypes: - Ingress - Egress ingress: - from: - podSelector: matchLabels: app: ingress-controller ports: - protocol: TCP port: 8080 egress: - to: - podSelector: matchLabels: app: database ports: - protocol: TCP port: 5432Cost Optimization
1. Cluster Autoscaler
# Already configured in Terraform with enable_auto_scaling# Verify autoscaler is runningkubectl get deployment -n kube-system cluster-autoscaler2. Spot Node Pools
Already configured in Terraform. Deploy fault-tolerant workloads:
apiVersion: apps/v1kind: Deploymentmetadata: name: batch-processorspec: replicas: 10 selector: matchLabels: app: batch-processor template: metadata: labels: app: batch-processor spec: nodeSelector: workload-type: spot tolerations: - key: kubernetes.azure.com/scalesetpriority operator: Equal value: spot effect: NoSchedule containers: - name: processor image: batch-processor:latest3. Azure Advisor Recommendations
# Get cost optimization recommendationsaz advisor recommendation list \ --category Cost \ --query "[?contains(impactedValue, 'aks-production-cluster')]"Production Checklist
Infrastructure
- Multi-zone cluster deployment
- VNet with proper subnet sizing
- Azure AD integration configured
- Diagnostic settings enabled
- Multiple node pools configured
Security
- Workload identity enabled
- Azure Policy configured
- Defender for Containers enabled
- Network policies enforced
- Key Vault integration configured
- Private cluster endpoint (if needed)
Networking
- Application Gateway or Load Balancer configured
- Azure CNI with proper IP planning
- Network policies configured
- Azure DNS or External DNS setup
Storage
- Azure Disk CSI driver enabled
- Azure Files provisioner configured
- Storage classes defined
- Backup strategy implemented
Monitoring
- Container Insights enabled
- Prometheus and Grafana deployed
- Application Insights integrated
- Alert rules configured
Operations
- Automated upgrade schedule configured
- Cluster autoscaler enabled
- Spot instances for appropriate workloads
- Disaster recovery plan documented
- GitOps pipeline configured
Conclusion
Azure Kubernetes Service provides a comprehensive, enterprise-ready platform for running containerized workloads on Azure. With deep Azure integration, robust security features, and excellent tooling support, AKS is an excellent choice for organizations already invested in the Azure ecosystem.
Success with AKS requires understanding both Kubernetes fundamentals and Azure-specific features. Start with a solid foundation, leverage Azure-native services, and continuously optimize for security, reliability, and cost.
Ready to master Azure Kubernetes Service? Our Azure training programs cover AKS from basics to advanced enterprise patterns. Contact us for customized training designed for your team.