Skip to content
Vladimir Chavkov
Go back

Azure Kubernetes Service (AKS) Production Guide: Complete Enterprise Deployment

Edit page

Azure Kubernetes Service (AKS) Production Guide: Complete Enterprise Deployment

Azure Kubernetes Service (AKS) is Microsoft’s managed Kubernetes offering that simplifies deploying and managing containerized applications on Azure. This comprehensive guide covers everything needed to build and operate production-grade AKS clusters.

Why Choose Azure AKS?

Key Benefits

AKS vs. Other Managed Kubernetes

FeatureAKSEKSGKE
Control Plane CostFree$0.10/hrFree (Autopilot)
Upgrade ProcessIn-placeRollingAuto (Autopilot)
Windows ContainersYesYesNo
Virtual NodesYes (ACI)FargateAutopilot
IDE IntegrationExcellentGoodGood

Getting Started with AKS

Architecture Overview

┌──────────────────────────────────────────────────────────────┐
│ Azure Subscription │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ AKS Control Plane (Azure Managed - Free) │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ API │ │ etcd │ │Scheduler │ │ │
│ │ │ Server │ │ │ │ │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────┼──────────────────────────────┐ │
│ │ Virtual Network │ │ │
│ │ ▼ │ │
│ │ ┌────────────────────────────────────────────────┐ │ │
│ │ │ Node Pools (Your Subscription) │ │ │
│ │ │ │ │ │
│ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │
│ │ │ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ │ │
│ │ │ │ (AZ 1) │ │ (AZ 2) │ │ (AZ 3) │ │ │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ │ │┌───────┐│ │┌───────┐│ │┌───────┐│ │ │ │
│ │ │ ││ Pods ││ ││ Pods ││ ││ Pods ││ │ │ │
│ │ │ │└───────┘│ │└───────┘│ │└───────┘│ │ │ │
│ │ │ └─────────┘ └─────────┘ └─────────┘ │ │ │
│ │ │ │ │ │
│ │ │ ┌────────────────────────────────────┐ │ │ │
│ │ │ │ Virtual Nodes (Azure Container │ │ │ │
│ │ │ │ Instances - Serverless) │ │ │ │
│ │ │ └────────────────────────────────────┘ │ │ │
│ │ └────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘

Prerequisites

Creating Your First AKS Cluster

Using Azure CLI (Quick Start)

Terminal window
# Create resource group
az group create \
--name rg-aks-production \
--location eastus
# Create AKS cluster
az aks create \
--resource-group rg-aks-production \
--name aks-production-cluster \
--node-count 3 \
--node-vm-size Standard_D4s_v5 \
--enable-managed-identity \
--enable-azure-rbac \
--enable-aad \
--enable-addons monitoring,azure-policy \
--network-plugin azure \
--network-policy azure \
--zones 1 2 3 \
--kubernetes-version 1.29.0 \
--generate-ssh-keys
# Get credentials
az aks get-credentials \
--resource-group rg-aks-production \
--name aks-production-cluster
# Verify connection
kubectl get nodes
main.tf
terraform {
required_version = ">= 1.6"
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.80"
}
}
}
provider "azurerm" {
features {}
}
# Resource Group
resource "azurerm_resource_group" "aks" {
name = "rg-aks-production"
location = "East US"
}
# Virtual Network
resource "azurerm_virtual_network" "aks" {
name = "vnet-aks"
location = azurerm_resource_group.aks.location
resource_group_name = azurerm_resource_group.aks.name
address_space = ["10.0.0.0/16"]
}
# Subnet for AKS nodes
resource "azurerm_subnet" "aks_nodes" {
name = "snet-aks-nodes"
resource_group_name = azurerm_resource_group.aks.name
virtual_network_name = azurerm_virtual_network.aks.name
address_prefixes = ["10.0.1.0/24"]
}
# Subnet for AKS pods (Azure CNI)
resource "azurerm_subnet" "aks_pods" {
name = "snet-aks-pods"
resource_group_name = azurerm_resource_group.aks.name
virtual_network_name = azurerm_virtual_network.aks.name
address_prefixes = ["10.0.64.0/18"]
delegation {
name = "aks-delegation"
service_delegation {
name = "Microsoft.ContainerService/managedClusters"
actions = ["Microsoft.Network/virtualNetworks/subnets/join/action"]
}
}
}
# Log Analytics Workspace
resource "azurerm_log_analytics_workspace" "aks" {
name = "log-aks-production"
location = azurerm_resource_group.aks.location
resource_group_name = azurerm_resource_group.aks.name
sku = "PerGB2018"
retention_in_days = 30
}
# AKS Cluster
resource "azurerm_kubernetes_cluster" "aks" {
name = "aks-production-cluster"
location = azurerm_resource_group.aks.location
resource_group_name = azurerm_resource_group.aks.name
dns_prefix = "aks-prod"
kubernetes_version = "1.29.0"
# Automatically upgrade to latest patch version
automatic_channel_upgrade = "patch"
# Network Profile
network_profile {
network_plugin = "azure"
network_policy = "azure"
dns_service_ip = "10.0.128.10"
service_cidr = "10.0.128.0/18"
load_balancer_sku = "standard"
outbound_type = "loadBalancer"
}
# Default Node Pool (System)
default_node_pool {
name = "system"
node_count = 3
vm_size = "Standard_D4s_v5"
os_disk_size_gb = 128
os_disk_type = "Ephemeral"
vnet_subnet_id = azurerm_subnet.aks_nodes.id
pod_subnet_id = azurerm_subnet.aks_pods.id
zones = [1, 2, 3]
enable_auto_scaling = true
min_count = 3
max_count = 10
max_pods = 50
upgrade_settings {
max_surge = "33%"
}
node_labels = {
"workload-type" = "system"
}
tags = {
Environment = "Production"
}
}
# Identity
identity {
type = "SystemAssigned"
}
# Azure AD Integration
azure_active_directory_role_based_access_control {
managed = true
azure_rbac_enabled = true
admin_group_object_ids = [var.aks_admin_group_id]
}
# Add-ons
oms_agent {
log_analytics_workspace_id = azurerm_log_analytics_workspace.aks.id
}
azure_policy_enabled = true
key_vault_secrets_provider {
secret_rotation_enabled = true
secret_rotation_interval = "2m"
}
# Enable Workload Identity
workload_identity_enabled = true
oidc_issuer_enabled = true
# Maintenance Windows
maintenance_window_auto_upgrade {
frequency = "Weekly"
interval = 1
duration = 4
day_of_week = "Sunday"
start_time = "02:00"
utc_offset = "+00:00"
}
maintenance_window_node_os {
frequency = "Weekly"
interval = 1
duration = 4
day_of_week = "Sunday"
start_time = "02:00"
utc_offset = "+00:00"
}
tags = {
Environment = "Production"
ManagedBy = "Terraform"
}
}
# User Node Pool - General Purpose
resource "azurerm_kubernetes_cluster_node_pool" "user_general" {
name = "general"
kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id
vm_size = "Standard_D4s_v5"
node_count = 3
os_disk_size_gb = 128
os_disk_type = "Ephemeral"
vnet_subnet_id = azurerm_subnet.aks_nodes.id
pod_subnet_id = azurerm_subnet.aks_pods.id
zones = [1, 2, 3]
enable_auto_scaling = true
min_count = 3
max_count = 20
max_pods = 50
node_labels = {
"workload-type" = "general"
}
tags = {
Environment = "Production"
}
}
# User Node Pool - Memory Optimized
resource "azurerm_kubernetes_cluster_node_pool" "user_memory" {
name = "memory"
kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id
vm_size = "Standard_E8s_v5"
node_count = 2
enable_auto_scaling = true
min_count = 2
max_count = 10
vnet_subnet_id = azurerm_subnet.aks_nodes.id
pod_subnet_id = azurerm_subnet.aks_pods.id
zones = [1, 2, 3]
node_labels = {
"workload-type" = "memory-intensive"
}
node_taints = [
"workload-type=memory-intensive:NoSchedule"
]
tags = {
Environment = "Production"
}
}
# User Node Pool - Spot Instances
resource "azurerm_kubernetes_cluster_node_pool" "user_spot" {
name = "spot"
kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id
vm_size = "Standard_D4s_v5"
priority = "Spot"
eviction_policy = "Delete"
spot_max_price = -1 # Pay up to on-demand price
node_count = 3
enable_auto_scaling = true
min_count = 1
max_count = 20
vnet_subnet_id = azurerm_subnet.aks_nodes.id
pod_subnet_id = azurerm_subnet.aks_pods.id
node_labels = {
"workload-type" = "spot"
"kubernetes.azure.com/scalesetpriority" = "spot"
}
node_taints = [
"kubernetes.azure.com/scalesetpriority=spot:NoSchedule"
]
tags = {
Environment = "Production"
}
}
# Azure Container Registry
resource "azurerm_container_registry" "acr" {
name = "acrproduction${random_string.suffix.result}"
resource_group_name = azurerm_resource_group.aks.name
location = azurerm_resource_group.aks.location
sku = "Premium"
admin_enabled = false
georeplications {
location = "West US"
tags = {}
}
network_rule_set {
default_action = "Deny"
ip_rule {
action = "Allow"
ip_range = "0.0.0.0/0" # Replace with your IP ranges
}
}
}
# Grant AKS access to ACR
resource "azurerm_role_assignment" "aks_acr" {
principal_id = azurerm_kubernetes_cluster.aks.kubelet_identity[0].object_id
role_definition_name = "AcrPull"
scope = azurerm_container_registry.acr.id
skip_service_principal_aad_check = true
}
resource "random_string" "suffix" {
length = 8
special = false
upper = false
}
# Outputs
output "cluster_name" {
value = azurerm_kubernetes_cluster.aks.name
}
output "kube_config" {
value = azurerm_kubernetes_cluster.aks.kube_config_raw
sensitive = true
}
output "oidc_issuer_url" {
value = azurerm_kubernetes_cluster.aks.oidc_issuer_url
}

Azure-Specific Integrations

1. Azure Active Directory Integration

Terminal window
# Already enabled in Terraform with:
# workload_identity_enabled = true
# oidc_issuer_enabled = true
# Create Azure AD application
APP_NAME="aks-workload-identity"
APP_ID=$(az ad app create --display-name $APP_NAME --query appId -o tsv)
# Create service principal
SP_ID=$(az ad sp create --id $APP_ID --query id -o tsv)
# Create Kubernetes service account
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
name: workload-identity-sa
namespace: default
annotations:
azure.workload.identity/client-id: $APP_ID
EOF
# Create federated credential
AKS_OIDC_ISSUER=$(az aks show \
--resource-group rg-aks-production \
--name aks-production-cluster \
--query oidcIssuerProfile.issuerUrl -o tsv)
az ad app federated-credential create \
--id $APP_ID \
--parameters '{
"name": "kubernetes-federated-credential",
"issuer": "'$AKS_OIDC_ISSUER'",
"subject": "system:serviceaccount:default:workload-identity-sa",
"audiences": ["api://AzureADTokenExchange"]
}'
# Grant permissions (example: Key Vault access)
KEYVAULT_NAME="kv-production"
az keyvault set-policy \
--name $KEYVAULT_NAME \
--spn $APP_ID \
--secret-permissions get list
# Use workload identity in deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: workload-identity-demo
spec:
replicas: 3
selector:
matchLabels:
app: workload-identity-demo
template:
metadata:
labels:
app: workload-identity-demo
azure.workload.identity/use: "true"
spec:
serviceAccountName: workload-identity-sa
containers:
- name: app
image: myapp:latest
env:
- name: AZURE_CLIENT_ID
value: "$APP_ID"
- name: AZURE_TENANT_ID
value: "your-tenant-id"

2. Azure Key Vault Integration

Using CSI Secret Store Driver

Terminal window
# Install CSI Secret Store Driver (if not using add-on)
helm repo add csi-secrets-store-provider-azure https://azure.github.io/secrets-store-csi-driver-provider-azure/charts
helm install csi-secrets-store-provider-azure/csi-secrets-store-provider-azure \
--generate-name \
--namespace kube-system
# SecretProviderClass
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
name: azure-keyvault-secrets
namespace: default
spec:
provider: azure
parameters:
usePodIdentity: "false"
useVMManagedIdentity: "false"
clientID: "$APP_ID" # From workload identity
keyvaultName: "kv-production"
cloudName: ""
objects: |
array:
- |
objectName: database-username
objectType: secret
objectVersion: ""
- |
objectName: database-password
objectType: secret
objectVersion: ""
tenantId: "your-tenant-id"
secretObjects:
- secretName: database-credentials
type: Opaque
data:
- objectName: database-username
key: username
- objectName: database-password
key: password
---
# Use in deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-with-secrets
spec:
replicas: 3
selector:
matchLabels:
app: app-with-secrets
template:
metadata:
labels:
app: app-with-secrets
azure.workload.identity/use: "true"
spec:
serviceAccountName: workload-identity-sa
containers:
- name: app
image: myapp:latest
volumeMounts:
- name: secrets-store
mountPath: "/mnt/secrets"
readOnly: true
env:
- name: DB_USERNAME
valueFrom:
secretKeyRef:
name: database-credentials
key: username
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: database-credentials
key: password
volumes:
- name: secrets-store
csi:
driver: secrets-store.csi.k8s.io
readOnly: true
volumeAttributes:
secretProviderClass: "azure-keyvault-secrets"

3. Azure Load Balancer and Application Gateway

Public Load Balancer Service

apiVersion: v1
kind: Service
metadata:
name: web-app-lb
annotations:
service.beta.kubernetes.io/azure-load-balancer-resource-group: "rg-aks-production"
service.beta.kubernetes.io/azure-pip-name: "pip-web-app"
service.beta.kubernetes.io/azure-dns-label-name: "myapp"
spec:
type: LoadBalancer
selector:
app: web-app
ports:
- port: 80
targetPort: 8080
protocol: TCP

Application Gateway Ingress Controller (AGIC)

Terminal window
# Install AGIC using Helm
helm repo add application-gateway-kubernetes-ingress \
https://appgwingress.blob.core.windows.net/ingress-azure-helm-package/
helm install ingress-azure \
application-gateway-kubernetes-ingress/ingress-azure \
--namespace default \
--set appgw.subscriptionId=$SUBSCRIPTION_ID \
--set appgw.resourceGroup=rg-aks-production \
--set appgw.name=appgw-aks \
--set armAuth.type=workloadIdentity \
--set armAuth.identityClientID=$APP_ID
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: web-app-ingress
annotations:
kubernetes.io/ingress.class: azure/application-gateway
appgw.ingress.kubernetes.io/ssl-redirect: "true"
appgw.ingress.kubernetes.io/backend-protocol: "http"
appgw.ingress.kubernetes.io/cookie-based-affinity: "true"
appgw.ingress.kubernetes.io/request-timeout: "30"
spec:
tls:
- secretName: web-app-tls
hosts:
- app.example.com
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: web-app
port:
number: 80

4. Azure Storage Integration

Azure Disk (Persistent Volumes)

# StorageClass for Premium SSD
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: managed-premium-ssd
provisioner: disk.csi.azure.com
parameters:
skuname: Premium_LRS
kind: Managed
cachingmode: ReadOnly
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# PersistentVolumeClaim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data
spec:
accessModes:
- ReadWriteOnce
storageClassName: managed-premium-ssd
resources:
requests:
storage: 100Gi

Azure Files (Shared Storage)

# StorageClass for Azure Files
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: azurefile-premium
provisioner: file.csi.azure.com
parameters:
skuName: Premium_LRS
location: eastus
resourceGroup: rg-aks-production
reclaimPolicy: Delete
volumeBindingMode: Immediate
allowVolumeExpansion: true
mountOptions:
- dir_mode=0777
- file_mode=0777
- uid=0
- gid=0
- mfsymlinks
- cache=strict
- actimeo=30
---
# PersistentVolumeClaim for shared storage
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: shared-storage
spec:
accessModes:
- ReadWriteMany
storageClassName: azurefile-premium
resources:
requests:
storage: 100Gi

5. Virtual Nodes (Azure Container Instances)

Terminal window
# Enable virtual nodes
az aks enable-addons \
--resource-group rg-aks-production \
--name aks-production-cluster \
--addons virtual-node \
--subnet-name snet-virtual-nodes
# Deploy to virtual nodes for burst capacity
apiVersion: apps/v1
kind: Deployment
metadata:
name: burst-workload
spec:
replicas: 10
selector:
matchLabels:
app: burst-workload
template:
metadata:
labels:
app: burst-workload
spec:
nodeSelector:
kubernetes.io/role: agent
type: virtual-kubelet
tolerations:
- key: virtual-kubelet.io/provider
operator: Exists
containers:
- name: app
image: nginx
resources:
requests:
cpu: 500m
memory: 1Gi

Monitoring and Observability

Azure Monitor Container Insights

Terminal window
# Already enabled with Terraform oms_agent add-on
# View metrics in Azure Portal or query with CLI
az monitor metrics list \
--resource /subscriptions/$SUBSCRIPTION_ID/resourceGroups/rg-aks-production/providers/Microsoft.ContainerService/managedClusters/aks-production-cluster \
--metric "node_cpu_usage_percentage"

Prometheus and Grafana on AKS

Terminal window
# Install kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=managed-premium-ssd \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
--set grafana.persistence.enabled=true \
--set grafana.persistence.storageClassName=managed-premium-ssd \
--set grafana.persistence.size=10Gi \
--set grafana.ingress.enabled=true \
--set grafana.ingress.ingressClassName=azure-application-gateway \
--set grafana.ingress.hosts[0]=grafana.example.com

Security Best Practices

1. Azure Policy for AKS

Terminal window
# Azure Policy is already enabled in Terraform
# View built-in policy initiatives
az policy set-definition list --query "[?contains(displayName, 'Kubernetes')]" -o table
# Assign Kubernetes cluster pod security baseline standards
az policy assignment create \
--name "aks-pod-security-baseline" \
--scope "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/rg-aks-production" \
--policy-set-definition "/providers/Microsoft.Authorization/policySetDefinitions/a8640138-9b0a-4a28-b8cb-1666c838647d"

2. Microsoft Defender for Containers

Terminal window
# Enable Defender for Containers
az security pricing create \
--name Containers \
--tier Standard

3. Network Security

# Network Policy (Azure CNI Network Policy)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-all-ingress
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-web-app
namespace: production
spec:
podSelector:
matchLabels:
app: web-app
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: ingress-controller
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: database
ports:
- protocol: TCP
port: 5432

Cost Optimization

1. Cluster Autoscaler

Terminal window
# Already configured in Terraform with enable_auto_scaling
# Verify autoscaler is running
kubectl get deployment -n kube-system cluster-autoscaler

2. Spot Node Pools

Already configured in Terraform. Deploy fault-tolerant workloads:

apiVersion: apps/v1
kind: Deployment
metadata:
name: batch-processor
spec:
replicas: 10
selector:
matchLabels:
app: batch-processor
template:
metadata:
labels:
app: batch-processor
spec:
nodeSelector:
workload-type: spot
tolerations:
- key: kubernetes.azure.com/scalesetpriority
operator: Equal
value: spot
effect: NoSchedule
containers:
- name: processor
image: batch-processor:latest

3. Azure Advisor Recommendations

Terminal window
# Get cost optimization recommendations
az advisor recommendation list \
--category Cost \
--query "[?contains(impactedValue, 'aks-production-cluster')]"

Production Checklist

Infrastructure

Security

Networking

Storage

Monitoring

Operations

Conclusion

Azure Kubernetes Service provides a comprehensive, enterprise-ready platform for running containerized workloads on Azure. With deep Azure integration, robust security features, and excellent tooling support, AKS is an excellent choice for organizations already invested in the Azure ecosystem.

Success with AKS requires understanding both Kubernetes fundamentals and Azure-specific features. Start with a solid foundation, leverage Azure-native services, and continuously optimize for security, reliability, and cost.


Ready to master Azure Kubernetes Service? Our Azure training programs cover AKS from basics to advanced enterprise patterns. Contact us for customized training designed for your team.


Edit page
Share this post on:

Previous Post
Ceph Distributed Storage: Complete Production Deployment Guide
Next Post
Multi-Cloud Infrastructure with Terraform: AWS, Azure, and GCP