Azure AKS Networking and Ingress in Production: Practical Guide
AKS is easy to create and deceptively hard to get networking right—especially at scale. Most AKS operational incidents I see are rooted in one of these:
- IP planning mistakes
- Outbound connectivity misunderstandings
- DNS and private cluster misconfiguration
- Ingress controller and TLS operational drift
This guide focuses on production-grade AKS networking and ingress patterns you can standardize across clusters.
Mental model: AKS networking layers
- Node network: VM NICs in an Azure VNet subnet
- Pod network: depends on CNI (kubenet vs Azure CNI)
- Service networking: ClusterIP/NodePort/LoadBalancer + kube-proxy
- Ingress: L7 entrypoint routing to services
- Outbound: NAT for egress (load balancer SNAT, NAT Gateway, or user-defined routing)
Choose the right CNI: kubenet vs Azure CNI
kubenet (legacy/simple)
- Pods get IPs from a separate range and are NATed behind node IPs
- Smaller VNet IP consumption
- Historically simpler for small clusters
Trade-offs:
- More moving parts
- NAT can become a bottleneck
- Some advanced scenarios are harder
Azure CNI (recommended for most production)
- Pods get IPs from your VNet subnet (routable in the VNet)
- Easier integration with Azure networking features
- Often better for enterprise networking requirements
Trade-offs:
- Consumes VNet IPs heavily (planning matters)
- Subnet exhaustion is a very real production failure mode
IP planning: the most important design step
You must plan:
- Node subnet size (VM NICs)
- Pod IP consumption (Azure CNI)
- Service CIDR (cluster internal services)
- DNS service IP (must be inside service CIDR)
Quick rules of thumb
- Plan subnets for growth, not current size
- For Azure CNI, allocate enough IPs for:
(max nodes) * (max pods per node) + operational bufferIn practice, allocate a subnet much larger than your immediate need.
Outbound traffic: understand your egress path
In AKS, outbound behavior depends on how the cluster is set up.
Common outbound types
- Load balancer SNAT (default in many setups)
- NAT Gateway (highly recommended for production)
- User-defined routing (UDR) via firewall/NVA
Why NAT Gateway is often best
- Predictable SNAT behavior
- Scales better than default LB SNAT
- Stable outbound IPs for allowlists
Private AKS clusters
Private clusters reduce exposure by keeping the API server private.
Operational considerations:
- You need private DNS to resolve the API server
- Your CI/CD and SRE access paths need private connectivity (VPN/ExpressRoute/bastion)
- Plan for troubleshooting when the API endpoint isn’t reachable from the public internet
DNS in AKS: common pitfalls
CoreDNS basics
- Pods use the cluster DNS service
kube-dnsservice IP is in the service CIDR
Common issues:
- Upstream DNS timeouts
- Split-horizon DNS surprises in private clusters
- Excessive DNS query volume from misbehaving apps
Signals:
- Intermittent service discovery failures
SERVFAIL/timeout in app logs- CoreDNS pods CPU spikes
Ingress options in AKS
You usually pick one of:
- NGINX Ingress Controller (most common)
- Azure Application Gateway Ingress Controller (AGIC)
NGINX Ingress Controller
Pros:
- Kubernetes-native and portable
- Mature ecosystem, lots of examples
- Great for multi-tenant routing patterns
Cons:
- You operate it like any controller (upgrades, tuning)
- Need to handle WAF separately if required
AGIC (Application Gateway)
Pros:
- Integrates with Application Gateway features
- Works well with orgs standardizing on App Gateway + WAF
- Centralized L7 policies
Cons:
- More Azure-specific
- Reconciliation issues can be more opaque
- Requires careful role and subnet configuration
Production pattern: NGINX Ingress + cert-manager
This is a widely used approach for flexible ingress + automated TLS.
Example: NGINX Ingress install (Helm)
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginxhelm repo update
helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \ --namespace ingress-nginx \ --create-namespace \ --set controller.replicaCount=2 \ --set controller.service.type=LoadBalancerExample: Ingress resource
apiVersion: networking.k8s.io/v1kind: Ingressmetadata: name: app namespace: app annotations: nginx.ingress.kubernetes.io/proxy-body-size: "50m" nginx.ingress.kubernetes.io/proxy-read-timeout: "60"spec: ingressClassName: nginx tls: - hosts: - app.example.com secretName: app-tls rules: - host: app.example.com http: paths: - path: / pathType: Prefix backend: service: name: app port: number: 80Production pattern: AGIC (high-level checklist)
Key items that usually cause incidents:
- App Gateway subnet is dedicated (don’t mix with nodes)
- Correct permissions for AGIC identity
- Correct listeners, backend pools, and health probes
- Ensure NSGs allow health probe traffic
TLS strategy
Recommended standard:
- Use
cert-managerto automate certificates (ACME/Let’s Encrypt or internal PKI) - Store certs in Kubernetes secrets
- Enforce TLS 1.2+ and strong ciphers at the ingress layer
Operational best practices:
- Alert on certificate expiry
- Standardize domains and wildcard usage
- Prefer external-dns to manage DNS records consistently
Troubleshooting runbook
1) Debug ingress routing
kubectl -n ingress-nginx get podskubectl -n ingress-nginx logs deploy/ingress-nginx-controller --tail=200kubectl -n app describe ingress appkubectl -n app get endpoints app2) Debug LoadBalancer provisioning
kubectl -n ingress-nginx get svckubectl -n ingress-nginx describe svc ingress-nginx-controllerCommon causes:
- Subnet/NSG restrictions
- Quota limits
- Misconfigured cloud-provider integration
3) Debug DNS inside the cluster
kubectl run -it --rm dnsutils \ --image=registry.k8s.io/e2e-test-images/jessie-dnsutils:1.3 \ --restart=Never -- sh
nslookup kubernetes.default.svc.cluster.localnslookup app.example.com4) Debug outbound connectivity
kubectl run -it --rm curl \ --image=curlimages/curl:8.5.0 \ --restart=Never -- sh
curl -I https://example.comIf egress fails, check:
- Route tables (UDR)
- NAT Gateway association
- Firewall rules
- Azure policy restrictions
Guardrails to standardize across clusters
-
Use a platform module to create:
- VNet + subnets
- NAT Gateway (or firewall) for egress
- Standard node pools
- Standard ingress and TLS automation
-
Enforce:
- naming conventions
- tags
- RBAC and workload identity baseline
- network policies where required
Conclusion
AKS production networking is mostly about preventing avoidable problems: IP exhaustion, unpredictable outbound SNAT, private DNS drift, and ingress/TLS operational inconsistency. Standardize your CNI choice, size your subnets conservatively, make outbound explicit (NAT Gateway/UDR), and adopt a repeatable ingress + cert automation pattern.