Azure AKS Networking and Ingress in Production: Practical Guide

AKS is easy to create and deceptively hard to get networking right—especially at scale. Most AKS operational incidents I see are rooted in one of these:

IP planning mistakes
Outbound connectivity misunderstandings
DNS and private cluster misconfiguration
Ingress controller and TLS operational drift

This guide focuses on production-grade AKS networking and ingress patterns you can standardize across clusters.

Mental model: AKS networking layers

Node network: VM NICs in an Azure VNet subnet
Pod network: depends on CNI (kubenet vs Azure CNI)
Service networking: ClusterIP/NodePort/LoadBalancer + kube-proxy
Ingress: L7 entrypoint routing to services
Outbound: NAT for egress (load balancer SNAT, NAT Gateway, or user-defined routing)

Choose the right CNI: kubenet vs Azure CNI

kubenet (legacy/simple)

Pods get IPs from a separate range and are NATed behind node IPs
Smaller VNet IP consumption
Historically simpler for small clusters

Trade-offs:

More moving parts
NAT can become a bottleneck
Some advanced scenarios are harder

Azure CNI (recommended for most production)

Pods get IPs from your VNet subnet (routable in the VNet)
Easier integration with Azure networking features
Often better for enterprise networking requirements

Trade-offs:

Consumes VNet IPs heavily (planning matters)
Subnet exhaustion is a very real production failure mode

IP planning: the most important design step

You must plan:

Node subnet size (VM NICs)
Pod IP consumption (Azure CNI)
Service CIDR (cluster internal services)
DNS service IP (must be inside service CIDR)

Quick rules of thumb

Plan subnets for growth, not current size
For Azure CNI, allocate enough IPs for:

(max nodes) * (max pods per node) + operational buffer

In practice, allocate a subnet much larger than your immediate need.

Outbound traffic: understand your egress path

In AKS, outbound behavior depends on how the cluster is set up.

Common outbound types

Load balancer SNAT (default in many setups)
NAT Gateway (highly recommended for production)
User-defined routing (UDR) via firewall/NVA

Why NAT Gateway is often best

Predictable SNAT behavior
Scales better than default LB SNAT
Stable outbound IPs for allowlists

Private AKS clusters

Private clusters reduce exposure by keeping the API server private.

Operational considerations:

You need private DNS to resolve the API server
Your CI/CD and SRE access paths need private connectivity (VPN/ExpressRoute/bastion)
Plan for troubleshooting when the API endpoint isn’t reachable from the public internet

DNS in AKS: common pitfalls

CoreDNS basics

Pods use the cluster DNS service
kube-dns service IP is in the service CIDR

Common issues:

Upstream DNS timeouts
Split-horizon DNS surprises in private clusters
Excessive DNS query volume from misbehaving apps

Signals:

Intermittent service discovery failures
SERVFAIL/timeout in app logs
CoreDNS pods CPU spikes

Ingress options in AKS

You usually pick one of:

NGINX Ingress Controller (most common)
Azure Application Gateway Ingress Controller (AGIC)

NGINX Ingress Controller

Pros:

Kubernetes-native and portable
Mature ecosystem, lots of examples
Great for multi-tenant routing patterns

Cons:

You operate it like any controller (upgrades, tuning)
Need to handle WAF separately if required

AGIC (Application Gateway)

Pros:

Integrates with Application Gateway features
Works well with orgs standardizing on App Gateway + WAF
Centralized L7 policies

Cons:

More Azure-specific
Reconciliation issues can be more opaque
Requires careful role and subnet configuration

Production pattern: NGINX Ingress + cert-manager

This is a widely used approach for flexible ingress + automated TLS.

Example: NGINX Ingress install (Helm)

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update

helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  --set controller.replicaCount=2 \
  --set controller.service.type=LoadBalancer

Example: Ingress resource

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app
  namespace: app
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - app.example.com
      secretName: app-tls
  rules:
    - host: app.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: app
                port:
                  number: 80

Production pattern: AGIC (high-level checklist)

Key items that usually cause incidents:

App Gateway subnet is dedicated (don’t mix with nodes)
Correct permissions for AGIC identity
Correct listeners, backend pools, and health probes
Ensure NSGs allow health probe traffic

TLS strategy

Recommended standard:

Use cert-manager to automate certificates (ACME/Let’s Encrypt or internal PKI)
Store certs in Kubernetes secrets
Enforce TLS 1.2+ and strong ciphers at the ingress layer

Operational best practices:

Alert on certificate expiry
Standardize domains and wildcard usage
Prefer external-dns to manage DNS records consistently

Troubleshooting runbook

1) Debug ingress routing

kubectl -n ingress-nginx get pods
kubectl -n ingress-nginx logs deploy/ingress-nginx-controller --tail=200
kubectl -n app describe ingress app
kubectl -n app get endpoints app

2) Debug LoadBalancer provisioning

kubectl -n ingress-nginx get svc
kubectl -n ingress-nginx describe svc ingress-nginx-controller

Common causes:

Subnet/NSG restrictions
Quota limits
Misconfigured cloud-provider integration

3) Debug DNS inside the cluster

kubectl run -it --rm dnsutils \
  --image=registry.k8s.io/e2e-test-images/jessie-dnsutils:1.3 \
  --restart=Never -- sh

nslookup kubernetes.default.svc.cluster.local
nslookup app.example.com

4) Debug outbound connectivity

kubectl run -it --rm curl \
  --image=curlimages/curl:8.5.0 \
  --restart=Never -- sh

curl -I https://example.com

If egress fails, check:

Route tables (UDR)
NAT Gateway association
Firewall rules
Azure policy restrictions

Guardrails to standardize across clusters

Use a platform module to create:
- VNet + subnets
- NAT Gateway (or firewall) for egress
- Standard node pools
- Standard ingress and TLS automation
Enforce:
- naming conventions
- tags
- RBAC and workload identity baseline
- network policies where required

Conclusion

AKS production networking is mostly about preventing avoidable problems: IP exhaustion, unpredictable outbound SNAT, private DNS drift, and ingress/TLS operational inconsistency. Standardize your CNI choice, size your subnets conservatively, make outbound explicit (NAT Gateway/UDR), and adopt a repeatable ingress + cert automation pattern.