Amazon EKS Upgrades and Day-2 Operations: Practical Production Guide
If you’re already running Amazon EKS, the hardest part isn’t creating a cluster—it’s operating it reliably over time. “Day-2” EKS work includes upgrades, node rotations, add-on lifecycle management, autoscaling, IAM integration, security posture, and repeatable runbooks.
This guide focuses on practical, low-risk operational patterns you can apply to existing clusters.
Core principles for safe EKS operations
1) Treat upgrades as a routine, not an emergency
- Upgrade frequently rather than infrequently
- Prefer small version jumps (e.g.,
1.27 -> 1.28) to reduce unknowns - Automate pre-checks and standardize change windows
- Always rotate nodes as part of upgrades (don’t keep ancient AMIs)
2) Standardize your baseline components
A stable baseline reduces upgrade complexity:
- Managed node groups (or Karpenter) for node lifecycle
- IRSA for pods needing AWS APIs
- EKS managed add-ons when possible (VPC CNI, CoreDNS, kube-proxy)
- Ingress via AWS Load Balancer Controller (ALB/NLB)
- Metrics + logging baked in
Upgrade overview: what actually needs upgrading
In EKS, you’ll typically upgrade:
- Control plane version (EKS-managed)
- Worker nodes (node group AMI / launch template)
- Managed add-ons (VPC CNI, CoreDNS, kube-proxy)
- Cluster add-ons (AWS LB Controller, ExternalDNS, CSI driver, etc.)
A good default upgrade order is:
- Validate cluster readiness (PDBs, capacity, compatibility)
- Upgrade control plane
- Upgrade managed add-ons
- Rotate/upgrade worker nodes
- Upgrade third-party controllers/operators
- Post-upgrade verification and burn-in
Pre-upgrade checklist
1) Kubernetes API deprecations and compatibility
- Confirm the Kubernetes version’s removed APIs.
- Check add-ons compatibility matrices.
Useful signals:
- Admission webhook failures
- Controllers stuck in CrashLoop
- Deprecated API usage from manifests
2) PodDisruptionBudgets and drainability
Un-drainable nodes are the #1 upgrade/rotation killer.
- Ensure workloads have >1 replica where HA is needed
- Validate PDBs allow at least one pod to be disrupted
Example PDB:
apiVersion: policy/v1kind: PodDisruptionBudgetmetadata: name: api-pdbspec: minAvailable: 2 selector: matchLabels: app: api3) Capacity headroom
You need spare capacity during rotations.
- Target: at least 20-30% headroom for critical namespaces
- Ensure cluster autoscaler/Karpenter can scale up
4) Operational safety controls
- Confirm alerting works (PagerDuty/Slack etc.)
- Define rollback (node group rollback + app rollback)
- Snapshot critical config if needed
Control plane upgrade (EKS)
You can upgrade via console, eksctl, Terraform, or AWS CLI.
Example using eksctl
eksctl upgrade cluster \ --name my-eks \ --region us-east-1 \ --version 1.29Verify control plane version
aws eks describe-cluster \ --name my-eks \ --region us-east-1 \ --query 'cluster.version' \ --output textManaged add-ons upgrades
EKS managed add-ons reduce operational burden but still require conscious lifecycle management.
List add-ons
aws eks list-addons --cluster-name my-eks --region us-east-1Describe add-on versions
aws eks describe-addon-versions \ --addon-name vpc-cni \ --kubernetes-version 1.29 \ --region us-east-1Upgrade add-on
aws eks update-addon \ --cluster-name my-eks \ --addon-name vpc-cni \ --resolve-conflicts OVERWRITE \ --region us-east-1Recommended order:
kube-proxyvpc-cnicoredns
Node rotation strategies
Strategy A: Managed Node Group rolling update (simple)
If you use managed node groups, update the node group version/AMI and allow EKS to roll.
eksctl upgrade nodegroup \ --cluster my-eks \ --name ng-general \ --kubernetes-version 1.29Strategy B: Blue/green node groups (safer)
- Create a new node group
ng-general-v2 - Cordon/drain old nodes
- Delete old node group
This is excellent for:
- Big AMI jumps
- Instance family changes (e.g.,
m5 -> m7g) - Migrating from Bottlerocket/AL2 to another base
Drain nodes safely
kubectl cordon <node>kubectl drain <node> \ --ignore-daemonsets \ --delete-emptydir-data \ --grace-period=60 \ --timeout=10mAutoscaling: Cluster Autoscaler vs Karpenter
Cluster Autoscaler
- Integrates well with managed node groups
- Simpler mental model
- Scaling speed depends on node group constraints
Karpenter
- Faster, more flexible provisioning
- Better binpacking, diverse instance selection
- Great for cost optimization and bursty workloads
Day-2 guidance:
- Use Karpenter for stateless compute and CA for stable node groups (or go all-in on Karpenter once mature)
- Separate critical system workloads onto a stable node group
IAM for Service Accounts (IRSA)
IRSA is foundational for day-2 operations.
Example: service account annotated for IRSA
apiVersion: v1kind: ServiceAccountmetadata: name: external-dns namespace: networking annotations: eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/external-dns-irsaBest practices:
- One role per controller/app where practical
- Tight IAM permissions (least privilege)
- Prefer AWS managed policies only when acceptable; otherwise custom
Networking operations
VPC CNI tuning
Common day-2 tasks:
- Avoid IP exhaustion
- Configure prefix delegation where appropriate
- Separate node groups across subnets/AZs
Signals of IP exhaustion:
- Pods stuck in
ContainerCreating - CNI errors in
aws-nodelogs
Observability runbook (minimum viable)
At minimum you want:
- Control plane logs enabled
- Container logs centralized
- Cluster metrics and dashboards
- Alerting for node/pod saturation
Enable EKS control plane logs
aws eks update-cluster-config \ --name my-eks \ --region us-east-1 \ --logging 'clusterLogging=[{types=[api,audit,authenticator,controllerManager,scheduler],enabled=true}]'Backup and disaster recovery
- Use Velero for Kubernetes resources + PV snapshots (if supported)
- Back up critical secrets and GitOps source of truth
- Practice restore to a staging cluster
Security hardening (day-2 essentials)
- Private API endpoint where possible
- Restrict
system:mastersmapping - Use Pod Security Standards (or policy engine)
- Scan images and enforce signed images if required
Example: restrict public endpoint
aws eks update-cluster-config \ --name my-eks \ --region us-east-1 \ --resources-vpc-config endpointPublicAccess=false,endpointPrivateAccess=truePost-upgrade verification checklist
-
kubectl get nodesshows expected version and readiness - CoreDNS healthy
- Ingress controller healthy and can reconcile
- Critical apps pass smoke tests
- No elevated error rate / latency regressions
- Cluster autoscaler/Karpenter functioning
- Run a 24h burn-in period for non-trivial upgrades
Recommended operating cadence
- Weekly: review failed pods, node pressure events, deployment health
- Monthly: patch/AMI rotation, add-on review, capacity and cost review
- Quarterly: Kubernetes minor upgrades, resilience game days
Conclusion
EKS day-2 operations become straightforward when you standardize your platform baseline and make upgrades routine. Keep headroom, keep PDBs sane, rotate nodes regularly, and treat add-ons as first-class lifecycle components. Over time, you’ll reduce downtime risk and regain predictability for both upgrades and incident response.