Amazon EKS Upgrades and Day-2 Operations: Practical Production Guide

If you’re already running Amazon EKS, the hardest part isn’t creating a cluster—it’s operating it reliably over time. “Day-2” EKS work includes upgrades, node rotations, add-on lifecycle management, autoscaling, IAM integration, security posture, and repeatable runbooks.

This guide focuses on practical, low-risk operational patterns you can apply to existing clusters.

Core principles for safe EKS operations

1) Treat upgrades as a routine, not an emergency

Upgrade frequently rather than infrequently
Prefer small version jumps (e.g., 1.27 -> 1.28) to reduce unknowns
Automate pre-checks and standardize change windows
Always rotate nodes as part of upgrades (don’t keep ancient AMIs)

2) Standardize your baseline components

A stable baseline reduces upgrade complexity:

Managed node groups (or Karpenter) for node lifecycle
IRSA for pods needing AWS APIs
EKS managed add-ons when possible (VPC CNI, CoreDNS, kube-proxy)
Ingress via AWS Load Balancer Controller (ALB/NLB)
Metrics + logging baked in

Upgrade overview: what actually needs upgrading

In EKS, you’ll typically upgrade:

Control plane version (EKS-managed)
Worker nodes (node group AMI / launch template)
Managed add-ons (VPC CNI, CoreDNS, kube-proxy)
Cluster add-ons (AWS LB Controller, ExternalDNS, CSI driver, etc.)

A good default upgrade order is:

Validate cluster readiness (PDBs, capacity, compatibility)
Upgrade control plane
Upgrade managed add-ons
Rotate/upgrade worker nodes
Upgrade third-party controllers/operators
Post-upgrade verification and burn-in

Pre-upgrade checklist

1) Kubernetes API deprecations and compatibility

Confirm the Kubernetes version’s removed APIs.
Check add-ons compatibility matrices.

Useful signals:

Admission webhook failures
Controllers stuck in CrashLoop
Deprecated API usage from manifests

2) PodDisruptionBudgets and drainability

Un-drainable nodes are the #1 upgrade/rotation killer.

Ensure workloads have >1 replica where HA is needed
Validate PDBs allow at least one pod to be disrupted

Example PDB:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api

3) Capacity headroom

You need spare capacity during rotations.

Target: at least 20-30% headroom for critical namespaces
Ensure cluster autoscaler/Karpenter can scale up

4) Operational safety controls

Confirm alerting works (PagerDuty/Slack etc.)
Define rollback (node group rollback + app rollback)
Snapshot critical config if needed

Control plane upgrade (EKS)

You can upgrade via console, eksctl, Terraform, or AWS CLI.

Example using `eksctl`

eksctl upgrade cluster \
  --name my-eks \
  --region us-east-1 \
  --version 1.29

Verify control plane version

aws eks describe-cluster \
  --name my-eks \
  --region us-east-1 \
  --query 'cluster.version' \
  --output text

Managed add-ons upgrades

EKS managed add-ons reduce operational burden but still require conscious lifecycle management.

List add-ons

aws eks list-addons --cluster-name my-eks --region us-east-1

Describe add-on versions

aws eks describe-addon-versions \
  --addon-name vpc-cni \
  --kubernetes-version 1.29 \
  --region us-east-1

Upgrade add-on

aws eks update-addon \
  --cluster-name my-eks \
  --addon-name vpc-cni \
  --resolve-conflicts OVERWRITE \
  --region us-east-1

Recommended order:

kube-proxy
vpc-cni
coredns

Node rotation strategies

Strategy A: Managed Node Group rolling update (simple)

If you use managed node groups, update the node group version/AMI and allow EKS to roll.

eksctl upgrade nodegroup \
  --cluster my-eks \
  --name ng-general \
  --kubernetes-version 1.29

Strategy B: Blue/green node groups (safer)

Create a new node group ng-general-v2
Cordon/drain old nodes
Delete old node group

This is excellent for:

Big AMI jumps
Instance family changes (e.g., m5 -> m7g)
Migrating from Bottlerocket/AL2 to another base

Drain nodes safely

kubectl cordon <node>
kubectl drain <node> \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=60 \
  --timeout=10m

Autoscaling: Cluster Autoscaler vs Karpenter

Cluster Autoscaler

Integrates well with managed node groups
Simpler mental model
Scaling speed depends on node group constraints

Karpenter

Faster, more flexible provisioning
Better binpacking, diverse instance selection
Great for cost optimization and bursty workloads

Day-2 guidance:

Use Karpenter for stateless compute and CA for stable node groups (or go all-in on Karpenter once mature)
Separate critical system workloads onto a stable node group

IAM for Service Accounts (IRSA)

IRSA is foundational for day-2 operations.

Example: service account annotated for IRSA

apiVersion: v1
kind: ServiceAccount
metadata:
  name: external-dns
  namespace: networking
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/external-dns-irsa

Best practices:

One role per controller/app where practical
Tight IAM permissions (least privilege)
Prefer AWS managed policies only when acceptable; otherwise custom

Networking operations

VPC CNI tuning

Common day-2 tasks:

Avoid IP exhaustion
Configure prefix delegation where appropriate
Separate node groups across subnets/AZs

Signals of IP exhaustion:

Pods stuck in ContainerCreating
CNI errors in aws-node logs

Observability runbook (minimum viable)

At minimum you want:

Control plane logs enabled
Container logs centralized
Cluster metrics and dashboards
Alerting for node/pod saturation

Enable EKS control plane logs

aws eks update-cluster-config \
  --name my-eks \
  --region us-east-1 \
  --logging 'clusterLogging=[{types=[api,audit,authenticator,controllerManager,scheduler],enabled=true}]'

Backup and disaster recovery

Use Velero for Kubernetes resources + PV snapshots (if supported)
Back up critical secrets and GitOps source of truth
Practice restore to a staging cluster

Security hardening (day-2 essentials)

Private API endpoint where possible
Restrict system:masters mapping
Use Pod Security Standards (or policy engine)
Scan images and enforce signed images if required

Example: restrict public endpoint

aws eks update-cluster-config \
  --name my-eks \
  --region us-east-1 \
  --resources-vpc-config endpointPublicAccess=false,endpointPrivateAccess=true

Post-upgrade verification checklist

kubectl get nodes shows expected version and readiness
CoreDNS healthy
Ingress controller healthy and can reconcile
Critical apps pass smoke tests
No elevated error rate / latency regressions
Cluster autoscaler/Karpenter functioning
Run a 24h burn-in period for non-trivial upgrades

Recommended operating cadence

Weekly: review failed pods, node pressure events, deployment health
Monthly: patch/AMI rotation, add-on review, capacity and cost review
Quarterly: Kubernetes minor upgrades, resilience game days

Conclusion

EKS day-2 operations become straightforward when you standardize your platform baseline and make upgrades routine. Keep headroom, keep PDBs sane, rotate nodes regularly, and treat add-ons as first-class lifecycle components. Over time, you’ll reduce downtime risk and regain predictability for both upgrades and incident response.

Amazon EKS Upgrades and Day-2 Operations: Practical Production Guide

Amazon EKS Upgrades and Day-2 Operations: Practical Production Guide

Core principles for safe EKS operations

1) Treat upgrades as a routine, not an emergency

2) Standardize your baseline components

Upgrade overview: what actually needs upgrading

Pre-upgrade checklist

1) Kubernetes API deprecations and compatibility

2) PodDisruptionBudgets and drainability

3) Capacity headroom

4) Operational safety controls

Control plane upgrade (EKS)

Example using eksctl

Verify control plane version

Managed add-ons upgrades

List add-ons

Describe add-on versions

Upgrade add-on

Node rotation strategies

Strategy A: Managed Node Group rolling update (simple)

Strategy B: Blue/green node groups (safer)

Drain nodes safely

Autoscaling: Cluster Autoscaler vs Karpenter

Cluster Autoscaler

Karpenter

IAM for Service Accounts (IRSA)

Example: service account annotated for IRSA

Networking operations

VPC CNI tuning

Observability runbook (minimum viable)

Enable EKS control plane logs

Backup and disaster recovery

Security hardening (day-2 essentials)

Example: restrict public endpoint

Post-upgrade verification checklist

Recommended operating cadence

Conclusion

Example using `eksctl`