Skip to content
Vladimir Chavkov
Go back

Amazon EKS Upgrades and Day-2 Operations: Practical Production Guide

Edit page

Amazon EKS Upgrades and Day-2 Operations: Practical Production Guide

If you’re already running Amazon EKS, the hardest part isn’t creating a cluster—it’s operating it reliably over time. “Day-2” EKS work includes upgrades, node rotations, add-on lifecycle management, autoscaling, IAM integration, security posture, and repeatable runbooks.

This guide focuses on practical, low-risk operational patterns you can apply to existing clusters.

Core principles for safe EKS operations

1) Treat upgrades as a routine, not an emergency

2) Standardize your baseline components

A stable baseline reduces upgrade complexity:

Upgrade overview: what actually needs upgrading

In EKS, you’ll typically upgrade:

A good default upgrade order is:

  1. Validate cluster readiness (PDBs, capacity, compatibility)
  2. Upgrade control plane
  3. Upgrade managed add-ons
  4. Rotate/upgrade worker nodes
  5. Upgrade third-party controllers/operators
  6. Post-upgrade verification and burn-in

Pre-upgrade checklist

1) Kubernetes API deprecations and compatibility

Useful signals:

2) PodDisruptionBudgets and drainability

Un-drainable nodes are the #1 upgrade/rotation killer.

Example PDB:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: api

3) Capacity headroom

You need spare capacity during rotations.

4) Operational safety controls

Control plane upgrade (EKS)

You can upgrade via console, eksctl, Terraform, or AWS CLI.

Example using eksctl

Terminal window
eksctl upgrade cluster \
--name my-eks \
--region us-east-1 \
--version 1.29

Verify control plane version

Terminal window
aws eks describe-cluster \
--name my-eks \
--region us-east-1 \
--query 'cluster.version' \
--output text

Managed add-ons upgrades

EKS managed add-ons reduce operational burden but still require conscious lifecycle management.

List add-ons

Terminal window
aws eks list-addons --cluster-name my-eks --region us-east-1

Describe add-on versions

Terminal window
aws eks describe-addon-versions \
--addon-name vpc-cni \
--kubernetes-version 1.29 \
--region us-east-1

Upgrade add-on

Terminal window
aws eks update-addon \
--cluster-name my-eks \
--addon-name vpc-cni \
--resolve-conflicts OVERWRITE \
--region us-east-1

Recommended order:

Node rotation strategies

Strategy A: Managed Node Group rolling update (simple)

If you use managed node groups, update the node group version/AMI and allow EKS to roll.

Terminal window
eksctl upgrade nodegroup \
--cluster my-eks \
--name ng-general \
--kubernetes-version 1.29

Strategy B: Blue/green node groups (safer)

  1. Create a new node group ng-general-v2
  2. Cordon/drain old nodes
  3. Delete old node group

This is excellent for:

Drain nodes safely

Terminal window
kubectl cordon <node>
kubectl drain <node> \
--ignore-daemonsets \
--delete-emptydir-data \
--grace-period=60 \
--timeout=10m

Autoscaling: Cluster Autoscaler vs Karpenter

Cluster Autoscaler

Karpenter

Day-2 guidance:

IAM for Service Accounts (IRSA)

IRSA is foundational for day-2 operations.

Example: service account annotated for IRSA

apiVersion: v1
kind: ServiceAccount
metadata:
name: external-dns
namespace: networking
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/external-dns-irsa

Best practices:

Networking operations

VPC CNI tuning

Common day-2 tasks:

Signals of IP exhaustion:

Observability runbook (minimum viable)

At minimum you want:

Enable EKS control plane logs

Terminal window
aws eks update-cluster-config \
--name my-eks \
--region us-east-1 \
--logging 'clusterLogging=[{types=[api,audit,authenticator,controllerManager,scheduler],enabled=true}]'

Backup and disaster recovery

Security hardening (day-2 essentials)

Example: restrict public endpoint

Terminal window
aws eks update-cluster-config \
--name my-eks \
--region us-east-1 \
--resources-vpc-config endpointPublicAccess=false,endpointPrivateAccess=true

Post-upgrade verification checklist

Conclusion

EKS day-2 operations become straightforward when you standardize your platform baseline and make upgrades routine. Keep headroom, keep PDBs sane, rotate nodes regularly, and treat add-ons as first-class lifecycle components. Over time, you’ll reduce downtime risk and regain predictability for both upgrades and incident response.


Edit page
Share this post on:

Previous Post
Azure AKS Networking and Ingress in Production: Practical Guide
Next Post
VMware to Proxmox Migration: Complete Guide