Skip to content
Vladimir Chavkov
Go back

Amazon EKS Production Guide: Building and Running Kubernetes on AWS

Edit page

Amazon EKS Production Guide: Building and Running Kubernetes on AWS

Amazon Elastic Kubernetes Service (EKS) is AWS’s managed Kubernetes offering that simplifies running Kubernetes on AWS while integrating deeply with AWS services. This guide covers everything you need to build production-grade EKS clusters.

Why Choose Amazon EKS?

Key Benefits

EKS vs. Self-Managed Kubernetes on EC2

FeatureEKSSelf-Managed
Control PlaneAWS-managedYou manage
UpgradesAutomatedManual
HA SetupBuilt-in multi-AZManual configuration
Cost$0.10/hour per cluster + nodesNode costs only
AWS IntegrationNativeRequires configuration
Operational OverheadLowHigh

Getting Started with EKS

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│ AWS Cloud (Region) │
│ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ EKS Control Plane (AWS Managed) │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ API Server │ │ etcd │ │ Scheduler │ │ │
│ │ │ (Multi-AZ) │ │ (Multi-AZ) │ │ │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └───────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────┼────────────────────────────┐ │
│ │ Your VPC │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────────────────────────┐ │ │
│ │ │ Worker Nodes (Your Account) │ │ │
│ │ │ │ │ │
│ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │
│ │ │ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ │ │
│ │ │ │ (AZ-a) │ │ (AZ-b) │ │ (AZ-c) │ │ │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ │ │ ┌───────┐ │ │ ┌───────┐ │ │ ┌───────┐ │ │ │ │
│ │ │ │ │Pods │ │ │ │Pods │ │ │ │Pods │ │ │ │ │
│ │ │ │ └───────┘ │ │ └───────┘ │ │ └───────┘ │ │ │ │
│ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ │
│ │ └─────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

Prerequisites

Creating Your First EKS Cluster

Terminal window
# Create a production-ready cluster
eksctl create cluster \
--name production-cluster \
--region us-east-1 \
--version 1.29 \
--nodegroup-name standard-workers \
--node-type t3.large \
--nodes 3 \
--nodes-min 3 \
--nodes-max 10 \
--managed \
--with-oidc \
--ssh-access \
--ssh-public-key my-key \
--asg-access \
--external-dns-access \
--full-ecr-access \
--alb-ingress-access \
--vpc-nat-mode Single
main.tf
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 19.0"
cluster_name = "production-cluster"
cluster_version = "1.29"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
# Enable IRSA (IAM Roles for Service Accounts)
enable_irsa = true
# Cluster endpoint access
cluster_endpoint_public_access = true
cluster_endpoint_private_access = true
# Cluster addons
cluster_addons = {
coredns = {
most_recent = true
}
kube-proxy = {
most_recent = true
}
vpc-cni = {
most_recent = true
}
aws-ebs-csi-driver = {
most_recent = true
service_account_role_arn = module.ebs_csi_driver_irsa.iam_role_arn
}
}
# EKS Managed Node Groups
eks_managed_node_groups = {
general = {
name = "general-purpose"
instance_types = ["t3.large"]
capacity_type = "ON_DEMAND"
min_size = 3
max_size = 10
desired_size = 3
labels = {
workload-type = "general"
}
tags = {
Environment = "production"
ManagedBy = "terraform"
}
}
compute = {
name = "compute-optimized"
instance_types = ["c6i.2xlarge"]
capacity_type = "ON_DEMAND"
min_size = 2
max_size = 20
desired_size = 2
labels = {
workload-type = "compute-intensive"
}
taints = [{
key = "workload-type"
value = "compute-intensive"
effect = "NoSchedule"
}]
}
spot = {
name = "spot-workers"
instance_types = ["t3.large", "t3a.large", "t3.xlarge"]
capacity_type = "SPOT"
min_size = 1
max_size = 10
desired_size = 3
labels = {
workload-type = "spot"
}
taints = [{
key = "spot-instance"
value = "true"
effect = "NoSchedule"
}]
}
}
# Cluster security group rules
cluster_security_group_additional_rules = {
ingress_nodes_ephemeral_ports_tcp = {
description = "Nodes on ephemeral ports"
protocol = "tcp"
from_port = 1025
to_port = 65535
type = "ingress"
source_node_security_group = true
}
}
# Node security group rules
node_security_group_additional_rules = {
ingress_self_all = {
description = "Node to node all ports/protocols"
protocol = "-1"
from_port = 0
to_port = 0
type = "ingress"
self = true
}
}
tags = {
Environment = "production"
ManagedBy = "terraform"
}
}
# EBS CSI Driver IRSA
module "ebs_csi_driver_irsa" {
source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
version = "~> 5.0"
role_name = "ebs-csi-driver"
attach_ebs_csi_policy = true
oidc_providers = {
main = {
provider_arn = module.eks.oidc_provider_arn
namespace_service_accounts = ["kube-system:ebs-csi-controller-sa"]
}
}
}
# VPC Module
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.0"
name = "eks-vpc"
cidr = "10.0.0.0/16"
azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
enable_nat_gateway = true
single_nat_gateway = false # Multi-AZ NAT for HA
enable_dns_hostnames = true
enable_dns_support = true
# Kubernetes tags for subnet discovery
public_subnet_tags = {
"kubernetes.io/role/elb" = "1"
"kubernetes.io/cluster/${local.cluster_name}" = "shared"
}
private_subnet_tags = {
"kubernetes.io/role/internal-elb" = "1"
"kubernetes.io/cluster/${local.cluster_name}" = "shared"
}
tags = {
Environment = "production"
}
}

Configure kubectl Access

Terminal window
# Update kubeconfig
aws eks update-kubeconfig \
--region us-east-1 \
--name production-cluster
# Verify connection
kubectl get nodes
kubectl get pods -A

AWS-Specific Integrations

1. IAM Roles for Service Accounts (IRSA)

IRSA allows Kubernetes pods to assume AWS IAM roles without storing credentials:

Terminal window
# Create OIDC provider (if not already done)
eksctl utils associate-iam-oidc-provider \
--cluster production-cluster \
--approve

Example: S3 Access for Pods

Terminal window
# Create IAM policy
cat > s3-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-app-bucket/*",
"arn:aws:s3:::my-app-bucket"
]
}
]
}
EOF
# Create IAM role for service account
eksctl create iamserviceaccount \
--name s3-access-sa \
--namespace default \
--cluster production-cluster \
--attach-policy-arn $(aws iam create-policy \
--policy-name S3AccessPolicy \
--policy-document file://s3-policy.json \
--query 'Policy.Arn' --output text) \
--approve
# Deploy application using IRSA
apiVersion: apps/v1
kind: Deployment
metadata:
name: s3-app
spec:
replicas: 3
selector:
matchLabels:
app: s3-app
template:
metadata:
labels:
app: s3-app
spec:
serviceAccountName: s3-access-sa # Uses IRSA
containers:
- name: app
image: my-app:latest
env:
- name: AWS_REGION
value: us-east-1
# No AWS credentials needed - IRSA handles authentication

2. AWS Load Balancer Controller

Replace the deprecated ALB Ingress Controller:

Terminal window
# Install AWS Load Balancer Controller
helm repo add eks https://aws.github.io/eks-charts
helm repo update
# Create IAM policy
curl -o iam-policy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/main/docs/install/iam_policy.json
aws iam create-policy \
--policy-name AWSLoadBalancerControllerIAMPolicy \
--policy-document file://iam-policy.json
# Create service account with IAM role
eksctl create iamserviceaccount \
--cluster=production-cluster \
--namespace=kube-system \
--name=aws-load-balancer-controller \
--attach-policy-arn=arn:aws:iam::ACCOUNT_ID:policy/AWSLoadBalancerControllerIAMPolicy \
--approve
# Install controller
helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
-n kube-system \
--set clusterName=production-cluster \
--set serviceAccount.create=false \
--set serviceAccount.name=aws-load-balancer-controller

Application Load Balancer (ALB) Ingress

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: web-app
annotations:
# ALB Configuration
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
alb.ingress.kubernetes.io/ssl-redirect: '443'
# Certificate
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:ACCOUNT:certificate/CERT_ID
# Health check
alb.ingress.kubernetes.io/healthcheck-path: /health
alb.ingress.kubernetes.io/healthcheck-interval-seconds: '15'
alb.ingress.kubernetes.io/healthcheck-timeout-seconds: '5'
alb.ingress.kubernetes.io/healthy-threshold-count: '2'
alb.ingress.kubernetes.io/unhealthy-threshold-count: '2'
# WAF
alb.ingress.kubernetes.io/wafv2-acl-arn: arn:aws:wafv2:us-east-1:ACCOUNT:regional/webacl/NAME/ID
# Access logs
alb.ingress.kubernetes.io/load-balancer-attributes: access_logs.s3.enabled=true,access_logs.s3.bucket=my-logs-bucket
spec:
ingressClassName: alb
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: web-app
port:
number: 80

Network Load Balancer (NLB) Service

apiVersion: v1
kind: Service
metadata:
name: web-app-nlb
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "external"
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "tcp"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: "http"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: "/health"
spec:
type: LoadBalancer
selector:
app: web-app
ports:
- port: 80
targetPort: 8080
protocol: TCP

3. Amazon EBS CSI Driver

For persistent storage with EBS volumes:

# StorageClass for gp3 volumes (recommended)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ebs-gp3
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "3000"
throughput: "125"
encrypted: "true"
kmsKeyId: arn:aws:kms:us-east-1:ACCOUNT:key/KEY_ID
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# PersistentVolumeClaim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data
spec:
accessModes:
- ReadWriteOnce
storageClassName: ebs-gp3
resources:
requests:
storage: 100Gi
---
# StatefulSet using PVC
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
serviceName: postgres
replicas: 1
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:15
ports:
- containerPort: 5432
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: ebs-gp3
resources:
requests:
storage: 100Gi

4. Amazon EFS CSI Driver

For shared storage across multiple pods:

Terminal window
# Install EFS CSI Driver
kubectl apply -k "github.com/kubernetes-sigs/aws-efs-csi-driver/deploy/kubernetes/overlays/stable/?ref=master"
# Create EFS filesystem
aws efs create-file-system \
--region us-east-1 \
--performance-mode generalPurpose \
--throughput-mode bursting \
--encrypted \
--tags Key=Name,Value=eks-efs
# StorageClass for EFS
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: efs-sc
provisioner: efs.csi.aws.com
parameters:
provisioningMode: efs-ap
fileSystemId: fs-1234567890abcdef0
directoryPerms: "700"
---
# PersistentVolumeClaim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: shared-storage
spec:
accessModes:
- ReadWriteMany
storageClassName: efs-sc
resources:
requests:
storage: 5Gi
---
# Deployment using shared EFS storage
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 5
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: app
image: nginx
volumeMounts:
- name: shared-data
mountPath: /usr/share/nginx/html
volumes:
- name: shared-data
persistentVolumeClaim:
claimName: shared-storage

EKS Networking Best Practices

VPC CNI Configuration

The AWS VPC CNI uses ENIs (Elastic Network Interfaces) to assign VPC IP addresses to pods:

# Configure VPC CNI for custom networking
apiVersion: v1
kind: ConfigMap
metadata:
name: amazon-vpc-cni
namespace: kube-system
data:
# Enable custom networking
AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG: "true"
ENI_CONFIG_LABEL_DEF: "topology.kubernetes.io/zone"
# Enable prefix delegation for more IPs per node
ENABLE_PREFIX_DELEGATION: "true"
# Network policy enforcement
AWS_VPC_K8S_CNI_NETWORK_POLICY_ENFORCING_MODE: "standard"
# Pod security group
ENABLE_POD_ENI: "true"

Security Groups for Pods

Assign security groups directly to pods:

apiVersion: vpcresources.k8s.aws/v1beta1
kind: SecurityGroupPolicy
metadata:
name: database-pods-sg
namespace: default
spec:
podSelector:
matchLabels:
app: database
securityGroups:
groupIds:
- sg-0123456789abcdef0
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres
spec:
replicas: 1
selector:
matchLabels:
app: database
template:
metadata:
labels:
app: database
spec:
containers:
- name: postgres
image: postgres:15

EKS Security Best Practices

1. Pod Security Standards

# Enforce restricted pod security standard
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted

2. Network Policies with Calico

Install Calico for network policy support:

Terminal window
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.0/manifests/calico-vxlan.yaml
# Default deny all traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
---
# Allow specific traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-web-app
namespace: production
spec:
podSelector:
matchLabels:
app: web-app
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: ingress-controller
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: postgres
ports:
- protocol: TCP
port: 5432

3. Secrets Management with AWS Secrets Manager

Terminal window
# Install External Secrets Operator
helm repo add external-secrets https://charts.external-secrets.io
helm install external-secrets \
external-secrets/external-secrets \
-n external-secrets-system \
--create-namespace
# Create IAM role for External Secrets
eksctl create iamserviceaccount \
--name external-secrets \
--namespace external-secrets-system \
--cluster production-cluster \
--attach-policy-arn arn:aws:iam::aws:policy/SecretsManagerReadWrite \
--approve
# SecretStore
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
name: aws-secrets-manager
namespace: default
spec:
provider:
aws:
service: SecretsManager
region: us-east-1
auth:
jwt:
serviceAccountRef:
name: external-secrets
---
# ExternalSecret
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: database-credentials
namespace: default
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager
kind: SecretStore
target:
name: postgres-secret
creationPolicy: Owner
data:
- secretKey: username
remoteRef:
key: prod/database/postgres
property: username
- secretKey: password
remoteRef:
key: prod/database/postgres
property: password

Monitoring and Observability

Amazon CloudWatch Container Insights

Terminal window
# Install CloudWatch agent and Fluent Bit
ClusterName=production-cluster
RegionName=us-east-1
FluentBitHttpPort='2020'
FluentBitReadFromHead='Off'
curl https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluent-bit-quickstart.yaml | \
sed "s/{{cluster_name}}/${ClusterName}/;s/{{region_name}}/${RegionName}/;s/{{http_server_toggle}}/\"On\"/;s/{{http_server_port}}/${FluentBitHttpPort}/;s/{{read_from_head}}/${FluentBitReadFromHead}/" | \
kubectl apply -f -

Prometheus and Grafana

Terminal window
# Install kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=ebs-gp3 \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
--set grafana.persistence.enabled=true \
--set grafana.persistence.storageClassName=ebs-gp3 \
--set grafana.persistence.size=10Gi

Cost Optimization

1. Use Spot Instances for Fault-Tolerant Workloads

# Deploy to spot instances
apiVersion: apps/v1
kind: Deployment
metadata:
name: batch-processor
spec:
replicas: 10
selector:
matchLabels:
app: batch-processor
template:
metadata:
labels:
app: batch-processor
spec:
nodeSelector:
workload-type: spot
tolerations:
- key: spot-instance
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: processor
image: batch-processor:latest

2. Cluster Autoscaler

Terminal window
# Install Cluster Autoscaler
helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
--namespace kube-system \
--set autoDiscovery.clusterName=production-cluster \
--set awsRegion=us-east-1 \
--set rbac.serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::ACCOUNT:role/cluster-autoscaler

3. Karpenter (Next-Gen Autoscaling)

Terminal window
# Install Karpenter
helm repo add karpenter https://charts.karpenter.sh
helm install karpenter karpenter/karpenter \
--namespace karpenter \
--create-namespace \
--set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::ACCOUNT:role/karpenter-controller \
--set settings.aws.clusterName=production-cluster \
--set settings.aws.defaultInstanceProfile=KarpenterNodeInstanceProfile
# Karpenter Provisioner
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["t3.large", "t3.xlarge", "c6i.large", "c6i.xlarge"]
limits:
resources:
cpu: 1000
memory: 1000Gi
providerRef:
name: default
ttlSecondsAfterEmpty: 30
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
name: default
spec:
subnetSelector:
karpenter.sh/discovery: production-cluster
securityGroupSelector:
karpenter.sh/discovery: production-cluster
instanceProfile: KarpenterNodeInstanceProfile
tags:
ManagedBy: Karpenter

Upgrade Strategy

Control Plane Upgrade

Terminal window
# Upgrade EKS control plane
aws eks update-cluster-version \
--name production-cluster \
--kubernetes-version 1.29
# Wait for upgrade to complete
aws eks describe-update \
--name production-cluster \
--update-id <update-id>

Node Group Upgrade

Terminal window
# Update managed node group
aws eks update-nodegroup-version \
--cluster-name production-cluster \
--nodegroup-name general-purpose \
--kubernetes-version 1.29
# Or using eksctl
eksctl upgrade nodegroup \
--cluster production-cluster \
--name general-purpose \
--kubernetes-version 1.29

Disaster Recovery

Backup with Velero

Terminal window
# Install Velero
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.8.0 \
--bucket velero-backup-bucket \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1 \
--secret-file ./credentials-velero
# Create backup
velero backup create full-cluster-backup \
--include-namespaces '*' \
--snapshot-volumes
# Schedule daily backups
velero schedule create daily-backup \
--schedule="0 2 * * *" \
--include-namespaces '*'

Production Checklist

Infrastructure

Security

Networking

Storage

Monitoring

Cost Management

Backup and DR

Conclusion

Amazon EKS provides a robust, scalable, and secure platform for running Kubernetes on AWS. By following these best practices and leveraging AWS-native integrations, you can build production-grade container platforms that are reliable, cost-effective, and easy to operate.

The key to success with EKS is understanding both Kubernetes fundamentals and AWS service integrations. Start with a simple cluster, gradually add features as needed, and always prioritize security and reliability.


Ready to master Amazon EKS? Our AWS training programs cover EKS in depth, from basic deployments to advanced multi-cluster architectures. Contact us for customized training tailored to your team’s needs.


Edit page
Share this post on:

Previous Post
Azure Kubernetes Service (AKS) Production Guide: Complete Enterprise Deployment
Next Post
AWS Solutions Architect Certification Path: From Associate to Professional