Amazon EKS Production Guide: Building and Running Kubernetes on AWS
Amazon Elastic Kubernetes Service (EKS) is AWS’s managed Kubernetes offering that simplifies running Kubernetes on AWS while integrating deeply with AWS services. This guide covers everything you need to build production-grade EKS clusters.
Why Choose Amazon EKS?
Key Benefits
- Fully Managed Control Plane: AWS manages Kubernetes control plane availability and updates
- AWS Integration: Native integration with IAM, VPC, ALB, EBS, EFS, and more
- High Availability: Multi-AZ control plane by default
- Security: AWS-managed security patches and compliance certifications
- Scalability: Scales from small dev clusters to thousands of nodes
- EKS Anywhere: Run EKS on-premises with consistent tooling
EKS vs. Self-Managed Kubernetes on EC2
| Feature | EKS | Self-Managed |
|---|---|---|
| Control Plane | AWS-managed | You manage |
| Upgrades | Automated | Manual |
| HA Setup | Built-in multi-AZ | Manual configuration |
| Cost | $0.10/hour per cluster + nodes | Node costs only |
| AWS Integration | Native | Requires configuration |
| Operational Overhead | Low | High |
Getting Started with EKS
Architecture Overview
┌─────────────────────────────────────────────────────────────┐│ AWS Cloud (Region) ││ ││ ┌───────────────────────────────────────────────────────┐ ││ │ EKS Control Plane (AWS Managed) │ ││ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ ││ │ │ API Server │ │ etcd │ │ Scheduler │ │ ││ │ │ (Multi-AZ) │ │ (Multi-AZ) │ │ │ │ ││ │ └──────────────┘ └──────────────┘ └──────────────┘ │ ││ └───────────────────────────────────────────────────────┘ ││ │ ││ ┌───────────────────────────┼────────────────────────────┐ ││ │ Your VPC │ │ ││ │ ▼ │ ││ │ ┌─────────────────────────────────────────────────┐ │ ││ │ │ Worker Nodes (Your Account) │ │ ││ │ │ │ │ ││ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ ││ │ │ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ │ ││ │ │ │ (AZ-a) │ │ (AZ-b) │ │ (AZ-c) │ │ │ ││ │ │ │ │ │ │ │ │ │ │ ││ │ │ │ ┌───────┐ │ │ ┌───────┐ │ │ ┌───────┐ │ │ │ ││ │ │ │ │Pods │ │ │ │Pods │ │ │ │Pods │ │ │ │ ││ │ │ │ └───────┘ │ │ └───────┘ │ │ └───────┘ │ │ │ ││ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ ││ │ └─────────────────────────────────────────────────┘ │ ││ └───────────────────────────────────────────────────────┘ │└─────────────────────────────────────────────────────────────┘Prerequisites
- AWS CLI v2.x installed and configured
- kubectl 1.28+ installed
- eksctl CLI tool (optional but recommended)
- AWS IAM permissions for EKS and related services
Creating Your First EKS Cluster
Using eksctl (Recommended for Getting Started)
# Create a production-ready clustereksctl create cluster \ --name production-cluster \ --region us-east-1 \ --version 1.29 \ --nodegroup-name standard-workers \ --node-type t3.large \ --nodes 3 \ --nodes-min 3 \ --nodes-max 10 \ --managed \ --with-oidc \ --ssh-access \ --ssh-public-key my-key \ --asg-access \ --external-dns-access \ --full-ecr-access \ --alb-ingress-access \ --vpc-nat-mode SingleUsing Terraform (Recommended for Production)
module "eks" { source = "terraform-aws-modules/eks/aws" version = "~> 19.0"
cluster_name = "production-cluster" cluster_version = "1.29"
vpc_id = module.vpc.vpc_id subnet_ids = module.vpc.private_subnets
# Enable IRSA (IAM Roles for Service Accounts) enable_irsa = true
# Cluster endpoint access cluster_endpoint_public_access = true cluster_endpoint_private_access = true
# Cluster addons cluster_addons = { coredns = { most_recent = true } kube-proxy = { most_recent = true } vpc-cni = { most_recent = true } aws-ebs-csi-driver = { most_recent = true service_account_role_arn = module.ebs_csi_driver_irsa.iam_role_arn } }
# EKS Managed Node Groups eks_managed_node_groups = { general = { name = "general-purpose" instance_types = ["t3.large"] capacity_type = "ON_DEMAND"
min_size = 3 max_size = 10 desired_size = 3
labels = { workload-type = "general" }
tags = { Environment = "production" ManagedBy = "terraform" } }
compute = { name = "compute-optimized" instance_types = ["c6i.2xlarge"] capacity_type = "ON_DEMAND"
min_size = 2 max_size = 20 desired_size = 2
labels = { workload-type = "compute-intensive" }
taints = [{ key = "workload-type" value = "compute-intensive" effect = "NoSchedule" }] }
spot = { name = "spot-workers" instance_types = ["t3.large", "t3a.large", "t3.xlarge"] capacity_type = "SPOT"
min_size = 1 max_size = 10 desired_size = 3
labels = { workload-type = "spot" }
taints = [{ key = "spot-instance" value = "true" effect = "NoSchedule" }] } }
# Cluster security group rules cluster_security_group_additional_rules = { ingress_nodes_ephemeral_ports_tcp = { description = "Nodes on ephemeral ports" protocol = "tcp" from_port = 1025 to_port = 65535 type = "ingress" source_node_security_group = true } }
# Node security group rules node_security_group_additional_rules = { ingress_self_all = { description = "Node to node all ports/protocols" protocol = "-1" from_port = 0 to_port = 0 type = "ingress" self = true } }
tags = { Environment = "production" ManagedBy = "terraform" }}
# EBS CSI Driver IRSAmodule "ebs_csi_driver_irsa" { source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks" version = "~> 5.0"
role_name = "ebs-csi-driver"
attach_ebs_csi_policy = true
oidc_providers = { main = { provider_arn = module.eks.oidc_provider_arn namespace_service_accounts = ["kube-system:ebs-csi-controller-sa"] } }}
# VPC Modulemodule "vpc" { source = "terraform-aws-modules/vpc/aws" version = "~> 5.0"
name = "eks-vpc" cidr = "10.0.0.0/16"
azs = ["us-east-1a", "us-east-1b", "us-east-1c"] private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"] public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
enable_nat_gateway = true single_nat_gateway = false # Multi-AZ NAT for HA enable_dns_hostnames = true enable_dns_support = true
# Kubernetes tags for subnet discovery public_subnet_tags = { "kubernetes.io/role/elb" = "1" "kubernetes.io/cluster/${local.cluster_name}" = "shared" }
private_subnet_tags = { "kubernetes.io/role/internal-elb" = "1" "kubernetes.io/cluster/${local.cluster_name}" = "shared" }
tags = { Environment = "production" }}Configure kubectl Access
# Update kubeconfigaws eks update-kubeconfig \ --region us-east-1 \ --name production-cluster
# Verify connectionkubectl get nodeskubectl get pods -AAWS-Specific Integrations
1. IAM Roles for Service Accounts (IRSA)
IRSA allows Kubernetes pods to assume AWS IAM roles without storing credentials:
# Create OIDC provider (if not already done)eksctl utils associate-iam-oidc-provider \ --cluster production-cluster \ --approveExample: S3 Access for Pods
# Create IAM policycat > s3-policy.json <<EOF{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::my-app-bucket/*", "arn:aws:s3:::my-app-bucket" ] } ]}EOF
# Create IAM role for service accounteksctl create iamserviceaccount \ --name s3-access-sa \ --namespace default \ --cluster production-cluster \ --attach-policy-arn $(aws iam create-policy \ --policy-name S3AccessPolicy \ --policy-document file://s3-policy.json \ --query 'Policy.Arn' --output text) \ --approve# Deploy application using IRSAapiVersion: apps/v1kind: Deploymentmetadata: name: s3-appspec: replicas: 3 selector: matchLabels: app: s3-app template: metadata: labels: app: s3-app spec: serviceAccountName: s3-access-sa # Uses IRSA containers: - name: app image: my-app:latest env: - name: AWS_REGION value: us-east-1 # No AWS credentials needed - IRSA handles authentication2. AWS Load Balancer Controller
Replace the deprecated ALB Ingress Controller:
# Install AWS Load Balancer Controllerhelm repo add eks https://aws.github.io/eks-chartshelm repo update
# Create IAM policycurl -o iam-policy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/main/docs/install/iam_policy.json
aws iam create-policy \ --policy-name AWSLoadBalancerControllerIAMPolicy \ --policy-document file://iam-policy.json
# Create service account with IAM roleeksctl create iamserviceaccount \ --cluster=production-cluster \ --namespace=kube-system \ --name=aws-load-balancer-controller \ --attach-policy-arn=arn:aws:iam::ACCOUNT_ID:policy/AWSLoadBalancerControllerIAMPolicy \ --approve
# Install controllerhelm install aws-load-balancer-controller eks/aws-load-balancer-controller \ -n kube-system \ --set clusterName=production-cluster \ --set serviceAccount.create=false \ --set serviceAccount.name=aws-load-balancer-controllerApplication Load Balancer (ALB) Ingress
apiVersion: networking.k8s.io/v1kind: Ingressmetadata: name: web-app annotations: # ALB Configuration alb.ingress.kubernetes.io/scheme: internet-facing alb.ingress.kubernetes.io/target-type: ip alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]' alb.ingress.kubernetes.io/ssl-redirect: '443'
# Certificate alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:ACCOUNT:certificate/CERT_ID
# Health check alb.ingress.kubernetes.io/healthcheck-path: /health alb.ingress.kubernetes.io/healthcheck-interval-seconds: '15' alb.ingress.kubernetes.io/healthcheck-timeout-seconds: '5' alb.ingress.kubernetes.io/healthy-threshold-count: '2' alb.ingress.kubernetes.io/unhealthy-threshold-count: '2'
# WAF alb.ingress.kubernetes.io/wafv2-acl-arn: arn:aws:wafv2:us-east-1:ACCOUNT:regional/webacl/NAME/ID
# Access logs alb.ingress.kubernetes.io/load-balancer-attributes: access_logs.s3.enabled=true,access_logs.s3.bucket=my-logs-bucketspec: ingressClassName: alb rules: - host: api.example.com http: paths: - path: / pathType: Prefix backend: service: name: web-app port: number: 80Network Load Balancer (NLB) Service
apiVersion: v1kind: Servicemetadata: name: web-app-nlb annotations: service.beta.kubernetes.io/aws-load-balancer-type: "external" service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip" service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing" service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true" service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "tcp" service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: "http" service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: "/health"spec: type: LoadBalancer selector: app: web-app ports: - port: 80 targetPort: 8080 protocol: TCP3. Amazon EBS CSI Driver
For persistent storage with EBS volumes:
# StorageClass for gp3 volumes (recommended)apiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: ebs-gp3provisioner: ebs.csi.aws.comparameters: type: gp3 iops: "3000" throughput: "125" encrypted: "true" kmsKeyId: arn:aws:kms:us-east-1:ACCOUNT:key/KEY_IDvolumeBindingMode: WaitForFirstConsumerallowVolumeExpansion: true---# PersistentVolumeClaimapiVersion: v1kind: PersistentVolumeClaimmetadata: name: postgres-dataspec: accessModes: - ReadWriteOnce storageClassName: ebs-gp3 resources: requests: storage: 100Gi---# StatefulSet using PVCapiVersion: apps/v1kind: StatefulSetmetadata: name: postgresspec: serviceName: postgres replicas: 1 selector: matchLabels: app: postgres template: metadata: labels: app: postgres spec: containers: - name: postgres image: postgres:15 ports: - containerPort: 5432 volumeMounts: - name: data mountPath: /var/lib/postgresql/data volumeClaimTemplates: - metadata: name: data spec: accessModes: ["ReadWriteOnce"] storageClassName: ebs-gp3 resources: requests: storage: 100Gi4. Amazon EFS CSI Driver
For shared storage across multiple pods:
# Install EFS CSI Driverkubectl apply -k "github.com/kubernetes-sigs/aws-efs-csi-driver/deploy/kubernetes/overlays/stable/?ref=master"
# Create EFS filesystemaws efs create-file-system \ --region us-east-1 \ --performance-mode generalPurpose \ --throughput-mode bursting \ --encrypted \ --tags Key=Name,Value=eks-efs# StorageClass for EFSapiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: efs-scprovisioner: efs.csi.aws.comparameters: provisioningMode: efs-ap fileSystemId: fs-1234567890abcdef0 directoryPerms: "700"---# PersistentVolumeClaimapiVersion: v1kind: PersistentVolumeClaimmetadata: name: shared-storagespec: accessModes: - ReadWriteMany storageClassName: efs-sc resources: requests: storage: 5Gi---# Deployment using shared EFS storageapiVersion: apps/v1kind: Deploymentmetadata: name: web-appspec: replicas: 5 selector: matchLabels: app: web-app template: metadata: labels: app: web-app spec: containers: - name: app image: nginx volumeMounts: - name: shared-data mountPath: /usr/share/nginx/html volumes: - name: shared-data persistentVolumeClaim: claimName: shared-storageEKS Networking Best Practices
VPC CNI Configuration
The AWS VPC CNI uses ENIs (Elastic Network Interfaces) to assign VPC IP addresses to pods:
# Configure VPC CNI for custom networkingapiVersion: v1kind: ConfigMapmetadata: name: amazon-vpc-cni namespace: kube-systemdata: # Enable custom networking AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG: "true" ENI_CONFIG_LABEL_DEF: "topology.kubernetes.io/zone"
# Enable prefix delegation for more IPs per node ENABLE_PREFIX_DELEGATION: "true"
# Network policy enforcement AWS_VPC_K8S_CNI_NETWORK_POLICY_ENFORCING_MODE: "standard"
# Pod security group ENABLE_POD_ENI: "true"Security Groups for Pods
Assign security groups directly to pods:
apiVersion: vpcresources.k8s.aws/v1beta1kind: SecurityGroupPolicymetadata: name: database-pods-sg namespace: defaultspec: podSelector: matchLabels: app: database securityGroups: groupIds: - sg-0123456789abcdef0---apiVersion: apps/v1kind: Deploymentmetadata: name: postgresspec: replicas: 1 selector: matchLabels: app: database template: metadata: labels: app: database spec: containers: - name: postgres image: postgres:15EKS Security Best Practices
1. Pod Security Standards
# Enforce restricted pod security standardapiVersion: v1kind: Namespacemetadata: name: production labels: pod-security.kubernetes.io/enforce: restricted pod-security.kubernetes.io/audit: restricted pod-security.kubernetes.io/warn: restricted2. Network Policies with Calico
Install Calico for network policy support:
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.0/manifests/calico-vxlan.yaml# Default deny all trafficapiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: default-deny-all namespace: productionspec: podSelector: {} policyTypes: - Ingress - Egress---# Allow specific trafficapiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: allow-web-app namespace: productionspec: podSelector: matchLabels: app: web-app policyTypes: - Ingress - Egress ingress: - from: - podSelector: matchLabels: app: ingress-controller ports: - protocol: TCP port: 8080 egress: - to: - podSelector: matchLabels: app: postgres ports: - protocol: TCP port: 54323. Secrets Management with AWS Secrets Manager
# Install External Secrets Operatorhelm repo add external-secrets https://charts.external-secrets.iohelm install external-secrets \ external-secrets/external-secrets \ -n external-secrets-system \ --create-namespace
# Create IAM role for External Secretseksctl create iamserviceaccount \ --name external-secrets \ --namespace external-secrets-system \ --cluster production-cluster \ --attach-policy-arn arn:aws:iam::aws:policy/SecretsManagerReadWrite \ --approve# SecretStoreapiVersion: external-secrets.io/v1beta1kind: SecretStoremetadata: name: aws-secrets-manager namespace: defaultspec: provider: aws: service: SecretsManager region: us-east-1 auth: jwt: serviceAccountRef: name: external-secrets---# ExternalSecretapiVersion: external-secrets.io/v1beta1kind: ExternalSecretmetadata: name: database-credentials namespace: defaultspec: refreshInterval: 1h secretStoreRef: name: aws-secrets-manager kind: SecretStore target: name: postgres-secret creationPolicy: Owner data: - secretKey: username remoteRef: key: prod/database/postgres property: username - secretKey: password remoteRef: key: prod/database/postgres property: passwordMonitoring and Observability
Amazon CloudWatch Container Insights
# Install CloudWatch agent and Fluent BitClusterName=production-clusterRegionName=us-east-1FluentBitHttpPort='2020'FluentBitReadFromHead='Off'
curl https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluent-bit-quickstart.yaml | \sed "s/{{cluster_name}}/${ClusterName}/;s/{{region_name}}/${RegionName}/;s/{{http_server_toggle}}/\"On\"/;s/{{http_server_port}}/${FluentBitHttpPort}/;s/{{read_from_head}}/${FluentBitReadFromHead}/" | \kubectl apply -f -Prometheus and Grafana
# Install kube-prometheus-stackhelm repo add prometheus-community https://prometheus-community.github.io/helm-chartshelm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace \ --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=ebs-gp3 \ --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \ --set grafana.persistence.enabled=true \ --set grafana.persistence.storageClassName=ebs-gp3 \ --set grafana.persistence.size=10GiCost Optimization
1. Use Spot Instances for Fault-Tolerant Workloads
# Deploy to spot instancesapiVersion: apps/v1kind: Deploymentmetadata: name: batch-processorspec: replicas: 10 selector: matchLabels: app: batch-processor template: metadata: labels: app: batch-processor spec: nodeSelector: workload-type: spot tolerations: - key: spot-instance operator: Equal value: "true" effect: NoSchedule containers: - name: processor image: batch-processor:latest2. Cluster Autoscaler
# Install Cluster Autoscalerhelm repo add autoscaler https://kubernetes.github.io/autoscalerhelm install cluster-autoscaler autoscaler/cluster-autoscaler \ --namespace kube-system \ --set autoDiscovery.clusterName=production-cluster \ --set awsRegion=us-east-1 \ --set rbac.serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::ACCOUNT:role/cluster-autoscaler3. Karpenter (Next-Gen Autoscaling)
# Install Karpenterhelm repo add karpenter https://charts.karpenter.shhelm install karpenter karpenter/karpenter \ --namespace karpenter \ --create-namespace \ --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::ACCOUNT:role/karpenter-controller \ --set settings.aws.clusterName=production-cluster \ --set settings.aws.defaultInstanceProfile=KarpenterNodeInstanceProfile# Karpenter ProvisionerapiVersion: karpenter.sh/v1alpha5kind: Provisionermetadata: name: defaultspec: requirements: - key: karpenter.sh/capacity-type operator: In values: ["spot", "on-demand"] - key: node.kubernetes.io/instance-type operator: In values: ["t3.large", "t3.xlarge", "c6i.large", "c6i.xlarge"] limits: resources: cpu: 1000 memory: 1000Gi providerRef: name: default ttlSecondsAfterEmpty: 30---apiVersion: karpenter.k8s.aws/v1alpha1kind: AWSNodeTemplatemetadata: name: defaultspec: subnetSelector: karpenter.sh/discovery: production-cluster securityGroupSelector: karpenter.sh/discovery: production-cluster instanceProfile: KarpenterNodeInstanceProfile tags: ManagedBy: KarpenterUpgrade Strategy
Control Plane Upgrade
# Upgrade EKS control planeaws eks update-cluster-version \ --name production-cluster \ --kubernetes-version 1.29
# Wait for upgrade to completeaws eks describe-update \ --name production-cluster \ --update-id <update-id>Node Group Upgrade
# Update managed node groupaws eks update-nodegroup-version \ --cluster-name production-cluster \ --nodegroup-name general-purpose \ --kubernetes-version 1.29
# Or using eksctleksctl upgrade nodegroup \ --cluster production-cluster \ --name general-purpose \ --kubernetes-version 1.29Disaster Recovery
Backup with Velero
# Install Velerovelero install \ --provider aws \ --plugins velero/velero-plugin-for-aws:v1.8.0 \ --bucket velero-backup-bucket \ --backup-location-config region=us-east-1 \ --snapshot-location-config region=us-east-1 \ --secret-file ./credentials-velero
# Create backupvelero backup create full-cluster-backup \ --include-namespaces '*' \ --snapshot-volumes
# Schedule daily backupsvelero schedule create daily-backup \ --schedule="0 2 * * *" \ --include-namespaces '*'Production Checklist
Infrastructure
- Multi-AZ cluster deployment
- VPC with private and public subnets
- NAT Gateways in each AZ
- VPC Flow Logs enabled
- Control plane logging enabled
Security
- IRSA configured for workload IAM access
- Pod Security Standards enforced
- Network policies configured
- Secrets stored in AWS Secrets Manager
- Security groups properly configured
- AWS WAF on ALB (if needed)
Networking
- AWS Load Balancer Controller installed
- VPC CNI properly configured
- Network policies enforced
- DNS (Route53 or External DNS) configured
Storage
- EBS CSI Driver installed
- EFS CSI Driver installed (if needed)
- Snapshot policies configured
- Storage classes defined
Monitoring
- CloudWatch Container Insights enabled
- Prometheus and Grafana deployed
- Application metrics exposed
- Alerting configured
- Log aggregation setup
Cost Management
- Cluster Autoscaler or Karpenter installed
- Spot instances for appropriate workloads
- Resource quotas configured
- Cost allocation tags applied
Backup and DR
- Velero backup solution deployed
- Regular backup schedule configured
- DR runbook documented
- Recovery tested
Conclusion
Amazon EKS provides a robust, scalable, and secure platform for running Kubernetes on AWS. By following these best practices and leveraging AWS-native integrations, you can build production-grade container platforms that are reliable, cost-effective, and easy to operate.
The key to success with EKS is understanding both Kubernetes fundamentals and AWS service integrations. Start with a simple cluster, gradually add features as needed, and always prioritize security and reliability.
Ready to master Amazon EKS? Our AWS training programs cover EKS in depth, from basic deployments to advanced multi-cluster architectures. Contact us for customized training tailored to your team’s needs.