Kubernetes Production Best Practices: From Deployment to Day 2 Operations
Running Kubernetes in production is fundamentally different from running it in development. This comprehensive guide covers battle-tested practices for deploying, securing, and operating Kubernetes clusters at scale.
Architecture Design Principles
Cluster Design Patterns
1. Multi-Cluster vs. Single Large Cluster
Multi-Cluster Approach (Recommended for most organizations):
Production Cluster (us-east-1) ├── Critical workloads ├── High availability requirements └── Production data
Staging Cluster (us-east-1) ├── Pre-production testing └── Integration tests
Development Cluster (us-west-2) ├── Developer experimentation └── CI/CD testingBenefits:
- Blast radius containment
- Environment isolation
- Independent upgrade cycles
- Multi-region deployments
Trade-offs:
- Higher operational overhead
- Multiple control planes to manage
- Cross-cluster networking complexity
2. Node Pool Strategy
Separate workloads by resource requirements and characteristics:
# Example GKE node pool configurationapiVersion: container.cnrm.cloud.google.com/v1beta1kind: ContainerNodePoolmetadata: name: high-cpu-poolspec: cluster: production-cluster nodeCount: 3 nodeConfig: machineType: n2-highcpu-8 diskSizeGb: 100 diskType: pd-ssd labels: workload-type: cpu-intensive taints: - effect: NoSchedule key: workload-type value: cpu-intensive autoscaling: minNodeCount: 3 maxNodeCount: 20Common Node Pools:
- System Pool: Core cluster services (ingress, monitoring, logging)
- General Purpose Pool: Standard application workloads
- High-Memory Pool: Data processing, caching services
- High-CPU Pool: Compute-intensive workloads
- GPU Pool: Machine learning, rendering
- Spot/Preemptible Pool: Cost-optimized for fault-tolerant workloads
Namespace Strategy
Organize by teams, environments, or business units:
# Production namespace with resource quotas and limitsapiVersion: v1kind: Namespacemetadata: name: ecommerce-prod labels: environment: production team: ecommerce cost-center: engineering---apiVersion: v1kind: ResourceQuotametadata: name: ecommerce-prod-quota namespace: ecommerce-prodspec: hard: requests.cpu: "100" requests.memory: 200Gi limits.cpu: "200" limits.memory: 400Gi persistentvolumeclaims: "20" services.loadbalancers: "5"---apiVersion: v1kind: LimitRangemetadata: name: ecommerce-prod-limits namespace: ecommerce-prodspec: limits: - max: cpu: "4" memory: 8Gi min: cpu: 100m memory: 128Mi default: cpu: 500m memory: 512Mi defaultRequest: cpu: 200m memory: 256Mi type: ContainerResource Management Best Practices
1. Always Define Resource Requests and Limits
apiVersion: apps/v1kind: Deploymentmetadata: name: web-appspec: replicas: 3 template: spec: containers: - name: app image: myapp:v1.2.3 resources: requests: cpu: 200m # Guaranteed CPU memory: 256Mi # Guaranteed memory limits: cpu: 500m # Maximum CPU memory: 512Mi # Maximum memory (hard limit)Resource Request Guidelines:
- CPU: Set based on average usage, not peaks
- Memory: Set based on maximum expected usage (hard limit causes OOMKill)
- Requests = Limits for critical workloads (guaranteed QoS)
- Requests < Limits for burstable workloads
2. Quality of Service (QoS) Classes
Kubernetes assigns QoS based on resource specifications:
# Guaranteed QoS - highest priorityresources: requests: cpu: 1000m memory: 1Gi limits: cpu: 1000m memory: 1Gi
# Burstable QoS - medium priorityresources: requests: cpu: 500m memory: 512Mi limits: cpu: 2000m memory: 2Gi
# BestEffort QoS - lowest priority (NOT recommended for production)# No resources defined3. Horizontal Pod Autoscaling (HPA)
apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: web-app-hpaspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: web-app minReplicas: 3 maxReplicas: 20 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 behavior: scaleDown: stabilizationWindowSeconds: 300 policies: - type: Percent value: 50 periodSeconds: 60 scaleUp: stabilizationWindowSeconds: 0 policies: - type: Percent value: 100 periodSeconds: 30 - type: Pods value: 4 periodSeconds: 30 selectPolicy: Max4. Vertical Pod Autoscaling (VPA)
For workloads with unpredictable resource needs:
apiVersion: autoscaling.k8s.io/v1kind: VerticalPodAutoscalermetadata: name: web-app-vpaspec: targetRef: apiVersion: apps/v1 kind: Deployment name: web-app updatePolicy: updateMode: "Auto" # Or "Recreate" for stateful apps resourcePolicy: containerPolicies: - containerName: app minAllowed: cpu: 100m memory: 128Mi maxAllowed: cpu: 2 memory: 2GiHigh Availability and Reliability
1. Pod Disruption Budgets (PDB)
Protect against voluntary disruptions (node drains, upgrades):
apiVersion: policy/v1kind: PodDisruptionBudgetmetadata: name: web-app-pdbspec: minAvailable: 2 # Or use maxUnavailable: 1 selector: matchLabels: app: web-app2. Pod Anti-Affinity
Spread pods across nodes and zones:
apiVersion: apps/v1kind: Deploymentmetadata: name: web-appspec: replicas: 6 template: spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app: web-app topologyKey: kubernetes.io/hostname preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchLabels: app: web-app topologyKey: topology.kubernetes.io/zone3. Liveness and Readiness Probes
spec: containers: - name: app image: myapp:v1.2.3 livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 10 periodSeconds: 5 timeoutSeconds: 3 failureThreshold: 3 startupProbe: # For slow-starting apps httpGet: path: /healthz port: 8080 initialDelaySeconds: 0 periodSeconds: 10 failureThreshold: 30 # 300 seconds max startup timeProbe Best Practices:
- Liveness: Detects deadlocked processes (restart container)
- Readiness: Detects if app can serve traffic (remove from endpoints)
- Startup: Protects slow-starting apps from premature liveness kills
- Keep probes lightweight (< 100ms response time)
- Avoid checking external dependencies in liveness probes
4. Graceful Shutdown
spec: containers: - name: app lifecycle: preStop: exec: command: ["/bin/sh", "-c", "sleep 15"] terminationGracePeriodSeconds: 30Application code should handle SIGTERM:
import signalimport sysimport time
shutdown_event = threading.Event()
def signal_handler(sig, frame): print('Received SIGTERM, shutting down gracefully...') shutdown_event.set() # Stop accepting new requests # Drain existing connections # Close database connections sys.exit(0)
signal.signal(signal.SIGTERM, signal_handler)
# In your main loopwhile not shutdown_event.is_set(): process_request()Security Best Practices
1. Pod Security Standards
apiVersion: v1kind: Namespacemetadata: name: production labels: pod-security.kubernetes.io/enforce: restricted pod-security.kubernetes.io/audit: restricted pod-security.kubernetes.io/warn: restrictedRestricted Pod Security Standard enforces:
- Run as non-root
- Drop all capabilities
- No privileged containers
- No host namespaces
- No host ports
- Limited volume types
2. Network Policies
Default-deny all traffic, then explicitly allow:
# Deny all ingress trafficapiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: default-deny-ingress namespace: productionspec: podSelector: {} policyTypes: - Ingress---# Allow specific trafficapiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: web-app-policy namespace: productionspec: podSelector: matchLabels: app: web-app policyTypes: - Ingress - Egress ingress: - from: - namespaceSelector: matchLabels: name: ingress-nginx ports: - protocol: TCP port: 8080 egress: - to: - podSelector: matchLabels: app: postgres ports: - protocol: TCP port: 5432 - to: # Allow DNS - namespaceSelector: matchLabels: name: kube-system - podSelector: matchLabels: k8s-app: kube-dns ports: - protocol: UDP port: 533. Secrets Management
Never commit secrets to Git or put in ConfigMaps:
# Use sealed secrets or external secrets operatorapiVersion: external-secrets.io/v1beta1kind: ExternalSecretmetadata: name: database-credentialsspec: refreshInterval: 1h secretStoreRef: name: aws-secrets-manager kind: SecretStore target: name: postgres-secret data: - secretKey: username remoteRef: key: prod/database/postgres property: username - secretKey: password remoteRef: key: prod/database/postgres property: passwordBest Practices:
- Use AWS Secrets Manager, HashiCorp Vault, or Google Secret Manager
- Rotate secrets regularly
- Use different secrets per environment
- Enable encryption at rest for etcd
4. RBAC (Role-Based Access Control)
# Principle of least privilegeapiVersion: rbac.authorization.k8s.io/v1kind: Rolemetadata: name: developer namespace: productionrules:- apiGroups: ["", "apps"] resources: ["pods", "pods/log", "deployments"] verbs: ["get", "list", "watch"]- apiGroups: ["apps"] resources: ["deployments"] verbs: ["patch"] # For scaling only---apiVersion: rbac.authorization.k8s.io/v1kind: RoleBindingmetadata: name: developer-binding namespace: productionsubjects:- kind: Group name: developers apiGroup: rbac.authorization.k8s.ioroleRef: kind: Role name: developer apiGroup: rbac.authorization.k8s.ioObservability and Monitoring
1. Structured Logging
apiVersion: v1kind: ConfigMapmetadata: name: fluent-bit-configdata: parsers.conf: | [PARSER] Name json Format json Time_Key time Time_Format %Y-%m-%dT%H:%M:%S.%LZApplication logging:
import jsonimport loggingimport sys
class JSONFormatter(logging.Formatter): def format(self, record): log_obj = { 'timestamp': self.formatTime(record, self.datefmt), 'level': record.levelname, 'message': record.getMessage(), 'logger': record.name, 'pod': os.getenv('HOSTNAME'), 'namespace': os.getenv('POD_NAMESPACE'), } if record.exc_info: log_obj['exception'] = self.formatException(record.exc_info) return json.dumps(log_obj)
handler = logging.StreamHandler(sys.stdout)handler.setFormatter(JSONFormatter())logger = logging.getLogger()logger.addHandler(handler)logger.setLevel(logging.INFO)2. Prometheus Metrics
apiVersion: v1kind: Servicemetadata: name: web-app annotations: prometheus.io/scrape: "true" prometheus.io/port: "8080" prometheus.io/path: "/metrics"spec: selector: app: web-app ports: - port: 8080Application metrics:
from prometheus_client import Counter, Histogram, Gauge, start_http_server
# Countersrequests_total = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
# Histogramsrequest_duration = Histogram('http_request_duration_seconds', 'HTTP request duration', ['method', 'endpoint'])
# Gaugesactive_connections = Gauge('active_connections', 'Number of active connections')
# In your request handler@request_duration.labels(method='GET', endpoint='/api/users').time()def get_users(): # Your code requests_total.labels(method='GET', endpoint='/api/users', status=200).inc()3. Distributed Tracing
# OpenTelemetry CollectorapiVersion: apps/v1kind: Deploymentmetadata: name: otel-collectorspec: template: spec: containers: - name: otel-collector image: otel/opentelemetry-collector:latest ports: - containerPort: 4317 # OTLP gRPC - containerPort: 4318 # OTLP HTTPGitOps and Deployment Strategies
1. Blue-Green Deployments
# Green deployment (new version)apiVersion: apps/v1kind: Deploymentmetadata: name: web-app-greenspec: replicas: 3 selector: matchLabels: app: web-app version: green template: metadata: labels: app: web-app version: green spec: containers: - name: app image: myapp:v2.0.0---# Service switches between blue and greenapiVersion: v1kind: Servicemetadata: name: web-appspec: selector: app: web-app version: green # Switch from 'blue' to 'green'2. Canary Deployments with Flagger
apiVersion: flagger.app/v1beta1kind: Canarymetadata: name: web-appspec: targetRef: apiVersion: apps/v1 kind: Deployment name: web-app service: port: 8080 analysis: interval: 1m threshold: 5 maxWeight: 50 stepWeight: 10 metrics: - name: request-success-rate thresholdRange: min: 99 interval: 1m - name: request-duration thresholdRange: max: 500 interval: 1mDay 2 Operations Checklist
Pre-Production
- Multi-zone cluster for high availability
- Node pools configured per workload type
- Resource quotas and limit ranges defined
- Network policies configured (default-deny)
- Pod Security Standards enforced
- RBAC roles and bindings configured
- Secrets management solution deployed
- GitOps tooling configured (ArgoCD/Flux)
Monitoring and Observability
- Prometheus and Grafana deployed
- Application metrics exposed
- Structured logging configured
- Log aggregation setup (ELK/Loki)
- Distributed tracing enabled
- Alerting rules configured
- On-call procedures documented
Reliability
- All workloads have resource requests/limits
- HPA configured for scalable workloads
- PodDisruptionBudgets for critical services
- Pod anti-affinity rules for HA
- Health checks (liveness/readiness) configured
- Graceful shutdown implemented
- Backup strategy for stateful workloads
Security
- Network policies enforced
- Pod Security Standards enabled
- Secrets encrypted at rest
- Regular vulnerability scanning
- Audit logging enabled
- Compliance requirements met (PCI, HIPAA, etc.)
Cost Optimization
- Right-sized node pools
- Cluster autoscaler configured
- Spot instances for fault-tolerant workloads
- Resource requests tuned based on actual usage
- Unused resources identified and removed
Conclusion
Running Kubernetes in production requires careful planning, robust automation, and operational discipline. Focus on reliability, security, and observability from day one. Automate everything possible, monitor ruthlessly, and always have a rollback plan.
Remember: Kubernetes gives you powerful primitives, but it’s your responsibility to use them correctly. Invest time in understanding these best practices—they’ll save you from costly outages and security incidents down the road.
Need help running Kubernetes in production? Our Kubernetes training programs cover everything from fundamentals to advanced Day 2 operations with hands-on labs and real-world scenarios. Explore Kubernetes training or contact us for customized enterprise training.