Skip to content
Vladimir Chavkov
Go back

Kubernetes Production Best Practices: From Deployment to Day 2 Operations

Edit page

Kubernetes Production Best Practices: From Deployment to Day 2 Operations

Running Kubernetes in production is fundamentally different from running it in development. This comprehensive guide covers battle-tested practices for deploying, securing, and operating Kubernetes clusters at scale.

Architecture Design Principles

Cluster Design Patterns

1. Multi-Cluster vs. Single Large Cluster

Multi-Cluster Approach (Recommended for most organizations):

Production Cluster (us-east-1)
├── Critical workloads
├── High availability requirements
└── Production data
Staging Cluster (us-east-1)
├── Pre-production testing
└── Integration tests
Development Cluster (us-west-2)
├── Developer experimentation
└── CI/CD testing

Benefits:

Trade-offs:

2. Node Pool Strategy

Separate workloads by resource requirements and characteristics:

# Example GKE node pool configuration
apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerNodePool
metadata:
name: high-cpu-pool
spec:
cluster: production-cluster
nodeCount: 3
nodeConfig:
machineType: n2-highcpu-8
diskSizeGb: 100
diskType: pd-ssd
labels:
workload-type: cpu-intensive
taints:
- effect: NoSchedule
key: workload-type
value: cpu-intensive
autoscaling:
minNodeCount: 3
maxNodeCount: 20

Common Node Pools:

  1. System Pool: Core cluster services (ingress, monitoring, logging)
  2. General Purpose Pool: Standard application workloads
  3. High-Memory Pool: Data processing, caching services
  4. High-CPU Pool: Compute-intensive workloads
  5. GPU Pool: Machine learning, rendering
  6. Spot/Preemptible Pool: Cost-optimized for fault-tolerant workloads

Namespace Strategy

Organize by teams, environments, or business units:

# Production namespace with resource quotas and limits
apiVersion: v1
kind: Namespace
metadata:
name: ecommerce-prod
labels:
environment: production
team: ecommerce
cost-center: engineering
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: ecommerce-prod-quota
namespace: ecommerce-prod
spec:
hard:
requests.cpu: "100"
requests.memory: 200Gi
limits.cpu: "200"
limits.memory: 400Gi
persistentvolumeclaims: "20"
services.loadbalancers: "5"
---
apiVersion: v1
kind: LimitRange
metadata:
name: ecommerce-prod-limits
namespace: ecommerce-prod
spec:
limits:
- max:
cpu: "4"
memory: 8Gi
min:
cpu: 100m
memory: 128Mi
default:
cpu: 500m
memory: 512Mi
defaultRequest:
cpu: 200m
memory: 256Mi
type: Container

Resource Management Best Practices

1. Always Define Resource Requests and Limits

apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 3
template:
spec:
containers:
- name: app
image: myapp:v1.2.3
resources:
requests:
cpu: 200m # Guaranteed CPU
memory: 256Mi # Guaranteed memory
limits:
cpu: 500m # Maximum CPU
memory: 512Mi # Maximum memory (hard limit)

Resource Request Guidelines:

2. Quality of Service (QoS) Classes

Kubernetes assigns QoS based on resource specifications:

# Guaranteed QoS - highest priority
resources:
requests:
cpu: 1000m
memory: 1Gi
limits:
cpu: 1000m
memory: 1Gi
# Burstable QoS - medium priority
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2000m
memory: 2Gi
# BestEffort QoS - lowest priority (NOT recommended for production)
# No resources defined

3. Horizontal Pod Autoscaling (HPA)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 30
- type: Pods
value: 4
periodSeconds: 30
selectPolicy: Max

4. Vertical Pod Autoscaling (VPA)

For workloads with unpredictable resource needs:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: web-app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
updatePolicy:
updateMode: "Auto" # Or "Recreate" for stateful apps
resourcePolicy:
containerPolicies:
- containerName: app
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2
memory: 2Gi

High Availability and Reliability

1. Pod Disruption Budgets (PDB)

Protect against voluntary disruptions (node drains, upgrades):

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-app-pdb
spec:
minAvailable: 2 # Or use maxUnavailable: 1
selector:
matchLabels:
app: web-app

2. Pod Anti-Affinity

Spread pods across nodes and zones:

apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 6
template:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: web-app
topologyKey: kubernetes.io/hostname
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: web-app
topologyKey: topology.kubernetes.io/zone

3. Liveness and Readiness Probes

spec:
containers:
- name: app
image: myapp:v1.2.3
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
startupProbe: # For slow-starting apps
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 0
periodSeconds: 10
failureThreshold: 30 # 300 seconds max startup time

Probe Best Practices:

4. Graceful Shutdown

spec:
containers:
- name: app
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]
terminationGracePeriodSeconds: 30

Application code should handle SIGTERM:

import signal
import sys
import time
shutdown_event = threading.Event()
def signal_handler(sig, frame):
print('Received SIGTERM, shutting down gracefully...')
shutdown_event.set()
# Stop accepting new requests
# Drain existing connections
# Close database connections
sys.exit(0)
signal.signal(signal.SIGTERM, signal_handler)
# In your main loop
while not shutdown_event.is_set():
process_request()

Security Best Practices

1. Pod Security Standards

apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted

Restricted Pod Security Standard enforces:

2. Network Policies

Default-deny all traffic, then explicitly allow:

# Deny all ingress traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
---
# Allow specific traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: web-app-policy
namespace: production
spec:
podSelector:
matchLabels:
app: web-app
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: postgres
ports:
- protocol: TCP
port: 5432
- to: # Allow DNS
- namespaceSelector:
matchLabels:
name: kube-system
- podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53

3. Secrets Management

Never commit secrets to Git or put in ConfigMaps:

# Use sealed secrets or external secrets operator
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: database-credentials
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager
kind: SecretStore
target:
name: postgres-secret
data:
- secretKey: username
remoteRef:
key: prod/database/postgres
property: username
- secretKey: password
remoteRef:
key: prod/database/postgres
property: password

Best Practices:

4. RBAC (Role-Based Access Control)

# Principle of least privilege
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: developer
namespace: production
rules:
- apiGroups: ["", "apps"]
resources: ["pods", "pods/log", "deployments"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["patch"] # For scaling only
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: developer-binding
namespace: production
subjects:
- kind: Group
name: developers
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: developer
apiGroup: rbac.authorization.k8s.io

Observability and Monitoring

1. Structured Logging

apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
data:
parsers.conf: |
[PARSER]
Name json
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%LZ

Application logging:

import json
import logging
import sys
class JSONFormatter(logging.Formatter):
def format(self, record):
log_obj = {
'timestamp': self.formatTime(record, self.datefmt),
'level': record.levelname,
'message': record.getMessage(),
'logger': record.name,
'pod': os.getenv('HOSTNAME'),
'namespace': os.getenv('POD_NAMESPACE'),
}
if record.exc_info:
log_obj['exception'] = self.formatException(record.exc_info)
return json.dumps(log_obj)
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(JSONFormatter())
logger = logging.getLogger()
logger.addHandler(handler)
logger.setLevel(logging.INFO)

2. Prometheus Metrics

apiVersion: v1
kind: Service
metadata:
name: web-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
selector:
app: web-app
ports:
- port: 8080

Application metrics:

from prometheus_client import Counter, Histogram, Gauge, start_http_server
# Counters
requests_total = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
# Histograms
request_duration = Histogram('http_request_duration_seconds', 'HTTP request duration', ['method', 'endpoint'])
# Gauges
active_connections = Gauge('active_connections', 'Number of active connections')
# In your request handler
@request_duration.labels(method='GET', endpoint='/api/users').time()
def get_users():
# Your code
requests_total.labels(method='GET', endpoint='/api/users', status=200).inc()

3. Distributed Tracing

# OpenTelemetry Collector
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
spec:
template:
spec:
containers:
- name: otel-collector
image: otel/opentelemetry-collector:latest
ports:
- containerPort: 4317 # OTLP gRPC
- containerPort: 4318 # OTLP HTTP

GitOps and Deployment Strategies

1. Blue-Green Deployments

# Green deployment (new version)
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app-green
spec:
replicas: 3
selector:
matchLabels:
app: web-app
version: green
template:
metadata:
labels:
app: web-app
version: green
spec:
containers:
- name: app
image: myapp:v2.0.0
---
# Service switches between blue and green
apiVersion: v1
kind: Service
metadata:
name: web-app
spec:
selector:
app: web-app
version: green # Switch from 'blue' to 'green'

2. Canary Deployments with Flagger

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: web-app
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
service:
port: 8080
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 1m

Day 2 Operations Checklist

Pre-Production

Monitoring and Observability

Reliability

Security

Cost Optimization

Conclusion

Running Kubernetes in production requires careful planning, robust automation, and operational discipline. Focus on reliability, security, and observability from day one. Automate everything possible, monitor ruthlessly, and always have a rollback plan.

Remember: Kubernetes gives you powerful primitives, but it’s your responsibility to use them correctly. Invest time in understanding these best practices—they’ll save you from costly outages and security incidents down the road.


Need help running Kubernetes in production? Our Kubernetes training programs cover everything from fundamentals to advanced Day 2 operations with hands-on labs and real-world scenarios. Explore Kubernetes training or contact us for customized enterprise training.


Edit page
Share this post on:

Previous Post
AWS Lambda Serverless Patterns and Best Practices for Production
Next Post
Google Cloud Professional Certifications: Complete Guide to GCP Career Path