Production Observability with Prometheus and Grafana: Complete Guide
Observability is the ability to understand a system’s internal state from its external outputs. Modern applications require comprehensive monitoring, metrics, logging, and tracing to maintain reliability and performance. This guide covers building production-grade observability with Prometheus and Grafana.
The Three Pillars of Observability
1. Metrics (Prometheus)
Time-series data showing system behavior over time.
2. Logs (Loki/ELK)
Detailed event records for debugging and auditing.
3. Traces (Jaeger/Tempo)
Request flow across distributed services.
Prometheus Fundamentals
Architecture Overview
┌──────────────┐ ┌──────────────┐ ┌──────────────┐│ Target 1 │ │ Target 2 │ │ Target 3 ││ (Exporter) │ │ (Application)│ │ (K8s API) │└──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ └───────────────────┴───────────────────┘ │ (Scrape Metrics) │ ▼ ┌────────────────┐ │ Prometheus │ │ Server │ └────────┬───────┘ │ ┌───────────────┼───────────────┐ │ │ │ ▼ ▼ ▼ ┌──────────┐ ┌─────────────┐ ┌───────────┐ │ Grafana │ │ AlertManager│ │ API │ │(Visualize) │ (Alerts) │ │(Queries) │ └──────────┘ └─────────────┘ └───────────┘Installation on Kubernetes
# Add Prometheus Helm repositoryhelm repo add prometheus-community https://prometheus-community.github.io/helm-chartshelm repo update
# Install kube-prometheus-stack (Prometheus + Grafana + AlertManager)helm install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace \ --values prometheus-values.yamlprometheus-values.yaml:
prometheus: prometheusSpec: # Retention period retention: 30d retentionSize: "50GB"
# Storage storageSpec: volumeClaimTemplate: spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 100Gi storageClassName: gp3
# Resource limits resources: requests: cpu: 1000m memory: 2Gi limits: cpu: 2000m memory: 4Gi
# Service monitors serviceMonitorSelectorNilUsesHelmValues: false podMonitorSelectorNilUsesHelmValues: false
# External labels externalLabels: cluster: production environment: prod
# Remote write (for long-term storage) remoteWrite: - url: https://prometheus-long-term-storage.company.com/api/v1/write writeRelabelConfigs: - sourceLabels: [__name__] regex: 'go_.*|process_.*' action: drop
grafana: adminPassword: "changeme"
# Persistence persistence: enabled: true size: 10Gi
# Datasources datasources: datasources.yaml: apiVersion: 1 datasources: - name: Prometheus type: prometheus url: http://prometheus-kube-prometheus-prometheus:9090 isDefault: true jsonData: timeInterval: 30s
alertmanager: config: global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
route: group_by: ['alertname', 'cluster', 'service'] group_wait: 10s group_interval: 10s repeat_interval: 12h receiver: 'slack-critical' routes: - match: severity: critical receiver: slack-critical - match: severity: warning receiver: slack-warning
receivers: - name: 'slack-critical' slack_configs: - channel: '#alerts-critical' title: 'Critical Alert: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'slack-warning' slack_configs: - channel: '#alerts-warning' title: 'Warning: {{ .GroupLabels.alertname }}'Application Instrumentation
1. Python with Prometheus Client
from prometheus_client import Counter, Histogram, Gauge, start_http_serverfrom prometheus_client import make_wsgi_appfrom flask import Flask, requestfrom werkzeug.middleware.dispatcher import DispatcherMiddlewareimport time
app = Flask(__name__)
# Add prometheus wsgi middleware to route /metrics requestsapp.wsgi_app = DispatcherMiddleware(app.wsgi_app, { '/metrics': make_wsgi_app()})
# Define metricsREQUEST_COUNT = Counter( 'http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_DURATION = Histogram( 'http_request_duration_seconds', 'HTTP request duration', ['method', 'endpoint'])
ACTIVE_REQUESTS = Gauge( 'http_requests_active', 'Number of active requests')
DATABASE_CONNECTIONS = Gauge( 'database_connections_active', 'Number of active database connections')
# Business metricsORDER_TOTAL = Counter( 'orders_total', 'Total number of orders', ['status', 'product_category'])
ORDER_VALUE = Histogram( 'order_value_dollars', 'Order value in dollars', buckets=[10, 25, 50, 100, 250, 500, 1000])
@app.before_requestdef before_request(): request.start_time = time.time() ACTIVE_REQUESTS.inc()
@app.after_requestdef after_request(response): request_duration = time.time() - request.start_time
REQUEST_COUNT.labels( method=request.method, endpoint=request.path, status=response.status_code ).inc()
REQUEST_DURATION.labels( method=request.method, endpoint=request.path ).observe(request_duration)
ACTIVE_REQUESTS.dec() return response
@app.route('/api/orders', methods=['POST'])def create_order(): order_data = request.json
# Process order order_value = order_data['amount'] category = order_data['category']
# Record business metrics ORDER_TOTAL.labels(status='created', product_category=category).inc() ORDER_VALUE.observe(order_value)
return {'order_id': '12345', 'status': 'created'}, 201
if __name__ == '__main__': app.run(port=8080)2. Go with Prometheus Client
package main
import ( "net/http" "time"
"github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promauto" "github.com/prometheus/client_golang/prometheus/promhttp")
var ( httpRequestsTotal = promauto.NewCounterVec( prometheus.CounterOpts{ Name: "http_requests_total", Help: "Total number of HTTP requests", }, []string{"method", "endpoint", "status"}, )
httpRequestDuration = promauto.NewHistogramVec( prometheus.HistogramOpts{ Name: "http_request_duration_seconds", Help: "HTTP request duration", Buckets: prometheus.DefBuckets, }, []string{"method", "endpoint"}, )
activeRequests = promauto.NewGauge( prometheus.GaugeOpts{ Name: "http_requests_active", Help: "Number of active requests", }, )
// Business metrics orderTotal = promauto.NewCounterVec( prometheus.CounterOpts{ Name: "orders_total", Help: "Total number of orders", }, []string{"status", "product_category"}, )
orderValue = promauto.NewHistogram( prometheus.HistogramOpts{ Name: "order_value_dollars", Help: "Order value in dollars", Buckets: []float64{10, 25, 50, 100, 250, 500, 1000}, }, ))
func prometheusMiddleware(next http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { start := time.Now() activeRequests.Inc() defer activeRequests.Dec()
// Create response writer wrapper to capture status code wrapped := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}
next.ServeHTTP(wrapped, r)
duration := time.Since(start).Seconds()
httpRequestsTotal.WithLabelValues( r.Method, r.URL.Path, http.StatusText(wrapped.statusCode), ).Inc()
httpRequestDuration.WithLabelValues( r.Method, r.URL.Path, ).Observe(duration) })}
type responseWriter struct { http.ResponseWriter statusCode int}
func (rw *responseWriter) WriteHeader(code int) { rw.statusCode = code rw.ResponseWriter.WriteHeader(code)}
func main() { mux := http.NewServeMux()
// Application endpoints mux.HandleFunc("/api/orders", createOrder)
// Metrics endpoint mux.Handle("/metrics", promhttp.Handler())
// Wrap with prometheus middleware http.ListenAndServe(":8080", prometheusMiddleware(mux))}
func createOrder(w http.ResponseWriter, r *http.Request) { // Business logic orderTotal.WithLabelValues("created", "electronics").Inc() orderValue.Observe(99.99)
w.WriteHeader(http.StatusCreated) w.Write([]byte(`{"order_id": "12345"}`))}3. Node.js with prom-client
const express = require('express');const promClient = require('prom-client');
const app = express();
// Create a Registryconst register = new promClient.Registry();
// Add default metricspromClient.collectDefaultMetrics({ register });
// Custom metricsconst httpRequestsTotal = new promClient.Counter({ name: 'http_requests_total', help: 'Total number of HTTP requests', labelNames: ['method', 'endpoint', 'status'], registers: [register]});
const httpRequestDuration = new promClient.Histogram({ name: 'http_request_duration_seconds', help: 'HTTP request duration', labelNames: ['method', 'endpoint'], buckets: [0.1, 0.5, 1, 2, 5], registers: [register]});
const activeRequests = new promClient.Gauge({ name: 'http_requests_active', help: 'Number of active requests', registers: [register]});
// Middlewareapp.use((req, res, next) => { const start = Date.now(); activeRequests.inc();
res.on('finish', () => { const duration = (Date.now() - start) / 1000;
httpRequestsTotal.labels(req.method, req.path, res.statusCode).inc(); httpRequestDuration.labels(req.method, req.path).observe(duration); activeRequests.dec(); });
next();});
// Metrics endpointapp.get('/metrics', async (req, res) => { res.set('Content-Type', register.contentType); res.end(await register.metrics());});
// Application routesapp.post('/api/orders', (req, res) => { // Business logic res.status(201).json({ order_id: '12345' });});
app.listen(8080, () => { console.log('Server listening on port 8080');});ServiceMonitor Configuration
Auto-discover application metrics:
apiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: myapp namespace: production labels: app: myappspec: selector: matchLabels: app: myapp endpoints: - port: metrics interval: 30s path: /metrics scheme: http # Metric relabeling metricRelabelings: - sourceLabels: [__name__] regex: 'go_.*' action: drop # Add custom labels relabelings: - sourceLabels: [__meta_kubernetes_pod_name] targetLabel: pod - sourceLabels: [__meta_kubernetes_namespace] targetLabel: namespacePromQL Queries
1. Request Rate (QPS)
# Requests per secondrate(http_requests_total[5m])
# By endpointsum(rate(http_requests_total[5m])) by (endpoint)
# Total QPSsum(rate(http_requests_total[5m]))2. Error Rate
# Error percentage( sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
# Errors per second by endpointsum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint)3. Latency Percentiles
# P95 latencyhistogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))
# P50, P95, P99histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))4. Resource Utilization
# CPU usagerate(container_cpu_usage_seconds_total{namespace="production"}[5m])
# Memory usagecontainer_memory_usage_bytes{namespace="production"}
# Disk I/Orate(container_fs_reads_bytes_total[5m])rate(container_fs_writes_bytes_total[5m])
# Network I/Orate(container_network_receive_bytes_total[5m])rate(container_network_transmit_bytes_total[5m])5. Kubernetes Metrics
# Pod restartsincrease(kube_pod_container_status_restarts_total[1h])
# Pods not readykube_pod_status_phase{phase!="Running"}
# Node capacity(kube_node_status_capacity_cpu_cores - kube_node_status_allocatable_cpu_cores)
# Persistent Volume usagekubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100Alert Rules
apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: application-alerts namespace: monitoringspec: groups: - name: application interval: 30s rules: # High error rate - alert: HighErrorRate expr: | ( sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
# High latency - alert: HighLatency expr: | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint) ) > 2 for: 10m labels: severity: warning annotations: summary: "High latency on {{ $labels.endpoint }}" description: "P95 latency is {{ $value }}s (threshold: 2s)"
# Pod not ready - alert: PodNotReady expr: | kube_pod_status_phase{phase!="Running",namespace="production"} > 0 for: 5m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }} not ready" description: "Pod has been in {{ $labels.phase }} state for > 5 minutes"
# High memory usage - alert: HighMemoryUsage expr: | ( container_memory_usage_bytes{namespace="production"} / container_spec_memory_limit_bytes{namespace="production"} ) > 0.9 for: 10m labels: severity: warning annotations: summary: "High memory usage in {{ $labels.pod }}" description: "Memory usage is {{ $value | humanizePercentage }} of limit"
# Node disk pressure - alert: NodeDiskPressure expr: | kube_node_status_condition{condition="DiskPressure",status="true"} == 1 for: 5m labels: severity: critical annotations: summary: "Node {{ $labels.node }} under disk pressure" description: "Node is experiencing disk pressure"
# Certificate expiring soon - alert: CertificateExpiringSoon expr: | (probe_ssl_earliest_cert_expiry - time()) / 86400 < 30 for: 1h labels: severity: warning annotations: summary: "Certificate expiring soon" description: "Certificate for {{ $labels.instance }} expires in {{ $value }} days"Grafana Dashboards
1. Application Overview Dashboard
{ "dashboard": { "title": "Application Overview", "panels": [ { "title": "Request Rate", "targets": [ { "expr": "sum(rate(http_requests_total[5m])) by (endpoint)", "legendFormat": "{{ endpoint }}" } ], "type": "graph" }, { "title": "Error Rate", "targets": [ { "expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100", "legendFormat": "Error %" } ], "type": "graph" }, { "title": "Latency Percentiles", "targets": [ { "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))", "legendFormat": "P50" }, { "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))", "legendFormat": "P95" }, { "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))", "legendFormat": "P99" } ], "type": "graph" } ] }}2. RED Method Dashboard
Rate, Errors, Duration for each service:
- Rate: Request volume
- Errors: Failed requests
- Duration: Latency distribution
3. USE Method Dashboard
For resources (CPU, Memory, Disk):
- Utilization: Percentage used
- Saturation: Queue depth
- Errors: Error count
Long-Term Storage
Thanos for Multi-Cluster Monitoring
# Thanos SidecarapiVersion: monitoring.coreos.com/v1kind: Prometheusmetadata: name: prometheusspec: thanos: image: quay.io/thanos/thanos:v0.32.0 version: v0.32.0 objectStorageConfig: key: objstore.yml name: thanos-objstore-config
---# Object storage configurationapiVersion: v1kind: Secretmetadata: name: thanos-objstore-configstringData: objstore.yml: | type: S3 config: bucket: thanos-metrics endpoint: s3.amazonaws.com region: us-east-1Production Checklist
- Prometheus deployed with persistent storage
- Grafana configured with authentication
- Applications instrumented with metrics
- ServiceMonitors configured for auto-discovery
- Alert rules defined and tested
- AlertManager routing configured
- Dashboards created for key services
- Long-term storage configured (Thanos/Cortex)
- Backup and disaster recovery plan
- Team trained on PromQL and Grafana
- SLOs/SLIs defined
- On-call runbooks documented
Conclusion
Production observability requires comprehensive metrics collection, visualization, and alerting. Prometheus and Grafana provide powerful, scalable solutions for monitoring modern cloud-native applications.
Focus on instrumenting applications with meaningful metrics, defining actionable alerts, and building dashboards that help teams understand system behavior. Observability is not a one-time setup—it’s an ongoing practice that evolves with your applications.
Need help building observability? Our DevOps training covers Prometheus, Grafana, distributed tracing, and SRE practices with hands-on labs. Explore observability training or contact us for monitoring architecture consulting.