Skip to content
Vladimir Chavkov
Go back

Production Observability with Prometheus and Grafana: Complete Guide

Edit page

Production Observability with Prometheus and Grafana: Complete Guide

Observability is the ability to understand a system’s internal state from its external outputs. Modern applications require comprehensive monitoring, metrics, logging, and tracing to maintain reliability and performance. This guide covers building production-grade observability with Prometheus and Grafana.

The Three Pillars of Observability

1. Metrics (Prometheus)

Time-series data showing system behavior over time.

2. Logs (Loki/ELK)

Detailed event records for debugging and auditing.

3. Traces (Jaeger/Tempo)

Request flow across distributed services.

Prometheus Fundamentals

Architecture Overview

┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Target 1 │ │ Target 2 │ │ Target 3 │
│ (Exporter) │ │ (Application)│ │ (K8s API) │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
│ │ │
└───────────────────┴───────────────────┘
(Scrape Metrics)
┌────────────────┐
│ Prometheus │
│ Server │
└────────┬───────┘
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌─────────────┐ ┌───────────┐
│ Grafana │ │ AlertManager│ │ API │
│(Visualize) │ (Alerts) │ │(Queries) │
└──────────┘ └─────────────┘ └───────────┘

Installation on Kubernetes

Terminal window
# Add Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install kube-prometheus-stack (Prometheus + Grafana + AlertManager)
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--values prometheus-values.yaml

prometheus-values.yaml:

prometheus:
prometheusSpec:
# Retention period
retention: 30d
retentionSize: "50GB"
# Storage
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
storageClassName: gp3
# Resource limits
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
# Service monitors
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
# External labels
externalLabels:
cluster: production
environment: prod
# Remote write (for long-term storage)
remoteWrite:
- url: https://prometheus-long-term-storage.company.com/api/v1/write
writeRelabelConfigs:
- sourceLabels: [__name__]
regex: 'go_.*|process_.*'
action: drop
grafana:
adminPassword: "changeme"
# Persistence
persistence:
enabled: true
size: 10Gi
# Datasources
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus-kube-prometheus-prometheus:9090
isDefault: true
jsonData:
timeInterval: 30s
alertmanager:
config:
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'slack-critical'
routes:
- match:
severity: critical
receiver: slack-critical
- match:
severity: warning
receiver: slack-warning
receivers:
- name: 'slack-critical'
slack_configs:
- channel: '#alerts-critical'
title: 'Critical Alert: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'slack-warning'
slack_configs:
- channel: '#alerts-warning'
title: 'Warning: {{ .GroupLabels.alertname }}'

Application Instrumentation

1. Python with Prometheus Client

from prometheus_client import Counter, Histogram, Gauge, start_http_server
from prometheus_client import make_wsgi_app
from flask import Flask, request
from werkzeug.middleware.dispatcher import DispatcherMiddleware
import time
app = Flask(__name__)
# Add prometheus wsgi middleware to route /metrics requests
app.wsgi_app = DispatcherMiddleware(app.wsgi_app, {
'/metrics': make_wsgi_app()
})
# Define metrics
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
REQUEST_DURATION = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint']
)
ACTIVE_REQUESTS = Gauge(
'http_requests_active',
'Number of active requests'
)
DATABASE_CONNECTIONS = Gauge(
'database_connections_active',
'Number of active database connections'
)
# Business metrics
ORDER_TOTAL = Counter(
'orders_total',
'Total number of orders',
['status', 'product_category']
)
ORDER_VALUE = Histogram(
'order_value_dollars',
'Order value in dollars',
buckets=[10, 25, 50, 100, 250, 500, 1000]
)
@app.before_request
def before_request():
request.start_time = time.time()
ACTIVE_REQUESTS.inc()
@app.after_request
def after_request(response):
request_duration = time.time() - request.start_time
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.path,
status=response.status_code
).inc()
REQUEST_DURATION.labels(
method=request.method,
endpoint=request.path
).observe(request_duration)
ACTIVE_REQUESTS.dec()
return response
@app.route('/api/orders', methods=['POST'])
def create_order():
order_data = request.json
# Process order
order_value = order_data['amount']
category = order_data['category']
# Record business metrics
ORDER_TOTAL.labels(status='created', product_category=category).inc()
ORDER_VALUE.observe(order_value)
return {'order_id': '12345', 'status': 'created'}, 201
if __name__ == '__main__':
app.run(port=8080)

2. Go with Prometheus Client

package main
import (
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
httpRequestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
activeRequests = promauto.NewGauge(
prometheus.GaugeOpts{
Name: "http_requests_active",
Help: "Number of active requests",
},
)
// Business metrics
orderTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "orders_total",
Help: "Total number of orders",
},
[]string{"status", "product_category"},
)
orderValue = promauto.NewHistogram(
prometheus.HistogramOpts{
Name: "order_value_dollars",
Help: "Order value in dollars",
Buckets: []float64{10, 25, 50, 100, 250, 500, 1000},
},
)
)
func prometheusMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
activeRequests.Inc()
defer activeRequests.Dec()
// Create response writer wrapper to capture status code
wrapped := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}
next.ServeHTTP(wrapped, r)
duration := time.Since(start).Seconds()
httpRequestsTotal.WithLabelValues(
r.Method,
r.URL.Path,
http.StatusText(wrapped.statusCode),
).Inc()
httpRequestDuration.WithLabelValues(
r.Method,
r.URL.Path,
).Observe(duration)
})
}
type responseWriter struct {
http.ResponseWriter
statusCode int
}
func (rw *responseWriter) WriteHeader(code int) {
rw.statusCode = code
rw.ResponseWriter.WriteHeader(code)
}
func main() {
mux := http.NewServeMux()
// Application endpoints
mux.HandleFunc("/api/orders", createOrder)
// Metrics endpoint
mux.Handle("/metrics", promhttp.Handler())
// Wrap with prometheus middleware
http.ListenAndServe(":8080", prometheusMiddleware(mux))
}
func createOrder(w http.ResponseWriter, r *http.Request) {
// Business logic
orderTotal.WithLabelValues("created", "electronics").Inc()
orderValue.Observe(99.99)
w.WriteHeader(http.StatusCreated)
w.Write([]byte(`{"order_id": "12345"}`))
}

3. Node.js with prom-client

const express = require('express');
const promClient = require('prom-client');
const app = express();
// Create a Registry
const register = new promClient.Registry();
// Add default metrics
promClient.collectDefaultMetrics({ register });
// Custom metrics
const httpRequestsTotal = new promClient.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'endpoint', 'status'],
registers: [register]
});
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration',
labelNames: ['method', 'endpoint'],
buckets: [0.1, 0.5, 1, 2, 5],
registers: [register]
});
const activeRequests = new promClient.Gauge({
name: 'http_requests_active',
help: 'Number of active requests',
registers: [register]
});
// Middleware
app.use((req, res, next) => {
const start = Date.now();
activeRequests.inc();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
httpRequestsTotal.labels(req.method, req.path, res.statusCode).inc();
httpRequestDuration.labels(req.method, req.path).observe(duration);
activeRequests.dec();
});
next();
});
// Metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
// Application routes
app.post('/api/orders', (req, res) => {
// Business logic
res.status(201).json({ order_id: '12345' });
});
app.listen(8080, () => {
console.log('Server listening on port 8080');
});

ServiceMonitor Configuration

Auto-discover application metrics:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: myapp
namespace: production
labels:
app: myapp
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: metrics
interval: 30s
path: /metrics
scheme: http
# Metric relabeling
metricRelabelings:
- sourceLabels: [__name__]
regex: 'go_.*'
action: drop
# Add custom labels
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: pod
- sourceLabels: [__meta_kubernetes_namespace]
targetLabel: namespace

PromQL Queries

1. Request Rate (QPS)

# Requests per second
rate(http_requests_total[5m])
# By endpoint
sum(rate(http_requests_total[5m])) by (endpoint)
# Total QPS
sum(rate(http_requests_total[5m]))

2. Error Rate

# Error percentage
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) * 100
# Errors per second by endpoint
sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint)

3. Latency Percentiles

# P95 latency
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
)
# P50, P95, P99
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

4. Resource Utilization

# CPU usage
rate(container_cpu_usage_seconds_total{namespace="production"}[5m])
# Memory usage
container_memory_usage_bytes{namespace="production"}
# Disk I/O
rate(container_fs_reads_bytes_total[5m])
rate(container_fs_writes_bytes_total[5m])
# Network I/O
rate(container_network_receive_bytes_total[5m])
rate(container_network_transmit_bytes_total[5m])

5. Kubernetes Metrics

# Pod restarts
increase(kube_pod_container_status_restarts_total[1h])
# Pods not ready
kube_pod_status_phase{phase!="Running"}
# Node capacity
(kube_node_status_capacity_cpu_cores - kube_node_status_allocatable_cpu_cores)
# Persistent Volume usage
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100

Alert Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: application-alerts
namespace: monitoring
spec:
groups:
- name: application
interval: 30s
rules:
# High error rate
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
# High latency
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.endpoint }}"
description: "P95 latency is {{ $value }}s (threshold: 2s)"
# Pod not ready
- alert: PodNotReady
expr: |
kube_pod_status_phase{phase!="Running",namespace="production"} > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} not ready"
description: "Pod has been in {{ $labels.phase }} state for > 5 minutes"
# High memory usage
- alert: HighMemoryUsage
expr: |
(
container_memory_usage_bytes{namespace="production"}
/
container_spec_memory_limit_bytes{namespace="production"}
) > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage in {{ $labels.pod }}"
description: "Memory usage is {{ $value | humanizePercentage }} of limit"
# Node disk pressure
- alert: NodeDiskPressure
expr: |
kube_node_status_condition{condition="DiskPressure",status="true"} == 1
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} under disk pressure"
description: "Node is experiencing disk pressure"
# Certificate expiring soon
- alert: CertificateExpiringSoon
expr: |
(probe_ssl_earliest_cert_expiry - time()) / 86400 < 30
for: 1h
labels:
severity: warning
annotations:
summary: "Certificate expiring soon"
description: "Certificate for {{ $labels.instance }} expires in {{ $value }} days"

Grafana Dashboards

1. Application Overview Dashboard

{
"dashboard": {
"title": "Application Overview",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (endpoint)",
"legendFormat": "{{ endpoint }}"
}
],
"type": "graph"
},
{
"title": "Error Rate",
"targets": [
{
"expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
"legendFormat": "Error %"
}
],
"type": "graph"
},
{
"title": "Latency Percentiles",
"targets": [
{
"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
"legendFormat": "P50"
},
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
"legendFormat": "P95"
},
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
"legendFormat": "P99"
}
],
"type": "graph"
}
]
}
}

2. RED Method Dashboard

Rate, Errors, Duration for each service:

3. USE Method Dashboard

For resources (CPU, Memory, Disk):

Long-Term Storage

Thanos for Multi-Cluster Monitoring

# Thanos Sidecar
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
spec:
thanos:
image: quay.io/thanos/thanos:v0.32.0
version: v0.32.0
objectStorageConfig:
key: objstore.yml
name: thanos-objstore-config
---
# Object storage configuration
apiVersion: v1
kind: Secret
metadata:
name: thanos-objstore-config
stringData:
objstore.yml: |
type: S3
config:
bucket: thanos-metrics
endpoint: s3.amazonaws.com
region: us-east-1

Production Checklist

Conclusion

Production observability requires comprehensive metrics collection, visualization, and alerting. Prometheus and Grafana provide powerful, scalable solutions for monitoring modern cloud-native applications.

Focus on instrumenting applications with meaningful metrics, defining actionable alerts, and building dashboards that help teams understand system behavior. Observability is not a one-time setup—it’s an ongoing practice that evolves with your applications.


Need help building observability? Our DevOps training covers Prometheus, Grafana, distributed tracing, and SRE practices with hands-on labs. Explore observability training or contact us for monitoring architecture consulting.


Edit page
Share this post on:

Previous Post
Multi-Cloud Infrastructure with Terraform: AWS, Azure, and GCP
Next Post
Python Data Engineering: Building Production Pipelines with Apache Airflow and dbt