Production Observability with Prometheus and Grafana: Complete Guide

Observability is the ability to understand a system’s internal state from its external outputs. Modern applications require comprehensive monitoring, metrics, logging, and tracing to maintain reliability and performance. This guide covers building production-grade observability with Prometheus and Grafana.

The Three Pillars of Observability

1. Metrics (Prometheus)

Time-series data showing system behavior over time.

2. Logs (Loki/ELK)

Detailed event records for debugging and auditing.

3. Traces (Jaeger/Tempo)

Request flow across distributed services.

Prometheus Fundamentals

Architecture Overview

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Target 1   │    │   Target 2   │    │   Target 3   │
│  (Exporter)  │    │ (Application)│    │  (K8s API)   │
└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
       │                   │                   │
       │                   │                   │
       └───────────────────┴───────────────────┘
                           │
                    (Scrape Metrics)
                           │
                           ▼
                  ┌────────────────┐
                  │   Prometheus   │
                  │     Server     │
                  └────────┬───────┘
                           │
           ┌───────────────┼───────────────┐
           │               │               │
           ▼               ▼               ▼
    ┌──────────┐   ┌─────────────┐  ┌───────────┐
    │ Grafana  │   │ AlertManager│  │   API     │
    │(Visualize)   │  (Alerts)   │  │(Queries)  │
    └──────────┘   └─────────────┘  └───────────┘

Installation on Kubernetes

# Add Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack (Prometheus + Grafana + AlertManager)
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --values prometheus-values.yaml

prometheus-values.yaml:

prometheus:
  prometheusSpec:
    # Retention period
    retention: 30d
    retentionSize: "50GB"

    # Storage
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 100Gi
          storageClassName: gp3

    # Resource limits
    resources:
      requests:
        cpu: 1000m
        memory: 2Gi
      limits:
        cpu: 2000m
        memory: 4Gi

    # Service monitors
    serviceMonitorSelectorNilUsesHelmValues: false
    podMonitorSelectorNilUsesHelmValues: false

    # External labels
    externalLabels:
      cluster: production
      environment: prod

    # Remote write (for long-term storage)
    remoteWrite:
    - url: https://prometheus-long-term-storage.company.com/api/v1/write
      writeRelabelConfigs:
      - sourceLabels: [__name__]
        regex: 'go_.*|process_.*'
        action: drop

grafana:
  adminPassword: "changeme"

  # Persistence
  persistence:
    enabled: true
    size: 10Gi

  # Datasources
  datasources:
    datasources.yaml:
      apiVersion: 1
      datasources:
      - name: Prometheus
        type: prometheus
        url: http://prometheus-kube-prometheus-prometheus:9090
        isDefault: true
        jsonData:
          timeInterval: 30s

alertmanager:
  config:
    global:
      resolve_timeout: 5m
      slack_api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'

    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 12h
      receiver: 'slack-critical'
      routes:
      - match:
          severity: critical
        receiver: slack-critical
      - match:
          severity: warning
        receiver: slack-warning

    receivers:
    - name: 'slack-critical'
      slack_configs:
      - channel: '#alerts-critical'
        title: 'Critical Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

    - name: 'slack-warning'
      slack_configs:
      - channel: '#alerts-warning'
        title: 'Warning: {{ .GroupLabels.alertname }}'

Application Instrumentation

1. Python with Prometheus Client

from prometheus_client import Counter, Histogram, Gauge, start_http_server
from prometheus_client import make_wsgi_app
from flask import Flask, request
from werkzeug.middleware.dispatcher import DispatcherMiddleware
import time

app = Flask(__name__)

# Add prometheus wsgi middleware to route /metrics requests
app.wsgi_app = DispatcherMiddleware(app.wsgi_app, {
    '/metrics': make_wsgi_app()
})

# Define metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_DURATION = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'endpoint']
)

ACTIVE_REQUESTS = Gauge(
    'http_requests_active',
    'Number of active requests'
)

DATABASE_CONNECTIONS = Gauge(
    'database_connections_active',
    'Number of active database connections'
)

# Business metrics
ORDER_TOTAL = Counter(
    'orders_total',
    'Total number of orders',
    ['status', 'product_category']
)

ORDER_VALUE = Histogram(
    'order_value_dollars',
    'Order value in dollars',
    buckets=[10, 25, 50, 100, 250, 500, 1000]
)

@app.before_request
def before_request():
    request.start_time = time.time()
    ACTIVE_REQUESTS.inc()

@app.after_request
def after_request(response):
    request_duration = time.time() - request.start_time

    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.path,
        status=response.status_code
    ).inc()

    REQUEST_DURATION.labels(
        method=request.method,
        endpoint=request.path
    ).observe(request_duration)

    ACTIVE_REQUESTS.dec()
    return response

@app.route('/api/orders', methods=['POST'])
def create_order():
    order_data = request.json

    # Process order
    order_value = order_data['amount']
    category = order_data['category']

    # Record business metrics
    ORDER_TOTAL.labels(status='created', product_category=category).inc()
    ORDER_VALUE.observe(order_value)

    return {'order_id': '12345', 'status': 'created'}, 201

if __name__ == '__main__':
    app.run(port=8080)

2. Go with Prometheus Client

package main

import (
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )

    httpRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint"},
    )

    activeRequests = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "http_requests_active",
            Help: "Number of active requests",
        },
    )

    // Business metrics
    orderTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "orders_total",
            Help: "Total number of orders",
        },
        []string{"status", "product_category"},
    )

    orderValue = promauto.NewHistogram(
        prometheus.HistogramOpts{
            Name:    "order_value_dollars",
            Help:    "Order value in dollars",
            Buckets: []float64{10, 25, 50, 100, 250, 500, 1000},
        },
    )
)

func prometheusMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        activeRequests.Inc()
        defer activeRequests.Dec()

        // Create response writer wrapper to capture status code
        wrapped := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}

        next.ServeHTTP(wrapped, r)

        duration := time.Since(start).Seconds()

        httpRequestsTotal.WithLabelValues(
            r.Method,
            r.URL.Path,
            http.StatusText(wrapped.statusCode),
        ).Inc()

        httpRequestDuration.WithLabelValues(
            r.Method,
            r.URL.Path,
        ).Observe(duration)
    })
}

type responseWriter struct {
    http.ResponseWriter
    statusCode int
}

func (rw *responseWriter) WriteHeader(code int) {
    rw.statusCode = code
    rw.ResponseWriter.WriteHeader(code)
}

func main() {
    mux := http.NewServeMux()

    // Application endpoints
    mux.HandleFunc("/api/orders", createOrder)

    // Metrics endpoint
    mux.Handle("/metrics", promhttp.Handler())

    // Wrap with prometheus middleware
    http.ListenAndServe(":8080", prometheusMiddleware(mux))
}

func createOrder(w http.ResponseWriter, r *http.Request) {
    // Business logic
    orderTotal.WithLabelValues("created", "electronics").Inc()
    orderValue.Observe(99.99)

    w.WriteHeader(http.StatusCreated)
    w.Write([]byte(`{"order_id": "12345"}`))
}

3. Node.js with prom-client

const express = require('express');
const promClient = require('prom-client');

const app = express();

// Create a Registry
const register = new promClient.Registry();

// Add default metrics
promClient.collectDefaultMetrics({ register });

// Custom metrics
const httpRequestsTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'endpoint', 'status'],
  registers: [register]
});

const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration',
  labelNames: ['method', 'endpoint'],
  buckets: [0.1, 0.5, 1, 2, 5],
  registers: [register]
});

const activeRequests = new promClient.Gauge({
  name: 'http_requests_active',
  help: 'Number of active requests',
  registers: [register]
});

// Middleware
app.use((req, res, next) => {
  const start = Date.now();
  activeRequests.inc();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;

    httpRequestsTotal.labels(req.method, req.path, res.statusCode).inc();
    httpRequestDuration.labels(req.method, req.path).observe(duration);
    activeRequests.dec();
  });

  next();
});

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

// Application routes
app.post('/api/orders', (req, res) => {
  // Business logic
  res.status(201).json({ order_id: '12345' });
});

app.listen(8080, () => {
  console.log('Server listening on port 8080');
});

ServiceMonitor Configuration

Auto-discover application metrics:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp
  namespace: production
  labels:
    app: myapp
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
    scheme: http
    # Metric relabeling
    metricRelabelings:
    - sourceLabels: [__name__]
      regex: 'go_.*'
      action: drop
    # Add custom labels
    relabelings:
    - sourceLabels: [__meta_kubernetes_pod_name]
      targetLabel: pod
    - sourceLabels: [__meta_kubernetes_namespace]
      targetLabel: namespace

PromQL Queries

1. Request Rate (QPS)

# Requests per second
rate(http_requests_total[5m])

# By endpoint
sum(rate(http_requests_total[5m])) by (endpoint)

# Total QPS
sum(rate(http_requests_total[5m]))

2. Error Rate

# Error percentage
(
  sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) * 100

# Errors per second by endpoint
sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint)

3. Latency Percentiles

# P95 latency
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
)

# P50, P95, P99
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

4. Resource Utilization

# CPU usage
rate(container_cpu_usage_seconds_total{namespace="production"}[5m])

# Memory usage
container_memory_usage_bytes{namespace="production"}

# Disk I/O
rate(container_fs_reads_bytes_total[5m])
rate(container_fs_writes_bytes_total[5m])

# Network I/O
rate(container_network_receive_bytes_total[5m])
rate(container_network_transmit_bytes_total[5m])

5. Kubernetes Metrics

# Pod restarts
increase(kube_pod_container_status_restarts_total[1h])

# Pods not ready
kube_pod_status_phase{phase!="Running"}

# Node capacity
(kube_node_status_capacity_cpu_cores - kube_node_status_allocatable_cpu_cores)

# Persistent Volume usage
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100

Alert Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: application-alerts
  namespace: monitoring
spec:
  groups:
  - name: application
    interval: 30s
    rules:
    # High error rate
    - alert: HighErrorRate
      expr: |
        (
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
        ) > 0.05
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High error rate detected"
        description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"

    # High latency
    - alert: HighLatency
      expr: |
        histogram_quantile(0.95,
          sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
        ) > 2
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High latency on {{ $labels.endpoint }}"
        description: "P95 latency is {{ $value }}s (threshold: 2s)"

    # Pod not ready
    - alert: PodNotReady
      expr: |
        kube_pod_status_phase{phase!="Running",namespace="production"} > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.pod }} not ready"
        description: "Pod has been in {{ $labels.phase }} state for > 5 minutes"

    # High memory usage
    - alert: HighMemoryUsage
      expr: |
        (
          container_memory_usage_bytes{namespace="production"}
          /
          container_spec_memory_limit_bytes{namespace="production"}
        ) > 0.9
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High memory usage in {{ $labels.pod }}"
        description: "Memory usage is {{ $value | humanizePercentage }} of limit"

    # Node disk pressure
    - alert: NodeDiskPressure
      expr: |
        kube_node_status_condition{condition="DiskPressure",status="true"} == 1
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Node {{ $labels.node }} under disk pressure"
        description: "Node is experiencing disk pressure"

    # Certificate expiring soon
    - alert: CertificateExpiringSoon
      expr: |
        (probe_ssl_earliest_cert_expiry - time()) / 86400 < 30
      for: 1h
      labels:
        severity: warning
      annotations:
        summary: "Certificate expiring soon"
        description: "Certificate for {{ $labels.instance }} expires in {{ $value }} days"

Grafana Dashboards

1. Application Overview Dashboard

{
  "dashboard": {
    "title": "Application Overview",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (endpoint)",
            "legendFormat": "{{ endpoint }}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
            "legendFormat": "Error %"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Latency Percentiles",
        "targets": [
          {
            "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
            "legendFormat": "P50"
          },
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
            "legendFormat": "P95"
          },
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
            "legendFormat": "P99"
          }
        ],
        "type": "graph"
      }
    ]
  }
}

2. RED Method Dashboard

Rate, Errors, Duration for each service:

Rate: Request volume
Errors: Failed requests
Duration: Latency distribution

3. USE Method Dashboard

For resources (CPU, Memory, Disk):

Utilization: Percentage used
Saturation: Queue depth
Errors: Error count

Long-Term Storage

Thanos for Multi-Cluster Monitoring

# Thanos Sidecar
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
spec:
  thanos:
    image: quay.io/thanos/thanos:v0.32.0
    version: v0.32.0
    objectStorageConfig:
      key: objstore.yml
      name: thanos-objstore-config

---
# Object storage configuration
apiVersion: v1
kind: Secret
metadata:
  name: thanos-objstore-config
stringData:
  objstore.yml: |
    type: S3
    config:
      bucket: thanos-metrics
      endpoint: s3.amazonaws.com
      region: us-east-1

Production Checklist

Conclusion

Production observability requires comprehensive metrics collection, visualization, and alerting. Prometheus and Grafana provide powerful, scalable solutions for monitoring modern cloud-native applications.

Focus on instrumenting applications with meaningful metrics, defining actionable alerts, and building dashboards that help teams understand system behavior. Observability is not a one-time setup—it’s an ongoing practice that evolves with your applications.

Need help building observability? Our DevOps training covers Prometheus, Grafana, distributed tracing, and SRE practices with hands-on labs. Explore observability training or contact us for monitoring architecture consulting.