Skip to content Skip to content
Vladimir Chavkov

Observability and Monitoring Training

Build production-grade visibility into your systems with this 3-day observability and monitoring training. Learn how to collect metrics, logs, and traces, design actionable alerts, and use observability data to reduce incident impact and improve reliability.

Duration3 days (24 hours)
LevelIntermediate
DeliveryIn-person, Live online, Hybrid
CertificationN/A
  • DevOps and SRE teams operating production systems
  • Platform engineers building shared observability services
  • Developers responsible for instrumenting services
  • Technical leads improving operational visibility and incident response

After completing this training, you’ll be able to:

  • Design an observability stack for cloud-native systems
  • Collect and query metrics, logs, and traces
  • Build dashboards that support engineering and operations workflows
  • Create alerts that are actionable and low noise
  • Instrument applications using modern observability standards
  • Troubleshoot incidents using correlated telemetry

Day 1: Metrics and Monitoring Fundamentals

Section titled “Day 1: Metrics and Monitoring Fundamentals”

Module 1: Observability Concepts

  • Monitoring vs observability
  • Golden signals and RED/USE methods
  • Telemetry design principles
  • Hands-on: Define a service observability model

Module 2: Metrics with Prometheus

  • Exporters, scraping, and service discovery
  • PromQL basics and recording rules
  • Capacity and performance dashboards
  • Hands-on: Collect and query application metrics

Module 3: Visualization with Grafana

  • Dashboard design and panel selection
  • Variables, drill-downs, and annotations
  • Sharing dashboards across teams
  • Hands-on: Build operational dashboards

Module 4: Logging Architecture

  • Structured logging patterns
  • Centralized collection with Loki or ELK
  • Log search, retention, and cost tradeoffs
  • Hands-on: Centralize and query logs

Module 5: Distributed Tracing

  • Trace context and spans
  • OpenTelemetry instrumentation
  • Service dependency analysis
  • Hands-on: Trace a request through services

Module 6: Alerting and Incident Response

  • Alert routing and severity models
  • SLOs, burn rates, and symptom-based alerting
  • Reducing noisy alerts
  • Hands-on: Create actionable alerts

Module 7: Platform Patterns

  • Multi-tenant observability
  • Data retention and governance
  • Scaling collectors and storage backends
  • Hands-on: Design a shared observability platform

Module 8: Troubleshooting Workflows

  • Correlating metrics, logs, and traces
  • Incident timelines and root cause analysis
  • Post-incident review patterns
  • Hands-on: Diagnose a production incident

Module 9: Adoption and Instrumentation Strategy

  • Instrumentation standards for teams
  • Developer experience and self-service observability
  • Maturity roadmap and ownership model
  • Hands-on: Draft an observability rollout plan
  • Experience running or developing applications in production
  • Basic Linux and networking familiarity
  • Understanding of containers and distributed systems concepts
  • Helpful but not required: prior Prometheus or Grafana exposure
FormatDescription
In-PersonOn-site at your company’s location, hands-on with direct interaction
Live OnlineInteractive virtual sessions with screen sharing and real-time labs
HybridCombination of on-site and remote sessions, flexible scheduling

All formats include hands-on labs, dashboards, alerting examples, and post-training support.