RUNBOOK

RUNBOOKLast updated: 1/31/2026

Observability Operations & Monitoring Infrastructure Runbook

1. Overview

This runbook covers production operational procedures for deploying, managing, and maintaining observability stacks including metrics collection, alerting, logging, and distributed tracing systems.

Scope: Prometheus setup, Grafana dashboards, alerting rules, ELK/logging, distributed tracing, SLO management Target Audience: DevOps engineers, SREs, platform engineers, on-call engineers Prerequisite: CONCEPT.md (three pillars, four golden signals)


2. Prometheus Deployment

2.1 Installation (Kubernetes)

Using Prometheus Operator:

# Add Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install Prometheus Operator and stack
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi \
  --set grafana.adminPassword=<SECURE_PASSWORD> \
  --set alertmanager.enabled=true

2.2 Prometheus Configuration

scrape_configs.yaml (job definitions):

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 30s
      evaluation_interval: 30s
      external_labels:
        cluster: prod-us-east1
        environment: production
    
    # Alertmanager for alert routing
    alerting:
      alertmanagers:
      - static_configs:
        - targets:
          - alertmanager:9093
    
    # Alert rule files
    rule_files:
    - /etc/prometheus/rules/*.yml
    
    # Scrape jobs
    scrape_configs:
    
    # Kubernetes API server
    - job_name: 'kubernetes-apiservers'
      kubernetes_sd_configs:
      - role: endpoints
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https
    
    # Kubernetes nodes
    - job_name: 'kubernetes-nodes'
      kubernetes_sd_configs:
      - role: node
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    
    # Kubernetes pods (scrape all pods with metrics)
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: "true"
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)

2.3 ServiceMonitor for Prometheus Operator

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: my-application
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

3. Alerting Configuration

3.1 Alert Rules

Create PrometheusRule:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: application-alerts
  namespace: monitoring
spec:
  groups:
  - name: application.rules
    interval: 30s
    rules:
    
    # Latency Alerts (Golden Signal #1)
    - alert: HighLatency
      expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1.0
      for: 5m
      labels:
        severity: warning
        team: platform
      annotations:
        summary: "High request latency detected"
        description: "P99 latency is {{ $value }}s on {{ $labels.instance }}"
    
    # Error Rate Alerts (Golden Signal #4)
    - alert: HighErrorRate
      expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
      for: 5m
      labels:
        severity: critical
        team: platform
      annotations:
        summary: "High error rate detected"
        description: "Error rate is {{ $value | humanizePercentage }} on {{ $labels.service }}"
    
    # Traffic Alerts (Golden Signal #2)
    - alert: TrafficSpike
      expr: rate(http_requests_total[5m]) > 10000
      for: 2m
      labels:
        severity: warning
        team: platform
      annotations:
        summary: "Unusual traffic spike detected"
        description: "Traffic is {{ $value }} requests/sec on {{ $labels.service }}"
    
    # Saturation Alerts (Golden Signal #3)
    - alert: HighCPUUsage
      expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
      for: 10m
      labels:
        severity: warning
        team: platform
      annotations:
        summary: "High CPU usage detected"
        description: "CPU usage is {{ $value | humanizePercentage }} on {{ $labels.pod }}"
    
    - alert: HighMemoryUsage
      expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.85
      for: 5m
      labels:
        severity: warning
        team: platform
      annotations:
        summary: "High memory usage detected"
        description: "Memory usage is {{ $value | humanizePercentage }} on {{ $labels.pod }}"
    
    # Database Connection Pool
    - alert: DBConnectionPoolExhausted
      expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.8
      for: 5m
      labels:
        severity: critical
        team: database
      annotations:
        summary: "Database connection pool near exhaustion"
        description: "{{ $value | humanizePercentage }} of connections in use"

3.2 Alertmanager Configuration

alertmanager.yml (alert routing & notifications):

apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: monitoring
data:
  alertmanager.yml: |
    global:
      resolve_timeout: 5m
      slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
    
    # Default routing
    route:
      receiver: platform-default
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 12h
      
      # Routes for specific teams
      routes:
      
      # Critical alerts → immediate PagerDuty
      - match:
          severity: critical
        receiver: pagerduty-critical
        continue: true
        repeat_interval: 5m
      
      # Warning alerts → Slack
      - match:
          severity: warning
        receiver: slack-warnings
        repeat_interval: 1h
      
      # Database team alerts
      - match_re:
          team: database|mysql|postgres
        receiver: database-team
        continue: true
    
    receivers:
    
    # Slack notifications
    - name: slack-warnings
      slack_configs:
      - channel: '#alerts-warnings'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
        send_resolved: true
    
    # PagerDuty for critical
    - name: pagerduty-critical
      pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
        description: '{{ .GroupLabels.alertname }} on {{ .GroupLabels.instance }}'
        details:
          severity: '{{ .GroupLabels.severity }}'
          alerts: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    
    # Database team
    - name: database-team
      slack_configs:
      - channel: '#database-alerts'
        title: '[{{ .GroupLabels.severity | toUpper }}] {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    
    - name: platform-default
      slack_configs:
      - channel: '#platform-alerts'

4. Grafana Dashboard Management

4.1 Dashboard Creation

Create Dashboard via API:

# Export existing dashboard as JSON
curl -s "http://grafana:3000/api/dashboards/db/system-overview" \
  -H "Authorization: Bearer $GRAFANA_API_TOKEN" | jq '.dashboard' > dashboard.json

# Create new dashboard
curl -X POST "http://grafana:3000/api/dashboards/db" \
  -H "Authorization: Bearer $GRAFANA_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d @dashboard.json

4.2 Key Dashboards

System Overview Dashboard:

{
  "dashboard": {
    "title": "System Overview",
    "panels": [
      {
        "title": "Request Latency (P99)",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"
          }
        ]
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
          }
        ]
      },
      {
        "title": "CPU Usage",
        "targets": [
          {
            "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "targets": [
          {
            "expr": "1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)"
          }
        ]
      }
    ]
  }
}

5. Logging Stack (ELK)

5.1 Elasticsearch Installation

# Using Docker Compose or Kubernetes Helm
helm repo add elastic https://helm.elastic.co
helm repo update

helm install elasticsearch elastic/elasticsearch \
  --namespace logging \
  --create-namespace \
  --set replicas=3 \
  --set resources.requests.memory="2Gi" \
  --set resources.requests.cpu="500m" \
  --set volumeClaimTemplate.storageClassName="fast-ssd"

5.2 Logstash Configuration

logstash.conf (log processing):

input {
  # Receive logs from Filebeat
  beats {
    port => 5044
    ssl => true
    ssl_certificate => "/etc/logstash/certs/logstash.crt"
    ssl_key => "/etc/logstash/certs/logstash.key"
  }
  
  # Kubernetes audit logs
  file {
    path => "/var/log/kubernetes/audit.log"
    tags => ["k8s-audit"]
    codec => json
  }
}

filter {
  # Parse JSON logs
  json {
    source => "message"
  }
  
  # Extract fields
  mutate {
    add_field => { "[@metadata][index_name]" => "logs-%{+YYYY.MM.dd}" }
  }
  
  # Grok patterns for nginx
  if [program] == "nginx" {
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
  }
  
  # Kubernetes metadata enrichment
  if [kubernetes] {
    mutate {
      add_field => { "pod_name" => "%{[kubernetes][pod][name]}" }
      add_field => { "namespace" => "%{[kubernetes][namespace]}" }
    }
  }
}

output {
  # Send to Elasticsearch
  elasticsearch {
    hosts => ["elasticsearch.logging:9200"]
    index => "%{[@metadata][index_name]}"
  }
  
  # Archive to S3 for long-term storage
  s3 {
    region => "us-east-1"
    bucket => "log-archive"
    prefix => "logs/%{[kubernetes][namespace]}/%{+YYYY/MM/dd}/"
  }
}

5.3 Kibana Queries

Common Query Patterns:

# Find 5xx errors in last 24h
response.code >= 500 AND timestamp:[now-24h TO now]

# Pod crash logs
level: ERROR AND (OOMKilled OR CrashLoopBackOff)

# Slow database queries (> 5 seconds)
query.duration_ms > 5000 AND database: production

# Authentication failures
event.type: authentication AND result: failure

# Get log count by service
@timestamp:[now-1h TO now] AND service:* | stats count() by service

6. Distributed Tracing (Jaeger)

6.1 Jaeger Deployment

# Install Jaeger operator
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm repo update

helm install jaeger jaegertracing/jaeger \
  --namespace tracing \
  --create-namespace \
  --set storage.type=elasticsearch \
  --set elasticsearch.host=elasticsearch.logging

6.2 Application Instrumentation

Python Example (OpenTelemetry):

from opentelemetry import trace, metrics
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Configure Jaeger exporter
jaeger_exporter = JaegerExporter(
    agent_host_name="jaeger-agent",
    agent_port=6831,
)

# Create tracer
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(jaeger_exporter)
)
tracer = trace.get_tracer(__name__)

# Instrument application
with tracer.start_as_current_span("process_order") as span:
    span.set_attribute("order.id", 12345)
    span.set_attribute("customer.id", 67890)
    
    # Nested span
    with tracer.start_as_current_span("validate_order"):
        # Validation logic
        pass
    
    with tracer.start_as_current_span("process_payment"):
        # Payment logic
        pass

7. SLO & SLA Management

7.1 Define SLOs

# Example SLOs for payment service
SLOs:
  Availability:
    target: 99.9%
    window: 30 days
    query: |
      (count(up{service="payment"} == 1) / count(up{service="payment"})) * 100
  
  Latency:
    target: P99 < 500ms
    window: 7 days
    query: |
      histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="payment"}[7d])) < 0.5
  
  Error Rate:
    target: < 0.1%
    window: 7 days
    query: |
      rate(http_requests_total{service="payment", status=~"5.."}[7d]) < 0.001

7.2 Error Budget Tracking

# Error budget remaining (%)
(1 - (rate(errors[30d]) / target_error_rate)) * 100

# Alert when error budget exhausted
(30 * 24 * 60 * 60) - 
  (count_over_time(http_requests_total{status=~"5.."}[30d])) / 
  (count_over_time(http_requests_total[30d])) < 0.999

8. Troubleshooting

8.1 Prometheus Issues

Prometheus Not Scraping Targets:

# Check Prometheus UI
curl http://localhost:9090/api/v1/targets

# Look for error state targets
# Check scrape config and target availability

# Check Prometheus logs
kubectl logs -n monitoring prometheus-0 | grep -i error | tail -20

# Common issues:
# 1. Target not reachable (firewall, DNS)
# 2. Authentication failure (missing bearer token)
# 3. Scrape timeout too short

High Memory Usage:

# Prometheus metric cardinality explosion
curl http://localhost:9090/api/v1/labels

# Check for high-cardinality labels
curl 'http://localhost:9090/api/v1/label/{label_name}/values'

# Solutions:
# 1. Add relabel_configs to drop high-cardinality labels
# 2. Use metric_relabel_configs to filter metrics
# 3. Reduce retention period

8.2 Alert Troubleshooting

Alerts Not Firing:

# Check alert evaluation
curl 'http://localhost:9090/api/v1/query?query=up'

# Check alert rule file syntax
promtool check rules /etc/prometheus/rules/*.yml

# Verify alert is firing (even if not notified)
curl 'http://localhost:9090/api/v1/alerts'

# Check Alertmanager
curl http://localhost:9093/api/v1/alerts

# Verify webhook delivery
# Check Alertmanager logs
kubectl logs -n monitoring alertmanager-0 | grep -i error

9. Operational Checklists

Pre-Production

  • Prometheus configured and scraping all targets
  • Alert rules created for all SLOs
  • Alertmanager routing configured and tested
  • Grafana dashboards created for key metrics
  • Logging stack deployed (Elasticsearch, Logstash, Kibana)
  • Distributed tracing (Jaeger) deployed
  • On-call rotation configured
  • Alert escalation procedures documented
  • Load testing with observability validated

Daily Operations

  • Review alert firing rate (should be low)
  • Check Prometheus target status (all green)
  • Verify log ingestion rate
  • Monitor error budget status

Weekly Maintenance

  • Review alert tuning (false positives/negatives)
  • Update dashboards based on new metrics
  • Archive old logs
  • Test backup/restore of observability data

10. Performance Tuning

Prometheus Storage Optimization:

global:
  scrape_interval: 30s          # Increase from 15s if possible
  evaluation_interval: 30s      # Match scrape interval

# Reduce cardinality in scrape configs
relabel_configs:
- source_labels: [__name__]
  regex: 'go_.*'                # Drop Go runtime metrics if not needed
  action: drop

Elasticsearch Performance:

# Increase refresh interval for bulk indexing
PUT logs-*/_settings
{
  "index": {
    "refresh_interval": "30s"   # From default 1s
  }
}

# Tune shard count based on volume
# Rule: 50GB per shard, 1-3 shards per node
PUT logs-new
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 2
  }
}

Last Updated: January 2026 Maintained by: Platform Observability Team Version: 1.0.0