Observability Workshop: Hands-On Monitoring Lab

Overview

This workshop provides a practical, hands-on introduction to observability by setting up a local monitoring stack and generating real alerts.

Duration: 120 minutes (6 parts, 18 tasks) Prerequisites: Docker, curl, basic Kubernetes knowledge Outcome: Working monitoring stack with metrics, dashboards, and alerts

Part 1: Local Monitoring Stack Setup (15 min)

Objective

Set up Prometheus and Grafana locally using Docker Compose.

Task 1.1: Create docker-compose.yml

Create a file docker-compose.yml:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
      GF_SECURITY_ADMIN_USER: admin
    volumes:
      - grafana_data:/var/lib/grafana
    networks:
      - monitoring
    depends_on:
      - prometheus

  sample_app:
    image: kennethreitz/httpbin:latest
    container_name: sample_app
    ports:
      - "5000:80"
    networks:
      - monitoring

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data:
  grafana_data:

Task 1.2: Create Prometheus Configuration

Create prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'sample_app'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['sample_app:5000']

Task 1.3: Start the Stack

# Navigate to directory with docker-compose.yml
docker-compose up -d

# Verify containers running
docker ps | grep -E "prometheus|grafana|sample_app"

# Expected output:
# CONTAINER ID   IMAGE                          PORTS
# abc123def     prom/prometheus:latest         0.0.0.0:9090->9090/tcp
# def456abc     grafana/grafana:latest         0.0.0.0:3000->3000/tcp
# ghi789jkl     kennethreitz/httpbin:latest    0.0.0.0:5000->80/tcp

Task 1.4: Verify Prometheus Health

# Access Prometheus UI
curl http://localhost:9090

# Check targets
curl http://localhost:9090/api/v1/targets

# Expected output: targets array with "UP" status

Task 1.5: Access Grafana

# Open browser
# http://localhost:3000
# Username: admin
# Password: admin

# Then change password (recommended):
# Profile → Change Password

Part 2: Metrics Generation & Collection (30 min)

Objective

Create a Python application that exposes Prometheus metrics.

Task 2.1: Create Flask Metrics App

Create app.py:

from flask import Flask, Response, request
from prometheus_client import Counter, Histogram, Gauge, generate_latest, CONTENT_TYPE_LATEST
import time
import random

app = Flask(__name__)

# Define metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=(0.1, 0.5, 1.0, 2.5, 5.0, 10.0)
)

ACTIVE_CONNECTIONS = Gauge(
    'active_connections',
    'Number of active connections'
)

# Track connections
@app.before_request
def before_request():
    ACTIVE_CONNECTIONS.inc()
    request.start_time = time.time()

@app.after_request
def after_request(response):
    ACTIVE_CONNECTIONS.dec()
    
    # Record metrics
    latency = time.time() - request.start_time
    REQUEST_LATENCY.labels(
        method=request.method,
        endpoint=request.path
    ).observe(latency)
    
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.path,
        status=response.status_code
    ).inc()
    
    return response

# Application endpoints
@app.route('/login', methods=['POST'])
def login():
    # Simulate random latency (50-200ms)
    time.sleep(random.uniform(0.05, 0.2))
    return {'status': 'ok'}, 200

@app.route('/api/users/<int:user_id>', methods=['GET'])
def get_user(user_id):
    # Occasionally fail (5% error rate)
    if random.random() < 0.05:
        return {'error': 'user not found'}, 404
    
    time.sleep(random.uniform(0.02, 0.1))
    return {'id': user_id, 'name': 'User'}, 200

@app.route('/api/orders', methods=['POST'])
def create_order():
    # Slow endpoint (200-500ms)
    time.sleep(random.uniform(0.2, 0.5))
    return {'order_id': random.randint(1000, 9999)}, 201

@app.route('/health', methods=['GET'])
def health():
    return {'status': 'healthy'}, 200

# Expose metrics
@app.route('/metrics', methods=['GET'])
def metrics():
    return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080, debug=False)

Task 2.2: Create Dockerfile for App

Create Dockerfile:

FROM python:3.11-slim

WORKDIR /app

RUN pip install flask prometheus_client

COPY app.py .

EXPOSE 8080

CMD ["python", "app.py"]

Task 2.3: Update docker-compose.yml

Update docker-compose.yml to include the app:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
      GF_SECURITY_ADMIN_USER: admin
    volumes:
      - grafana_data:/var/lib/grafana
    networks:
      - monitoring
    depends_on:
      - prometheus

  app:
    build: .
    container_name: metrics_app
    ports:
      - "8080:8080"
    networks:
      - monitoring

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data:
  grafana_data:

Task 2.4: Update prometheus.yml

Add the app to scrape config:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'metrics_app'
    static_configs:
      - targets: ['app:8080']

Task 2.5: Build and Run App

# Build Docker image
docker-compose build app

# Start all services
docker-compose up -d

# Wait 30 seconds for Prometheus to scrape metrics
sleep 30

# Verify metrics are collected
curl http://localhost:8080/metrics | head -20

# Expected output:
# # HELP http_requests_total Total HTTP requests
# # TYPE http_requests_total counter
# http_requests_total{endpoint="/login",method="POST",status="200"} 5.0
# ...

Task 2.6: Generate Traffic

Generate traffic to create metrics:

# Generate login requests (fast)
for i in {1..100}; do
  curl -X POST http://localhost:8080/login &
done
wait

# Generate user API requests
for i in {1..50}; do
  curl http://localhost:8080/api/users/$((RANDOM % 100)) &
done
wait

# Generate order API requests (slower)
for i in {1..20}; do
  curl -X POST http://localhost:8080/api/orders &
done
wait

echo "Traffic generation complete"

Part 3: Dashboards & Visualization (30 min)

Objective

Create Grafana dashboards to visualize metrics.

Task 3.1: Add Prometheus Data Source

Open Grafana: http://localhost:3000
Navigate to: Configuration (gear icon) → Data Sources
Click "Add data source"
Select "Prometheus"
Set URL: http://prometheus:9090
Click "Save & Test"

Expected: "Data source is working"

Task 3.2: Create Requests Dashboard

Click "+" (New Dashboard)
Click "Add panel"
Select Prometheus query type
Enter query: sum(rate(http_requests_total[5m])) by (endpoint)
Set title: "Requests per Endpoint"
Click "Save" → Name: "Application Metrics"

Task 3.3: Add Latency Panel

Click "Add panel"
Query: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))
Title: "P99 Latency by Endpoint"
Save

Task 3.4: Add Error Rate Panel

Click "Add panel"
Query: sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint) / sum(rate(http_requests_total[5m])) by (endpoint)
Title: "Error Rate by Endpoint"
Format as percentage
Save

Task 3.5: Add Connection Gauge

Click "Add panel"
Query: active_connections
Title: "Active Connections"
Visualization: "Stat" (gauge)
Save

Part 4: Alerting (20 min)

Objective

Create alert rules and test triggering alerts.

Task 4.1: Create Alert Rules File

Create alert_rules.yml:

groups:
  - name: application_alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint)
          /
          sum(rate(http_requests_total[5m])) by (endpoint)
          > 0.05
        for: 2m
        annotations:
          summary: "High error rate on {{ $labels.endpoint }}"
          description: "Error rate is {{ $value | humanizePercentage }}"
        labels:
          severity: warning

      - alert: HighLatency
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)) > 1
        for: 2m
        annotations:
          summary: "High latency on {{ $labels.endpoint }}"
          description: "P99 latency is {{ $value }}s"
        labels:
          severity: warning

      - alert: HighConnectionCount
        expr: active_connections > 50
        for: 1m
        annotations:
          summary: "High connection count"
          description: "Active connections: {{ $value }}"
        labels:
          severity: info

Task 4.2: Update prometheus.yml with Alert Rules

Update prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - 'alert_rules.yml'

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'metrics_app'
    static_configs:
      - targets: ['app:8080']

Task 4.3: Restart Prometheus

docker restart prometheus

# Wait 30 seconds for Prometheus to reload rules
sleep 30

# Verify alerts loaded
curl http://localhost:9090/api/v1/rules | grep -o '"name":"[^"]*"' | head -10

# Expected: Should see alert names like "HighErrorRate", "HighLatency"

Task 4.4: Trigger High Error Rate Alert

Create an endpoint that returns errors:

# Generate 100 errors rapidly
for i in {1..100}; do
  # Call /api/users with force errors somehow
  # For now, generate invalid requests
  curl http://localhost:8080/api/users/invalid &
done
wait

# Wait 2 minutes for alert to fire (for: 2m threshold)
echo "Waiting for alert to fire..."
sleep 120

# Check alert status
curl http://localhost:9090/api/v1/alerts | grep -i "HighErrorRate"

Task 4.5: View Fired Alerts in Prometheus

Open http://localhost:9090/alerts
Look for alerts with status "FIRING" (red)
Click on alert to see details
Note: Severity label and annotation message

Expected: Alert shows "High error rate on /api/users" with value > 5%

Task 4.6: View Alerts in Grafana

Open Grafana: http://localhost:3000
Navigate to: Alerting (bell icon) → Alert Rules
Should see list of configured alerts
Click on fired alert to see details

Part 5: Metrics Cardinality Analysis (20 min)

Objective

Understand and optimize metric cardinality.

Task 5.1: Query Cardinality

Check current metric cardinality:

# Get total unique metrics
curl http://localhost:9090/api/v1/query?query='count(count(%7B__name__%7D) by (__name__))' | jq '.data.result[0].value[1]'

# Expected: Should show hundreds (each unique metric)

# Get cardinality by metric type
curl http://localhost:9090/api/v1/query?query='topk(10, count by (__name__) (count(%7B__name__%7D) by (__name__, le)))' | jq '.data.result[] | {metric: .__name__, value: .value[1]}'

# Expected output shows cardinality per metric

Task 5.2: Identify High Cardinality Metrics

# Check http_requests_total cardinality
curl 'http://localhost:9090/api/v1/query?query=count(http_requests_total)' | jq '.data.result[0].value[1]'

# Expected: Multiple combinations of method, endpoint, status

# Show all combinations
curl 'http://localhost:9090/api/v1/query?query=http_requests_total' | jq '.data.result[] | {metric: .metric, value: .value[1]}'

Task 5.3: Calculate Histogram Cardinality Impact

Metric: http_request_duration_seconds (histogram)

Cardinality formula:
  Base cardinality = unique (method, endpoint) combinations
  Histogram cardinality = Base × 5 (count, sum, bucket, min, max)

Example:
  method: [GET, POST]                 (2 values)
  endpoint: [/login, /api/users, /api/orders] (3 values)
  
  Base cardinality = 2 × 3 = 6
  Histogram cardinality = 6 × 5 = 30 unique metrics

Task 5.4: Generate High Cardinality Scenario

Simulate problematic metric (DO NOT use user_id in production):

# Example of HIGH CARDINALITY (BAD):
user_request_total{user_id=?, endpoint=?}

If 1M users × 10 endpoints = 10M unique metrics (EXPENSIVE!)

# Fix: Remove user_id or aggregate
request_total{endpoint=?}  # Cardinality = 10

Task 5.5: Cardinality Budget Recommendations

Environment: Production
Service: order-api
Budget: 1000 unique metrics per service

Breakdown:
  Core metrics (5 services): 200 metrics
  Custom business metrics: 300 metrics
  Infrastructure (CPU, memory): 200 metrics
  Dependencies (DB, cache): 150 metrics
  Reserve: 150 metrics

Total: 1000 metrics (at budget)

Alert if cardinality > 1100 (10% overage)

Part 6: Incident Response Simulation (25 min)

Objective

Practice responding to an alert using observability data.

Task 6.1: Create Performance Degradation

Modify app.py to add latency:

# Add after imports
SLOW_MODE = False

# In create_order() function:
@app.route('/api/orders', methods=['POST'])
def create_order():
    if SLOW_MODE:
        time.sleep(random.uniform(2.0, 5.0))  # Much slower
    else:
        time.sleep(random.uniform(0.2, 0.5))
    return {'order_id': random.randint(1000, 9999)}, 201

# Add new endpoint to trigger slow mode
@app.route('/debug/enable_slow_mode', methods=['POST'])
def enable_slow_mode():
    global SLOW_MODE
    SLOW_MODE = True
    return {'message': 'Slow mode enabled'}

Rebuild and restart:

docker-compose build app
docker-compose up -d app

Task 6.2: Trigger Incident

# Enable slow mode
curl -X POST http://localhost:8080/debug/enable_slow_mode

# Generate order requests to trigger latency alert
for i in {1..100}; do
  curl -X POST http://localhost:8080/api/orders &
done
wait

# Wait for alert to fire
echo "Alert should fire in ~2-3 minutes..."
sleep 180

# Check alert status
curl http://localhost:9090/api/v1/alerts | grep -i "HighLatency"

Task 6.3: Investigation Using Metrics

# Step 1: Check request rate
curl 'http://localhost:9090/api/v1/query?query=sum(rate(http_requests_total%5B5m%5D)) by (endpoint)' | jq '.data.result[] | select(.metric.endpoint == "/api/orders")'

# Step 2: Check latency
curl 'http://localhost:9090/api/v1/query?query=histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket%5B5m%5D)) by (le, endpoint))' | jq '.data.result[] | select(.metric.endpoint == "/api/orders")'

# Step 3: Check error rate
curl 'http://localhost:9090/api/v1/query?query=http_requests_total{endpoint="/api/orders", status=~"5.."}' | jq '.data.result[] | .value'

Task 6.4: View in Dashboards

Open Grafana dashboard
Observe:
- Requests per Endpoint: /api/orders traffic increased
- P99 Latency: /api/orders jumped to 2-5 seconds
- Error Rate: Likely no errors (still completing, just slow)
These metrics tell you: "Service is slow, not broken"

Task 6.5: Resolution

# Disable slow mode (simulate fix)
curl -X POST http://localhost:8080/debug/disable_slow_mode

# Generate normal traffic
for i in {1..50}; do
  curl -X POST http://localhost:8080/api/orders &
done
wait

# Wait for latency to return to normal
echo "Waiting for recovery..."
sleep 120

# Check alert status (should be resolved)
curl http://localhost:9090/api/v1/alerts | grep -i "HighLatency"

Task 6.6: Post-Incident Review

Document findings:

## Incident: High Latency on Order API

**Duration**: 10:15 - 10:45 UTC (30 minutes)
**Severity**: WARNING → CRITICAL

**Detection**: 
- Alert: HighLatency fired at 10:15
- P99 latency: 50ms → 2500ms (50x increase)

**Root Cause**:
- Slow database query triggered
- Connection pool exhaustion suspected

**Resolution**:
- Restarted order service
- Latency returned to normal

**Prevention**:
- Add index to frequently queried column
- Implement connection pooling limits
- Test with load to catch issues earlier

Cleanup & Validation

Task: Cleanup

# Stop containers
docker-compose down

# Remove volumes (optional)
docker-compose down -v

# Verify cleanup
docker ps | grep -i prometheus
# (Should show no results)

Validation Checklist

Prometheus collected metrics from app
Grafana dashboard created with 4+ panels
Alert rules loaded in Prometheus
Alert fired when threshold exceeded
Metrics visualized over time
Cardinality analyzed and understood
Incident response practiced
Dashboard showed performance degradation

Common Issues & Troubleshooting

Issue	Cause	Solution
Prometheus shows "DOWN" for app	App not running	Check: `docker ps` and logs
No metrics appearing	Scrape interval not elapsed	Wait 30s after app starts
Grafana can't connect to Prometheus	Network issue	Verify containers on same network
Alert won't fire	Threshold too high	Lower threshold and test
High cardinality metrics	Too many tag values	Remove or group tags

Next Steps

Deploy to Kubernetes: Use Prometheus Operator for production
Implement APM: Add distributed tracing with Jaeger
Multi-cluster monitoring: Aggregate metrics from multiple clusters
Custom dashboards: Build service-specific dashboards
Alert routing: Configure PagerDuty/OpsGenie integration

Workshop Completion Estimated Time: 120 minutes Skills Gained: Metrics collection, dashboard creation, alerting, incident response

WORKSHOP