Elasticsearch & OpenSearch: Production Runbook

Purpose: Operational procedures for deploying, managing, and maintaining Elasticsearch/OpenSearch clusters in production.

Overview
Standard Deployment Configuration
Cluster Management
Monitoring & Health Checks
Upgrades & Maintenance
Disaster Recovery
Troubleshooting
Essential Commands Reference

1. Overview

This runbook covers operational procedures for Elasticsearch (commercial) and OpenSearch (open-source) clusters in production environments. Both platforms have similar operational requirements with some platform-specific differences noted where applicable.

Assumed Audience: Infrastructure/Database engineers with Linux and search engine knowledge.

2. Standard Deployment Configuration

2.1 Pre-Deployment Checklist

Before deployment, verify:

# System requirements
- [ ] Minimum 3 nodes for production (quorum)
- [ ] Each node: 8+ CPU cores, 32GB+ RAM
- [ ] Storage: SSD (3-5K IOPS per node)
- [ ] Network: 1Gbps+ bandwidth between nodes
- [ ] OS: Linux (RHEL 8+, Ubuntu 20.04+)
- [ ] Java: OpenJDK 11+ or vendor JDK
- [ ] Disk space: 2-3x data size minimum
- [ ] Open ports: 9200 (HTTP), 9300 (node communication)

2.2 System Tuning

Apply these kernel parameters before deployment:

# Increase file descriptors
sudo sysctl -w vm.max_map_count=262144

# Persist settings
echo "vm.max_map_count=262144" | sudo tee -a /etc/sysctl.conf

# Verify
sysctl vm.max_map_count

2.3 Installation Steps

Using Package Manager (Ubuntu):

# 1. Add repository
wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
echo "deb https://artifacts.elastic.co/packages/8.x/apt stable main" | \
  sudo tee /etc/apt/sources.list.d/elastic-8.x.list

# 2. Install Elasticsearch
sudo apt-get update
sudo apt-get install -y elasticsearch

# 3. Enable and start service
sudo systemctl enable elasticsearch
sudo systemctl start elasticsearch

# 4. Verify
curl -u elastic:password localhost:9200

Using Docker (Recommended for testing):

# Pull image
docker pull docker.elastic.co/elasticsearch/elasticsearch:8.6.0

# Run container
docker run -d --name elasticsearch \
  -e discovery.seed_hosts=es-node-2,es-node-3 \
  -e cluster.initial_master_nodes=es-node-1,es-node-2,es-node-3 \
  -e xpack.security.enabled=true \
  -e ELASTIC_PASSWORD=password123 \
  -p 9200:9200 \
  -p 9300:9300 \
  docker.elastic.co/elasticsearch/elasticsearch:8.6.0

2.4 Configuration (elasticsearch.yml)

Minimal Production Config:

cluster.name: production-cluster
node.name: es-node-1

# Node roles
node.roles: [master, data]

# Network
network.host: 0.0.0.0
http.port: 9200
transport.port: 9300

# Discovery
discovery.seed_hosts: ["es-node-2:9300", "es-node-3:9300"]
cluster.initial_master_nodes: ["es-node-1", "es-node-2", "es-node-3"]

# Memory
-Xms16g
-Xmx16g  # 50% of available RAM, max 31GB

# Thread pools
thread_pool.search.size: 48
thread_pool.search.queue_size: 1000
thread_pool.bulk.size: 16
thread_pool.bulk.queue_size: 300

# Index management
action.auto_create_index: "+logs-*,+metrics-*,-.watches,-_daily_,-_monthly_"

# Security
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.http.ssl.enabled: true

# Monitoring
xpack.monitoring.collection.enabled: true

2.5 Cluster Bootstrap

After installing all 3+ nodes, verify cluster formation:

# 1. Check cluster health
curl -u elastic:password localhost:9200/_cluster/health
# Expected: "status": "green"

# 2. List nodes
curl -u elastic:password localhost:9200/_cat/nodes

# 3. View cluster state
curl -u elastic:password localhost:9200/_cluster/state?pretty | head -50

3. Cluster Management

3.1 Adding Nodes to Cluster

Steps:

# 1. Install Elasticsearch on new node
# (Follow Section 2.3 installation steps)

# 2. Configure elasticsearch.yml with:
# - cluster.name: production-cluster (same as others)
# - node.name: es-node-4 (unique name)
# - discovery.seed_hosts: [list of existing nodes]
# - Initial master nodes not needed (only for bootstrap)

# 3. Start Elasticsearch
sudo systemctl start elasticsearch

# 4. Verify node joined
curl -u elastic:password localhost:9200/_cat/nodes -v

# 5. Check cluster health (may be yellow during rebalancing)
curl -u elastic:password localhost:9200/_cluster/health?wait_for_status=green

3.2 Removing Nodes Safely

To remove a node without data loss:

# 1. Exclude node from shard allocation
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' \
  -u elastic:password -d '{
  "transient": {
    "cluster.routing.allocation.exclude._name": "es-node-4"
  }
}'

# 2. Wait for shards to migrate (monitor progress)
watch -n 5 'curl -s -u elastic:password localhost:9200/_cluster/health?pretty'
# Wait for unassigned_shards to reach 0

# 3. Stop service on node
ssh es-node-4
sudo systemctl stop elasticsearch

# 4. Remove node from discovery list in other nodes' config
# (Optional - node won't rejoin without manual restart)

# 5. Verify cluster health
curl -u elastic:password localhost:9200/_cluster/health
# Should return "green" with fewer nodes

3.3 Index Lifecycle Management (ILM)

Create ILM Policy:

curl -X PUT "localhost:9200/_ilm/policy/logs_policy" \
  -H 'Content-Type: application/json' \
  -u elastic:password -d '{
  "policy": "logs_policy",
  "phases": {
    "hot": {
      "min_age": "0d",
      "actions": {
        "rollover": {
          "max_primary_shard_size": "50GB",
          "max_age": "1d"
        }
      }
    },
    "warm": {
      "min_age": "7d",
      "actions": {
        "set_priority": { "priority": 25 },
        "forcemerge": { "max_num_segments": 1 }
      }
    },
    "cold": {
      "min_age": "30d",
      "actions": {
        "set_priority": { "priority": 0 }
      }
    },
    "delete": {
      "min_age": "90d",
      "actions": {
        "delete": {}
      }
    }
  }
}'

Apply to Index:

curl -X PUT "localhost:9200/logs/_settings" \
  -H 'Content-Type: application/json' \
  -u elastic:password -d '{
  "settings": {
    "index.lifecycle.name": "logs_policy",
    "index.lifecycle.rollover_alias": "logs"
  }
}'

4. Monitoring & Health Checks

4.1 Daily Health Check

Run this every morning:

#!/bin/bash
# health_check.sh

ES_HOST="localhost:9200"
ES_USER="elastic"
ES_PASS="password123"

echo "=== Cluster Health ==="
curl -s -u "$ES_USER:$ES_PASS" "$ES_HOST/_cluster/health" | jq '.status'

echo "=== Disk Usage ==="
curl -s -u "$ES_USER:$ES_PASS" "$ES_HOST/_cat/allocation?v" | head -10

echo "=== Heap Usage ==="
curl -s -u "$ES_USER:$ES_PASS" "$ES_HOST/_nodes/stats" | \
  jq '.nodes[] | {name, heap_percent: .jvm.mem.heap_percent}'

echo "=== Unassigned Shards ==="
curl -s -u "$ES_USER:$ES_PASS" "$ES_HOST/_cluster/health" | \
  jq '.unassigned_shards'

echo "=== Index Count ==="
curl -s -u "$ES_USER:$ES_PASS" "$ES_HOST/_cat/indices" | wc -l

echo "=== JVM GC Time ==="
curl -s -u "$ES_USER:$ES_PASS" "$ES_HOST/_nodes/stats/jvm" | \
  jq '.nodes[] | {name, gc_time_ms: .jvm.gc.collection_time_in_millis}'

4.2 Key Monitoring Metrics

Metric	Healthy Range	Alert If
Cluster Status	green	yellow or red
Heap Usage	< 85%	> 90%
Disk Usage	< 85%	> 90%
JVM GC Time	< 1% of total	> 2%
Query Latency p99	< 200ms	> 500ms
Indexing Latency	< 100ms	> 500ms
Unassigned Shards	0	> 0

4.3 Monitoring Tools Setup

Using Prometheus + Elasticsearch Exporter:

# 1. Install exporter
docker run -d --name elasticsearch-exporter \
  -p 9114:9114 \
  prometheuscommunity/elasticsearch-exporter \
  --es.uri=http://es-node-1:9200

# 2. Add to Prometheus scrape config
cat >> /etc/prometheus/prometheus.yml << 'EOF'
- job_name: 'elasticsearch'
  static_configs:
    - targets: ['localhost:9114']
EOF

# 3. Verify metrics collected
curl localhost:9114/metrics | grep es_

5. Upgrades & Maintenance

5.1 Minor Version Upgrade (e.g., 8.5 → 8.6)

Rolling upgrade (zero downtime):

# 1. Disable shard allocation
curl -X PUT "localhost:9200/_cluster/settings" \
  -H 'Content-Type: application/json' \
  -u elastic:password -d '{
  "transient": {
    "cluster.routing.allocation.enable": "primaries"
  }
}'

# 2. Stop one node
sudo systemctl stop elasticsearch

# 3. Upgrade package
sudo apt-get upgrade elasticsearch

# 4. Start node and wait for recovery
sudo systemctl start elasticsearch
sleep 30
curl -u elastic:password localhost:9200/_cluster/health?wait_for_status=yellow

# 5. Repeat steps 2-4 for each node

# 6. Re-enable shard allocation
curl -X PUT "localhost:9200/_cluster/settings" \
  -H 'Content-Type: application/json' \
  -u elastic:password -d '{
  "transient": {
    "cluster.routing.allocation.enable": "all"
  }
}'

# 7. Wait for cluster to be green
curl -u elastic:password localhost:9200/_cluster/health?wait_for_status=green

5.2 Snapshot Backup

Create repository (S3 example):

curl -X PUT "localhost:9200/_snapshot/s3-backup" \
  -H 'Content-Type: application/json' \
  -u elastic:password -d '{
  "type": "s3",
  "settings": {
    "bucket": "my-elasticsearch-backups",
    "region": "us-east-1",
    "compress": true
  }
}'

Create snapshot:

curl -X PUT "localhost:9200/_snapshot/s3-backup/snapshot-$(date +%Y%m%d)" \
  -H 'Content-Type: application/json' \
  -u elastic:password -d '{
  "indices": "logs-*,metrics-*",
  "include_global_state": true,
  "wait_for_completion": false
}'

# Monitor progress
curl -u elastic:password localhost:9200/_snapshot/s3-backup/_status

5.3 Maintenance Window Procedure

Schedule: 2:00 AM - 4:00 AM UTC (Low traffic)

# 1. Notify team (do this 24 hours before)
echo "Maintenance window: 2024-02-07 02:00-04:00 UTC"

# 2. 10 minutes before: Disable alerts
# (Pause alerting in monitoring system)

# 3. Execute upgrade (Section 5.1)

# 4. Verify everything working
curl -u elastic:password localhost:9200/_cluster/health
curl -u elastic:password localhost:9200/_cat/indices?v | head -20

# 5. Test search functionality
curl -u elastic:password "localhost:9200/logs-*/_search?size=1"

# 6. Re-enable alerts
# (Resume alerting in monitoring system)

# 7. Send completion notification
echo "Maintenance completed successfully"

6. Disaster Recovery

6.1 Backup Strategy

Recommended:

Daily snapshots (kept for 30 days)
Weekly snapshots (kept for 1 year)
Monthly snapshots (kept for 7 years)

Automated backup script:

#!/bin/bash
# backup.sh - Run daily via cron

CLUSTER="https://elastic:password@localhost:9200"
REPO="s3-backup"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# Create snapshot
curl -X PUT "$CLUSTER/_snapshot/$REPO/snapshot_$TIMESTAMP" \
  -H 'Content-Type: application/json' \
  -d '{
  "indices": "logs-*,metrics-*",
  "include_global_state": true
}'

# Wait for completion
while true; do
  STATUS=$(curl -s "$CLUSTER/_snapshot/$REPO/snapshot_$TIMESTAMP" | jq '.snapshots[0].state')
  if [ "$STATUS" = '"SUCCESS"' ]; then
    echo "Snapshot $TIMESTAMP completed"
    break
  fi
  sleep 10
done

6.2 Restore from Snapshot

Partial restore (specific indices):

# 1. Check available snapshots
curl -u elastic:password localhost:9200/_snapshot/s3-backup/_all

# 2. Restore specific index
curl -X POST "localhost:9200/_snapshot/s3-backup/snapshot-20240131/_restore" \
  -H 'Content-Type: application/json' \
  -u elastic:password -d '{
  "indices": "logs-2024.01.*",
  "rename_pattern": "(.+)",
  "rename_replacement": "$1-restored"
}'

# 3. Wait for restore to complete
watch -n 5 'curl -s -u elastic:password localhost:9200/_cluster/health?pretty'

Full cluster restore (complete disaster recovery):

# 1. Spin up new Elasticsearch cluster (3+ nodes)

# 2. Restore all indices from snapshot
curl -X POST "localhost:9200/_snapshot/s3-backup/snapshot-20240131/_restore" \
  -H 'Content-Type: application/json' \
  -u elastic:password -d '{
  "indices": "*"
}'

# 3. Wait for recovery
curl -u elastic:password localhost:9200/_cluster/health?wait_for_status=green

# 4. Verify data integrity
curl -u elastic:password "localhost:9200/_cat/indices?v" | wc -l

6.3 Cross-Cluster Replication (CCR)

For remote disaster recovery site:

# On leader cluster:
curl -X PUT "localhost:9200/_cluster/settings" \
  -H 'Content-Type: application/json' \
  -u elastic:password -d '{
  "persistent": {
    "cluster.remote.follower.seeds": ["remote-cluster-node:9300"]
  }
}'

# On follower cluster:
curl -X PUT "localhost:9200/_ccr/follow" \
  -H 'Content-Type: application/json' \
  -u elastic:password -d '{
  "remote_cluster": "leader",
  "leader_index": "logs-*"
}'

7. Troubleshooting

7.1 Common Issues

Issue: Cluster Status RED

# 1. Check cluster health
curl -u elastic:password localhost:9200/_cluster/health?pretty

# 2. Find missing primary shards
curl -u elastic:password "localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason" | \
  grep UNASSIGNED

# 3. Options:
# Option A: Wait (if node recovering)
# Option B: Add more nodes
# Option C: Force allocate (data loss risk):
curl -X POST "localhost:9200/_cluster/reroute?allow_primary=true" \
  -H 'Content-Type: application/json' \
  -u elastic:password -d '{
  "commands": [{
    "allocate_empty_primary": {
      "index": "logs-2024.01.31",
      "shard": 0,
      "node": "es-node-1"
    }
  }]
}'

Issue: High Heap Usage (> 90%)

# 1. Check what's using memory
curl -u elastic:password localhost:9200/_nodes/stats/jvm | \
  jq '.nodes[] | {name, heap_max_in_bytes, heap_used_in_bytes}'

# 2. Check for long-running queries
curl -u elastic:password "localhost:9200/_tasks?detailed=true&actions=*search*"

# 3. Options:
# Option A: Restart node (graceful)
curl -X POST "localhost:9200/_nodes/es-node-1/_shutdown" \
  -u elastic:password

# Option B: Increase heap size (edit elasticsearch.yml, restart)
# Option C: Add more nodes (distribute load)

Issue: Slow Queries

# 1. Enable slow query logging
curl -X PUT "localhost:9200/_cluster/settings" \
  -H 'Content-Type: application/json' \
  -u elastic:password -d '{
  "transient": {
    "logger.index.search.slowlog": "INFO",
    "index.search.slowlog.threshold.query.warn": "1s",
    "index.search.slowlog.threshold.query.info": "100ms"
  }
}'

# 2. Check slow query logs
tail -100f /var/log/elasticsearch/slowlog.log

# 3. Optimize (pick one or more):
# - Add timestamp filter (reduce data scanned)
# - Simplify aggregations
# - Use smaller shards (< 50GB)
# - Add more data nodes

7.2 Emergency Procedures

Cluster Won't Start

# 1. Check logs
tail -100 /var/log/elasticsearch/elasticsearch.log

# 2. Common causes:
# - Not enough disk space (clear old indices)
# - Bad configuration file (syntax error)
# - Java version mismatch (upgrade Java)
# - Permission issues (check file ownership)

# 3. Recover:
sudo chown -R elasticsearch:elasticsearch /var/lib/elasticsearch
sudo systemctl start elasticsearch

Complete Data Loss Recovery

# 1. If snapshot exists:
# Follow Section 6.2 full cluster restore

# 2. If no snapshot:
# - Accept data loss
# - Restart cluster
# - Restart data ingestion

# Prevention: Always maintain backups!

8. Essential Commands Reference

Cluster Operations

# Health status
curl -u elastic:password localhost:9200/_cluster/health

# Cluster info
curl -u elastic:password localhost:9200/

# Nodes information
curl -u elastic:password localhost:9200/_nodes

# Cluster settings
curl -u elastic:password localhost:9200/_cluster/settings

# Update cluster setting
curl -X PUT "localhost:9200/_cluster/settings" -d '{"transient": {...}}'

Index Operations

# List all indices
curl -u elastic:password localhost:9200/_cat/indices

# Index statistics
curl -u elastic:password localhost:9200/<index>/_stats

# Get index settings
curl -u elastic:password localhost:9200/<index>/_settings

# Update index settings
curl -X PUT "localhost:9200/<index>/_settings" -d '{"settings": {...}}'

# Delete old indices
curl -X DELETE "localhost:9200/logs-2024.01.*"

Shard Operations

# List shard allocation
curl -u elastic:password localhost:9200/_cat/shards

# Explain shard allocation
curl -u elastic:password "localhost:9200/_cluster/allocation/explain?pretty"

# Force shard move
curl -X POST "localhost:9200/_cluster/reroute" -d '{
  "commands": [{
    "move": {
      "index": "logs-2024.01.31",
      "shard": 0,
      "from_node": "es-node-1",
      "to_node": "es-node-2"
    }
  }]
}'

Snapshot Operations

# Create snapshot
curl -X PUT "localhost:9200/_snapshot/repo/snapshot-name" -d '{...}'

# List snapshots
curl -u elastic:password localhost:9200/_snapshot/repo/_all

# Restore from snapshot
curl -X POST "localhost:9200/_snapshot/repo/snapshot-name/_restore"

# Delete snapshot
curl -X DELETE "localhost:9200/_snapshot/repo/snapshot-name"

Runbook Maintenance

Last Updated: January 31, 2026
Maintained By: Database & Search Team
Contact: ops-team@example.com

For immediate help:

Check CONCEPT.md for technical details
Consult WORKSHOP.md for hands-on learning
Review Elasticsearch Documentation

Critical Contacts:

On-Call DBA: [Phone/Pager]
Database Team Slack: #elasticsearch-support
Escalation: Database Team Lead

RUNBOOK