Apache Kafka Operations & Infrastructure Runbook

1. Overview

This runbook covers production operational procedures for deploying, managing, and troubleshooting Apache Kafka clusters, including broker management, topic administration, producer/consumer operations, and disaster recovery.

Scope: Kafka cluster setup, topic management, performance tuning, monitoring, troubleshooting Target Audience: DevOps engineers, SREs, platform engineers, Kafka administrators Prerequisite: CONCEPT.md (architecture, core concepts)

2. Kafka Cluster Deployment

2.1 Broker Configuration (Production)

Broker Properties (server.properties):

# Basic Configuration
broker.id=1                                    # Unique ID per broker
listeners=PLAINTEXT://kafka-broker-1:9092,SSL://kafka-broker-1:9093
advertised.listeners=PLAINTEXT://kafka-broker-1:9092,SSL://kafka-broker-1:9093
listener.security.protocol.map=PLAINTEXT:PLAINTEXT,SSL:SSL

# Zookeeper/KRaft Coordination
zookeeper.connect=zk-1:2181,zk-2:2181,zk-3:2181
zookeeper.session.timeout.ms=18000

# Log Configuration
log.dirs=/var/kafka-logs
log.retention.hours=168                       # 7 days
log.retention.bytes=1073741824                # 1 GB per partition
log.segment.bytes=1073741824                  # 1 GB segments
log.cleanup.policy=delete                     # or 'compact' for compacted topics

# Performance Tuning
num.network.threads=8
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600

# Replication
min.insync.replicas=2
default.replication.factor=3
auto.leader.rebalance.enable=true

# Topic Defaults
num.partitions=3
default.replication.factor=3
auto.create.topics.enable=false               # Explicitly create topics

# Metrics & Monitoring
metrics.num.samples=3
metrics.sample.window.ms=30000

2.2 Kubernetes Deployment (Strimzi Operator)

Install Strimzi Operator:

# Add Helm repository
helm repo add strimzi https://strimzi.io/charts
helm repo update

# Install operator
helm install strimzi strimzi/strimzi-kafka-operator \
  --namespace kafka \
  --create-namespace \
  --set watchAnyNamespace=true

# Verify operator running
kubectl get pods -n kafka

Create Kafka Cluster:

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: prod-cluster
  namespace: kafka
spec:
  # Broker Configuration
  kafka:
    version: 3.7.0
    replicas: 3
    
    listeners:
    - name: plain
      port: 9092
      type: internal
      tls: false
    - name: tls
      port: 9093
      type: internal
      tls: true
    - name: external
      port: 9094
      type: loadbalancer
      tls: true
    
    config:
      log.retention.hours: 168
      log.retention.bytes: 1073741824
      num.network.threads: 8
      num.io.threads: 8
      min.insync.replicas: 2
      auto.create.topics.enable: "false"
      compression.type: "snappy"
    
    # Storage
    storage:
      type: persistent-claim
      size: 100Gi
      class: fast-ssd
    
    # Resources
    resources:
      requests:
        cpu: "2"
        memory: "4Gi"
      limits:
        cpu: "4"
        memory: "8Gi"
    
    # Pod Disruption Budget
    template:
      pod:
        terminationGracePeriodSeconds: 300
        affinity:
          podAntiAffinity:
            preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                  - key: strimzi.io/cluster
                    operator: In
                    values:
                    - prod-cluster
                topologyKey: kubernetes.io/hostname
  
  # Zookeeper Configuration (or use KRaft mode)
  zookeeper:
    replicas: 3
    storage:
      type: persistent-claim
      size: 10Gi
      class: fast-ssd
    resources:
      requests:
        cpu: "500m"
        memory: "1Gi"
      limits:
        cpu: "1"
        memory: "2Gi"
  
  # Cluster Operator
  entityOperator:
    topicOperator:
      watchedNamespace: kafka
      reconciliationIntervalSeconds: 60
    userOperator:
      watchedNamespace: kafka

Deploy cluster:

kubectl apply -f kafka-cluster.yaml

# Monitor rollout
kubectl get kafka prod-cluster -n kafka -w
kubectl describe kafka prod-cluster -n kafka

# Verify brokers
kubectl get pods -n kafka -l strimzi.io/cluster=prod-cluster

3. Topic Management

3.1 Creating Topics

Using Kafka CLI:

# Create topic with 3 partitions, replication factor 3
kafka-topics.sh --create \
  --bootstrap-server localhost:9092 \
  --topic orders \
  --partitions 3 \
  --replication-factor 3 \
  --config retention.ms=604800000 \
  --config compression.type=snappy

# Verify creation
kafka-topics.sh --describe \
  --bootstrap-server localhost:9092 \
  --topic orders

Using Kubernetes CRD (Strimzi):

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  name: orders
  namespace: kafka
  labels:
    strimzi.io/cluster: prod-cluster
spec:
  partitions: 3
  replicationFactor: 3
  config:
    retention.ms: 604800000              # 7 days
    compression.type: snappy
    segment.ms: 86400000                 # 1 day segments
    cleanup.policy: delete
    min.cleanable.dirty.ratio: 0.5

Topic Configuration Best Practices:

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  name: user-events
  namespace: kafka
spec:
  partitions: 6                           # Scale based on throughput
  replicationFactor: 3                    # HA requirement
  config:
    retention.ms: 2592000000              # 30 days for events
    compression.type: snappy              # Reduce disk usage
    segment.ms: 3600000                   # 1 hour segments
    cleanup.policy: delete                # TTL-based cleanup
    min.insync.replicas: 2                # Durability
    leader.imbalance.check.interval.seconds: 300

3.2 Partition Reassignment

Rebalance Partitions (balance broker load):

# Generate reassignment plan
kafka-reassign-partitions.sh \
  --bootstrap-server localhost:9092 \
  --topics-to-move-json-file topics.json \
  --broker-list "1,2,3" \
  --generate > reassignment.json

# Cat topics.json:
{
  "topics": [
    {"topic": "orders"},
    {"topic": "user-events"}
  ],
  "version": 1
}

# Review plan
cat reassignment.json

# Execute reassignment
kafka-reassign-partitions.sh \
  --bootstrap-server localhost:9092 \
  --reassignment-json-file reassignment.json \
  --execute

# Monitor progress
kafka-reassign-partitions.sh \
  --bootstrap-server localhost:9092 \
  --reassignment-json-file reassignment.json \
  --verify

Scaling Partitions (increase parallelism):

# Add partitions (can only increase, not decrease)
kafka-topics.sh --alter \
  --bootstrap-server localhost:9092 \
  --topic orders \
  --partitions 6

# Verify
kafka-topics.sh --describe \
  --bootstrap-server localhost:9092 \
  --topic orders

4. Producer Operations

4.1 Producer Configuration (Performance)

# Batch Settings (for high throughput)
batch.size=32768                     # 32 KB batches
linger.ms=100                        # Wait 100ms to batch
compression.type=snappy              # Reduce network/disk

# Reliability (trade-off with latency)
acks=all                             # Wait for all replicas (durability)
retries=2147483647                   # Retry indefinitely
max.in.flight.requests.per.connection=5  # Pipeline for throughput

# Timeouts
request.timeout.ms=30000             # 30 seconds
delivery.timeout.ms=300000           # 5 minutes total

# Buffer Management
buffer.memory=67108864               # 64 MB buffer pool

4.2 Monitoring Producer Performance

# Monitor producer metrics
kafka-console-producer.sh \
  --bootstrap-server localhost:9092 \
  --topic orders \
  --property parse.key=true \
  --property key.separator=:

# In another terminal, monitor JMX metrics
jconsole -Dcom.sun.jndi.ldap.connect.pool=false

# Check producer metrics via JMX:
# kafka.producer:type=producer-metrics,client-id=*
#   - record-send-rate (messages/sec)
#   - record-error-rate (errors/sec)
#   - record-queue-time-avg (batching delay)

5. Consumer Operations

5.1 Consumer Configuration

# Offset Management
group.id=order-processing-service
auto.offset.reset=earliest           # Start from beginning if no offset found
enable.auto.commit=false              # Manual offset management (safer)
auto.commit.interval.ms=1000         # If auto-commit enabled

# Performance
fetch.min.bytes=1024
fetch.max.wait.ms=500
max.partition.fetch.bytes=1048576    # 1 MB per partition
session.timeout.ms=30000

# Rebalancing
heartbeat.interval.ms=10000          # Send heartbeat every 10 sec
max.poll.records=500
max.poll.interval.ms=300000          # 5 minutes to process records

# Isolation Level
isolation.level=read_committed       # Only committed messages

5.2 Consumer Group Management

List Consumer Groups:

kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --list

# Output:
# order-processing-service
# user-events-processor
# analytics-consumer

Monitor Consumer Group:

kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --group order-processing-service \
  --describe

# Output shows:
# TOPIC          PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG     CONSUMER-ID
# orders         0          1000            1500            500     consumer-1
# orders         1          2000            2000            0       consumer-2
# orders         2          1800            2000            200     consumer-3

Reset Consumer Offset:

# Reset to earliest (from beginning)
kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --group order-processing-service \
  --topic orders \
  --reset-offsets \
  --to-earliest \
  --execute

# Reset to specific offset
kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --group order-processing-service \
  --topic orders \
  --reset-offsets \
  --to-offset 1000 \
  --execute

# Reset to timestamp
kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --group order-processing-service \
  --topic orders \
  --reset-offsets \
  --to-datetime 2024-01-15T10:00:00.000 \
  --execute

6. Monitoring & Alerting

6.1 Key Metrics

# Broker Health
# kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
# Alert if > 0 (indicates replica lag)

# Consumer Lag
kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --group order-processing-service \
  --describe | awk '{if ($5 > 1000) print "HIGH LAG: " $0}'

# Alert if LAG > threshold (e.g., 1000 messages)

# Broker Disk Usage
kafka-log-dirs.sh \
  --bootstrap-server localhost:9092 \
  --describe

# Alert if disk usage > 80%

# Replication Status
kafka-topics.sh \
  --bootstrap-server localhost:9092 \
  --describe --under-replicated-partitions

# Alert if any output (indicates failed replicas)

6.2 Prometheus Metrics

# ServiceMonitor for Strimzi (Prometheus Operator)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kafka-metrics
  namespace: kafka
spec:
  selector:
    matchLabels:
      strimzi.io/cluster: prod-cluster
  endpoints:
  - port: metrics
    interval: 30s

Key Alert Rules:

groups:
- name: kafka.rules
  interval: 30s
  rules:
  - alert: KafkaUnderReplicatedPartitions
    expr: kafka_server_replicamanager_underreplicatedpartitions > 0
    for: 5m
    annotations:
      summary: "Kafka under-replicated partitions: {{ $value }}"
  
  - alert: KafkaConsumerLagHigh
    expr: kafka_consumer_lag{consumer_group=~"order.*"} > 10000
    for: 10m
    annotations:
      summary: "Consumer group {{ $labels.consumer_group }} lag: {{ $value }}"
  
  - alert: KafraBrokerDiskUsage
    expr: kafka_log_log_size_bytes / kafka_log_log_max_size_bytes > 0.8
    for: 5m
    annotations:
      summary: "Broker {{ $labels.broker_id }} disk usage: {{ $value | humanizePercentage }}"

7. Troubleshooting

7.1 Broker Issues

Broker Won't Start:

# Check logs
tail -f /var/log/kafka/server.log | grep ERROR

# Common issues:
# 1. Port already in use
lsof -i :9092

# 2. Broker ID conflict
kafka-broker-api-versions.sh --bootstrap-server localhost:9092

# 3. Zookeeper connection failed
zookeeper-shell.sh localhost:2181 ls /brokers/ids

Leader Election Issues:

# Check controller status
zookeeper-shell.sh localhost:2181 get /controller

# Check broker leadership
kafka-topics.sh --describe \
  --bootstrap-server localhost:9092 \
  --topic orders

# Expected: Different leader for each partition
# If all partitions same leader: rebalance needed

7.2 Consumer Issues

High Consumer Lag:

# 1. Check consumer status
kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --group slow-consumer \
  --describe

# 2. Check partition assignment
kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --group slow-consumer \
  --describe --members

# 3. Increase consumer instances (up to partition count)
# 4. Tune fetch settings:
max.poll.records=1000               # Increase if memory allows
fetch.min.bytes=10240               # Larger batches

# 5. Monitor consumer processing time
# In application code:
processing_time = time.time() - start_time
metrics.histogram('message_processing_ms', processing_time * 1000)

Message Loss:

# 1. Verify producer acks setting
# Should be: acks=all

# 2. Verify min.insync.replicas >= 2
kafka-configs.sh --bootstrap-server localhost:9092 \
  --describe --entity-type topics --entity-name orders

# 3. Check replication factor
kafka-topics.sh --describe --bootstrap-server localhost:9092 \
  --topic orders

# Expected: Replication-factor: 3 (or at least 2)

8. Backup & Disaster Recovery

8.1 Topic Backup

# Export topic data to file
kafka-console-consumer.sh \
  --bootstrap-server localhost:9092 \
  --topic orders \
  --from-beginning \
  --property print.offset=true \
  --property print.partition=true > orders-backup.txt

# Or export as JSON
kafka-console-consumer.sh \
  --bootstrap-server localhost:9092 \
  --topic orders \
  --from-beginning \
  --formatter kafka.tools.DefaultMessageFormatter \
  --property print.key=true \
  --property key.deserializer=org.apache.kafka.common.serialization.StringDeserializer \
  --property value.deserializer=org.apache.kafka.common.serialization.StringDeserializer > orders-backup.json

8.2 Topic Restore

# Restore from backup
kafka-console-producer.sh \
  --broker-list localhost:9092 \
  --topic orders-restored < orders-backup.txt

# For production, use Kafka Mirror Maker or Confluent Replicator for
# better control and durability

9. Operational Checklists

Pre-Production

Cluster deployed with 3+ brokers
Replication factor set to 3 on all topics
min.insync.replicas = 2
Monitoring and alerting configured
Consumer groups created and tested
Backup/restore procedure tested
Load testing completed (throughput verified)
Failover testing completed (broker down scenario)

Daily Operations

Check broker health (no under-replicated partitions)
Monitor consumer lag (should be < 1000 messages)
Verify disk usage (< 80%)
Check for rebalancing events

Weekly Maintenance

Review slow consumer groups
Optimize topic partitioning if needed
Verify replication status
Test backup/restore procedure

10. Performance Tuning

High Throughput Settings:

# Broker
num.network.threads=16
num.io.threads=16
compression.type=snappy

# Producer
batch.size=65536
linger.ms=100
compression.type=snappy

# Consumer
fetch.min.bytes=10240
fetch.max.wait.ms=500
max.partition.fetch.bytes=10485760

Low Latency Settings:

# Producer
batch.size=1024
linger.ms=10
acks=1                           # Trade-off with durability

# Consumer
fetch.min.bytes=1
fetch.max.wait.ms=100
max.poll.records=100

Last Updated: January 2026 Maintained by: Platform Engineering Team Version: 1.0.0

RUNBOOK