Infrastructure Debugging Concepts & Methodologies

Overview

This guide provides a comprehensive framework for debugging infrastructure issues across Kubernetes, VMs, networking, and cloud systems. Rather than providing quick fixes, this document teaches debugging methodologies that help you develop problem-solving skills and form your own experience.

Core Principle: These are general guides that point in rough directions. Every infrastructure issue is unique—use these frameworks to train your diagnostic thinking.

1. Debugging Framework & Strategy

1.1 The Four-Step Debugging Process

┌─────────────────────────────────────────────────────────────┐
│              INFRASTRUCTURE DEBUGGING PROCESS               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. OBSERVE          2. CORRELATE        3. NARROW DOWN     │
│  ↓                   ↓                    ↓                 │
│  Collect logs,    Match events to    Identify scope:        │
│  metrics,         timeline and       - Time window          │
│  events           symptoms           - Affected requests    │
│                                       - Location (node)     │
│                                                             │
│  4. REPLICATE       5. ROOT CAUSE      6. FIX & VERIFY      │
│  ↓                  ↓                   ↓                   │
│  Can you stably   Find the origin    Implement fix,         │
│  reproduce?       of the problem     verify metrics         │
│  Pattern analysis                    normalize              │
│  for intermittent                                           │
│                                                             │
└─────────────────────────────────────────────────────────────┘

1.2 Observation Phase: The Investigation Pyramid

Start with the broadest signals, then narrow down:

Level 1: SYSTEM HEALTH
├─ Overall availability (Is anything down?)
├─ Cluster/service status
└─ High-level metrics (CPU, memory, disk)

Level 2: APPLICATION SIGNALS
├─ Error rates & status codes
├─ Request latency (P50, P99)
└─ Log volume & severity

Level 3: REQUEST FLOW
├─ Request path (DNS → LB → Ingress → Pod)
├─ Log presence at each stage
└─ Response codes at each hop

Level 4: DETAILED DIAGNOSTICS
├─ Pod logs, events, resource usage
├─ Network captures (tcpdump)
└─ Application internals (debug endpoints)

1.3 Correlation Strategy: Building the Timeline

Key questions to correlate events:

Time Correlation:
  - When did the issue start? (exact timestamp)
  - Is it still happening? (persistent vs intermittent)
  - Did it happen before this date? (historical pattern)
  - Any scheduled events at this time? (deployments, maintenance)

Request Correlation:
  - Does it happen for ALL requests? (scope)
  - Or specific endpoints? (path filtering)
  - Or specific parameters? (request characteristics)
  - Or specific clients? (geographic, user agent)

Location Correlation:
  - Does it happen in load balancer? (early in chain)
  - Or only in pods? (late in chain)
  - One worker node or all? (distributed problem)
  - One region or multiple? (infrastructure vs application)

1.4 Narrowing Down: Scope Reduction

Decompose the problem into smaller domains:

Domain	Questions	Examples
Time	Duration? Sporadic or continuous?	2 min duration, 15:30-15:32 UTC
Request	Path? Parameters? Method? Size?	GET /api/users?id=123, 50KB payload
Geography	Region? Zone? Node? Pod?	us-east1-b, worker-3, pod-replica-2
Services	Which service affected? Dependencies?	api-service → database connection
Users	All users? Specific user types?	All users, or premium users only

Example:

Observed: 50x errors for 5 minutes
Narrow down: Only in us-west region, only POST requests, only >1MB payloads, only to order-service
Conclusion: Likely network buffer size issue in us-west load balancer

2. Common Scenarios & Diagnostic Flows

2.1 Kubernetes Service Returning 4xx/50x Errors

Decision Tree:

HTTP Error Response
│
├─ 404 Not Found
│  ├─ Does route exist in service?
│  │  ├─ YES → Check DNS pointing to correct IP
│  │  │      → Check Imperva/WAF routing
│  │  │      → Check ingress resource hostname matches
│  │  │      → Check no duplicate ingress resources
│  │  └─ NO → Endpoint doesn't exist, fix application route
│  │
│  └─ Is request reaching application logs?
│     ├─ YES in app → Route not found (expected)
│     └─ NO in app → Request stopped earlier (ingress/network)
│
├─ 502 Bad Gateway
│  ├─ Is 502 returned immediately?
│  │  ├─ YES → Check access logs for upstream routing
│  │  │      → Check upstream process health (WSGI/FastCGI)
│  │  │      → Check request sizes (buffer issue)
│  │  └─ NO (delay 2-5s) → Network issue, see Connection Timeout
│  │
│  └─ Common cause: Malformed HTTP response from backend
│
├─ 503 Service Unavailable
│  ├─ Check k8s service selector matches pods
│  ├─ kubectl get endpoints <service> (should show pod IPs)
│  └─ No healthy pods? Check pod status:
│     ├─ Pending → Check resource requests (no node capacity)
│     ├─ CrashLoopBackOff → Check pod logs
│     └─ Running but unhealthy → Check liveness probe
│
├─ 504 Gateway Timeout
│  ├─ Check if request reached upstream
│  │  ├─ YES → Upstream slow (slow query, blocked I/O)
│  │  └─ NO → Network timeout, see Connection Timeout
│  │
│  └─ Check access logs for actual duration vs timeout config
│
└─ Other 5xx (500, 429, etc.)
   └─ Usually application error, check app logs

Verification Commands:

# 1. Check if route exists
kubectl get ingress -A | grep <hostname>

# 2. Check service selector
kubectl describe service <service-name> -n <namespace>
kubectl get pod -n <namespace> --show-labels

# 3. Verify pod health
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace>

# 4. Check service endpoints
kubectl get endpoints <service-name> -n <namespace>

# 5. Check ingress logs
kubectl logs -n ingress-nginx deployment/ingress-nginx-controller --tail=100

2.2 VM Service Not Working (Kafka, Elasticsearch, MongoDB, PostgreSQL)

Diagnostic Checklist:

┌─ Service Status
│  ├─ systemctl status <service> (process running?)
│  ├─ ps -ef | grep <process> (verify process exists)
│  └─ netstat -tlnp (verify listening port)
│
├─ Logs
│  ├─ journalctl -u <service> -n 100 (recent logs)
│  ├─ tail -f /var/log/<service>/<service>.log (live logs)
│  └─ Check for ERROR or FATAL messages
│
├─ Cluster Health (for distributed services)
│  ├─ curl localhost:9200/_cluster/health (Elasticsearch)
│  ├─ kafka-broker-api-versions.sh --bootstrap-server localhost:9092 (Kafka)
│  ├─ mongo localhost:27017 (MongoDB)
│  └─ psql -U postgres -d postgres (PostgreSQL)
│
├─ Resource Constraints
│  ├─ htop or top (CPU, memory usage)
│  ├─ iotop (disk I/O contention)
│  ├─ df -h (disk space full?)
│  └─ free -h (memory available?)
│
└─ Network Connectivity
   ├─ ss -tlnp (listening on expected interface?)
   ├─ ip r (correct routing?)
   └─ Can other nodes reach this node?

2.3 Connection Issues (Timeout, Refused, Reset)

Connection Timeout (no response after 30s):

Likely causes (in order of frequency):
1. DNS fails → hostname doesn't resolve
   └─ dig/nslookup hostname (check resolution)

2. Firewall blocks traffic → packets never arrive
   └─ traceroute, tcpdump (see if packets reach destination)

3. Routing broken → packets go wrong direction
   └─ ip r get <destination> (check route table)

4. Network device issue → port exhaustion or congestion
   └─ Intermittent, hard to diagnose (see tcpdump analysis)

Verification:
  ping <destination> (ICMP, tests connectivity)
  nc -zv <host> <port> (TCP port test)
  traceroute <destination> (trace path)
  tcpdump -i eth0 'host <destination>' (packet capture)

Connection Refused (immediate error, port closed):

Means: Network good, but destination port not listening

Check destination:
  1. Is service running?
     └─ systemctl status <service>
  
  2. Is it listening on correct port/interface?
     └─ ss -tlnp (t=tcp, l=listening, n=numeric, p=process)
     └─ Example output: LISTEN 0 128 0.0.0.0:8080 0.0.0.0:* users:(("app",pid=1234))
  
  3. Is firewall blocking?
     └─ iptables -L (local firewall rules)

If cloud load balancer returns "Refused":
  └─ No healthy backends configured
  └─ Check: Does backend health check pass?

Connection Reset by Peer (TCP RST packet):

Server actively closes connection (not just timeout)

Possible causes:
1. Server hitting connection limit
   └─ Too many simultaneous connections
   └─ Fix: Increase max connections or reduce client connections

2. TCP timeout on server side
   └─ Request takes longer than timeout config
   └─ Fix: Increase server timeout or optimize slow operations

3. Application logic rejects connection
   └─ Authentication fails, rate limit, etc.
   └─ Check: Application logs

2.4 SSH Connection Issues

Timeout (no response):

[Step 1] Verify you're on correct VPN
  $ dig +short bastion.example.com
  10.0.16.5
  
  $ ip r get 10.0.16.5
  10.0.16.5 via 10.0.16.1 dev tun0 (✓ correct, goes through VPN)
  # vs
  10.0.16.5 via 192.168.1.1 dev eth0 (✗ wrong, doesn't use VPN)

[Step 2] Verify using correct bastion
  $ ssh -v <host>
  Look for: "Setting implicit ProxyCommand from ProxyJump: ssh -v -W..."
  This means it's correctly using bastion as jump host

[Step 3] Test firewall/routing
  $ traceroute <host> (see if path exists)

Permission Denied:

[Step 1] Verify username is correct
  $ gcloud compute os-login describe-profile (see expected username)
  # Usually: firstname_lastname_company_com (dots/@ replaced with _)
  # For cross-org access: ext_firstname_lastname_company_com

[Step 2] Verify SSH key is authorized
  $ gcloud compute os-login ssh-keys list (on GCP)
  $ aws ec2-instance-connect describe-instance-information (on AWS)

[Step 3] Check permissions
  GCP: Need roles/compute.osAdminLogin role
  Azure: Need "Virtual Machine Administrator Login" role
  AWS: Need ec2-instance-connect permissions

[Step 4] Check server logs (requires infra access)
  /var/log/auth.log (authentication attempts)
  journalctl -u ssh (SSH service logs)

2.5 Networking: DNS, Firewall, Routing Issues

DNS Resolution Failure:

symptom: "Name or service not known"

[Step 1] Can you resolve locally?
  $ nslookup service.default.svc.cluster.local
  $ dig @10.0.0.10 myservice.example.com

[Step 2] Check DNS service health
  kubectl get pod -n kube-system -l k8s-app=kube-dns
  kubectl logs -n kube-system -l k8s-app=kube-dns

[Step 3] Check DNS config
  cat /etc/resolv.conf (what nameservers are configured?)
  
[Step 4] Check if pod can reach DNS server
  kubectl exec -it <pod> -- nslookup 8.8.8.8

Firewall/Network Policy Blocks Traffic:

Symptom: Specific connection times out, other connections work

[Step 1] Identify source and destination clearly
  Source: 10.1.0.5 (pod A)
  Destination: 10.2.0.10:443 (pod B)

[Step 2] Check cloud firewall rules (GCP, AWS, Azure)
  gcloud compute firewall-rules list --filter="<SOURCE_RANGE> AND <DEST_RANGE>"
  
[Step 3] Check k8s network policies
  kubectl get networkpolicy -A
  kubectl describe networkpolicy <policy> -n <namespace>
  
  Does it allow ingress from source?
  Does source have egress rule to destination?

[Step 4] Verify with tcpdump
  # On destination node
  tcpdump -i any "host <source> and host <destination>"
  # If you see packets = firewall issue
  # If no packets = DNS or routing issue

Routing Issue:

Packets don't reach destination (wrong path)

[Step 1] Trace the route
  $ traceroute <destination>
  $ mtr <destination> (continuous monitoring)

[Step 2] Check local routing table
  $ ip r (Kernel route table)
  $ ip r get <destination> (where will this IP go?)
  
  Example output:
  10.2.0.0/16 via 10.1.0.1 dev eth0
  (means: to reach 10.2.0.0/16, send to gateway 10.1.0.1)

[Step 3] Verify connectivity exists
  If A→B times out, check if B→A works
  - If B→A works but A→B doesn't = asymmetric routing
  - If both don't work = full firewall/routing issue

3. Debugging Tools & Commands Reference

3.1 Kubernetes Debugging

# Pod status & logs
kubectl describe pod <pod> -n <ns>              # Full pod info & events
kubectl logs <pod> -n <ns> --all-containers=true  # All container logs
kubectl logs <pod> -n <ns> -c <container>      # Specific container
kubectl logs <pod> -n <ns> --previous          # Crashed container logs
kubectl logs <pod> -n <ns> -f                  # Stream logs (tail -f)

# Service & networking
kubectl get svc -n <ns> -o wide                # Services & IPs
kubectl get endpoints <svc> -n <ns>            # Service backend pods
kubectl get networkpolicy -n <ns>              # Network policies
kubectl describe networkpolicy <np> -n <ns>    # Policy details

# Ingress & routing
kubectl get ingress -A                         # All ingresses
kubectl describe ingress <ing> -n <ns>         # Ingress config
kubectl get ingress -A | grep <hostname>       # Find ingress by hostname

# Events & cluster info
kubectl get events -n <ns> --sort-by='.lastTimestamp'  # Recent events
kubectl cluster-info                           # Cluster endpoints
kubectl get nodes -o wide                      # Node status & IPs

# Interactive debugging
kubectl exec -it <pod> -n <ns> -- /bin/bash   # Connect to pod
kubectl port-forward <pod> 8080:8080 -n <ns>  # Port forward to pod
kubectl port-forward svc/<svc> 8080:80 -n <ns> # Port forward service

3.2 VM & System Debugging

# Service & process
systemctl status <service>                     # Service status & logs
systemctl list-units --type=service --all      # All services
ps aux | grep <process>                        # Find process
sudo journalctl -u <service> -n 50 -f          # Stream service logs

# Logs
tail -f /var/log/<service>/<service>.log       # Stream service log
grep ERROR /var/log/<service>/<service>.log    # Find errors
journalctl -p err -n 50                        # All errors, last 50

# System resources
top or htop                                    # CPU, memory (interactive)
free -h                                        # Memory usage
df -h                                          # Disk usage
iotop                                          # Disk I/O usage
lsof -i :<port>                               # Process using port

# Network
ss -tlnp                                       # Listening ports & process
netstat -an | grep ESTABLISHED                # Active connections
ip a                                           # Network interfaces & IPs
ip r                                           # Routing table
ip r get <destination>                         # Where does this IP go?

3.3 Networking Tools

# DNS
dig <hostname>                                 # Full DNS lookup
nslookup <hostname>                            # Simple DNS lookup
host <hostname>                                # Quick DNS check
dig +trace <hostname>                          # Trace DNS resolution chain

# Connectivity
ping <host>                                    # ICMP reachability
nc -zv <host> <port>                          # TCP port test
telnet <host> <port>                          # TCP connection test

# Routing
traceroute <destination>                       # Trace path to destination
mtr <destination>                              # Continuous trace (better than traceroute)

# Packet capture
tcpdump -i any 'host <ip>'                    # Capture all packets to/from IP
tcpdump -i eth0 'tcp port 8080'               # Capture port 8080
tcpdump -i any -n 'src 10.0.0.1 and dst 10.0.0.2' # Specific flow

4. Special Cases & Non-Obvious Issues

4.1 "No Space Left on Device" (Misleading Error)

Possible causes:

1. Actual Disk Full
   - Symptoms: df -h shows 100% usage
   - Solution: Delete old logs, increase disk size
   
2. Inode Exhaustion (not disk space)
   - Symptoms: df -h shows <90%, but mkdir fails
   - Check: df -i (inode count)
   - Solution: Delete many small files or increase inodes

3. Kubernetes Memory Limit Bug
   - Symptoms: Mount fails with "No space", but disk has space
   - Cause: Memory limit set incorrectly (512m vs 512Mi)
   - Solution: Fix resource limits
   - Example: resources.limits.memory: 512Mi (not 512m)

4.2 Intermittent TCP Port Exhaustion

Symptoms:

Random connection timeouts that recover quickly
Not consistent (hard to reproduce)
More common during high load

Causes:

NAT device running out of ports (SNAT port table)
Client-side port reuse too aggressive
Proxy holding TIME_WAIT connections too long

Diagnosis:

netstat -an | grep TIME_WAIT | wc -l         # Too many TIME_WAIT?
cat /proc/sys/net/ipv4/tcp_tw_reuse          # Check reuse setting (should be 1)
cat /proc/sys/net/netfilter/nf_conntrack_max # Max connections
cat /proc/sys/net/netfilter/nf_conntrack_count # Current connections

4.3 Istio Service Mesh Debugging

Why debugging Istio is hard:

Istio returns 503 for many different problems
Requests go through sidecar proxies (Envoy)
Hard to trace where error actually originates

Approach:

1. Check if issue happens immediately or after delay
   - Immediate = wrong route configuration
   - After 2-5s = network issue (see timeout troubleshooting)

2. Check Envoy/Istio logs in sidecar
   kubectl logs <pod> -c istio-proxy
   
3. Look for PassthroughCluster in logs (wrong routing)
   - Check VirtualService config
   - Check ServiceEntry definitions
   - Verify hostnames match

4. Compare with non-Istio pod
   - Deploy same app without Istio
   - If non-Istio works, it's Istio config issue

5. Building Debugging Experience

5.1 Common Debugging Mistakes to Avoid

Mistake	Better Approach
Assume something works without testing	Test each step: DNS → firewall → routing → app
Change multiple things at once	Change one thing, test, verify metrics normalize
Look at only one metric	Correlate multiple signals (logs + metrics + events)
Start debugging at the app	Start from the edge (DNS, firewall) and work inward
Trust that "it should work"	Actually verify each component
Ignore timezone in timestamps	Always convert to UTC for correlation

5.2 Systematic Narrowing Process

Given: Service returns 502 errors
│
├─ Step 1: WHEN does it happen?
│  └─ Only during 14:00-14:05 UTC on Nov 15
│
├─ Step 2: WHICH requests are affected?
│  └─ Only POST /api/users with >1MB payload
│
├─ Step 3: WHERE in the stack?
│  └─ Check: ingress logs (yes, requests received)
│  └─ Check: app logs (no, requests never arrived)
│  └─ Conclusion: Blocked between ingress and app pods
│
├─ Step 4: WHY is it blocked?
│  └─ Check: Network policies (found: allow ingress, but port wrong)
│  └─ Check: Ingress config (found: forwarding to port 8080, but app listening 8081)
│
└─ Resolution: Fix port mismatch, verify metrics normalize

6. When to Escalate vs. Self-Debug

Escalate to Specialists When

Infrastructure/Network Issues:
  - Consistent timeout across multiple pods/nodes
  - Cloud firewall rules not working as expected
  - VPN/routing issues (contact networking team)
  - Cloud provider API errors

Security Issues:
  - Unauthorized access attempts in audit logs
  - Suspected compromise or intrusion
  - Permission/RBAC issues beyond scope

Hardware/VM Issues:
  - VM won't start
  - Physical disk/network hardware failure
  - Cloud quota exceeded

When to Self-Debug:
  - Application errors (check app logs first)
  - Kubernetes configuration (check YAML, selectors)
  - Local service misconfiguration
  - Networking between components you own

Last Updated: January 2026 Maintained by: Platform Engineering Team Version: 1.0.0

CONCEPT