CONCEPT
CONCEPTLast updated: 1/31/2026
Compute Infrastructure: Comprehensive Technical Reference
Complete guide to compute technologies from physical servers to cloud-native instances.
Table of Contents
- Compute Spectrum Overview
- Physical Servers & Bare Metal
- Virtualization & Hypervisors
- Cloud Compute Platforms
- Container Orchestration
- Resource Management
- Performance Tuning
- Scaling Strategies
- Cost Optimization
- Disaster Recovery & High Availability
- Security & Compliance
- Monitoring & Observability
- Migration Strategies
- Best Practices
- Production Checklist
1. Compute Spectrum Overview
The Compute Continuum
┌─────────────────────────────────────────────────────────────┐
│ Compute Technology Spectrum │
├─────────────────────────────────────────────────────────────┤
│ │
│ Physical Servers │
│ ├─ Bare metal hardware │
│ ├─ On-premise data centers │
│ └─ High performance compute (HPC) │
│ ↓ (Control, Performance) │
│ │
│ Virtualization │
│ ├─ VMware vSphere │
│ ├─ KVM (Linux) │
│ └─ Hyper-V (Windows) │
│ ↓ (Efficiency, Consolidation) │
│ │
│ Cloud Compute │
│ ├─ AWS EC2, Azure VMs, GCP Compute Engine │
│ ├─ Dedicated hosts vs. shared infrastructure │
│ └─ Burstable vs. reserved instances │
│ ↓ (Flexibility, Scale) │
│ │
│ Container Orchestration │
│ ├─ Kubernetes (self-managed vs. managed) │
│ ├─ ECS, AKS, GKE │
│ └─ Serverless (Lambda, Functions) │
│ ↓ (Density, Automation) │
│ │
│ Serverless │
│ ├─ AWS Lambda, Azure Functions, GCP Cloud Functions │
│ ├─ Event-driven, auto-scaling │
│ └─ Managed infrastructure │
│ ↓ (Scale-to-zero, Simplicity) │
│ │
└─────────────────────────────────────────────────────────────┘
Tradeoff Axis:
Control ← → Simplicity
Performance ← → Elasticity
Cost ← → Flexibility
Key Characteristics by Layer
| Layer | Control | Performance | Cost | Scaling | Time to Deploy |
|---|---|---|---|---|---|
| Physical | Maximum | Optimal | High | Slow | Weeks |
| Virtualization | High | Good | Medium | Medium | Hours |
| Cloud VMs | Medium | Good | Variable | Fast | Minutes |
| Containers | Medium | Good | Low-Medium | Very Fast | Seconds |
| Serverless | Low | Fair | Pay-as-used | Automatic | Milliseconds |
2. Physical Servers & Bare Metal
Characteristics
Bare Metal Hardware:
├─ No hypervisor overhead → optimal performance
├─ Direct hardware access → specialized workloads
├─ On-premise or dedicated cloud (AWS Bare Metal, Azure Dedicated Host)
├─ Predictable performance (no noisy neighbor)
└─ High upfront cost, long lifecycle (5-7 years)
Use Cases
Physical servers optimal for:
├─ High-performance computing (HPC) clusters
├─ Databases with strict latency requirements (<1ms)
├─ GPU/ML workloads requiring full performance
├─ Financial systems (trading, settlement)
├─ Regulated compliance (dedicated hardware required)
└─ Very long-term workloads (5+ year horizon)
Server Components
CPU Architecture:
Intel Xeon: Enterprise standard, 2-socket, 24-96 cores
AMD EPYC: Competitive, 2-socket, up to 128 cores
ARM (Graviton): Cloud native, energy efficient
Memory:
Per-server: 128GB - 2TB+
NUMA aware (architecture consideration for applications)
Storage:
NVMe SSDs: Ultra-high IOPS (1M+)
SAS SSDs: High endurance, battery-backed cache
RAID: Hardware RAID 1/5/6/10 for reliability
Network:
10GbE Standard, 25GbE/40GbE/100GbE for high-performance
NIC bonding for redundancy
Management Tools
Physical Server Management:
├─ IPMI (Intelligent Platform Management Interface)
├─ Redfish API (modern replacement)
├─ Out-of-band management (separate network)
├─ Power management, console access
├─ Firmware/BIOS updates
└─ Health monitoring (temperature, fans, power)
TCO Calculation
5-Year Total Cost of Ownership:
Hardware Cost:
1x Server (48-core, 512GB): $25,000
Network switches, cables: $5,000
Rack, power, cooling: $10,000
Subtotal: $40,000
Operational Costs (5 years):
Power: 1000W × 24hrs × 365 × 5 × $0.12/kWh = $5,260
Cooling: ~equal to power = $5,260
Space: $500/month × 60 months = $30,000
Staffing: 0.5 FTE × $150K/year × 5 = $375,000
Maintenance: $5K/year × 5 = $25,000
Subtotal: $440,520
Total 5-Year Cost: ~$480,520
Per-Year Cost: ~$96,104
3. Virtualization & Hypervisors
Hypervisor Types
Type 1: Bare Metal Hypervisors
ESXi (VMware) - Enterprise standard, no OS
Hyper-V - Microsoft, integrated with Windows Server
KVM - Linux-based, open-source, production-grade
Xen - Lightweight, containerization support
Characteristics:
- Direct hardware access
- Lower overhead (no host OS layer)
- Better performance
- Higher cost/complexity
Type 2: Hosted Hypervisors
VirtualBox - Free, development/testing
VMware Fusion - Mac/Linux, development
Parallels Desktop - Mac, consumer-focused
Characteristics:
- Run on existing OS
- Lower performance
- Simpler deployment
- Development/testing focus
VM Resource Allocation
vCPU Allocation:
├─ 1 vCPU: Web servers, caches, development
├─ 2-4 vCPU: Application servers, small databases
├─ 8-16 vCPU: Database servers, batch processing
└─ 32+ vCPU: Analytics, specialized workloads
Memory Allocation:
├─ 512MB-1GB: Web servers
├─ 2-4GB: Application servers
├─ 8-32GB: Database servers
├─ 64GB+: In-memory caches, analytics
└─ Note: Over-subscription possible (memory ballooning), not recommended
Storage:
├─ Thin provisioning: Allocate max, use as needed
├─ Thick provisioning: Pre-allocate all space (performance)
├─ Snapshots: Point-in-time copies (use carefully)
└─ Consider I/O patterns: IOPS, latency, throughput
VM Density Calculation
Example: 2-socket server with 48 cores, 512GB RAM
Conservative deployment (for production):
CPU: 48 cores ÷ 2 vCPU per VM = 24 VMs max
Memory: 512GB ÷ 8GB per VM = 64 VMs max
→ Limited by CPU: ~20-24 VMs for production
Moderate deployment:
CPU: 48 cores ÷ 4 vCPU per VM = 12 VMs
Memory: 512GB ÷ 32GB per VM = 16 VMs
→ Balanced: ~12-14 VMs per server
High-density deployment (dev/test):
CPU: 48 cores ÷ 1 vCPU per VM = 48 VMs
Memory: 512GB ÷ 2GB per VM = 256 VMs
→ Limited by CPU: ~40-48 small VMs
Production Rule of Thumb:
CPU oversubscription: 1:3 to 1:5 (1 physical core : 3-5 vCPU)
Memory oversubscription: 1:1.2 to 1:1.5 (limited ballooning)
vSphere Architecture
ESXi Cluster Configuration:
┌──────────────────────────────────────────┐
│ vCenter Server (Management) │
└──────────────────────────────────────────┘
│
┌────┼────┬─────┐
▼ ▼ ▼ ▼
┌─────┬─────┬─────┬─────┐
│ESXi1│ESXi2│ESXi3│ESXi4│ (4-node cluster)
├─────┼─────┼─────┼─────┤
│ VM1 │ VM2 │ VM3 │ VM4 │
│ VM5 │ VM6 │ VM7 │ VM8 │
│ ... │ ... │ ... │ ... │
└─────┴─────┴─────┴─────┘
│ │ │ │
└───────┴───────┴───────┘
Shared Storage (SAN/NFS)
- VMFS datastore
- VM files, snapshots
- HA/DRS enabled
Key Features:
- vSphere HA: VM restart on host failure
- vSphere DRS: Load balancing, performance optimization
- Storage vMotion: Live migration between datastores
- vSphere Replication: Disaster recovery
- vSAN: Distributed storage (no SAN required)
4. Cloud Compute Platforms
AWS EC2
Instance Types:
├─ General Purpose (M-series): Web, app servers, development
├─ Compute Optimized (C-series): Batch, HPC, analytics
├─ Memory Optimized (R/X-series): Databases, in-memory caches
├─ GPU/Accelerated (G/P/F-series): ML, graphics, HPC
├─ Storage Optimized (I/D/H-series): NoSQL, data warehousing
└─ Burstable (T-series): Low-traffic workloads, development
Instance Size Strategy:
├─ Right-sizing: Match workload requirements
├─ Monitor CloudWatch metrics: CPU, memory, network
├─ Adjust size based on actual utilization
└─ Rule of thumb: 30-40% average utilization optimal
Azure VMs
VM Types:
├─ General Purpose (B/D/E-series): Development, testing
├─ Compute Optimized (F/H-series): Web, app servers
├─ Memory Optimized (D-series): Databases, in-memory
├─ GPU Optimized (N-series): ML, graphics, HPC
├─ Storage Optimized (L-series): NoSQL, data warehousing
└─ High Performance Compute (H-series): Complex simulations
Spot Instances:
├─ Up to 90% discount vs. on-demand
├─ Subject to eviction if capacity needed
├─ Good for: batch jobs, non-critical workloads
└─ Not recommended for: production databases, transactional systems
GCP Compute Engine
Machine Types:
├─ Predefined: N1, N2, N2D, E2 (compute, memory, cost optimized)
├─ Custom: Create exact vCPU/memory combination
├─ High-memory/CPU: For specialized workloads
├─ Shared-core: E2-micro, burstable, for low-traffic
Commitment Discounts:
├─ 1-year: 25-30% discount
├─ 3-year: 52-70% discount (deeper savings, long-term commitment)
└─ Use for: Baseline/steady-state workloads
Pricing Models
On-Demand (Hourly):
Cost: Pay per hour used
Best for: Unpredictable workloads, development/testing
Commitment: None
Reserved Instances (1-3 years):
AWS: 31-72% discount, upfront payment
Azure: 40-72% discount
GCP: 25-70% discount
Best for: Baseline usage, production steady-state
Commitment: Fixed term
Spot/Preemptible Instances (Dynamic):
Cost: 60-90% discount vs. on-demand
Availability: Not guaranteed, eviction possible
Best for: Batch jobs, development, non-critical workloads
Commitment: None (but expect interruptions)
Savings Plans:
AWS: Compute Savings Plans (any instance type)
Flexibility: Use across regions, instance families
Commitment: 1-3 year commitment
Multi-Cloud Comparison
| Feature | AWS EC2 | Azure VMs | GCP Compute |
|---|---|---|---|
| Instance variety | 700+ | 200+ | 100+ |
| Spot discount | 70-90% | 80-90% | 80-90% |
| Reserved term | 1-3 years | 1-3 years | 1-3 years |
| Startup time | 30-60 sec | 30-60 sec | 20-40 sec |
| Network bandwidth | 10-100 Gbps | 10-100 Gbps | 10-100 Gbps |
| Committed use | RI + Savings Plan | Reserved | Commitment |
| Maturity | Most mature | Mature | Growing |
5. Container Orchestration
Kubernetes Fundamentals
Kubernetes Architecture:
┌────────────────────────────────────────┐
│ Control Plane │
├────────────────────────────────────────┤
│ API Server │ Scheduler │ Controller │
│ Manager │ etcd │ kubelet │
└────────────────────────────────────────┘
│
┌──────┼──────┬──────┐
▼ ▼ ▼ ▼
┌──────┬──────┬──────┬──────┐
│Node1 │Node2 │Node3 │Node4 │
├──────┼──────┼──────┼──────┤
│ Pod │ Pod │ Pod │ Pod │
│ Pod │ Pod │ Pod │ Pod │
│ Pod │ Pod │ Pod │ Pod │
└──────┴──────┴──────┴──────┘
Container Runtime (Docker, containerd)
Key Concepts:
- Pod: Smallest deployable unit (1+ containers)
- Service: Load balancer for pods
- Deployment: Declarative pod management
- StatefulSet: Ordered, stable pod identities
- DaemonSet: Run pod on every node
Managed vs. Self-Managed
Self-Managed Kubernetes:
├─ Full control over all components
├─ Operator responsibility: high
├─ Cost: Infrastructure + management effort
├─ Flexibility: Maximum
└─ Complexity: High
Managed Kubernetes:
├─ EKS (AWS), AKS (Azure), GKE (GCP)
├─ Control plane managed by cloud provider
├─ Operator responsibility: Nodes, networking
├─ Cost: Control plane fee + node cost
├─ Flexibility: Good
└─ Complexity: Lower than self-managed
Pod Density & Node Sizing
Node Resource Reservation:
Kubernetes system pods: 5-10% CPU, 10-15% memory
Kubelet daemon: 2-3% CPU, 100-200MB memory
Container runtime: 1-2% CPU, 50-100MB memory
──────────────────────────────────
System overhead: ~10% of node capacity
Usable Capacity Calculation:
16 vCPU, 64GB memory node
System overhead: 1.6 vCPU, 6.4GB
Available for pods: 14.4 vCPU, 57.6GB
Pod Density Example:
Pod requirement: 0.5 vCPU, 512MB memory
Number of pods per node: min(14.4/0.5, 57.6/0.5) = 29 pods
Conservative production: 15-20 pods per node (leave headroom)
Scaling in Kubernetes
Horizontal Pod Autoscaling (HPA):
├─ Scale replicas based on metrics (CPU, memory, custom)
├─ Min/max replicas configured
├─ Response time: 15-60 seconds
└─ Good for: Traffic spikes, predictable patterns
Vertical Pod Autoscaling (VPA):
├─ Adjust resource requests/limits
├─ Requires pod restart
├─ Response time: Minutes to hours
└─ Good for: Finding optimal resource requests
Cluster Autoscaling:
├─ Scale cluster nodes up/down
├─ Responds to pending pods
├─ Response time: 30 seconds to 5 minutes
└─ Good for: Long-term capacity planning
ECS vs. Kubernetes
ECS (AWS Elastic Container Service):
├─ Simpler than Kubernetes
├─ Deep AWS integration (IAM, CloudWatch, ALB)
├─ Less operational overhead
├─ Smaller learning curve
└─ Trade-off: Less flexible, AWS-specific
Kubernetes:
├─ More complex than ECS
├─ Cloud-agnostic (AWS, Azure, GCP, on-prem)
├─ Larger ecosystem (Helm, operators, addons)
├─ Steeper learning curve
└─ Trade-off: Maximum flexibility, more ops work
6. Resource Management
CPU & Memory Limits
Container Resource Requests:
Requests: Minimum guaranteed resources
Limits: Maximum resources allowed
Good Practice:
├─ Set requests = expected usage
├─ Set limits = maximum burst capacity
├─ Limits > Requests (buffer for spikes)
└─ Example: Request 500m CPU, Limit 1000m CPU
NUMA Awareness
Non-Uniform Memory Architecture:
┌─────────────────────────────────┐
│ CPU Socket 0 │ CPU Socket 1 │
├───────────────────┼───────────────┤
│ L3 Cache │ L3 Cache │
│ Local NUMA 0 │ Local NUMA 1 │
│ ↓ │ ↓ │
│ Memory Bank 0 │ Memory Bank 1 │
│ (fast) │ (fast) │
│ │ │
└─────────────────────────────────┘
↓ Remote Latency
Cross-socket memory access (30-50% slower)
Optimization:
- Pin vCPU/processes to NUMA node
- Allocate memory on same NUMA node
- CPU affinity settings
- Benefits: 10-30% latency reduction
CPU Throttling
CPU Throttling Scenario:
Pod resource limit: 1000m (1 core)
Pod CPU usage: 1200m (wants 1.2 cores)
├─ Actual CPU granted: 1000m
├─ Remaining demand: 200m (throttled)
└─ Result: Pod performance degrades 17%
Prevention:
├─ Set realistic limits (leave 20% headroom)
├─ Monitor CPU throttling metrics
├─ Implement HPA based on CPU usage
└─ Use CPU requests for scheduling
7. Performance Tuning
Network Optimization
Network Tuning:
├─ TCP buffer sizes (rmem_max, wmem_max)
├─ Number of connections (ulimits, net.core.somaxconn)
├─ TCP window scaling (net.ipv4.tcp_window_scaling)
├─ Fast retransmit (net.ipv4.tcp_sack)
└─ NIC interrupt coalescing
Expected Improvements:
Throughput: +30-50%
Latency: -10-20%
Connection rate: +50-100%
Disk I/O Tuning
I/O Scheduler Options:
├─ CFQ: Fair, desktop-friendly
├─ Deadline: Predictable latency
├─ NOOP: Bypass kernel scheduler (best for SSDs/NVMe)
├─ mq-deadline: Multi-queue variant
Block Device Tuning:
├─ Read-ahead: Increase from 128KB to 256KB-1MB
├─ nr_requests: Increase queue depth
├─ rq_affinity: Bind to CPU core
└─ Expected: +20-40% throughput improvement
JVM Tuning
JVM Flags for Production:
├─ -XX:+UseG1GC: Modern garbage collector
├─ -XX:MaxGCPauseMillis=200: GC pause target
├─ -XX:InitiatingHeapOccupancyPercent=45: When to start concurrent GC
├─ -XX:+ParallelRefProcEnabled: Reference processing parallelism
└─ -XX:+UnlockDiagnosticVMOptions -XX:+PrintGCDetails: Diagnostics
Heap Sizing:
Initial heap (-Xms): Set = Max (avoid dynamic resizing)
Max heap (-Xmx): 75% of container memory limit
Example: 8GB container → -Xms6g -Xmx6g
Database Tuning
PostgreSQL:
├─ shared_buffers: 25% of system RAM
├─ effective_cache_size: 50-75% of system RAM
├─ work_mem: RAM / (max_connections × 2)
├─ max_connections: 100-400 (depends on workload)
└─ fsync=off only for non-critical data (write speed boost)
MySQL:
├─ innodb_buffer_pool_size: 50-75% of RAM
├─ innodb_log_file_size: 256MB-1GB
├─ max_connections: 100-400
├─ query_cache: Disable on modern versions
└─ Consider NUMA binding for large buffers
8. Scaling Strategies
Horizontal Scaling
Add more instances:
├─ Web tier: Stateless, easy to scale
├─ Application tier: May require session management
├─ Database: Complex, requires replication/sharding
└─ Load balancer: Distributes traffic
Pattern:
1 instance → N instances
Typical: 5-50 instances per tier
Kubernetes: 1-1000+ pods easily
Vertical Scaling
Increase instance size:
├─ Bigger CPU, memory
├─ Same application code
├─ Simpler than horizontal for stateful workloads
└─ Limited by max instance size available
Limits:
AWS m6i.32xlarge: 128 vCPU, 512GB memory (max practical limit)
Beyond: Requires horizontal scaling or architecture change
Auto-Scaling Triggers
Metrics to Monitor:
├─ CPU utilization: 60-70% target
├─ Memory utilization: 70-80% target
├─ Request rate: Scale when 80% of capacity
├─ Custom metrics: Application-specific (queue depth, latency)
└─ Predictive scaling: Historical patterns
Scale-Out Triggers:
Scale down: Wait 5-10 minutes to avoid thrashing
Scale up: Respond within 30-60 seconds
Multi-Region Scaling
Geographic Distribution:
┌─────────────────────┐ ┌─────────────────────┐
│ US Region │ │ EU Region │
│ ├─ 50+ instances │ │ ├─ 30+ instances │
│ └─ Active-Active │ │ └─ Active-Active │
└─────────────────────┘ └─────────────────────┘
↓ ↓
DNS Failover / Global Load Balancer
Benefits:
├─ High availability (regional failure tolerance)
├─ Reduced latency (serve users locally)
├─ Compliance (data residency requirements)
└─ Disaster recovery (RTO/RPO improvement)
9. Cost Optimization
Instance Right-Sizing
Before (Over-provisioned):
├─ 48-core instance
├─ 256GB memory
├─ Average utilization: CPU 15%, Memory 20%
├─ Monthly cost: $3,000
After (Right-sized):
├─ 8-core instance
├─ 32GB memory
├─ Average utilization: CPU 60%, Memory 65%
├─ Monthly cost: $400
└─ Savings: $2,600/month (87% reduction)
Commitment Strategy
Capacity Planning:
Baseline workload (minimum): 20 instances
Peak workload (maximum): 100 instances
On-demand for bursts: 50 instances
Cost Optimization:
├─ Reserved instances: 20 baseline (70% discount)
├─ Savings plans: Flexible coverage
├─ Spot instances: 30 instances (85% discount)
└─ On-demand: 20 instances for flexibility
Monthly Cost Example (AWS EC2 m5.large):
Baseline reserved: 20 × $0.05/hr × 730 × 70% = $511
Spot (peak demand): 30 × $0.02/hr × 200 = $120
On-demand (spike): 50 × $0.10/hr × 100 = $500
────────────────────────────────────────────
Total: ~$1,131/month (vs. $5,840 full on-demand)
Savings: 81%
Containerization ROI
Physical Servers:
5 servers × $25K hardware + operations = $150K/year
Virtualization:
2 servers × $25K + software licensing = $60K/year
Savings: 60%
Containerization:
Kubernetes cluster (3-5 nodes) = $40K/year
Plus cloud compute savings
Overall savings: 73%
10. Disaster Recovery & High Availability
RTO/RPO Targets
Recovery Time Objective (RTO):
Critical systems: 15 minutes - 1 hour
Important systems: 1-4 hours
Non-critical: 24+ hours
Recovery Point Objective (RPO):
Critical systems: 15 minutes (real-time replication)
Important systems: 1-4 hours (frequent backups)
Non-critical: 24 hours (daily backups)
Cost tradeoff:
RTO/RPO 15 min: 3-5x cost (real-time replication)
RTO/RPO 4 hours: 1.5-2x cost (sync backups)
RTO/RPO 24 hours: 1.1x cost (async backups)
High Availability Patterns
Multi-AZ Deployment:
┌──────────────────────┐ ┌──────────────────────┐
│ Availability Zone 1 │ │ Availability Zone 2 │
│ ├─ Instance 1 │ │ ├─ Instance 2 │
│ ├─ Instance 3 │ │ ├─ Instance 4 │
│ └─ Database primary │ │ └─ Database replica │
└──────────────────────┘ └──────────────────────┘
↓ ↓
└────────────────────────────┘
Load Balancer (health checks)
RTO: 30 seconds (auto-failover)
Backup Strategy
3-2-1 Rule:
├─ 3 copies of data (original + 2 backups)
├─ 2 different storage types (disk + cloud)
├─ 1 copy offsite (different region/provider)
Implementation:
├─ Daily snapshots (local EBS/disk)
├─ Weekly full backups (cloud object storage)
├─ Monthly archive (Glacier/cold storage)
└─ Quarterly restore tests (verify recovery works)
11. Security & Compliance
Network Security
Security Layers:
┌─────────────────────────────────────┐
│ Internet │
└──────────────┬──────────────────────┘
↓
┌─────────────────────────────────────┐
│ AWS Security Group / NSG │
│ ├─ Inbound rules (ports, IPs) │
│ ├─ Outbound rules │
│ └─ Stateful filtering │
└──────────────┬──────────────────────┘
↓
┌─────────────────────────────────────┐
│ OS Firewall (iptables, Windows) │
│ ├─ Host-based rules │
│ └─ Application-specific │
└──────────────┬──────────────────────┘
↓
┌─────────────────────────────────────┐
│ Application │
│ ├─ Authentication │
│ ├─ Authorization │
│ └─ Encryption │
└─────────────────────────────────────┘
Compliance Requirements
PCI DSS (Payment Card Industry):
├─ Data encryption (in transit, at rest)
├─ Network segmentation
├─ Regular security assessments
├─ Audit logging
└─ Access controls
HIPAA (Healthcare):
├─ Encryption (AES-256)
├─ Access logging (all access recorded)
├─ Network isolation
├─ Annual risk assessment
└─ Business Associate Agreements (BAAs)
SOC 2 / ISO 27001:
├─ Security controls documentation
├─ Regular audit trails
├─ Incident response procedures
├─ Configuration management
└─ Annual third-party audits
Instance Hardening
Hardening Checklist:
├─ Disable unnecessary services
├─ Close unused ports (firewall)
├─ Update OS regularly (security patches)
├─ Configure SELinux / AppArmor
├─ Remove default accounts
├─ Implement SSH key authentication (no passwords)
├─ Enable audit logging
├─ Use encrypted filesystems
├─ Configure AIDE for file integrity
└─ Regular security scanning (Nessus, OpenVAS)
12. Monitoring & Observability
Key Metrics to Monitor
Compute Performance:
├─ CPU utilization: Should be 30-70% (headroom)
├─ Memory utilization: Should be 40-80%
├─ Disk I/O: Monitor IOPS and throughput
├─ Network throughput: Monitor bandwidth usage
└─ Load average: Should be < number of cores
Application Health:
├─ Response time: Track latency trends
├─ Error rate: P50/P95/P99 percentiles
├─ Throughput: Requests per second
├─ Resource consumption: Memory/CPU per request
└─ Queue depth: Buffered requests
Observability Stack
┌─────────────────────────────────────┐
│ Application │
│ └─ Instrumentation (metrics, logs) │
└──────────────┬──────────────────────┘
↓
┌─────────────────────────────────────┐
│ Collection Layer │
│ ├─ Prometheus (metrics) │
│ ├─ ELK/Loki (logs) │
│ └─ Jaeger (traces) │
└──────────────┬──────────────────────┘
↓
┌─────────────────────────────────────┐
│ Visualization & Alerting │
│ ├─ Grafana (dashboards) │
│ ├─ AlertManager (alerting) │
│ └─ PagerDuty (incident response) │
└─────────────────────────────────────┘
Alert Thresholds
CPU Utilization:
Warning: > 75% for > 5 minutes
Critical: > 90% for > 2 minutes
Memory Utilization:
Warning: > 80% available
Critical: > 90% available (risk of OOM)
Disk Space:
Warning: > 80% used
Critical: > 95% used
Network Errors:
Warning: > 1% packet loss
Critical: > 5% packet loss
13. Migration Strategies
Physical to Virtual
Lift & Shift:
├─ Minimal changes to application
├─ Reduced upfront effort
├─ Potential for optimization post-migration
└─ Risk: Misses virtualization benefits
Phased Migration:
├─ Migrate non-critical workloads first
├─ Learn processes before critical systems
├─ Reduce deployment risk
└─ Take advantage of optimization opportunities
P2V (Physical to Virtual) Tools:
├─ VMware vCenter Converter
├─ AWS DataSync, Azure Migrate, GCP Migrate for Compute
├─ Snapshot-based migration
└─ Agent-based migration
Virtual to Cloud
VM Import:
├─ AWS VM Import/Export
├─ Azure Migrate
├─ GCP VM Import
Network Considerations:
├─ VPC/VNet configuration
├─ Security groups/NSGs
├─ Route tables
├─ Private connectivity (Direct Connect, ExpressRoute)
└─ DNS configuration
Cloud to Kubernetes
Containerization Steps:
├─ 1. Application analysis (dependencies, resources)
├─ 2. Create Dockerfile
├─ 3. Build container image
├─ 4. Test in local environment
├─ 5. Push to container registry
├─ 6. Create Kubernetes manifests (Deployment, Service)
├─ 7. Deploy to staging cluster
├─ 8. Test in staging
├─ 9. Deploy to production
└─ 10. Monitor and optimize
Benefits:
├─ Better resource efficiency
├─ Simpler scaling
├─ Improved portability
└─ 50-70% cost reduction possible
14. Best Practices
Capacity Planning
1. Collect Historical Data
├─ Track usage over 6-12 months
├─ Identify trends (growth rate)
└─ Note seasonal patterns
2. Forecast Future Demand
├─ Apply growth trend (e.g., 20% YoY)
├─ Add buffer for peak (30-50%)
└─ Account for new features
3. Right-Size Infrastructure
├─ Match forecast + buffer
├─ Avoid over-provisioning (cost waste)
├─ Avoid under-provisioning (performance issues)
└─ Review quarterly, adjust as needed
Change Management
Change Process:
├─ 1. Document proposed change
├─ 2. Impact assessment (blast radius)
├─ 3. Rollback plan (if things go wrong)
├─ 4. Change approval (CAB)
├─ 5. Schedule maintenance window
├─ 6. Execute change
├─ 7. Monitor closely (first 30 minutes)
├─ 8. Verify success
└─ 9. Document lessons learned
Maintenance Windows:
├─ Schedule during low-traffic periods
├─ Communicate with stakeholders in advance
├─ Maintain change log
└─ Post-implementation review
Tagging Strategy
Cloud Resource Tags:
Cost Allocation:
├─ environment (production, staging, development)
├─ project (team, business unit)
├─ cost-center (billing department)
└─ owner (responsible person/team)
Operational:
├─ application (app name)
├─ version (app version)
├─ data-classification (public, internal, confidential)
├─ backup-policy (frequency, retention)
└─ backup-status (latest backup date)
Mandatory Tags:
├─ All resources must have: environment, owner, cost-center
├─ Enforce via IAM policy
├─ Regular audits for compliance
└─ Use for cost tracking and forecasting
Documentation
Infrastructure as Code (IaC):
├─ Terraform: Multi-cloud (AWS, Azure, GCP)
├─ CloudFormation: AWS-specific
├─ ARM Templates: Azure-specific
├─ Version control: Track all changes
Benefits:
├─ Reproducible infrastructure
├─ Consistent across environments
├─ Easy disaster recovery
├─ Audit trail of all changes
└─ Self-documenting (code is documentation)
15. Production Checklist
Pre-Launch Checklist
Capacity & Performance:
☐ Load testing completed (2-3x expected peak)
☐ CPU/memory utilization acceptable (< 70% target)
☐ Response time within SLA
☐ Disk I/O performance acceptable
☐ Network latency measured and acceptable
☐ Database query performance optimized
High Availability:
☐ Multi-AZ deployment configured
☐ Load balancer health checks enabled
☐ Auto-scaling policies configured
☐ Failover tested manually
☐ Disaster recovery plan documented
Security:
☐ Security groups/NSGs locked down (least privilege)
☐ Firewall rules verified
☐ Encryption enabled (data in transit, at rest)
☐ SSH/RDP access restricted to admins
☐ Security scanning passed
☐ Secrets management configured
Monitoring & Alerting:
☐ All metrics collected (CPU, memory, disk, network)
☐ Alerts configured for critical thresholds
☐ Dashboards created
☐ Logging configured and verified
☐ Incident response procedures documented
☐ On-call rotation established
Backup & Disaster Recovery:
☐ Backup policy configured
☐ Backup verification tests passed (restore procedure works)
☐ Recovery time objective (RTO) < threshold
☐ Recovery point objective (RPO) < threshold
☐ Disaster recovery runbook created
Documentation:
☐ Architecture diagram created
☐ Runbook for common operations created
☐ Troubleshooting guide created
☐ Change log started
☐ Team trained on new infrastructure
☐ On-call playbook created
Runbook Template
Common Operations Runbook:
1. Scaling up
├─ Manual: AWS CLI command
├─ Auto-scaling: Monitor trigger metrics
└─ Estimated time: 5-10 minutes
2. Restarting instance/pod
├─ Command: systemctl restart app / kubectl restart pod
├─ Health check verification
└─ Estimated time: 2-5 minutes
3. Patching/Updates
├─ Staging verification first
├─ Multi-step rolling update
└─ Rollback procedure if issues
4. Emergency failover
├─ Automated: DNS failover triggers
├─ Manual: DNS change + load balancer adjustment
├─ Communications: Notify stakeholders
└─ Estimated time: < 5 minutes
5. Incident response
├─ Escalation path
├─ Decision tree (rollback vs. fix forward)
├─ Communications template
└─ Post-incident review
Summary
Compute infrastructure encompasses a wide spectrum from physical servers to serverless platforms. Key success factors:
- Right-sizing: Match compute resources to workload requirements
- Scaling: Implement auto-scaling for dynamic workloads
- Reliability: Multi-AZ deployment, health checks, auto-recovery
- Cost optimization: Reserved instances, spot, right-sizing
- Monitoring: Observe all metrics, alert on anomalies
- Automation: IaC, deployment pipelines, auto-remediation
The optimal solution combines technologies across the spectrum based on workload characteristics and business requirements.
Document Version: 1.0
Last Updated: January 31, 2026
Audience: Infrastructure Engineers, DevOps, Architecture
Contact: Infrastructure Team