CONCEPT
Cloud-Native PostgreSQL (CNPG): Architecture & Concepts
Overview
CloudNativePG (CNPG) is a Kubernetes operator that manages PostgreSQL clusters natively on Kubernetes. It provides production-grade high availability, automated backups, self-healing capabilities, and Point-in-Time Recovery (PITR).
Core Benefits:
- ✅ High Availability (HA) with automatic failover
- ✅ Synchronous replication for data safety
- ✅ Automated WAL archiving and PITR
- ✅ Rolling updates with zero downtime
- ✅ Declarative cluster management (GitOps)
- ✅ No external dependencies (runs fully in Kubernetes)
1. PostgreSQL Basics
1.1 Replication Concepts
Primary Node (Writer):
- Single leader that accepts writes
- WAL (Write-Ahead Log) records changes
- Streams WAL to replicas
Replica Nodes (Read-Only):
- Receive WAL from primary
- Apply changes asynchronously or synchronously
- Can be promoted to primary if original fails
┌──────────────────────────────────────────────────────┐
│ PostgreSQL HA Architecture (CNPG) │
├──────────────────────────────────────────────────────┤
│ │
│ Client Applications │
│ (Write to Primary, Read from Replicas) │
│ │ │
│ ├────────┬─────────────┬────────────┐ │
│ ▼ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Primary │ │ Replica 1 │ │ Replica 2 │ │
│ │ (Writer) │ │ (Read-Only) │ │ (Read-Only) │ │
│ └──────┬───────┘ └──────────────┘ └──────────────┘ │
│ │ │
│ WAL Stream (Sync) │
│ ├─ Standby Slots │
│ ├─ Quorum Commits │
│ └─ Replication Slots │
│ │
│ Persistent Storage (PVCs) │
│ ├─ Data Volume (20GB+) │
│ └─ WAL Volume (5GB+) │
│ │
└──────────────────────────────────────────────────────┘
1.2 Synchronous vs Asynchronous Replication
Asynchronous (Fast, risky):
Primary writes → Replica receives later
Risk: Data loss if primary crashes before replica receives
Use: Non-critical databases, high-throughput scenarios
Synchronous (Safe, slower):
Primary waits → Replica confirms → Commit confirms
Safety: No data loss (if replica crashes, primary has it)
Use: Production, mission-critical data, financial systems
1.3 Replication Slots
Purpose: Retain WAL files on primary until replica has consumed them.
Primary Disk
├─ WAL files (keep until consumed by replicas)
└─ Replication Slots track consumer position
Without slots: Primary might delete WAL before replica receives → replication lag
With slots: Primary keeps WAL → replicas always catchup
Quorum Commit (Synchronous mode):
spec:
postgresql:
synchronous_commit: "on" # Wait for replicas
numSynchronousReplicas: 1 # At least 1 replica must ACK
Behavior:
If 3 replicas (1 primary + 2 standbys):
- Primary writes
- Waits for ANY 1 replica to ACK
- Then confirms write to application
If 1 replica crashes:
- Primary still waits for other replica
- No impact on availability
If 2 replicas crash:
- Primary still works (no replicas to wait for)
- Degraded but operational
2. CNPG Operator Architecture
2.1 Key Components
┌─────────────────────────────────────────────────────┐
│ CNPG Operator (Control Plane) │
├─────────────────────────────────────────────────────┤
│ │
│ CRDs (Custom Resource Definitions) │
│ ├─ Cluster: Define PostgreSQL cluster │
│ ├─ ScheduledBackup: Backup policy │
│ └─ Pooler: Connection pooling (pgBouncer) │
│ │
│ Controllers │
│ ├─ Cluster Controller: Manage instances │
│ ├─ Bootstrap Controller: Initialize clusters │
│ └─ Backup Controller: Handle WAL archiving │
│ │
│ Status & Reconciliation │
│ └─ Continuously reconcile desired vs actual state │
│ │
└─────────────────────────────────────────────────────┘
│
│ Manages
▼
┌─────────────────────────────────────────────────────┐
│ PostgreSQL Cluster (Data Plane) │
├─────────────────────────────────────────────────────┤
│ │
│ Pods (Statefulset) │
│ ├─ pod-0 (Primary) │
│ ├─ pod-1 (Replica) │
│ └─ pod-2 (Replica) │
│ │
│ PersistentVolumes (Storage) │
│ ├─ Data PVC (20GB) │
│ ├─ WAL PVC (5GB) │
│ └─ PGDATA directory │
│ │
│ Services │
│ ├─ rw: Primary (read-write) │
│ ├─ ro: Replicas (read-only) │
│ ├─ r: Any instance (reader) │
│ └─ metrics: Prometheus port 9187 │
│ │
└─────────────────────────────────────────────────────┘
2.2 Instance Lifecycle
1. CREATE Cluster manifest
│
▼
2. CNPG Operator detects Cluster
│
▼
3. CREATE StatefulSet with N pods
│
▼
4. BOOTSTRAP (Initialize database)
├─ First pod: Primary
├─ Remaining pods: Replicas
└─ Configure replication
│
▼
5. READY (All pods running, replication healthy)
│
▼
6. ROLLING UPDATE (e.g., version upgrade)
├─ Update replicas first (no downtime)
├─ Perform switchover (primary → replica)
├─ Update old primary
└─ Switchback (if desired)
│
▼
7. HEALTHY (Cluster operational)
3. Storage Architecture
3.1 Storage Types
Data Volume (PGDATA):
Location: /var/lib/postgresql/data/pgdata
Size: Depends on database size (typically 20GB+)
Usage: PostgreSQL data files, indexes, tables
I/O Pattern: Random read/write (needs fast disk)
Retention: Permanent (until cluster deleted)
WAL Volume (Write-Ahead Log):
Location: /var/lib/postgresql/wal
Size: Depends on write throughput (typically 5GB)
Usage: Transaction logs before committed to disk
I/O Pattern: Sequential writes (can use slower disk)
Retention: Until archived to S3/backup store
Benefit: Separating WAL improves I/O performance
Backup Storage (S3/MinIO):
Location: s3://bucket/cluster-name/base/ (base backups)
+ s3://bucket/cluster-name/wal/ (WAL archives)
Size: Compressed base backup + WAL archives
Usage: PITR and disaster recovery
Retention: Based on backup policy (e.g., 30 days)
3.2 Storage Expansion
Online Expansion (No downtime):
# Update manifest
storage:
size: 30Gi # Increase from 20Gi
# Apply change
kubectl apply -f cluster.yaml
# Operator progressively expands each PVC
# Users experience no downtime
Limitations:
- ❌ Cannot shrink storage (only expand)
- ❌ Requires StorageClass that supports expansion
- ❌ Expansion takes time (depends on I/O speed)
3.3 Storage Classes
# Fast SSD (Production Primary)
storageClass: "fast-storage-resizable"
size: 50Gi
# Standard storage (Replicas, non-critical)
storageClass: "standard-storage-resizable"
size: 20Gi
# Backup storage (Object store)
backup:
barmanObjectStore:
destinationPath: s3://my-bucket/
endpointURL: https://s3.amazonaws.com
4. High Availability & Failover
4.1 Automatic Failover
Scenario: Primary node crashes
Timeline:
0s: Primary pod dies
2s: Kubernetes detects pod is down
5s: CNPG operator notices missing primary
10s: Operator promotes healthy replica to primary
15s: New primary is ready, accepts connections
20s: Service updates to point to new primary
Result: ~20 seconds of read-only access, then normal operations
4.2 Quorum-Based Failover
spec:
postgresql:
synchronous_commit: "on"
numSynchronousReplicas: 1 # At least 1 must ACK writes
Cluster: 1 Primary + 2 Replicas
Scenario 1: Primary fails
✅ Replicas still exist
✅ Operator promotes best replica (least lag)
✅ New cluster: 1 Primary + 1 Replica
✅ Cluster recovers automatically
Scenario 2: 1 Replica fails
✅ Primary + 1 Replica still healthy
✅ Replication continues
✅ Operator recreates lost replica
✅ Eventually: 1 Primary + 2 Replicas again
Scenario 3: 2 Replicas fail (Primary only)
⚠️ Primary operates without replicas (degraded)
✅ No synchronous replication (performance improves)
✅ Operator recreates replicas
✅ Eventually: 1 Primary + 2 Replicas again
4.3 Switchover (Planned Failover)
Purpose: Move primary to a different node with zero data loss.
Reason: Maintenance, node drain, load balancing
Steps:
1. Current primary: Flush and sync WAL
2. Current primary: Stop accepting writes
3. Replica: Catch up to primary
4. Replica: Promoted to primary
5. Old primary: Demoted to replica
6. Resume: New cluster is ready
Downtime: Typically < 1 second (fast reconnect)
Data Loss: Zero (fully synchronous)
5. Upgrades & Maintenance
5.1 PostgreSQL Minor Version Upgrade
Example: PostgreSQL 16.1 → 16.2
Impact: Near-zero downtime (with switchover enabled)
Procedure:
1. Edit manifest: imageName: postgresql:16.2
2. Apply change: kubectl apply -f cluster.yaml
3. Operator updates replicas first (no downtime)
4. Operator performs switchover (primary → replica)
5. Operator updates old primary
6. Optional: Switch back to original primary
Validation:
- Check pod status: kubectl get pods
- Check cluster version: kubectl cnpg status <cluster>
- Monitor logs: kubectl logs -f pod/<cluster>-1
5.2 PostgreSQL Major Version Upgrade
Example: PostgreSQL 15.x → 16.x
⚠️ WARNING: Major upgrades involve pg_upgrade (can take time)
Prerequisites:
1. Full backup (old backups become invalid after upgrade)
2. Test in staging cluster first
3. Ensure sufficient disk space (for pg_upgrade)
Procedure:
1. Create base backup before upgrade
2. Edit manifest: imageName: postgresql:16
3. Set: primaryUpdateMethod: switchover
4. Apply: kubectl apply -f cluster.yaml
5. Monitor upgrade progress (check logs)
6. If pod stuck: Force delete: kubectl delete pod <pod-name> --force
Post-Upgrade:
⚠️ CRITICAL: Trigger new base backup immediately
Old backups are invalid for PITR
New PITR starts from backup after upgrade
Rollback:
❌ NOT POSSIBLE (pg_upgrade is in-place)
Have full backup if rollback needed
5.3 CNPG Operator Upgrade
Risk Level: High (can cause unexpected cluster restarts)
Safety Procedure:
Step 1: Freeze all clusters (supervised mode)
spec:
primaryUpdateStrategy: supervised # Don't auto-update
Apply to ALL clusters:
kubectl patch cluster -n cnpg --all -p '{"spec":{"primaryUpdateStrategy":"supervised"}}' --type=merge
Step 2: Upgrade operator
helm upgrade cnpg cnpg/cloudnative-pg --version X.Y.Z
Step 3: Monitor operator
kubectl logs -f deployment/cnpg-controller-manager -n cnpg-system
Step 4: Verify operator health
kubectl get deployment -n cnpg-system cnpg-controller-manager
# Should show all replicas ready
Step 5: Unfreeze clusters one by one
spec:
primaryUpdateStrategy: unsupervised # Resume normal updates
Apply per cluster to validate each works correctly
6. Backup & Disaster Recovery
6.1 WAL Archiving
WAL (Write-Ahead Log): Transaction log recorded before commit.
Without archiving:
Primary disk: Keeps last few hours of WAL
If primary crashes: Can recover only recent transactions
Old backups: Cannot be used for PITR
With archiving:
Primary disk: Keeps recent WAL (cleanup when archived)
S3/Minio: Keeps ALL WAL for 30+ days
Any old backup: Can PITR to any point in last 30 days
Configuration:
spec:
backup:
barmanObjectStore:
destinationPath: s3://my-bucket/
endpointURL: https://s3.amazonaws.com
s3Credentials:
accessKeyId: KEY
secretAccessKey: SECRET
wal:
compression: gzip # Reduce storage
maxParallel: 4 # Parallel streams
6.2 Scheduled Backups
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
name: cluster-daily-backup
spec:
schedule: "0 1 * * *" # Daily at 1 AM UTC
backupOwnerReference: cluster # Auto-delete old backups
cluster:
name: <cluster-name>
# Retention: keep 7 daily backups
retention: 7
6.3 Point-in-Time Recovery (PITR)
Scenario: Delete critical table at 14:30, discover at 15:00
# Create NEW cluster (don't overwrite existing)
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: restored-cluster
spec:
instances: 1
bootstrap:
recovery:
source: cluster # Source cluster name
recoveryTarget:
targetTime: "2024-01-28T14:25:00Z" # Just before deletion
# All data up to this timestamp will be restored
7. Connection Management
7.1 Connection Pooling with PgBouncer
Problem: Each PostgreSQL connection expensive (memory, process)
Without pooling:
10,000 app instances → 10,000 connections → PostgreSQL overload
With pooling:
10,000 app instances → PgBouncer → 100 connections → PostgreSQL (1 connection per replica)
Configuration:
apiVersion: postgresql.cnpg.io/v1
kind: Pooler
metadata:
name: cluster-pooler
spec:
cluster:
name: <cluster-name>
type: rw # Read-write pooling
pgbouncer:
pool_mode: "transaction" # Reuse connection per transaction
max_client_conn: 1000 # Max app connections
default_pool_size: 25 # Connection pool size per replica
7.2 Service Discovery
# Read-Write (Primary only)
svc/<cluster>-rw.default.svc.cluster.local:5432
# Read-Only (Replicas only)
svc/<cluster>-ro.default.svc.cluster.local:5432
# Read (Any instance, load balanced)
svc/<cluster>-r.default.svc.cluster.local:5432
# Monitoring
svc/<cluster>-metrics.default.svc.cluster.local:9187
8. Monitoring & Observability
8.1 Key Metrics
| Metric | Normal | Warning | Critical |
|---|---|---|---|
| Replication Lag | < 1s | > 10s | > 60s |
| WAL Archiving | Caught up | Behind 1h | Behind 24h |
| PVC Usage | < 70% | > 85% | > 95% |
| Connection Count | < 50% pool | > 70% | > 90% |
| Transaction Rate | Baseline | +50% | +100% |
| Slow Queries | < 1% | > 5% | > 10% |
8.2 Prometheus Metrics
# Replication lag (in seconds)
cnpg_replication_lag
# WAL archive status
cnpg_last_wal_archive_time
cnpg_wal_archive_success_total
# Pod metrics (standard)
up{job="cnpg"} # Cluster health
container_memory_usage_bytes{pod=~"<cluster>.*"}
container_cpu_usage_seconds_total{pod=~"<cluster>.*"}
# PVC usage
kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"<cluster>.*"}
kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"<cluster>.*"}
8.3 Alert Rules
- alert: CNPGReplicationLagHigh
expr: cnpg_replication_lag > 60
for: 5m
annotations:
summary: "CNPG replication lag > 60s"
- alert: CNPGWALArchivingFailing
expr: time() - cnpg_last_wal_archive_time > 3600
for: 15m
annotations:
summary: "WAL archiving not running (> 1 hour behind)"
- alert: CNPGPVCAlmostFull
expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.95
for: 10m
annotations:
summary: "CNPG PVC > 95% full"
9. Best Practices
9.1 Production Configuration Checklist
✅ Replicas: Minimum 3 (1 primary + 2 replicas)
✅ Synchronous Replication: numSynchronousReplicas: 1
✅ Storage: Separate data + WAL PVCs
✅ Resources: Explicit CPU/memory limits
✅ WAL Archiving: Enabled to S3/MinIO
✅ Backup Policy: Scheduled daily backups
✅ Update Strategy: primaryUpdateMethod: switchover
✅ Monitoring: PodMonitor configured
✅ Connection Pooling: PgBouncer for high concurrency
✅ Secrets: PostgreSQL passwords in K8s Secrets
9.2 Capacity Planning
Determine requirements:
1. Database size today
2. Growth rate (GB/month)
3. Write throughput (queries/sec)
4. Backup frequency and retention
Storage calculation:
Data volume = current_size × 1.5 (growth buffer)
WAL volume = (queries_per_sec × avg_query_size × 3600) / 1000
Backup retention = base_backup_size × number_of_backups
Example:
Current: 10GB
Growth: 1GB/month
Writes: 1000 q/s
Data PVC: 10GB × 1.5 = 15GB
WAL PVC: 5GB (typical for 1000 q/s)
30-day backups: 10GB × 10 = 100GB (in S3)
9.3 Performance Tuning
# Resource-heavy workloads
resources:
requests:
memory: "8Gi"
cpu: "4"
limits:
memory: "16Gi"
cpu: "8"
# Increase max connections
postgresql:
parameters:
max_connections: "1000"
shared_buffers: "2GB"
effective_cache_size: "8GB"
10. Essential kubectl Commands
# Cluster status
kubectl cnpg status <cluster-name>
# PSQL shell (run queries)
kubectl cnpg psql <cluster-name>
# Replication status
kubectl cnpg psql <cluster-name> -- -c "SELECT * FROM pg_stat_replication;"
# Check replication slots
kubectl cnpg psql <cluster-name> -- -c "SELECT * FROM pg_replication_slots;"
# Force promote replica to primary
kubectl cnpg promote <cluster-name> <pod-name>
# Restart instance
kubectl cnpg restart <cluster-name> <instance-name>
# Rollback cluster (destroy and recreate)
kubectl cnpg destroy <cluster-name>
11. Troubleshooting Common Issues
Replica Stuck in "Waiting" Status
Symptom: Replica pod runs but doesn't join cluster
Causes:
1. Replication slot issue
2. Network connectivity problem
3. Storage initialization problem
Fix:
# Check logs
kubectl logs pod/<cluster>-1
# If slot stuck, recreate pod
kubectl delete pod <cluster>-1
# CNPG will automatically recreate and rejoin
Authentication Failures
Error: "FATAL: role 'app' does not exist"
Fix: Ensure managed role name matches secret
Manifest:
spec:
managed:
roles:
- name: app
Secret:
Must have key 'username: app'
Switchover Timeout
Symptom: Switchover hangs or times out
Fix: Check primaryUpdateStrategy
If supervised: Manually promote
kubectl cnpg promote <cluster> <pod>
Change to unsupervised:
spec:
primaryUpdateStrategy: unsupervised
primaryUpdateMethod: switchover
12. Key Takeaways
- HA by default: 3+ replicas with synchronous replication
- Storage separation: Data + WAL on different volumes
- Zero-downtime updates: Use switchover method
- PITR capability: Enable WAL archiving to S3
- Self-healing: Operator handles failover automatically
- Monitoring required: Track replication lag and WAL archiving
- Capacity planning: Predict storage growth upfront
- Backup discipline: Automate backups and test recovery
Additional Resources
- CloudNativePG Documentation
- PostgreSQL Replication Guide
- CNPG GitHub Repository
- Production Deployment Guide
Last Updated: January 2026