FinOps: Implementation Runbook

Complete operational guide for deploying and managing FinOps infrastructure and processes across AWS, Azure, and GCP.

Overview

This runbook provides step-by-step instructions for implementing FinOps across multi-cloud environments, including cost visibility setup, optimization implementation, and governance enforcement. Each phase includes provider-specific commands, verification procedures, and best practices.

Target Audience: Cloud architects, FinOps engineers, infrastructure teams, finance teams
Expected Timeline: 8-12 weeks for full implementation
Expected Cost Reduction: 25-40% with complete implementation

Standard Deployment

Phase 1: Cost Visibility Foundation (Week 1-2)

1.1 AWS Cost Visibility Setup

Prerequisites:

AWS Account with billing admin or Cost Management access
IAM permissions: ce:*, budgets:*, cloudtrail:*, organizations:*
CloudTrail enabled with S3 logging configured
Management account access (for organization-wide policies)
S3 bucket for cost and usage reports

Key AWS Tools:

Cost Explorer: Visual cost analysis and trend forecasting
Cost Anomaly Detection: ML-based unusual spending detection
AWS Budgets: Threshold-based alerts and forecasting
Cost and Usage Reports: Granular billing data export
Compute Optimizer: Instance right-sizing recommendations
Trusted Advisor: General infrastructure optimization

Steps:

# 1. Verify prerequisites and permissions
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
echo "Working with AWS Account: $ACCOUNT_ID"

# Check IAM permissions
aws iam get-user --query 'User.UserId' && echo "✓ IAM access verified" || exit 1

# Verify CloudTrail is enabled
aws cloudtrail describe-trails --query 'trailList[0].S3BucketName' && echo "✓ CloudTrail enabled" || exit 1

# 2. Create S3 bucket for cost and usage reports (if not exists)
BUCKET_NAME="finops-reports-${ACCOUNT_ID}"
aws s3 ls "s3://${BUCKET_NAME}" 2>/dev/null || {
  aws s3 mb "s3://${BUCKET_NAME}" --region us-east-1
  aws s3api put-bucket-versioning \
    --bucket "${BUCKET_NAME}" \
    --versioning-configuration Status=Enabled
  echo "✓ Created S3 bucket for reports"
}

# 3. Enable Cost Explorer (enable detailed billing first)
aws ce create-cost-category-definition \
  --name "CostCenter" \
  --rules '[{
    "Rule": {
      "Rule": "tag:CostCenter LIKE %"
    },
    "Value": "tag:CostCenter"
  }]' 2>/dev/null || echo "Cost category may already exist"

# 4. Set up comprehensive tagging policy for organizations
aws organizations put-policy \
  --content '{
    "tags": {
      "Environment": {
        "tag_key": {"@@assign": "Environment"},
        "enforced_for": {"@@assign": ["ec2:*", "rds:*", "s3:*", "dynamodb:*", "lambda:*"]},
        "tag_value": {"@@assign": ["production", "staging", "development", "test"]}
      },
      "CostCenter": {
        "tag_key": {"@@assign": "CostCenter"},
        "enforced_for": {"@@assign": ["ec2:*", "s3:*", "rds:*", "lambda:*"]}
      },
      "Owner": {
        "tag_key": {"@@assign": "Owner"},
        "enforced_for": {"@@assign": ["ec2:*", "rds:*"]}
      },
      "Application": {
        "tag_key": {"@@assign": "Application"},
        "enforced_for": {"@@assign": ["ec2:*", "rds:*", "lambda:*"]}
      }
    }
  }' \
  --type TAG_POLICY 2>/dev/null || echo "Policy may already exist"

# 5. Create additional cost allocation tags
for TAG_NAME in Application Owner Team Project; do
  aws ce create-cost-category-definition \
    --name "${TAG_NAME}" \
    --rules "[{\"Rule\": {\"Rule\": \"tag:${TAG_NAME} LIKE %\"}, \"Value\": \"tag:${TAG_NAME}\"}]" \
    2>/dev/null || echo "Tag category ${TAG_NAME} exists"
done

# 6. Generate Cost and Usage Report (detailed billing)
aws cur put-report-definition \
  --report-definition '{
    "ReportName": "FinOps-Daily-Cost-Report",
    "ReportFormat": "Parquet",
    "Compression": "Parquet",
    "TimeUnit": "DAILY",
    "IncludeResourceIds": true,
    "IncludeSupport": true,
    "IncludeDiscounts": true,
    "IncludeRefunds": true,
    "AdditionalSchemaElements": ["MANIFEST"],
    "RefreshClosedReports": true,
    "ReportVersioningType": "OVERWRITE_REPORT",
    "S3Bucket": "'"${BUCKET_NAME}"'",
    "S3Prefix": "cost-reports/",
    "S3Region": "us-east-1"
  }' 2>/dev/null || echo "Report definition may already exist"

# 7. Create Budget with multiple alert thresholds
BUDGET_ID="monthly-production-budget"
aws budgets create-budget \
  --account-id "${ACCOUNT_ID}" \
  --budget '{
    "BudgetName": "'"${BUDGET_ID}"'",
    "BudgetLimit": {
      "Amount": "5000",
      "Unit": "USD"
    },
    "TimeUnit": "MONTHLY",
    "BudgetType": "COST",
    "CostFilters": {
      "TagKeyValue": ["Environment$production"]
    }
  }' \
  --notifications-with-subscribers '[
    {
      "Notification": {
        "NotificationType": "FORECASTED",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 75,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        {"SubscriptionType": "EMAIL", "Address": "team@company.com"}
      ]
    },
    {
      "Notification": {
        "NotificationType": "FORECASTED",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 90,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        {"SubscriptionType": "EMAIL", "Address": "finance@company.com"}
      ]
    },
    {
      "Notification": {
        "NotificationType": "ACTUAL",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 100,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        {"SubscriptionType": "EMAIL", "Address": "cfo@company.com"}
      ]
    }
  ]' 2>/dev/null || echo "Budget may already exist"

# 8. Enable Cost Anomaly Detection with ML
aws ce create-anomaly-detector \
  --anomaly-detector '{
    "MonitorType": "DIMENSIONAL",
    "MonitorDimension": "SERVICE",
    "MonitorSpecification": {
      "InvocationFrequency": "DAILY"
    }
  }' 2>/dev/null || echo "Anomaly detector may already exist"

# Set up anomaly monitor subscription
aws ce create-anomaly-monitor \
  --anomaly-monitor '{
    "MonitorType": "DIMENSIONAL",
    "MonitorDimension": "SERVICE"
  }' 2>/dev/null || echo "Monitor may already exist"

# 9. Create EventBridge rule for cost anomalies
aws events put-rule \
  --name FinOps-Anomaly-Alert \
  --event-bus-name default \
  --state ENABLED \
  --event-pattern '{
    "source": ["aws.ce"],
    "detail-type": ["Cost Anomaly Detection"],
    "detail": {
      "detector": [{"state": ["ACTIVE"]}]
    }
  }' 2>/dev/null || echo "Rule may already exist"

# 10. Create SNS topic for alerts
SNS_TOPIC_ARN=$(aws sns create-topic \
  --name finops-alerts \
  --query 'TopicArn' \
  --output text 2>/dev/null)
echo "SNS Topic for alerts: $SNS_TOPIC_ARN"

# Add email subscription
aws sns subscribe \
  --topic-arn "${SNS_TOPIC_ARN}" \
  --protocol email \
  --notification-endpoint "team@company.com"

# 11. Create CloudWatch dashboard for cost monitoring
aws cloudwatch put-dashboard \
  --dashboard-name FinOps-Cost-Dashboard \
  --dashboard-body '{
    "widgets": [
      {
        "type": "metric",
        "properties": {
          "metrics": [["AWS/Billing", "EstimatedCharges", {"stat": "Average"}]],
          "period": 86400,
          "stat": "Sum",
          "region": "us-east-1",
          "title": "Estimated Monthly Charges"
        }
      },
      {
        "type": "log",
        "properties": {
          "query": "fields @timestamp, @message | stats sum(@message) by bin(5m)",
          "region": "us-east-1",
          "title": "Hourly Cost Trends"
        }
      }
    ]
  }' 2>/dev/null || echo "Dashboard may already exist"

Verification Steps:

# 1. Verify cost categories exist
aws ce describe-cost-category-definitions --query 'CostCategoryDefinitions[*].CostCategoryArn'

# 2. Verify tagging policy
aws organizations list-policies --filter TAG_POLICY --query 'Policies[*].Name'

# 3. Verify budgets configured
aws budgets describe-budgets \
  --account-id "${ACCOUNT_ID}" \
  --query 'Budgets[*].[BudgetName,BudgetLimit.Amount,TimeUnit]' \
  --output table

# 4. Verify anomaly detection is active
aws ce describe-anomaly-detectors \
  --query 'AnomalyDetectors[*].[AnomalyDetectorArn,MonitorType,Status]'

# 5. Check Cost and Usage Report generation
aws cur describe-report-definitions \
  --query 'ReportDefinitions[*].[ReportName,ReportFormat,TimeUnit]'

# 6. Verify S3 bucket has cost data (wait 24-48 hours for first report)
aws s3 ls "s3://${BUCKET_NAME}/cost-reports/" --recursive --human-readable --summarize

# 7. Test cost explorer query
aws ce get-cost-and-usage \
  --time-period Start=$(date -d '30 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity MONTHLY \
  --metrics UnblendedCost \
  --group-by Type=DIMENSION,Key=SERVICE \
  --query 'ResultsByTime[0]' --output table

1.2 Azure Cost Visibility Setup

Prerequisites:

Azure subscription with Owner or Contributor role
Enterprise Agreement (EA) or pay-as-you-go billing
Resource Group for cost management resources
Storage account for cost export

Key Azure Tools:

Cost Management + Billing: Central cost analysis and budgeting
Azure Monitor: Real-time metrics and alerts
Resource Graph: Query resources across subscriptions
Azure Advisor: Optimization recommendations
Log Analytics: Historical cost and usage data
Power BI: Advanced cost analytics and reporting

# 1. Verify prerequisites
SUBSCRIPTION_ID=$(az account show --query id --output tsv)
RESOURCE_GROUP="finops-rg"
echo "Working with subscription: $SUBSCRIPTION_ID"

# Verify access
az account show --query '[name, state]' && echo "✓ Azure access verified" || exit 1

# 2. Create resource group for finops resources
az group create \
  --name "${RESOURCE_GROUP}" \
  --location eastus \
  --tags Environment=finops Purpose=cost-management \
  2>/dev/null || echo "Resource group may already exist"

# 3. Create storage account for cost exports
STORAGE_NAME="finopsstg$(date +%s | tail -c 6)"
az storage account create \
  --resource-group "${RESOURCE_GROUP}" \
  --name "${STORAGE_NAME}" \
  --location eastus \
  --sku Standard_LRS \
  --kind StorageV2 \
  --access-tier Hot \
  2>/dev/null || echo "Storage account may already exist"

# 4. Create container for exports
az storage container create \
  --account-name "${STORAGE_NAME}" \
  --name cost-exports 2>/dev/null || echo "Container may already exist"

# 5. Set up cost management export (daily)
EXPORT_NAME="daily-cost-export"
az costmanagement export create \
  --scope "/subscriptions/${SUBSCRIPTION_ID}" \
  --name "${EXPORT_NAME}" \
  --definition-type "Usage" \
  --schedule-period "Daily" \
  --schedule-status "Active" \
  --time-period From="2025-01-01T00:00:00" To="2025-12-31T23:59:59" \
  --format Csv 2>/dev/null || echo "Export may already exist"

# 6. Create Log Analytics workspace for cost data
WORKSPACE_NAME="finops-workspace-${SUBSCRIPTION_ID:0:8}"
az monitor log-analytics workspace create \
  --resource-group "${RESOURCE_GROUP}" \
  --workspace-name "${WORKSPACE_NAME}" \
  --location eastus 2>/dev/null || echo "Workspace may already exist"

WORKSPACE_ID=$(az monitor log-analytics workspace show \
  --resource-group "${RESOURCE_GROUP}" \
  --workspace-name "${WORKSPACE_NAME}" \
  --query id --output tsv)
echo "Log Analytics Workspace ID: $WORKSPACE_ID"

# 7. Create action group for alerts
ACTION_GROUP_NAME="FinOps-Alerts"
az monitor action-group create \
  --resource-group "${RESOURCE_GROUP}" \
  --name "${ACTION_GROUP_NAME}" \
  --short-name "FinOps" 2>/dev/null || echo "Action group may already exist"

# Add email receiver
az monitor action-group receiver email add \
  --resource-group "${RESOURCE_GROUP}" \
  --action-group-name "${ACTION_GROUP_NAME}" \
  --receiver-name "TeamEmail" \
  --email-receiver team@company.com 2>/dev/null || echo "Email receiver exists"

# 8. Create budget with alerts at multiple thresholds
az costmanagement budget create \
  --scope "/subscriptions/${SUBSCRIPTION_ID}" \
  --name "monthly-budget" \
  --category "Cost" \
  --amount 5000 \
  --time-period Start="2025-01-01T00:00:00Z" End="2025-12-31T23:59:59Z" \
  --time-grain Monthly 2>/dev/null || echo "Budget may already exist"

# 9. Create metric alerts for high spending
az monitor metrics alert create \
  --name "HighSpendAlert-75Percent" \
  --resource-group "${RESOURCE_GROUP}" \
  --scopes "/subscriptions/${SUBSCRIPTION_ID}" \
  --description "Alert when spending reaches 75% of budget" \
  --condition "total > 3750" \
  --window-size PT1H \
  --evaluation-frequency PT1H \
  --action "${ACTION_GROUP_NAME}" 2>/dev/null || echo "Alert may already exist"

# 10. Create resource groups with cost allocation tags
for ENV in production staging development; do
  az group create \
    --name "workload-${ENV}-rg" \
    --location eastus \
    --tags Environment="${ENV}" CostCenter="cc-001" Owner="team@company.com" \
    2>/dev/null || echo "RG workload-${ENV}-rg exists"
done

# 11. Create Azure Policy for mandatory tagging
POLICY_DEFINITION='{
  "mode": "Indexed",
  "policyRule": {
    "if": {
      "field": "tags",
      "exists": "false"
    },
    "then": {
      "effect": "deny"
    }
  }
}'

az policy definition create \
  --name "require-resource-tags" \
  --description "Enforce mandatory resource tagging" \
  --rules "'"${POLICY_DEFINITION}"'" \
  2>/dev/null || echo "Policy definition may exist"

# 12. Create Power BI or Grafana dashboard for cost visualization
# This typically requires manual setup, but export configuration is ready
echo "Cost export configured. Set up Power BI dashboard manually at Azure Portal"

Verification Steps:

# 1. Verify storage account setup
az storage account list --resource-group "${RESOURCE_GROUP}" --query '[*].[name,kind]' --output table

# 2. Verify cost export configuration
az costmanagement export list --scope "/subscriptions/${SUBSCRIPTION_ID}" \
  --query '[*].[name,definition.type,schedule.status]' --output table

# 3. Verify Log Analytics workspace
az monitor log-analytics workspace list \
  --resource-group "${RESOURCE_GROUP}" \
  --query '[*].[name,provisioningState]' --output table

# 4. Verify budget exists
az costmanagement budget list --scope "/subscriptions/${SUBSCRIPTION_ID}" \
  --query '[*].[name,amount,category]' --output table

# 5. Verify action group
az monitor action-group list --resource-group "${RESOURCE_GROUP}" \
  --query '[*].[name,shortName]' --output table

# 6. Test cost data extraction (run after first export completes)
az monitor log-analytics query \
  --workspace "${WORKSPACE_ID}" \
  --analytics-query "AzureBillingData | summarize TotalCost=sum(ChargeAmount) by BillingPeriod" \
  2>/dev/null || echo "Run after first cost export completes"

# 7. Check tag compliance
az policy state summarize \
  --resource-group "${RESOURCE_GROUP}"

1.3 GCP Cost Visibility Setup

Prerequisites:

GCP Project with billing enabled
Organization or Billing Account admin role
BigQuery API enabled
Cloud Logging API enabled
Sufficient IAM permissions

Key GCP Tools:

Cloud Billing: Cost analysis and budgeting
BigQuery: Billing data analysis and custom queries
Data Studio: Visualize billing data
Recommender API: Automatic optimization recommendations
Cloud Monitoring: Metrics and alerting
Cloud Audit Logs: Resource creation and modification tracking

# 1. Set up environment variables
PROJECT_ID=$(gcloud config get-value project)
BILLING_ACCOUNT=$(gcloud billing accounts list --format='value(name)' | head -1)
echo "Working with project: $PROJECT_ID, Billing: $BILLING_ACCOUNT"

# 2. Enable required APIs
gcloud services enable \
  cloudresourcemanager.googleapis.com \
  cloudbilling.googleapis.com \
  bigquery.googleapis.com \
  logging.googleapis.com \
  monitoring.googleapis.com \
  recommender.googleapis.com \
  --project="${PROJECT_ID}"

# 3. Create BigQuery dataset for billing data
bq mk \
  --dataset \
  --description="FinOps billing and cost analysis" \
  --location=US \
  --default_table_expiration=7776000 \
  finops_dataset

# 4. Export Cloud Billing data to BigQuery
gcloud billing budgets create \
  --billing-account="${BILLING_ACCOUNT}" \
  --display-name="Monthly-Overall-Budget" \
  --budget-amount=5000 \
  --threshold-rule percent=50 \
  --threshold-rule percent=75 \
  --threshold-rule percent=100

# 5. Link BigQuery billing export (may require manual setup via console for first-time)
echo "Enabling BigQuery export for billing data..."
# This requires at least one budget to be created; export happens automatically

# 6. Create budget alerts at multiple thresholds
for THRESHOLD in 50 75 90 100; do
  gcloud billing budgets create \
    --billing-account="${BILLING_ACCOUNT}" \
    --display-name="Alert-${THRESHOLD}Percent" \
    --budget-amount=5000 \
    --threshold-rule percent="${THRESHOLD}" \
    2>/dev/null || echo "Budget threshold ${THRESHOLD} may exist"
done

# 7. Create monitoring alert policy for billing
gcloud alpha monitoring policies create \
  --notification-channels=[CHANNEL_ID] \
  --display-name="High-Daily-Cost" \
  --condition-display-name="Daily cost exceeds threshold" \
  --condition-threshold-value=200 \
  --condition-threshold-duration=300s \
  2>/dev/null || echo "Alert policy may exist"

# 8. Create Pub/Sub topic for budget alerts
gcloud pubsub topics create finops-budget-alerts \
  --project="${PROJECT_ID}" 2>/dev/null || echo "Topic may exist"

# Create subscription
gcloud pubsub subscriptions create finops-budget-alerts-sub \
  --topic=finops-budget-alerts \
  --project="${PROJECT_ID}" 2>/dev/null || echo "Subscription may exist"

# 9. Create Cloud Function to process budget alerts
cat > process_budget_alert.py << 'EOF'
import json
import functions_framework
from google.cloud import logging_v2

@functions_framework.cloud_event
def process_budget_alert(cloud_event):
    """Process budget alert from Pub/Sub"""
    import base64
    
    pubsub_message = base64.b64decode(cloud_event.data["message"]["data"]).decode()
    budget_alert = json.loads(pubsub_message)
    
    logging_client = logging_v2.Client()
    logger = logging_client.logger('finops-budget-alerts')
    
    if budget_alert.get('budgetExceeded'):
        logger.log_struct({
            'severity': 'CRITICAL',
            'message': 'Budget threshold exceeded',
            'budget_name': budget_alert.get('budgetDisplayName'),
            'alert_threshold': budget_alert.get('alertThresholdExceeded'),
        })
        # Could also trigger remediation actions here
    
    return 'Alert processed', 200
EOF

gcloud functions deploy process-budget-alert \
  --runtime python39 \
  --trigger-topic finops-budget-alerts \
  --entry-point process_budget_alert \
  --project="${PROJECT_ID}" 2>/dev/null || echo "Function may exist"

# 10. Enable Cost Anomaly Detection (via Recommender API)
gcloud recommender recommendations list \
  --recommender=google.compute.instances.MachineTypeRecommender \
  --location=global \
  --project="${PROJECT_ID}" \
  --format=json > machine_type_recommendations.json

# 11. Create Cloud Monitoring dashboard
gcloud monitoring dashboards create --config='{
  "displayName": "FinOps-Cost-Dashboard",
  "mosaicLayout": {
    "columns": 12,
    "tiles": [
      {
        "width": 6,
        "height": 4,
        "widget": {
          "title": "Estimated Monthly Cost",
          "xyChart": {
            "chartOptions": {"mode": "COLOR"},
            "timeSeries": [{
              "timeSeriesQuery": {
                "timeSeriesFilter": {
                  "filter": "metric.type=\"billing.googleapis.com/gcp_billing_export_v1\" resource.type=\"billing_export\"",
                  "aggregation": {"alignmentPeriod": "86400s", "perSeriesAligner": "ALIGN_SUM"}
                }
              }
            }]
          }
        }
      }
    ]
  }
}' 2>/dev/null || echo "Dashboard may exist"

# 12. Set up Cloud Audit Logs for compliance
gcloud logging sinks create finops-audit-sink \
  bigquery.googleapis.com/projects/"${PROJECT_ID}"/datasets/audit_logs \
  --log-filter='protoPayload.serviceName="compute.googleapis.com" OR protoPayload.serviceName="storage.googleapis.com"' \
  --project="${PROJECT_ID}" 2>/dev/null || echo "Sink may exist"

Verification Steps:

# 1. Verify APIs are enabled
gcloud services list --enabled --filter="name:(billing|bigquery|logging|monitoring)" \
  --project="${PROJECT_ID}"

# 2. Verify BigQuery dataset
bq ls -d --project_id="${PROJECT_ID}" | grep finops_dataset

# 3. Verify budgets
gcloud billing budgets list --billing-account="${BILLING_ACCOUNT}" \
  --format='table(displayName, budgetAmount.nanos, name)'

# 4. Verify Pub/Sub resources
gcloud pubsub topics list --project="${PROJECT_ID}" | grep finops

# 5. Query billing data (wait 24-48 hours after enabling export)
bq query --nouse_legacy_sql \
  'SELECT SUM(cost) as total_cost, DATE(usage_start_time) as date 
   FROM finops_dataset.gcp_billing_export_v1 
   WHERE DATE(usage_start_time) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
   GROUP BY date ORDER BY date DESC LIMIT 30'

# 6. Check recommendations
gcloud recommender recommendations list \
  --recommender=google.compute.instances.MachineTypeRecommender \
  --location=global --project="${PROJECT_ID}" \
  --format='table(name, description, stateInfo.state, content.overview)'

Phase 2: Cost Optimization (Week 3-5)

2.1 AWS Reserved Instance & Savings Plan Strategy

Reserved Instance Types:

Compute Savings Plans: 10-17% discount (1-year), 20-25% (3-year) - flexible across instance types/sizes/regions
EC2 Instance Savings Plans: 19% discount (1-year), 29% (3-year) - specific to instance family/region
EC2 Reserved Instances: 20-40% discount - most restrictive but highest savings
RDS/DynamoDB Reserved Capacity: 30-35% savings for database workloads

Strategy:

Analyze 12 months of usage (don't rely on current traffic alone)
Use Compute Optimizer for recommendations (ML-based)
Mix RI/Savings Plans based on flexibility needs
Reserve only baseline stable workloads (20-70% of capacity)
Use On-Demand for spiky/variable workloads

# 1. Comprehensive instance analysis
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

# Get detailed instance metrics (last 30 days)
aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[*].Instances[*].[InstanceId,InstanceType,LaunchTime,Tags[?Key==`Name`].Value|[0]]' \
  --output table > running_instances.txt

# Collect CPU and Network metrics
for INSTANCE_ID in $(aws ec2 describe-instances --filters "Name=instance-state-name,Values=running" --query 'Reservations[*].Instances[*].InstanceId' --output text); do
  echo "Analyzing $INSTANCE_ID..."
  aws cloudwatch get-metric-statistics \
    --namespace AWS/EC2 \
    --metric-name CPUUtilization \
    --dimensions Name=InstanceId,Value="${INSTANCE_ID}" \
    --statistics Average,Maximum \
    --start-time $(date -u -d '30 days ago' +%Y-%m-%dT%H:%M:%S) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
    --period 3600 \
    --query 'Datapoints | [avg(@[*].Average), max(@[*].Maximum)]' \
    --output text >> instance_metrics.txt
done

# 2. Get Compute Optimizer recommendations (ML-based analysis)
aws compute-optimizer get-ec2-instance-recommendations \
  --recommendation-preferences \
    includeExistingRecommendations=true,lookBackPeriod=THIRTY_DAYS \
  --query 'instanceRecommendations[*].[instanceId,currentInstanceType,recommendationOptions[0].instanceType,recommendationOptions[0].savingsOpportunity.percentage]' \
  --output table > right_sizing_recommendations.txt

# 3. Analyze Savings Plans coverage
aws ce get-savings-plans-coverage \
  --time-period Start=$(date -d '30 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity DAILY \
  --metrics CoverageBenefit,OnDemandCost,SavingsPlansAmortizedCost \
  --group-by Type=DIMENSION,Key=REGION

# 4. Get Reserved Instance recommendations
aws ce get-reservation-purchase-recommendation \
  --service "EC2" \
  --lookback-period THIRTY_DAYS \
  --payment-option ALL_UPFRONT \
  --term-in-years 3 \
  --filter '[{
    "Type": "REGION",
    "Value": "us-east-1"
  }]' \
  --query 'Recommendations[*].[
    recommendationTarget,
    metadata[0],
    recommendedNumberOfUnitsToPurchase,
    estimatedMonthlySavingsAmount,
    estimatedMonthlyOnDemandCost
  ]' --output table

# 5. Purchase Compute Savings Plans (most flexible)
# First, identify the right amount to reserve (typically 30-50% of on-demand spend)
aws ce get-cost-and-usage \
  --time-period Start=$(date -d '30 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity MONTHLY \
  --metrics UnblendedCost \
  --filter '{
    "Dimensions": {
      "Key": "SERVICE",
      "Values": ["EC2 - Compute"]
    }
  }' \
  --query 'ResultsByTime[*].Total.UnblendedCost' --output text

echo "Compute Savings Plans pricing available at: https://aws.amazon.com/savingsplans/pricing/"
echo "Purchase Savings Plans manually via AWS Console or via AWS Marketplace"

# 6. For aggressive optimization, purchase 1-3 year RIs for baseline workloads
# Example: 10 t3.medium instances in us-east-1
aws ec2 purchase-reserved-instances-offering \
  --reserved-instances-offering-id "438012d3-644d-4496-9d84-1234567890ab" \
  --instance-count 10 \
  --dry-run 2>&1 | head -20 || echo "Remove --dry-run to execute purchase"

# 7. Monitor RI/Savings Plans utilization
aws ce get-reservation-coverage \
  --time-period Start=$(date -d '7 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity DAILY \
  --metrics UnblendedCost,BlendedCost \
  --group-by Type=DIMENSION,Key=LINKED_ACCOUNT \
  --query 'CoveragesByTime[*].[
    TimePeriod.Start,
    Total.CoverageHours.OnDemandHours,
    Total.CoverageHours.ReservedHours,
    Total.CoverageNormalizedUnits.UnderCoverageQuantity
  ]' --output table

# 8. Set up CloudWatch alerts for low RI utilization
aws cloudwatch put-metric-alarm \
  --alarm-name "RI-Utilization-Low-Alert" \
  --alarm-description "Alert if RI utilization drops below 50%" \
  --metric-name EstimatedReservedInstancesNormalizedUnitsUtilization \
  --namespace AWS/Billing \
  --statistic Average \
  --period 86400 \
  --threshold 50 \
  --comparison-operator LessThanThreshold \
  --evaluation-periods 1

Verification:

# 1. Verify RI/Savings Plans are applied
aws ce get-cost-and-usage \
  --time-period Start=$(date -d '7 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity DAILY \
  --metrics AmortizedCost,NetAmortizedCost,NetUnblendedCost \
  --query 'ResultsByTime[].{Date: TimePeriod.Start, Cost: Total.NetAmortizedCost.amount}' \
  --output table

# 2. Show actual savings achieved
aws ce get-savings-plans-utilization-details \
  --time-period Start=$(date -d '30 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --query 'SavingsPlansUtilizationsByTime[*].[
    TimePeriod.Start,
    Total.UtilizationPercentage,
    Total.AmortizedCommitment,
    Total.TotalCommitmentToDate
  ]' --output table

# 3. Track monthly savings trend
for MONTH in {1..6}; do
  START=$(date -d "${MONTH} months ago" +%Y-%m-01)
  END=$(date -d "${MONTH} months ago +1 month -1 day" +%Y-%m-%d)
  SAVINGS=$(aws ce get-reservation-coverage \
    --time-period Start="${START}",End="${END}" \
    --query 'CoveragesByTime[*].Total.SavingsMonthly.amount' --output text)
  echo "${START}: ${SAVINGS}"
done

2.2 Azure Reserved Instance & Savings Plan Strategy

Reservation Types:

Reserved Instances: 30-35% discount (1-year), 60-70% (3-year) - specific to VM size/region
Azure Hybrid Benefit: Up to 40% additional savings (bring your own license)
Spot VMs: 50-90% discount - for fault-tolerant workloads
Commitment-based discounts: Database and App Service discounts

# 1. Set up environment
SUBSCRIPTION_ID=$(az account show --query id --output tsv)
RESOURCE_GROUP="finops-rg"

# 2. Analyze current VM deployment and costs
az vm list --query '[*].[name,hardwareProfile.vmSize,osProfile.osType]' --output table

# 3. Get reservation recommendations
az reservations recommendations list \
  --resource-group "${RESOURCE_GROUP}" \
  --recommendation-location eastus \
  --filter "properties.recommendationKind eq 'Recommended'"

# 4. Purchase 3-year reserved instances for stable workloads
# Example: Reserve 5 Standard_D2s_v3 instances in eastus
RESERVATION_ORDER=$(az reservations reservation-order create \
  --sku "Standard_D2s_v3" \
  --location "eastus" \
  --term "P3Y" \
  --billing-scope "/subscriptions/${SUBSCRIPTION_ID}" \
  --quantity 5 \
  --friendly-name "prod-vm-reservation" \
  --query 'name' --output tsv 2>/dev/null)

echo "Reservation Order: ${RESERVATION_ORDER}"

# 5. Enable Azure Hybrid Benefit for Windows Server and SQL Server licenses
az vm update \
  --resource-group production-rg \
  --name prod-vm-01 \
  --license-type Windows_Server \
  --no-wait

az vm update \
  --resource-group production-rg \
  --name prod-vm-02 \
  --license-type SUSE_Linux \
  --no-wait

# 6. Monitor reservation utilization
az reservations reservations list \
  --query '[*].[id, name, expiryDate, utilization.utilized, utilization.trend, utilization.aggregated_kind]' \
  --output table

# 7. Set up spending alerts based on reservation thresholds
az monitor metrics alert create \
  --name "ReservationUtilizationLow" \
  --resource-group "${RESOURCE_GROUP}" \
  --scopes "/subscriptions/${SUBSCRIPTION_ID}" \
  --description "Alert if reservation utilization below 70%" \
  --condition "avg(ReservationUtilizationPercentage) < 70" \
  --window-size PT1H \
  --evaluation-frequency PT1H

# 8. Use Spot VMs for batch and non-critical workloads
az vm create \
  --resource-group dev-workloads \
  --name batch-job-vm \
  --image UbuntuLTS \
  --priority Spot \
  --eviction-policy Delete \
  --max-price 0.05 \
  --size Standard_B2s \
  --no-wait

# 9. Query reservation coverage
az reservations reservations list \
  --query "[?properties.sku.name=='Standard_D2s_v3'].{
    Name: name,
    SKU: properties.sku.name,
    Term: properties.term,
    ExpiryDate: properties.expiryDate,
    Utilization: properties.utilization.utilized
  }"

Verification:

# 1. Verify reservations are applied
az reservations reservations list \
  --query '[*].[name, provisioningState, expiryDate]' --output table

# 2. Check cost impact
az costmanagement query --timeframe MonthToDate \
  --type "Usage" \
  --dataset \
    granularity=Daily \
    aggregation='{
      totalCost: {name: PreTaxCost, function: Sum}
    }' \
  --scope "/subscriptions/${SUBSCRIPTION_ID}"

# 3. Verify Hybrid Benefit is active
az vm show --resource-group production-rg --name prod-vm-01 \
  --query 'licenseType' --output table

# 4. Monitor Spot VM evictions
az vm get-instance-view \
  --resource-group dev-workloads \
  --name batch-job-vm \
  --query 'instanceView.statuses[*].[code, displayStatus]' --output table

2.3 GCP Committed Use Discount (CUD) Strategy

Commitment Types:

Compute Engine commitments: 25-52% discount (1-year), 40-65% (3-year)
Cloud SQL commitments: 25-35% discount
Datastore/Firestore commitments: 35% discount
Flexible Slots (BigQuery): 25% discount
GPU/TPU commitments: Available for ML workloads

# 1. Set environment
PROJECT_ID=$(gcloud config get-value project)
BILLING_ACCOUNT=$(gcloud billing accounts list --format='value(name)' | head -1)

# 2. Analyze current compute usage
gcloud compute instances list --format=table \
  --format='table(name,zone,machineType.size(),status)' \
  > compute_instances.txt

# Collect CPU and memory metrics
for INSTANCE in $(gcloud compute instances list --format='value(name)'); do
  for ZONE in $(gcloud compute instances list --format='value(zone)' --filter="name=${INSTANCE}"); do
    echo "Analyzing ${INSTANCE}..."
    gcloud compute instances get-serial-port-output "${INSTANCE}" \
      --zone="${ZONE}" | grep -i "cpu\|memory" || true
  done
done

# 3. Get Recommender recommendations (ML-based sizing)
gcloud recommender recommendations list \
  --recommender=google.compute.instances.MachineTypeRecommender \
  --location=global \
  --project="${PROJECT_ID}" \
  --format='table(
    name,
    description,
    content.overview.resourceName,
    stateInfo.state,
    content.overview.estimatedCostSavings
  )'

# 4. Analyze 30-day usage for commitment planning
# Query usage data from BigQuery
bq query --nouse_legacy_sql \
  'SELECT 
    resource.name,
    resource.location,
    DATE(usage_time) as date,
    ROUND(SUM(CAST(compute.vcpu_seconds AS FLOAT64) / 3600 / 24), 2) as avg_vcpus,
    ROUND(SUM(CAST(compute.memory_byte_seconds AS FLOAT64) / 3600 / 24 / 1024 / 1024 / 1024), 2) as avg_memory_gb
  FROM `'"${PROJECT_ID}"'.gcp_billing_export_v1.gcp_billing_export_*`
  WHERE _TABLE_SUFFIX BETWEEN 
    FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
    AND FORMAT_DATE("%Y%m%d", CURRENT_DATE())
  GROUP BY resource.name, resource.location, date
  ORDER BY avg_vcpus DESC'

# 5. Purchase Compute Engine commitments (1-year for fast ROI)
# For n1-standard-4 machines in us-central1
gcloud compute commitments create web-server-commitment \
  --plan=one-year \
  --machine-type=n1-standard-4 \
  --region=us-central1 \
  --resource-count=10 \
  --project="${PROJECT_ID}"

# 6. Purchase 3-year commitments for stable production workloads
# For higher savings (40-65% discount)
gcloud compute commitments create prod-db-commitment \
  --plan=three-year \
  --machine-type=m1-ultramem-40 \
  --region=us-central1 \
  --resource-count=2 \
  --project="${PROJECT_ID}"

# 7. Create commitments with GPU/TPU for ML workloads
gcloud compute commitments create ml-gpu-commitment \
  --plan=one-year \
  --machine-type=n1-standard-8 \
  --accelerator=type=nvidia-tesla-p100,count=1 \
  --region=us-west1 \
  --project="${PROJECT_ID}"

# 8. Set up commitment alerts and dashboards
gcloud monitoring dashboards create --config='{
  "displayName": "GCP-Commitment-Dashboard",
  "mosaicLayout": {
    "columns": 12,
    "tiles": [
      {
        "width": 12,
        "height": 4,
        "widget": {
          "title": "Active Commitments",
          "scorecard": {
            "timeSeriesQuery": {
              "timeSeriesFilter": {
                "filter": "resource.type=\"commitment\" AND metric.type=\"compute.googleapis.com/commitment/commitment_utilization\""
              }
            }
          }
        }
      }
    ]
  }
}' 2>/dev/null || echo "Dashboard config syntax may need adjustment"

# 9. Monitor commitment usage and buy more if needed
gcloud compute commitments list \
  --project="${PROJECT_ID}" \
  --format='table(
    name,
    plan,
    region,
    resourcesValues[].size,
    creationTimestamp,
    endTimestamp
  )'

# 10. Set up Pub/Sub alerts when commitments are running low
gcloud pubsub topics create gcp-commitment-alerts \
  --project="${PROJECT_ID}" 2>/dev/null || echo "Topic exists"

# Cloud Function to monitor commitments
cat > monitor_commitments.py << 'EOF'
import functions_framework
from google.cloud import compute_v1
from google.cloud import pubsub_v1
import json

@functions_framework.cloud_event
def check_commitment_usage(cloud_event):
    """Monitor commitment usage and alert if low"""
    
    client = compute_v1.CommitmentsClient()
    project_id = os.environ.get('GCP_PROJECT')
    
    request = compute_v1.AggregatedListCommitmentsRequest(project=project_id)
    agg_list = client.aggregated_list(request=request)
    
    for zone, commitment_list in agg_list:
        for commitment in commitment_list.commitments:
            if commitment.status == 'ACTIVE':
                # Calculate utilization
                # Alert if utilization < 50%
                pass
EOF

# Deploy the function
gcloud functions deploy monitor-commitments \
  --runtime python39 \
  --trigger-topic gcp-commitment-alerts \
  --entry-point check_commitment_usage \
  --project="${PROJECT_ID}"

Verification:

# 1. List all active commitments
gcloud compute commitments list --project="${PROJECT_ID}" \
  --format='table(name, plan, status, region, endTimestamp)'

# 2. Check commitment utilization
gcloud compute commitments describe <COMMITMENT_NAME> \
  --region=us-central1 --project="${PROJECT_ID}" \
  --format='json' | grep -i utilization

# 3. Calculate savings from commitments
bq query --nouse_legacy_sql \
  'SELECT 
    SUM(CASE WHEN discount_id != "" THEN -CAST(discount_id AS FLOAT64) ELSE 0 END) as commitment_savings,
    SUM(CAST(cost AS FLOAT64)) as total_cost
  FROM `'"${PROJECT_ID}"'.gcp_billing_export_v1.gcp_billing_export_*`
  WHERE _TABLE_SUFFIX BETWEEN 
    FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
    AND FORMAT_DATE("%Y%m%d", CURRENT_DATE())'

# 4. Verify no resource is over-committed
gcloud compute instances list --project="${PROJECT_ID}" \
  --filter='status:RUNNING' --format='json' | \
  jq '.[] | select(.machineType | contains("n1-standard")) | .name'

2.4 Instance Right-Sizing

# AWS: Using Compute Optimizer
aws compute-optimizer get-ec2-instance-recommendations \
  --query 'instanceRecommendations[?recommendation==`Underprovisioned`]' \
  --output table

# Find underutilized instances (CPU < 20%, Memory < 30%)
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --start-time $(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Average \
  --dimensions Name=InstanceId,Value=i-xxxxxxxxx

# Resize instance (stop first, then change type)
aws ec2 stop-instances --instance-ids i-xxxxxxxxx
aws ec2 modify-instance-attribute \
  --instance-id i-xxxxxxxxx \
  --instance-type "{\"Value\": \"t3.small\"}"
aws ec2 start-instances --instance-ids i-xxxxxxxxx

2.5 Storage Optimization

# Create S3 lifecycle policy
aws s3api put-bucket-lifecycle-configuration \
  --bucket my-bucket \
  --lifecycle-configuration '{
    "Rules": [
      {
        "Id": "Archive-old-data",
        "Status": "Enabled",
        "Prefix": "logs/",
        "Transitions": [
          {
            "Days": 30,
            "StorageClass": "STANDARD_IA"
          },
          {
            "Days": 90,
            "StorageClass": "GLACIER"
          }
        ],
        "Expiration": {
          "Days": 365
        }
      }
    ]
  }'

# Find and delete orphaned snapshots
aws ec2 describe-snapshots \
  --owner-ids self \
  --query 'Snapshots[?VolumeSize==`100` && StartTime<`2024-01-01`]' \
  --output json

# Delete unused volumes
aws ec2 describe-volumes \
  --filters "Name=status,Values=available" \
  --query 'Volumes[*].[VolumeId,Size,CreateTime]'

Phase 3: Governance & Automation (Week 6-8)

3.1 AWS Policy Enforcement

# Create Lambda function for resource cleanup
cat > cleanup.py << 'EOF'
import boto3
import json
from datetime import datetime, timedelta

ec2 = boto3.client('ec2')
s3 = boto3.client('s3')

def lambda_handler(event, context):
    # Terminate instances stopped > 30 days
    response = ec2.describe_instances(
        Filters=[
            {'Name': 'instance-state-name', 'Values': ['stopped']},
            {'Name': 'state-transition-reason', 'Values': ['*']}
        ]
    )
    
    for reservation in response['Reservations']:
        for instance in reservation['Instances']:
            stopped_time = instance['StateTransitionReason']
            if datetime.now() - instance['LaunchTime'].replace(tzinfo=None) > timedelta(days=30):
                ec2.terminate_instances(InstanceIds=[instance['InstanceId']])
                print(f"Terminated {instance['InstanceId']}")
    
    return {
        'statusCode': 200,
        'body': json.dumps('Cleanup completed')
    }
EOF

# Deploy Lambda
aws lambda create-function \
  --function-name FinOps-Cleanup \
  --runtime python3.9 \
  --role arn:aws:iam::ACCOUNT_ID:role/lambda-role \
  --handler cleanup.lambda_handler \
  --zip-file fileb://cleanup.zip

# Schedule with EventBridge (daily)
aws events put-rule \
  --name finops-cleanup-schedule \
  --schedule-expression "cron(0 2 * * ? *)"

3.2 Azure Policy Enforcement

# Create policy to require tagging
az policy definition create \
  --name "require-environment-tag" \
  --mode All \
  --rules '{
    "if": {
      "field": "tags[environment]",
      "exists": "false"
    },
    "then": {
      "effect": "deny"
    }
  }'

# Assign policy
az policy assignment create \
  --policy "require-environment-tag" \
  --scope "/subscriptions/{subscription-id}/resourcegroups/production-rg"

# Create auto-shutdown policy
az policy definition create \
  --name "auto-shutdown-dev-resources" \
  --rules '{
    "if": {
      "allOf": [
        {
          "field": "tags[environment]",
          "equals": "development"
        },
        {
          "field": "type",
          "equals": "Microsoft.Compute/virtualMachines"
        }
      ]
    },
    "then": {
      "effect": "deployIfNotExists",
      "details": {
        "type": "Microsoft.DevTestLab/schedules"
      }
    }
  }'

3.3 GCP Policy Enforcement

# Create org policy to restrict VM types
gcloud resource-manager org-policies enforce \
  --project=PROJECT_ID \
  compute.restrictVmExternalIpAccess

# Create budget with alerts
gcloud billing budgets create \
  --billing-account=BILLING_ACCOUNT_ID \
  --display-name="Monthly-Budget" \
  --budget-amount=5000 \
  --threshold-rule percent=75 \
  --threshold-rule percent=100

# Set up Cloud Function for cost optimization
gcloud functions deploy finops-cleanup \
  --runtime python39 \
  --trigger-topic cost-optimization \
  --entry-point cleanup

Phase 4: Monitoring & Continuous Optimization

4.1 Cost Anomaly Detection

# AWS: Create anomaly detector
aws ce create-anomaly-detector \
  --anomaly-detector '{
    "MonitorType": "DIMENSIONAL",
    "MonitorDimension": "SERVICE",
    "MonitorSpecification": {
      "EventType": "COST_INCREASE",
      "Threshold": 0.2
    }
  }'

# Get anomalies
aws ce get-anomalies \
  --date-interval Start=2025-01-01,End=2025-01-31 \
  --feedback NOT_ANOMALOUS

# Azure: Create metric alert for anomalies
az monitor metrics alert create \
  --name "CostAnomalyAlert" \
  --resource-group finops-rg \
  --scopes "/subscriptions/{subscription-id}" \
  --condition "total SpendingAmount > Baseline + 20%"

4.2 Reporting & Dashboards

# AWS: Generate cost report
aws ce get-cost-and-usage \
  --time-period Start=2025-01-01,End=2025-01-31 \
  --granularity MONTHLY \
  --metrics UnblendedCost \
  --group-by Type=TAG,Key=CostCenter \
  --output json > cost_report.json

# Create email report
cat > send_report.sh << 'EOF'
#!/bin/bash
python3 << 'PYTHON'
import json
import subprocess
from datetime import datetime

# Get cost data
result = subprocess.run(['aws', 'ce', 'get-cost-and-usage', ...], capture_output=True)
costs = json.loads(result.stdout)

# Generate HTML report
html = "<html><body>"
html += f"<h1>Monthly Cost Report - {datetime.now().strftime('%B %Y')}</h1>"
for item in costs['ResultsByTime']:
    html += f"<p>{item}</p>"
html += "</body></html>"

# Send email
subprocess.run(['mail', '-s', 'Monthly FinOps Report', 'team@company.com'], input=html)
PYTHON
EOF

chmod +x send_report.sh

Storage

S3 Cost Optimization

Implement tiered storage:

# Create lifecycle policy
aws s3api put-bucket-lifecycle-configuration \
  --bucket production-data \
  --lifecycle-configuration '{
    "Rules": [
      {
        "Id": "ArchiveOldLogs",
        "Status": "Enabled",
        "Prefix": "logs/",
        "Transitions": [
          {"Days": 30, "StorageClass": "STANDARD_IA"},
          {"Days": 60, "StorageClass": "GLACIER"},
          {"Days": 365, "StorageClass": "DEEP_ARCHIVE"}
        ],
        "Expiration": {"Days": 2555}
      }
    ]
  }'

# Monitor storage usage
aws s3 ls s3://production-data --recursive --summarize

Azure Storage Tiering

# Create storage account with tiering
az storage account create \
  --name finopsstorage \
  --resource-group finops-rg \
  --access-tier Hot

# Set lifecycle management
az storage account blob-service-properties update \
  --account-name finopsstorage \
  --resource-group finops-rg \
  --enable-delete-retention true

Upgrades

Reserved Capacity Renewal

Review quarterly:

# AWS: Check RI expiration
aws ec2 describe-reserved-instances \
  --filters "Name=state,Values=active" \
  --query 'ReservedInstances[*].[ReservedInstancesId,InstanceType,End,State]'

# Purchase new RIs before expiration
aws ec2 purchase-reserved-instances-offering \
  --reserved-instances-offering-id rifxxxxxxxx \
  --instance-count 10

Disaster Recovery

Cost-Optimized DR Setup

Multi-region strategy:

# Primary region: full capacity
aws ec2 run-instances \
  --image-id ami-xxxxxxxx \
  --instance-type t3.xlarge \
  --min-count 5 \
  --max-count 5 \
  --region us-east-1

# DR region: on-demand ready, scaled down
aws ec2 run-instances \
  --image-id ami-xxxxxxxx \
  --instance-type t3.xlarge \
  --min-count 2 \
  --max-count 2 \
  --region us-west-2

# Use Reserved Instances in primary, On-demand/Spot in DR

Troubleshooting

Common Issues and Solutions

Issue 1: High Cloud Costs Without Obvious Root Cause

Symptoms:

Monthly bill 20%+ higher than expected
No recent major deployments
Multiple cost drivers unclear

Diagnostic Procedure:

# AWS Diagnosis
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

# 1. Analyze costs by service in last 30 days
echo "=== Cost by Service (Last 30 Days) ==="
aws ce get-cost-and-usage \
  --time-period Start=$(date -d '30 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity DAILY \
  --metrics UnblendedCost \
  --group-by Type=DIMENSION,Key=SERVICE \
  --query 'ResultsByTime[].{Date:TimePeriod.Start, Groups:Groups[?Keys[0]!="Unblended Cost"].{Service:Keys[0],Amount:Metrics.UnblendedCost.Amount}}' \
  --output table | sort -k3 -nr | head -20

# 2. Check for anomalies reported by Cost Anomaly Detection
echo "=== Cost Anomalies ==="
aws ce get-anomalies \
  --date-interval Start=$(date -d '30 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --query 'Anomalies[*].[DimensionValues,TotalExpectedSpend,TotalActualSpend,TotalImpact]' \
  --output table

# 3. Find newly created resources
echo "=== Resources Created in Last 7 Days ==="
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=CreateInstance,AttributeValue=CreateDBInstance,AttributeValue=CreateFunction \
  --max-results 50 \
  --query 'Events[*].[EventTime,EventName,CloudTrailEvent]' | grep -o '"instanceId":"[^"]*"' | sort | uniq

# 4. Check for unattached/unused resources
echo "=== Unattached EBS Volumes ==="
aws ec2 describe-volumes \
  --filters "Name=status,Values=available" \
  --query 'Volumes[*].[VolumeId,Size,CreateTime,State]' --output table

echo "=== Unattached Elastic IPs ==="
aws ec2 describe-addresses \
  --filters "Name=association-state,Values=disassociated" \
  --query 'Addresses[*].[PublicIp,AllocationId,AssociationId]' --output table

echo "=== Stopped Instances (>30 days) ==="
aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=stopped" \
  --query 'Reservations[*].Instances[?LaunchTime<=`'$(date -d '30 days ago' +%Y-%m-%dT%H:%M:%S)'`].[InstanceId,InstanceType,LaunchTime,BlockDeviceMappings[*].Ebs.VolumeId]' --output table

# 5. Check data transfer costs (common culprit)
echo "=== Data Transfer Analysis ==="
aws ce get-cost-and-usage \
  --time-period Start=$(date -d '30 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity DAILY \
  --metrics UnblendedCost,UsageQuantity \
  --group-by Type=DIMENSION,Key=OPERATION \
  --filter '{"Dimensions":{"Key":"SERVICE","Values":["EC2 - Data Transfer"]}}' \
  --output table

# 6. Check for Reserved Instance underutilization
echo "=== RI Utilization ==="
aws ce get-reservation-coverage \
  --time-period Start=$(date -d '30 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity MONTHLY \
  --group-by Type=DIMENSION,Key=INSTANCE_TYPE \
  --query 'CoveragesByTime[*].Groups[?Keys[0]!="Unblended Cost"]' --output table

# 7. Check for unused NAT Gateways
echo "=== NAT Gateway Charges ==="
aws ec2 describe-nat-gateways \
  --filter "Name=state,Values=available" \
  --query 'NatGateways[*].[NatGatewayId,State,CreateTime,PublicIpAddress]' --output table

Azure Diagnosis:

# 1. Top cost drivers by service
az costmanagement query --timeframe LastMonth \
  --type "Usage" \
  --scope "/subscriptions/${SUBSCRIPTION_ID}" \
  --dataset \
    granularity=Daily \
    aggregation='{
      totalCost: {name: PreTaxCost, function: Sum}
    }' \
    grouping='[
      {type: "Dimension", name: "ServiceName"}
    ]'

# 2. Find unused resources
echo "=== Deallocated VMs (not deleted) ==="
az vm list --query "[?powerState=='deallocated'].{Name:name, ResourceGroup:resourceGroup}" --output table

echo "=== Unattached Disks ==="
az disk list --query "[?managedBy==null].{Name:name, SizeGb:diskSizeGb, TimeCreated:timeCreated}" --output table

echo "=== Public IPs Not Associated ==="
az network public-ip list --query "[?ipConfiguration==null].{Name:name, IpAddress:ipAddress, ProvisioningState:provisioningState}" --output table

# 3. Check data egress costs
az monitor metrics list-definitions \
  --resource-group finops-rg \
  --resource-type "Microsoft.Network/publicIPAddresses" \
  --query "[?name.value contains 'Bytes'].name.value"

GCP Diagnosis:

# 1. Top services by cost
bq query --nouse_legacy_sql \
  'SELECT 
    service.description,
    ROUND(SUM(CAST(cost AS FLOAT64)), 2) as total_cost,
    COUNT(*) as line_items
  FROM `'"${PROJECT_ID}"'.gcp_billing_export_v1.gcp_billing_export_*`
  WHERE _TABLE_SUFFIX BETWEEN 
    FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
    AND FORMAT_DATE("%Y%m%d", CURRENT_DATE())
  GROUP BY service.description
  ORDER BY total_cost DESC
  LIMIT 20'

# 2. Find idle resources
echo "=== Idle Compute Instances ==="
gcloud compute instances list --format=json | \
  jq '.[] | select(.status=="RUNNING") | 
  {name: .name, zone: .zone, creationTimestamp: .creationTimestamp}'

# 3. Check committed discounts usage
echo "=== Commitment Usage ==="
gcloud compute commitments list --project="${PROJECT_ID}" \
  --format='table(name, plan, status, creationTimestamp)'

Resolution Steps:

Terminate unattached volumes and stopped instances
Release unassociated IP addresses
Verify RI/CUD utilization is >80%
Check for data transfer - consider CloudFront/CDN
Analyze recent changes via CloudTrail/Activity Logs
Contact cloud support if cost increase >50%

Issue 2: RI/Savings Plan Underutilization

Symptoms:

RI/CUD coverage <70%
Unused capacity "wasting" money

Diagnosis:

# AWS
aws ce get-reservation-coverage \
  --time-period Start=$(date -d '30 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --query 'CoveragesByTime[*].{
    Date: TimePeriod.Start,
    Coverage: Total.CoveragePercentage,
    UnderCovered: Total.CoverageHours.UnCoveredHours
  }' --output table

# Azure
az reservations reservations list --query "[*].[name, expiryDate, utilization.utilized]"

# GCP
gcloud compute commitments list --project="${PROJECT_ID}" \
  --format='table(name, status, region)' | while read name; do
  gcloud compute commitments describe "$name" --region=us-central1
done

Resolution:

Buy only 40-60% of peak capacity as reserved
Use On-Demand/Spot for variable workloads
Exchange/modify commitments to match actual usage
Set up automated scaling to match commitment capacity

Issue 3: Budget Alerts Not Triggering

AWS Resolution:

# 1. Verify budget exists
aws budgets describe-budgets \
  --account-id "${ACCOUNT_ID}" \
  --query 'Budgets[*].[BudgetName, BudgetStatus]'

# 2. Test manually with spike
# Create test resources temporarily to verify alerts trigger

# 3. Check SNS subscription confirmed
aws sns list-subscriptions-by-topic --topic-arn $TOPIC_ARN

Azure Resolution:

# Verify alert rule is enabled
az monitor metrics alert show \
  --name "HighSpendAlert-75Percent" \
  --resource-group "${RESOURCE_GROUP}" \
  --query 'enabled'

# Test alert (requires manual cost spike)

GCP Resolution:

# Verify Pub/Sub topic has subscribers
gcloud pubsub topics list-subscriptions finops-budget-alerts \
  --project="${PROJECT_ID}"

# Check Cloud Function is deployed
gcloud functions list --project="${PROJECT_ID}" | grep budget

Quick Reference: Provider-Specific Commands

AWS Quick Commands

# Get yesterday's costs
aws ce get-cost-and-usage \
  --time-period Start=$(date -d '2 days ago' +%Y-%m-%d),End=$(date -d 'yesterday' +%Y-%m-%d) \
  --granularity DAILY --metrics UnblendedCost --group-by Type=DIMENSION,Key=SERVICE

# Stop all dev instances
aws ec2 stop-instances --instance-ids $(aws ec2 describe-instances --filters "Name=tag:Environment,Values=development" --query 'Reservations[*].Instances[*].InstanceId' --output text)

# Find idle RDS instances
aws rds describe-db-instances --query 'DBInstances[?DBInstanceStatus==`available`].[DBInstanceIdentifier,DBInstanceClass,Engine]'

Azure Quick Commands

# Get top resource groups by cost
az costmanagement query --timeframe LastMonth --type Usage \
  --scope "/" --dataset granularity=Daily aggregation='{totalCost:{name:PreTaxCost,function:Sum}}'

# Stop all dev VMs
az vm list --query "[?tags.Environment=='development'].id" -o tsv | xargs -I {} az vm deallocate --ids {}

# Find unmanaged disks
az disk list --query "[?managedBy==null]"

GCP Quick Commands

# Get top services by cost
bq query 'SELECT service.description, SUM(CAST(cost AS FLOAT64)) as total FROM `'"$PROJECT_ID"'.gcp_billing_export_v1.gcp_billing_export_*` GROUP BY service.description ORDER BY total DESC'

# Delete idle instances
gcloud compute instances list --filter="lastStopTimestamp<2025-01-01" --format='value(name,zone)' | while read name zone; do
  gcloud compute instances delete "$name" --zone="$zone"
done

# Check commitment coverage
gcloud compute commitments list --format='table(name, status, plan)'

Complete FinOps infrastructure is now operational with cost visibility, optimization, and governance in place. Estimated implementation timeline: 8-12 weeks for full optimization
Typical cost reduction: 25-40% with all initiatives implemented

RUNBOOK