Building Highly Available AWS Infrastructure: Graceful Failure - Part 4

Your infrastructure scales perfectly, but what happens when your database goes down? Learn strategies for graceful degradation, from application-level error handling to Route 53 health checks and CloudFront functions. Turn ugly 500 errors into elegant maintenance pages.

Building Highly Available AWS Infrastructure: Graceful Failure - Part 4

You've done everything right. You followed Part 1 and built a highly available setup with ALB and Auto Scaling. You containerized with Part 2 using ECS. Maybe you even went serverless with Part 3 and Fargate.

Your application layer is bulletproof. It scales beautifully. Health checks are perfect. Your ALB is distributing traffic like a champ.

Then, at 2 AM on a Friday, your database goes down.

Suddenly, every single request returns a 500 error. Your perfectly scaled infrastructure becomes a perfectly scaled error generator. Your users see this:

500 Internal Server Error
The server encountered an internal error and was unable to complete your request.

Welcome to Part 4, where we talk about the harsh truth: No matter how well your applications scale, there are links in the chain that can still break.

πŸ”— The Weakest Link Problem

Your architecture looks like this:

User β†’ CloudFront β†’ ALB β†’ Fargate Tasks β†’ RDS Database
                                        ↓
                                     (πŸ’€ DOWN)

When that database fails:

  • Your Fargate tasks are healthy βœ…
  • Your ALB is healthy βœ…
  • Health checks pass (they only check /health, not the database) βœ…
  • But every real request fails with 500 ❌

The problem: Your infrastructure health checks don't reflect your actual application health.

🎭 The User Experience Crisis

What your users see when your database is down:

<!-- What the ALB returns -->
<!DOCTYPE html>
<html>
<head><title>500 Internal Server Error</title></head>
<body>
<center><h1>500 Internal Server Error</h1></center>
<hr><center>nginx/1.21.6</center>
</body>
</html>

Problems with this:

  • Looks broken and unprofessional
  • No information about what's happening
  • No estimated time to resolution
  • No alternatives or status page link
  • Makes users think your entire site is broken

What you want users to see:

<!-- A graceful maintenance page -->
<!DOCTYPE html>
<html>
<head><title>Scheduled Maintenance</title></head>
<body style="font-family: Arial; text-align: center; padding: 50px;">
<h1>πŸ”§ We'll be right back!</h1>
<p>We're currently performing scheduled maintenance.</p>
<p>We'll be back online shortly. Thank you for your patience!</p>
<p><a href="https://status.example.com">Check our status page</a></p>
</body>
</html>

πŸ’‘ Solution 1: Application-Level Graceful Degradation

The first line of defense is your application itself.

Strategy: Fail Gracefully

Instead of crashing when the database is down, catch the error and return a pretty maintenance page:

# Python/Flask example
from flask import Flask, render_template
import psycopg2
from functools import wraps

app = Flask(__name__)

# Global flag to track database availability
db_available = True

def graceful_degradation(f):
    @wraps(f)
    def decorated_function(*args, **kwargs):
        global db_available
        try:
            # Try to execute the route
            return f(*args, **kwargs)
        except psycopg2.OperationalError:
            db_available = False
            # Return a nice maintenance page instead of 500
            return render_template('maintenance.html'), 503
        except Exception as e:
            # Log the error for debugging
            app.logger.error(f"Unexpected error: {e}")
            return render_template('maintenance.html'), 503
    return decorated_function

@app.route('/api/users')
@graceful_degradation
def get_users():
    # This will fail gracefully if database is down
    conn = get_db_connection()
    users = conn.execute('SELECT * FROM users').fetchall()
    return jsonify(users)

# Health check that ACTUALLY checks dependencies
@app.route('/health')
def health_check():
    global db_available
    
    # Check database connectivity
    try:
        conn = get_db_connection()
        conn.execute('SELECT 1')
        db_available = True
        return {'status': 'healthy', 'database': 'connected'}, 200
    except Exception as e:
        db_available = False
        # Return 503 so ALB marks as unhealthy
        return {'status': 'unhealthy', 'database': 'disconnected', 'error': str(e)}, 503

# Fallback route - show maintenance page for all other routes
@app.errorhandler(503)
def service_unavailable(e):
    return render_template('maintenance.html'), 503

.NET 9 / ASP.NET Core Implementation

Here's the same concept in modern .NET:

// Program.cs - ASP.NET Core 9
using Microsoft.EntityFrameworkCore;

var builder = WebApplication.CreateBuilder(args);

// Add services
builder.Services.AddDbContext<AppDbContext>(options =>
    options.UseNpgsql(builder.Configuration.GetConnectionString("DefaultConnection")));
builder.Services.AddControllers();

var app = builder.Build();

// Global exception handler middleware
app.Use(async (context, next) =>
{
    try
    {
        await next(context);
    }
    catch (DbException ex)
    {
        context.RequestServices.GetRequiredService<ILogger<Program>>()
            .LogError(ex, "Database connection failed");
        
        context.Response.StatusCode = 503;
        context.Response.ContentType = "text/html";
        await context.Response.WriteAsync(GetMaintenancePageHtml());
    }
    catch (Exception ex)
    {
        context.RequestServices.GetRequiredService<ILogger<Program>>()
            .LogError(ex, "Unexpected error occurred");
        
        context.Response.StatusCode = 503;
        context.Response.ContentType = "text/html";
        await context.Response.WriteAsync(GetMaintenancePageHtml());
    }
});

app.MapControllers();

// Health check endpoint
app.MapGet("/health", async (AppDbContext db, ILogger<Program> logger) =>
{
    try
    {
        // Check database connectivity
        await db.Database.ExecuteSqlRawAsync("SELECT 1");
        
        return Results.Ok(new 
        { 
            status = "healthy", 
            database = "connected",
            timestamp = DateTime.UtcNow 
        });
    }
    catch (Exception ex)
    {
        logger.LogError(ex, "Health check failed - database unavailable");
        
        return Results.Problem(
            statusCode: 503,
            title: "Service Unhealthy",
            detail: "Database connection failed"
        );
    }
});

app.Run();

static string GetMaintenancePageHtml() => @"
<!DOCTYPE html>
<html lang=""en"">
<head>
    <meta charset=""UTF-8"">
    <meta name=""viewport"" content=""width=device-width, initial-scale=1.0"">
    <title>Maintenance - We'll Be Right Back</title>
    <style>
        body {
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
            display: flex;
            align-items: center;
            justify-content: center;
            height: 100vh;
            margin: 0;
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            color: white;
        }
        .container {
            text-align: center;
            padding: 2rem;
        }
        h1 { font-size: 3rem; margin: 0; }
        p { font-size: 1.2rem; opacity: 0.9; }
    </style>
</head>
<body>
    <div class=""container"">
        <h1>πŸ”§ We'll Be Right Back!</h1>
        <p>We're experiencing technical difficulties. Please try again in a few minutes.</p>
        <p style=""font-size: 0.9rem; margin-top: 2rem;"">
            If this persists, contact support@acme.com
        </p>
    </div>
</body>
</html>";

Key .NET Features Used:

βœ… Global exception middleware - Catches unhandled exceptions
βœ… Minimal API health endpoint - Clean, simple health checks
βœ… Entity Framework error handling - Gracefully handles DbException
βœ… Inline HTML generation - No template engine needed for simple pages
βœ… Structured logging - Integrates with ASP.NET Core logging

Alternative: Using Middleware Class

For larger applications, create a dedicated middleware:

// GracefulDegradationMiddleware.cs
public class GracefulDegradationMiddleware
{
    private readonly RequestDelegate _next;
    private readonly ILogger<GracefulDegradationMiddleware> _logger;
    
    public GracefulDegradationMiddleware(
        RequestDelegate next,
        ILogger<GracefulDegradationMiddleware> logger)
    {
        _next = next;
        _logger = logger;
    }
    
    public async Task InvokeAsync(HttpContext context)
    {
        try
        {
            await _next(context);
        }
        catch (DbException ex)
        {
            _logger.LogError(ex, "Database error on {Path}", context.Request.Path);
            await HandleFailureAsync(context, "Database temporarily unavailable");
        }
        catch (HttpRequestException ex)
        {
            _logger.LogError(ex, "External service error on {Path}", context.Request.Path);
            await HandleFailureAsync(context, "External service temporarily unavailable");
        }
    }
    
    private static async Task HandleFailureAsync(HttpContext context, string message)
    {
        context.Response.StatusCode = 503;
        context.Response.ContentType = "application/json";
        
        await context.Response.WriteAsJsonAsync(new
        {
            status = "service_unavailable",
            message = message,
            timestamp = DateTime.UtcNow
        });
    }
}

// Register in Program.cs
app.UseMiddleware<GracefulDegradationMiddleware>();

The Maintenance Page Template

<!-- templates/maintenance.html -->
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Maintenance - We'll Be Right Back</title>
    <style>
        body {
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Arial, sans-serif;
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            color: white;
            display: flex;
            align-items: center;
            justify-content: center;
            min-height: 100vh;
            margin: 0;
            padding: 20px;
        }
        .container {
            text-align: center;
            max-width: 600px;
            background: rgba(255, 255, 255, 0.1);
            backdrop-filter: blur(10px);
            padding: 60px 40px;
            border-radius: 20px;
            box-shadow: 0 8px 32px rgba(0, 0, 0, 0.3);
        }
        h1 { font-size: 3em; margin: 0 0 20px; }
        p { font-size: 1.2em; margin: 15px 0; opacity: 0.9; }
        .status-link {
            display: inline-block;
            margin-top: 30px;
            padding: 12px 30px;
            background: white;
            color: #667eea;
            text-decoration: none;
            border-radius: 25px;
            font-weight: 600;
            transition: transform 0.2s;
        }
        .status-link:hover { transform: translateY(-2px); }
        .icon { font-size: 4em; margin-bottom: 20px; }
    </style>
</head>
<body>
    <div class="container">
        <div class="icon">πŸ”§</div>
        <h1>We'll Be Right Back!</h1>
        <p>We're currently experiencing technical difficulties.</p>
        <p>Our team has been notified and is working to resolve the issue.</p>
        <p style="font-size: 0.9em; opacity: 0.7;">Estimated resolution time: 15-30 minutes</p>
        <a href="https://status.example.com" class="status-link">Check Status Page</a>
    </div>
    <script>
        // Auto-refresh every 30 seconds
        setTimeout(() => location.reload(), 30000);
    </script>
</body>
</html>

Pros and Cons

βœ… Pros:

  • Full control: Customize the message, styling, and behavior
  • Context-aware: Different errors can show different messages
  • Fast response: No external dependencies, served directly from your app
  • Works everywhere: Functions regardless of your DNS or CDN setup
  • Can include logic: Show cached data, degraded functionality, etc.

❌ Cons:

  • Requires code changes: Every application needs to implement this
  • Still hits your infrastructure: Requests still go through ALB β†’ instances β†’ app
  • Resource usage: Even maintenance pages consume compute resources
  • Multiple apps: If you have microservices, each needs this logic
  • Not helpful if app completely crashes: Only works if app can catch errors

πŸ’‘ Solution 2: Route 53 Health Checks with Failover

Take control at the DNS level before requests even reach your infrastructure.

Strategy: DNS Failover to Static Maintenance Site

Normal Operation:
User β†’ DNS (app.example.com) β†’ ALB β†’ Your App

Database Down:
User β†’ DNS (app.example.com) β†’ S3 Static Site (Maintenance Page)

Setting It Up

1. Create a maintenance page in S3:

# Create S3 bucket for maintenance page
aws s3 mb s3://example-maintenance-page

# Upload your maintenance page
aws s3 cp maintenance.html s3://example-maintenance-page/index.html \
  --content-type "text/html" \
  --cache-control "no-cache, no-store, must-revalidate"

# Configure bucket for static website hosting
aws s3 website s3://example-maintenance-page \
  --index-document index.html

# Make it public (or use CloudFront for better security)
aws s3api put-bucket-policy \
  --bucket example-maintenance-page \
  --policy '{
    "Version": "2012-10-17",
    "Statement": [{
      "Sid": "PublicReadGetObject",
      "Effect": "Allow",
      "Principal": "*",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::example-maintenance-page/*"
    }]
  }'

2. Create Route 53 health check:

# Health check that monitors your actual app health
aws route53 create-health-check \
  --health-check-config '{
    "Type": "HTTPS",
    "ResourcePath": "/health",
    "FullyQualifiedDomainName": "app.example.com",
    "Port": 443,
    "RequestInterval": 30,
    "FailureThreshold": 3,
    "MeasureLatency": true,
    "EnableSNI": true
  }' \
  --caller-reference "app-health-check-$(date +%s)"

3. Configure Route 53 failover records:

# Primary record (your main ALB)
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABC \
  --change-batch '{
    "Changes": [{
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "app.example.com",
        "Type": "A",
        "SetIdentifier": "Primary",
        "Failover": "PRIMARY",
        "AliasTarget": {
          "HostedZoneId": "Z35SXDOTRQ7X7K",
          "DNSName": "my-alb-123456.us-east-1.elb.amazonaws.com",
          "EvaluateTargetHealth": false
        },
        "HealthCheckId": "abc123-health-check-id"
      }
    }]
  }'

# Secondary record (S3 maintenance page)
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABC \
  --change-batch '{
    "Changes": [{
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "app.example.com",
        "Type": "A",
        "SetIdentifier": "Secondary",
        "Failover": "SECONDARY",
        "AliasTarget": {
          "HostedZoneId": "Z3AQBSTGFYJSTF",
          "DNSName": "s3-website-us-east-1.amazonaws.com",
          "EvaluateTargetHealth": false
        }
      }
    }]
  }'

Pros and Cons

βœ… Pros:

  • Completely offloads traffic: When failing over, zero load on your infrastructure
  • Works even if app crashes: DNS-level failover doesn't require your app to be running
  • Simple maintenance page: Just static HTML in S3
  • Cost-effective during outages: S3 hosting is pennies compared to running instances
  • Automatic failover: Route 53 detects failure and switches automatically

❌ Cons:

  • DNS propagation delay: Can take 30-60 seconds (or longer with caching) for failover to take effect
  • TTL complications: Clients cache DNS for the TTL duration (typically 60-300 seconds)
  • All or nothing: Either all traffic goes to maintenance page or none
  • Limited customization: Static page can't show dynamic information
  • Health check costs: Route 53 health checks cost $0.50/month each
  • Not granular: Can't fail over specific routes, only entire domains

πŸ’‘ Solution 3: CloudFront with Edge Functions

Intercept and handle errors at the edge, closest to your users.

Strategy: CloudFront Functions or Lambda@Edge

CloudFront sits in front of your entire infrastructure and can inspect/modify responses:

User β†’ CloudFront (Edge Location) β†’ ALB β†’ Your App
           ↓
    (Detects 5xx error)
           ↓
    (Returns pretty maintenance page)

Option A: CloudFront Functions (Lightweight)

CloudFront Functions run in microseconds and are perfect for simple transformations:

// CloudFront Function (viewer-response event)
function handler(event) {
    var response = event.response;
    var statusCode = response.statusCode;
    
    // If origin returned 5xx error, return maintenance page
    if (statusCode >= 500 && statusCode < 600) {
        return {
            statusCode: 503,
            statusDescription: 'Service Unavailable',
            headers: {
                'content-type': { value: 'text/html; charset=utf-8' },
                'cache-control': { value: 'no-cache, no-store, must-revalidate' }
            },
            body: `<!DOCTYPE html>
<html>
<head>
    <title>Maintenance - We'll Be Right Back</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            display: flex;
            align-items: center;
            justify-content: center;
            min-height: 100vh;
            margin: 0;
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            color: white;
            text-align: center;
        }
        .container {
            background: rgba(255, 255, 255, 0.1);
            padding: 40px;
            border-radius: 20px;
            backdrop-filter: blur(10px);
        }
        h1 { font-size: 2.5em; margin: 0 0 20px; }
        p { font-size: 1.1em; margin: 10px 0; }
    </style>
</head>
<body>
    <div class="container">
        <h1>πŸ”§ We'll Be Right Back!</h1>
        <p>We're experiencing technical difficulties.</p>
        <p>Our team is working to resolve the issue.</p>
        <p style="font-size: 0.9em; opacity: 0.8;">Please try again in a few minutes.</p>
    </div>
    <script>setTimeout(() => location.reload(), 30000);</script>
</body>
</html>`
        };
    }
    
    // Return original response if no error
    return response;
}

Deploying the function:

# Create function
aws cloudfront create-function \
  --name error-handler \
  --function-config Comment="Handle 5xx errors gracefully",Runtime="cloudfront-js-1.0" \
  --function-code file://error-handler.js

# Publish function
aws cloudfront publish-function \
  --name error-handler \
  --if-match ETVABCDEF12345

# Associate with CloudFront distribution
aws cloudfront update-distribution \
  --id E1234ABCD \
  --distribution-config '{
    "DefaultCacheBehavior": {
      "FunctionAssociations": {
        "Quantity": 1,
        "Items": [{
          "FunctionARN": "arn:aws:cloudfront::123456:function/error-handler",
          "EventType": "viewer-response"
        }]
      }
    }
  }'

Option B: Lambda@Edge (Full Power)

For more complex logic, use Lambda@Edge:

# Lambda@Edge function (origin-response event)
import json
import boto3

def lambda_handler(event, context):
    response = event['Records'][0]['cf']['response']
    status = int(response['status'])
    
    # If 5xx error, check if it's a database issue
    if 500 <= status < 600:
        # Could check CloudWatch metrics, or RDS status here
        # For simplicity, return maintenance page for all 5xx
        
        maintenance_page = """<!DOCTYPE html>
<html>
<head>
    <title>Maintenance</title>
    <style>
        body { 
            font-family: Arial, sans-serif; 
            text-align: center; 
            padding: 50px;
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            color: white;
        }
        .container { 
            max-width: 600px; 
            margin: 0 auto; 
            background: rgba(255,255,255,0.1);
            padding: 40px;
            border-radius: 20px;
        }
        h1 { font-size: 2.5em; }
    </style>
</head>
<body>
    <div class="container">
        <h1>πŸ”§ Under Maintenance</h1>
        <p>We're currently performing maintenance.</p>
        <p>We'll be back shortly!</p>
    </div>
</body>
</html>"""
        
        return {
            'status': '503',
            'statusDescription': 'Service Unavailable',
            'headers': {
                'content-type': [{'key': 'Content-Type', 'value': 'text/html'}],
                'cache-control': [{'key': 'Cache-Control', 'value': 'no-cache'}]
            },
            'body': maintenance_page
        }
    
    return response

CloudFront Custom Error Pages (Simplest Option)

CloudFront also supports custom error pages without any code:

aws cloudfront update-distribution \
  --id E1234ABCD \
  --distribution-config '{
    "CustomErrorResponses": {
      "Quantity": 3,
      "Items": [
        {
          "ErrorCode": 500,
          "ResponsePagePath": "/maintenance.html",
          "ResponseCode": "503",
          "ErrorCachingMinTTL": 10
        },
        {
          "ErrorCode": 502,
          "ResponsePagePath": "/maintenance.html",
          "ResponseCode": "503",
          "ErrorCachingMinTTL": 10
        },
        {
          "ErrorCode": 503,
          "ResponsePagePath": "/maintenance.html",
          "ResponseCode": "503",
          "ErrorCachingMinTTL": 10
        }
      ]
    }
  }'

Then host maintenance.html in your S3 origin bucket.

Pros and Cons

βœ… Pros:

  • Edge-level response: Handled at CloudFront POPs, closest to users
  • Fast failover: No DNS propagation delays
  • Reduced origin load: Errors intercepted before hitting origin repeatedly
  • Granular control: Can handle different error codes differently
  • Custom logic: Lambda@Edge can check metrics, databases, etc.
  • Consistent UX: Same error page for all users globally
  • Low error cache TTL: Can recover quickly once origin is healthy

❌ Cons:

  • Requires CloudFront: Additional infrastructure and cost
  • CloudFront Functions limitations: 10KB size limit, limited runtime
  • Lambda@Edge complexity: More expensive ($0.60 per 1M requests), longer latency
  • Deployment time: Function updates take 15-30 minutes to propagate
  • Cold starts: Lambda@Edge can have cold start latency
  • Debugging challenges: Edge functions are harder to test and debug

πŸ† Comparison Matrix

Feature App-Level Route 53 Failover CloudFront Functions Lambda@Edge Custom Error Pages
Response Time Instant 30-60s (DNS TTL) Instant Instant Instant
Infrastructure Load High None (failover) Low Low Low
Customization Full Limited (static) Medium High Low (static)
Code Required Yes No Yes (simple) Yes (complex) No
Cost App compute $0.50/month $0.10 per 1M $0.60 per 1M Included
Maintenance Per app DNS + S3 Function updates Function updates Config only
Granularity Per route Per domain Per distribution Per distribution Per error code
Works if app crashes No Yes Yes Yes Yes
Edge/Global No Yes (DNS) Yes Yes Yes

πŸ’Ž The Hybrid Approach (Best Practice)

Don't choose just oneβ€”layer your defenses:

Layer 1: Application-Level (First Line)

# Catch expected failures, show degraded functionality
@app.route('/api/users')
def get_users():
    try:
        return fetch_users_from_db()
    except DatabaseError:
        # Return cached data with a warning
        return {
            'users': get_cached_users(),
            'warning': 'Using cached data - live data temporarily unavailable'
        }, 200

Layer 2: CloudFront Custom Error Pages (Second Line)

CustomErrorResponses:
  - ErrorCode: 503
    ResponsePagePath: /maintenance.html
    ResponseCode: 503
    ErrorCachingMinTTL: 10  # Short TTL for quick recovery

Layer 3: Route 53 Failover (Nuclear Option)

# Only kicks in if health checks fail completely
PRIMARY: app.example.com β†’ ALB
SECONDARY: app.example.com β†’ S3 (Full maintenance mode)

The Flow

1. Database goes down
2. App catches error, returns cached data or 503
3. If app returns 503, CloudFront shows pretty maintenance page
4. If entire app/ALB fails health checks, Route 53 fails over to S3

🎯 Real-World Implementation

Let's put it all together for a production setup:

#!/bin/bash
# Setup script for graceful failure handling

# 1. Create S3 bucket for maintenance page
aws s3 mb s3://myapp-maintenance
aws s3 cp maintenance.html s3://myapp-maintenance/index.html
aws s3 website s3://myapp-maintenance --index-document index.html

# 2. Create Route 53 health check
HEALTH_CHECK_ID=$(aws route53 create-health-check \
  --health-check-config Type=HTTPS,ResourcePath=/health,FullyQualifiedDomainName=app.example.com,Port=443 \
  --caller-reference "health-$(date +%s)" \
  --query 'HealthCheck.Id' --output text)

# 3. Create CloudFront function for error handling
aws cloudfront create-function \
  --name error-handler \
  --function-config Runtime="cloudfront-js-1.0" \
  --function-code fileb://error-handler.js

# 4. Update CloudFront to use custom error pages
aws cloudfront update-distribution \
  --id $DISTRIBUTION_ID \
  --distribution-config file://distribution-config.json

# 5. Configure Route 53 failover
aws route53 change-resource-record-sets \
  --hosted-zone-id $ZONE_ID \
  --change-batch file://failover-config.json

echo "βœ… Graceful failure handling configured!"
echo "Test by:"
echo "1. Taking down database"
echo "2. Watching CloudWatch metrics"
echo "3. Verifying users see maintenance page"

πŸ“Š Monitoring and Alerting

Set up alerts to know when things go wrong:

# CloudWatch Alarms
DatabaseConnectionFailures:
  Metric: DatabaseConnectionErrors
  Threshold: > 10 in 5 minutes
  Action: SNS notification to ops team

ALB5xxErrors:
  Metric: HTTPCode_Target_5XX_Count
  Threshold: > 50 in 2 minutes
  Action: Page on-call engineer

Route53HealthCheckFailed:
  Metric: HealthCheckStatus
  Threshold: < 1
  Action: Trigger failover + alert

CloudFrontErrorRate:
  Metric: 5xxErrorRate
  Threshold: > 5%
  Action: Escalate to engineering lead

🎬 Testing Your Graceful Failure

Always test before you need it:

# 1. Test application-level graceful degradation
# Temporarily block database access from your app
aws ec2 modify-security-group-rules \
  --group-id sg-app \
  --security-group-rules "SecurityGroupRuleId=sgr-xxx,SecurityGroupRule={IpProtocol=tcp,FromPort=5432,ToPort=5432,CidrIpv4=0.0.0.0/0,Description='Block DB'}"

# Check: Do you see the maintenance page?

# 2. Test Route 53 failover
# Mark primary as unhealthy manually
aws route53 update-health-check \
  --health-check-id $HEALTH_CHECK_ID \
  --disabled

# Wait 60 seconds, check DNS resolution
dig app.example.com
# Should point to S3 maintenance site

# 3. Test CloudFront error handling
# Force a 503 from your app
curl -X POST https://app.example.com/admin/maintenance-mode

# Check: CloudFront should show custom error page

πŸŽ“ Key Takeaways

  1. Perfect infrastructure isn't enough: Dependencies like databases can fail
  2. Layer your defenses: Use multiple strategies together
  3. Fail gracefully: Never show ugly 500 errors to users
  4. Test regularly: Simulate failures in staging and production
  5. Monitor everything: Know about failures before your users complain
  6. Set expectations: Maintenance pages should be informative and professional
  7. Recover quickly: Short cache TTLs and auto-refresh help users see recovery

πŸš€ What's Next?

You now have a complete picture of building highly available infrastructure on AWS:

  • Part 1: ALB, Auto Scaling, and EC2 fundamentals
  • Part 2: ECS with containers and two-dimensional scaling
  • Part 3: Fargate serverless simplicity
  • Part 4: Graceful failure handling and error recovery

Your infrastructure can now:

  • Scale automatically based on demand βœ…
  • Handle instance failures βœ…
  • Distribute traffic intelligently βœ…
  • Fail gracefully when dependencies break βœ…
  • Provide great UX even during outages βœ…

The final lesson: High availability isn't about preventing all failuresβ€”it's about handling them gracefully when they inevitably happen.


"Hope for the best, plan for the worst, and prepare to be surprised." Build systems that fail gracefully, monitor continuously, and always have a plan B (and C, and D).

Questions about graceful failure handling? Find me on social media or leave a comment below!

Comments

Comments are not available. Feel free to share your feedback on LinkedIn or connect with Geek Cafe.

Geek Cafe LogoGeek Cafe

Your trusted partner for cloud architecture, development, and technical solutions. Let's build something amazing together.

Quick Links

Β© 2025 Geek Cafe LLC. All rights reserved.

Research Triangle Park, North Carolina

Version: 8.9.26