Published November 21, 2025 · 17 min read · by Eric Wilson

AWS High Availability CloudFront Route 53 Resilience

Your infrastructure scales perfectly, but what happens when your database goes down? Learn strategies for graceful degradation, from application-level error handling to Route 53 health checks and CloudFront functions. Turn ugly 500 errors into elegant maintenance pages.

Building Highly Available AWS Infrastructure: Graceful Failure - Part 4

You've done everything right. You followed Part 1 and built a highly available setup with ALB and Auto Scaling. You containerized with Part 2 using ECS. Maybe you even went serverless with Part 3 and Fargate.

Your application layer is bulletproof. It scales beautifully. Health checks are perfect. Your ALB is distributing traffic like a champ.

Then, at 2 AM on a Friday, your database goes down.

Suddenly, every single request returns a 500 error. Your perfectly scaled infrastructure becomes a perfectly scaled error generator. Your users see this:

500 Internal Server Error
The server encountered an internal error and was unable to complete your request.

Welcome to Part 4, where we talk about the harsh truth: No matter how well your applications scale, there are links in the chain that can still break.

🔗 The Weakest Link Problem

Your architecture looks like this:

User → CloudFront → ALB → Fargate Tasks → RDS Database
                                        ↓
                                     (💀 DOWN)

When that database fails:

Your Fargate tasks are healthy ✅
Your ALB is healthy ✅
Health checks pass (they only check /health, not the database) ✅
But every real request fails with 500 ❌

The problem: Your infrastructure health checks don't reflect your actual application health.

🎭 The User Experience Crisis

What your users see when your database is down:

<!-- What the ALB returns -->
<!DOCTYPE html>
<html>
<head><title>500 Internal Server Error</title></head>
<body>
<center><h1>500 Internal Server Error</h1></center>
<hr><center>nginx/1.21.6</center>
</body>
</html>

Problems with this:

Looks broken and unprofessional
No information about what's happening
No estimated time to resolution
No alternatives or status page link
Makes users think your entire site is broken

What you want users to see:

<!-- A graceful maintenance page -->
<!DOCTYPE html>
<html>
<head><title>Scheduled Maintenance</title></head>
<body style="font-family: Arial; text-align: center; padding: 50px;">
<h1>🔧 We'll be right back!</h1>
<p>We're currently performing scheduled maintenance.</p>
<p>We'll be back online shortly. Thank you for your patience!</p>
<p><a href="https://status.example.com">Check our status page</a></p>
</body>
</html>

💡 Solution 1: Application-Level Graceful Degradation

The first line of defense is your application itself.

Strategy: Fail Gracefully

Instead of crashing when the database is down, catch the error and return a pretty maintenance page:

# Python/Flask example
from flask import Flask, render_template
import psycopg2
from functools import wraps

app = Flask(__name__)

# Global flag to track database availability
db_available = True

def graceful_degradation(f):
    @wraps(f)
    def decorated_function(*args, **kwargs):
        global db_available
        try:
            # Try to execute the route
            return f(*args, **kwargs)
        except psycopg2.OperationalError:
            db_available = False
            # Return a nice maintenance page instead of 500
            return render_template('maintenance.html'), 503
        except Exception as e:
            # Log the error for debugging
            app.logger.error(f"Unexpected error: {e}")
            return render_template('maintenance.html'), 503
    return decorated_function

@app.route('/api/users')
@graceful_degradation
def get_users():
    # This will fail gracefully if database is down
    conn = get_db_connection()
    users = conn.execute('SELECT * FROM users').fetchall()
    return jsonify(users)

# Health check that ACTUALLY checks dependencies
@app.route('/health')
def health_check():
    global db_available
    
    # Check database connectivity
    try:
        conn = get_db_connection()
        conn.execute('SELECT 1')
        db_available = True
        return {'status': 'healthy', 'database': 'connected'}, 200
    except Exception as e:
        db_available = False
        # Return 503 so ALB marks as unhealthy
        return {'status': 'unhealthy', 'database': 'disconnected', 'error': str(e)}, 503

# Fallback route - show maintenance page for all other routes
@app.errorhandler(503)
def service_unavailable(e):
    return render_template('maintenance.html'), 503

.NET 9 / ASP.NET Core Implementation

Here's the same concept in modern .NET:

// Program.cs - ASP.NET Core 9
using Microsoft.EntityFrameworkCore;

var builder = WebApplication.CreateBuilder(args);

// Add services
builder.Services.AddDbContext<AppDbContext>(options =>
    options.UseNpgsql(builder.Configuration.GetConnectionString("DefaultConnection")));
builder.Services.AddControllers();

var app = builder.Build();

// Global exception handler middleware
app.Use(async (context, next) =>
{
    try
    {
        await next(context);
    }
    catch (DbException ex)
    {
        context.RequestServices.GetRequiredService<ILogger<Program>>()
            .LogError(ex, "Database connection failed");
        
        context.Response.StatusCode = 503;
        context.Response.ContentType = "text/html";
        await context.Response.WriteAsync(GetMaintenancePageHtml());
    }
    catch (Exception ex)
    {
        context.RequestServices.GetRequiredService<ILogger<Program>>()
            .LogError(ex, "Unexpected error occurred");
        
        context.Response.StatusCode = 503;
        context.Response.ContentType = "text/html";
        await context.Response.WriteAsync(GetMaintenancePageHtml());
    }
});

app.MapControllers();

// Health check endpoint
app.MapGet("/health", async (AppDbContext db, ILogger<Program> logger) =>
{
    try
    {
        // Check database connectivity
        await db.Database.ExecuteSqlRawAsync("SELECT 1");
        
        return Results.Ok(new 
        { 
            status = "healthy", 
            database = "connected",
            timestamp = DateTime.UtcNow 
        });
    }
    catch (Exception ex)
    {
        logger.LogError(ex, "Health check failed - database unavailable");
        
        return Results.Problem(
            statusCode: 503,
            title: "Service Unhealthy",
            detail: "Database connection failed"
        );
    }
});

app.Run();

static string GetMaintenancePageHtml() => @"
<!DOCTYPE html>
<html lang=""en"">
<head>
    <meta charset=""UTF-8"">
    <meta name=""viewport"" content=""width=device-width, initial-scale=1.0"">
    <title>Maintenance - We'll Be Right Back</title>
    <style>
        body {
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
            display: flex;
            align-items: center;
            justify-content: center;
            height: 100vh;
            margin: 0;
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            color: white;
        }
        .container {
            text-align: center;
            padding: 2rem;
        }
        h1 { font-size: 3rem; margin: 0; }
        p { font-size: 1.2rem; opacity: 0.9; }
    </style>
</head>
<body>
    <div class=""container"">
        <h1>🔧 We'll Be Right Back!</h1>
        <p>We're experiencing technical difficulties. Please try again in a few minutes.</p>
        <p style=""font-size: 0.9rem; margin-top: 2rem;"">
            If this persists, contact support@acme.com
        </p>
    </div>
</body>
</html>";

Key .NET Features Used:

✅ Global exception middleware - Catches unhandled exceptions
✅ Minimal API health endpoint - Clean, simple health checks
✅ Entity Framework error handling - Gracefully handles DbException
✅ Inline HTML generation - No template engine needed for simple pages
✅ Structured logging - Integrates with ASP.NET Core logging

Alternative: Using Middleware Class

For larger applications, create a dedicated middleware:

// GracefulDegradationMiddleware.cs
public class GracefulDegradationMiddleware
{
    private readonly RequestDelegate _next;
    private readonly ILogger<GracefulDegradationMiddleware> _logger;
    
    public GracefulDegradationMiddleware(
        RequestDelegate next,
        ILogger<GracefulDegradationMiddleware> logger)
    {
        _next = next;
        _logger = logger;
    }
    
    public async Task InvokeAsync(HttpContext context)
    {
        try
        {
            await _next(context);
        }
        catch (DbException ex)
        {
            _logger.LogError(ex, "Database error on {Path}", context.Request.Path);
            await HandleFailureAsync(context, "Database temporarily unavailable");
        }
        catch (HttpRequestException ex)
        {
            _logger.LogError(ex, "External service error on {Path}", context.Request.Path);
            await HandleFailureAsync(context, "External service temporarily unavailable");
        }
    }
    
    private static async Task HandleFailureAsync(HttpContext context, string message)
    {
        context.Response.StatusCode = 503;
        context.Response.ContentType = "application/json";
        
        await context.Response.WriteAsJsonAsync(new
        {
            status = "service_unavailable",
            message = message,
            timestamp = DateTime.UtcNow
        });
    }
}

// Register in Program.cs
app.UseMiddleware<GracefulDegradationMiddleware>();

The Maintenance Page Template

<!-- templates/maintenance.html -->
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Maintenance - We'll Be Right Back</title>
    <style>
        body {
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Arial, sans-serif;
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            color: white;
            display: flex;
            align-items: center;
            justify-content: center;
            min-height: 100vh;
            margin: 0;
            padding: 20px;
        }
        .container {
            text-align: center;
            max-width: 600px;
            background: rgba(255, 255, 255, 0.1);
            backdrop-filter: blur(10px);
            padding: 60px 40px;
            border-radius: 20px;
            box-shadow: 0 8px 32px rgba(0, 0, 0, 0.3);
        }
        h1 { font-size: 3em; margin: 0 0 20px; }
        p { font-size: 1.2em; margin: 15px 0; opacity: 0.9; }
        .status-link {
            display: inline-block;
            margin-top: 30px;
            padding: 12px 30px;
            background: white;
            color: #667eea;
            text-decoration: none;
            border-radius: 25px;
            font-weight: 600;
            transition: transform 0.2s;
        }
        .status-link:hover { transform: translateY(-2px); }
        .icon { font-size: 4em; margin-bottom: 20px; }
    </style>
</head>
<body>
    <div class="container">
        <div class="icon">🔧</div>
        <h1>We'll Be Right Back!</h1>
        <p>We're currently experiencing technical difficulties.</p>
        <p>Our team has been notified and is working to resolve the issue.</p>
        <p style="font-size: 0.9em; opacity: 0.7;">Estimated resolution time: 15-30 minutes</p>
        <a href="https://status.example.com" class="status-link">Check Status Page</a>
    </div>
    <script>
        // Auto-refresh every 30 seconds
        setTimeout(() => location.reload(), 30000);
    </script>
</body>
</html>

Pros and Cons

✅ Pros:

Full control: Customize the message, styling, and behavior
Context-aware: Different errors can show different messages
Fast response: No external dependencies, served directly from your app
Works everywhere: Functions regardless of your DNS or CDN setup
Can include logic: Show cached data, degraded functionality, etc.

❌ Cons:

Requires code changes: Every application needs to implement this
Still hits your infrastructure: Requests still go through ALB → instances → app
Resource usage: Even maintenance pages consume compute resources
Multiple apps: If you have microservices, each needs this logic
Not helpful if app completely crashes: Only works if app can catch errors

💡 Solution 2: Route 53 Health Checks with Failover

Take control at the DNS level before requests even reach your infrastructure.

Strategy: DNS Failover to Static Maintenance Site

Normal Operation:
User → DNS (app.example.com) → ALB → Your App

Database Down:
User → DNS (app.example.com) → S3 Static Site (Maintenance Page)

Setting It Up

1. Create a maintenance page in S3:

# Create S3 bucket for maintenance page
aws s3 mb s3://example-maintenance-page

# Upload your maintenance page
aws s3 cp maintenance.html s3://example-maintenance-page/index.html \
  --content-type "text/html" \
  --cache-control "no-cache, no-store, must-revalidate"

# Configure bucket for static website hosting
aws s3 website s3://example-maintenance-page \
  --index-document index.html

# Make it public (or use CloudFront for better security)
aws s3api put-bucket-policy \
  --bucket example-maintenance-page \
  --policy '{
    "Version": "2012-10-17",
    "Statement": [{
      "Sid": "PublicReadGetObject",
      "Effect": "Allow",
      "Principal": "*",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::example-maintenance-page/*"
    }]
  }'

2. Create Route 53 health check:

# Health check that monitors your actual app health
aws route53 create-health-check \
  --health-check-config '{
    "Type": "HTTPS",
    "ResourcePath": "/health",
    "FullyQualifiedDomainName": "app.example.com",
    "Port": 443,
    "RequestInterval": 30,
    "FailureThreshold": 3,
    "MeasureLatency": true,
    "EnableSNI": true
  }' \
  --caller-reference "app-health-check-$(date +%s)"

3. Configure Route 53 failover records:

# Primary record (your main ALB)
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABC \
  --change-batch '{
    "Changes": [{
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "app.example.com",
        "Type": "A",
        "SetIdentifier": "Primary",
        "Failover": "PRIMARY",
        "AliasTarget": {
          "HostedZoneId": "Z35SXDOTRQ7X7K",
          "DNSName": "my-alb-123456.us-east-1.elb.amazonaws.com",
          "EvaluateTargetHealth": false
        },
        "HealthCheckId": "abc123-health-check-id"
      }
    }]
  }'

# Secondary record (S3 maintenance page)
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABC \
  --change-batch '{
    "Changes": [{
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "app.example.com",
        "Type": "A",
        "SetIdentifier": "Secondary",
        "Failover": "SECONDARY",
        "AliasTarget": {
          "HostedZoneId": "Z3AQBSTGFYJSTF",
          "DNSName": "s3-website-us-east-1.amazonaws.com",
          "EvaluateTargetHealth": false
        }
      }
    }]
  }'

Pros and Cons

✅ Pros:

Completely offloads traffic: When failing over, zero load on your infrastructure
Works even if app crashes: DNS-level failover doesn't require your app to be running
Simple maintenance page: Just static HTML in S3
Cost-effective during outages: S3 hosting is pennies compared to running instances
Automatic failover: Route 53 detects failure and switches automatically

❌ Cons:

DNS propagation delay: Can take 30-60 seconds (or longer with caching) for failover to take effect
TTL complications: Clients cache DNS for the TTL duration (typically 60-300 seconds)
All or nothing: Either all traffic goes to maintenance page or none
Limited customization: Static page can't show dynamic information
Health check costs: Route 53 health checks cost $0.50/month each
Not granular: Can't fail over specific routes, only entire domains

💡 Solution 3: CloudFront with Edge Functions

Intercept and handle errors at the edge, closest to your users.

Strategy: CloudFront Functions or Lambda@Edge

CloudFront sits in front of your entire infrastructure and can inspect/modify responses:

User → CloudFront (Edge Location) → ALB → Your App
           ↓
    (Detects 5xx error)
           ↓
    (Returns pretty maintenance page)

Option A: CloudFront Functions (Lightweight)

CloudFront Functions run in microseconds and are perfect for simple transformations:

// CloudFront Function (viewer-response event)
function handler(event) {
    var response = event.response;
    var statusCode = response.statusCode;
    
    // If origin returned 5xx error, return maintenance page
    if (statusCode >= 500 && statusCode < 600) {
        return {
            statusCode: 503,
            statusDescription: 'Service Unavailable',
            headers: {
                'content-type': { value: 'text/html; charset=utf-8' },
                'cache-control': { value: 'no-cache, no-store, must-revalidate' }
            },
            body: `<!DOCTYPE html>
<html>
<head>
    <title>Maintenance - We'll Be Right Back</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            display: flex;
            align-items: center;
            justify-content: center;
            min-height: 100vh;
            margin: 0;
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            color: white;
            text-align: center;
        }
        .container {
            background: rgba(255, 255, 255, 0.1);
            padding: 40px;
            border-radius: 20px;
            backdrop-filter: blur(10px);
        }
        h1 { font-size: 2.5em; margin: 0 0 20px; }
        p { font-size: 1.1em; margin: 10px 0; }
    </style>
</head>
<body>
    <div class="container">
        <h1>🔧 We'll Be Right Back!</h1>
        <p>We're experiencing technical difficulties.</p>
        <p>Our team is working to resolve the issue.</p>
        <p style="font-size: 0.9em; opacity: 0.8;">Please try again in a few minutes.</p>
    </div>
    <script>setTimeout(() => location.reload(), 30000);</script>
</body>
</html>`
        };
    }
    
    // Return original response if no error
    return response;
}

Deploying the function:

# Create function
aws cloudfront create-function \
  --name error-handler \
  --function-config Comment="Handle 5xx errors gracefully",Runtime="cloudfront-js-1.0" \
  --function-code file://error-handler.js

# Publish function
aws cloudfront publish-function \
  --name error-handler \
  --if-match ETVABCDEF12345

# Associate with CloudFront distribution
aws cloudfront update-distribution \
  --id E1234ABCD \
  --distribution-config '{
    "DefaultCacheBehavior": {
      "FunctionAssociations": {
        "Quantity": 1,
        "Items": [{
          "FunctionARN": "arn:aws:cloudfront::123456:function/error-handler",
          "EventType": "viewer-response"
        }]
      }
    }
  }'

Option B: Lambda@Edge (Full Power)

For more complex logic, use Lambda@Edge:

# Lambda@Edge function (origin-response event)
import json
import boto3

def lambda_handler(event, context):
    response = event['Records'][0]['cf']['response']
    status = int(response['status'])
    
    # If 5xx error, check if it's a database issue
    if 500 <= status < 600:
        # Could check CloudWatch metrics, or RDS status here
        # For simplicity, return maintenance page for all 5xx
        
        maintenance_page = """<!DOCTYPE html>
<html>
<head>
    <title>Maintenance</title>
    <style>
        body { 
            font-family: Arial, sans-serif; 
            text-align: center; 
            padding: 50px;
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            color: white;
        }
        .container { 
            max-width: 600px; 
            margin: 0 auto; 
            background: rgba(255,255,255,0.1);
            padding: 40px;
            border-radius: 20px;
        }
        h1 { font-size: 2.5em; }
    </style>
</head>
<body>
    <div class="container">
        <h1>🔧 Under Maintenance</h1>
        <p>We're currently performing maintenance.</p>
        <p>We'll be back shortly!</p>
    </div>
</body>
</html>"""
        
        return {
            'status': '503',
            'statusDescription': 'Service Unavailable',
            'headers': {
                'content-type': [{'key': 'Content-Type', 'value': 'text/html'}],
                'cache-control': [{'key': 'Cache-Control', 'value': 'no-cache'}]
            },
            'body': maintenance_page
        }
    
    return response

CloudFront Custom Error Pages (Simplest Option)

CloudFront also supports custom error pages without any code:

aws cloudfront update-distribution \
  --id E1234ABCD \
  --distribution-config '{
    "CustomErrorResponses": {
      "Quantity": 3,
      "Items": [
        {
          "ErrorCode": 500,
          "ResponsePagePath": "/maintenance.html",
          "ResponseCode": "503",
          "ErrorCachingMinTTL": 10
        },
        {
          "ErrorCode": 502,
          "ResponsePagePath": "/maintenance.html",
          "ResponseCode": "503",
          "ErrorCachingMinTTL": 10
        },
        {
          "ErrorCode": 503,
          "ResponsePagePath": "/maintenance.html",
          "ResponseCode": "503",
          "ErrorCachingMinTTL": 10
        }
      ]
    }
  }'

Then host maintenance.html in your S3 origin bucket.

Pros and Cons

✅ Pros:

Edge-level response: Handled at CloudFront POPs, closest to users
Fast failover: No DNS propagation delays
Reduced origin load: Errors intercepted before hitting origin repeatedly
Granular control: Can handle different error codes differently
Custom logic: Lambda@Edge can check metrics, databases, etc.
Consistent UX: Same error page for all users globally
Low error cache TTL: Can recover quickly once origin is healthy

❌ Cons:

Requires CloudFront: Additional infrastructure and cost
CloudFront Functions limitations: 10KB size limit, limited runtime
Lambda@Edge complexity: More expensive ($0.60 per 1M requests), longer latency
Deployment time: Function updates take 15-30 minutes to propagate
Cold starts: Lambda@Edge can have cold start latency
Debugging challenges: Edge functions are harder to test and debug

🏆 Comparison Matrix

Feature	App-Level	Route 53 Failover	CloudFront Functions	Lambda@Edge	Custom Error Pages
Response Time	Instant	30-60s (DNS TTL)	Instant	Instant	Instant
Infrastructure Load	High	None (failover)	Low	Low	Low
Customization	Full	Limited (static)	Medium	High	Low (static)
Code Required	Yes	No	Yes (simple)	Yes (complex)	No
Cost	App compute	$0.50/month	$0.10 per 1M	$0.60 per 1M	Included
Maintenance	Per app	DNS + S3	Function updates	Function updates	Config only
Granularity	Per route	Per domain	Per distribution	Per distribution	Per error code
Works if app crashes	No	Yes	Yes	Yes	Yes
Edge/Global	No	Yes (DNS)	Yes	Yes	Yes

💎 The Hybrid Approach (Best Practice)

Don't choose just one—layer your defenses:

Layer 1: Application-Level (First Line)

# Catch expected failures, show degraded functionality
@app.route('/api/users')
def get_users():
    try:
        return fetch_users_from_db()
    except DatabaseError:
        # Return cached data with a warning
        return {
            'users': get_cached_users(),
            'warning': 'Using cached data - live data temporarily unavailable'
        }, 200

Layer 2: CloudFront Custom Error Pages (Second Line)

CustomErrorResponses:
  - ErrorCode: 503
    ResponsePagePath: /maintenance.html
    ResponseCode: 503
    ErrorCachingMinTTL: 10  # Short TTL for quick recovery

Layer 3: Route 53 Failover (Nuclear Option)

# Only kicks in if health checks fail completely
PRIMARY: app.example.com → ALB
SECONDARY: app.example.com → S3 (Full maintenance mode)

The Flow

1. Database goes down
2. App catches error, returns cached data or 503
3. If app returns 503, CloudFront shows pretty maintenance page
4. If entire app/ALB fails health checks, Route 53 fails over to S3

🎯 Real-World Implementation

Let's put it all together for a production setup:

#!/bin/bash
# Setup script for graceful failure handling

# 1. Create S3 bucket for maintenance page
aws s3 mb s3://myapp-maintenance
aws s3 cp maintenance.html s3://myapp-maintenance/index.html
aws s3 website s3://myapp-maintenance --index-document index.html

# 2. Create Route 53 health check
HEALTH_CHECK_ID=$(aws route53 create-health-check \
  --health-check-config Type=HTTPS,ResourcePath=/health,FullyQualifiedDomainName=app.example.com,Port=443 \
  --caller-reference "health-$(date +%s)" \
  --query 'HealthCheck.Id' --output text)

# 3. Create CloudFront function for error handling
aws cloudfront create-function \
  --name error-handler \
  --function-config Runtime="cloudfront-js-1.0" \
  --function-code fileb://error-handler.js

# 4. Update CloudFront to use custom error pages
aws cloudfront update-distribution \
  --id $DISTRIBUTION_ID \
  --distribution-config file://distribution-config.json

# 5. Configure Route 53 failover
aws route53 change-resource-record-sets \
  --hosted-zone-id $ZONE_ID \
  --change-batch file://failover-config.json

echo "✅ Graceful failure handling configured!"
echo "Test by:"
echo "1. Taking down database"
echo "2. Watching CloudWatch metrics"
echo "3. Verifying users see maintenance page"

📊 Monitoring and Alerting

Set up alerts to know when things go wrong:

# CloudWatch Alarms
DatabaseConnectionFailures:
  Metric: DatabaseConnectionErrors
  Threshold: > 10 in 5 minutes
  Action: SNS notification to ops team

ALB5xxErrors:
  Metric: HTTPCode_Target_5XX_Count
  Threshold: > 50 in 2 minutes
  Action: Page on-call engineer

Route53HealthCheckFailed:
  Metric: HealthCheckStatus
  Threshold: < 1
  Action: Trigger failover + alert

CloudFrontErrorRate:
  Metric: 5xxErrorRate
  Threshold: > 5%
  Action: Escalate to engineering lead

🎬 Testing Your Graceful Failure

Always test before you need it:

# 1. Test application-level graceful degradation
# Temporarily block database access from your app
aws ec2 modify-security-group-rules \
  --group-id sg-app \
  --security-group-rules "SecurityGroupRuleId=sgr-xxx,SecurityGroupRule={IpProtocol=tcp,FromPort=5432,ToPort=5432,CidrIpv4=0.0.0.0/0,Description='Block DB'}"

# Check: Do you see the maintenance page?

# 2. Test Route 53 failover
# Mark primary as unhealthy manually
aws route53 update-health-check \
  --health-check-id $HEALTH_CHECK_ID \
  --disabled

# Wait 60 seconds, check DNS resolution
dig app.example.com
# Should point to S3 maintenance site

# 3. Test CloudFront error handling
# Force a 503 from your app
curl -X POST https://app.example.com/admin/maintenance-mode

# Check: CloudFront should show custom error page

🎓 Key Takeaways

Perfect infrastructure isn't enough: Dependencies like databases can fail
Layer your defenses: Use multiple strategies together
Fail gracefully: Never show ugly 500 errors to users
Test regularly: Simulate failures in staging and production
Monitor everything: Know about failures before your users complain
Set expectations: Maintenance pages should be informative and professional
Recover quickly: Short cache TTLs and auto-refresh help users see recovery

🚀 What's Next?

You now have a complete picture of building highly available infrastructure on AWS:

Part 1: ALB, Auto Scaling, and EC2 fundamentals
Part 2: ECS with containers and two-dimensional scaling
Part 3: Fargate serverless simplicity
Part 4: Graceful failure handling and error recovery

Your infrastructure can now:

Scale automatically based on demand ✅
Handle instance failures ✅
Distribute traffic intelligently ✅
Fail gracefully when dependencies break ✅
Provide great UX even during outages ✅

The final lesson: High availability isn't about preventing all failures—it's about handling them gracefully when they inevitably happen.

"Hope for the best, plan for the worst, and prepare to be surprised." Build systems that fail gracefully, monitor continuously, and always have a plan B (and C, and D).

Questions about graceful failure handling? Find me on social media or leave a comment below!

Comments

Comments are not available. Feel free to share your feedback on LinkedIn or connect with Geek Cafe.

Back to All Blogs