Advanced
45 mins

High Availability

Learn how to design and implement high availability architectures to ensure your applications remain operational even during component failures.

Prerequisites

  • Understanding of system architecture principles
  • Experience with cloud infrastructure
  • Knowledge of networking concepts
  • Familiarity with load balancing

High Availability Overview

High Availability Architecture

Visual representation of a high availability architecture with redundant components across multiple availability zones.

1

High Availability Architecture

Design a resilient high availability architecture:

// High availability architecture configuration
const haArchitecture = {
  // Redundancy levels
  redundancyLevels: {
    n: {
      description: 'Single component, no redundancy',
      availability: '99.5%', // Typical single component availability
      failureImpact: 'Service outage'
    },
    nPlusOne: {
      description: 'N+1 redundancy (one additional component)',
      availability: '99.9%', // Three nines
      failureImpact: 'No impact if single component fails'
    },
    nPlusTwo: {
      description: 'N+2 redundancy (two additional components)',
      availability: '99.99%', // Four nines
      failureImpact: 'No impact if two components fail'
    },
    twoN: {
      description: '2N redundancy (full component duplication)',
      availability: '99.999%', // Five nines
      failureImpact: 'No impact if entire subsystem fails'
    }
  },
  
  // System components
  components: {
    compute: {
      type: 'Virtual Machines',
      redundancyLevel: 'nPlusOne',
      distribution: 'Multiple availability zones',
      scalingStrategy: 'Auto-scaling group'
    },
    database: {
      type: 'Managed Database',
      redundancyLevel: 'nPlusOne',
      distribution: 'Primary with hot standby',
      replicationStrategy: 'Synchronous replication'
    },
    storage: {
      type: 'Distributed Storage',
      redundancyLevel: 'twoN',
      distribution: 'Multiple availability zones',
      replicationStrategy: 'Triple replication'
    },
    networking: {
      type: 'Load Balancers',
      redundancyLevel: 'nPlusOne',
      distribution: 'Multiple availability zones',
      failoverStrategy: 'Active-active'
    }
  },
  
  // Availability zones
  availabilityZones: {
    primary: {
      name: 'us-east-1a',
      region: 'us-east-1',
      components: ['compute', 'database', 'storage', 'networking']
    },
    secondary: {
      name: 'us-east-1b',
      region: 'us-east-1',
      components: ['compute', 'database', 'storage', 'networking']
    },
    tertiary: {
      name: 'us-east-1c',
      region: 'us-east-1',
      components: ['compute', 'storage', 'networking']
    }
  },
  
  // Calculate theoretical availability
  calculateAvailability() {
    let systemAvailability = 1.0;
    
    for (const [componentName, component] of Object.entries(this.components)) {
      const redundancyLevel = this.redundancyLevels[component.redundancyLevel];
      const componentAvailability = this.parseAvailability(redundancyLevel.availability);
      
      // Apply redundancy formula based on level
      let effectiveAvailability;
      
      switch (component.redundancyLevel) {
        case 'n':
          effectiveAvailability = componentAvailability;
          break;
        case 'nPlusOne':
          effectiveAvailability = 1 - Math.pow(1 - componentAvailability, 2);
          break;
        case 'nPlusTwo':
          effectiveAvailability = 1 - Math.pow(1 - componentAvailability, 3);
          break;
        case 'twoN':
          effectiveAvailability = 1 - Math.pow(1 - componentAvailability, 2);
          break;
        default:
          effectiveAvailability = componentAvailability;
      }
      
      // Multiply with system availability (series calculation)
      systemAvailability *= effectiveAvailability;
    }
    
    return this.formatAvailability(systemAvailability);
  },
  
  // Parse availability percentage string to decimal
  parseAvailability(availabilityStr) {
    return parseFloat(availabilityStr.replace('%', '')) / 100;
  },
  
  // Format availability decimal to percentage string
  formatAvailability(availability) {
    return (availability * 100).toFixed(3) + '%';
  }
}
2

Load Balancing Configuration

Set up load balancing for distributed traffic:

// Load balancer configuration
const loadBalancer = {
  // Load balancer types
  types: {
    application: {
      name: 'Application Load Balancer',
      layer: 'Layer 7 (HTTP/HTTPS)',
      features: [
        'Path-based routing',
        'Host-based routing',
        'HTTP header routing',
        'SSL termination',
        'Session stickiness'
      ],
      bestFor: ['Web applications', 'Microservices', 'Container-based applications']
    },
    network: {
      name: 'Network Load Balancer',
      layer: 'Layer 4 (TCP/UDP)',
      features: [
        'Ultra-low latency',
        'Millions of requests per second',
        'Static IP addresses',
        'Preserve client IP addresses'
      ],
      bestFor: ['TCP/UDP traffic', 'Extreme performance requirements', 'Static IP requirements']
    },
    global: {
      name: 'Global Load Balancer',
      layer: 'DNS-based',
      features: [
        'Geographic routing',
        'Weighted routing',
        'Latency-based routing',
        'Health check integration',
        'Disaster recovery'
      ],
      bestFor: ['Multi-region deployments', 'Global user base', 'Disaster recovery']
    }
  },
  
  // Algorithms
  algorithms: {
    roundRobin: {
      name: 'Round Robin',
      description: 'Distributes requests sequentially across all servers',
      advantages: ['Simple implementation', 'Equal distribution'],
      disadvantages: ['Doesn't account for server load', 'Doesn't consider request complexity']
    },
    leastConnections: {
      name: 'Least Connections',
      description: 'Directs traffic to server with fewest active connections',
      advantages: ['Accounts for varying connection times', 'Prevents server overload'],
      disadvantages: ['Doesn't account for connection complexity', 'Requires connection tracking']
    },
    leastResponseTime: {
      name: 'Least Response Time',
      description: 'Directs traffic to server with lowest response time',
      advantages: ['Accounts for server performance', 'Improves user experience'],
      disadvantages: ['More complex implementation', 'Requires response time monitoring']
    },
    ipHash: {
      name: 'IP Hash',
      description: 'Uses client IP to determine server (consistent hashing)',
      advantages: ['Session persistence without cookies', 'Evenly distributed'],
      disadvantages: ['Less flexible', 'Can be unbalanced with similar IPs']
    },
    weightedRoundRobin: {
      name: 'Weighted Round Robin',
      description: 'Round robin with server capacity weights',
      advantages: ['Accounts for different server capacities', 'Simple implementation'],
      disadvantages: ['Static weights may not reflect real-time conditions']
    }
  },
  
  // Health checks
  healthChecks: {
    types: {
      http: {
        protocol: 'HTTP/HTTPS',
        checks: ['Status code', 'Response body', 'Response time'],
        interval: 30, // seconds
        timeout: 5, // seconds
        thresholds: {
          healthy: 2,
          unhealthy: 3
        }
      },
      tcp: {
        protocol: 'TCP',
        checks: ['Connection establishment'],
        interval: 30, // seconds
        timeout: 5, // seconds
        thresholds: {
          healthy: 2,
          unhealthy: 3
        }
      }
    },
    
    async configureHealthCheck(target, type, options = {}) {
      const healthCheckType = this.healthChecks.types[type];
      
      if (!healthCheckType) {
        throw new Error(`Unknown health check type: ${type}`);
      }
      
      return {
        target,
        type,
        protocol: options.protocol || healthCheckType.protocol,
        port: options.port || 80,
        path: options.path || '/health',
        interval: options.interval || healthCheckType.interval,
        timeout: options.timeout || healthCheckType.timeout,
        thresholds: {
          healthy: options.healthyThreshold || healthCheckType.thresholds.healthy,
          unhealthy: options.unhealthyThreshold || healthCheckType.thresholds.unhealthy
        }
      };
    }
  },
  
  // Configure load balancer
  async configure(type, options = {}) {
    const lbType = this.types[type];
    
    if (!lbType) {
      throw new Error(`Unknown load balancer type: ${type}`);
    }
    
    console.log(`Configuring ${lbType.name}`);
    
    const algorithm = options.algorithm || 'roundRobin';
    const healthCheckType = options.healthCheckType || 'http';
    
    const config = {
      name: options.name || `${type}-lb`,
      type,
      algorithm,
      listeners: options.listeners || [
        { protocol: 'HTTP', port: 80 },
        { protocol: 'HTTPS', port: 443 }
      ],
      targets: [],
      healthCheck: await this.healthChecks.configureHealthCheck(
        options.healthCheckTarget || '/health',
        healthCheckType,
        options.healthCheckOptions
      ),
      stickinessEnabled: options.stickinessEnabled || false,
      stickinessType: options.stickinessType || 'cookie',
      stickinessExpiration: options.stickinessExpiration || 86400 // 1 day
    };
    
    // Add targets
    if (options.targets && options.targets.length > 0) {
      for (const target of options.targets) {
        config.targets.push({
          id: target.id,
          host: target.host,
          port: target.port || 80,
          weight: target.weight || 1,
          status: 'initial'
        });
      }
    }
    
    return config;
  }
}
3

Database High Availability

Implement database high availability:

// Database high availability configuration
const databaseHA = {
  // Replication types
  replicationTypes: {
    synchronous: {
      name: 'Synchronous Replication',
      description: 'Primary waits for replica acknowledgment before confirming write',
      consistency: 'Strong',
      latency: 'Higher',
      dataLoss: 'None',
      bestFor: ['Financial systems', 'Critical data', 'Regulatory requirements']
    },
    asynchronous: {
      name: 'Asynchronous Replication',
      description: 'Primary confirms write before replica acknowledgment',
      consistency: 'Eventually consistent',
      latency: 'Lower',
      dataLoss: 'Potential for small window of loss',
      bestFor: ['Geographically distributed systems', 'Performance-sensitive applications']
    },
    semiSynchronous: {
      name: 'Semi-Synchronous Replication',
      description: 'Hybrid approach with configurable durability guarantees',
      consistency: 'Configurable',
      latency: 'Moderate',
      dataLoss: 'Configurable risk',
      bestFor: ['Balance of performance and durability', 'Most web applications']
    }
  },
  
  // Database HA architectures
  architectures: {
    primaryReplica: {
      name: 'Primary-Replica',
      description: 'Single primary with one or more read replicas',
      components: ['Primary node', 'Replica nodes', 'Monitoring system'],
      advantages: ['Simple setup', 'Read scalability', 'Backup options'],
      disadvantages: ['Single write point', 'Failover requires promotion'],
      availability: '99.9% - 99.95%'
    },
    multiPrimary: {
      name: 'Multi-Primary (Multi-Master)',
      description: 'Multiple nodes accepting writes with replication',
      components: ['Multiple primary nodes', 'Replication manager', 'Conflict resolution'],
      advantages: ['No single write point', 'Higher write availability', 'Geographic distribution'],
      disadvantages: ['Complex conflict resolution', 'Consistency challenges', 'More complex setup'],
      availability: '99.95% - 99.99%'
    },
    sharded: {
      name: 'Sharded Cluster',
      description: 'Data partitioned across multiple servers',
      components: ['Shard servers', 'Config servers', 'Router servers'],
      advantages: ['Horizontal scalability', 'Performance for large datasets', 'Workload isolation'],
      disadvantages: ['Complex setup and maintenance', 'Uneven sharding challenges', 'Cross-shard transactions'],
      availability: '99.9% - 99.99%'
    }
  },
  
  // Failover strategies
  failoverStrategies: {
    automatic: {
      name: 'Automatic Failover',
      description: 'System automatically detects failure and promotes replica',
      components: ['Health monitoring', 'Failover manager', 'DNS updater'],
      timeToRecover: 'Seconds to minutes',
      humanIntervention: 'None'
    },
    manual: {
      name: 'Manual Failover',
      description: 'Human operator initiates and manages failover process',
      components: ['Monitoring alerts', 'Runbooks', 'Admin tools'],
      timeToRecover: 'Minutes to hours',
      humanIntervention: 'Required'
    },
    orchestrated: {
      name: 'Orchestrated Failover',
      description: 'Automated with approval gates and validation',
      components: ['Orchestration platform', 'Validation tests', 'Approval workflow'],
      timeToRecover: 'Minutes',
      humanIntervention: 'Approval only'
    }
  },
  
  // Configure database HA
  async configure(dbType, architecture, options = {}) {
    console.log(`Configuring ${dbType} with ${architecture} architecture`);
    
    const architectureConfig = this.architectures[architecture];
    
    if (!architectureConfig) {
      throw new Error(`Unknown architecture: ${architecture}`);
    }
    
    const replicationType = options.replicationType || 'synchronous';
    const replicationConfig = this.replicationTypes[replicationType];
    
    if (!replicationConfig) {
      throw new Error(`Unknown replication type: ${replicationType}`);
    }
    
    const failoverType = options.failoverType || 'automatic';
    const failoverConfig = this.failoverStrategies[failoverType];
    
    if (!failoverConfig) {
      throw new Error(`Unknown failover type: ${failoverType}`);
    }
    
    // Create configuration
    const config = {
      dbType,
      architecture,
      replication: {
        type: replicationType,
        ...replicationConfig
      },
      failover: {
        type: failoverType,
        ...failoverConfig
      },
      nodes: [],
      monitoring: {
        interval: options.monitoringInterval || 10, // seconds
        healthEndpoint: options.healthEndpoint || '/health',
        metrics: options.metrics || ['cpu', 'memory', 'connections', 'replication_lag']
      }
    };
    
    // Configure nodes
    if (options.nodes && options.nodes.length > 0) {
      config.nodes = options.nodes.map(node => ({
        id: node.id,
        host: node.host,
        port: node.port,
        role: node.role || 'replica',
        zone: node.zone,
        priority: node.priority || 1
      }));
    } else {
      // Create default nodes
      const zones = ['us-east-1a', 'us-east-1b', 'us-east-1c'];
      
      if (architecture === 'primaryReplica') {
        config.nodes.push({
          id: 'primary-1',
          host: 'db-primary',
          port: 5432,
          role: 'primary',
          zone: zones[0],
          priority: 1
        });
        
        for (let i = 0; i < 2; i++) {
          config.nodes.push({
            id: `replica-${i+1}`,
            host: `db-replica-${i+1}`,
            port: 5432,
            role: 'replica',
            zone: zones[i+1],
            priority: i+2
          });
        }
      } else if (architecture === 'multiPrimary') {
        for (let i = 0; i < 3; i++) {
          config.nodes.push({
            id: `primary-${i+1}`,
            host: `db-primary-${i+1}`,
            port: 5432,
            role: 'primary',
            zone: zones[i],
            priority: i+1
          });
        }
      } else if (architecture === 'sharded') {
        // Config servers
        for (let i = 0; i < 3; i++) {
          config.nodes.push({
            id: `config-${i+1}`,
            host: `db-config-${i+1}`,
            port: 27017,
            role: 'config',
            zone: zones[i % zones.length],
            priority: 1
          });
        }
        
        // Shard servers (2 shards with 3 nodes each)
        for (let shard = 0; shard < 2; shard++) {
          for (let node = 0; node < 3; node++) {
            const role = node === 0 ? 'primary' : 'replica';
            config.nodes.push({
              id: `shard-${shard+1}-${role}-${node+1}`,
              host: `db-shard-${shard+1}-${node+1}`,
              port: 27017,
              role: `shard-${role}`,
              zone: zones[node % zones.length],
              priority: node === 0 ? 1 : node+1,
              shardId: shard+1
            });
          }
        }
        
        // Router servers
        for (let i = 0; i < 2; i++) {
          config.nodes.push({
            id: `router-${i+1}`,
            host: `db-router-${i+1}`,
            port: 27017,
            role: 'router',
            zone: zones[i % zones.length],
            priority: 1
          });
        }
      }
    }
    
    return config;
  }
}
4

Auto-Scaling Implementation

Configure auto-scaling for dynamic capacity:

// Auto-scaling configuration
const autoScaling = {
  // Scaling types
  scalingTypes: {
    horizontal: {
      name: 'Horizontal Scaling (Scaling Out)',
      description: 'Add or remove instances to handle load',
      advantages: ['Linear scalability', 'Improved availability', 'No downtime'],
      disadvantages: ['State management challenges', 'Load balancing required', 'Potential data consistency issues'],
      bestFor: ['Stateless applications', 'Web servers', 'Microservices']
    },
    vertical: {
      name: 'Vertical Scaling (Scaling Up)',
      description: 'Increase resources (CPU, memory) of existing instances',
      advantages: ['Simpler implementation', 'Better for stateful applications', 'Less network overhead'],
      disadvantages: ['Limited by hardware', 'Potential downtime', 'Not linearly scalable'],
      bestFor: ['Databases', 'Monolithic applications', 'Memory-intensive workloads']
    },
    diagonal: {
      name: 'Diagonal Scaling (Hybrid)',
      description: 'Combination of horizontal and vertical scaling',
      advantages: ['Flexible resource utilization', 'Cost optimization', 'Adaptable to workload'],
      disadvantages: ['Complex implementation', 'Requires sophisticated monitoring', 'Advanced orchestration'],
      bestFor: ['Complex applications', 'Variable workloads', 'Enterprise systems']
    }
  },
  
  // Scaling metrics
  metrics: {
    cpu: {
      name: 'CPU Utilization',
      description: 'Percentage of CPU capacity in use',
      thresholds: {
        scaleOut: 70, // percentage
        scaleIn: 30 // percentage
      },
      period: 300, // seconds
      evaluationPeriods: 2
    },
    memory: {
      name: 'Memory Utilization',
      description: 'Percentage of memory in use',
      thresholds: {
        scaleOut: 75, // percentage
        scaleIn: 40 // percentage
      },
      period: 300, // seconds
      evaluationPeriods: 2
    },
    requests: {
      name: 'Request Count',
      description: 'Number of requests per instance',
      thresholds: {
        scaleOut: 1000, // requests per minute
        scaleIn: 300 // requests per minute
      },
      period: 60, // seconds
      evaluationPeriods: 3
    },
    responseTime: {
      name: 'Response Time',
      description: 'Average response time in milliseconds',
      thresholds: {
        scaleOut: 500, // milliseconds
        scaleIn: 200 // milliseconds
      },
      period: 60, // seconds
      evaluationPeriods: 3
    },
    custom: {
      name: 'Custom Metric',
      description: 'User-defined metric for scaling',
      thresholds: {
        scaleOut: null, // to be defined
        scaleIn: null // to be defined
      },
      period: 60, // seconds
      evaluationPeriods: 3
    }
  },
  
  // Scaling policies
  policies: {
    targetTracking: {
      name: 'Target Tracking',
      description: 'Maintain a specific target value for a metric',
      configuration: {
        targetValue: null, // to be defined
        metric: null, // to be defined
        scaleOutCooldown: 300, // seconds
        scaleInCooldown: 300 // seconds
      }
    },
    stepScaling: {
      name: 'Step Scaling',
      description: 'Scale based on metric thresholds with step adjustments',
      configuration: {
        metric: null, // to be defined
        adjustments: [
          { lower: null, upper: null, adjustment: null } // to be defined
        ],
        scaleOutCooldown: 300, // seconds
        scaleInCooldown: 300 // seconds
      }
    },
    scheduled: {
      name: 'Scheduled Scaling',
      description: 'Scale based on predictable schedules',
      configuration: {
        schedules: [
          { recurrence: null, minCapacity: null, maxCapacity: null } // to be defined
        ]
      }
    },
    predictive: {
      name: 'Predictive Scaling',
      description: 'Scale based on load forecasting and machine learning',
      configuration: {
        forecastingModel: 'ml.timeseries',
        metricPattern: 'daily', // daily, weekly, monthly
        lookbackDays: 14,
        forecastHorizon: 48, // hours
        scaleOutCooldown: 300, // seconds
        scaleInCooldown: 300 // seconds
      }
    }
  },
  
  // Configure auto-scaling
  async configure(resourceType, scalingType, options = {}) {
    console.log(`Configuring ${scalingType} scaling for ${resourceType}`);
    
    const scalingTypeConfig = this.scalingTypes[scalingType];
    
    if (!scalingTypeConfig) {
      throw new Error(`Unknown scaling type: ${scalingType}`);
    }
    
    // Create configuration
    const config = {
      resourceType,
      scalingType,
      capacity: {
        min: options.minCapacity || 2,
        max: options.maxCapacity || 10,
        desired: options.desiredCapacity || 2
      },
      metrics: [],
      policies: [],
      cooldown: {
        scaleOut: options.scaleOutCooldown || 300, // seconds
        scaleIn: options.scaleInCooldown || 300 // seconds
      }
    };
    
    // Configure metrics
    if (options.metrics && options.metrics.length > 0) {
      for (const metricName of options.metrics) {
        const metric = this.metrics[metricName];
        
        if (!metric) {
          console.warn(`Unknown metric: ${metricName}, skipping`);
          continue;
        }
        
        config.metrics.push({
          name: metric.name,
          type: metricName,
          thresholds: { ...metric.thresholds },
          period: options.metricPeriod || metric.period,
          evaluationPeriods: options.evaluationPeriods || metric.evaluationPeriods
        });
      }
    } else {
      // Default to CPU utilization
      config.metrics.push({
        name: this.metrics.cpu.name,
        type: 'cpu',
        thresholds: { ...this.metrics.cpu.thresholds },
        period: options.metricPeriod || this.metrics.cpu.period,
        evaluationPeriods: options.evaluationPeriods || this.metrics.cpu.evaluationPeriods
      });
    }
    
    // Configure policies
    if (options.policies && options.policies.length > 0) {
      for (const policyConfig of options.policies) {
        const policyType = policyConfig.type;
        const policy = this.policies[policyType];
        
        if (!policy) {
          console.warn(`Unknown policy type: ${policyType}, skipping`);
          continue;
        }
        
        const policyConfiguration = {
          name: policy.name,
          type: policyType,
          ...JSON.parse(JSON.stringify(policy.configuration)) // Deep clone
        };
        
        // Override with provided configuration
        if (policyConfig.configuration) {
          Object.assign(policyConfiguration, policyConfig.configuration);
        }
        
        config.policies.push(policyConfiguration);
      }
    } else {
      // Default to target tracking policy
      config.policies.push({
        name: this.policies.targetTracking.name,
        type: 'targetTracking',
        targetValue: 70, // 70% CPU utilization
        metric: 'cpu',
        scaleOutCooldown: config.cooldown.scaleOut,
        scaleInCooldown: config.cooldown.scaleIn
      });
    }
    
    return config;
  }
}
5

Monitoring and Alerting

Set up monitoring and alerting for high availability:

// High availability monitoring system
const haMonitoring = {
  // Monitoring components
  components: {
    infrastructure: {
      name: 'Infrastructure Monitoring',
      metrics: [
        'cpu_utilization',
        'memory_utilization',
        'disk_usage',
        'network_throughput',
        'load_average'
      ],
      interval: 60, // seconds
      retention: 90 // days
    },
    application: {
      name: 'Application Monitoring',
      metrics: [
        'request_count',
        'error_rate',
        'response_time',
        'throughput',
        'apdex_score'
      ],
      interval: 30, // seconds
      retention: 30 // days
    },
    database: {
      name: 'Database Monitoring',
      metrics: [
        'query_performance',
        'connection_count',
        'replication_lag',
        'transaction_rate',
        'lock_wait_time'
      ],
      interval: 60, // seconds
      retention: 30 // days
    },
    availability: {
      name: 'Availability Monitoring',
      metrics: [
        'uptime',
        'endpoint_availability',
        'ssl_certificate_validity',
        'dns_resolution',
        'ping_response'
      ],
      interval: 60, // seconds
      retention: 365 // days
    }
  },
  
  // Alert thresholds
  alertThresholds: {
    critical: {
      cpu_utilization: 90, // percentage
      memory_utilization: 90, // percentage
      disk_usage: 90, // percentage
      error_rate: 5, // percentage
      response_time: 1000, // milliseconds
      replication_lag: 300, // seconds
      uptime: 99.9 // percentage (alert if below)
    },
    warning: {
      cpu_utilization: 80, // percentage
      memory_utilization: 80, // percentage
      disk_usage: 80, // percentage
      error_rate: 2, // percentage
      response_time: 500, // milliseconds
      replication_lag: 60, // seconds
      uptime: 99.95 // percentage (alert if below)
    }
  },
  
  // Health checks
  healthChecks: {
    types: {
      http: {
        protocol: 'HTTP/HTTPS',
        method: 'GET',
        expectedStatus: 200,
        timeout: 5, // seconds
        interval: 30 // seconds
      },
      tcp: {
        protocol: 'TCP',
        port: 80,
        timeout: 3, // seconds
        interval: 30 // seconds
      },
      dns: {
        protocol: 'DNS',
        recordType: 'A',
        timeout: 2, // seconds
        interval: 60 // seconds
      }
    },
    
    async configureHealthCheck(name, type, endpoint, options = {}) {
      const healthCheckType = this.healthChecks.types[type];
      
      if (!healthCheckType) {
        throw new Error(`Unknown health check type: ${type}`);
      }
      
      return {
        name,
        type,
        endpoint,
        protocol: options.protocol || healthCheckType.protocol,
        interval: options.interval || healthCheckType.interval,
        timeout: options.timeout || healthCheckType.timeout,
        threshold: options.threshold || 3,
        regions: options.regions || ['us-east-1', 'us-west-2', 'eu-west-1'],
        alertChannels: options.alertChannels || ['email', 'slack', 'pagerduty']
      };
    }
  },
  
  // Dashboards
  dashboards: {
    types: {
      overview: {
        name: 'System Overview',
        description: 'High-level system health and performance',
        panels: [
          'system_health',
          'error_rates',
          'response_times',
          'throughput',
          'availability'
        ]
      },
      infrastructure: {
        name: 'Infrastructure Performance',
        description: 'Detailed infrastructure metrics',
        panels: [
          'cpu_usage',
          'memory_usage',
          'disk_performance',
          'network_traffic',
          'load_balancer_metrics'
        ]
      },
      availability: {
        name: 'Availability Metrics',
        description: 'System and component availability',
        panels: [
          'uptime',
          'outages',
          'response_success',
          'apdex_score',
          'sla_compliance'
        ]
      }
    },
    
    async createDashboard(type, options = {}) {
      const dashboardType = this.dashboards.types[type];
      
      if (!dashboardType) {
        throw new Error(`Unknown dashboard type: ${type}`);
      }
      
      return {
        name: options.name || dashboardType.name,
        type,
        description: options.description || dashboardType.description,
        panels: options.panels || dashboardType.panels,
        refreshRate: options.refreshRate || 60, // seconds
        timeRange: options.timeRange || '24h',
        accessRoles: options.accessRoles || ['admin', 'operations']
      };
    }
  },
  
  // Configure monitoring
  async configure(components = [], options = {}) {
    console.log('Configuring high availability monitoring');
    
    const config = {
      components: {},
      healthChecks: [],
      dashboards: [],
      alerting: {
        channels: options.alertChannels || ['email', 'slack', 'pagerduty'],
        policies: options.alertPolicies || {
          critical: {
            channels: ['email', 'slack', 'pagerduty'],
            escalation: true,
            autoRemediation: options.autoRemediation || false
          },
          warning: {
            channels: ['email', 'slack'],
            escalation: false,
            autoRemediation: false
          }
        }
      }
    };
    
    // Configure components
    for (const component of components) {
      const componentConfig = this.components[component];
      
      if (!componentConfig) {
        console.warn(`Unknown component: ${component}, skipping`);
        continue;
      }
      
      config.components[component] = {
        ...componentConfig,
        enabled: true
      };
    }
    
    // Configure health checks
    if (options.healthChecks && options.healthChecks.length > 0) {
      for (const healthCheck of options.healthChecks) {
        config.healthChecks.push(
          await this.healthChecks.configureHealthCheck(
            healthCheck.name,
            healthCheck.type,
            healthCheck.endpoint,
            healthCheck.options
          )
        );
      }
    }
    
    // Configure dashboards
    if (options.dashboards && options.dashboards.length > 0) {
      for (const dashboard of options.dashboards) {
        config.dashboards.push(
          await this.dashboards.createDashboard(
            dashboard.type,
            dashboard.options
          )
        );
      }
    } else {
      // Create default overview dashboard
      config.dashboards.push(
        await this.dashboards.createDashboard('overview')
      );
    }
    
    return config;
  }
}

Best Practices

Architecture Design

Best practices for HA architecture:

  • Eliminate single points of failure
  • Implement redundancy at all layers
  • Design for graceful degradation
  • Automate recovery processes

Load Balancing

Optimize load balancing:

  • Use health checks for all backends
  • Implement session persistence
  • Configure proper timeouts
  • Monitor balancer performance

Data Management

Ensure data availability:

  • Implement data replication
  • Use distributed storage
  • Regular backup verification
  • Plan for data recovery

High Availability Principles

Redundancy

Duplicate critical components and systems to eliminate single points of failure. When one component fails, the redundant component takes over.

Examples:

  • Multiple application servers behind a load balancer
  • Database primary with standby replicas
  • Redundant network paths
  • Multiple power supplies

Fault Isolation

Design systems so that failures in one component don't cascade to others. Isolate components to contain failures within boundaries.

Examples:

  • Multiple availability zones
  • Bulkhead pattern in microservices
  • Circuit breakers for API calls
  • Resource quotas and limits

Replication

Maintain multiple copies of data across different locations to ensure data availability even if some storage systems fail.

Examples:

  • Database replication
  • Distributed file systems
  • Content delivery networks
  • Multi-region data stores

Automated Recovery

Implement systems that can automatically detect failures and recover without human intervention to minimize downtime.

Examples:

  • Auto-scaling groups
  • Self-healing systems
  • Automated failover
  • Health checks with remediation

Understanding Availability Levels

Availability Downtime per Year Downtime per Month Typical Use Case
99% ("Two Nines") 3.65 days 7.2 hours Development environments, non-critical internal tools
99.9% ("Three Nines") 8.76 hours 43.8 minutes Internal business applications, content websites
99.95% ("Three and a Half Nines") 4.38 hours 21.9 minutes E-commerce platforms, SaaS applications
99.99% ("Four Nines") 52.56 minutes 4.38 minutes Financial systems, critical business services
99.999% ("Five Nines") 5.26 minutes 26.3 seconds Telecommunications, emergency services, critical infrastructure

Note: Achieving higher availability levels requires exponentially more investment in infrastructure, architecture, and operations. The appropriate availability target should be based on business requirements and cost considerations.

Common Challenges

Design Challenges

Common architecture issues:

  • Overlooked single points of failure
  • Improper redundancy implementation
  • Network partition handling
  • Cascading failures

Operational Challenges

Day-to-day operational issues:

  • Split-brain scenarios
  • Replication lag
  • Failover timing issues
  • Monitoring blind spots

Next Steps

Now that you understand high availability, explore these related topics: