Skip to content

Service High Availability

Thomas Mangin edited this page Nov 13, 2025 · 9 revisions

Service High Availability with ExaBGP

Building resilient, self-healing services with BGP-based failover

πŸ”„ Application-driven high availability - services control their own routing and failover


Table of Contents


Overview

Service High Availability (HA) with ExaBGP enables services to automatically announce their availability via BGP and withdraw when unhealthy.

The Traditional HA Problem

Without ExaBGP:

Load Balancer (Single Point of Failure)
       ↓
   β”Œβ”€β”€β”€β”΄β”€β”€β”€β”
   β–Ό       β–Ό
Server 1  Server 2

Issues:
- Load balancer is SPOF
- Expensive hardware
- Manual failover configuration
- Limited geographic distribution

With ExaBGP:

No central load balancer
Network routes to healthy instances

Server 1 (healthy) ──→ Announces route ──→ Receives traffic βœ…
Server 2 (healthy) ──→ Announces route ──→ Receives traffic βœ…
Server 3 (failed)  ──→ Withdraws route ──→ No traffic ❌

Benefits:
- No single point of failure
- Automatic failover (5-15 seconds)
- Geographic distribution
- Cost-effective

High Availability Concepts

Service Availability

Key metrics:

  • Uptime: Percentage of time service is available
  • MTBF (Mean Time Between Failures): Average time service runs
  • MTTR (Mean Time To Recover): Average time to restore service
  • RTO (Recovery Time Objective): Maximum acceptable downtime
  • RPO (Recovery Point Objective): Maximum acceptable data loss

HA Formula:

Availability = MTBF / (MTBF + MTTR)

Example:
MTBF = 720 hours (30 days)
MTTR = 0.25 hours (15 minutes)
Availability = 720 / (720 + 0.25) = 99.97%

ExaBGP HA Advantages

ExaBGP's Key Advantage: No Single Point of Failure

Traditional Architecture (Load Balancer):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         Load Balancer (HAProxy/NGINX)       β”‚ ← Single Point of Failure
β”‚         (Central Device)                     β”‚ ← Must be in ONE location
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β–Ό           β–Ό           β–Ό
   Server 1    Server 2    Server 3

Problem: Load balancer MUST be centralized
- Cannot span multiple data centers without becoming SPOF
- Very fast failover (< 1 second) BUT only between backends
- Load balancer itself is single point of failure
- If DC with load balancer fails, entire service fails

ExaBGP Architecture (Distributed):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Server 1     β”‚    β”‚ Server 2     β”‚    β”‚ Server 3     β”‚
β”‚ + ExaBGP     β”‚    β”‚ + ExaBGP     β”‚    β”‚ + ExaBGP     β”‚
β”‚ (DC-1)       β”‚    β”‚ (DC-1)       β”‚    β”‚ (DC-2)       β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚                   β”‚                   β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                    BGP Announcements

No single point of failure:
- Each instance independent
- Can span multiple data centers
- DC-1 fails β†’ DC-2 automatically takes over (BGP convergence: 5-15s)
- Slower failover than load balancer, but survives DC failure

Comparison with other HA mechanisms:

Layer 7 Load Balancer (HAProxy/NGINX):
- Very fast failover between backends (< 1 second)
- Works across Layer 3 (no Layer 2 requirement)
- BUT: Centralized architecture (single device)
- BUT: Cannot span data centers without becoming SPOF
- Best for: Fast failover within single location

ExaBGP:
- Slower failover (5-15 seconds BGP convergence)
- Fully distributed (no central device)
- Can span multiple data centers
- Survives entire DC failure
- Best for: Geographic redundancy, eliminating SPOF

Combined Architecture (Best of Both):
  ExaBGP β†’ Distribute traffic across multiple DCs
      ↓
  HAProxy/NGINX in each DC β†’ Fast local failover
      ↓
  Backend servers

DNS-based HA:
- Very slow (30-60 seconds due to DNS TTL)
- Client-side caching issues
- Best used with ExaBGP for multi-region routing

Common Use Case: ExaBGP Provides Resilience TO Load Balancers

ExaBGP announces load balancer VIPs:
- HAProxy-DC1 (healthy) β†’ announces 100.10.0.100 β†’ receives traffic
- HAProxy-DC2 (healthy) β†’ announces 100.10.0.100 β†’ receives traffic
- If HAProxy-DC1 fails β†’ withdraws route β†’ traffic goes to DC2

Result: Fast local failover + geographic redundancy

Architecture Patterns

Pattern 1: Active-Active HA

Multiple active instances serving traffic simultaneously:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Server 1     β”‚    β”‚ Server 2     β”‚    β”‚ Server 3     β”‚
β”‚ Service: UP  β”‚    β”‚ Service: UP  β”‚    β”‚ Service: UP  β”‚
β”‚ ExaBGP: βœ…   β”‚    β”‚ ExaBGP: βœ…   β”‚    β”‚ ExaBGP: βœ…   β”‚
β”‚ Announces    β”‚    β”‚ Announces    β”‚    β”‚ Announces    β”‚
β”‚ 100.10.0.100 β”‚    β”‚ 100.10.0.100 β”‚    β”‚ 100.10.0.100 β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚                   β”‚                   β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
                    Traffic distributed
                   (ECMP load balancing)

Characteristics:

  • All instances active
  • Traffic distributed via ECMP
  • Horizontal scaling (add more servers = more capacity)
  • No wasted standby capacity

Configuration:

# Each server announces same IP
SERVICE_IP = "100.10.0.100"

if is_service_healthy():
    announce route {SERVICE_IP}/32 next-hop self

Pattern 2: Active-Passive HA

One active instance, others on standby:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Primary      β”‚    β”‚ Secondary    β”‚
β”‚ Service: UP  β”‚    β”‚ Service: UP  β”‚
β”‚ ExaBGP: βœ…   β”‚    β”‚ ExaBGP: ⏸️   β”‚
β”‚ Announces    β”‚    β”‚ Silent       β”‚
β”‚ MED=100      β”‚    β”‚ (or MED=200) β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚                   β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β–Ό
        Traffic to Primary

If Primary fails:
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚ Secondary    β”‚
                β”‚ Service: UP  β”‚
                β”‚ ExaBGP: βœ…   β”‚
                β”‚ Announces    β”‚
                β”‚ MED=100      β”‚
                β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                       β–Ό
              Traffic to Secondary

Implementation with MED:

# Primary
if is_service_healthy():
    announce route 100.10.0.100/32 next-hop self med 100

# Secondary
if is_service_healthy():
    announce route 100.10.0.100/32 next-hop self med 200  # Higher MED = backup

Pattern 3: Geographic HA

Active instances in multiple regions:

Region A (US-East)         Region B (EU-West)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Servers 1-3  β”‚           β”‚ Servers 4-6  β”‚
β”‚ ExaBGP       β”‚           β”‚ ExaBGP       β”‚
β”‚ 100.10.0.100 β”‚           β”‚ 100.10.0.100 β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚                          β”‚
       β–Ό                          β–Ό
US Clients routed to A     EU Clients routed to B

If Region A fails β†’ all traffic to Region B
If Region B fails β†’ all traffic to Region A

Benefits:

  • Disaster recovery
  • Low latency (geo-proximity routing)
  • Regulatory compliance (data residency)

Health Check Strategies

⭐ RECOMMENDED: Use Built-in Healthcheck Module

ExaBGP includes a production-ready exabgp healthcheck tool that handles all health check patterns below - no custom scripting required!

# Zero-code health check with rise/fall dampening, metrics, and execution hooks
exabgp healthcheck --cmd "curl -sf http://localhost/health" --ip 10.0.0.1/32 --rise 3 --fall 2

See Healthcheck Module for complete documentation with examples.

Custom scripts (shown below) are only needed for complex logic (10% of use cases). For most deployments, use the built-in module.


1. TCP Port Check (Basic)

Check if port is open:

import socket

def tcp_check(host, port, timeout=2):
    try:
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.settimeout(timeout)
        result = sock.connect_ex((host, port))
        sock.close()
        return result == 0
    except:
        return False

Pros:

  • Simple
  • Fast

Cons:

  • Doesn't verify service functionality
  • Port open β‰  service healthy

2. HTTP Endpoint Check

Check HTTP /health endpoint:

import urllib.request

def http_health_check(url='http://127.0.0.1/health', timeout=2):
    try:
        response = urllib.request.urlopen(url, timeout=timeout)
        if response.getcode() == 200:
            # Optionally check response body
            body = response.read().decode('utf-8')
            return 'OK' in body
        return False
    except:
        return False

Health endpoint example (Flask):

from flask import Flask, jsonify
import psycopg2

app = Flask(__name__)

@app.route('/health')
def health():
    # Check database connection
    try:
        conn = psycopg2.connect('dbname=mydb')
        conn.close()
        return jsonify({'status': 'healthy'}), 200
    except:
        return jsonify({'status': 'unhealthy'}), 503

if __name__ == '__main__':
    app.run(port=8080)

Pros:

  • Verifies service responds
  • Can check dependencies (database, cache, etc.)
  • Application-specific logic

3. Comprehensive Health Check

Check all critical dependencies:

import socket
import urllib.request
import psycopg2
import redis

def comprehensive_health_check():
    checks = {
        'web': check_web_server(),
        'database': check_database(),
        'cache': check_redis(),
        'disk_space': check_disk_space(),
        'memory': check_memory(),
    }

    # All checks must pass
    return all(checks.values())

def check_web_server():
    try:
        response = urllib.request.urlopen('http://127.0.0.1:80/health', timeout=2)
        return response.getcode() == 200
    except:
        return False

def check_database():
    try:
        conn = psycopg2.connect(host='127.0.0.1', database='mydb', user='monitor', password='secret')
        cursor = conn.cursor()
        cursor.execute('SELECT 1')
        result = cursor.fetchone()
        conn.close()
        return result[0] == 1
    except:
        return False

def check_redis():
    try:
        r = redis.Redis(host='127.0.0.1', port=6379, socket_timeout=2)
        return r.ping()
    except:
        return False

def check_disk_space():
    import shutil
    stat = shutil.disk_usage('/')
    free_percent = (stat.free / stat.total) * 100
    return free_percent > 10  # At least 10% free

def check_memory():
    import psutil
    mem = psutil.virtual_memory()
    return mem.available > 1024 * 1024 * 1024  # At least 1 GB free

4. Load-Based Health Checks

Health based on current load/performance:

⚠️ Important: BGP is Binary (All-or-Nothing)

BGP cannot do proportional/weighted traffic distribution. You can only:

  • Announce a route (receive traffic)
  • Withdraw a route (stop receiving traffic)

There is NO way to receive "50% of traffic" via BGP. When multiple instances announce the same prefix, routers use ECMP (Equal-Cost Multi-Path) which distributes traffic equally via flow-based hashing.

For TCP services: Withdrawing a route causes existing connections to break. Use high thresholds (e.g., 95% CPU) to avoid unnecessary disruptions.

import psutil

def load_based_health():
    """
    Binary health check based on load.
    Returns False only when server is severely overloaded.
    Use HIGH thresholds to avoid connection disruption.
    """
    # CPU load - very high threshold
    cpu_percent = psutil.cpu_percent(interval=1)
    if cpu_percent > 95:
        return False  # Severely overloaded

    # Memory - very high threshold
    mem = psutil.virtual_memory()
    if mem.percent > 95:
        return False  # Critical memory pressure

    # Connection count - very high threshold
    connections = len(psutil.net_connections(kind='inet'))
    if connections > 50000:
        return False  # Dangerously high connection count

    return True

Use case: Prevent complete service failure by removing severely overloaded instances

Not suitable for:

  • Proportional load balancing (use HAProxy/NGINX for Layer 7 weighted distribution)
  • Fine-grained traffic shaping
  • Gradual capacity management

Failover Mechanisms

Automatic Failover

ExaBGP script with automatic failover:

#!/usr/bin/env python3
"""
Automatic failover based on health checks
"""
import sys
import time
import socket

SERVICE_IP = "100.10.0.100"
SERVICE_PORT = 80
CHECK_INTERVAL = 5

# Dampening: require N consecutive failures
FALL_THRESHOLD = 2
fall_count = 0
announced = False

def is_healthy():
    try:
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.settimeout(2)
        result = sock.connect_ex(('127.0.0.1', SERVICE_PORT))
        sock.close()
        return result == 0
    except:
        return False

time.sleep(2)

while True:
    healthy = is_healthy()

    if healthy:
        fall_count = 0
        if not announced:
            # Service recovered, announce
            sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n")
            sys.stdout.flush()
            sys.stderr.write(f"[FAILOVER] Service recovered, announcing route\n")
            announced = True

    else:
        fall_count += 1
        if fall_count >= FALL_THRESHOLD and announced:
            # Service failed, trigger failover
            sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n")
            sys.stdout.flush()
            sys.stderr.write(f"[FAILOVER] Service failed, withdrawing route (traffic fails over to other instances)\n")
            announced = False

    time.sleep(CHECK_INTERVAL)

Failover timeline:

T+0s  : Service fails
T+5s  : Health check detects failure
T+10s : Second check confirms (fall threshold = 2)
T+10s : ExaBGP withdraws route
T+15s : BGP convergence complete
T+15s : Traffic fails over to healthy instances

Manual Failover (Maintenance Mode)

Gracefully drain traffic before maintenance:

#!/usr/bin/env python3
"""
Maintenance mode support
Create /var/run/maintenance file to drain traffic
"""
import sys
import time
import socket
import os

SERVICE_IP = "100.10.0.100"
MAINTENANCE_FILE = "/var/run/maintenance"

def is_maintenance_mode():
    return os.path.exists(MAINTENANCE_FILE)

def is_healthy():
    try:
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.settimeout(2)
        result = sock.connect_ex(('127.0.0.1', 80))
        sock.close()
        return result == 0
    except:
        return False

time.sleep(2)
announced = False

while True:
    if is_maintenance_mode():
        if announced:
            sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n")
            sys.stdout.flush()
            sys.stderr.write(f"[MAINTENANCE] Entering maintenance mode\n")
            announced = False
    else:
        healthy = is_healthy()
        if healthy and not announced:
            sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n")
            sys.stdout.flush()
            announced = True
        elif not healthy and announced:
            sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n")
            sys.stdout.flush()
            announced = False

    time.sleep(5)

Maintenance workflow:

# 1. Enter maintenance mode (stops receiving new traffic)
touch /var/run/maintenance

# 2. Wait for existing connections to drain
watch 'ss -tan | grep :80 | grep ESTAB | wc -l'

# 3. Perform maintenance
systemctl restart nginx
systemctl restart application

# 4. Exit maintenance mode (resume receiving traffic)
rm /var/run/maintenance

Load Distribution

Equal Load Distribution (ECMP)

All servers announce with same metric:

# All servers run identical script
announce route 100.10.0.100/32 next-hop self

Router performs ECMP (Equal-Cost Multi-Path):

Router sees 3 equal-cost paths
β†’ Distributes traffic equally (hash-based)
β†’ Per-flow load balancing (same src/dst goes to same server)

Enable ECMP on routers:

# Cisco
router bgp 65000
 maximum-paths 8

# Juniper
set protocols bgp group servers multipath

Weighted Load Distribution

Use BGP MED to control traffic distribution:

# High-capacity server (receives more traffic)
announce route 100.10.0.100/32 next-hop self med 50

# Medium-capacity server
announce route 100.10.0.100/32 next-hop self med 100

# Low-capacity server (receives less traffic)
announce route 100.10.0.100/32 next-hop self med 150

Note: Lower MED = preferred path = more traffic


Dynamic Load-Based Distribution

Adjust MED based on current load:

#!/usr/bin/env python3
import sys
import time
import psutil

SERVICE_IP = "100.10.0.100"
BASE_MED = 100

def calculate_med():
    """Calculate MED based on CPU load"""
    cpu_percent = psutil.cpu_percent(interval=1)

    # Higher CPU = higher MED = less preferred
    load_factor = int(cpu_percent)
    med = BASE_MED + load_factor

    return med

time.sleep(2)

while True:
    med = calculate_med()

    sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self med {med}\n")
    sys.stdout.flush()

    sys.stderr.write(f"[LOAD] Announced with MED={med}\n")

    time.sleep(30)  # Update every 30 seconds

Result: Traffic automatically distributed based on real-time load


Common HA Scenarios

Scenario 1: Web Application HA

Setup:

  • 3 web servers (NGINX + application)
  • Anycast IP: 100.10.0.80
  • Health check: HTTP /health endpoint
  • Active-active configuration

Configuration:

# /etc/exabgp/web-ha.conf
neighbor 192.168.1.1 {
    router-id 192.168.1.10;
    local-address 192.168.1.10;
    local-as 65001;
    peer-as 65001;

    family {
        ipv4 unicast;
    }

    api {
        processes [ web-healthcheck ];
    }
}

process web-healthcheck {
    run /etc/exabgp/web-healthcheck.py;
    encoder text;
}

Health check script:

#!/usr/bin/env python3
import sys
import time
import urllib.request

SERVICE_IP = "100.10.0.80"

def is_web_healthy():
    try:
        response = urllib.request.urlopen('http://127.0.0.1/health', timeout=2)
        return response.getcode() == 200
    except:
        return False

time.sleep(2)
announced = False

while True:
    if is_web_healthy() and not announced:
        sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n")
        sys.stdout.flush()
        announced = True
    elif not is_web_healthy() and announced:
        sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n")
        sys.stdout.flush()
        announced = False

    time.sleep(5)

Scenario 2: Database Read Replica HA

Setup:

  • 1 primary database (writes)
  • 3 read replicas (reads)
  • Anycast read IP: 100.10.0.5432
  • Health check: replication lag

Health check:

#!/usr/bin/env python3
import sys
import time
import psycopg2

SERVICE_IP = "100.10.0.5432"
MAX_LAG_SECONDS = 10

def get_replication_lag():
    try:
        conn = psycopg2.connect(host='127.0.0.1', database='postgres', user='monitor')
        cursor = conn.cursor()

        cursor.execute("""
            SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))
        """)

        lag = cursor.fetchone()[0]
        conn.close()
        return lag if lag else 0
    except:
        return float('inf')

time.sleep(2)
announced = False

while True:
    lag = get_replication_lag()
    healthy = lag < MAX_LAG_SECONDS

    if healthy and not announced:
        sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n")
        sys.stdout.flush()
        sys.stderr.write(f"[DB] Replication lag OK ({lag:.1f}s), announcing\n")
        announced = True
    elif not healthy and announced:
        sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n")
        sys.stdout.flush()
        sys.stderr.write(f"[DB] Replication lag too high ({lag:.1f}s), withdrawing\n")
        announced = False

    time.sleep(10)

Scenario 3: Multi-Region HA

Setup:

  • Region A: 3 servers
  • Region B: 3 servers
  • Same anycast IP in both regions
  • Clients routed to nearest region

Benefits:

  • Low latency (geo-proximity)
  • Disaster recovery (region failure)
  • Active-active across regions

Implementation Examples

Complete HA Setup

1. Install ExaBGP on all servers:

pip3 install exabgp

2. Configure service IP on loopback:

ip addr add 100.10.0.100/32 dev lo

3. Create ExaBGP config:

neighbor 192.168.1.1 {
    router-id 192.168.1.10;
    local-address 192.168.1.10;
    local-as 65001;
    peer-as 65001;

    family {
        ipv4 unicast;
    }

    api {
        processes [ ha-healthcheck ];
    }
}

process ha-healthcheck {
    run /etc/exabgp/ha-healthcheck.py;
    encoder text;
}

4. Create health check script:

#!/usr/bin/env python3
import sys
import time
import socket

SERVICE_IP = "100.10.0.100"
SERVICE_PORT = 80

def is_healthy():
    try:
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.settimeout(2)
        result = sock.connect_ex(('127.0.0.1', SERVICE_PORT))
        sock.close()
        return result == 0
    except:
        return False

time.sleep(2)
announced = False

while True:
    if is_healthy() and not announced:
        sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n")
        sys.stdout.flush()
        announced = True
    elif not is_healthy() and announced:
        sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n")
        sys.stdout.flush()
        announced = False

    time.sleep(5)

5. Start ExaBGP:

exabgp /etc/exabgp/ha.conf

6. Verify:

# Check route on router
show ip bgp 100.10.0.100

# Should see multiple paths (one per healthy server)

Best Practices

1. Use Rise/Fall Thresholds

Prevent route flapping:

RISE_THRESHOLD = 3  # 3 consecutive successes to announce
FALL_THRESHOLD = 2  # 2 consecutive failures to withdraw

2. Monitor BGP Session Health

import subprocess

def check_exabgp_running():
    result = subprocess.run(['pgrep', '-f', 'exabgp'], capture_output=True)
    if result.returncode != 0:
        send_alert("ExaBGP not running!")
        return False
    return True

3. Log All Announcements

import logging

logging.basicConfig(filename='/var/log/exabgp-ha.log', level=logging.INFO)

def announce_route(ip):
    sys.stdout.write(f"announce route {ip}/32 next-hop self\n")
    sys.stdout.flush()
    logging.info(f"ANNOUNCE: {ip}")

4. Implement Maintenance Mode

Allow graceful traffic draining:

# Enter maintenance
touch /var/run/maintenance

# Wait for connections to drain
watch 'ss -tan | grep ESTAB | wc -l'

# Perform maintenance
systemctl restart service

# Exit maintenance
rm /var/run/maintenance

5. Test Failover Regularly

# Monthly failover drill
systemctl stop nginx
# Verify traffic failed over
sleep 60
systemctl start nginx

Monitoring and Alerting

Metrics to Monitor

1. Service Health:

  • Health check success rate
  • Time since last successful check
  • Health check latency

2. BGP State:

  • BGP session state (up/down)
  • Routes announced
  • Routes withdrawn
  • BGP convergence time

3. Failover Events:

  • Number of failovers
  • Time to failover
  • Failed node recovery time

Monitoring Script

#!/usr/bin/env python3
"""
Monitor HA metrics and export to Prometheus
"""
import time
from prometheus_client import start_http_server, Gauge, Counter

# Metrics
health_check_success = Gauge('ha_health_check_success', 'Health check status (1=healthy, 0=unhealthy)')
route_announced = Gauge('ha_route_announced', 'Route announcement status (1=announced, 0=withdrawn)')
failover_count = Counter('ha_failover_total', 'Total number of failovers')

def monitor_ha():
    announced = False

    while True:
        healthy = is_healthy()
        health_check_success.set(1 if healthy else 0)

        if healthy and not announced:
            route_announced.set(1)
            announced = True
        elif not healthy and announced:
            route_announced.set(0)
            failover_count.inc()
            announced = False

        time.sleep(5)

if __name__ == '__main__':
    # Start Prometheus metrics server
    start_http_server(9100)
    monitor_ha()

Troubleshooting

Issue 1: Route Not Failing Over

Symptoms: Service down but traffic still routed to failed instance

Check:

# 1. Verify ExaBGP withdrew route
grep WITHDRAW /var/log/exabgp.log

# 2. Check BGP table on router
show ip bgp 100.10.0.100

# 3. Verify health check detecting failure
tail -f /var/log/exabgp.log

Common causes:

  • Health check not detecting failure
  • ExaBGP not running
  • BGP session down
  • Router not removing route

Issue 2: Route Flapping

Symptoms: Route repeatedly announced/withdrawn

Diagnosis:

# Monitor route changes
watch -d 'show ip bgp 100.10.0.100 | grep paths'

Solutions:

  • Implement rise/fall thresholds
  • Increase health check interval
  • Fix unstable service
  • Add dampening

Issue 3: Uneven Load Distribution

Symptoms: One server gets all traffic despite ECMP

Check:

# Verify ECMP enabled
show ip bgp 100.10.0.100
# Should show "multipath" or "ECMP"

# Check routing table
show ip route 100.10.0.100
# Should show multiple next-hops

Solutions:

# Enable ECMP
router bgp 65000
 maximum-paths 8

Next Steps

Learn More

Operations

Configuration


Ready to implement HA? See Quick Start β†’


πŸ‘» Ghost written by Claude (Anthropic AI)

Clone this wiki locally