-
Notifications
You must be signed in to change notification settings - Fork 459
Service High Availability
Building resilient, self-healing services with BGP-based failover
π Application-driven high availability - services control their own routing and failover
- Overview
- High Availability Concepts
- Architecture Patterns
- Health Check Strategies
- Failover Mechanisms
- Load Distribution
- Common HA Scenarios
- Implementation Examples
- Best Practices
- Monitoring and Alerting
- Troubleshooting
Service High Availability (HA) with ExaBGP enables services to automatically announce their availability via BGP and withdraw when unhealthy.
Without ExaBGP:
Load Balancer (Single Point of Failure)
β
βββββ΄ββββ
βΌ βΌ
Server 1 Server 2
Issues:
- Load balancer is SPOF
- Expensive hardware
- Manual failover configuration
- Limited geographic distribution
With ExaBGP:
No central load balancer
Network routes to healthy instances
Server 1 (healthy) βββ Announces route βββ Receives traffic β
Server 2 (healthy) βββ Announces route βββ Receives traffic β
Server 3 (failed) βββ Withdraws route βββ No traffic β
Benefits:
- No single point of failure
- Automatic failover (5-15 seconds)
- Geographic distribution
- Cost-effective
Key metrics:
- Uptime: Percentage of time service is available
- MTBF (Mean Time Between Failures): Average time service runs
- MTTR (Mean Time To Recover): Average time to restore service
- RTO (Recovery Time Objective): Maximum acceptable downtime
- RPO (Recovery Point Objective): Maximum acceptable data loss
HA Formula:
Availability = MTBF / (MTBF + MTTR)
Example:
MTBF = 720 hours (30 days)
MTTR = 0.25 hours (15 minutes)
Availability = 720 / (720 + 0.25) = 99.97%
ExaBGP's Key Advantage: No Single Point of Failure
Traditional Architecture (Load Balancer):
βββββββββββββββββββββββββββββββββββββββββββββββ
β Load Balancer (HAProxy/NGINX) β β Single Point of Failure
β (Central Device) β β Must be in ONE location
ββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β
βββββββββββββΌββββββββββββ
βΌ βΌ βΌ
Server 1 Server 2 Server 3
Problem: Load balancer MUST be centralized
- Cannot span multiple data centers without becoming SPOF
- Very fast failover (< 1 second) BUT only between backends
- Load balancer itself is single point of failure
- If DC with load balancer fails, entire service fails
ExaBGP Architecture (Distributed):
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Server 1 β β Server 2 β β Server 3 β
β + ExaBGP β β + ExaBGP β β + ExaBGP β
β (DC-1) β β (DC-1) β β (DC-2) β
ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ
β β β
βββββββββββββββββββββ΄ββββββββββββββββββββ
β
BGP Announcements
No single point of failure:
- Each instance independent
- Can span multiple data centers
- DC-1 fails β DC-2 automatically takes over (BGP convergence: 5-15s)
- Slower failover than load balancer, but survives DC failure
Comparison with other HA mechanisms:
Layer 7 Load Balancer (HAProxy/NGINX):
- Very fast failover between backends (< 1 second)
- Works across Layer 3 (no Layer 2 requirement)
- BUT: Centralized architecture (single device)
- BUT: Cannot span data centers without becoming SPOF
- Best for: Fast failover within single location
ExaBGP:
- Slower failover (5-15 seconds BGP convergence)
- Fully distributed (no central device)
- Can span multiple data centers
- Survives entire DC failure
- Best for: Geographic redundancy, eliminating SPOF
Combined Architecture (Best of Both):
ExaBGP β Distribute traffic across multiple DCs
β
HAProxy/NGINX in each DC β Fast local failover
β
Backend servers
DNS-based HA:
- Very slow (30-60 seconds due to DNS TTL)
- Client-side caching issues
- Best used with ExaBGP for multi-region routing
Common Use Case: ExaBGP Provides Resilience TO Load Balancers
ExaBGP announces load balancer VIPs:
- HAProxy-DC1 (healthy) β announces 100.10.0.100 β receives traffic
- HAProxy-DC2 (healthy) β announces 100.10.0.100 β receives traffic
- If HAProxy-DC1 fails β withdraws route β traffic goes to DC2
Result: Fast local failover + geographic redundancy
Multiple active instances serving traffic simultaneously:
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Server 1 β β Server 2 β β Server 3 β
β Service: UP β β Service: UP β β Service: UP β
β ExaBGP: β
β β ExaBGP: β
β β ExaBGP: β
β
β Announces β β Announces β β Announces β
β 100.10.0.100 β β 100.10.0.100 β β 100.10.0.100 β
ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ
β β β
βββββββββββββββββββββ΄ββββββββββββββββββββ
β
βΌ
Traffic distributed
(ECMP load balancing)
Characteristics:
- All instances active
- Traffic distributed via ECMP
- Horizontal scaling (add more servers = more capacity)
- No wasted standby capacity
Configuration:
# Each server announces same IP
SERVICE_IP = "100.10.0.100"
if is_service_healthy():
announce route {SERVICE_IP}/32 next-hop selfOne active instance, others on standby:
ββββββββββββββββ ββββββββββββββββ
β Primary β β Secondary β
β Service: UP β β Service: UP β
β ExaBGP: β
β β ExaBGP: βΈοΈ β
β Announces β β Silent β
β MED=100 β β (or MED=200) β
ββββββββ¬ββββββββ ββββββββ¬ββββββββ
β β
ββββββββββ¬βββββββββββ
βΌ
Traffic to Primary
If Primary fails:
ββββββββββββββββ
β Secondary β
β Service: UP β
β ExaBGP: β
β
β Announces β
β MED=100 β
ββββββββ¬ββββββββ
βΌ
Traffic to Secondary
Implementation with MED:
# Primary
if is_service_healthy():
announce route 100.10.0.100/32 next-hop self med 100
# Secondary
if is_service_healthy():
announce route 100.10.0.100/32 next-hop self med 200 # Higher MED = backupActive instances in multiple regions:
Region A (US-East) Region B (EU-West)
ββββββββββββββββ ββββββββββββββββ
β Servers 1-3 β β Servers 4-6 β
β ExaBGP β β ExaBGP β
β 100.10.0.100 β β 100.10.0.100 β
ββββββββ¬ββββββββ ββββββββ¬ββββββββ
β β
βΌ βΌ
US Clients routed to A EU Clients routed to B
If Region A fails β all traffic to Region B
If Region B fails β all traffic to Region A
Benefits:
- Disaster recovery
- Low latency (geo-proximity routing)
- Regulatory compliance (data residency)
β RECOMMENDED: Use Built-in Healthcheck Module
ExaBGP includes a production-ready
exabgp healthchecktool that handles all health check patterns below - no custom scripting required!# Zero-code health check with rise/fall dampening, metrics, and execution hooks exabgp healthcheck --cmd "curl -sf http://localhost/health" --ip 10.0.0.1/32 --rise 3 --fall 2See Healthcheck Module for complete documentation with examples.
Custom scripts (shown below) are only needed for complex logic (10% of use cases). For most deployments, use the built-in module.
Check if port is open:
import socket
def tcp_check(host, port, timeout=2):
try:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.settimeout(timeout)
result = sock.connect_ex((host, port))
sock.close()
return result == 0
except:
return FalsePros:
- Simple
- Fast
Cons:
- Doesn't verify service functionality
- Port open β service healthy
Check HTTP /health endpoint:
import urllib.request
def http_health_check(url='http://127.0.0.1/health', timeout=2):
try:
response = urllib.request.urlopen(url, timeout=timeout)
if response.getcode() == 200:
# Optionally check response body
body = response.read().decode('utf-8')
return 'OK' in body
return False
except:
return FalseHealth endpoint example (Flask):
from flask import Flask, jsonify
import psycopg2
app = Flask(__name__)
@app.route('/health')
def health():
# Check database connection
try:
conn = psycopg2.connect('dbname=mydb')
conn.close()
return jsonify({'status': 'healthy'}), 200
except:
return jsonify({'status': 'unhealthy'}), 503
if __name__ == '__main__':
app.run(port=8080)Pros:
- Verifies service responds
- Can check dependencies (database, cache, etc.)
- Application-specific logic
Check all critical dependencies:
import socket
import urllib.request
import psycopg2
import redis
def comprehensive_health_check():
checks = {
'web': check_web_server(),
'database': check_database(),
'cache': check_redis(),
'disk_space': check_disk_space(),
'memory': check_memory(),
}
# All checks must pass
return all(checks.values())
def check_web_server():
try:
response = urllib.request.urlopen('http://127.0.0.1:80/health', timeout=2)
return response.getcode() == 200
except:
return False
def check_database():
try:
conn = psycopg2.connect(host='127.0.0.1', database='mydb', user='monitor', password='secret')
cursor = conn.cursor()
cursor.execute('SELECT 1')
result = cursor.fetchone()
conn.close()
return result[0] == 1
except:
return False
def check_redis():
try:
r = redis.Redis(host='127.0.0.1', port=6379, socket_timeout=2)
return r.ping()
except:
return False
def check_disk_space():
import shutil
stat = shutil.disk_usage('/')
free_percent = (stat.free / stat.total) * 100
return free_percent > 10 # At least 10% free
def check_memory():
import psutil
mem = psutil.virtual_memory()
return mem.available > 1024 * 1024 * 1024 # At least 1 GB freeHealth based on current load/performance:
β οΈ Important: BGP is Binary (All-or-Nothing)BGP cannot do proportional/weighted traffic distribution. You can only:
- Announce a route (receive traffic)
- Withdraw a route (stop receiving traffic)
There is NO way to receive "50% of traffic" via BGP. When multiple instances announce the same prefix, routers use ECMP (Equal-Cost Multi-Path) which distributes traffic equally via flow-based hashing.
For TCP services: Withdrawing a route causes existing connections to break. Use high thresholds (e.g., 95% CPU) to avoid unnecessary disruptions.
import psutil
def load_based_health():
"""
Binary health check based on load.
Returns False only when server is severely overloaded.
Use HIGH thresholds to avoid connection disruption.
"""
# CPU load - very high threshold
cpu_percent = psutil.cpu_percent(interval=1)
if cpu_percent > 95:
return False # Severely overloaded
# Memory - very high threshold
mem = psutil.virtual_memory()
if mem.percent > 95:
return False # Critical memory pressure
# Connection count - very high threshold
connections = len(psutil.net_connections(kind='inet'))
if connections > 50000:
return False # Dangerously high connection count
return TrueUse case: Prevent complete service failure by removing severely overloaded instances
Not suitable for:
- Proportional load balancing (use HAProxy/NGINX for Layer 7 weighted distribution)
- Fine-grained traffic shaping
- Gradual capacity management
ExaBGP script with automatic failover:
#!/usr/bin/env python3
"""
Automatic failover based on health checks
"""
import sys
import time
import socket
SERVICE_IP = "100.10.0.100"
SERVICE_PORT = 80
CHECK_INTERVAL = 5
# Dampening: require N consecutive failures
FALL_THRESHOLD = 2
fall_count = 0
announced = False
def is_healthy():
try:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.settimeout(2)
result = sock.connect_ex(('127.0.0.1', SERVICE_PORT))
sock.close()
return result == 0
except:
return False
time.sleep(2)
while True:
healthy = is_healthy()
if healthy:
fall_count = 0
if not announced:
# Service recovered, announce
sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n")
sys.stdout.flush()
sys.stderr.write(f"[FAILOVER] Service recovered, announcing route\n")
announced = True
else:
fall_count += 1
if fall_count >= FALL_THRESHOLD and announced:
# Service failed, trigger failover
sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n")
sys.stdout.flush()
sys.stderr.write(f"[FAILOVER] Service failed, withdrawing route (traffic fails over to other instances)\n")
announced = False
time.sleep(CHECK_INTERVAL)Failover timeline:
T+0s : Service fails
T+5s : Health check detects failure
T+10s : Second check confirms (fall threshold = 2)
T+10s : ExaBGP withdraws route
T+15s : BGP convergence complete
T+15s : Traffic fails over to healthy instances
Gracefully drain traffic before maintenance:
#!/usr/bin/env python3
"""
Maintenance mode support
Create /var/run/maintenance file to drain traffic
"""
import sys
import time
import socket
import os
SERVICE_IP = "100.10.0.100"
MAINTENANCE_FILE = "/var/run/maintenance"
def is_maintenance_mode():
return os.path.exists(MAINTENANCE_FILE)
def is_healthy():
try:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.settimeout(2)
result = sock.connect_ex(('127.0.0.1', 80))
sock.close()
return result == 0
except:
return False
time.sleep(2)
announced = False
while True:
if is_maintenance_mode():
if announced:
sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n")
sys.stdout.flush()
sys.stderr.write(f"[MAINTENANCE] Entering maintenance mode\n")
announced = False
else:
healthy = is_healthy()
if healthy and not announced:
sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n")
sys.stdout.flush()
announced = True
elif not healthy and announced:
sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n")
sys.stdout.flush()
announced = False
time.sleep(5)Maintenance workflow:
# 1. Enter maintenance mode (stops receiving new traffic)
touch /var/run/maintenance
# 2. Wait for existing connections to drain
watch 'ss -tan | grep :80 | grep ESTAB | wc -l'
# 3. Perform maintenance
systemctl restart nginx
systemctl restart application
# 4. Exit maintenance mode (resume receiving traffic)
rm /var/run/maintenanceAll servers announce with same metric:
# All servers run identical script
announce route 100.10.0.100/32 next-hop selfRouter performs ECMP (Equal-Cost Multi-Path):
Router sees 3 equal-cost paths
β Distributes traffic equally (hash-based)
β Per-flow load balancing (same src/dst goes to same server)
Enable ECMP on routers:
# Cisco
router bgp 65000
maximum-paths 8
# Juniper
set protocols bgp group servers multipath
Use BGP MED to control traffic distribution:
# High-capacity server (receives more traffic)
announce route 100.10.0.100/32 next-hop self med 50
# Medium-capacity server
announce route 100.10.0.100/32 next-hop self med 100
# Low-capacity server (receives less traffic)
announce route 100.10.0.100/32 next-hop self med 150Note: Lower MED = preferred path = more traffic
Adjust MED based on current load:
#!/usr/bin/env python3
import sys
import time
import psutil
SERVICE_IP = "100.10.0.100"
BASE_MED = 100
def calculate_med():
"""Calculate MED based on CPU load"""
cpu_percent = psutil.cpu_percent(interval=1)
# Higher CPU = higher MED = less preferred
load_factor = int(cpu_percent)
med = BASE_MED + load_factor
return med
time.sleep(2)
while True:
med = calculate_med()
sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self med {med}\n")
sys.stdout.flush()
sys.stderr.write(f"[LOAD] Announced with MED={med}\n")
time.sleep(30) # Update every 30 secondsResult: Traffic automatically distributed based on real-time load
Setup:
- 3 web servers (NGINX + application)
- Anycast IP: 100.10.0.80
- Health check: HTTP /health endpoint
- Active-active configuration
Configuration:
# /etc/exabgp/web-ha.conf
neighbor 192.168.1.1 {
router-id 192.168.1.10;
local-address 192.168.1.10;
local-as 65001;
peer-as 65001;
family {
ipv4 unicast;
}
api {
processes [ web-healthcheck ];
}
}
process web-healthcheck {
run /etc/exabgp/web-healthcheck.py;
encoder text;
}Health check script:
#!/usr/bin/env python3
import sys
import time
import urllib.request
SERVICE_IP = "100.10.0.80"
def is_web_healthy():
try:
response = urllib.request.urlopen('http://127.0.0.1/health', timeout=2)
return response.getcode() == 200
except:
return False
time.sleep(2)
announced = False
while True:
if is_web_healthy() and not announced:
sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n")
sys.stdout.flush()
announced = True
elif not is_web_healthy() and announced:
sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n")
sys.stdout.flush()
announced = False
time.sleep(5)Setup:
- 1 primary database (writes)
- 3 read replicas (reads)
- Anycast read IP: 100.10.0.5432
- Health check: replication lag
Health check:
#!/usr/bin/env python3
import sys
import time
import psycopg2
SERVICE_IP = "100.10.0.5432"
MAX_LAG_SECONDS = 10
def get_replication_lag():
try:
conn = psycopg2.connect(host='127.0.0.1', database='postgres', user='monitor')
cursor = conn.cursor()
cursor.execute("""
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))
""")
lag = cursor.fetchone()[0]
conn.close()
return lag if lag else 0
except:
return float('inf')
time.sleep(2)
announced = False
while True:
lag = get_replication_lag()
healthy = lag < MAX_LAG_SECONDS
if healthy and not announced:
sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n")
sys.stdout.flush()
sys.stderr.write(f"[DB] Replication lag OK ({lag:.1f}s), announcing\n")
announced = True
elif not healthy and announced:
sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n")
sys.stdout.flush()
sys.stderr.write(f"[DB] Replication lag too high ({lag:.1f}s), withdrawing\n")
announced = False
time.sleep(10)Setup:
- Region A: 3 servers
- Region B: 3 servers
- Same anycast IP in both regions
- Clients routed to nearest region
Benefits:
- Low latency (geo-proximity)
- Disaster recovery (region failure)
- Active-active across regions
1. Install ExaBGP on all servers:
pip3 install exabgp2. Configure service IP on loopback:
ip addr add 100.10.0.100/32 dev lo3. Create ExaBGP config:
neighbor 192.168.1.1 {
router-id 192.168.1.10;
local-address 192.168.1.10;
local-as 65001;
peer-as 65001;
family {
ipv4 unicast;
}
api {
processes [ ha-healthcheck ];
}
}
process ha-healthcheck {
run /etc/exabgp/ha-healthcheck.py;
encoder text;
}4. Create health check script:
#!/usr/bin/env python3
import sys
import time
import socket
SERVICE_IP = "100.10.0.100"
SERVICE_PORT = 80
def is_healthy():
try:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.settimeout(2)
result = sock.connect_ex(('127.0.0.1', SERVICE_PORT))
sock.close()
return result == 0
except:
return False
time.sleep(2)
announced = False
while True:
if is_healthy() and not announced:
sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n")
sys.stdout.flush()
announced = True
elif not is_healthy() and announced:
sys.stdout.write(f"withdraw route {SERVICE_IP}/32 next-hop self\n")
sys.stdout.flush()
announced = False
time.sleep(5)5. Start ExaBGP:
exabgp /etc/exabgp/ha.conf6. Verify:
# Check route on router
show ip bgp 100.10.0.100
# Should see multiple paths (one per healthy server)Prevent route flapping:
RISE_THRESHOLD = 3 # 3 consecutive successes to announce
FALL_THRESHOLD = 2 # 2 consecutive failures to withdrawimport subprocess
def check_exabgp_running():
result = subprocess.run(['pgrep', '-f', 'exabgp'], capture_output=True)
if result.returncode != 0:
send_alert("ExaBGP not running!")
return False
return Trueimport logging
logging.basicConfig(filename='/var/log/exabgp-ha.log', level=logging.INFO)
def announce_route(ip):
sys.stdout.write(f"announce route {ip}/32 next-hop self\n")
sys.stdout.flush()
logging.info(f"ANNOUNCE: {ip}")Allow graceful traffic draining:
# Enter maintenance
touch /var/run/maintenance
# Wait for connections to drain
watch 'ss -tan | grep ESTAB | wc -l'
# Perform maintenance
systemctl restart service
# Exit maintenance
rm /var/run/maintenance# Monthly failover drill
systemctl stop nginx
# Verify traffic failed over
sleep 60
systemctl start nginx1. Service Health:
- Health check success rate
- Time since last successful check
- Health check latency
2. BGP State:
- BGP session state (up/down)
- Routes announced
- Routes withdrawn
- BGP convergence time
3. Failover Events:
- Number of failovers
- Time to failover
- Failed node recovery time
#!/usr/bin/env python3
"""
Monitor HA metrics and export to Prometheus
"""
import time
from prometheus_client import start_http_server, Gauge, Counter
# Metrics
health_check_success = Gauge('ha_health_check_success', 'Health check status (1=healthy, 0=unhealthy)')
route_announced = Gauge('ha_route_announced', 'Route announcement status (1=announced, 0=withdrawn)')
failover_count = Counter('ha_failover_total', 'Total number of failovers')
def monitor_ha():
announced = False
while True:
healthy = is_healthy()
health_check_success.set(1 if healthy else 0)
if healthy and not announced:
route_announced.set(1)
announced = True
elif not healthy and announced:
route_announced.set(0)
failover_count.inc()
announced = False
time.sleep(5)
if __name__ == '__main__':
# Start Prometheus metrics server
start_http_server(9100)
monitor_ha()Symptoms: Service down but traffic still routed to failed instance
Check:
# 1. Verify ExaBGP withdrew route
grep WITHDRAW /var/log/exabgp.log
# 2. Check BGP table on router
show ip bgp 100.10.0.100
# 3. Verify health check detecting failure
tail -f /var/log/exabgp.logCommon causes:
- Health check not detecting failure
- ExaBGP not running
- BGP session down
- Router not removing route
Symptoms: Route repeatedly announced/withdrawn
Diagnosis:
# Monitor route changes
watch -d 'show ip bgp 100.10.0.100 | grep paths'Solutions:
- Implement rise/fall thresholds
- Increase health check interval
- Fix unstable service
- Add dampening
Symptoms: One server gets all traffic despite ECMP
Check:
# Verify ECMP enabled
show ip bgp 100.10.0.100
# Should show "multipath" or "ECMP"
# Check routing table
show ip route 100.10.0.100
# Should show multiple next-hops
Solutions:
# Enable ECMP
router bgp 65000
maximum-paths 8
- Anycast Management - Anycast patterns
- DDoS Mitigation - DDoS protection
- Quick Start - Getting started
- Debugging - Troubleshooting
- Monitoring - Monitoring setup
- Configuration Syntax - Config reference
- API Overview - API patterns
Ready to implement HA? See Quick Start β
π» Ghost written by Claude (Anthropic AI)
π Home
π Getting Started
π§ API
π‘οΈ Use Cases
π Address Families
βοΈ Configuration
π Operations
π Reference
- Architecture
- BGP State Machine
- Communities (RFC)
- Extended Communities
- BGP Ecosystem
- Capabilities (AFI/SAFI)
- RFC Support
π Migration
π Community
π External
- GitHub Repo β
- Slack β
- Issues β
π» Ghost written by Claude (Anthropic AI)