-
Notifications
You must be signed in to change notification settings - Fork 459
Health Checks
Health checking is a critical component of using ExaBGP for anycast, high availability, and load balancing scenarios. ExaBGP provides flexible health checking capabilities through both a built-in module and custom health check scripts.
- Overview
- Built-in Healthcheck Module
- Custom Health Check Scripts
- Health Check Patterns
- Integration with Monitoring Systems
- Best Practices
- Troubleshooting
- See Also
Important: ExaBGP does NOT manipulate the routing table (RIB/FIB). Health checks determine when ExaBGP should announce or withdraw routes via BGP. The operating system or other routing software must install routes from BGP into the FIB.
Health checks enable ExaBGP to:
- Announce routes only when services are healthy - Prevents traffic black-holing
- Withdraw routes automatically on failure - Enables fast failover
- Support anycast architectures - Multiple servers advertise the same IP
- Enable graceful maintenance - Controlled traffic drainage
ββββββββββββββββββββ
β Health Check β
β Script/Module β
ββββββββββ¬ββββββββββ
β checks service
βΌ
ββββββββββββββββββββ
β Local Service β
β (HTTP/DNS/etc) β
ββββββββββββββββββββ
β
β healthy/unhealthy
βΌ
ββββββββββββββββββββ
β ExaBGP API β
β announce/withdrawβ
ββββββββββ¬ββββββββββ
β
βΌ
ββββββββββββββββββββ
β BGP Routers β
β receive updates β
ββββββββββββββββββββ
ExaBGP 5.x includes a built-in healthcheck module that eliminates the need for external scripts for simple HTTP/HTTPS health checks.
# /etc/exabgp/exabgp.conf
process healthcheck {
run /usr/bin/python3 -m exabgp.application.healthcheck;
encoder json;
}
neighbor 192.0.2.1 {
router-id 192.0.2.10;
local-address 192.0.2.10;
local-as 65001;
peer-as 65001;
family {
ipv4 unicast;
}
api {
processes [ healthcheck ];
}
}Create /etc/exabgp/healthcheck.conf:
# /etc/exabgp/healthcheck.conf
# HTTP health check for web service
[check-web-http]
type = http
url = http://127.0.0.1:80/health
method = GET
timeout = 5
interval = 10
rise = 2
fall = 3
announce = 198.51.100.10/32
next-hop = 192.0.2.10
withdraw-on-down = true
# HTTPS health check with custom headers
[check-web-https]
type = https
url = https://127.0.0.1:443/healthz
method = GET
timeout = 5
interval = 10
rise = 2
fall = 3
expected-status = 200
headers = X-Health-Check: ExaBGP
announce = 198.51.100.20/32
next-hop = 192.0.2.10
# TCP port check
[check-dns]
type = tcp
host = 127.0.0.1
port = 53
timeout = 3
interval = 5
rise = 2
fall = 2
announce = 198.51.100.30/32
next-hop = 192.0.2.10| Parameter | Description | Default |
|---|---|---|
type |
Check type: http, https, tcp, icmp
|
Required |
url |
URL to check (HTTP/HTTPS) | Required for HTTP |
host |
Hostname/IP to check (TCP/ICMP) | Required for TCP |
port |
TCP port to check | Required for TCP |
method |
HTTP method: GET, POST, HEAD
|
GET |
timeout |
Check timeout in seconds | 5 |
interval |
Check interval in seconds | 10 |
rise |
Consecutive successes before UP | 2 |
fall |
Consecutive failures before DOWN | 3 |
expected-status |
Expected HTTP status code | 200 |
headers |
Custom HTTP headers | None |
announce |
Route to announce when healthy | Required |
next-hop |
BGP next-hop for route | Required |
withdraw-on-down |
Withdraw route when unhealthy | true |
- No external script needed - Built into ExaBGP
- JSON-based configuration - Easy to manage
- Multiple check types - HTTP, HTTPS, TCP, ICMP
- Flap protection - Rise/fall thresholds prevent flapping
- Automatic route management - Announces and withdraws routes
For more complex health checking logic, write custom scripts that communicate with ExaBGP via its API.
#!/usr/bin/env python3
# /etc/exabgp/healthcheck.py
import sys
import time
import requests
from subprocess import run
# Configuration
SERVICE_URL = "http://127.0.0.1:80/health"
CHECK_INTERVAL = 10
ROUTE = "198.51.100.10/32"
NEXT_HOP = "192.0.2.10"
RISE_THRESHOLD = 2
FALL_THRESHOLD = 3
def announce_route():
"""Announce route via ExaBGP API"""
print(f"announce route {ROUTE} next-hop {NEXT_HOP}", flush=True)
def withdraw_route():
"""Withdraw route via ExaBGP API"""
print(f"withdraw route {ROUTE} next-hop {NEXT_HOP}", flush=True)
def check_health():
"""Check if service is healthy"""
try:
response = requests.get(SERVICE_URL, timeout=5)
return response.status_code == 200
except Exception as e:
sys.stderr.write(f"Health check failed: {e}\n")
return False
def main():
consecutive_successes = 0
consecutive_failures = 0
route_announced = False
while True:
healthy = check_health()
if healthy:
consecutive_successes += 1
consecutive_failures = 0
# Announce route if we've reached rise threshold
if consecutive_successes >= RISE_THRESHOLD and not route_announced:
announce_route()
route_announced = True
sys.stderr.write(f"Service UP - route announced\n")
else:
consecutive_failures += 1
consecutive_successes = 0
# Withdraw route if we've reached fall threshold
if consecutive_failures >= FALL_THRESHOLD and route_announced:
withdraw_route()
route_announced = False
sys.stderr.write(f"Service DOWN - route withdrawn\n")
time.sleep(CHECK_INTERVAL)
if __name__ == "__main__":
main()#!/bin/bash
# /etc/exabgp/healthcheck.sh
ROUTE="198.51.100.10/32"
NEXT_HOP="192.0.2.10"
SERVICE_URL="http://127.0.0.1:80/health"
CHECK_INTERVAL=10
RISE_THRESHOLD=2
FALL_THRESHOLD=3
consecutive_successes=0
consecutive_failures=0
route_announced=0
announce_route() {
echo "announce route $ROUTE next-hop $NEXT_HOP"
}
withdraw_route() {
echo "withdraw route $ROUTE next-hop $NEXT_HOP"
}
check_health() {
curl -sf "$SERVICE_URL" > /dev/null 2>&1
return $?
}
while true; do
if check_health; then
((consecutive_successes++))
consecutive_failures=0
if [ $consecutive_successes -ge $RISE_THRESHOLD ] && [ $route_announced -eq 0 ]; then
announce_route
route_announced=1
echo "Service UP - route announced" >&2
fi
else
((consecutive_failures++))
consecutive_successes=0
if [ $consecutive_failures -ge $FALL_THRESHOLD ] && [ $route_announced -eq 1 ]; then
withdraw_route
route_announced=0
echo "Service DOWN - route withdrawn" >&2
fi
fi
sleep $CHECK_INTERVAL
done# /etc/exabgp/exabgp.conf
process healthcheck {
run /etc/exabgp/healthcheck.py;
encoder text;
}
neighbor 192.0.2.1 {
router-id 192.0.2.10;
local-address 192.0.2.10;
local-as 65001;
peer-as 65001;
family {
ipv4 unicast;
}
api {
processes [ healthcheck ];
}
}Multiple DNS servers advertise the same anycast IP. Each server only announces when its local DNS service is healthy.
#!/usr/bin/env python3
# /etc/exabgp/dns-healthcheck.py
import sys
import time
import socket
ANYCAST_IP = "198.51.100.53/32"
NEXT_HOP = "self"
DNS_PORT = 53
CHECK_INTERVAL = 5
def check_dns():
"""Check if DNS server is responding"""
try:
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.settimeout(3)
# Send DNS query for version.bind
query = b'\x00\x00\x01\x00\x00\x01\x00\x00\x00\x00\x00\x00\x07version\x04bind\x00\x00\x10\x00\x03'
sock.sendto(query, ('127.0.0.1', DNS_PORT))
data, addr = sock.recvfrom(512)
sock.close()
return True
except:
return False
route_announced = False
while True:
healthy = check_dns()
if healthy and not route_announced:
print(f"announce route {ANYCAST_IP} next-hop {NEXT_HOP}", flush=True)
route_announced = True
sys.stderr.write("DNS healthy - announcing anycast IP\n")
elif not healthy and route_announced:
print(f"withdraw route {ANYCAST_IP} next-hop {NEXT_HOP}", flush=True)
route_announced = False
sys.stderr.write("DNS unhealthy - withdrawing anycast IP\n")
time.sleep(CHECK_INTERVAL)Check multiple services before announcing a route. All services must be healthy.
#!/usr/bin/env python3
# /etc/exabgp/multi-service-healthcheck.py
import sys
import time
import requests
import socket
ROUTE = "198.51.100.100/32"
NEXT_HOP = "192.0.2.10"
CHECK_INTERVAL = 10
def check_http():
try:
r = requests.get("http://127.0.0.1:80/health", timeout=3)
return r.status_code == 200
except:
return False
def check_database():
try:
sock = socket.create_connection(("127.0.0.1", 5432), timeout=3)
sock.close()
return True
except:
return False
def check_cache():
try:
sock = socket.create_connection(("127.0.0.1", 6379), timeout=3)
sock.close()
return True
except:
return False
route_announced = False
while True:
# All services must be healthy
all_healthy = check_http() and check_database() and check_cache()
if all_healthy and not route_announced:
print(f"announce route {ROUTE} next-hop {NEXT_HOP}", flush=True)
route_announced = True
sys.stderr.write("All services healthy - route announced\n")
elif not all_healthy and route_announced:
print(f"withdraw route {ROUTE} next-hop {NEXT_HOP}", flush=True)
route_announced = False
sys.stderr.write("Service failure detected - route withdrawn\n")
time.sleep(CHECK_INTERVAL)Different services have different weights. Announce if total health score exceeds threshold.
#!/usr/bin/env python3
# /etc/exabgp/weighted-healthcheck.py
import sys
import time
import requests
ROUTE = "198.51.100.100/32"
NEXT_HOP = "192.0.2.10"
CHECK_INTERVAL = 10
HEALTH_THRESHOLD = 70 # Announce if score >= 70
CHECKS = [
{"name": "web", "url": "http://127.0.0.1:80/health", "weight": 50},
{"name": "api", "url": "http://127.0.0.1:8080/health", "weight": 30},
{"name": "cache", "url": "http://127.0.0.1:11211/health", "weight": 20},
]
def calculate_health_score():
total_score = 0
for check in CHECKS:
try:
r = requests.get(check["url"], timeout=3)
if r.status_code == 200:
total_score += check["weight"]
sys.stderr.write(f"{check['name']}: OK (+{check['weight']})\n")
else:
sys.stderr.write(f"{check['name']}: FAIL (status {r.status_code})\n")
except Exception as e:
sys.stderr.write(f"{check['name']}: FAIL ({e})\n")
return total_score
route_announced = False
while True:
score = calculate_health_score()
sys.stderr.write(f"Health score: {score}/100\n")
if score >= HEALTH_THRESHOLD and not route_announced:
print(f"announce route {ROUTE} next-hop {NEXT_HOP}", flush=True)
route_announced = True
sys.stderr.write(f"Score {score} >= threshold {HEALTH_THRESHOLD} - route announced\n")
elif score < HEALTH_THRESHOLD and route_announced:
print(f"withdraw route {ROUTE} next-hop {NEXT_HOP}", flush=True)
route_announced = False
sys.stderr.write(f"Score {score} < threshold {HEALTH_THRESHOLD} - route withdrawn\n")
time.sleep(CHECK_INTERVAL)Detect shutdown signal and withdraw routes before service stops.
#!/usr/bin/env python3
# /etc/exabgp/graceful-healthcheck.py
import sys
import time
import signal
import requests
ROUTE = "198.51.100.100/32"
NEXT_HOP = "192.0.2.10"
CHECK_INTERVAL = 10
shutdown_requested = False
def signal_handler(signum, frame):
global shutdown_requested
shutdown_requested = True
sys.stderr.write(f"Shutdown signal received - withdrawing route\n")
# Register signal handlers
signal.signal(signal.SIGTERM, signal_handler)
signal.signal(signal.SIGINT, signal_handler)
def check_health():
try:
r = requests.get("http://127.0.0.1:80/health", timeout=3)
return r.status_code == 200
except:
return False
route_announced = False
while True:
if shutdown_requested:
if route_announced:
print(f"withdraw route {ROUTE} next-hop {NEXT_HOP}", flush=True)
route_announced = False
sys.stderr.write("Graceful shutdown complete\n")
sys.exit(0)
healthy = check_health()
if healthy and not route_announced:
print(f"announce route {ROUTE} next-hop {NEXT_HOP}", flush=True)
route_announced = True
elif not healthy and route_announced:
print(f"withdraw route {ROUTE} next-hop {NEXT_HOP}", flush=True)
route_announced = False
time.sleep(CHECK_INTERVAL)Export health check metrics to Prometheus:
#!/usr/bin/env python3
# /etc/exabgp/healthcheck-with-metrics.py
import sys
import time
import requests
from prometheus_client import start_http_server, Gauge, Counter
# Prometheus metrics
health_status = Gauge('exabgp_health_status', 'Service health status (1=healthy, 0=unhealthy)')
route_announced = Gauge('exabgp_route_announced', 'Route announcement status (1=announced, 0=withdrawn)')
health_checks_total = Counter('exabgp_health_checks_total', 'Total health checks performed', ['result'])
ROUTE = "198.51.100.100/32"
NEXT_HOP = "192.0.2.10"
CHECK_INTERVAL = 10
METRICS_PORT = 9101
def check_health():
try:
r = requests.get("http://127.0.0.1:80/health", timeout=3)
is_healthy = r.status_code == 200
health_checks_total.labels(result='success' if is_healthy else 'fail').inc()
health_status.set(1 if is_healthy else 0)
return is_healthy
except Exception as e:
health_checks_total.labels(result='error').inc()
health_status.set(0)
return False
# Start Prometheus metrics server
start_http_server(METRICS_PORT)
sys.stderr.write(f"Prometheus metrics available on port {METRICS_PORT}\n")
is_announced = False
while True:
healthy = check_health()
if healthy and not is_announced:
print(f"announce route {ROUTE} next-hop {NEXT_HOP}", flush=True)
is_announced = True
route_announced.set(1)
elif not healthy and is_announced:
print(f"withdraw route {ROUTE} next-hop {NEXT_HOP}", flush=True)
is_announced = False
route_announced.set(0)
time.sleep(CHECK_INTERVAL)Check if ExaBGP health check script is running and routes are announced:
#!/bin/bash
# /usr/lib/nagios/plugins/check_exabgp_health.sh
ROUTE="198.51.100.100/32"
EXABGP_PID_FILE="/var/run/exabgp/exabgp.pid"
# Check if ExaBGP is running
if [ ! -f "$EXABGP_PID_FILE" ]; then
echo "CRITICAL: ExaBGP not running"
exit 2
fi
# Check if route is in routing table (route is announced and installed)
if ip route show | grep -q "$ROUTE"; then
echo "OK: Route $ROUTE is announced and active"
exit 0
else
echo "WARNING: Route $ROUTE not in routing table"
exit 1
fiPrevent route flapping by requiring multiple consecutive successes/failures:
RISE_THRESHOLD = 2 # Announce after 2 consecutive successes
FALL_THRESHOLD = 3 # Withdraw after 3 consecutive failuresCHECK_INTERVAL = 10 # Check every 10 seconds
CHECK_TIMEOUT = 5 # Individual check timeout (must be < interval)Health checks should verify the local service, not remote dependencies:
# GOOD: Check local service
SERVICE_URL = "http://127.0.0.1:80/health"
# BAD: Check remote dependency
SERVICE_URL = "http://database.example.com:5432/"Your application should provide a health endpoint that checks all critical components:
# Example Flask health endpoint
from flask import Flask, jsonify
app = Flask(__name__)
@app.route('/health')
def health():
checks = {
'database': check_database_connection(),
'cache': check_cache_connection(),
'disk_space': check_disk_space(),
}
if all(checks.values()):
return jsonify({'status': 'healthy', 'checks': checks}), 200
else:
return jsonify({'status': 'unhealthy', 'checks': checks}), 503Always log when routes are announced or withdrawn:
if healthy and not route_announced:
print(f"announce route {ROUTE} next-hop {NEXT_HOP}", flush=True)
sys.stderr.write(f"{time.strftime('%Y-%m-%d %H:%M:%S')} - Route announced\n")
route_announced = TrueDon't announce routes immediately on startup. Wait for initial health checks:
# Wait for initial checks before announcing
initial_checks = 0
while initial_checks < RISE_THRESHOLD:
if check_health():
initial_checks += 1
else:
initial_checks = 0
time.sleep(CHECK_INTERVAL)
# Now start normal operation
announce_route()Use a process supervisor (systemd, supervisord) to ensure health check scripts keep running:
# /etc/systemd/system/exabgp.service
[Unit]
Description=ExaBGP
After=network.target
[Service]
Type=simple
User=exabgp
ExecStart=/usr/local/bin/exabgp /etc/exabgp/exabgp.conf
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.targetSymptoms: Health checks pass but routes aren't announced to peers.
Debugging steps:
- Check ExaBGP logs:
tail -f /var/log/exabgp/exabgp.log- Verify health check script output:
# Health check should print to stdout
/etc/exabgp/healthcheck.py- Check API communication:
# Enable ExaBGP API debugging
exabgp.log.all = true
exabgp.log.level = DEBUG- Verify BGP session is established:
# Check neighbor status in logs
grep "Peer.*up" /var/log/exabgp/exabgp.logSymptoms: Routes are constantly announced and withdrawn.
Solutions:
- Increase rise/fall thresholds:
RISE_THRESHOLD = 3 # More conservative
FALL_THRESHOLD = 5- Increase check interval:
CHECK_INTERVAL = 15 # Check less frequently- Implement hysteresis:
# Stay in current state for minimum time
MIN_STATE_TIME = 60 # 60 seconds minimum
last_state_change = time.time()
if time.time() - last_state_change >= MIN_STATE_TIME:
# Allow state change
passSymptoms: Routes withdrawn and never come back.
Solutions:
- Add exception handling:
def main():
try:
while True:
# Health check logic
pass
except Exception as e:
sys.stderr.write(f"Fatal error: {e}\n")
# Withdraw routes before exiting
withdraw_route()
sys.exit(1)- Use systemd to restart the process:
[Service]
Restart=always
RestartSec=10- Add logging to debug crashes:
import logging
logging.basicConfig(
filename='/var/log/exabgp/healthcheck.log',
level=logging.DEBUG,
format='%(asctime)s - %(levelname)s - %(message)s'
)Symptoms: Health checks take too long, affecting responsiveness.
Solutions:
- Use shorter timeouts:
requests.get(url, timeout=3) # 3 second timeout- Run checks in parallel (for multiple services):
import concurrent.futures
def check_all_services():
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
futures = {
executor.submit(check_web): "web",
executor.submit(check_api): "api",
executor.submit(check_db): "db",
}
results = {}
for future in concurrent.futures.as_completed(futures):
service = futures[future]
results[service] = future.result()
return all(results.values())- Use lighter health check methods:
# TCP connection check (faster than HTTP)
import socket
def check_tcp(host, port):
try:
sock = socket.create_connection((host, port), timeout=2)
sock.close()
return True
except:
return False- Service High Availability - HA patterns with ExaBGP
- Anycast Management - Anycast architectures
- API Overview - ExaBGP API documentation
- Monitoring - Production monitoring setup
- Debugging - Troubleshooting ExaBGP issues
- Healthcheck Module - Built-in healthcheck module details
π» Ghost written by Claude (Anthropic AI)
π Home
π Getting Started
π§ API
π‘οΈ Use Cases
π Address Families
βοΈ Configuration
π Operations
π Reference
- Architecture
- BGP State Machine
- Communities (RFC)
- Extended Communities
- BGP Ecosystem
- Capabilities (AFI/SAFI)
- RFC Support
π Migration
π Community
π External
- GitHub Repo β
- Slack β
- Issues β
π» Ghost written by Claude (Anthropic AI)