Skip to content

Health Checks

Thomas Mangin edited this page Nov 15, 2025 · 1 revision

Health Checks with ExaBGP

Health checking is a critical component of using ExaBGP for anycast, high availability, and load balancing scenarios. ExaBGP provides flexible health checking capabilities through both a built-in module and custom health check scripts.

Table of Contents

Overview

Important: ExaBGP does NOT manipulate the routing table (RIB/FIB). Health checks determine when ExaBGP should announce or withdraw routes via BGP. The operating system or other routing software must install routes from BGP into the FIB.

Why Health Checks Matter

Health checks enable ExaBGP to:

  • Announce routes only when services are healthy - Prevents traffic black-holing
  • Withdraw routes automatically on failure - Enables fast failover
  • Support anycast architectures - Multiple servers advertise the same IP
  • Enable graceful maintenance - Controlled traffic drainage

Health Check Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Health Check    β”‚
β”‚  Script/Module   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚ checks service
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Local Service   β”‚
β”‚  (HTTP/DNS/etc)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β”‚ healthy/unhealthy
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    ExaBGP API    β”‚
β”‚  announce/withdrawβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   BGP Routers    β”‚
β”‚  receive updates β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Built-in Healthcheck Module

ExaBGP 5.x includes a built-in healthcheck module that eliminates the need for external scripts for simple HTTP/HTTPS health checks.

Basic Configuration

# /etc/exabgp/exabgp.conf
process healthcheck {
    run /usr/bin/python3 -m exabgp.application.healthcheck;
    encoder json;
}

neighbor 192.0.2.1 {
    router-id 192.0.2.10;
    local-address 192.0.2.10;
    local-as 65001;
    peer-as 65001;

    family {
        ipv4 unicast;
    }

    api {
        processes [ healthcheck ];
    }
}

Healthcheck Configuration File

Create /etc/exabgp/healthcheck.conf:

# /etc/exabgp/healthcheck.conf

# HTTP health check for web service
[check-web-http]
type = http
url = http://127.0.0.1:80/health
method = GET
timeout = 5
interval = 10
rise = 2
fall = 3
announce = 198.51.100.10/32
next-hop = 192.0.2.10
withdraw-on-down = true

# HTTPS health check with custom headers
[check-web-https]
type = https
url = https://127.0.0.1:443/healthz
method = GET
timeout = 5
interval = 10
rise = 2
fall = 3
expected-status = 200
headers = X-Health-Check: ExaBGP
announce = 198.51.100.20/32
next-hop = 192.0.2.10

# TCP port check
[check-dns]
type = tcp
host = 127.0.0.1
port = 53
timeout = 3
interval = 5
rise = 2
fall = 2
announce = 198.51.100.30/32
next-hop = 192.0.2.10

Healthcheck Parameters

Parameter Description Default
type Check type: http, https, tcp, icmp Required
url URL to check (HTTP/HTTPS) Required for HTTP
host Hostname/IP to check (TCP/ICMP) Required for TCP
port TCP port to check Required for TCP
method HTTP method: GET, POST, HEAD GET
timeout Check timeout in seconds 5
interval Check interval in seconds 10
rise Consecutive successes before UP 2
fall Consecutive failures before DOWN 3
expected-status Expected HTTP status code 200
headers Custom HTTP headers None
announce Route to announce when healthy Required
next-hop BGP next-hop for route Required
withdraw-on-down Withdraw route when unhealthy true

Built-in Module Advantages

  • No external script needed - Built into ExaBGP
  • JSON-based configuration - Easy to manage
  • Multiple check types - HTTP, HTTPS, TCP, ICMP
  • Flap protection - Rise/fall thresholds prevent flapping
  • Automatic route management - Announces and withdraws routes

Custom Health Check Scripts

For more complex health checking logic, write custom scripts that communicate with ExaBGP via its API.

Python Health Check Example

#!/usr/bin/env python3
# /etc/exabgp/healthcheck.py

import sys
import time
import requests
from subprocess import run

# Configuration
SERVICE_URL = "http://127.0.0.1:80/health"
CHECK_INTERVAL = 10
ROUTE = "198.51.100.10/32"
NEXT_HOP = "192.0.2.10"
RISE_THRESHOLD = 2
FALL_THRESHOLD = 3

def announce_route():
    """Announce route via ExaBGP API"""
    print(f"announce route {ROUTE} next-hop {NEXT_HOP}", flush=True)

def withdraw_route():
    """Withdraw route via ExaBGP API"""
    print(f"withdraw route {ROUTE} next-hop {NEXT_HOP}", flush=True)

def check_health():
    """Check if service is healthy"""
    try:
        response = requests.get(SERVICE_URL, timeout=5)
        return response.status_code == 200
    except Exception as e:
        sys.stderr.write(f"Health check failed: {e}\n")
        return False

def main():
    consecutive_successes = 0
    consecutive_failures = 0
    route_announced = False

    while True:
        healthy = check_health()

        if healthy:
            consecutive_successes += 1
            consecutive_failures = 0

            # Announce route if we've reached rise threshold
            if consecutive_successes >= RISE_THRESHOLD and not route_announced:
                announce_route()
                route_announced = True
                sys.stderr.write(f"Service UP - route announced\n")
        else:
            consecutive_failures += 1
            consecutive_successes = 0

            # Withdraw route if we've reached fall threshold
            if consecutive_failures >= FALL_THRESHOLD and route_announced:
                withdraw_route()
                route_announced = False
                sys.stderr.write(f"Service DOWN - route withdrawn\n")

        time.sleep(CHECK_INTERVAL)

if __name__ == "__main__":
    main()

Bash Health Check Example

#!/bin/bash
# /etc/exabgp/healthcheck.sh

ROUTE="198.51.100.10/32"
NEXT_HOP="192.0.2.10"
SERVICE_URL="http://127.0.0.1:80/health"
CHECK_INTERVAL=10
RISE_THRESHOLD=2
FALL_THRESHOLD=3

consecutive_successes=0
consecutive_failures=0
route_announced=0

announce_route() {
    echo "announce route $ROUTE next-hop $NEXT_HOP"
}

withdraw_route() {
    echo "withdraw route $ROUTE next-hop $NEXT_HOP"
}

check_health() {
    curl -sf "$SERVICE_URL" > /dev/null 2>&1
    return $?
}

while true; do
    if check_health; then
        ((consecutive_successes++))
        consecutive_failures=0

        if [ $consecutive_successes -ge $RISE_THRESHOLD ] && [ $route_announced -eq 0 ]; then
            announce_route
            route_announced=1
            echo "Service UP - route announced" >&2
        fi
    else
        ((consecutive_failures++))
        consecutive_successes=0

        if [ $consecutive_failures -ge $FALL_THRESHOLD ] && [ $route_announced -eq 1 ]; then
            withdraw_route
            route_announced=0
            echo "Service DOWN - route withdrawn" >&2
        fi
    fi

    sleep $CHECK_INTERVAL
done

ExaBGP Configuration for Custom Script

# /etc/exabgp/exabgp.conf
process healthcheck {
    run /etc/exabgp/healthcheck.py;
    encoder text;
}

neighbor 192.0.2.1 {
    router-id 192.0.2.10;
    local-address 192.0.2.10;
    local-as 65001;
    peer-as 65001;

    family {
        ipv4 unicast;
    }

    api {
        processes [ healthcheck ];
    }
}

Health Check Patterns

Pattern 1: Anycast DNS with Health Checks

Multiple DNS servers advertise the same anycast IP. Each server only announces when its local DNS service is healthy.

#!/usr/bin/env python3
# /etc/exabgp/dns-healthcheck.py

import sys
import time
import socket

ANYCAST_IP = "198.51.100.53/32"
NEXT_HOP = "self"
DNS_PORT = 53
CHECK_INTERVAL = 5

def check_dns():
    """Check if DNS server is responding"""
    try:
        sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
        sock.settimeout(3)

        # Send DNS query for version.bind
        query = b'\x00\x00\x01\x00\x00\x01\x00\x00\x00\x00\x00\x00\x07version\x04bind\x00\x00\x10\x00\x03'
        sock.sendto(query, ('127.0.0.1', DNS_PORT))

        data, addr = sock.recvfrom(512)
        sock.close()
        return True
    except:
        return False

route_announced = False

while True:
    healthy = check_dns()

    if healthy and not route_announced:
        print(f"announce route {ANYCAST_IP} next-hop {NEXT_HOP}", flush=True)
        route_announced = True
        sys.stderr.write("DNS healthy - announcing anycast IP\n")
    elif not healthy and route_announced:
        print(f"withdraw route {ANYCAST_IP} next-hop {NEXT_HOP}", flush=True)
        route_announced = False
        sys.stderr.write("DNS unhealthy - withdrawing anycast IP\n")

    time.sleep(CHECK_INTERVAL)

Pattern 2: Multi-Service Health Check

Check multiple services before announcing a route. All services must be healthy.

#!/usr/bin/env python3
# /etc/exabgp/multi-service-healthcheck.py

import sys
import time
import requests
import socket

ROUTE = "198.51.100.100/32"
NEXT_HOP = "192.0.2.10"
CHECK_INTERVAL = 10

def check_http():
    try:
        r = requests.get("http://127.0.0.1:80/health", timeout=3)
        return r.status_code == 200
    except:
        return False

def check_database():
    try:
        sock = socket.create_connection(("127.0.0.1", 5432), timeout=3)
        sock.close()
        return True
    except:
        return False

def check_cache():
    try:
        sock = socket.create_connection(("127.0.0.1", 6379), timeout=3)
        sock.close()
        return True
    except:
        return False

route_announced = False

while True:
    # All services must be healthy
    all_healthy = check_http() and check_database() and check_cache()

    if all_healthy and not route_announced:
        print(f"announce route {ROUTE} next-hop {NEXT_HOP}", flush=True)
        route_announced = True
        sys.stderr.write("All services healthy - route announced\n")
    elif not all_healthy and route_announced:
        print(f"withdraw route {ROUTE} next-hop {NEXT_HOP}", flush=True)
        route_announced = False
        sys.stderr.write("Service failure detected - route withdrawn\n")

    time.sleep(CHECK_INTERVAL)

Pattern 3: Weighted Health Check

Different services have different weights. Announce if total health score exceeds threshold.

#!/usr/bin/env python3
# /etc/exabgp/weighted-healthcheck.py

import sys
import time
import requests

ROUTE = "198.51.100.100/32"
NEXT_HOP = "192.0.2.10"
CHECK_INTERVAL = 10
HEALTH_THRESHOLD = 70  # Announce if score >= 70

CHECKS = [
    {"name": "web", "url": "http://127.0.0.1:80/health", "weight": 50},
    {"name": "api", "url": "http://127.0.0.1:8080/health", "weight": 30},
    {"name": "cache", "url": "http://127.0.0.1:11211/health", "weight": 20},
]

def calculate_health_score():
    total_score = 0
    for check in CHECKS:
        try:
            r = requests.get(check["url"], timeout=3)
            if r.status_code == 200:
                total_score += check["weight"]
                sys.stderr.write(f"{check['name']}: OK (+{check['weight']})\n")
            else:
                sys.stderr.write(f"{check['name']}: FAIL (status {r.status_code})\n")
        except Exception as e:
            sys.stderr.write(f"{check['name']}: FAIL ({e})\n")

    return total_score

route_announced = False

while True:
    score = calculate_health_score()
    sys.stderr.write(f"Health score: {score}/100\n")

    if score >= HEALTH_THRESHOLD and not route_announced:
        print(f"announce route {ROUTE} next-hop {NEXT_HOP}", flush=True)
        route_announced = True
        sys.stderr.write(f"Score {score} >= threshold {HEALTH_THRESHOLD} - route announced\n")
    elif score < HEALTH_THRESHOLD and route_announced:
        print(f"withdraw route {ROUTE} next-hop {NEXT_HOP}", flush=True)
        route_announced = False
        sys.stderr.write(f"Score {score} < threshold {HEALTH_THRESHOLD} - route withdrawn\n")

    time.sleep(CHECK_INTERVAL)

Pattern 4: Graceful Shutdown

Detect shutdown signal and withdraw routes before service stops.

#!/usr/bin/env python3
# /etc/exabgp/graceful-healthcheck.py

import sys
import time
import signal
import requests

ROUTE = "198.51.100.100/32"
NEXT_HOP = "192.0.2.10"
CHECK_INTERVAL = 10

shutdown_requested = False

def signal_handler(signum, frame):
    global shutdown_requested
    shutdown_requested = True
    sys.stderr.write(f"Shutdown signal received - withdrawing route\n")

# Register signal handlers
signal.signal(signal.SIGTERM, signal_handler)
signal.signal(signal.SIGINT, signal_handler)

def check_health():
    try:
        r = requests.get("http://127.0.0.1:80/health", timeout=3)
        return r.status_code == 200
    except:
        return False

route_announced = False

while True:
    if shutdown_requested:
        if route_announced:
            print(f"withdraw route {ROUTE} next-hop {NEXT_HOP}", flush=True)
            route_announced = False
        sys.stderr.write("Graceful shutdown complete\n")
        sys.exit(0)

    healthy = check_health()

    if healthy and not route_announced:
        print(f"announce route {ROUTE} next-hop {NEXT_HOP}", flush=True)
        route_announced = True
    elif not healthy and route_announced:
        print(f"withdraw route {ROUTE} next-hop {NEXT_HOP}", flush=True)
        route_announced = False

    time.sleep(CHECK_INTERVAL)

Integration with Monitoring Systems

Prometheus Exporter Integration

Export health check metrics to Prometheus:

#!/usr/bin/env python3
# /etc/exabgp/healthcheck-with-metrics.py

import sys
import time
import requests
from prometheus_client import start_http_server, Gauge, Counter

# Prometheus metrics
health_status = Gauge('exabgp_health_status', 'Service health status (1=healthy, 0=unhealthy)')
route_announced = Gauge('exabgp_route_announced', 'Route announcement status (1=announced, 0=withdrawn)')
health_checks_total = Counter('exabgp_health_checks_total', 'Total health checks performed', ['result'])

ROUTE = "198.51.100.100/32"
NEXT_HOP = "192.0.2.10"
CHECK_INTERVAL = 10
METRICS_PORT = 9101

def check_health():
    try:
        r = requests.get("http://127.0.0.1:80/health", timeout=3)
        is_healthy = r.status_code == 200
        health_checks_total.labels(result='success' if is_healthy else 'fail').inc()
        health_status.set(1 if is_healthy else 0)
        return is_healthy
    except Exception as e:
        health_checks_total.labels(result='error').inc()
        health_status.set(0)
        return False

# Start Prometheus metrics server
start_http_server(METRICS_PORT)
sys.stderr.write(f"Prometheus metrics available on port {METRICS_PORT}\n")

is_announced = False

while True:
    healthy = check_health()

    if healthy and not is_announced:
        print(f"announce route {ROUTE} next-hop {NEXT_HOP}", flush=True)
        is_announced = True
        route_announced.set(1)
    elif not healthy and is_announced:
        print(f"withdraw route {ROUTE} next-hop {NEXT_HOP}", flush=True)
        is_announced = False
        route_announced.set(0)

    time.sleep(CHECK_INTERVAL)

Nagios/Icinga Integration

Check if ExaBGP health check script is running and routes are announced:

#!/bin/bash
# /usr/lib/nagios/plugins/check_exabgp_health.sh

ROUTE="198.51.100.100/32"
EXABGP_PID_FILE="/var/run/exabgp/exabgp.pid"

# Check if ExaBGP is running
if [ ! -f "$EXABGP_PID_FILE" ]; then
    echo "CRITICAL: ExaBGP not running"
    exit 2
fi

# Check if route is in routing table (route is announced and installed)
if ip route show | grep -q "$ROUTE"; then
    echo "OK: Route $ROUTE is announced and active"
    exit 0
else
    echo "WARNING: Route $ROUTE not in routing table"
    exit 1
fi

Best Practices

1. Use Rise/Fall Thresholds

Prevent route flapping by requiring multiple consecutive successes/failures:

RISE_THRESHOLD = 2   # Announce after 2 consecutive successes
FALL_THRESHOLD = 3   # Withdraw after 3 consecutive failures

2. Set Appropriate Timeouts

CHECK_INTERVAL = 10  # Check every 10 seconds
CHECK_TIMEOUT = 5    # Individual check timeout (must be < interval)

3. Check Localhost Services

Health checks should verify the local service, not remote dependencies:

# GOOD: Check local service
SERVICE_URL = "http://127.0.0.1:80/health"

# BAD: Check remote dependency
SERVICE_URL = "http://database.example.com:5432/"

4. Implement Comprehensive Health Endpoints

Your application should provide a health endpoint that checks all critical components:

# Example Flask health endpoint
from flask import Flask, jsonify

app = Flask(__name__)

@app.route('/health')
def health():
    checks = {
        'database': check_database_connection(),
        'cache': check_cache_connection(),
        'disk_space': check_disk_space(),
    }

    if all(checks.values()):
        return jsonify({'status': 'healthy', 'checks': checks}), 200
    else:
        return jsonify({'status': 'unhealthy', 'checks': checks}), 503

5. Log Health State Changes

Always log when routes are announced or withdrawn:

if healthy and not route_announced:
    print(f"announce route {ROUTE} next-hop {NEXT_HOP}", flush=True)
    sys.stderr.write(f"{time.strftime('%Y-%m-%d %H:%M:%S')} - Route announced\n")
    route_announced = True

6. Handle Script Startup

Don't announce routes immediately on startup. Wait for initial health checks:

# Wait for initial checks before announcing
initial_checks = 0
while initial_checks < RISE_THRESHOLD:
    if check_health():
        initial_checks += 1
    else:
        initial_checks = 0
    time.sleep(CHECK_INTERVAL)

# Now start normal operation
announce_route()

7. Monitor Health Check Script

Use a process supervisor (systemd, supervisord) to ensure health check scripts keep running:

# /etc/systemd/system/exabgp.service
[Unit]
Description=ExaBGP
After=network.target

[Service]
Type=simple
User=exabgp
ExecStart=/usr/local/bin/exabgp /etc/exabgp/exabgp.conf
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Troubleshooting

Problem: Routes Not Being Announced

Symptoms: Health checks pass but routes aren't announced to peers.

Debugging steps:

  1. Check ExaBGP logs:
tail -f /var/log/exabgp/exabgp.log
  1. Verify health check script output:
# Health check should print to stdout
/etc/exabgp/healthcheck.py
  1. Check API communication:
# Enable ExaBGP API debugging
exabgp.log.all = true
exabgp.log.level = DEBUG
  1. Verify BGP session is established:
# Check neighbor status in logs
grep "Peer.*up" /var/log/exabgp/exabgp.log

Problem: Routes Flapping

Symptoms: Routes are constantly announced and withdrawn.

Solutions:

  1. Increase rise/fall thresholds:
RISE_THRESHOLD = 3  # More conservative
FALL_THRESHOLD = 5
  1. Increase check interval:
CHECK_INTERVAL = 15  # Check less frequently
  1. Implement hysteresis:
# Stay in current state for minimum time
MIN_STATE_TIME = 60  # 60 seconds minimum
last_state_change = time.time()

if time.time() - last_state_change >= MIN_STATE_TIME:
    # Allow state change
    pass

Problem: Health Check Script Crashes

Symptoms: Routes withdrawn and never come back.

Solutions:

  1. Add exception handling:
def main():
    try:
        while True:
            # Health check logic
            pass
    except Exception as e:
        sys.stderr.write(f"Fatal error: {e}\n")
        # Withdraw routes before exiting
        withdraw_route()
        sys.exit(1)
  1. Use systemd to restart the process:
[Service]
Restart=always
RestartSec=10
  1. Add logging to debug crashes:
import logging

logging.basicConfig(
    filename='/var/log/exabgp/healthcheck.log',
    level=logging.DEBUG,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

Problem: Slow Health Checks

Symptoms: Health checks take too long, affecting responsiveness.

Solutions:

  1. Use shorter timeouts:
requests.get(url, timeout=3)  # 3 second timeout
  1. Run checks in parallel (for multiple services):
import concurrent.futures

def check_all_services():
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        futures = {
            executor.submit(check_web): "web",
            executor.submit(check_api): "api",
            executor.submit(check_db): "db",
        }
        results = {}
        for future in concurrent.futures.as_completed(futures):
            service = futures[future]
            results[service] = future.result()
        return all(results.values())
  1. Use lighter health check methods:
# TCP connection check (faster than HTTP)
import socket

def check_tcp(host, port):
    try:
        sock = socket.create_connection((host, port), timeout=2)
        sock.close()
        return True
    except:
        return False

See Also


πŸ‘» Ghost written by Claude (Anthropic AI)

Clone this wiki locally