Skip to content

Error Handling

Thomas Mangin edited this page Nov 13, 2025 · 5 revisions

Error Handling

Comprehensive guide to handling errors in ExaBGP API programs


Table of Contents


Introduction

Robust error handling is critical for production ExaBGP deployments. Your API programs must gracefully handle errors without crashing or creating routing instability.

Error Handling Philosophy

Key principles:

  • βœ… Fail gracefully - Don't crash on errors
  • βœ… Log everything - Debug issues in production
  • βœ… Retry intelligently - Use exponential backoff
  • βœ… Degrade gracefully - Partial failure is better than total failure
  • βœ… Monitor errors - Alert on error rates

Important reminder:

πŸ”΄ ExaBGP does NOT manipulate RIB/FIB - When your program withdraws a route due to errors, ExaBGP sends the withdrawal via BGP. The router removes the route from its RIB/FIB. ExaBGP itself never touches routing tables.


Error Categories

1. BGP Protocol Errors

What: BGP session failures, NOTIFICATION messages Impact: Lost connectivity to peer, routes withdrawn Handling: Log, wait for reconnection, continue running


2. API Command Errors

What: Invalid syntax, malformed commands Impact: Commands rejected by ExaBGP Handling: Validate before sending, use ACK feature


3. Connection Errors

What: Network failures, timeouts, unreachable hosts Impact: Health checks fail, service appears down Handling: Retry with backoff, use circuit breaker


4. Parsing Errors

What: Invalid JSON, unexpected message format Impact: Can't process BGP updates Handling: Skip bad messages, log for debugging


5. Process Errors

What: Your program crashes, out of memory Impact: ExaBGP restarts your process Handling: Defensive programming, resource limits


BGP NOTIFICATION Messages

What Are NOTIFICATION Messages?

BGP NOTIFICATION messages indicate errors or session termination.

Structure:

{
  "type": "notification",
  "neighbor": {
    "address": {
      "local": "192.168.1.2",
      "peer": "192.168.1.1"
    },
    "message": {
      "notification": {
        "code": 6,
        "subcode": 2,
        "data": "Administrative Reset"
      }
    }
  }
}

Common NOTIFICATION Codes

Code 1: Message Header Error

Meaning: Malformed BGP message header

Subcodes:

  • 1 - Connection Not Synchronized
  • 2 - Bad Message Length
  • 3 - Bad Message Type

Example:

{
  "notification": {
    "code": 1,
    "subcode": 2,
    "data": "bad message length"
  }
}

Handling:

if notification['code'] == 1:
    log("[ERROR] BGP message header error - possible network corruption")
    # Let BGP reconnect automatically

Code 2: OPEN Message Error

Meaning: Error in BGP OPEN message

Subcodes:

  • 1 - Unsupported Version Number
  • 2 - Bad Peer AS
  • 3 - Bad BGP Identifier
  • 4 - Unsupported Optional Parameter
  • 5 - Authentication Failure
  • 6 - Unacceptable Hold Time

Example:

{
  "notification": {
    "code": 2,
    "subcode": 2,
    "data": "peer AS mismatch"
  }
}

Handling:

if notification['code'] == 2:
    subcode = notification['subcode']
    if subcode == 2:
        log("[FATAL] Peer AS mismatch - check configuration")
        # This requires config fix
    elif subcode == 5:
        log("[FATAL] Authentication failure - check MD5 password")

Code 3: UPDATE Message Error

Meaning: Error in BGP UPDATE message

Subcodes:

  • 1 - Malformed Attribute List
  • 2 - Unrecognized Well-known Attribute
  • 3 - Missing Well-known Attribute
  • 4 - Attribute Flags Error
  • 5 - Attribute Length Error
  • 6 - Invalid ORIGIN Attribute
  • 7 - AS Routing Loop
  • 8 - Invalid NEXT_HOP Attribute
  • 9 - Optional Attribute Error
  • 10 - Invalid Network Field
  • 11 - Malformed AS_PATH

Example:

{
  "notification": {
    "code": 3,
    "subcode": 3,
    "data": "missing ORIGIN attribute"
  }
}

Handling:

if notification['code'] == 3:
    log(f"[ERROR] UPDATE message error: subcode {notification['subcode']}")
    # ExaBGP bug or malformed command from your program
    # Check recent announcements

Code 4: Hold Timer Expired

Meaning: No KEEPALIVE or UPDATE received within hold time

Example:

{
  "notification": {
    "code": 4,
    "subcode": 0,
    "data": "hold timer expired"
  }
}

Handling:

if notification['code'] == 4:
    log("[WARN] Hold timer expired - network issue or peer overloaded")
    # BGP will reconnect automatically
    # If this happens frequently, check network or peer CPU

Code 6: Cease

Meaning: Session terminated (most common)

Subcodes:

  • 1 - Maximum Number of Prefixes Reached
  • 2 - Administrative Shutdown
  • 3 - Peer De-configured
  • 4 - Administrative Reset
  • 5 - Connection Rejected
  • 6 - Other Configuration Change
  • 7 - Connection Collision Resolution
  • 8 - Out of Resources

Example:

{
  "notification": {
    "code": 6,
    "subcode": 2,
    "data": "administrative shutdown"
  }
}

Handling:

if notification['code'] == 6:
    subcode = notification['subcode']

    if subcode == 1:
        log("[ERROR] Max prefixes exceeded - peer rejected our routes")
    elif subcode == 2:
        log("[INFO] Administrative shutdown - peer was shut down manually")
    elif subcode == 4:
        log("[INFO] Administrative reset - peer restarted")
    elif subcode == 8:
        log("[ERROR] Peer out of resources - may be overloaded")

Processing NOTIFICATION Messages

Complete handler:

def handle_notification(msg):
    """Process BGP NOTIFICATION messages"""
    try:
        peer = msg['neighbor']['address']['peer']
        notification = msg['neighbor']['message']['notification']

        code = notification.get('code', 0)
        subcode = notification.get('subcode', 0)
        data = notification.get('data', '')

        log(f"[NOTIFICATION] From {peer}: code={code} subcode={subcode} data={data}")

        # Handle specific codes
        if code == 2:  # OPEN error
            if subcode == 2:
                alert("[CRITICAL] Peer AS mismatch - configuration error")
            elif subcode == 5:
                alert("[CRITICAL] Authentication failure - check MD5 password")

        elif code == 3:  # UPDATE error
            log("[ERROR] UPDATE error - check recent announcements")
            # Log last N commands sent to ExaBGP
            for cmd in recent_commands[-10:]:
                log(f"  Recent command: {cmd}")

        elif code == 4:  # Hold timer
            log("[WARN] Hold timer expired - network or peer issue")

        elif code == 6:  # Cease
            if subcode == 1:
                alert("[ERROR] Max prefixes exceeded at peer")
            elif subcode == 8:
                alert("[WARN] Peer out of resources")

    except Exception as e:
        log(f"[ERROR] Failed to process notification: {e}")

API Command Errors

ExaBGP 4.x and Earlier: No Feedback

Problem: Commands are silently ignored if invalid

# Invalid command (missing next-hop)
print("announce route 100.10.0.0/24")
sys.stdout.flush()

# ExaBGP logs error but your program doesn't know
# Result: Route not announced, no feedback

Solution: Validate commands before sending

def validate_announce(prefix, nexthop):
    """Validate announcement before sending"""
    import ipaddress

    # Validate prefix
    try:
        ipaddress.ip_network(prefix)
    except ValueError as e:
        log(f"[ERROR] Invalid prefix {prefix}: {e}")
        return False

    # Validate next-hop
    if nexthop != 'self':
        try:
            ipaddress.ip_address(nexthop)
        except ValueError as e:
            log(f"[ERROR] Invalid next-hop {nexthop}: {e}")
            return False

    return True

# Use it
if validate_announce('100.10.0.0/24', 'self'):
    print("announce route 100.10.0.0/24 next-hop self")
    sys.stdout.flush()

ACK Feature (ExaBGP 4.x and 5.x)

ACK is enabled by default in both versions. To use ACK responses:

Responses (when ACK is enabled):

  • done\n - Command succeeded
  • error\n - Command failed
  • shutdown\n - ExaBGP shutting down

Example with error handling:

import select

def wait_for_ack(expected_count=1, timeout=30):
    """
    Wait for ACK responses with polling loop.
    ExaBGP may not respond immediately, so we poll with sleep.

    Handles both text and JSON encoder formats:
    - Text: "done", "error", "shutdown"
    - JSON: {"answer": "done|error|shutdown", "message": "..."}
    """
    import json
    received = 0
    start_time = time.time()

    while received < expected_count:
        if time.time() - start_time >= timeout:
            return False

        ready, _, _ = select.select([sys.stdin], [], [], 0.1)
        if ready:
            line = sys.stdin.readline().strip()

            # Parse response (could be text or JSON)
            answer = None
            if line.startswith('{'):
                try:
                    data = json.loads(line)
                    answer = data.get('answer')
                except:
                    pass
            else:
                answer = line

            if answer == "done":
                received += 1
            elif answer == "error":
                return False
            elif answer == "shutdown":
                raise SystemExit(0)
        else:
            time.sleep(0.1)

    return True

def send_command_with_ack(command, timeout=30):
    """Send command and wait for ACK"""
    sys.stdout.write(command + "\n")
    sys.stdout.flush()
    return wait_for_ack(expected_count=1, timeout=timeout)

# Use it
if not send_command_with_ack("announce route 100.10.0.0/24 next-hop self"):
    # Command failed
    alert("[CRITICAL] Failed to announce route")

Connection Failures

Network Timeouts

Problem: Health checks timeout, service appears down

def check_health():
    """Health check without timeout - WRONG"""
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.connect(('service', 80))  # Hangs forever if service down
    sock.close()
    return True

Solution: Always use timeouts

def check_health(host, port, timeout=2):
    """Health check with timeout - CORRECT"""
    try:
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.settimeout(timeout)  # CRITICAL
        result = sock.connect_ex((host, port))
        sock.close()
        return result == 0
    except socket.timeout:
        log(f"[TIMEOUT] Health check to {host}:{port}")
        return False
    except socket.error as e:
        log(f"[ERROR] Socket error: {e}")
        return False
    except Exception as e:
        log(f"[ERROR] Unexpected error: {e}")
        return False

HTTP Timeouts

import urllib.request

def http_health_check(url, timeout=2):
    """HTTP health check with timeout"""
    try:
        req = urllib.request.Request(url)
        response = urllib.request.urlopen(req, timeout=timeout)
        return response.status == 200
    except urllib.error.URLError as e:
        log(f"[ERROR] URL error: {e.reason}")
        return False
    except urllib.error.HTTPError as e:
        log(f"[ERROR] HTTP {e.code}: {e.reason}")
        return False
    except socket.timeout:
        log(f"[TIMEOUT] HTTP request to {url}")
        return False
    except Exception as e:
        log(f"[ERROR] Unexpected error: {e}")
        return False

DNS Failures

Problem: DNS lookup hangs or fails

def resolve_host(hostname, timeout=2):
    """Resolve hostname with timeout"""
    import socket
    import signal

    # Set alarm for timeout (Unix only)
    def timeout_handler(signum, frame):
        raise TimeoutError("DNS lookup timeout")

    old_handler = signal.signal(signal.SIGALRM, timeout_handler)
    signal.alarm(timeout)

    try:
        ip = socket.gethostbyname(hostname)
        signal.alarm(0)  # Cancel alarm
        return ip
    except socket.gaierror as e:
        log(f"[ERROR] DNS lookup failed for {hostname}: {e}")
        return None
    except TimeoutError:
        log(f"[TIMEOUT] DNS lookup for {hostname}")
        return None
    finally:
        signal.signal(signal.SIGALRM, old_handler)

Circuit Breaker Pattern

Prevent cascade failures:

class CircuitBreaker:
    """Circuit breaker for health checks"""

    def __init__(self, failure_threshold=5, timeout=60, half_open_timeout=10):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.half_open_timeout = half_open_timeout
        self.failure_count = 0
        self.last_failure_time = None
        self.last_success_time = None
        self.state = 'closed'  # closed, open, half_open

    def call(self, func, *args, **kwargs):
        """Execute function with circuit breaker protection"""
        now = time.time()

        if self.state == 'open':
            # Check if timeout expired
            if now - self.last_failure_time >= self.timeout:
                self.state = 'half_open'
                log(f"[CIRCUIT] State: half_open (trying again)")
            else:
                raise Exception(f"Circuit breaker OPEN (fails={self.failure_count})")

        try:
            result = func(*args, **kwargs)

            # Success
            if self.state == 'half_open':
                self.state = 'closed'
                self.failure_count = 0
                log(f"[CIRCUIT] State: closed (recovered)")

            self.last_success_time = now
            return result

        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = now

            if self.failure_count >= self.failure_threshold:
                if self.state != 'open':
                    self.state = 'open'
                    log(f"[CIRCUIT] State: OPEN after {self.failure_count} failures")

            raise

    def reset(self):
        """Manually reset circuit breaker"""
        self.state = 'closed'
        self.failure_count = 0
        log(f"[CIRCUIT] Manually reset to closed")

# Use it
breaker = CircuitBreaker(failure_threshold=5, timeout=60)

while True:
    try:
        healthy = breaker.call(check_health, 'service', 80)

        if healthy:
            announce_route()
        else:
            withdraw_route()

    except Exception as e:
        log(f"[ERROR] Circuit breaker: {e}")
        # Service down, route already withdrawn

    time.sleep(5)

Parsing Errors

JSON Parsing Errors

Problem: Malformed JSON from ExaBGP

# WRONG - crashes on invalid JSON
msg = json.loads(line)

Solution: Always catch parse errors

import json

while True:
    line = sys.stdin.readline()
    if not line:
        break

    try:
        msg = json.loads(line.strip())

        # Process message
        if msg['type'] == 'update':
            handle_update(msg)

    except json.JSONDecodeError as e:
        log(f"[ERROR] JSON parse error at line {e.lineno}, col {e.colno}: {e.msg}")
        log(f"[ERROR] Invalid line: {line[:100]}")  # Log first 100 chars
        # Continue processing next message
        continue

    except KeyError as e:
        log(f"[ERROR] Missing field in message: {e}")
        log(f"[ERROR] Message: {json.dumps(msg)[:200]}")
        continue

    except Exception as e:
        log(f"[ERROR] Unexpected error processing message: {e}")
        continue

Missing Fields Handling

Defensive field access:

def safe_get(dct, *keys, default=None):
    """Safely get nested dictionary value"""
    for key in keys:
        try:
            dct = dct[key]
        except (KeyError, TypeError):
            return default
    return dct

# Use it
peer = safe_get(msg, 'neighbor', 'address', 'peer', default='unknown')
nexthop = safe_get(msg, 'neighbor', 'message', 'update', 'announce',
                   'ipv4 unicast', '100.10.0.0/24', 0, 'next-hop',
                   default='unknown')

# Or use .get() chain
update = msg.get('neighbor', {}).get('message', {}).get('update', {})
announce = update.get('announce', {})
ipv4_routes = announce.get('ipv4 unicast', {})

Health Check Failures

Transient Failures

Problem: Single failed check triggers route withdrawal

# WRONG - flaps on transient failures
if not check_health():
    withdraw_route()  # Withdrawn on single failure

Solution: Use hysteresis (require N consecutive failures)

class HealthTracker:
    """Track health with hysteresis to prevent flapping"""

    def __init__(self, threshold_up=3, threshold_down=2):
        self.threshold_up = threshold_up
        self.threshold_down = threshold_down
        self.consecutive_up = 0
        self.consecutive_down = 0
        self.state = 'unknown'

    def update(self, healthy):
        """Update health state"""
        if healthy:
            self.consecutive_up += 1
            self.consecutive_down = 0

            if self.consecutive_up >= self.threshold_up:
                if self.state != 'up':
                    log(f"[HEALTH] Service UP (after {self.consecutive_up} checks)")
                    self.state = 'up'
        else:
            self.consecutive_down += 1
            self.consecutive_up = 0

            if self.consecutive_down >= self.threshold_down:
                if self.state != 'down':
                    log(f"[HEALTH] Service DOWN (after {self.consecutive_down} checks)")
                    self.state = 'down'

        return self.state

# Use it
health = HealthTracker(threshold_up=3, threshold_down=2)

while True:
    check_result = check_health()
    state = health.update(check_result)

    if state == 'up':
        announce_route()
    elif state == 'down':
        withdraw_route()

    time.sleep(5)

Multiple Health Check Types

Combine multiple checks with fallback:

def comprehensive_health_check():
    """Multiple health checks with fallback"""
    checks = []

    # TCP port check
    try:
        tcp_ok = tcp_port_check('127.0.0.1', 80, timeout=2)
        checks.append(('tcp', tcp_ok))
    except Exception as e:
        log(f"[ERROR] TCP check failed: {e}")
        checks.append(('tcp', False))

    # HTTP health endpoint
    try:
        http_ok = http_health_check('http://127.0.0.1/health', timeout=2)
        checks.append(('http', http_ok))
    except Exception as e:
        log(f"[ERROR] HTTP check failed: {e}")
        checks.append(('http', False))

    # Process check
    try:
        proc_ok = process_running('nginx')
        checks.append(('process', proc_ok))
    except Exception as e:
        log(f"[ERROR] Process check failed: {e}")
        checks.append(('process', False))

    # Require majority to pass
    passed = sum(1 for _, ok in checks if ok)
    total = len(checks)

    log(f"[HEALTH] Checks: {passed}/{total} passed - {checks}")

    return passed >= (total / 2)  # Majority must pass

Process Crashes and Recovery

ExaBGP Process Monitoring

ExaBGP automatically restarts crashed processes:

process my-program {
    run /etc/exabgp/api/my-program.py;
    encoder text;
    # ExaBGP automatically respawns if process exits
}

Your program should:

  • Not exit unexpectedly
  • Catch all exceptions in main loop
  • Log crashes for debugging

Crash-Resistant Main Loop

def main():
    """Crash-resistant main loop"""
    consecutive_errors = 0
    max_consecutive_errors = 10

    while True:
        try:
            # Your main logic
            healthy = check_health()

            if healthy:
                announce_route()
            else:
                withdraw_route()

            # Reset error counter on success
            consecutive_errors = 0

            time.sleep(5)

        except KeyboardInterrupt:
            log("[INFO] Interrupted by user")
            break

        except Exception as e:
            consecutive_errors += 1
            log(f"[ERROR] Main loop error ({consecutive_errors}): {e}")

            # If too many consecutive errors, exit and let ExaBGP restart
            if consecutive_errors >= max_consecutive_errors:
                log(f"[FATAL] Too many consecutive errors, exiting")
                sys.exit(1)

            # Back off on errors
            time.sleep(min(consecutive_errors * 2, 60))

if __name__ == '__main__':
    main()

Resource Limits

Prevent memory leaks:

import resource

def set_resource_limits():
    """Set resource limits to prevent runaway process"""
    # Max 512MB memory
    max_memory = 512 * 1024 * 1024
    resource.setrlimit(resource.RLIMIT_AS, (max_memory, max_memory))

    # Max 100 open files
    resource.setrlimit(resource.RLIMIT_NOFILE, (100, 100))

    log("[INFO] Resource limits set")

# Call at startup
set_resource_limits()

Recovery Strategies

Exponential Backoff

def exponential_backoff(attempt, base_delay=1, max_delay=60):
    """Calculate delay with exponential backoff"""
    delay = min(base_delay * (2 ** attempt), max_delay)
    # Add jitter to prevent thundering herd
    jitter = delay * 0.1 * (2 * time.time() % 1 - 0.5)
    return delay + jitter

# Use it
attempt = 0
while True:
    try:
        result = risky_operation()
        attempt = 0  # Reset on success
        break
    except Exception as e:
        delay = exponential_backoff(attempt)
        log(f"[RETRY] Attempt {attempt} failed, retrying in {delay:.1f}s")
        time.sleep(delay)
        attempt += 1

        if attempt >= 10:
            log("[FATAL] Too many retries, giving up")
            raise

Graceful Degradation

def announce_with_degradation():
    """Announce route with fallback metrics on errors"""
    try:
        # Try preferred announcement (low MED)
        announce_route('100.10.0.0/24', med=100)
    except Exception as e:
        log(f"[ERROR] Preferred announcement failed: {e}")

        try:
            # Fall back to backup announcement (high MED)
            announce_route('100.10.0.0/24', med=200)
            log("[DEGRADED] Using backup route with higher MED")
        except Exception as e:
            log(f"[FATAL] Backup announcement also failed: {e}")
            raise

Logging Best Practices

Structured Logging

import json
import time

def log_structured(level, message, **kwargs):
    """Structured logging for easy parsing"""
    entry = {
        'timestamp': time.time(),
        'level': level,
        'message': message,
        **kwargs
    }
    sys.stderr.write(json.dumps(entry) + "\n")
    sys.stderr.flush()

# Use it
log_structured('INFO', 'Service healthy', service='web', port=80)
log_structured('ERROR', 'Health check failed', service='web', error='timeout')
log_structured('WARN', 'Route flapping detected', prefix='100.10.0.0/24', count=5)

Log Levels

import os

LOG_LEVEL = os.getenv('LOG_LEVEL', 'INFO')
LEVELS = {'DEBUG': 0, 'INFO': 1, 'WARN': 2, 'ERROR': 3, 'FATAL': 4}

def log(level, message):
    """Log with level filtering"""
    if LEVELS.get(level, 1) >= LEVELS.get(LOG_LEVEL, 1):
        timestamp = time.strftime('%Y-%m-%d %H:%M:%S')
        sys.stderr.write(f"[{timestamp}] [{level}] {message}\n")
        sys.stderr.flush()

# Use it
log('DEBUG', 'Health check starting')  # Only if LOG_LEVEL=DEBUG
log('INFO', 'Route announced')
log('ERROR', 'Health check failed')
log('FATAL', 'Unrecoverable error')

Rate Limiting Logs

Prevent log spam:

class RateLimitedLogger:
    """Log with rate limiting to prevent spam"""

    def __init__(self, interval=60):
        self.interval = interval
        self.last_log = {}

    def log(self, key, message):
        """Log message with rate limiting per key"""
        now = time.time()

        if key not in self.last_log or now - self.last_log[key] >= self.interval:
            sys.stderr.write(f"{message}\n")
            sys.stderr.flush()
            self.last_log[key] = now
            return True
        return False

# Use it
rate_logger = RateLimitedLogger(interval=60)

while True:
    if not check_health():
        # Only log once per minute
        rate_logger.log('health_fail', '[ERROR] Health check failed')

Error Handling Patterns

Try-Except Hierarchy

def robust_operation():
    """Proper exception handling hierarchy"""
    try:
        # Risky operation
        result = perform_operation()

    except ConnectionError as e:
        # Specific exception
        log(f"[ERROR] Connection failed: {e}")
        # Handle connection error
        return None

    except TimeoutError as e:
        # Another specific exception
        log(f"[ERROR] Operation timeout: {e}")
        # Handle timeout
        return None

    except Exception as e:
        # Catch-all for unexpected errors
        log(f"[ERROR] Unexpected error: {e}")
        # Log traceback for debugging
        import traceback
        traceback.print_exc(file=sys.stderr)
        return None

    else:
        # Success path
        log("[INFO] Operation successful")
        return result

    finally:
        # Cleanup (always runs)
        cleanup_resources()

Context Managers

from contextlib import contextmanager

@contextmanager
def error_handler(operation_name):
    """Context manager for consistent error handling"""
    try:
        log(f"[START] {operation_name}")
        yield
        log(f"[SUCCESS] {operation_name}")
    except Exception as e:
        log(f"[ERROR] {operation_name} failed: {e}")
        raise
    finally:
        log(f"[END] {operation_name}")

# Use it
with error_handler("health check"):
    result = check_health()

with error_handler("route announcement"):
    announce_route('100.10.0.0/24')

Production Examples

Complete Error-Resistant Program

#!/usr/bin/env python3
"""
robust_healthcheck.py - Production-grade error handling
"""
import sys
import time
import socket
import signal
import json
from contextlib import contextmanager

# Configuration
CONFIG = {
    'service_ip': '100.10.0.100',
    'service_port': 80,
    'check_interval': 5,
    'health_threshold_up': 3,
    'health_threshold_down': 2,
    'max_consecutive_errors': 10,
}

# State
announced = False
consecutive_healthy = 0
consecutive_unhealthy = 0
consecutive_errors = 0

def log(level, message, **kwargs):
    """Structured logging"""
    entry = {
        'timestamp': time.time(),
        'level': level,
        'message': message,
        **kwargs
    }
    sys.stderr.write(json.dumps(entry) + "\n")
    sys.stderr.flush()

def signal_handler(signum, frame):
    """Handle shutdown gracefully"""
    log('INFO', 'Shutdown signal received', signal=signum)
    if announced:
        try:
            sys.stdout.write(f"withdraw route {CONFIG['service_ip']}/32\n")
            sys.stdout.flush()
            log('INFO', 'Route withdrawn on shutdown')
        except Exception as e:
            log('ERROR', 'Failed to withdraw route on shutdown', error=str(e))
    sys.exit(0)

@contextmanager
def timeout_context(seconds):
    """Timeout context manager"""
    def timeout_handler(signum, frame):
        raise TimeoutError(f"Operation timeout after {seconds}s")

    old_handler = signal.signal(signal.SIGALRM, timeout_handler)
    signal.alarm(seconds)
    try:
        yield
    finally:
        signal.alarm(0)
        signal.signal(signal.SIGALRM, old_handler)

def check_health():
    """Health check with comprehensive error handling"""
    try:
        with timeout_context(2):
            sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            sock.settimeout(2)
            result = sock.connect_ex(('127.0.0.1', CONFIG['service_port']))
            sock.close()
            return result == 0

    except TimeoutError:
        log('WARN', 'Health check timeout')
        return False
    except socket.error as e:
        log('ERROR', 'Socket error during health check', error=str(e))
        return False
    except Exception as e:
        log('ERROR', 'Unexpected health check error', error=str(e))
        return False

def announce_route():
    """Announce route with error handling"""
    global announced

    try:
        sys.stdout.write(f"announce route {CONFIG['service_ip']}/32 next-hop self\n")
        sys.stdout.flush()
        announced = True
        log('INFO', 'Route announced', prefix=f"{CONFIG['service_ip']}/32")
    except Exception as e:
        log('ERROR', 'Failed to announce route', error=str(e))
        raise

def withdraw_route():
    """Withdraw route with error handling"""
    global announced

    try:
        sys.stdout.write(f"withdraw route {CONFIG['service_ip']}/32\n")
        sys.stdout.flush()
        announced = False
        log('INFO', 'Route withdrawn', prefix=f"{CONFIG['service_ip']}/32")
    except Exception as e:
        log('ERROR', 'Failed to withdraw route', error=str(e))
        raise

def main():
    """Main loop with error handling"""
    global consecutive_healthy, consecutive_unhealthy, consecutive_errors

    # Register signal handlers
    signal.signal(signal.SIGTERM, signal_handler)
    signal.signal(signal.SIGINT, signal_handler)

    log('INFO', 'Starting health check daemon', config=CONFIG)
    time.sleep(2)  # Wait for ExaBGP
    log('INFO', 'Ready')

    while True:
        try:
            # Health check
            healthy = check_health()

            # Reset error counter on successful check
            consecutive_errors = 0

            # Update counters
            if healthy:
                consecutive_healthy += 1
                consecutive_unhealthy = 0
            else:
                consecutive_unhealthy += 1
                consecutive_healthy = 0

            # State transitions with hysteresis
            if consecutive_healthy >= CONFIG['health_threshold_up'] and not announced:
                announce_route()
            elif consecutive_unhealthy >= CONFIG['health_threshold_down'] and announced:
                withdraw_route()

            time.sleep(CONFIG['check_interval'])

        except KeyboardInterrupt:
            log('INFO', 'Interrupted by user')
            break

        except Exception as e:
            consecutive_errors += 1
            log('ERROR', 'Main loop error',
                error=str(e),
                consecutive_errors=consecutive_errors)

            if consecutive_errors >= CONFIG['max_consecutive_errors']:
                log('FATAL', 'Too many consecutive errors, exiting')
                sys.exit(1)

            # Exponential backoff on errors
            backoff = min(consecutive_errors * 2, 60)
            time.sleep(backoff)

if __name__ == '__main__':
    main()

Troubleshooting Guide

Common Errors and Solutions

Error Cause Solution
Routes not announced Forgot to flush STDOUT Add sys.stdout.flush()
Process exits immediately No keep-alive loop Add while True: sleep(60)
Health check hangs No timeout on socket Set sock.settimeout(2)
Route flapping No hysteresis Use HealthTracker pattern
Crashes on bad JSON No error handling Wrap json.loads() in try/except
BGP session down Config error Check peer AS, router-id
NOTIFICATION code 3 Invalid UPDATE Validate commands before sending
Memory leak No cleanup Use context managers, del objects

See Also


πŸ‘» Ghost written by Claude (Anthropic AI)

Clone this wiki locally