-
Notifications
You must be signed in to change notification settings - Fork 459
Error Handling
Comprehensive guide to handling errors in ExaBGP API programs
- Introduction
- Error Categories
- BGP NOTIFICATION Messages
- API Command Errors
- Connection Failures
- Parsing Errors
- Health Check Failures
- Process Crashes and Recovery
- Recovery Strategies
- Logging Best Practices
- Error Handling Patterns
- Production Examples
- Troubleshooting Guide
Robust error handling is critical for production ExaBGP deployments. Your API programs must gracefully handle errors without crashing or creating routing instability.
Key principles:
- β Fail gracefully - Don't crash on errors
- β Log everything - Debug issues in production
- β Retry intelligently - Use exponential backoff
- β Degrade gracefully - Partial failure is better than total failure
- β Monitor errors - Alert on error rates
Important reminder:
π΄ ExaBGP does NOT manipulate RIB/FIB - When your program withdraws a route due to errors, ExaBGP sends the withdrawal via BGP. The router removes the route from its RIB/FIB. ExaBGP itself never touches routing tables.
What: BGP session failures, NOTIFICATION messages Impact: Lost connectivity to peer, routes withdrawn Handling: Log, wait for reconnection, continue running
What: Invalid syntax, malformed commands Impact: Commands rejected by ExaBGP Handling: Validate before sending, use ACK feature
What: Network failures, timeouts, unreachable hosts Impact: Health checks fail, service appears down Handling: Retry with backoff, use circuit breaker
What: Invalid JSON, unexpected message format Impact: Can't process BGP updates Handling: Skip bad messages, log for debugging
What: Your program crashes, out of memory Impact: ExaBGP restarts your process Handling: Defensive programming, resource limits
BGP NOTIFICATION messages indicate errors or session termination.
Structure:
{
"type": "notification",
"neighbor": {
"address": {
"local": "192.168.1.2",
"peer": "192.168.1.1"
},
"message": {
"notification": {
"code": 6,
"subcode": 2,
"data": "Administrative Reset"
}
}
}
}Meaning: Malformed BGP message header
Subcodes:
- 1 - Connection Not Synchronized
- 2 - Bad Message Length
- 3 - Bad Message Type
Example:
{
"notification": {
"code": 1,
"subcode": 2,
"data": "bad message length"
}
}Handling:
if notification['code'] == 1:
log("[ERROR] BGP message header error - possible network corruption")
# Let BGP reconnect automaticallyMeaning: Error in BGP OPEN message
Subcodes:
- 1 - Unsupported Version Number
- 2 - Bad Peer AS
- 3 - Bad BGP Identifier
- 4 - Unsupported Optional Parameter
- 5 - Authentication Failure
- 6 - Unacceptable Hold Time
Example:
{
"notification": {
"code": 2,
"subcode": 2,
"data": "peer AS mismatch"
}
}Handling:
if notification['code'] == 2:
subcode = notification['subcode']
if subcode == 2:
log("[FATAL] Peer AS mismatch - check configuration")
# This requires config fix
elif subcode == 5:
log("[FATAL] Authentication failure - check MD5 password")Meaning: Error in BGP UPDATE message
Subcodes:
- 1 - Malformed Attribute List
- 2 - Unrecognized Well-known Attribute
- 3 - Missing Well-known Attribute
- 4 - Attribute Flags Error
- 5 - Attribute Length Error
- 6 - Invalid ORIGIN Attribute
- 7 - AS Routing Loop
- 8 - Invalid NEXT_HOP Attribute
- 9 - Optional Attribute Error
- 10 - Invalid Network Field
- 11 - Malformed AS_PATH
Example:
{
"notification": {
"code": 3,
"subcode": 3,
"data": "missing ORIGIN attribute"
}
}Handling:
if notification['code'] == 3:
log(f"[ERROR] UPDATE message error: subcode {notification['subcode']}")
# ExaBGP bug or malformed command from your program
# Check recent announcementsMeaning: No KEEPALIVE or UPDATE received within hold time
Example:
{
"notification": {
"code": 4,
"subcode": 0,
"data": "hold timer expired"
}
}Handling:
if notification['code'] == 4:
log("[WARN] Hold timer expired - network issue or peer overloaded")
# BGP will reconnect automatically
# If this happens frequently, check network or peer CPUMeaning: Session terminated (most common)
Subcodes:
- 1 - Maximum Number of Prefixes Reached
- 2 - Administrative Shutdown
- 3 - Peer De-configured
- 4 - Administrative Reset
- 5 - Connection Rejected
- 6 - Other Configuration Change
- 7 - Connection Collision Resolution
- 8 - Out of Resources
Example:
{
"notification": {
"code": 6,
"subcode": 2,
"data": "administrative shutdown"
}
}Handling:
if notification['code'] == 6:
subcode = notification['subcode']
if subcode == 1:
log("[ERROR] Max prefixes exceeded - peer rejected our routes")
elif subcode == 2:
log("[INFO] Administrative shutdown - peer was shut down manually")
elif subcode == 4:
log("[INFO] Administrative reset - peer restarted")
elif subcode == 8:
log("[ERROR] Peer out of resources - may be overloaded")Complete handler:
def handle_notification(msg):
"""Process BGP NOTIFICATION messages"""
try:
peer = msg['neighbor']['address']['peer']
notification = msg['neighbor']['message']['notification']
code = notification.get('code', 0)
subcode = notification.get('subcode', 0)
data = notification.get('data', '')
log(f"[NOTIFICATION] From {peer}: code={code} subcode={subcode} data={data}")
# Handle specific codes
if code == 2: # OPEN error
if subcode == 2:
alert("[CRITICAL] Peer AS mismatch - configuration error")
elif subcode == 5:
alert("[CRITICAL] Authentication failure - check MD5 password")
elif code == 3: # UPDATE error
log("[ERROR] UPDATE error - check recent announcements")
# Log last N commands sent to ExaBGP
for cmd in recent_commands[-10:]:
log(f" Recent command: {cmd}")
elif code == 4: # Hold timer
log("[WARN] Hold timer expired - network or peer issue")
elif code == 6: # Cease
if subcode == 1:
alert("[ERROR] Max prefixes exceeded at peer")
elif subcode == 8:
alert("[WARN] Peer out of resources")
except Exception as e:
log(f"[ERROR] Failed to process notification: {e}")Problem: Commands are silently ignored if invalid
# Invalid command (missing next-hop)
print("announce route 100.10.0.0/24")
sys.stdout.flush()
# ExaBGP logs error but your program doesn't know
# Result: Route not announced, no feedbackSolution: Validate commands before sending
def validate_announce(prefix, nexthop):
"""Validate announcement before sending"""
import ipaddress
# Validate prefix
try:
ipaddress.ip_network(prefix)
except ValueError as e:
log(f"[ERROR] Invalid prefix {prefix}: {e}")
return False
# Validate next-hop
if nexthop != 'self':
try:
ipaddress.ip_address(nexthop)
except ValueError as e:
log(f"[ERROR] Invalid next-hop {nexthop}: {e}")
return False
return True
# Use it
if validate_announce('100.10.0.0/24', 'self'):
print("announce route 100.10.0.0/24 next-hop self")
sys.stdout.flush()ACK is enabled by default in both versions. To use ACK responses:
Responses (when ACK is enabled):
-
done\n- Command succeeded -
error\n- Command failed -
shutdown\n- ExaBGP shutting down
Example with error handling:
import select
def wait_for_ack(expected_count=1, timeout=30):
"""
Wait for ACK responses with polling loop.
ExaBGP may not respond immediately, so we poll with sleep.
Handles both text and JSON encoder formats:
- Text: "done", "error", "shutdown"
- JSON: {"answer": "done|error|shutdown", "message": "..."}
"""
import json
received = 0
start_time = time.time()
while received < expected_count:
if time.time() - start_time >= timeout:
return False
ready, _, _ = select.select([sys.stdin], [], [], 0.1)
if ready:
line = sys.stdin.readline().strip()
# Parse response (could be text or JSON)
answer = None
if line.startswith('{'):
try:
data = json.loads(line)
answer = data.get('answer')
except:
pass
else:
answer = line
if answer == "done":
received += 1
elif answer == "error":
return False
elif answer == "shutdown":
raise SystemExit(0)
else:
time.sleep(0.1)
return True
def send_command_with_ack(command, timeout=30):
"""Send command and wait for ACK"""
sys.stdout.write(command + "\n")
sys.stdout.flush()
return wait_for_ack(expected_count=1, timeout=timeout)
# Use it
if not send_command_with_ack("announce route 100.10.0.0/24 next-hop self"):
# Command failed
alert("[CRITICAL] Failed to announce route")Problem: Health checks timeout, service appears down
def check_health():
"""Health check without timeout - WRONG"""
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect(('service', 80)) # Hangs forever if service down
sock.close()
return TrueSolution: Always use timeouts
def check_health(host, port, timeout=2):
"""Health check with timeout - CORRECT"""
try:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.settimeout(timeout) # CRITICAL
result = sock.connect_ex((host, port))
sock.close()
return result == 0
except socket.timeout:
log(f"[TIMEOUT] Health check to {host}:{port}")
return False
except socket.error as e:
log(f"[ERROR] Socket error: {e}")
return False
except Exception as e:
log(f"[ERROR] Unexpected error: {e}")
return Falseimport urllib.request
def http_health_check(url, timeout=2):
"""HTTP health check with timeout"""
try:
req = urllib.request.Request(url)
response = urllib.request.urlopen(req, timeout=timeout)
return response.status == 200
except urllib.error.URLError as e:
log(f"[ERROR] URL error: {e.reason}")
return False
except urllib.error.HTTPError as e:
log(f"[ERROR] HTTP {e.code}: {e.reason}")
return False
except socket.timeout:
log(f"[TIMEOUT] HTTP request to {url}")
return False
except Exception as e:
log(f"[ERROR] Unexpected error: {e}")
return FalseProblem: DNS lookup hangs or fails
def resolve_host(hostname, timeout=2):
"""Resolve hostname with timeout"""
import socket
import signal
# Set alarm for timeout (Unix only)
def timeout_handler(signum, frame):
raise TimeoutError("DNS lookup timeout")
old_handler = signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(timeout)
try:
ip = socket.gethostbyname(hostname)
signal.alarm(0) # Cancel alarm
return ip
except socket.gaierror as e:
log(f"[ERROR] DNS lookup failed for {hostname}: {e}")
return None
except TimeoutError:
log(f"[TIMEOUT] DNS lookup for {hostname}")
return None
finally:
signal.signal(signal.SIGALRM, old_handler)Prevent cascade failures:
class CircuitBreaker:
"""Circuit breaker for health checks"""
def __init__(self, failure_threshold=5, timeout=60, half_open_timeout=10):
self.failure_threshold = failure_threshold
self.timeout = timeout
self.half_open_timeout = half_open_timeout
self.failure_count = 0
self.last_failure_time = None
self.last_success_time = None
self.state = 'closed' # closed, open, half_open
def call(self, func, *args, **kwargs):
"""Execute function with circuit breaker protection"""
now = time.time()
if self.state == 'open':
# Check if timeout expired
if now - self.last_failure_time >= self.timeout:
self.state = 'half_open'
log(f"[CIRCUIT] State: half_open (trying again)")
else:
raise Exception(f"Circuit breaker OPEN (fails={self.failure_count})")
try:
result = func(*args, **kwargs)
# Success
if self.state == 'half_open':
self.state = 'closed'
self.failure_count = 0
log(f"[CIRCUIT] State: closed (recovered)")
self.last_success_time = now
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = now
if self.failure_count >= self.failure_threshold:
if self.state != 'open':
self.state = 'open'
log(f"[CIRCUIT] State: OPEN after {self.failure_count} failures")
raise
def reset(self):
"""Manually reset circuit breaker"""
self.state = 'closed'
self.failure_count = 0
log(f"[CIRCUIT] Manually reset to closed")
# Use it
breaker = CircuitBreaker(failure_threshold=5, timeout=60)
while True:
try:
healthy = breaker.call(check_health, 'service', 80)
if healthy:
announce_route()
else:
withdraw_route()
except Exception as e:
log(f"[ERROR] Circuit breaker: {e}")
# Service down, route already withdrawn
time.sleep(5)Problem: Malformed JSON from ExaBGP
# WRONG - crashes on invalid JSON
msg = json.loads(line)Solution: Always catch parse errors
import json
while True:
line = sys.stdin.readline()
if not line:
break
try:
msg = json.loads(line.strip())
# Process message
if msg['type'] == 'update':
handle_update(msg)
except json.JSONDecodeError as e:
log(f"[ERROR] JSON parse error at line {e.lineno}, col {e.colno}: {e.msg}")
log(f"[ERROR] Invalid line: {line[:100]}") # Log first 100 chars
# Continue processing next message
continue
except KeyError as e:
log(f"[ERROR] Missing field in message: {e}")
log(f"[ERROR] Message: {json.dumps(msg)[:200]}")
continue
except Exception as e:
log(f"[ERROR] Unexpected error processing message: {e}")
continueDefensive field access:
def safe_get(dct, *keys, default=None):
"""Safely get nested dictionary value"""
for key in keys:
try:
dct = dct[key]
except (KeyError, TypeError):
return default
return dct
# Use it
peer = safe_get(msg, 'neighbor', 'address', 'peer', default='unknown')
nexthop = safe_get(msg, 'neighbor', 'message', 'update', 'announce',
'ipv4 unicast', '100.10.0.0/24', 0, 'next-hop',
default='unknown')
# Or use .get() chain
update = msg.get('neighbor', {}).get('message', {}).get('update', {})
announce = update.get('announce', {})
ipv4_routes = announce.get('ipv4 unicast', {})Problem: Single failed check triggers route withdrawal
# WRONG - flaps on transient failures
if not check_health():
withdraw_route() # Withdrawn on single failureSolution: Use hysteresis (require N consecutive failures)
class HealthTracker:
"""Track health with hysteresis to prevent flapping"""
def __init__(self, threshold_up=3, threshold_down=2):
self.threshold_up = threshold_up
self.threshold_down = threshold_down
self.consecutive_up = 0
self.consecutive_down = 0
self.state = 'unknown'
def update(self, healthy):
"""Update health state"""
if healthy:
self.consecutive_up += 1
self.consecutive_down = 0
if self.consecutive_up >= self.threshold_up:
if self.state != 'up':
log(f"[HEALTH] Service UP (after {self.consecutive_up} checks)")
self.state = 'up'
else:
self.consecutive_down += 1
self.consecutive_up = 0
if self.consecutive_down >= self.threshold_down:
if self.state != 'down':
log(f"[HEALTH] Service DOWN (after {self.consecutive_down} checks)")
self.state = 'down'
return self.state
# Use it
health = HealthTracker(threshold_up=3, threshold_down=2)
while True:
check_result = check_health()
state = health.update(check_result)
if state == 'up':
announce_route()
elif state == 'down':
withdraw_route()
time.sleep(5)Combine multiple checks with fallback:
def comprehensive_health_check():
"""Multiple health checks with fallback"""
checks = []
# TCP port check
try:
tcp_ok = tcp_port_check('127.0.0.1', 80, timeout=2)
checks.append(('tcp', tcp_ok))
except Exception as e:
log(f"[ERROR] TCP check failed: {e}")
checks.append(('tcp', False))
# HTTP health endpoint
try:
http_ok = http_health_check('http://127.0.0.1/health', timeout=2)
checks.append(('http', http_ok))
except Exception as e:
log(f"[ERROR] HTTP check failed: {e}")
checks.append(('http', False))
# Process check
try:
proc_ok = process_running('nginx')
checks.append(('process', proc_ok))
except Exception as e:
log(f"[ERROR] Process check failed: {e}")
checks.append(('process', False))
# Require majority to pass
passed = sum(1 for _, ok in checks if ok)
total = len(checks)
log(f"[HEALTH] Checks: {passed}/{total} passed - {checks}")
return passed >= (total / 2) # Majority must passExaBGP automatically restarts crashed processes:
process my-program {
run /etc/exabgp/api/my-program.py;
encoder text;
# ExaBGP automatically respawns if process exits
}Your program should:
- Not exit unexpectedly
- Catch all exceptions in main loop
- Log crashes for debugging
def main():
"""Crash-resistant main loop"""
consecutive_errors = 0
max_consecutive_errors = 10
while True:
try:
# Your main logic
healthy = check_health()
if healthy:
announce_route()
else:
withdraw_route()
# Reset error counter on success
consecutive_errors = 0
time.sleep(5)
except KeyboardInterrupt:
log("[INFO] Interrupted by user")
break
except Exception as e:
consecutive_errors += 1
log(f"[ERROR] Main loop error ({consecutive_errors}): {e}")
# If too many consecutive errors, exit and let ExaBGP restart
if consecutive_errors >= max_consecutive_errors:
log(f"[FATAL] Too many consecutive errors, exiting")
sys.exit(1)
# Back off on errors
time.sleep(min(consecutive_errors * 2, 60))
if __name__ == '__main__':
main()Prevent memory leaks:
import resource
def set_resource_limits():
"""Set resource limits to prevent runaway process"""
# Max 512MB memory
max_memory = 512 * 1024 * 1024
resource.setrlimit(resource.RLIMIT_AS, (max_memory, max_memory))
# Max 100 open files
resource.setrlimit(resource.RLIMIT_NOFILE, (100, 100))
log("[INFO] Resource limits set")
# Call at startup
set_resource_limits()def exponential_backoff(attempt, base_delay=1, max_delay=60):
"""Calculate delay with exponential backoff"""
delay = min(base_delay * (2 ** attempt), max_delay)
# Add jitter to prevent thundering herd
jitter = delay * 0.1 * (2 * time.time() % 1 - 0.5)
return delay + jitter
# Use it
attempt = 0
while True:
try:
result = risky_operation()
attempt = 0 # Reset on success
break
except Exception as e:
delay = exponential_backoff(attempt)
log(f"[RETRY] Attempt {attempt} failed, retrying in {delay:.1f}s")
time.sleep(delay)
attempt += 1
if attempt >= 10:
log("[FATAL] Too many retries, giving up")
raisedef announce_with_degradation():
"""Announce route with fallback metrics on errors"""
try:
# Try preferred announcement (low MED)
announce_route('100.10.0.0/24', med=100)
except Exception as e:
log(f"[ERROR] Preferred announcement failed: {e}")
try:
# Fall back to backup announcement (high MED)
announce_route('100.10.0.0/24', med=200)
log("[DEGRADED] Using backup route with higher MED")
except Exception as e:
log(f"[FATAL] Backup announcement also failed: {e}")
raiseimport json
import time
def log_structured(level, message, **kwargs):
"""Structured logging for easy parsing"""
entry = {
'timestamp': time.time(),
'level': level,
'message': message,
**kwargs
}
sys.stderr.write(json.dumps(entry) + "\n")
sys.stderr.flush()
# Use it
log_structured('INFO', 'Service healthy', service='web', port=80)
log_structured('ERROR', 'Health check failed', service='web', error='timeout')
log_structured('WARN', 'Route flapping detected', prefix='100.10.0.0/24', count=5)import os
LOG_LEVEL = os.getenv('LOG_LEVEL', 'INFO')
LEVELS = {'DEBUG': 0, 'INFO': 1, 'WARN': 2, 'ERROR': 3, 'FATAL': 4}
def log(level, message):
"""Log with level filtering"""
if LEVELS.get(level, 1) >= LEVELS.get(LOG_LEVEL, 1):
timestamp = time.strftime('%Y-%m-%d %H:%M:%S')
sys.stderr.write(f"[{timestamp}] [{level}] {message}\n")
sys.stderr.flush()
# Use it
log('DEBUG', 'Health check starting') # Only if LOG_LEVEL=DEBUG
log('INFO', 'Route announced')
log('ERROR', 'Health check failed')
log('FATAL', 'Unrecoverable error')Prevent log spam:
class RateLimitedLogger:
"""Log with rate limiting to prevent spam"""
def __init__(self, interval=60):
self.interval = interval
self.last_log = {}
def log(self, key, message):
"""Log message with rate limiting per key"""
now = time.time()
if key not in self.last_log or now - self.last_log[key] >= self.interval:
sys.stderr.write(f"{message}\n")
sys.stderr.flush()
self.last_log[key] = now
return True
return False
# Use it
rate_logger = RateLimitedLogger(interval=60)
while True:
if not check_health():
# Only log once per minute
rate_logger.log('health_fail', '[ERROR] Health check failed')def robust_operation():
"""Proper exception handling hierarchy"""
try:
# Risky operation
result = perform_operation()
except ConnectionError as e:
# Specific exception
log(f"[ERROR] Connection failed: {e}")
# Handle connection error
return None
except TimeoutError as e:
# Another specific exception
log(f"[ERROR] Operation timeout: {e}")
# Handle timeout
return None
except Exception as e:
# Catch-all for unexpected errors
log(f"[ERROR] Unexpected error: {e}")
# Log traceback for debugging
import traceback
traceback.print_exc(file=sys.stderr)
return None
else:
# Success path
log("[INFO] Operation successful")
return result
finally:
# Cleanup (always runs)
cleanup_resources()from contextlib import contextmanager
@contextmanager
def error_handler(operation_name):
"""Context manager for consistent error handling"""
try:
log(f"[START] {operation_name}")
yield
log(f"[SUCCESS] {operation_name}")
except Exception as e:
log(f"[ERROR] {operation_name} failed: {e}")
raise
finally:
log(f"[END] {operation_name}")
# Use it
with error_handler("health check"):
result = check_health()
with error_handler("route announcement"):
announce_route('100.10.0.0/24')#!/usr/bin/env python3
"""
robust_healthcheck.py - Production-grade error handling
"""
import sys
import time
import socket
import signal
import json
from contextlib import contextmanager
# Configuration
CONFIG = {
'service_ip': '100.10.0.100',
'service_port': 80,
'check_interval': 5,
'health_threshold_up': 3,
'health_threshold_down': 2,
'max_consecutive_errors': 10,
}
# State
announced = False
consecutive_healthy = 0
consecutive_unhealthy = 0
consecutive_errors = 0
def log(level, message, **kwargs):
"""Structured logging"""
entry = {
'timestamp': time.time(),
'level': level,
'message': message,
**kwargs
}
sys.stderr.write(json.dumps(entry) + "\n")
sys.stderr.flush()
def signal_handler(signum, frame):
"""Handle shutdown gracefully"""
log('INFO', 'Shutdown signal received', signal=signum)
if announced:
try:
sys.stdout.write(f"withdraw route {CONFIG['service_ip']}/32\n")
sys.stdout.flush()
log('INFO', 'Route withdrawn on shutdown')
except Exception as e:
log('ERROR', 'Failed to withdraw route on shutdown', error=str(e))
sys.exit(0)
@contextmanager
def timeout_context(seconds):
"""Timeout context manager"""
def timeout_handler(signum, frame):
raise TimeoutError(f"Operation timeout after {seconds}s")
old_handler = signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(seconds)
try:
yield
finally:
signal.alarm(0)
signal.signal(signal.SIGALRM, old_handler)
def check_health():
"""Health check with comprehensive error handling"""
try:
with timeout_context(2):
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.settimeout(2)
result = sock.connect_ex(('127.0.0.1', CONFIG['service_port']))
sock.close()
return result == 0
except TimeoutError:
log('WARN', 'Health check timeout')
return False
except socket.error as e:
log('ERROR', 'Socket error during health check', error=str(e))
return False
except Exception as e:
log('ERROR', 'Unexpected health check error', error=str(e))
return False
def announce_route():
"""Announce route with error handling"""
global announced
try:
sys.stdout.write(f"announce route {CONFIG['service_ip']}/32 next-hop self\n")
sys.stdout.flush()
announced = True
log('INFO', 'Route announced', prefix=f"{CONFIG['service_ip']}/32")
except Exception as e:
log('ERROR', 'Failed to announce route', error=str(e))
raise
def withdraw_route():
"""Withdraw route with error handling"""
global announced
try:
sys.stdout.write(f"withdraw route {CONFIG['service_ip']}/32\n")
sys.stdout.flush()
announced = False
log('INFO', 'Route withdrawn', prefix=f"{CONFIG['service_ip']}/32")
except Exception as e:
log('ERROR', 'Failed to withdraw route', error=str(e))
raise
def main():
"""Main loop with error handling"""
global consecutive_healthy, consecutive_unhealthy, consecutive_errors
# Register signal handlers
signal.signal(signal.SIGTERM, signal_handler)
signal.signal(signal.SIGINT, signal_handler)
log('INFO', 'Starting health check daemon', config=CONFIG)
time.sleep(2) # Wait for ExaBGP
log('INFO', 'Ready')
while True:
try:
# Health check
healthy = check_health()
# Reset error counter on successful check
consecutive_errors = 0
# Update counters
if healthy:
consecutive_healthy += 1
consecutive_unhealthy = 0
else:
consecutive_unhealthy += 1
consecutive_healthy = 0
# State transitions with hysteresis
if consecutive_healthy >= CONFIG['health_threshold_up'] and not announced:
announce_route()
elif consecutive_unhealthy >= CONFIG['health_threshold_down'] and announced:
withdraw_route()
time.sleep(CONFIG['check_interval'])
except KeyboardInterrupt:
log('INFO', 'Interrupted by user')
break
except Exception as e:
consecutive_errors += 1
log('ERROR', 'Main loop error',
error=str(e),
consecutive_errors=consecutive_errors)
if consecutive_errors >= CONFIG['max_consecutive_errors']:
log('FATAL', 'Too many consecutive errors, exiting')
sys.exit(1)
# Exponential backoff on errors
backoff = min(consecutive_errors * 2, 60)
time.sleep(backoff)
if __name__ == '__main__':
main()| Error | Cause | Solution |
|---|---|---|
| Routes not announced | Forgot to flush STDOUT | Add sys.stdout.flush()
|
| Process exits immediately | No keep-alive loop | Add while True: sleep(60)
|
| Health check hangs | No timeout on socket | Set sock.settimeout(2)
|
| Route flapping | No hysteresis | Use HealthTracker pattern |
| Crashes on bad JSON | No error handling | Wrap json.loads() in try/except |
| BGP session down | Config error | Check peer AS, router-id |
| NOTIFICATION code 3 | Invalid UPDATE | Validate commands before sending |
| Memory leak | No cleanup | Use context managers, del objects |
- API Overview - API architecture
- Writing API Programs - Program structure
- Production Best Practices - Production deployment
- Service High Availability - HA patterns
- Debugging - Debugging guide
π» Ghost written by Claude (Anthropic AI)
π Home
π Getting Started
π§ API
π‘οΈ Use Cases
π Address Families
βοΈ Configuration
π Operations
π Reference
- Architecture
- BGP State Machine
- Communities (RFC)
- Extended Communities
- BGP Ecosystem
- Capabilities (AFI/SAFI)
- RFC Support
π Migration
π Community
π External
- GitHub Repo β
- Slack β
- Issues β
π» Ghost written by Claude (Anthropic AI)