Ætherlay - RPC Load Balancer

A lightweight, low-latency RPC load balancer written in Go. It is designed to manage and distribute requests to multiple upstream RPC endpoints based on their health status and request counts. The load balancer supports health checks and utilizes Valkey for state management.

Features

Round-Robin Load Balancing: Distributes requests to available endpoints in a round-robin manner, prioritizing those with fewer requests in the last 24 hours.
Intelligent Retry Logic: Configurable retry attempts with priority-based endpoint selection (primary endpoints, fallbacks, and optional public-first mode).
Public-First Mode: Optional prioritization of public RPC endpoints to reduce costs while maintaining reliability.
Flexible Timeout Control: Separate timeouts for overall requests and individual retry attempts.
Rate Limit Recovery: Safe rate limit detection and recovery with exponential backoff strategies per endpoint, to avoid making things worse when a provider is rate-limiting you.
Health Checks: Regularly checks the health of upstream endpoints and updates their status in Valkey.
Health Status Caching: Local per-pod caching with configurable TTL reduces Valkey load and race condition windows.
Graceful Shutdown: Proper cleanup of goroutines during pod termination with configurable timeout.
Standalone Health Checker: Optional standalone health checker service for efficient multi-pod deployments.
Static Configuration: Loads RPC endpoint configurations from a static JSON file.
Multi-threaded: Capable of handling multiple requests concurrently.
Kubernetes Ready: Designed to run in a Kubernetes environment with Horizontal Pod Autoscaling (HPA) enabled.
WebSocket Support: Full WebSocket proxy support for real-time applications.

Setup Instructions

Option 1: Integrated Health Checks

Clone the Repository:

git clone https://github.com/project-aethermesh/aetherlay
cd aetherlay

Install Dependencies:
```
go mod tidy
```
Configure Endpoints: Rename the configs/endpoints-example.json file to configs/endpoints.json and modify it as required in order to add all the RPC endpoints you want to load balance with this tool.
Set up your .env file: Copy the .env.example file to .env and modify it as required:
```
cp .env.example .env
```
Edit the .env file to add your API keys and configuration. For running a single service with both the health check and load balancer, make sure to set STANDALONE_HEALTH_CHECKS=false.
Run the Application:
```
make run-lb
```

Option 2: Standalone Health Checker

Clone the Repository:

git clone https://github.com/project-aethermesh/aetherlay
cd aetherlay

Install Dependencies:
```
go mod tidy
```
Configure Endpoints: Rename the configs/endpoints-example.json file to configs/endpoints.json and modify it as required in order to add all the RPC endpoints you want to load balance with this tool.
Set up your .env file: Copy the .env.example file to .env and modify it as required:
```
cp .env.example .env
```
Edit the .env file to add your API keys and configuration.
Build and run both services in the background:
```
make run
```

Option 3: Deploy to Kubernetes

Basic YAML files are provided for deploying to Kubernetes. It's recommended to check them out and update them as required. After that's done, simply run:

make k8s-deploy

Usage

The load balancer will listen for incoming requests on predefined endpoints that match the configured chains (e.g., /mainnet, /base, /optimism). It will proxy these requests to the available upstream endpoints based on their health status and request counts.

Available Endpoints

GET /health - Health check endpoint for the load balancer
GET /{chain} - WebSocket upgrade requests for a specific chain
POST /{chain} - HTTP RPC requests for a specific chain

Query Parameters

?archive=true - Request archive node endpoints only

Retry and Timeout Behavior

The load balancer implements intelligent retry logic with configurable timeouts:

How Retries Work

Priority-based selection: Endpoint selection follows these priorities:
- Normal mode: primary → fallback → public
- Public-first mode (PUBLIC_FIRST=true): public → primary → fallback
Configurable attempts: Retries up to PROXY_MAX_RETRIES times.
Public endpoint limiting: When PUBLIC_FIRST=true, attempts to reach public endpoints are limited to the value of PUBLIC_FIRST_ATTEMPTS, after which the proxy tries using a primary or fallback endpoint.
Endpoint rotation: Removes failed endpoints from the retry pool to avoid repeated failures.
Dual timeout control: There are 2 settings that control how long requests take:
- Total request timeout (PROXY_TIMEOUT): Maximum time for the entire request (this is what the end user "sees").
- Per-try timeout (PROXY_TIMEOUT_PER_TRY): Maximum time per individual request sent from the proxy to each endpoint.

Benefits

Fast failover: Won't wait the whole PROXY_TIMEOUT on a single sluggish endpoint.
Improved responsiveness: Each endpoint gets, at most, PROXY_TIMEOUT_PER_TRY seconds to respond.
More success opportunities: This allows you to use a PROXY_MAX_RETRIES that's greater than PROXY_TIMEOUT/PROXY_TIMEOUT_PER_TRY, since failures can happen way before PROXY_TIMEOUT_PER_TRY is reached.

Configuration

Command Line Flags

Flag	Default	Description
`--config-file`	`configs/endpoints.json`	Path to endpoints configuration file
`--cors-headers`	`Accept, Authorization, Content-Type, Origin, X-Requested-With`	Allowed headers for CORS requests
`--cors-methods`	`GET, POST, OPTIONS`	Allowed HTTP methods for CORS requests
`--cors-origin`	`*`	Allowed origin for CORS requests
`--endpoint-failure-threshold`	`2`	Number of consecutive failures before marking endpoint unhealthy
`--endpoint-success-threshold`	`2`	Number of consecutive successes before marking endpoint healthy
`--ephemeral-checks-healthy-threshold`	`3`	Amount of consecutive successful responses required to consider endpoint healthy again
`--ephemeral-checks-interval`	`30`	Interval in seconds for ephemeral health checks
`--health-cache-ttl`	`10`	Health status cache TTL in seconds
`--health-check-concurrency`	`20`	Maximum number of concurrent health checks during startup
`--health-check-interval`	`30`	Health check interval in seconds
`--health-check-sync-status`	`true`	Consider the sync status of the endpoints when deciding whether an endpoint is healthy or not. When enabled, endpoints that are syncing are considered to be unhealthy.
`--health-checker-grace-period`	`60`	Grace period in seconds for health checker downtime after initial check passes. During this period, the load balancer will remain ready even if the health checker is temporarily unavailable.
`--health-checker-server-port`	`8080`	Health checker HTTP server port
`--health-checker-service-url`	`http://aetherlay-hc:8080`	Health checker service URL for readiness checks (used when `--standalone-health-checks` is enabled)
`--log-level`	`info`	Set the log level. Valid options are: `debug`, `info`, `warn`, `error`, `fatal`, `panic`
`--metrics-enabled`	`true`	Whether to enable Prometheus metrics
`--metrics-port`	`9090`	Port for the Prometheus metrics server
`--proxy-retries`	`3`	Maximum number of retries for proxy requests
`--proxy-timeout`	`15`	Total timeout for proxy requests in seconds
`--proxy-timeout-per-try`	`5`	Timeout per individual retry attempt in seconds
`--public-first`	`false`	Prioritize public endpoints over primary endpoints
`--public-first-attempts`	`2`	Number of attempts to make at public endpoints before trying primary/fallback
`--server-port`	`8080`	Port to use for the load balancer / proxy
`--standalone-health-checks`	`true`	Enable standalone health checks
`--valkey-host`	`localhost`	Valkey server hostname
`--valkey-pass`	-	Valkey server password
`--valkey-port`	`6379`	Valkey server port
`--valkey-skip-tls-check`	`false`	Whether to skip TLS certificate validation when connecting to Valkey
`--valkey-use-tls`	`false`	Whether to use TLS for connecting to Valkey

Note: Command-line flags take precedence over environment variables if both are set.

Environment Variables

Variable	Default	Description
`ALCHEMY_API_KEY`	-	Example API key for Alchemy RPC endpoints. Only needed for the example config. The name must match the variable referenced in your `configs/endpoints.json`, if you need any.
`INFURA_API_KEY`	-	Example API key for Infura RPC endpoints. Only needed for the example config. The name must match the variable referenced in your `configs/endpoints.json`, if you need any.
`CONFIG_FILE`	`configs/endpoints.json`	Path to the endpoints configuration file
`CORS_HEADERS`	`Accept, Authorization, Content-Type, Origin, X-Requested-With`	Allowed headers for CORS requests
`CORS_METHODS`	`GET, POST, OPTIONS`	Allowed HTTP methods for CORS requests
`CORS_ORIGIN`	`*`	Allowed origin for CORS requests
`ENDPOINT_FAILURE_THRESHOLD`	`2`	Number of consecutive failures before marking endpoint unhealthy
`ENDPOINT_SUCCESS_THRESHOLD`	`2`	Number of consecutive successes before marking endpoint healthy
`EPHEMERAL_CHECKS_HEALTHY_THRESHOLD`	`3`	Amount of consecutive successful responses from the endpoint required to consider it as being healthy again
`EPHEMERAL_CHECKS_INTERVAL`	`30`	Interval in seconds for ephemeral health checks
`HEALTH_CACHE_TTL`	`10`	Health status cache TTL in seconds
`HEALTH_CHECK_CONCURRENCY`	`20`	Maximum number of concurrent health checks during startup
`HEALTH_CHECK_INTERVAL`	`30`	Health check interval in seconds
`HEALTH_CHECK_SYNC_STATUS`	`true`	Consider the sync status of the endpoints when deciding whether an endpoint is healthy or not. When enabled, endpoints that are syncing are considered to be unhealthy.
`HEALTH_CHECKER_GRACE_PERIOD`	`60`	Grace period in seconds for health checker downtime after initial check passes. During this period, the load balancer will remain ready even if the health checker is temporarily unavailable.
`HEALTH_CHECKER_SERVER_PORT`	`8080`	Health checker HTTP server port
`HEALTH_CHECKER_SERVICE_URL`	`http://aetherlay-hc:8080`	Health checker service URL for readiness checks (used when `STANDALONE_HEALTH_CHECKS` is enabled)
`LOG_LEVEL`	`info`	Set the log level
`METRICS_ENABLED`	`true`	Whether to enable Prometheus metrics
`METRICS_PORT`	`9090`	Port for the Prometheus metrics server
`PROXY_MAX_RETRIES`	`3`	Maximum number of retries for proxy requests
`PROXY_TIMEOUT`	`15`	Total timeout for proxy requests in seconds
`PROXY_TIMEOUT_PER_TRY`	`5`	Timeout per individual retry attempt in seconds
`PUBLIC_FIRST`	`false`	Prioritize public endpoints over primary and fallback endpoints
`PUBLIC_FIRST_ATTEMPTS`	`2`	Number of attempts to make at public endpoints before trying with a primary/fallback
`SERVER_PORT`	`8080`	Port to use for the load balancer / proxy
`STANDALONE_HEALTH_CHECKS`	`true`	Enable/disable the standalone mode of the health checker
`VALKEY_HOST`	`localhost`	Valkey server hostname
`VALKEY_PASS`	-	Valkey server password
`VALKEY_PORT`	`6379`	Valkey server port
`VALKEY_SKIP_TLS_CHECK`	`false`	Whether to skip TLS certificate validation when connecting to Valkey
`VALKEY_USE_TLS`	`false`	Whether to use TLS for connecting to Valkey

Health Check Configuration

The service checks the health of an endpoint by sending these requests to it:

eth_blockNumber - Checks for successful response and that the block is not 0.
eth_syncing (unless you disable it by setting HEALTH_CHECK_SYNC_STATUS=false) - Checks for successful response and that the node is not syncing (i.e., it has already fully synced, so you get the latest data from it).

Integrated Health Checks

When STANDALONE_HEALTH_CHECKS=false, the load balancer will run integrated health checks using the HEALTH_CHECK_INTERVAL setting.

You can also disable health checks altogether by setting HEALTH_CHECK_INTERVAL to 0, which might affect the performance of the proxy but will prevent the service from wasting your RPC credits by constantly running health checks. In this case, health checks will be run in an ephemeral fashion. For example:

A user sends a request.
The LB tries to proxy that request to RPC endpoint "A" but fails.
3 things happen at the same time: I. The RPC endpoint "A" is marked as unhealthy. II. The LB tries to proxy that request to another RPC endpoint. III. An ephemeral health checker starts running to monitor RPC endpoint "A" at the interval specified by EPHEMERAL_CHECKS_INTERVAL (default: 30s).
As soon as RPC endpoint "A" is healthy again, the ephemeral health checker is stopped.

Ephemeral Health Checks

Trigger: Only when a request to an endpoint fails and health checks are otherwise disabled. The server marks the endpoint as unhealthy for the specific protocol (HTTP or WS) that failed.
Interval: Controlled by the EPHEMERAL_CHECKS_INTERVAL environment variable (in seconds).
Behavior: The health checker service observes the unhealthy status and starts ephemeral checks for the affected protocol. The system will monitor the failed endpoint at the specified interval and automatically start routing traffic to it as soon as it becomes healthy again.

Per-Protocol Unhealthy Marking

When a request to an endpoint fails (HTTP or WebSocket), the server marks that endpoint as unhealthy for the specific protocol that failed (e.g., HealthyHTTP = false or HealthyWS = false in Valkey).
The health checker service detects this change and starts ephemeral health checks for that protocol only.
Once the endpoint passes the configured number of consecutive health checks, it is marked healthy again and ephemeral checks stop.

Standalone Health Checker (Recommended)

For production deployments with multiple load balancer pods, use the standalone health checker:

Single Health Checker Instance: Prevents duplicate health checks
Multiple Load Balancer Pods: Scale independently without health check overhead
Resource Efficiency: Reduces RPC endpoint usage
Better Separation of Concerns: Health monitoring isolated from request handling

Public-First Mode

Ætherlay supports a "public-first" mode that prioritizes public RPC endpoints over primary and fallback endpoints to help reduce costs while maintaining reliability.

How Public-First Mode Works

Enable public-first: Set PUBLIC_FIRST=true (or use the --public-first CLI flag)
Configure attempts: Set PUBLIC_FIRST_ATTEMPTS to control how many public endpoints to try (default: 2)
Endpoint hierarchy:
- When enabled: public → primary → fallback
- When disabled: primary → fallback → public

Configuration Example

In your endpoints.json, mark endpoints with "role": "public":

{
  "mainnet": {
    "publicnode-1": {
      "provider": "publicnode",
      "role": "public",
      "type": "archive",
      "http_url": "https://ethereum-rpc.publicnode.com",
      "ws_url": "wss://ethereum-rpc.publicnode.com"
    },
    "alchemy-1": {
      "provider": "alchemy",
      "role": "primary",
      "type": "archive",
      "http_url": "https://eth-mainnet.g.alchemy.com/v2/${ALCHEMY_API_KEY}"
    }
  }
}

Reliability Features

Ætherlay includes several advanced features to ensure high availability and prevent race conditions in multi-pod deployments.

Health Status Caching

Local per-pod caching of endpoint health status reduces Valkey load and minimizes race condition windows:

Configurable TTL: Cache entries expire after HEALTH_CACHE_TTL seconds (default: 10).
Automatic invalidation: Cache entries are invalidated when health status changes.
Reduced latency: Cache reads are <0.1ms vs 1-5ms for Valkey reads.
Fallback behavior: Falls back to Valkey on cache miss.

Endpoint Health Debouncing

Prevents health status "flapping" by requiring multiple consecutive observations before changing state:

Failure threshold (ENDPOINT_FAILURE_THRESHOLD): Number of consecutive failures before marking endpoint unhealthy (default: 2).
Success threshold (ENDPOINT_SUCCESS_THRESHOLD): Number of consecutive successes before marking endpoint healthy (default: 2).
Per-protocol tracking: HTTP and WebSocket health tracked independently.
Reset on opposite event: Success resets failure counter, failure resets success counter.

Example with default thresholds (2):

Request 1: Success
Request 2: Failure (1/2 failures) -> endpoint stays healthy
Request 3: Failure (2/2 failures) -> endpoint is marked unhealthy
Request 4: Success (1/2 successes) -> endpoint stays unhealthy
Request 5: Success (2/2 successes) -> endpoint is marked healthy

Graceful Shutdown

Ensures proper cleanup during pod termination (e.g., Kubernetes rolling updates, Karpenter node replacements):

Context cancellation: All goroutines receive shutdown signal.
Coordinated cleanup: Rate limit scheduler shuts down before HTTP server.
Configurable timeout: 10-second default wait for goroutines to complete.
No orphaned goroutines: WaitGroup tracking ensures all background tasks finish.

Startup Synchronization

Prevents routing traffic to pods before initial health checks complete:

Blocks traffic: HTTP server doesn't accept requests until ready.
Kubernetes integration: /ready endpoint returns 503 until health check completes.
Initial check: All endpoints checked with configurable concurrency limit (default: 20 concurrent checks).
Fast startup: Typical initial check completes in 2-5 seconds, but of course this will vary with the amount of endpoints you configure.

Rate Limit Recovery

Ætherlay includes intelligent rate limit detection and recovery mechanisms to handle upstream provider rate limits gracefully. This system automatically detects when endpoints are rate-limited and implements recovery strategies to restore service.

How Rate Limit Recovery Works

Detection: When a request returns a rate limit error (HTTP 429), the endpoint is automatically marked as rate-limited.
Retries with "backoff": The system tries to reach the endpoint only after waiting for a specific amount of time, defined as a backoff, which is configurable by the user. This wait period increases each time, relative to another user-defined parameter (the backoff multiplier).
Automatic recovery: The system will reintroduce the endpoint back into the load balancing pool after a certain amount of successful consecutive requests. Users can specify how many consecutive requests are required for endpoints to be marked again as healthy.
Per-endpoint configuration: Each endpoint can have its own rate limit recovery strategy tailored to the provider's limits. You can also simply rely on the system's defaults, which have been carefully set.

Configuration Parameters

Rate limit recovery is configured per endpoint in your endpoints.json file:

{
  "mainnet": {
    "provider-1": {
      "provider": "example",
      "role": "primary",
      "type": "archive",
      "http_url": "https://api.example.com",
      "rate_limit_recovery": {
        "backoff_multiplier": 2.0,
        "initial_backoff": 300,
        "max_backoff": 3600,
        "max_retries": 10,
        "required_successes": 3,
        "reset_after": 86400
      }
    }
  }
}

Parameters:

backoff_multiplier (float): Exponential multiplier for backoff time (e.g., 2.0 doubles the wait time each attempt).
initial_backoff (int): Initial backoff time in seconds before the first recovery attempt.
max_backoff (int): Maximum backoff time in seconds (limits exponential growth).
max_retries (int): Maximum number of recovery attempts before giving up until reset_after.
required_successes (int): Number of consecutive successes needed to mark the endpoint as healthy.
reset_after (int): Time in seconds after which to reset the backoff state and start fresh.

Recovery Strategy Examples

Conservative:

"rate_limit_recovery": {
  "backoff_multiplier": 4.0,
  "initial_backoff": 300,
  "max_backoff": 3600,
  "max_retries": 5,
  "required_successes": 3,
  "reset_after": 86400
}

Aggressive:

"rate_limit_recovery": {
  "backoff_multiplier": 1.5,
  "initial_backoff": 60,
  "max_retries": 20,
  "max_backoff": 600,
  "required_successes": 1,
  "reset_after": 86400
}

Prometheus Metrics

This service uses a pull-based model for metrics collection, which is standard for Prometheus. This means the application exposes a /metrics endpoint, and a separate Prometheus server is responsible for periodically "scraping" (or pulling) data from it.

How It Works

Exporter: The application acts as a Prometheus exporter. When enabled, it starts a dedicated server that holds all metric values (counters, gauges, and histograms) in memory.
/metrics Endpoint: This server exposes the metrics in a text-based format at http://localhost:9090/metrics (or the port specified by METRICS_PORT).
Prometheus Server: A separate Prometheus server must be configured to scrape this endpoint at regular intervals. The Prometheus server is responsible for all storage, querying, and alerting, ensuring that the application itself remains lightweight and stateless.

Because the metrics are stored in memory, they will be lost on every application restart. Persistence is the responsibility of the Prometheus server.

Disabling Metrics

Metrics are enabled by default. If you don't want them, use the --metrics-enabled=false flag or set the METRICS_ENABLED environment variable to false.

Port Configuration

The metrics server runs on the port defined by METRICS_PORT (default: 9090).
Important: When running multiple services from this repository on the same machine (e.g., the load balancer and the standalone health checker), you must assign them different metrics ports to avoid conflicts. For example, you could run the health checker with --metrics-port=9090 and the load balancer with --metrics-port=9091.

Grafana Dashboard

An example Grafana dashboard is provided with the code to help you monitor your Ætherlay deployment. The dashboard includes comprehensive monitoring for:

Service Health Overview: Real-time health status of remote blockchain endpoints.
Load Balancer Performance: Request rates, error rates, response times, and in-flight requests.
System Resources: Memory usage, CPU usage, goroutines, and garbage collection metrics.
Network & Infrastructure: Network I/O, file descriptor usage, and Prometheus scrape rates.

Using the Dashboard

Replace variables: Replace ${DATA_SOURCE} with the name of your Prometheus data source and ${NAMESPACE} with the Kubernetes namespace where you deployed the app.
Import the Dashboard: Import the aetherlay-dashboard.json file into your Grafana instance.

The dashboard is designed to work with the standard Prometheus metrics exposed by both the load balancer and health checker services. It automatically detects pods by matching with .*hc.* for the health checker and .*lb.* for the load balancer, so you'll need to update that if you use different names for your pods.

Architecture Options

Option 1: Integrated Health Checks

Option 2: Standalone Health Checker (Recommended)

Contributing

Contributions are welcome! Please submit a pull request or open an issue for any enhancements or bug fixes.

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

You may use, modify, and distribute this software under the terms of the AGPL-3.0. See the LICENSE file for details.

TL;DR: The AGPL-3.0 ensures that all changes and derivative works must also be licensed under AGPL-3.0, and that attribution is preserved. If you run a modified version as a network service, you must make the source code available to users. This code is provided as-is, without warranties.

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
.devcontainer		.devcontainer
.github		.github
configs		configs
docker		docker
internal		internal
k8s		k8s
services		services
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
VERSION		VERSION
aetherlay-dashboard.json		aetherlay-dashboard.json
aetherlay-diagram-1.png		aetherlay-diagram-1.png
aetherlay-diagram-2.png		aetherlay-diagram-2.png
go.mod		go.mod
go.sum		go.sum

License

project-aethermesh/aetherlay

Folders and files

Latest commit

History

Repository files navigation

Ætherlay - RPC Load Balancer

Features

Setup Instructions

Option 1: Integrated Health Checks

Option 2: Standalone Health Checker

Option 3: Deploy to Kubernetes

Usage

Available Endpoints

Query Parameters

Retry and Timeout Behavior

How Retries Work

Benefits

Configuration

Command Line Flags

Environment Variables

Health Check Configuration

Integrated Health Checks

Ephemeral Health Checks

Per-Protocol Unhealthy Marking

Standalone Health Checker (Recommended)

Public-First Mode

How Public-First Mode Works

Configuration Example

Reliability Features

Health Status Caching

Endpoint Health Debouncing

Graceful Shutdown

Startup Synchronization

Rate Limit Recovery

How Rate Limit Recovery Works

Configuration Parameters

Parameters:

Recovery Strategy Examples

Prometheus Metrics

How It Works

Disabling Metrics

Port Configuration

Grafana Dashboard

Using the Dashboard

Architecture Options

Option 1: Integrated Health Checks

Option 2: Standalone Health Checker (Recommended)

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 16

Packages 0

Uh oh!

Uh oh!

Languages

Packages