A lightweight, low-latency RPC load balancer written in Go. It is designed to manage and distribute requests to multiple upstream RPC endpoints based on their health status and request counts. The load balancer supports health checks and utilizes Valkey for state management.
- Round-Robin Load Balancing: Distributes requests to available endpoints in a round-robin manner, prioritizing those with fewer requests in the last 24 hours.
- Intelligent Retry Logic: Configurable retry attempts with priority-based endpoint selection (primary endpoints, fallbacks, and optional public-first mode).
- Public-First Mode: Optional prioritization of public RPC endpoints to reduce costs while maintaining reliability.
- Flexible Timeout Control: Separate timeouts for overall requests and individual retry attempts.
- Rate Limit Recovery: Safe rate limit detection and recovery with exponential backoff strategies per endpoint, to avoid making things worse when a provider is rate-limiting you.
- Health Checks: Regularly checks the health of upstream endpoints and updates their status in Valkey.
- Health Status Caching: Local per-pod caching with configurable TTL reduces Valkey load and race condition windows.
- Graceful Shutdown: Proper cleanup of goroutines during pod termination with configurable timeout.
- Standalone Health Checker: Optional standalone health checker service for efficient multi-pod deployments.
- Static Configuration: Loads RPC endpoint configurations from a static JSON file.
- Multi-threaded: Capable of handling multiple requests concurrently.
- Kubernetes Ready: Designed to run in a Kubernetes environment with Horizontal Pod Autoscaling (HPA) enabled.
- WebSocket Support: Full WebSocket proxy support for real-time applications.
-
Clone the Repository:
git clone https://github.com/project-aethermesh/aetherlay cd aetherlay -
Install Dependencies:
go mod tidy -
Configure Endpoints: Rename the
configs/endpoints-example.jsonfile toconfigs/endpoints.jsonand modify it as required in order to add all the RPC endpoints you want to load balance with this tool. -
Set up your .env file: Copy the
.env.examplefile to.envand modify it as required:cp .env.example .env
Edit the
.envfile to add your API keys and configuration. For running a single service with both the health check and load balancer, make sure to setSTANDALONE_HEALTH_CHECKS=false. -
Run the Application:
make run-lb
-
Clone the Repository:
git clone https://github.com/project-aethermesh/aetherlay cd aetherlay -
Install Dependencies:
go mod tidy -
Configure Endpoints: Rename the
configs/endpoints-example.jsonfile toconfigs/endpoints.jsonand modify it as required in order to add all the RPC endpoints you want to load balance with this tool. -
Set up your .env file: Copy the
.env.examplefile to.envand modify it as required:cp .env.example .env
Edit the
.envfile to add your API keys and configuration. -
Build and run both services in the background:
make run
Basic YAML files are provided for deploying to Kubernetes. It's recommended to check them out and update them as required. After that's done, simply run:
make k8s-deployThe load balancer will listen for incoming requests on predefined endpoints that match the configured chains (e.g., /mainnet, /base, /optimism). It will proxy these requests to the available upstream endpoints based on their health status and request counts.
GET /health- Health check endpoint for the load balancerGET /{chain}- WebSocket upgrade requests for a specific chainPOST /{chain}- HTTP RPC requests for a specific chain
?archive=true- Request archive node endpoints only
The load balancer implements intelligent retry logic with configurable timeouts:
- Priority-based selection: Endpoint selection follows these priorities:
- Normal mode: primary → fallback → public
- Public-first mode (
PUBLIC_FIRST=true): public → primary → fallback
- Configurable attempts: Retries up to
PROXY_MAX_RETRIEStimes. - Public endpoint limiting: When
PUBLIC_FIRST=true, attempts to reach public endpoints are limited to the value ofPUBLIC_FIRST_ATTEMPTS, after which the proxy tries using a primary or fallback endpoint. - Endpoint rotation: Removes failed endpoints from the retry pool to avoid repeated failures.
- Dual timeout control: There are 2 settings that control how long requests take:
- Total request timeout (
PROXY_TIMEOUT): Maximum time for the entire request (this is what the end user "sees"). - Per-try timeout (
PROXY_TIMEOUT_PER_TRY): Maximum time per individual request sent from the proxy to each endpoint.
- Total request timeout (
- Fast failover: Won't wait the whole
PROXY_TIMEOUTon a single sluggish endpoint. - Improved responsiveness: Each endpoint gets, at most,
PROXY_TIMEOUT_PER_TRYseconds to respond. - More success opportunities: This allows you to use a
PROXY_MAX_RETRIESthat's greater thanPROXY_TIMEOUT/PROXY_TIMEOUT_PER_TRY, since failures can happen way beforePROXY_TIMEOUT_PER_TRYis reached.
| Flag | Default | Description |
|---|---|---|
--config-file |
configs/endpoints.json |
Path to endpoints configuration file |
--cors-headers |
Accept, Authorization, Content-Type, Origin, X-Requested-With |
Allowed headers for CORS requests |
--cors-methods |
GET, POST, OPTIONS |
Allowed HTTP methods for CORS requests |
--cors-origin |
* |
Allowed origin for CORS requests |
--endpoint-failure-threshold |
2 |
Number of consecutive failures before marking endpoint unhealthy |
--endpoint-success-threshold |
2 |
Number of consecutive successes before marking endpoint healthy |
--ephemeral-checks-healthy-threshold |
3 |
Amount of consecutive successful responses required to consider endpoint healthy again |
--ephemeral-checks-interval |
30 |
Interval in seconds for ephemeral health checks |
--health-cache-ttl |
10 |
Health status cache TTL in seconds |
--health-check-concurrency |
20 |
Maximum number of concurrent health checks during startup |
--health-check-interval |
30 |
Health check interval in seconds |
--health-check-sync-status |
true |
Consider the sync status of the endpoints when deciding whether an endpoint is healthy or not. When enabled, endpoints that are syncing are considered to be unhealthy. |
--health-checker-grace-period |
60 |
Grace period in seconds for health checker downtime after initial check passes. During this period, the load balancer will remain ready even if the health checker is temporarily unavailable. |
--health-checker-server-port |
8080 |
Health checker HTTP server port |
--health-checker-service-url |
http://aetherlay-hc:8080 |
Health checker service URL for readiness checks (used when --standalone-health-checks is enabled) |
--log-level |
info |
Set the log level. Valid options are: debug, info, warn, error, fatal, panic |
--metrics-enabled |
true |
Whether to enable Prometheus metrics |
--metrics-port |
9090 |
Port for the Prometheus metrics server |
--proxy-retries |
3 |
Maximum number of retries for proxy requests |
--proxy-timeout |
15 |
Total timeout for proxy requests in seconds |
--proxy-timeout-per-try |
5 |
Timeout per individual retry attempt in seconds |
--public-first |
false |
Prioritize public endpoints over primary endpoints |
--public-first-attempts |
2 |
Number of attempts to make at public endpoints before trying primary/fallback |
--server-port |
8080 |
Port to use for the load balancer / proxy |
--standalone-health-checks |
true |
Enable standalone health checks |
--valkey-host |
localhost |
Valkey server hostname |
--valkey-pass |
- | Valkey server password |
--valkey-port |
6379 |
Valkey server port |
--valkey-skip-tls-check |
false |
Whether to skip TLS certificate validation when connecting to Valkey |
--valkey-use-tls |
false |
Whether to use TLS for connecting to Valkey |
Note: Command-line flags take precedence over environment variables if both are set.
| Variable | Default | Description |
|---|---|---|
ALCHEMY_API_KEY |
- | Example API key for Alchemy RPC endpoints. Only needed for the example config. The name must match the variable referenced in your configs/endpoints.json, if you need any. |
INFURA_API_KEY |
- | Example API key for Infura RPC endpoints. Only needed for the example config. The name must match the variable referenced in your configs/endpoints.json, if you need any. |
CONFIG_FILE |
configs/endpoints.json |
Path to the endpoints configuration file |
CORS_HEADERS |
Accept, Authorization, Content-Type, Origin, X-Requested-With |
Allowed headers for CORS requests |
CORS_METHODS |
GET, POST, OPTIONS |
Allowed HTTP methods for CORS requests |
CORS_ORIGIN |
* |
Allowed origin for CORS requests |
ENDPOINT_FAILURE_THRESHOLD |
2 |
Number of consecutive failures before marking endpoint unhealthy |
ENDPOINT_SUCCESS_THRESHOLD |
2 |
Number of consecutive successes before marking endpoint healthy |
EPHEMERAL_CHECKS_HEALTHY_THRESHOLD |
3 |
Amount of consecutive successful responses from the endpoint required to consider it as being healthy again |
EPHEMERAL_CHECKS_INTERVAL |
30 |
Interval in seconds for ephemeral health checks |
HEALTH_CACHE_TTL |
10 |
Health status cache TTL in seconds |
HEALTH_CHECK_CONCURRENCY |
20 |
Maximum number of concurrent health checks during startup |
HEALTH_CHECK_INTERVAL |
30 |
Health check interval in seconds |
HEALTH_CHECK_SYNC_STATUS |
true |
Consider the sync status of the endpoints when deciding whether an endpoint is healthy or not. When enabled, endpoints that are syncing are considered to be unhealthy. |
HEALTH_CHECKER_GRACE_PERIOD |
60 |
Grace period in seconds for health checker downtime after initial check passes. During this period, the load balancer will remain ready even if the health checker is temporarily unavailable. |
HEALTH_CHECKER_SERVER_PORT |
8080 |
Health checker HTTP server port |
HEALTH_CHECKER_SERVICE_URL |
http://aetherlay-hc:8080 |
Health checker service URL for readiness checks (used when STANDALONE_HEALTH_CHECKS is enabled) |
LOG_LEVEL |
info |
Set the log level |
METRICS_ENABLED |
true |
Whether to enable Prometheus metrics |
METRICS_PORT |
9090 |
Port for the Prometheus metrics server |
PROXY_MAX_RETRIES |
3 |
Maximum number of retries for proxy requests |
PROXY_TIMEOUT |
15 |
Total timeout for proxy requests in seconds |
PROXY_TIMEOUT_PER_TRY |
5 |
Timeout per individual retry attempt in seconds |
PUBLIC_FIRST |
false |
Prioritize public endpoints over primary and fallback endpoints |
PUBLIC_FIRST_ATTEMPTS |
2 |
Number of attempts to make at public endpoints before trying with a primary/fallback |
SERVER_PORT |
8080 |
Port to use for the load balancer / proxy |
STANDALONE_HEALTH_CHECKS |
true |
Enable/disable the standalone mode of the health checker |
VALKEY_HOST |
localhost |
Valkey server hostname |
VALKEY_PASS |
- | Valkey server password |
VALKEY_PORT |
6379 |
Valkey server port |
VALKEY_SKIP_TLS_CHECK |
false |
Whether to skip TLS certificate validation when connecting to Valkey |
VALKEY_USE_TLS |
false |
Whether to use TLS for connecting to Valkey |
The service checks the health of an endpoint by sending these requests to it:
eth_blockNumber- Checks for successful response and that the block is not0.eth_syncing(unless you disable it by settingHEALTH_CHECK_SYNC_STATUS=false) - Checks for successful response and that the node is not syncing (i.e., it has already fully synced, so you get the latest data from it).
When STANDALONE_HEALTH_CHECKS=false, the load balancer will run integrated health checks using the HEALTH_CHECK_INTERVAL setting.
You can also disable health checks altogether by setting HEALTH_CHECK_INTERVAL to 0, which might affect the performance of the proxy but will prevent the service from wasting your RPC credits by constantly running health checks. In this case, health checks will be run in an ephemeral fashion. For example:
- A user sends a request.
- The LB tries to proxy that request to RPC endpoint "A" but fails.
- 3 things happen at the same time:
I. The RPC endpoint "A" is marked as unhealthy.
II. The LB tries to proxy that request to another RPC endpoint.
III. An ephemeral health checker starts running to monitor RPC endpoint "A" at the interval specified by
EPHEMERAL_CHECKS_INTERVAL(default: 30s). - As soon as RPC endpoint "A" is healthy again, the ephemeral health checker is stopped.
- Trigger: Only when a request to an endpoint fails and health checks are otherwise disabled. The server marks the endpoint as unhealthy for the specific protocol (HTTP or WS) that failed.
- Interval: Controlled by the
EPHEMERAL_CHECKS_INTERVALenvironment variable (in seconds). - Behavior: The health checker service observes the unhealthy status and starts ephemeral checks for the affected protocol. The system will monitor the failed endpoint at the specified interval and automatically start routing traffic to it as soon as it becomes healthy again.
- When a request to an endpoint fails (HTTP or WebSocket), the server marks that endpoint as unhealthy for the specific protocol that failed (e.g.,
HealthyHTTP = falseorHealthyWS = falsein Valkey). - The health checker service detects this change and starts ephemeral health checks for that protocol only.
- Once the endpoint passes the configured number of consecutive health checks, it is marked healthy again and ephemeral checks stop.
For production deployments with multiple load balancer pods, use the standalone health checker:
- Single Health Checker Instance: Prevents duplicate health checks
- Multiple Load Balancer Pods: Scale independently without health check overhead
- Resource Efficiency: Reduces RPC endpoint usage
- Better Separation of Concerns: Health monitoring isolated from request handling
Ætherlay supports a "public-first" mode that prioritizes public RPC endpoints over primary and fallback endpoints to help reduce costs while maintaining reliability.
- Enable public-first: Set
PUBLIC_FIRST=true(or use the--public-firstCLI flag) - Configure attempts: Set
PUBLIC_FIRST_ATTEMPTSto control how many public endpoints to try (default: 2) - Endpoint hierarchy:
- When enabled: public → primary → fallback
- When disabled: primary → fallback → public
In your endpoints.json, mark endpoints with "role": "public":
{
"mainnet": {
"publicnode-1": {
"provider": "publicnode",
"role": "public",
"type": "archive",
"http_url": "https://ethereum-rpc.publicnode.com",
"ws_url": "wss://ethereum-rpc.publicnode.com"
},
"alchemy-1": {
"provider": "alchemy",
"role": "primary",
"type": "archive",
"http_url": "https://eth-mainnet.g.alchemy.com/v2/${ALCHEMY_API_KEY}"
}
}
}Ætherlay includes several advanced features to ensure high availability and prevent race conditions in multi-pod deployments.
Local per-pod caching of endpoint health status reduces Valkey load and minimizes race condition windows:
- Configurable TTL: Cache entries expire after
HEALTH_CACHE_TTLseconds (default: 10). - Automatic invalidation: Cache entries are invalidated when health status changes.
- Reduced latency: Cache reads are <0.1ms vs 1-5ms for Valkey reads.
- Fallback behavior: Falls back to Valkey on cache miss.
Prevents health status "flapping" by requiring multiple consecutive observations before changing state:
- Failure threshold (
ENDPOINT_FAILURE_THRESHOLD): Number of consecutive failures before marking endpoint unhealthy (default: 2). - Success threshold (
ENDPOINT_SUCCESS_THRESHOLD): Number of consecutive successes before marking endpoint healthy (default: 2). - Per-protocol tracking: HTTP and WebSocket health tracked independently.
- Reset on opposite event: Success resets failure counter, failure resets success counter.
Example with default thresholds (2):
Request 1: Success
Request 2: Failure (1/2 failures) -> endpoint stays healthy
Request 3: Failure (2/2 failures) -> endpoint is marked unhealthy
Request 4: Success (1/2 successes) -> endpoint stays unhealthy
Request 5: Success (2/2 successes) -> endpoint is marked healthy
Ensures proper cleanup during pod termination (e.g., Kubernetes rolling updates, Karpenter node replacements):
- Context cancellation: All goroutines receive shutdown signal.
- Coordinated cleanup: Rate limit scheduler shuts down before HTTP server.
- Configurable timeout: 10-second default wait for goroutines to complete.
- No orphaned goroutines: WaitGroup tracking ensures all background tasks finish.
Prevents routing traffic to pods before initial health checks complete:
- Blocks traffic: HTTP server doesn't accept requests until ready.
- Kubernetes integration:
/readyendpoint returns 503 until health check completes. - Initial check: All endpoints checked with configurable concurrency limit (default: 20 concurrent checks).
- Fast startup: Typical initial check completes in 2-5 seconds, but of course this will vary with the amount of endpoints you configure.
Ætherlay includes intelligent rate limit detection and recovery mechanisms to handle upstream provider rate limits gracefully. This system automatically detects when endpoints are rate-limited and implements recovery strategies to restore service.
- Detection: When a request returns a rate limit error (HTTP 429), the endpoint is automatically marked as rate-limited.
- Retries with "backoff": The system tries to reach the endpoint only after waiting for a specific amount of time, defined as a backoff, which is configurable by the user. This wait period increases each time, relative to another user-defined parameter (the backoff multiplier).
- Automatic recovery: The system will reintroduce the endpoint back into the load balancing pool after a certain amount of successful consecutive requests. Users can specify how many consecutive requests are required for endpoints to be marked again as healthy.
- Per-endpoint configuration: Each endpoint can have its own rate limit recovery strategy tailored to the provider's limits. You can also simply rely on the system's defaults, which have been carefully set.
Rate limit recovery is configured per endpoint in your endpoints.json file:
{
"mainnet": {
"provider-1": {
"provider": "example",
"role": "primary",
"type": "archive",
"http_url": "https://api.example.com",
"rate_limit_recovery": {
"backoff_multiplier": 2.0,
"initial_backoff": 300,
"max_backoff": 3600,
"max_retries": 10,
"required_successes": 3,
"reset_after": 86400
}
}
}
}backoff_multiplier(float): Exponential multiplier for backoff time (e.g., 2.0 doubles the wait time each attempt).initial_backoff(int): Initial backoff time in seconds before the first recovery attempt.max_backoff(int): Maximum backoff time in seconds (limits exponential growth).max_retries(int): Maximum number of recovery attempts before giving up untilreset_after.required_successes(int): Number of consecutive successes needed to mark the endpoint as healthy.reset_after(int): Time in seconds after which to reset the backoff state and start fresh.
Conservative:
"rate_limit_recovery": {
"backoff_multiplier": 4.0,
"initial_backoff": 300,
"max_backoff": 3600,
"max_retries": 5,
"required_successes": 3,
"reset_after": 86400
}Aggressive:
"rate_limit_recovery": {
"backoff_multiplier": 1.5,
"initial_backoff": 60,
"max_retries": 20,
"max_backoff": 600,
"required_successes": 1,
"reset_after": 86400
}This service uses a pull-based model for metrics collection, which is standard for Prometheus. This means the application exposes a /metrics endpoint, and a separate Prometheus server is responsible for periodically "scraping" (or pulling) data from it.
- Exporter: The application acts as a Prometheus exporter. When enabled, it starts a dedicated server that holds all metric values (counters, gauges, and histograms) in memory.
- /metrics Endpoint: This server exposes the metrics in a text-based format at
http://localhost:9090/metrics(or the port specified byMETRICS_PORT). - Prometheus Server: A separate Prometheus server must be configured to scrape this endpoint at regular intervals. The Prometheus server is responsible for all storage, querying, and alerting, ensuring that the application itself remains lightweight and stateless.
Because the metrics are stored in memory, they will be lost on every application restart. Persistence is the responsibility of the Prometheus server.
Metrics are enabled by default. If you don't want them, use the --metrics-enabled=false flag or set the METRICS_ENABLED environment variable to false.
- The metrics server runs on the port defined by
METRICS_PORT(default:9090). - Important: When running multiple services from this repository on the same machine (e.g., the load balancer and the standalone health checker), you must assign them different metrics ports to avoid conflicts. For example, you could run the health checker with
--metrics-port=9090and the load balancer with--metrics-port=9091.
An example Grafana dashboard is provided with the code to help you monitor your Ætherlay deployment. The dashboard includes comprehensive monitoring for:
- Service Health Overview: Real-time health status of remote blockchain endpoints.
- Load Balancer Performance: Request rates, error rates, response times, and in-flight requests.
- System Resources: Memory usage, CPU usage, goroutines, and garbage collection metrics.
- Network & Infrastructure: Network I/O, file descriptor usage, and Prometheus scrape rates.
- Replace variables: Replace
${DATA_SOURCE}with the name of your Prometheus data source and${NAMESPACE}with the Kubernetes namespace where you deployed the app. - Import the Dashboard: Import the
aetherlay-dashboard.jsonfile into your Grafana instance.
The dashboard is designed to work with the standard Prometheus metrics exposed by both the load balancer and health checker services. It automatically detects pods by matching with .*hc.* for the health checker and .*lb.* for the load balancer, so you'll need to update that if you use different names for your pods.
Contributions are welcome! Please submit a pull request or open an issue for any enhancements or bug fixes.
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).
You may use, modify, and distribute this software under the terms of the AGPL-3.0. See the LICENSE file for details.
TL;DR: The AGPL-3.0 ensures that all changes and derivative works must also be licensed under AGPL-3.0, and that attribution is preserved. If you run a modified version as a network service, you must make the source code available to users. This code is provided as-is, without warranties.

