Skip to content

Conversation

@sanbotto
Copy link
Member

@sanbotto sanbotto commented Dec 2, 2025

Some 4xx errors weren't being handled as errors since there was an assumption that, since they are technically user errors, they should be returned to the user, but there are some errors, like 401 and 403 that can be received by aetherlay which are problems of aetherlay as a user of the endpoints, so they actually become server errors from the end user's point of view and have to be handled accordingly.

There's an exception, though: a 400 error is checked for recurrence. If aetherlay sees the same 400 error coming from 2 different endpoints for the same request, that response is passed to the user since we assume that it's truly a bad request sent by the end user (e.g., wrong payload or invalid signature). I used the BadRequestError struct type for this.

Summary by CodeRabbit

  • Bug Fixes

    • Error responses now include explicit HTTP status info for clearer troubleshooting.
    • Retry policy expanded to retry on all non-2xx responses, improving request resilience.
    • 400 Bad Request handling refined to retry across endpoints and surface the original error when appropriate.
    • WebSocket handshake and proxy error handling strengthened for additional HTTP error scenarios.
  • Tests

    • Removed a unit test covering previous retry logic.
  • Chores

    • CI workflow: reduced debug output and emit explicit empty tag value when no tags provided.

✏️ Tip: You can customize this high-level summary in your review settings.

@sanbotto sanbotto self-assigned this Dec 2, 2025
@coderabbitai
Copy link

coderabbitai bot commented Dec 2, 2025

Walkthrough

Introduces a new BadRequestError carrying status, message, body, and headers. Forwarding now treats all non-2xx responses as retryable, caches the first 400 to attempt other endpoints and returns the cached 400 if endpoints exhaust; successful retries mark the first endpoint unhealthy. WebSocket handshake and proxying expanded to handle non-2xx and 429 consistently and update endpoint health/rate-limit state. CI workflow emits an explicit empty tags output when none are provided and removes several debug echo steps.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: treating 4xx errors from endpoints as retryable errors rather than immediately returning them to users, with special handling for 400 responses.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/catch-all-errors-from-endpoints

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
internal/health/checker.go (1)

425-425: Minor redundancy in error message format.

resp.Status already includes the status code (e.g., "401 Unauthorized"), so the resulting message will be "HTTP 401: 401 Unauthorized". Consider using just the reason phrase:

-		return nil, errors.New("HTTP " + strconv.Itoa(resp.StatusCode) + ": " + resp.Status)
+		return nil, errors.New("HTTP " + strconv.Itoa(resp.StatusCode) + ": " + http.StatusText(resp.StatusCode))

Alternatively, just use resp.Status directly since it's already descriptive.

internal/server/server.go (1)

504-513: Redundant continue blocks.

Lines 504-508 and 510-513 both check similar conditions and call continue. The first block is a subset of the second (when first400Error != nil), so the explicit 400 check is unnecessary since the general retry block below would handle it anyway.

Consider consolidating:

-				// If we got a 400 from first endpoint, continue retrying
-				if first400Error != nil && len(allEndpoints) > 0 && retryCount < s.maxRetries {
-					log.Debug().Str("chain", chain).Str("failed_endpoint", endpoint.ID).Int("public_attempt_count", publicAttemptCount).Int("remaining_endpoints", len(allEndpoints)).Int("retry", retryCount).Msg("Retrying HTTP request with different endpoint after 400")
-					continue
-				}
-
 				if len(allEndpoints) > 0 && retryCount < s.maxRetries {
-					log.Debug().Str("chain", chain).Str("failed_endpoint", endpoint.ID).Int("public_attempt_count", publicAttemptCount).Int("remaining_endpoints", len(allEndpoints)).Int("retry", retryCount).Msg("Retrying HTTP request with different endpoint")
+					reason := "error"
+					if first400Error != nil {
+						reason = "400"
+					}
+					log.Debug().Str("chain", chain).Str("failed_endpoint", endpoint.ID).Str("reason", reason).Int("remaining_endpoints", len(allEndpoints)).Int("retry", retryCount).Msg("Retrying HTTP request with different endpoint")
 					continue
 				}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 90fe1f6 and 8a182c1.

📒 Files selected for processing (2)
  • internal/health/checker.go (1 hunks)
  • internal/server/server.go (8 hunks)
🔇 Additional comments (6)
internal/server/server.go (6)

37-47: LGTM!

The BadRequestError type is well-designed to carry the necessary information (status code, body, and headers) for pass-through scenarios.


423-424: LGTM!

The variables correctly track the first 400 error for the two-endpoint confirmation logic.


514-522: LGTM!

Correctly marks the first endpoint (that returned 400) as unhealthy when a subsequent endpoint succeeds, confirming the 400 was a server-side issue rather than a user error.


1150-1177: LGTM!

The WebSocket handshake error handling is correctly expanded to handle all non-2xx responses, with appropriate rate-limit handling for 429 and unhealthy marking for other error codes.


1096-1105: LGTM!

Response headers are correctly copied on success, with CORS headers appropriately excluded to avoid duplication.


1115-1119: shouldRetry is tested but verify its integration in retry logic.

The shouldRetry method is covered by tests in server_test.go, so it's not dead code. However, confirm whether this method is actually invoked during request handling or if the retry logic uses a different mechanism (such as err != nil checks).

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
internal/server/server.go (1)

1063-1097: Consider reading response body only for 400 status codes.

Currently, the response body is read for all non-2xx responses (line 1068), but it's only used for 400 errors (line 1091). For 429 and other non-2xx responses, the body is read and immediately discarded, which is inefficient—especially for large error responses.

Consider restructuring to read the body only when needed:

 	// Check for non-2xx status codes, all of them should trigger retries
 	if resp.StatusCode < 200 || resp.StatusCode >= 300 {
 		chain, endpointID, found := s.findChainAndEndpointByURL(targetURL)
 
-		// Read response body for logging and potential passing through
-		respBodyBytes, readErr := io.ReadAll(resp.Body)
-		if readErr != nil {
-			respBodyBytes = []byte{}
-		}
-
 		// Mark endpoint as unhealthy for any non-2xx response
 		if found {
 			if resp.StatusCode == 429 {
 				// For 429 (Too Many Requests), use the rate limit handler
 				s.markEndpointUnhealthyProtocol(chain, endpointID, "http")
 				s.handleRateLimit(chain, endpointID, "http")
 				log.Debug().Str("url", helpers.RedactAPIKey(targetURL)).Int("status_code", resp.StatusCode).Msg("Endpoint returned 429 (Too Many Requests), handling rate limit")
 			} else {
 				s.markEndpointUnhealthyProtocol(chain, endpointID, "http")
 				log.Debug().Str("url", helpers.RedactAPIKey(targetURL)).Int("status_code", resp.StatusCode).Msg("Endpoint returned non-2xx status, marked unhealthy")
 			}
 		}
 
 		// Special handling for 400 Bad Request
 		if resp.StatusCode == 400 {
+			// Read response body for passing through
+			respBodyBytes, readErr := io.ReadAll(resp.Body)
+			if readErr != nil {
+				respBodyBytes = []byte{}
+			}
 			return &BadRequestError{
 				StatusCode: resp.StatusCode,
 				Message:    fmt.Sprintf("HTTP %d: %s", resp.StatusCode, resp.Status),
 				Body:       respBodyBytes,
 				Headers:    resp.Header,
 			}
 		}
 
 		// For all other non-2xx errors, return a generic error
 		return fmt.Errorf("HTTP %d: %s", resp.StatusCode, resp.Status)
 	}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8a182c1 and 4d9e026.

📒 Files selected for processing (1)
  • internal/server/server.go (8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
internal/server/server.go (1)
internal/helpers/helpers.go (1)
  • RedactAPIKey (253-277)
🔇 Additional comments (4)
internal/server/server.go (4)

37-47: LGTM: Well-structured error type for 400 handling.

The BadRequestError type properly captures the necessary information (status, message, body, headers) to implement the special 400 pass-through logic described in the PR objectives.


423-554: LGTM: 400 caching logic correctly implements the PR requirements.

The implementation properly:

  • Caches the first 400 response and retries with another endpoint
  • Passes through to the user if a second endpoint also returns 400 (treating it as a true bad request)
  • Marks the first endpoint as unhealthy only after a subsequent endpoint succeeds
  • Passes through the cached 400 if all endpoints are exhausted
  • Handles w.Write errors (lines 486-487, 550-551) as flagged in previous reviews
  • Filters Access-Control-* headers to prevent duplication with CORS middleware

1120-1123: LGTM: Simplified retry logic aligns with the broader error handling strategy.

The updated shouldRetry function now treats all non-2xx responses as retryable, which is consistent with the changes in defaultForwardRequestWithBodyFunc and the PR objectives.


1154-1182: LGTM: Enhanced WebSocket error handling is comprehensive and consistent.

The expanded error handling for WebSocket handshakes properly:

  • Distinguishes between rate-limit (429), other non-2xx, and network errors
  • Marks endpoints unhealthy appropriately for each error type
  • Returns RateLimitError for 429 responses, allowing the caller to trigger rate-limit recovery logic
  • Maintains consistency with the HTTP error handling approach

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
internal/server/server.go (3)

423-555: Clarify behavior when only one 400 is seen and remaining attempts fail with non‑400 errors

The caching logic around first400Error works as intended for the “two 400s from different endpoints” case: the second 400 is passed through and the first endpoint is later marked unhealthy on a success. However, the fallback at the end:

  • If any attempt produced a 400 (setting first400Error), and then all remaining attempts fail with non‑400 errors or you exhaust retries, you still pass that single cached 400 through to the user (lines 536‑554).

This means a sequence like:

  • Endpoint A → 400
  • Endpoint B/C → timeouts or 5xx

will surface A’s 400 to the client even though you never got a confirming 400 from another backend. The PR description implies the stricter rule “forward 400 only after observing it from at least two endpoints”; if that’s the intended contract, you may want an explicit “second 400 seen” flag and only use the cached 400 when that flag is set, otherwise fall back to the existing 5xx/“all endpoints failed” behavior.

If the current behavior is intentional (i.e., “use the 400 if it’s the only concrete response we got”), a short comment explaining this choice near the fallback would help future maintainers understand why a single 400 is treated as authoritative once all other options are exhausted.


1063-1097: Consider avoiding health penalties for endpoints on client‑side 4xx, especially confirmed 400s

Marking endpoints unhealthy for all non‑2xx responses centralizes the retry behavior nicely, and the special 400 case with BadRequestError is a good fit for the higher‑level logic. One side effect, though, is:

  • Every 4xx (including 400s ultimately treated as client errors) increments failure state via markEndpointUnhealthyProtocol.
  • A burst of bad client requests can therefore push otherwise healthy endpoints toward the failure threshold, especially if traffic is skewed.

Given you now distinguish 400s enough to surface them as client errors, it might be worth reconsidering whether 4xx — or at least 400s that you decide to pass through — should count as health “failures” at this layer, versus only penalizing 5xx/429/network issues and letting successes reset the counters.

If you choose to adjust this, one approach is to skip markEndpointUnhealthyProtocol in this function when resp.StatusCode == 400, and instead have the calling logic decide when a 400 implies a misbehaving endpoint (e.g., “first endpoint 400, second succeeds”) versus a genuine client error.


1153-1177: WebSocket handshake error handling is sound; consolidate redundant rate-limit calls

The handshake branch correctly differentiates 429 during dial, marks the endpoint unhealthy, and invokes the rate-limit handler. Returning RateLimitError allows handleRequestWS to apply its own retry logic, and marking endpoints unhealthy for other non-2xx statuses and network errors is appropriate.

The concern about resp reliability is addressed: gorilla/websocket v1.5.3 does return a non-nil *http.Response with status code on HTTP handshake failures (4xx/5xx), so your assumption is sound.

However, handleRateLimit is now called twice for 429 errors: once in the dialer and again in handleRequestWS when it sees the RateLimitError. This causes duplicate logging and redundant rate-limit state writes. Consider consolidating rate-limit side effects in one place (either the dialer or the caller) and using the typed error purely as a signal.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4d9e026 and 39b50c1.

📒 Files selected for processing (1)
  • internal/server/server.go (8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
internal/server/server.go (1)
internal/helpers/helpers.go (1)
  • RedactAPIKey (253-277)
🔇 Additional comments (2)
internal/server/server.go (2)

37-47: BadRequestError shape matches the intended 400 pass-through use case

The exported struct plus Error() method give callers enough detail (status, body, headers) while still behaving like a normal error. This aligns well with the later 400-caching logic; no changes needed here.


1120-1121: shouldRetry semantics now align with the new forwarding behavior

Updating shouldRetry to treat any non‑2xx as retryable matches the behavior in defaultForwardRequestWithBodyFunc and the PR’s goal of handling 4xx internally rather than surfacing them directly. As long as callers bound retries using maxRetries (as in this file), this looks consistent.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
internal/server/server.go (1)

1215-1231: 400 handling looks correct, minor simplification possible.

The logic correctly returns BadRequestError for 400 responses without marking health, deferring to the caller's retry logic.

The body reading could be slightly simplified:

 		if resp.StatusCode == 400 {
 			// Read response body for logging and passing through
-			respBodyBytes := []byte{}
-			if resp.Body != nil {
-				bodyBytes, readErr := io.ReadAll(resp.Body)
-				if readErr == nil {
-					respBodyBytes = bodyBytes
-				}
-			}
+			var respBodyBytes []byte
+			if resp.Body != nil {
+				respBodyBytes, _ = io.ReadAll(resp.Body)
+			}
 			log.Debug().Str("url", helpers.RedactAPIKey(backendURL)).Int("status_code", resp.StatusCode).Msg("WebSocket handshake returned 400 Bad Request")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d54701 and 1aa8415.

📒 Files selected for processing (1)
  • internal/server/server.go (10 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
internal/server/server.go (1)
internal/helpers/helpers.go (1)
  • RedactAPIKey (253-277)
🔇 Additional comments (12)
internal/server/server.go (12)

37-47: LGTM!

The BadRequestError struct follows the existing RateLimitError pattern and captures all necessary details (status code, message, body, headers) for proper error propagation and client response construction.


423-424: LGTM!

Variables properly initialized to track the first 400 error and its source endpoint for the retry-based validation logic.


464-491: LGTM!

The 400 handling logic correctly:

  1. Caches the first 400 error and endpoint ID for retry validation
  2. On a second 400 from a different endpoint, confirms it's a client error and forwards to user
  3. Properly copies headers (excluding CORS) and logs any write errors

506-514: LGTM!

The separate continuation checks for 400-specific retries vs. general error retries provide clearer debugging context with distinct log messages.


517-523: LGTM!

Correctly marks the first endpoint as unhealthy when the second endpoint succeeds, confirming the 400 was endpoint-specific rather than a client error.


576-577: LGTM!

Consistent with the HTTP handler's approach to tracking first 400 errors.


617-662: LGTM!

The WebSocket 400 handling mirrors the HTTP approach. The inline endpoint removal at lines 645-653 duplicates the pattern at lines 710-718, but this is acceptable given the different control flow requirements for the 400 case.


726-732: LGTM!

Consistent with the HTTP handler's success-after-400 logic, correctly using "ws" protocol for endpoint health marking.


1100-1137: LGTM with a note on behavior scope.

The implementation correctly:

  1. Defers 400 health decisions to the caller's retry logic
  2. Marks endpoints unhealthy for all other non-2xx responses
  3. Handles 429 specially for rate limiting

Note that all non-2xx responses (except 400) now trigger endpoint health marking and retries. This is broader than just 401/403 mentioned in the PR description but provides consistent behavior. Verify this aligns with intended design.


1140-1149: LGTM!

Properly copies backend response headers to the client while filtering CORS headers to avoid conflicts with the server's CORS middleware.


1161-1162: LGTM!

The updated logic correctly broadens retry behavior to all non-2xx responses, aligning with the new error handling strategy.


1233-1244: LGTM!

Correctly marks endpoints unhealthy for non-2xx handshake responses and network errors, with appropriate fallback logging when the endpoint can't be identified.

@sanbotto
Copy link
Member Author

sanbotto commented Dec 3, 2025

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Dec 3, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
internal/server/server.go (2)

423-525: 400 retry/pass-through semantics: consider equality check and exhaustion behavior.

The overall flow (cache first 400 via BadRequestError, retry other endpoints, pass through on second 400, and mark the first endpoint unhealthy on later success) matches the intent to treat repeated 400s as likely client errors rather than endpoint failures.

Two edge cases you may want to tighten:

  • You currently treat “any two endpoints that both return 400” as conclusive client error. If the goal is “same 400 from two endpoints”, consider comparing status + body before deciding to pass through, e.g.:
-				if badReqErr, ok := err.(*BadRequestError); ok {
-					if first400Error == nil {
+				if badReqErr, ok := err.(*BadRequestError); ok {
+					if first400Error == nil {
 						// First 400 response. Cache it and retry with next endpoint.
 						first400Error = badReqErr
 						first400EndpointID = endpoint.ID
 						log.Debug().Str("endpoint", endpoint.ID).Msg("First endpoint returned 400, will retry with next endpoint")
-					} else {
-						// Second endpoint also returned 400. This is the user's fault, pass it through.
+					} else if badReqErr.StatusCode == first400Error.StatusCode && bytes.Equal(badReqErr.Body, first400Error.Body) {
+						// Second endpoint returned the same 400. Treat as client error and pass it through.
 						log.Debug().Str("first_endpoint", first400EndpointID).Str("second_endpoint", endpoint.ID).Msg("Both endpoints returned 400, passing through to the user.")
  • If you see a single 400 and then exhaust endpoints/maxRetries with only non-400 failures, the cached 400 is ignored and the user gets a generic 502. If you’d prefer to surface that 400 as a stronger client-signal once there are no more backends to try, you could fall back to first400Error before returning 502, e.g.:
-		// If we get here, all retries failed
-		if retryCount >= s.maxRetries {
+		// If we get here, all retries failed. If we observed a cached 400, prefer returning that
+		// to the client as a likely bad request signal.
+		if first400Error != nil {
+			for key, values := range first400Error.Headers {
+				if strings.HasPrefix(key, "Access-Control-") {
+					continue
+				}
+				for _, value := range values {
+					w.Header().Add(key, value)
+				}
+			}
+			w.WriteHeader(first400Error.StatusCode)
+			if _, err := w.Write(first400Error.Body); err != nil {
+				log.Debug().Err(err).Msg("Failed to write cached 400 response body to client")
+			}
+			return
+		}
+
+		// If we get here, all retries failed without a definitive 400
+		if retryCount >= s.maxRetries {
 			log.Error().Str("chain", chain).Strs("tried_endpoints", triedEndpoints).Int("max_retries", s.maxRetries).Msg("Max retries reached")
 			http.Error(w, "Max retries reached, all endpoints unavailable", http.StatusBadGateway)
 		} else {

If the current behavior is intentional, this can be left as-is; the above is mostly about making the “same 400 twice” heuristic explicit in code.

Also applies to: 506-510, 1100-1121


1100-1121: Non‑2xx HTTP handling is consistent; consider reusing an HTTP client.

The new behavior in defaultForwardRequestWithBodyFunc (treat all non‑2xx as errors, special‑case 400 via BadRequestError, and copy upstream headers on success while skipping CORS) looks coherent and lines up with the higher‑level retry logic.

One small performance/connection‑reuse improvement to consider (not blocking):

  • A new http.Client is created per forwarded request. Since NewRequestWithContext already enforces timeouts via the context, you could keep a single proxyHTTPClient *http.Client on Server (similar to healthCheckClient) and reuse it here to avoid repeated allocations and to better leverage connection pooling:
 type Server struct {
@@
-	healthCheckClient *http.Client
+	healthCheckClient *http.Client
+	proxyHTTPClient   *http.Client
@@
 func NewServer(cfg *config.Config, valkeyClient store.ValkeyClientIface, appConfig *helpers.LoadedConfig) *Server {
 	s := &Server{
@@
 	s.healthCheckClient = &http.Client{
 		Transport: transport,
 		Timeout:   2 * time.Second,
 	}
+
+	// Reusable HTTP client for proxying requests; relies on per-request context for timeouts.
+	s.proxyHTTPClient = &http.Client{
+		Transport: transport,
+	}
@@
 func (s *Server) defaultForwardRequestWithBodyFunc(w http.ResponseWriter, ctx context.Context, method, targetURL string, bodyBytes []byte, headers http.Header) error {
@@
-	// Forward the request. Use the context timeout instead of a fixed timeout.
-	client := &http.Client{}
-	resp, err := client.Do(req)
+	// Forward the request. Use the context timeout instead of a fixed timeout.
+	resp, err := s.proxyHTTPClient.Do(req)

This keeps behavior the same while reducing per‑request overhead.

Also applies to: 1123-1137, 1140-1149

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1aa8415 and 32b3da9.

📒 Files selected for processing (2)
  • internal/server/server.go (9 hunks)
  • internal/server/server_test.go (0 hunks)
💤 Files with no reviewable changes (1)
  • internal/server/server_test.go
🧰 Additional context used
🧬 Code graph analysis (1)
internal/server/server.go (1)
internal/helpers/helpers.go (1)
  • RedactAPIKey (253-277)
🔇 Additional comments (1)
internal/server/server.go (1)

37-47: BadRequestError structure and Error implementation look good.

The fields and Error() method provide enough context (status, message, body, headers) for both HTTP and WebSocket callers to make routing and logging decisions; no changes needed here.

Comment on lines +576 to +577
var first400Error *BadRequestError
var first400EndpointID string
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

wc -l internal/server/server.go

Repository: project-aethermesh/aetherlay

Length of output: 102


🏁 Script executed:

sed -n '570,590p' internal/server/server.go

Repository: project-aethermesh/aetherlay

Length of output: 690


🏁 Script executed:

sed -n '1175,1235p' internal/server/server.go

Repository: project-aethermesh/aetherlay

Length of output: 2855


🏁 Script executed:

sed -n '617,670p' internal/server/server.go

Repository: project-aethermesh/aetherlay

Length of output: 2536


🏁 Script executed:

sed -n '720,740p' internal/server/server.go

Repository: project-aethermesh/aetherlay

Length of output: 1451


🌐 Web query:

Go http.ResponseWriter after websocket.Upgrader.Upgrade - is connection hijacked, can you call WriteHeader or Write after upgrade?

💡 Result:

Short answer: Yes — Upgrade hijacks the underlying connection, so you must not call ResponseWriter.WriteHeader or Write after Upgrade; use the returned *websocket.Conn instead. If you do call Write/WriteHeader after the hijack you’ll get errors like "http: response.Write on hijacked connection". Also note hijack is only supported for HTTP/1.x (HTTP/2 ResponseWriters don’t implement Hijacker). [1][2]

Sources:

  • gorilla/websocket Upgrade docs (Upgrade calls hijack as part of the handshake). [1]
  • Explanation and common error ("response.Write on hijacked connection") on Stack Overflow. [2]

🌐 Web query:

gorilla/websocket Dial error - do I need to close resp.Body when Dial fails?

💡 Result:

No — you do not need to close resp.Body when gorilla/websocket Dial returns an error. The Dialer returns a non-nil *http.Response on handshake failure for caller inspection, but the response body may be incomplete and “does not need to be closed by the application.” [1][2]

Sources:

  • gorilla/websocket package docs. [1]
  • gorilla/websocket client.go (source). [2]

🏁 Script executed:

sed -n '560,580p' internal/server/server.go

Repository: project-aethermesh/aetherlay

Length of output: 877


🏁 Script executed:

sed -n '1170,1185p' internal/server/server.go

Repository: project-aethermesh/aetherlay

Length of output: 617


🏁 Script executed:

# Get the full function signature and structure to understand flow
grep -B 30 "var first400Error" internal/server/server.go | head -40

Repository: project-aethermesh/aetherlay

Length of output: 1515


🏁 Script executed:

# Understand the exact function this code is in
sed -n '540,580p' internal/server/server.go

Repository: project-aethermesh/aetherlay

Length of output: 1670


WebSocket handshake error handling: Upgrade happens too early in the retry loop, making HTTP response writes invalid.

The WS path attempts to mirror HTTP 400 logic, but there's a critical architectural issue:

  1. Premature Upgrade prevents HTTP error responses in the retry loop.

    • defaultProxyWebSocket upgrades the client connection immediately via upgrader.Upgrade(w, r, nil) (Line 1178). The connection is hijacked and http.ResponseWriter becomes invalid for further HTTP operations.
    • However, handleRequestWS calls defaultProxyWebSocket on each retry iteration (lines 576+). If the backend handshake fails with a 400 error on the first endpoint, the function returns a BadRequestError to handleRequestWS.
    • handleRequestWS then attempts to write HTTP headers and body via w.Header().Add, w.WriteHeader, and w.Write (lines 638–641) to pass through the error to the client. But since the connection was already hijacked by the Upgrade call in the previous defaultProxyWebSocket invocation, these writes will fail with "response.Write on hijacked connection" errors.
    • To fix this, the backend handshake should be completed (and its status code checked) before calling Upgrade. Only call Upgrade once you confirm the backend will accept the connection, or restructure so defaultProxyWebSocket does not upgrade on retry attempts that fail.
  2. Response body closure on handshake error (lines 1203–1207) is not required.

    • Per gorilla/websocket documentation, resp.Body does not need to be closed by the application when Dial returns an error. The current code reads the body without closing it, which is acceptable per the library's contract and does not cause resource leaks.

Given the hijack issue, the WS 400 pass-through path will not work as written. Consider redesigning the flow to probe or perform the backend handshake before upgrading the client connection.

🤖 Prompt for AI Agents
internal/server/server.go around lines 576-577: the WebSocket upgrade is being
performed inside defaultProxyWebSocket on each retry which hijacks the client
connection and makes subsequent HTTP error writes invalid; change the flow so
the backend handshake (dialing the backend and checking its HTTP status) is
completed and its status code examined before calling upgrader.Upgrade on the
client side—either by refactoring defaultProxyWebSocket to separate "probe
backend handshake without upgrading client" and only upgrade after a successful
backend response, or by moving the upgrade out of the retry loop so retries only
probe the backend; also remove or leave unchanged any explicit resp.Body.Close
on dial errors per gorilla/websocket contract (no body close required) and
ensure error responses are written to the client only when the connection has
not been hijacked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants