-
Notifications
You must be signed in to change notification settings - Fork 0
fix: Catch all errors from endpoints #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughIntroduces a new Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (2)
internal/health/checker.go (1)
425-425: Minor redundancy in error message format.
resp.Statusalready includes the status code (e.g., "401 Unauthorized"), so the resulting message will be "HTTP 401: 401 Unauthorized". Consider using just the reason phrase:- return nil, errors.New("HTTP " + strconv.Itoa(resp.StatusCode) + ": " + resp.Status) + return nil, errors.New("HTTP " + strconv.Itoa(resp.StatusCode) + ": " + http.StatusText(resp.StatusCode))Alternatively, just use
resp.Statusdirectly since it's already descriptive.internal/server/server.go (1)
504-513: Redundant continue blocks.Lines 504-508 and 510-513 both check similar conditions and call
continue. The first block is a subset of the second (whenfirst400Error != nil), so the explicit 400 check is unnecessary since the general retry block below would handle it anyway.Consider consolidating:
- // If we got a 400 from first endpoint, continue retrying - if first400Error != nil && len(allEndpoints) > 0 && retryCount < s.maxRetries { - log.Debug().Str("chain", chain).Str("failed_endpoint", endpoint.ID).Int("public_attempt_count", publicAttemptCount).Int("remaining_endpoints", len(allEndpoints)).Int("retry", retryCount).Msg("Retrying HTTP request with different endpoint after 400") - continue - } - if len(allEndpoints) > 0 && retryCount < s.maxRetries { - log.Debug().Str("chain", chain).Str("failed_endpoint", endpoint.ID).Int("public_attempt_count", publicAttemptCount).Int("remaining_endpoints", len(allEndpoints)).Int("retry", retryCount).Msg("Retrying HTTP request with different endpoint") + reason := "error" + if first400Error != nil { + reason = "400" + } + log.Debug().Str("chain", chain).Str("failed_endpoint", endpoint.ID).Str("reason", reason).Int("remaining_endpoints", len(allEndpoints)).Int("retry", retryCount).Msg("Retrying HTTP request with different endpoint") continue }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
internal/health/checker.go(1 hunks)internal/server/server.go(8 hunks)
🔇 Additional comments (6)
internal/server/server.go (6)
37-47: LGTM!The
BadRequestErrortype is well-designed to carry the necessary information (status code, body, and headers) for pass-through scenarios.
423-424: LGTM!The variables correctly track the first 400 error for the two-endpoint confirmation logic.
514-522: LGTM!Correctly marks the first endpoint (that returned 400) as unhealthy when a subsequent endpoint succeeds, confirming the 400 was a server-side issue rather than a user error.
1150-1177: LGTM!The WebSocket handshake error handling is correctly expanded to handle all non-2xx responses, with appropriate rate-limit handling for 429 and unhealthy marking for other error codes.
1096-1105: LGTM!Response headers are correctly copied on success, with CORS headers appropriately excluded to avoid duplication.
1115-1119:shouldRetryis tested but verify its integration in retry logic.The
shouldRetrymethod is covered by tests inserver_test.go, so it's not dead code. However, confirm whether this method is actually invoked during request handling or if the retry logic uses a different mechanism (such aserr != nilchecks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
internal/server/server.go (1)
1063-1097: Consider reading response body only for 400 status codes.Currently, the response body is read for all non-2xx responses (line 1068), but it's only used for 400 errors (line 1091). For 429 and other non-2xx responses, the body is read and immediately discarded, which is inefficient—especially for large error responses.
Consider restructuring to read the body only when needed:
// Check for non-2xx status codes, all of them should trigger retries if resp.StatusCode < 200 || resp.StatusCode >= 300 { chain, endpointID, found := s.findChainAndEndpointByURL(targetURL) - // Read response body for logging and potential passing through - respBodyBytes, readErr := io.ReadAll(resp.Body) - if readErr != nil { - respBodyBytes = []byte{} - } - // Mark endpoint as unhealthy for any non-2xx response if found { if resp.StatusCode == 429 { // For 429 (Too Many Requests), use the rate limit handler s.markEndpointUnhealthyProtocol(chain, endpointID, "http") s.handleRateLimit(chain, endpointID, "http") log.Debug().Str("url", helpers.RedactAPIKey(targetURL)).Int("status_code", resp.StatusCode).Msg("Endpoint returned 429 (Too Many Requests), handling rate limit") } else { s.markEndpointUnhealthyProtocol(chain, endpointID, "http") log.Debug().Str("url", helpers.RedactAPIKey(targetURL)).Int("status_code", resp.StatusCode).Msg("Endpoint returned non-2xx status, marked unhealthy") } } // Special handling for 400 Bad Request if resp.StatusCode == 400 { + // Read response body for passing through + respBodyBytes, readErr := io.ReadAll(resp.Body) + if readErr != nil { + respBodyBytes = []byte{} + } return &BadRequestError{ StatusCode: resp.StatusCode, Message: fmt.Sprintf("HTTP %d: %s", resp.StatusCode, resp.Status), Body: respBodyBytes, Headers: resp.Header, } } // For all other non-2xx errors, return a generic error return fmt.Errorf("HTTP %d: %s", resp.StatusCode, resp.Status) }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
internal/server/server.go(8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
internal/server/server.go (1)
internal/helpers/helpers.go (1)
RedactAPIKey(253-277)
🔇 Additional comments (4)
internal/server/server.go (4)
37-47: LGTM: Well-structured error type for 400 handling.The
BadRequestErrortype properly captures the necessary information (status, message, body, headers) to implement the special 400 pass-through logic described in the PR objectives.
423-554: LGTM: 400 caching logic correctly implements the PR requirements.The implementation properly:
- Caches the first 400 response and retries with another endpoint
- Passes through to the user if a second endpoint also returns 400 (treating it as a true bad request)
- Marks the first endpoint as unhealthy only after a subsequent endpoint succeeds
- Passes through the cached 400 if all endpoints are exhausted
- Handles
w.Writeerrors (lines 486-487, 550-551) as flagged in previous reviews- Filters Access-Control-* headers to prevent duplication with CORS middleware
1120-1123: LGTM: Simplified retry logic aligns with the broader error handling strategy.The updated
shouldRetryfunction now treats all non-2xx responses as retryable, which is consistent with the changes indefaultForwardRequestWithBodyFuncand the PR objectives.
1154-1182: LGTM: Enhanced WebSocket error handling is comprehensive and consistent.The expanded error handling for WebSocket handshakes properly:
- Distinguishes between rate-limit (429), other non-2xx, and network errors
- Marks endpoints unhealthy appropriately for each error type
- Returns
RateLimitErrorfor 429 responses, allowing the caller to trigger rate-limit recovery logic- Maintains consistency with the HTTP error handling approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
internal/server/server.go (3)
423-555: Clarify behavior when only one 400 is seen and remaining attempts fail with non‑400 errorsThe caching logic around
first400Errorworks as intended for the “two 400s from different endpoints” case: the second 400 is passed through and the first endpoint is later marked unhealthy on a success. However, the fallback at the end:
- If any attempt produced a 400 (setting
first400Error), and then all remaining attempts fail with non‑400 errors or you exhaust retries, you still pass that single cached 400 through to the user (lines 536‑554).This means a sequence like:
- Endpoint A → 400
- Endpoint B/C → timeouts or 5xx
will surface A’s 400 to the client even though you never got a confirming 400 from another backend. The PR description implies the stricter rule “forward 400 only after observing it from at least two endpoints”; if that’s the intended contract, you may want an explicit “second 400 seen” flag and only use the cached 400 when that flag is set, otherwise fall back to the existing 5xx/“all endpoints failed” behavior.
If the current behavior is intentional (i.e., “use the 400 if it’s the only concrete response we got”), a short comment explaining this choice near the fallback would help future maintainers understand why a single 400 is treated as authoritative once all other options are exhausted.
1063-1097: Consider avoiding health penalties for endpoints on client‑side 4xx, especially confirmed 400sMarking endpoints unhealthy for all non‑2xx responses centralizes the retry behavior nicely, and the special 400 case with
BadRequestErroris a good fit for the higher‑level logic. One side effect, though, is:
- Every 4xx (including 400s ultimately treated as client errors) increments failure state via
markEndpointUnhealthyProtocol.- A burst of bad client requests can therefore push otherwise healthy endpoints toward the failure threshold, especially if traffic is skewed.
Given you now distinguish 400s enough to surface them as client errors, it might be worth reconsidering whether 4xx — or at least 400s that you decide to pass through — should count as health “failures” at this layer, versus only penalizing 5xx/429/network issues and letting successes reset the counters.
If you choose to adjust this, one approach is to skip
markEndpointUnhealthyProtocolin this function whenresp.StatusCode == 400, and instead have the calling logic decide when a 400 implies a misbehaving endpoint (e.g., “first endpoint 400, second succeeds”) versus a genuine client error.
1153-1177: WebSocket handshake error handling is sound; consolidate redundant rate-limit callsThe handshake branch correctly differentiates 429 during dial, marks the endpoint unhealthy, and invokes the rate-limit handler. Returning
RateLimitErrorallowshandleRequestWSto apply its own retry logic, and marking endpoints unhealthy for other non-2xx statuses and network errors is appropriate.The concern about
respreliability is addressed: gorilla/websocket v1.5.3 does return a non-nil*http.Responsewith status code on HTTP handshake failures (4xx/5xx), so your assumption is sound.However,
handleRateLimitis now called twice for 429 errors: once in the dialer and again inhandleRequestWSwhen it sees theRateLimitError. This causes duplicate logging and redundant rate-limit state writes. Consider consolidating rate-limit side effects in one place (either the dialer or the caller) and using the typed error purely as a signal.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
internal/server/server.go(8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
internal/server/server.go (1)
internal/helpers/helpers.go (1)
RedactAPIKey(253-277)
🔇 Additional comments (2)
internal/server/server.go (2)
37-47: BadRequestError shape matches the intended 400 pass-through use caseThe exported struct plus
Error()method give callers enough detail (status, body, headers) while still behaving like a normalerror. This aligns well with the later 400-caching logic; no changes needed here.
1120-1121:shouldRetrysemantics now align with the new forwarding behaviorUpdating
shouldRetryto treat any non‑2xx as retryable matches the behavior indefaultForwardRequestWithBodyFuncand the PR’s goal of handling 4xx internally rather than surfacing them directly. As long as callers bound retries usingmaxRetries(as in this file), this looks consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
internal/server/server.go (1)
1215-1231: 400 handling looks correct, minor simplification possible.The logic correctly returns
BadRequestErrorfor 400 responses without marking health, deferring to the caller's retry logic.The body reading could be slightly simplified:
if resp.StatusCode == 400 { // Read response body for logging and passing through - respBodyBytes := []byte{} - if resp.Body != nil { - bodyBytes, readErr := io.ReadAll(resp.Body) - if readErr == nil { - respBodyBytes = bodyBytes - } - } + var respBodyBytes []byte + if resp.Body != nil { + respBodyBytes, _ = io.ReadAll(resp.Body) + } log.Debug().Str("url", helpers.RedactAPIKey(backendURL)).Int("status_code", resp.StatusCode).Msg("WebSocket handshake returned 400 Bad Request")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
internal/server/server.go(10 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
internal/server/server.go (1)
internal/helpers/helpers.go (1)
RedactAPIKey(253-277)
🔇 Additional comments (12)
internal/server/server.go (12)
37-47: LGTM!The
BadRequestErrorstruct follows the existingRateLimitErrorpattern and captures all necessary details (status code, message, body, headers) for proper error propagation and client response construction.
423-424: LGTM!Variables properly initialized to track the first 400 error and its source endpoint for the retry-based validation logic.
464-491: LGTM!The 400 handling logic correctly:
- Caches the first 400 error and endpoint ID for retry validation
- On a second 400 from a different endpoint, confirms it's a client error and forwards to user
- Properly copies headers (excluding CORS) and logs any write errors
506-514: LGTM!The separate continuation checks for 400-specific retries vs. general error retries provide clearer debugging context with distinct log messages.
517-523: LGTM!Correctly marks the first endpoint as unhealthy when the second endpoint succeeds, confirming the 400 was endpoint-specific rather than a client error.
576-577: LGTM!Consistent with the HTTP handler's approach to tracking first 400 errors.
617-662: LGTM!The WebSocket 400 handling mirrors the HTTP approach. The inline endpoint removal at lines 645-653 duplicates the pattern at lines 710-718, but this is acceptable given the different control flow requirements for the 400 case.
726-732: LGTM!Consistent with the HTTP handler's success-after-400 logic, correctly using "ws" protocol for endpoint health marking.
1100-1137: LGTM with a note on behavior scope.The implementation correctly:
- Defers 400 health decisions to the caller's retry logic
- Marks endpoints unhealthy for all other non-2xx responses
- Handles 429 specially for rate limiting
Note that all non-2xx responses (except 400) now trigger endpoint health marking and retries. This is broader than just 401/403 mentioned in the PR description but provides consistent behavior. Verify this aligns with intended design.
1140-1149: LGTM!Properly copies backend response headers to the client while filtering CORS headers to avoid conflicts with the server's CORS middleware.
1161-1162: LGTM!The updated logic correctly broadens retry behavior to all non-2xx responses, aligning with the new error handling strategy.
1233-1244: LGTM!Correctly marks endpoints unhealthy for non-2xx handshake responses and network errors, with appropriate fallback logging when the endpoint can't be identified.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
internal/server/server.go (2)
423-525: 400 retry/pass-through semantics: consider equality check and exhaustion behavior.The overall flow (cache first 400 via
BadRequestError, retry other endpoints, pass through on second 400, and mark the first endpoint unhealthy on later success) matches the intent to treat repeated 400s as likely client errors rather than endpoint failures.Two edge cases you may want to tighten:
- You currently treat “any two endpoints that both return 400” as conclusive client error. If the goal is “same 400 from two endpoints”, consider comparing status + body before deciding to pass through, e.g.:
- if badReqErr, ok := err.(*BadRequestError); ok { - if first400Error == nil { + if badReqErr, ok := err.(*BadRequestError); ok { + if first400Error == nil { // First 400 response. Cache it and retry with next endpoint. first400Error = badReqErr first400EndpointID = endpoint.ID log.Debug().Str("endpoint", endpoint.ID).Msg("First endpoint returned 400, will retry with next endpoint") - } else { - // Second endpoint also returned 400. This is the user's fault, pass it through. + } else if badReqErr.StatusCode == first400Error.StatusCode && bytes.Equal(badReqErr.Body, first400Error.Body) { + // Second endpoint returned the same 400. Treat as client error and pass it through. log.Debug().Str("first_endpoint", first400EndpointID).Str("second_endpoint", endpoint.ID).Msg("Both endpoints returned 400, passing through to the user.")
- If you see a single 400 and then exhaust endpoints/maxRetries with only non-400 failures, the cached 400 is ignored and the user gets a generic 502. If you’d prefer to surface that 400 as a stronger client-signal once there are no more backends to try, you could fall back to
first400Errorbefore returning 502, e.g.:- // If we get here, all retries failed - if retryCount >= s.maxRetries { + // If we get here, all retries failed. If we observed a cached 400, prefer returning that + // to the client as a likely bad request signal. + if first400Error != nil { + for key, values := range first400Error.Headers { + if strings.HasPrefix(key, "Access-Control-") { + continue + } + for _, value := range values { + w.Header().Add(key, value) + } + } + w.WriteHeader(first400Error.StatusCode) + if _, err := w.Write(first400Error.Body); err != nil { + log.Debug().Err(err).Msg("Failed to write cached 400 response body to client") + } + return + } + + // If we get here, all retries failed without a definitive 400 + if retryCount >= s.maxRetries { log.Error().Str("chain", chain).Strs("tried_endpoints", triedEndpoints).Int("max_retries", s.maxRetries).Msg("Max retries reached") http.Error(w, "Max retries reached, all endpoints unavailable", http.StatusBadGateway) } else {If the current behavior is intentional, this can be left as-is; the above is mostly about making the “same 400 twice” heuristic explicit in code.
Also applies to: 506-510, 1100-1121
1100-1121: Non‑2xx HTTP handling is consistent; consider reusing an HTTP client.The new behavior in
defaultForwardRequestWithBodyFunc(treat all non‑2xx as errors, special‑case 400 viaBadRequestError, and copy upstream headers on success while skipping CORS) looks coherent and lines up with the higher‑level retry logic.One small performance/connection‑reuse improvement to consider (not blocking):
- A new
http.Clientis created per forwarded request. SinceNewRequestWithContextalready enforces timeouts via the context, you could keep a singleproxyHTTPClient *http.ClientonServer(similar tohealthCheckClient) and reuse it here to avoid repeated allocations and to better leverage connection pooling:type Server struct { @@ - healthCheckClient *http.Client + healthCheckClient *http.Client + proxyHTTPClient *http.Client @@ func NewServer(cfg *config.Config, valkeyClient store.ValkeyClientIface, appConfig *helpers.LoadedConfig) *Server { s := &Server{ @@ s.healthCheckClient = &http.Client{ Transport: transport, Timeout: 2 * time.Second, } + + // Reusable HTTP client for proxying requests; relies on per-request context for timeouts. + s.proxyHTTPClient = &http.Client{ + Transport: transport, + } @@ func (s *Server) defaultForwardRequestWithBodyFunc(w http.ResponseWriter, ctx context.Context, method, targetURL string, bodyBytes []byte, headers http.Header) error { @@ - // Forward the request. Use the context timeout instead of a fixed timeout. - client := &http.Client{} - resp, err := client.Do(req) + // Forward the request. Use the context timeout instead of a fixed timeout. + resp, err := s.proxyHTTPClient.Do(req)This keeps behavior the same while reducing per‑request overhead.
Also applies to: 1123-1137, 1140-1149
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
internal/server/server.go(9 hunks)internal/server/server_test.go(0 hunks)
💤 Files with no reviewable changes (1)
- internal/server/server_test.go
🧰 Additional context used
🧬 Code graph analysis (1)
internal/server/server.go (1)
internal/helpers/helpers.go (1)
RedactAPIKey(253-277)
🔇 Additional comments (1)
internal/server/server.go (1)
37-47: BadRequestError structure and Error implementation look good.The fields and
Error()method provide enough context (status, message, body, headers) for both HTTP and WebSocket callers to make routing and logging decisions; no changes needed here.
| var first400Error *BadRequestError | ||
| var first400EndpointID string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
wc -l internal/server/server.goRepository: project-aethermesh/aetherlay
Length of output: 102
🏁 Script executed:
sed -n '570,590p' internal/server/server.goRepository: project-aethermesh/aetherlay
Length of output: 690
🏁 Script executed:
sed -n '1175,1235p' internal/server/server.goRepository: project-aethermesh/aetherlay
Length of output: 2855
🏁 Script executed:
sed -n '617,670p' internal/server/server.goRepository: project-aethermesh/aetherlay
Length of output: 2536
🏁 Script executed:
sed -n '720,740p' internal/server/server.goRepository: project-aethermesh/aetherlay
Length of output: 1451
🌐 Web query:
Go http.ResponseWriter after websocket.Upgrader.Upgrade - is connection hijacked, can you call WriteHeader or Write after upgrade?
💡 Result:
Short answer: Yes — Upgrade hijacks the underlying connection, so you must not call ResponseWriter.WriteHeader or Write after Upgrade; use the returned *websocket.Conn instead. If you do call Write/WriteHeader after the hijack you’ll get errors like "http: response.Write on hijacked connection". Also note hijack is only supported for HTTP/1.x (HTTP/2 ResponseWriters don’t implement Hijacker). [1][2]
Sources:
- gorilla/websocket Upgrade docs (Upgrade calls hijack as part of the handshake). [1]
- Explanation and common error ("response.Write on hijacked connection") on Stack Overflow. [2]
🌐 Web query:
gorilla/websocket Dial error - do I need to close resp.Body when Dial fails?
💡 Result:
No — you do not need to close resp.Body when gorilla/websocket Dial returns an error. The Dialer returns a non-nil *http.Response on handshake failure for caller inspection, but the response body may be incomplete and “does not need to be closed by the application.” [1][2]
Sources:
- gorilla/websocket package docs. [1]
- gorilla/websocket client.go (source). [2]
🏁 Script executed:
sed -n '560,580p' internal/server/server.goRepository: project-aethermesh/aetherlay
Length of output: 877
🏁 Script executed:
sed -n '1170,1185p' internal/server/server.goRepository: project-aethermesh/aetherlay
Length of output: 617
🏁 Script executed:
# Get the full function signature and structure to understand flow
grep -B 30 "var first400Error" internal/server/server.go | head -40Repository: project-aethermesh/aetherlay
Length of output: 1515
🏁 Script executed:
# Understand the exact function this code is in
sed -n '540,580p' internal/server/server.goRepository: project-aethermesh/aetherlay
Length of output: 1670
WebSocket handshake error handling: Upgrade happens too early in the retry loop, making HTTP response writes invalid.
The WS path attempts to mirror HTTP 400 logic, but there's a critical architectural issue:
-
Premature Upgrade prevents HTTP error responses in the retry loop.
defaultProxyWebSocketupgrades the client connection immediately viaupgrader.Upgrade(w, r, nil)(Line 1178). The connection is hijacked andhttp.ResponseWriterbecomes invalid for further HTTP operations.- However,
handleRequestWScallsdefaultProxyWebSocketon each retry iteration (lines 576+). If the backend handshake fails with a 400 error on the first endpoint, the function returns aBadRequestErrortohandleRequestWS. handleRequestWSthen attempts to write HTTP headers and body viaw.Header().Add,w.WriteHeader, andw.Write(lines 638–641) to pass through the error to the client. But since the connection was already hijacked by the Upgrade call in the previousdefaultProxyWebSocketinvocation, these writes will fail with "response.Write on hijacked connection" errors.- To fix this, the backend handshake should be completed (and its status code checked) before calling
Upgrade. Only callUpgradeonce you confirm the backend will accept the connection, or restructure sodefaultProxyWebSocketdoes not upgrade on retry attempts that fail.
-
Response body closure on handshake error (lines 1203–1207) is not required.
- Per gorilla/websocket documentation,
resp.Bodydoes not need to be closed by the application whenDialreturns an error. The current code reads the body without closing it, which is acceptable per the library's contract and does not cause resource leaks.
- Per gorilla/websocket documentation,
Given the hijack issue, the WS 400 pass-through path will not work as written. Consider redesigning the flow to probe or perform the backend handshake before upgrading the client connection.
🤖 Prompt for AI Agents
internal/server/server.go around lines 576-577: the WebSocket upgrade is being
performed inside defaultProxyWebSocket on each retry which hijacks the client
connection and makes subsequent HTTP error writes invalid; change the flow so
the backend handshake (dialing the backend and checking its HTTP status) is
completed and its status code examined before calling upgrader.Upgrade on the
client side—either by refactoring defaultProxyWebSocket to separate "probe
backend handshake without upgrading client" and only upgrade after a successful
backend response, or by moving the upgrade out of the retry loop so retries only
probe the backend; also remove or leave unchanged any explicit resp.Body.Close
on dial errors per gorilla/websocket contract (no body close required) and
ensure error responses are written to the client only when the connection has
not been hijacked.
Some 4xx errors weren't being handled as errors since there was an assumption that, since they are technically user errors, they should be returned to the user, but there are some errors, like 401 and 403 that can be received by aetherlay which are problems of aetherlay as a user of the endpoints, so they actually become server errors from the end user's point of view and have to be handled accordingly.
There's an exception, though: a 400 error is checked for recurrence. If aetherlay sees the same 400 error coming from 2 different endpoints for the same request, that response is passed to the user since we assume that it's truly a bad request sent by the end user (e.g., wrong payload or invalid signature). I used the
BadRequestErrorstruct type for this.Summary by CodeRabbit
Bug Fixes
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.