Skip to content

Skip listener reload when no certificate binding actually changed#591

Closed
NiccoMlt wants to merge 16 commits into
diennea:masterfrom
NiccoMlt:dcm-reload-guard
Closed

Skip listener reload when no certificate binding actually changed#591
NiccoMlt wants to merge 16 commits into
diennea:masterfrom
NiccoMlt:dcm-reload-guard

Conversation

@NiccoMlt
Copy link
Copy Markdown
Contributor

@NiccoMlt NiccoMlt commented May 29, 2026

This PR was opened against the wrong target (upstream instead of the maintainer's fork) and was closed for that reason. The original description has been redacted.

NiccoMlt and others added 16 commits May 27, 2026 16:26
Forks now publish to their own GitHub Packages registry instead of failing with a 403 against the upstream registry.
…quests (#9)

A malformed request yields an empty path that reactor-netty's metrics handler
passes to substring(1), throwing on the event loop; map empty URIs to "/".
exceedsCapacity returned true when a backend was under its safe capacity,
so cold backends within limits were deprioritized and overloaded ones preferred;
compare connections against the limit the right way round.
* fix: preserve cached client SSL contexts across config reload

The already-loaded branch never copied the existing context into the new map,
so a backend's custom CA was dropped on reload and its requests silently fell
back to the default trust store; carry the cached context forward.

* fix: reuse SSL context when multiple backends share a CA path

Without this, a fresh caCertificatePath shared by several backends would
trigger buildClientSslContextForCaFile once per backend during reload.

* fix: avoid TOCTOU on volatile clientSslContexts during reload

Two separate reads of the volatile field could race with a concurrent
reload or close() and yield a null put into newContexts.
doAfterResponseSuccess and onErrorResume could both fire when the backend
sent headers but the body then failed, decrementing the gauge and connection
count twice; do it once in doFinally on the returned publisher instead.
PR #13 moved the pending-request decrement into doFinally on the returned
publisher, so the dec now fires after sendResponseData completes — possibly
after the test client has read the response and the assertion runs. Wrap the
gauge assertion in TestUtils.waitForCondition so the test waits up to 10s
for the publisher terminal to land. Fixes the flake observed on Linux CI
for testNonHttpResponseThenClose; same race exists in StuckRequestsTest.
asProperties skips one char past the prefix to drop the dot separator, so
passing "zookeeper." with a trailing dot ate the first letter of every key;
pass the prefix without the dot, matching the db caller.
…#18)

* fix: isolate per-listener reload failures and shorten dispose timeout

Listeners.reloadConfiguration caught only InterruptedException, so an
IllegalStateException from disposeChannel's 10s blocking timeout or a bad
cert in bootListener would cascade and halt the rest of the reload;
observed in staging as ~296 disposeChannel timeouts cascading into ~50%
cert-manager mutex-acquire failures. Wrap each iteration in its own
try/catch and drop the dispose timeouts to 2s so they don't overlap the
30s cert-rotation cycle.

* fix: always close event-loop group on dispose and isolate Listeners.stop

disposeChannel skipped the channel-group close when disposeNow timed out,
leaking the DefaultEventExecutor and amplifying BUG-13; move that close
into a finally. Listeners.stop only caught InterruptedException, so a
RuntimeException from disposeChannel would abort stopping the remaining
listeners — match the per-listener isolation already applied to
reloadConfiguration.
…en CL.0 body-read (#14)

* fix: enforce HTTP/1.x request-smuggling checks at the proxy chokepoint

The CL.0/CL.TE check lived inside StandardEndpointMapper, so custom or test
mappers bypassed it; the CL.0 body-read also blocked the event loop without
ever catching the smuggle, since the decoder splits CL:0 + trailing bytes
into two pipelined requests. Run the check before map() and use a per-
connection taint flag to reject the next request after a Content-Length: 0.

* revert: drop the per-connection CL.0 taint

A legitimate Content-Length: 0 request followed by another request on the
same keep-alive connection is routine (CORS preflight, empty-body POSTs,
HEAD/DELETE/OPTIONS), so tainting the connection caused collateral 400s on
normal flows. The proxy layer cannot reliably distinguish a pipelined-burst
smuggle from sequential keep-alive reuse without channel-buffer inspection,
so keep only the header-only CL.TE check and accept that pure CL.0 smuggling
is not addressed at this layer.

* fix: smuggling check runs before filters can mutate CL/TE

Move the rejectAsSmuggling decision above the filter loop in processRequest
so a user-supplied request filter that strips or rewrites Content-Length or
Transfer-Encoding cannot bypass the check. The smuggling boolean is captured
on the raw inbound headers; filters and the mapper run as before.
processRequest's finally fired when the returned Publisher was assembled,
not when it completed, so access logs missed response status, backend
timing, and cache-hit info for proxied and cached requests; do the
logging once in doFinally on the returned publisher.
mapper, filters, currentConfiguration, and realm are written under
configurationLock during reload but read on the Netty hot path without it,
so the JMM does not guarantee readers ever see post-reload values; mark
them volatile.
decrementConnections checked > 0 then decrementAndGet as two separate atomic
ops; concurrent decrements or a set(0) from reportAsUnreachable could slip
between them and drive the counter negative. Use updateAndGet with a clamp
so the read, check, and decrement happen as one CAS.
…State> (#21)

* fix: make BackendHealthStatus transitions atomic via AtomicReference<State>

Status + unreachableSince + lastUnreachable + lastReachable used to be four
separate volatile fields written across multi-step transitions in
reportAsReachable / reportAsUnreachable, so concurrent callers (health-check
timer + Netty event loop on connection errors) could interleave and leave
torn end states like status=DOWN with unreachableSince=0. Hold the four
fields in a private immutable State record swapped through a single
AtomicReference.updateAndGet so each transition is one linearization point.

Also drop connections.set(0) from reportAsUnreachable: with the inc/dec
invariant restored (#13) and the atomic clamp (#17) in place, in-flight
decrements drain the counter naturally and SafeBackendSelector ignores
connections for DOWN backends anyway.

* fix: expose Snapshot accessor for atomic multi-field reads

BackendsResource called bhs.getStatus() twice and bhs.getUnreachableSince()
once on the same BackendHealthStatus, so two consecutive getter calls could
straddle a state transition and observe an inconsistent pair. Add a public
Snapshot record + snapshot() method that captures the four state-machine
fields atomically, and migrate the API reader to use it.
…loads (#19)

applyDynamicConfiguration wrote filters/realm/mapper/currentConfiguration
interleaved with subsystem reloads, so a failure mid-flight left request
readers seeing a torn old/new view (e.g. new filters with old mapper).
Build the new objects first, reload subsystems, and only then swap the
four fields back-to-back at the end of the try.
…standards (#22)

* feat: implement hop-by-hop header stripping for compliance with HTTP standards

* test: cover hop-by-hop stripping with an H2-capable backend

Adds a test variant that uses a Reactor Netty H2C HttpServer as the backend
so Carapace's outbound HttpClient negotiates H2 to it (rather than the
HTTP/1.1 WireMock case the other tests cover). Verifies the request still
succeeds and the backend never observes Keep-Alive — the original symptom
in the PR description was a StreamException on this exact path.
DynamicCertificatesManager.reloadCertificatesFromDB fires
server.getListeners().reloadCurrentConfiguration() on every poll and
on every ZK certificates_state_changed event, regardless of whether
the cert content actually changed. Combined with the reference-equality
short-circuit in Listeners.reloadConfiguration, this restarts every
SSL listener every 30 s while a single ACME order is stuck.

Compare the binding-relevant subset (keystore bytes and SANs) between
the previous and refreshed cert maps and trigger the listener reload
only when at least one entry actually changed. The Lombok-generated
equals on CertificateData is unsuitable because it includes
attemptsCount and message, which tick on stuck-ACME orders.

Also publish the certificates field as volatile and serialise the
three external callers (admin REST, two ZK callbacks) through
groupMembershipHandler.executeInMutex, matching the pattern that
already protects certificatesLifecycle. The internal call from
certificatesLifecycle bypasses the dispatcher to avoid mutex
reentrancy on the same thread.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 88e1e66d-48f6-4874-bb67-f3807ee938f3

📥 Commits

Reviewing files that changed from the base of the PR and between c5f67d9 and 3ef47f2.

📒 Files selected for processing (18)
  • .github/workflows/master-snapshot.yml
  • carapace-server/src/main/java/org/carapaceproxy/api/BackendsResource.java
  • carapace-server/src/main/java/org/carapaceproxy/configstore/CertificateData.java
  • carapace-server/src/main/java/org/carapaceproxy/core/HttpProxyServer.java
  • carapace-server/src/main/java/org/carapaceproxy/core/Listeners.java
  • carapace-server/src/main/java/org/carapaceproxy/core/ListeningChannel.java
  • carapace-server/src/main/java/org/carapaceproxy/core/ProxyRequestsManager.java
  • carapace-server/src/main/java/org/carapaceproxy/server/backends/BackendHealthManager.java
  • carapace-server/src/main/java/org/carapaceproxy/server/backends/BackendHealthStatus.java
  • carapace-server/src/main/java/org/carapaceproxy/server/certificates/DynamicCertificatesManager.java
  • carapace-server/src/main/java/org/carapaceproxy/server/mapper/StandardEndpointMapper.java
  • carapace-server/src/main/java/org/carapaceproxy/utils/HttpUtils.java
  • carapace-server/src/test/java/org/carapaceproxy/backends/StuckRequestsTest.java
  • carapace-server/src/test/java/org/carapaceproxy/backends/UnreachableBackendTest.java
  • carapace-server/src/test/java/org/carapaceproxy/configstore/CertificateDataBindingEquivalentTest.java
  • carapace-server/src/test/java/org/carapaceproxy/core/Http2HeadersTest.java
  • carapace-server/src/test/java/org/carapaceproxy/server/certificates/DynamicCertificatesManagerReloadGuardTest.java
  • carapace-server/src/test/java/org/carapaceproxy/utils/HttpUtilsTest.java

📝 Walkthrough

Walkthrough

This pull request refactors HTTP request/response header handling and backend health state management for improved correctness and consistency. It relocates HTTP/1.x request-smuggling validation from the mapper to the request manager, introduces hop-by-hop header stripping for both client requests and backend responses, refactors backend health tracking into atomic state transitions, marks configuration snapshot fields as volatile with a deferred-commit pattern to reduce inconsistency windows, and optimizes certificate-binding detection to gate unnecessary listener reloads. Tests are strengthened with gauge synchronization waits and header-stripping validation.

Poem

🐰 Hops through headers, stripe by stripe,
Atomic states keep health in type,
Config snapshots, now volatile, true—
No smuggling allowed, through and through!
Listeners reload when bindings change,
Rabbit runs the proxy exchange. 🌟

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@NiccoMlt
Copy link
Copy Markdown
Contributor Author

Closing — meant to target the NiccoMlt fork, not upstream. Will reopen there.

@NiccoMlt NiccoMlt closed this May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant