[COR-23048] memcache: split TCP-connect / TLS-handshake timeouts + add dial hook#7
Open
ethanlin-coding wants to merge 1 commit into
Open
[COR-23048] memcache: split TCP-connect / TLS-handshake timeouts + add dial hook#7ethanlin-coding wants to merge 1 commit into
ethanlin-coding wants to merge 1 commit into
Conversation
A single DialTimeout budgets the TCP connect and the TLS handshake
together. When a server is CPU-bound the handshake slows down and can
exceed that budget, so the client aborts it mid-flight ("closed during
TLS handshake") — which, combined with immediate redial, exhausted FDs
on a Verkada ElastiCache node and required a manual reboot to recover.
Add independent ConnectTimeout and HandshakeTimeout fields. When either
is set, dial() takes a split path: net.Dialer.DialContext bounds the TCP
connect, then tls.Client(...).Handshake() runs under its own deadline.
This lets the connect budget stay tight (fail dead hosts fast) while the
handshake budget is generous enough to ride out a slow-but-healthy
server. When neither is set, the original single-budget path is used
unchanged (verified by existing tests).
Also add an optional OnDial(DialInfo) hook reporting connect/handshake
durations, TLS-resumption status, and outcome, so callers can emit dial
metrics without the library taking a metrics dependency. The hook runs
on a separate goroutine (so it can never add dial latency) with a
recover() guard (so a panicking hook cannot crash the process).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
00edca4 to
146a696
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lets callers budget the TCP connect and the TLS handshake independently, and adds an optional dial-outcome hook for metrics. Fully backward compatible: when the new fields are unset, dial behavior is unchanged.
Motivation
DialTimeoutis a single budget covering both the TCP connection and (for TLS) the handshake —tls.DialWithDialerapplies the dialer's timeout to "connection and TLS handshake as a whole." TCP connect intra-AZ is sub-millisecond, but a full TLS 1.3 handshake (ECDHE + cert verify) can be slow when the server is CPU-bound. A single tight budget therefore aborts handshakes mid-flight under load.This bit a downstream consumer (Verkada
vaccel): a connection disruption to a memcache node triggered full-handshake reconnects, which spiked server CPU, which pushed handshakes past the 175msDialTimeout, which aborted them mid-handshake ("closed during TLS handshake") — and combined with immediate redials this exhausted the node's file descriptors and required a manual reboot to recover.Changes
ConnectTimeout/HandshakeTimeout(newClientfields). When either is set,dial()takes a split path:net.Dialer.DialContextbounds the TCP connect, thentls.Client(conn, cfg).Handshake()runs under its own deadline. This lets the connect budget stay tight (fail a dead host fast) while the handshake budget is generous enough to ride out a slow-but-healthy server. When both are zero, the original single-DialTimeoutpath is used unchanged.OnDial func(DialInfo)(optional hook). Invoked after every connection attempt with connect/handshake durations, TLS-resumption status (DidResume), and outcome — so callers can emit dial-count / connect-duration / handshake-duration / resumption metrics without this library taking a metrics dependency. Runs on a separate goroutine (never adds dial latency) with arecover()guard (a panicking hook can't crash the process).Testing
go test -race ./memcache/— passes, including existing legacy-path tests (TestDialTimeoutMethod,TestBackwardCompatibility,TestSeparateTimeouts, TLS tests).TestUseSplitTimeouts,TestSplitTimeoutFallbacks,TestOnDialHookSuccess,TestOnDialHookConnectFailure,TestOnDialHookPanicRecovered.Backward compatibility
No behavior change for existing callers:
ConnectTimeout/HandshakeTimeout/OnDialall default to zero/nil, in which casedial()uses the pre-existing single-budget path. Verified by the unchanged legacy tests.Downstream follow-up
Once merged, the
vaccelconsumer will bump its pin, exposeWithTLSConnectTimeout/WithTLSHandshakeTimeout, setOnDialto emit dial metrics, and replace its single 500msmemcached_connection_timeoutwith a tight connect + generous handshake budget.Linear
Relates to COR-23048
🤖 Generated with Claude Code