Skip to content

[COR-23048] memcache: split TCP-connect / TLS-handshake timeouts + add dial hook#7

Open
ethanlin-coding wants to merge 1 commit into
masterfrom
ethanlin/cor-tls-split-handshake-timeout
Open

[COR-23048] memcache: split TCP-connect / TLS-handshake timeouts + add dial hook#7
ethanlin-coding wants to merge 1 commit into
masterfrom
ethanlin/cor-tls-split-handshake-timeout

Conversation

@ethanlin-coding
Copy link
Copy Markdown

@ethanlin-coding ethanlin-coding commented May 29, 2026

Summary

Lets callers budget the TCP connect and the TLS handshake independently, and adds an optional dial-outcome hook for metrics. Fully backward compatible: when the new fields are unset, dial behavior is unchanged.

Motivation

DialTimeout is a single budget covering both the TCP connection and (for TLS) the handshake — tls.DialWithDialer applies the dialer's timeout to "connection and TLS handshake as a whole." TCP connect intra-AZ is sub-millisecond, but a full TLS 1.3 handshake (ECDHE + cert verify) can be slow when the server is CPU-bound. A single tight budget therefore aborts handshakes mid-flight under load.

This bit a downstream consumer (Verkada vaccel): a connection disruption to a memcache node triggered full-handshake reconnects, which spiked server CPU, which pushed handshakes past the 175ms DialTimeout, which aborted them mid-handshake ("closed during TLS handshake") — and combined with immediate redials this exhausted the node's file descriptors and required a manual reboot to recover.

Changes

  • ConnectTimeout / HandshakeTimeout (new Client fields). When either is set, dial() takes a split path: net.Dialer.DialContext bounds the TCP connect, then tls.Client(conn, cfg).Handshake() runs under its own deadline. This lets the connect budget stay tight (fail a dead host fast) while the handshake budget is generous enough to ride out a slow-but-healthy server. When both are zero, the original single-DialTimeout path is used unchanged.
  • OnDial func(DialInfo) (optional hook). Invoked after every connection attempt with connect/handshake durations, TLS-resumption status (DidResume), and outcome — so callers can emit dial-count / connect-duration / handshake-duration / resumption metrics without this library taking a metrics dependency. Runs on a separate goroutine (never adds dial latency) with a recover() guard (a panicking hook can't crash the process).

Testing

  • go test -race ./memcache/ — passes, including existing legacy-path tests (TestDialTimeoutMethod, TestBackwardCompatibility, TestSeparateTimeouts, TLS tests).
  • New tests: TestUseSplitTimeouts, TestSplitTimeoutFallbacks, TestOnDialHookSuccess, TestOnDialHookConnectFailure, TestOnDialHookPanicRecovered.

Backward compatibility

No behavior change for existing callers: ConnectTimeout/HandshakeTimeout/OnDial all default to zero/nil, in which case dial() uses the pre-existing single-budget path. Verified by the unchanged legacy tests.

Downstream follow-up

Once merged, the vaccel consumer will bump its pin, expose WithTLSConnectTimeout/WithTLSHandshakeTimeout, set OnDial to emit dial metrics, and replace its single 500ms memcached_connection_timeout with a tight connect + generous handshake budget.

Linear

Relates to COR-23048

🤖 Generated with Claude Code

@ethanlin-coding ethanlin-coding requested a review from a team as a code owner May 29, 2026 03:08
A single DialTimeout budgets the TCP connect and the TLS handshake
together. When a server is CPU-bound the handshake slows down and can
exceed that budget, so the client aborts it mid-flight ("closed during
TLS handshake") — which, combined with immediate redial, exhausted FDs
on a Verkada ElastiCache node and required a manual reboot to recover.

Add independent ConnectTimeout and HandshakeTimeout fields. When either
is set, dial() takes a split path: net.Dialer.DialContext bounds the TCP
connect, then tls.Client(...).Handshake() runs under its own deadline.
This lets the connect budget stay tight (fail dead hosts fast) while the
handshake budget is generous enough to ride out a slow-but-healthy
server. When neither is set, the original single-budget path is used
unchanged (verified by existing tests).

Also add an optional OnDial(DialInfo) hook reporting connect/handshake
durations, TLS-resumption status, and outcome, so callers can emit dial
metrics without the library taking a metrics dependency. The hook runs
on a separate goroutine (so it can never add dial latency) with a
recover() guard (so a panicking hook cannot crash the process).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ethanlin-coding ethanlin-coding force-pushed the ethanlin/cor-tls-split-handshake-timeout branch from 00edca4 to 146a696 Compare May 29, 2026 03:51
@linear
Copy link
Copy Markdown

linear Bot commented May 29, 2026

COR-23048

@ethanlin-coding ethanlin-coding changed the title memcache: split TCP-connect / TLS-handshake timeouts + add dial hook [COR-23048] memcache: split TCP-connect / TLS-handshake timeouts + add dial hook May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant