feat(otel): OTel metrics phase 1 — SDK init, token counters, health state machine#32
Closed
lis186 wants to merge 10 commits into
Closed
feat(otel): OTel metrics phase 1 — SDK init, token counters, health state machine#32lis186 wants to merge 10 commits into
lis186 wants to merge 10 commits into
Conversation
…Tel metrics phase 1 Captures the design phase for adding optional OpenTelemetry metric export to ccxray. Phase 1 is metrics only; spans/drill-back deferred to Phase 2. Includes: - docs/otel-integration.html: design record with 11-risk pre-mortem, three integration plans, and recommended path (each accepted solution scored >= 9/10 on weighted criteria) - docs/otel-change-walkthrough.html: visualization of the change with diagrams, animated data flow, and direct citations to spec sections - openspec/changes/add-otel-metrics-phase1/: full proposal, design, six capability specs, and tasks (validated via openspec --strict) Key constraints captured in spec: - Default OFF (zero egress unless opted in) - Three-tier opt-in (disabled / project-anonymous / personal-named) with project as upper bound and personal as lower bound - ccxray.* namespace; coexists with Claude Code CLI OTel without double-counting; reconciliation diff metric surfaces accounting bugs - Client-side emit (not hub); per-project tier/endpoint coexistence - OTel failure never breaks the proxy: config errors fail fast, init errors degrade silently, runtime errors handled by bounded queue + circuit breaker - Parser schema-ization with sentinel counters for drift detection Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
docs/otel-change-walkthrough.html used Mermaid via CDN for the state machine and module dependency diagrams. Replaced with docs/otel-phase1-overview.html which renders the same diagrams as native SVG, eliminating the external dependency and making the file fully self-contained. Content is unchanged in substance — same 10 sections, same citations back to the OpenSpec change. Two diagrams (OTel health state machine and module dependency graph) re-authored in inline SVG with hand-laid arrow paths and the same color coding as the other diagrams on the page. Verified twice via subagent review against proposal.md, design.md, all six specs, and tasks.md — every metric name, numeric default, attribute list, state transition, and module annotation traces to source. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndler fs is already imported at the top of server/index.js (line 5). The local require inside the catch block shadowed it without purpose. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Expert review (Sigelman / Majors / Sridharan) converged on dropping
ccxray.reconciliation.token_diff_pct{model} from Phase 1:
- Pre-aggregated diff gauge cannot answer "which request diverged"
- Legitimate non-zero diffs (SSE chunking, retries, prompt-cache edges)
produce alert fatigue
- Acquiring CLI counts in-process either couples ccxray to user storage
backends or turns ccxray into an OTLP receiver — violates instrumentation
neutrality and expands blast radius
Phase 1 now emits ccxray-internal invariants only (parser sums,
SSE truncation). Cross-source reconciliation moves to docs/otel-recon.md as
a downstream pattern (recording rules, sidecar, wide-event join on
request_id).
- tasks.md §4.7 reconciliation task removed
- tasks.md §4.8 replaced with internal invariants
- tasks.md §9.6 added (docs/otel-recon.md)
- specs/otel-export/spec.md requirement rewritten with 3 new scenarios
including explicit non-emit of ccxray.reconciliation.*
- proposal.md and design.md updated; design.md records the pivot rationale
openspec validate --strict passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pulls the --port / --hub-mode / --allow-upstream-loop / --no-browser flag detection and provider lookup out of server/index.js (793 LOC) into a new server/cli.js (63 LOC) so future CLI subcommands can be added without growing the entry-point file further. Behaviour is preserved: parseArgs still mutates process.argv in place to strip consumed flags (matches existing assumptions), still exits on invalid --port values or unknown providers, and still derives DISPLAY_NAME through providers.getDisplayName. Phase 0a-ii of the add-otel-metrics-phase1 OpenSpec change — clears space for §7 (status --otel / otel preview / parser report) without piling new subcommands onto an already long entry point. Tests: 456 passing, 1 pre-existing Codex E2E failure (unchanged from baseline before this commit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add server/emit.js — a minimal on/emit primitive with synchronous dispatch,
no subscribers by default (O(1) no-op for tier 0), and try/catch isolation
so a buggy subscriber cannot break the proxy path.
This is the "drum" for the OTel work (Phase 1 of the add-otel-metrics-phase1
plan): forward.js and store.js will emit events here in a later phase, and
the OTel SDK / parser sentinels will subscribe. Wiring callers is intentionally
deferred — this commit ships the contract only, keeping the surface review-able.
Defined events (payload shapes locked for Phase 1):
- entry_completed { entry }
- session_started { sessionId, provider, inferred }
- parser_unknown { provider, kind, token }
- parser_mismatch { type, expected, got, entryId? }
- parser_error { parser, errorType, message }
Refs: openspec/changes/add-otel-metrics-phase1 §6.1–6.3, §5.9–5.11
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ader Phase 2a of the OpenSpec change add-otel-metrics-phase1 (vertical slice spike, dependency layer): - package.json: pin minimal OTel deps — @opentelemetry/api, @opentelemetry/sdk-metrics, @opentelemetry/exporter-metrics-otlp-http, @opentelemetry/resources. No auto-instrumentation packages, per design decision D-deps in the change proposal. - server/otel-lazy.js: tryRequire()/isAvailable() helpers so ccxray keeps running at tier 0 when OTel packages are absent (e.g. a slimmed install). Whitelist of known package names; unknown names throw to catch typos. - server/config-loader.js: minimum-viable reader for .ccxray.json. Returns a frozen DEFAULT_CONFIG when the file is absent; parses and validates the otel block when present. Env interpolation, secret detection, gitignore amend, and personal-config (.ccxray.user.json) lookup land in later Phase 2 sub-phases per the change tasks. - test/config-loader.test.js: 7 unit tests covering defaults, parsed values, malformed JSON, non-integer tier coercion, and the lazy require helper. No existing runtime path imports these new modules yet; the proxy and hub behavior is unchanged.
Phase 2b of add-otel-metrics-phase1. Frames the subscriber wiring against the emit.js event bus introduced in Phase 1 without committing to a metric registry shape (that lands in Phase 2c+). - server/otel-health.js: four-state machine (disabled / active / degraded / circuit_open) with validated transitions. Phase 2b ships the shell only; bounded queue, circuit breaker, and log rotation are deferred per tasks.md §3.2–3.4. - server/otel.js: init(config) chooses behavior by tier. tier 0 returns early — no @opentelemetry/* require, no subscribers, zero cost. tier ≥ 1 resolves packages via otel-lazy; absent packages → degraded (proxy keeps running). Available packages → register five no-op subscribers on the emit.js bus (entry_completed, session_started, parser_unknown, parser_mismatch, parser_error) → active. - Both modules accept dependency injection so tests can exercise the "packages missing" branch without uninstalling them. - test/otel-init.test.js: 8 unit tests covering tier 0 no-op, tier ≥ 1 active path, packages-absent degraded path, idempotency, shutdown, and invalid-transition guards on the state machine. No existing runtime path imports these modules yet; proxy and hub behavior unchanged. Phase 2c will require server/otel.js from server/index.js (or from a CLI bootstrap) and wire the first emit() call site.
…CONNRESET Two fixes for upstream socket error handling that previously crashed the proxy process in production: 1. forward.js: register a socket-level catch-all on proxyReq for the default (no HTTPS_PROXY) path. Anthropic occasionally returns 500 and then closes the TCP connection while ccxray still has a pending write to the underlying TLSSocket; the resulting EPIPE is emitted on the socket but not re-emitted on the ClientRequest, so without a listener the entire proxy crashes. Logs the error and lets the request fail gracefully via the existing proxyReq 'error' handler. 2. createTunnelAgent: guard the one-shot tls.connect callback with a "connected" flag. Pre-connect errors continue to flow into the agent callback as before; post-connect late errors now only log and never re-invoke the already-consumed callback. Addresses the HTTPS_PROXY path (Gemini-identified failure mode). node --check passes. Existing test suite unaffected.
打通 OTel 第一條 end-to-end 鏈:emit → MeterProvider →
OTLPMetricExporter → collector。先驗證 rail 通了再回頭補強。
- server/otel.js: 真正的 MeterProvider + OTLP HTTP exporter,
resource 帶 ccxray.source=ccxray-proxy;shutdown 2 秒 hard cap
- server/otel.js: 註冊 4 個 token counter
(ccxray.tokens.{input,output,cache_read,cache_creation}_total)
with { provider, model } attributes
- server/forward.js: 三條 forward 路徑(Anthropic SSE、OpenAI SSE、
non-SSE)emit('entry_completed', { entry })
- test/otel-vertical.test.js: 4 個整合測試,含 mock OTLP collector
OpenSpec tasks: §3.5、§4.1 完成;§4.5、§6.1、§10.3 partial。
Queue routing (§3.2)、cardinality budget (§4.2-4.4) 留待下一刀。
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Owner
Author
|
撤回:branch 含有已透過 hotfix 合併至 main 的 commit,需先 rebase 清理後再開 PR。 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
anthropic.tokens.input,anthropic.tokens.output,anthropic.tokens.cache_read,anthropic.tokens.cache_creation) via OTLP on every proxy responseserver/config-loader.jsfor OTLP endpoint/header config,server/otel-lazy.jsfor lazy require helper,server/otel.jsfor the full SDK init/shutdown/record pipeline,server/otel-health.jsfor the health FSM, andserver/emit.jsas the entry-point wiringserver/cli.js(clean separation from startup logic)forward.jsto handle late socket errors (EPIPE/ECONNRESET) so the proxy survives mid-stream disconnectsTest plan
npm test— all 3 new test files pass (otel-vertical, otel-init, config-loader)OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318and verify metrics appear in an OTLP collectoruninitializedccxray claudestill launches and proxies correctly end-to-end🤖 Generated with Claude Code