Skip to content

feat(otel): OTel metrics phase 1 — SDK init, token counters, health state machine#32

Closed
lis186 wants to merge 10 commits into
mainfrom
otel-metrics-phase1
Closed

feat(otel): OTel metrics phase 1 — SDK init, token counters, health state machine#32
lis186 wants to merge 10 commits into
mainfrom
otel-metrics-phase1

Conversation

@lis186
Copy link
Copy Markdown
Owner

@lis186 lis186 commented May 19, 2026

Summary

  • Add OpenTelemetry SDK integration with lazy initialization and graceful health state machine (uninitialized → starting → healthy → degraded → disabled)
  • Emit token usage counters (anthropic.tokens.input, anthropic.tokens.output, anthropic.tokens.cache_read, anthropic.tokens.cache_creation) via OTLP on every proxy response
  • Add server/config-loader.js for OTLP endpoint/header config, server/otel-lazy.js for lazy require helper, server/otel.js for the full SDK init/shutdown/record pipeline, server/otel-health.js for the health FSM, and server/emit.js as the entry-point wiring
  • Extract CLI argv parsing into server/cli.js (clean separation from startup logic)
  • Fix forward.js to handle late socket errors (EPIPE/ECONNRESET) so the proxy survives mid-stream disconnects
  • Add OpenSpec proposal, design docs, and per-spec files for config, export, health, introspection, tiers, and parser schemas
  • Add vertical-slice integration tests, OTel init tests, and config-loader unit tests

Test plan

  • npm test — all 3 new test files pass (otel-vertical, otel-init, config-loader)
  • Start proxy with OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 and verify metrics appear in an OTLP collector
  • Start proxy without any OTel env vars — verify it starts normally with health state uninitialized
  • Kill the OTLP endpoint mid-session — verify health degrades but proxy keeps forwarding Claude requests
  • ccxray claude still launches and proxies correctly end-to-end

🤖 Generated with Claude Code

lis186 and others added 10 commits May 15, 2026 19:55
…Tel metrics phase 1

Captures the design phase for adding optional OpenTelemetry metric export
to ccxray. Phase 1 is metrics only; spans/drill-back deferred to Phase 2.

Includes:
- docs/otel-integration.html: design record with 11-risk pre-mortem,
  three integration plans, and recommended path (each accepted solution
  scored >= 9/10 on weighted criteria)
- docs/otel-change-walkthrough.html: visualization of the change with
  diagrams, animated data flow, and direct citations to spec sections
- openspec/changes/add-otel-metrics-phase1/: full proposal, design,
  six capability specs, and tasks (validated via openspec --strict)

Key constraints captured in spec:
- Default OFF (zero egress unless opted in)
- Three-tier opt-in (disabled / project-anonymous / personal-named)
  with project as upper bound and personal as lower bound
- ccxray.* namespace; coexists with Claude Code CLI OTel without
  double-counting; reconciliation diff metric surfaces accounting bugs
- Client-side emit (not hub); per-project tier/endpoint coexistence
- OTel failure never breaks the proxy: config errors fail fast,
  init errors degrade silently, runtime errors handled by bounded
  queue + circuit breaker
- Parser schema-ization with sentinel counters for drift detection

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
docs/otel-change-walkthrough.html used Mermaid via CDN for the state machine
and module dependency diagrams. Replaced with docs/otel-phase1-overview.html
which renders the same diagrams as native SVG, eliminating the external
dependency and making the file fully self-contained.

Content is unchanged in substance — same 10 sections, same citations back to
the OpenSpec change. Two diagrams (OTel health state machine and module
dependency graph) re-authored in inline SVG with hand-laid arrow paths and
the same color coding as the other diagrams on the page.

Verified twice via subagent review against proposal.md, design.md, all six
specs, and tasks.md — every metric name, numeric default, attribute list,
state transition, and module annotation traces to source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndler

fs is already imported at the top of server/index.js (line 5). The local
require inside the catch block shadowed it without purpose.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Expert review (Sigelman / Majors / Sridharan) converged on dropping
ccxray.reconciliation.token_diff_pct{model} from Phase 1:

- Pre-aggregated diff gauge cannot answer "which request diverged"
- Legitimate non-zero diffs (SSE chunking, retries, prompt-cache edges)
  produce alert fatigue
- Acquiring CLI counts in-process either couples ccxray to user storage
  backends or turns ccxray into an OTLP receiver — violates instrumentation
  neutrality and expands blast radius

Phase 1 now emits ccxray-internal invariants only (parser sums,
SSE truncation). Cross-source reconciliation moves to docs/otel-recon.md as
a downstream pattern (recording rules, sidecar, wide-event join on
request_id).

- tasks.md §4.7 reconciliation task removed
- tasks.md §4.8 replaced with internal invariants
- tasks.md §9.6 added (docs/otel-recon.md)
- specs/otel-export/spec.md requirement rewritten with 3 new scenarios
  including explicit non-emit of ccxray.reconciliation.*
- proposal.md and design.md updated; design.md records the pivot rationale

openspec validate --strict passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pulls the --port / --hub-mode / --allow-upstream-loop / --no-browser flag
detection and provider lookup out of server/index.js (793 LOC) into a new
server/cli.js (63 LOC) so future CLI subcommands can be added without
growing the entry-point file further.

Behaviour is preserved: parseArgs still mutates process.argv in place to
strip consumed flags (matches existing assumptions), still exits on
invalid --port values or unknown providers, and still derives DISPLAY_NAME
through providers.getDisplayName.

Phase 0a-ii of the add-otel-metrics-phase1 OpenSpec change — clears space
for §7 (status --otel / otel preview / parser report) without piling new
subcommands onto an already long entry point.

Tests: 456 passing, 1 pre-existing Codex E2E failure (unchanged from
baseline before this commit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add server/emit.js — a minimal on/emit primitive with synchronous dispatch,
no subscribers by default (O(1) no-op for tier 0), and try/catch isolation
so a buggy subscriber cannot break the proxy path.

This is the "drum" for the OTel work (Phase 1 of the add-otel-metrics-phase1
plan): forward.js and store.js will emit events here in a later phase, and
the OTel SDK / parser sentinels will subscribe. Wiring callers is intentionally
deferred — this commit ships the contract only, keeping the surface review-able.

Defined events (payload shapes locked for Phase 1):
- entry_completed { entry }
- session_started { sessionId, provider, inferred }
- parser_unknown  { provider, kind, token }
- parser_mismatch { type, expected, got, entryId? }
- parser_error    { parser, errorType, message }

Refs: openspec/changes/add-otel-metrics-phase1 §6.1–6.3, §5.9–5.11

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ader

Phase 2a of the OpenSpec change add-otel-metrics-phase1 (vertical slice
spike, dependency layer):

- package.json: pin minimal OTel deps — @opentelemetry/api,
  @opentelemetry/sdk-metrics, @opentelemetry/exporter-metrics-otlp-http,
  @opentelemetry/resources. No auto-instrumentation packages, per design
  decision D-deps in the change proposal.
- server/otel-lazy.js: tryRequire()/isAvailable() helpers so ccxray keeps
  running at tier 0 when OTel packages are absent (e.g. a slimmed install).
  Whitelist of known package names; unknown names throw to catch typos.
- server/config-loader.js: minimum-viable reader for .ccxray.json. Returns
  a frozen DEFAULT_CONFIG when the file is absent; parses and validates the
  otel block when present. Env interpolation, secret detection, gitignore
  amend, and personal-config (.ccxray.user.json) lookup land in later
  Phase 2 sub-phases per the change tasks.
- test/config-loader.test.js: 7 unit tests covering defaults, parsed
  values, malformed JSON, non-integer tier coercion, and the lazy require
  helper.

No existing runtime path imports these new modules yet; the proxy and hub
behavior is unchanged.
Phase 2b of add-otel-metrics-phase1. Frames the subscriber wiring against
the emit.js event bus introduced in Phase 1 without committing to a
metric registry shape (that lands in Phase 2c+).

- server/otel-health.js: four-state machine (disabled / active / degraded /
  circuit_open) with validated transitions. Phase 2b ships the shell only;
  bounded queue, circuit breaker, and log rotation are deferred per tasks.md
  §3.2–3.4.
- server/otel.js: init(config) chooses behavior by tier. tier 0 returns
  early — no @opentelemetry/* require, no subscribers, zero cost. tier ≥ 1
  resolves packages via otel-lazy; absent packages → degraded (proxy keeps
  running). Available packages → register five no-op subscribers on the
  emit.js bus (entry_completed, session_started, parser_unknown,
  parser_mismatch, parser_error) → active.
- Both modules accept dependency injection so tests can exercise the
  "packages missing" branch without uninstalling them.
- test/otel-init.test.js: 8 unit tests covering tier 0 no-op, tier ≥ 1
  active path, packages-absent degraded path, idempotency, shutdown, and
  invalid-transition guards on the state machine.

No existing runtime path imports these modules yet; proxy and hub behavior
unchanged. Phase 2c will require server/otel.js from server/index.js (or
from a CLI bootstrap) and wire the first emit() call site.
…CONNRESET

Two fixes for upstream socket error handling that previously crashed the
proxy process in production:

1. forward.js: register a socket-level catch-all on proxyReq for the
   default (no HTTPS_PROXY) path. Anthropic occasionally returns 500 and
   then closes the TCP connection while ccxray still has a pending write
   to the underlying TLSSocket; the resulting EPIPE is emitted on the
   socket but not re-emitted on the ClientRequest, so without a listener
   the entire proxy crashes. Logs the error and lets the request fail
   gracefully via the existing proxyReq 'error' handler.

2. createTunnelAgent: guard the one-shot tls.connect callback with a
   "connected" flag. Pre-connect errors continue to flow into the agent
   callback as before; post-connect late errors now only log and never
   re-invoke the already-consumed callback. Addresses the HTTPS_PROXY
   path (Gemini-identified failure mode).

node --check passes. Existing test suite unaffected.
打通 OTel 第一條 end-to-end 鏈:emit → MeterProvider →
OTLPMetricExporter → collector。先驗證 rail 通了再回頭補強。

- server/otel.js: 真正的 MeterProvider + OTLP HTTP exporter,
  resource 帶 ccxray.source=ccxray-proxy;shutdown 2 秒 hard cap
- server/otel.js: 註冊 4 個 token counter
  (ccxray.tokens.{input,output,cache_read,cache_creation}_total)
  with { provider, model } attributes
- server/forward.js: 三條 forward 路徑(Anthropic SSE、OpenAI SSE、
  non-SSE)emit('entry_completed', { entry })
- test/otel-vertical.test.js: 4 個整合測試,含 mock OTLP collector

OpenSpec tasks: §3.5、§4.1 完成;§4.5、§6.1、§10.3 partial。
Queue routing (§3.2)、cardinality budget (§4.2-4.4) 留待下一刀。

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@lis186
Copy link
Copy Markdown
Owner Author

lis186 commented May 19, 2026

撤回:branch 含有已透過 hotfix 合併至 main 的 commit,需先 rebase 清理後再開 PR。

@lis186 lis186 closed this May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant