From 639bc254f2eb6071e12445eefb093b87743e5dbb Mon Sep 17 00:00:00 2001 From: Justin Lee Date: Tue, 12 May 2026 18:21:50 +0800 Subject: [PATCH 01/10] docs(otel): pre-mortem, design walkthrough, and OpenSpec change for OTel metrics phase 1 Captures the design phase for adding optional OpenTelemetry metric export to ccxray. Phase 1 is metrics only; spans/drill-back deferred to Phase 2. Includes: - docs/otel-integration.html: design record with 11-risk pre-mortem, three integration plans, and recommended path (each accepted solution scored >= 9/10 on weighted criteria) - docs/otel-change-walkthrough.html: visualization of the change with diagrams, animated data flow, and direct citations to spec sections - openspec/changes/add-otel-metrics-phase1/: full proposal, design, six capability specs, and tasks (validated via openspec --strict) Key constraints captured in spec: - Default OFF (zero egress unless opted in) - Three-tier opt-in (disabled / project-anonymous / personal-named) with project as upper bound and personal as lower bound - ccxray.* namespace; coexists with Claude Code CLI OTel without double-counting; reconciliation diff metric surfaces accounting bugs - Client-side emit (not hub); per-project tier/endpoint coexistence - OTel failure never breaks the proxy: config errors fail fast, init errors degrade silently, runtime errors handled by bounded queue + circuit breaker - Parser schema-ization with sentinel counters for drift detection Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/otel-change-walkthrough.html | 916 ++++++++++++++ docs/otel-integration.html | 1124 +++++++++++++++++ .../add-otel-metrics-phase1/.openspec.yaml | 2 + .../changes/add-otel-metrics-phase1/design.md | 166 +++ .../add-otel-metrics-phase1/proposal.md | 47 + .../specs/otel-config/spec.md | 90 ++ .../specs/otel-export/spec.md | 121 ++ .../specs/otel-health/spec.md | 99 ++ .../specs/otel-introspection/spec.md | 66 + .../specs/otel-tiers/spec.md | 79 ++ .../specs/parser-schemas/spec.md | 83 ++ .../changes/add-otel-metrics-phase1/tasks.md | 104 ++ 12 files changed, 2897 insertions(+) create mode 100644 docs/otel-change-walkthrough.html create mode 100644 docs/otel-integration.html create mode 100644 openspec/changes/add-otel-metrics-phase1/.openspec.yaml create mode 100644 openspec/changes/add-otel-metrics-phase1/design.md create mode 100644 openspec/changes/add-otel-metrics-phase1/proposal.md create mode 100644 openspec/changes/add-otel-metrics-phase1/specs/otel-config/spec.md create mode 100644 openspec/changes/add-otel-metrics-phase1/specs/otel-export/spec.md create mode 100644 openspec/changes/add-otel-metrics-phase1/specs/otel-health/spec.md create mode 100644 openspec/changes/add-otel-metrics-phase1/specs/otel-introspection/spec.md create mode 100644 openspec/changes/add-otel-metrics-phase1/specs/otel-tiers/spec.md create mode 100644 openspec/changes/add-otel-metrics-phase1/specs/parser-schemas/spec.md create mode 100644 openspec/changes/add-otel-metrics-phase1/tasks.md diff --git a/docs/otel-change-walkthrough.html b/docs/otel-change-walkthrough.html new file mode 100644 index 0000000..d3eec08 --- /dev/null +++ b/docs/otel-change-walkthrough.html @@ -0,0 +1,916 @@ + + + + +OpenSpec change: add-otel-metrics-phase1 — Mental Model + + + + +
+ +

OpenSpec change: add-otel-metrics-phase1

+
這份文件 100% 來自規格本身,每個段落都標註來源。用視覺化(流程圖、狀態機、動畫)幫你建立正確的 mental model,而不是描述。
+ + + + +

1. 這個 change 是什麼,不是什麼

+ +

是什麼(Goals)

+
    +
  • 提供 ccxray 自己 emit 的 OTel metrics,涵蓋 cost、usage(tool/MCP/skill)、quality、patterns、governance
  • +
  • 預設關閉。除非使用者明確 opt in per-project,否則零 telemetry
  • +
  • 三層 opt-in(disabled / project-anonymous / personal-named),project 設上限,personal 只能 equal-or-downgrade
  • +
  • 跟 Claude Code CLI 內建 OTel 共存,有 reconciliation metric 來反查雙方計費 bug
  • +
  • OTel 失敗絕不影響 proxy:config 錯誤 fail at startup、init 錯誤 degraded silently、runtime 錯誤由 bounded queue + circuit breaker 吸收
  • +
  • Parser drift 必須可見:未識別事件 emit sentinel counter,不能 silently 變 0
  • +
  • 內省命令:ccxray status --otelccxray otel previewccxray parser report
  • +
+
+

Source 列表逐條對應 design.md 的 Goals 段落。

+ 📄 design.md › Goals / Non-Goals +
+ +

不是什麼(Non-Goals)

+
    +
  • Traces / spans。Phase 1 只 emit metrics。Span、entry_id deep-link、/entry/:id 路由都是 Phase 2
  • +
  • 完整 payload 外送。Request/response body 永不離開機器
  • +
  • Synthetic tool span 時間。從 HTTP cadence 推估 tool 執行時間會誤導
  • +
  • 中央 ccxray hub 做跨機器聚合。每個工程師的 ccxray 都本地獨立
  • +
  • Auto-instrumentation。不會引入 @opentelemetry/auto-instrumentations-node
  • +
+
+ 📄 design.md › Non-Goals +
+ + +

2. 六個 capabilities 的全景

+ +

規格定義了 6 個新 capabilities(proposal.md › New Capabilities)。它們的關係:

+ +
+
+flowchart TB
+    CFG[otel-config
讀取/驗證 .ccxray.json
+ .ccxray.user.json] + TIERS[otel-tiers
三層 opt-in 解析] + HEALTH[otel-health
狀態機 + queue + breaker] + EXPORT[otel-export
SDK init + metrics emit] + PARSER[parser-schemas
schema 化 + sentinels] + INTRO[otel-introspection
status / preview / report] + + CFG --> TIERS + TIERS --> EXPORT + HEALTH --> EXPORT + PARSER --> EXPORT + CFG --> INTRO + HEALTH --> INTRO + EXPORT --> INTRO + PARSER --> INTRO + + style CFG fill:#88c0d0,stroke:#5e81ac,color:#0f1419 + style TIERS fill:#88c0d0,stroke:#5e81ac,color:#0f1419 + style HEALTH fill:#88c0d0,stroke:#5e81ac,color:#0f1419 + style EXPORT fill:#a3be8c,stroke:#5e81ac,color:#0f1419 + style PARSER fill:#88c0d0,stroke:#5e81ac,color:#0f1419 + style INTRO fill:#b48ead,stroke:#5e81ac,color:#0f1419 +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Capability規格檔核心職責(直接引用)
otel-configspecs/otel-config/spec.md.ccxray.json and .ccxray.user.json schema, ${ENV_VAR} interpolation, literal-secret rejection, .gitignore auto-amend, project-upper-bound + personal-lower-bound merging rules.」
otel-exportspecs/otel-export/spec.md「OTel SDK initialization (client-side, not hub), metric definitions under ccxray.* namespace, ccxray.source resource attribute, cardinality budget enforcement with _overflow_ fallback, CLI coexistence detection and complement-mode signaling, reconciliation diff metric.」
otel-tiersspecs/otel-tiers/spec.md「three-tier opt-in (disabled / project-anonymous / personal-named), tier resolution with project as upper bound and personal as lower bound, enduser.id attribute only in tier 2, opt-in acknowledgment timestamp persisted in personal config.」
otel-healthspecs/otel-health/spec.md「failure state machine (disabled / active / degraded / circuit_open), bounded export queue with drop-oldest semantics, circuit breaker with exponential backoff, local failure log at ~/.ccxray/otel.log with rotation, never-block guarantee for the proxy path.」
parser-schemasspecs/parser-schemas/spec.md「extract skill / MCP / tool / agent-type detection into versioned JSON schemas, snapshot fixtures per provider (Anthropic + Codex), sentinel metrics for unknown events, reconciliation invariants run per entry, try/catch isolation so parser failure does not affect ccxray core.」
otel-introspectionspecs/otel-introspection/spec.mdccxray status --otel view (tier, endpoint, health, cardinality, dropped counts), ccxray otel preview dry-run, ccxray parser report for drift inspection, startup banner declaring active tier and CLI coexistence mode.」
+
+ 📄 proposal.md › Capabilities › New Capabilities +
+ + +

3. 一次請求的旅程(資料流動畫)

+ +

當一個 HTTP 請求被 ccxray 攔截、回應、進入本地 log 時,同時會走 OTel 路徑。下面綠球代表「一筆 metric 事件」從產生到送達 OTLP collector 的旅程:

+ +
+
+
forward.jsrequest 完成
取得 usage
+ +
parsers/*解析 tool /
skill / MCP
+ +
otel.jsbudget 檢查
+ counter.add()
+ +
otel-healthqueue +
state machine
+ +
OTLP HTTP送往
collector
+
+
+
綠球 = 一筆 metric 事件;從左到右流經 5 個階段。注意這條路徑跟 proxy forward 平行,任一階段失敗都不會影響 proxy。
+
+ +

規格定義的關鍵保證

+ +
+

「OTel export operations SHALL NOT block the HTTP proxy path. All emit operations SHALL enqueue without awaiting export completion.」

+ 📄 otel-health/spec.md › Requirement: Never-block guarantee for the proxy +
+ +
+

「Parser code SHALL be wrapped in try/catch boundaries. On exception... Parser failure SHALL NOT propagate to the proxy path or terminate ccxray.」

+ 📄 parser-schemas/spec.md › Requirement: Parser error isolation +
+ +

對應 tasks.md 的實作位置

+
    +
  • forward.js 改動:tasks.md §6.1
  • +
  • store.js 改動:tasks.md §6.2
  • +
  • 「No emit path can throw into the proxy」:tasks.md §6.3
  • +
+ + +

4. 三層 opt-in 決策樹

+ +

規格規定 tier 的解析規則是 min(project_tier, personal_tier)。下面這棵樹列出所有有意義的組合:

+ +
+ccxray 啟動 +├─ 找不到任何 config? +│ └─ ▶ tier 0 (disabled) — 不載入 OTel SDK,不開網路連線 +│ +├─ 只有 .ccxray.json(專案開啟 tier 1) +│ └─ ▶ tier 1 (project anonymous) — 只帶 project.name / team,沒有 enduser.id +│ +├─ .ccxray.json tier 1 + .ccxray.user.json tier 0 +│ └─ ▶ tier 0 — 個人 unilateral opt-out +│ +├─ .ccxray.json tier 1 + .ccxray.user.json tier 2 +│ └─ ⚠ project 沒有 authorize tier 2 +│
→ clamp 為 tier 1,印 warning(spec 明定)
+│ +└─ .ccxray.json tier 2 + .ccxray.user.json tier 2 + └─ ▶ tier 2 (personal named) — 帶 enduser.id,記錄 opt_in_acknowledged_at +
+ +
+

「The effective tier SHALL be min(project_tier, personal_tier). If either side is absent, the present side SHALL be used. The minimum SHALL clamp downward; personal config SHALL NOT exceed project config.」

+ 📄 otel-tiers/spec.md › Requirement: Tier resolution rule +
+ +
+

「Any engineer SHALL be able to opt out of OTel emission for their own machine by setting tier: 0 in .ccxray.user.json, regardless of the project config.」

+ 📄 otel-tiers/spec.md › Requirement: Engineer unilateral opt-out +
+ +

Tier 2 的額外條件

+
    +
  • 必須在 .ccxray.user.json 提供 identity 字串(可化名,不必真名)— spec D1
  • +
  • 檔案必須 gitignored。若 ccxray 偵測到此檔被 git tracked,refuse to apply personal identity(spec otel-tiers › Personal config gitignore enforcement)
  • +
  • 首次啟用時自動寫入 opt_in_acknowledged_at ISO 8601 timestamp(spec otel-tiers › Opt-in acknowledgment timestamp)
  • +
+ + +

5. Client 端 emit,不是 hub

+ +

ccxray 的 hub mode 讓多個專案共用一個 proxy 進程。但這個 change 明確規定 OTel emit 不在 hub,而是在 client 端,讓不同專案可以各自設不同的 tier / endpoint:

+ +
+
+flowchart LR
+    subgraph ClientA [Client A: projectA
.ccxray.json tier=1] + SDKA[OTel SDK A] --> CollA[collector-A] + end + subgraph ClientB [Client B: projectB
.ccxray.json tier=0] + NoneB((no SDK)) + end + subgraph ClientC [Client C: projectC
.ccxray.user.json tier=2] + SDKC[OTel SDK C
含 enduser.id] --> CollC[collector-C] + end + + ClientA -.HTTP only.-> Hub[ccxray hub
不 emit business metrics] + ClientB -.HTTP only.-> Hub + ClientC -.HTTP only.-> Hub + Hub --> API[Anthropic / OpenAI] + + style Hub fill:#88c0d0,stroke:#5e81ac,color:#0f1419 + style SDKA fill:#a3be8c,stroke:#5e81ac,color:#0f1419 + style SDKC fill:#b48ead,stroke:#5e81ac,color:#0f1419 + style NoneB fill:#2a313c,stroke:#5e81ac,color:#d8dee9 +
+
+ +
+

「OTel SDK initialization and metric emission happen in the client process (the one that ran ccxray claude). The hub remains a pure HTTP proxy plus SSE broadcaster. The hub MAY emit its own operational metrics under ccxray.hub.* namespace using a separate config (~/.ccxray/hub-config.json), but it does NOT emit business metrics on behalf of clients.」

+ 📄 design.md › D2. Client-side emit, not hub-side +
+ +
+

「Hub does not emit business metrics: WHEN the ccxray hub forwards an HTTP request between a client and an upstream provider, THEN the hub SHALL NOT emit any business metric on behalf of the client, regardless of the client's tier setting.」

+ 📄 otel-export/spec.md › Requirement: Client-side OTel SDK initialization › Scenario: Hub does not emit business metrics +
+ +

注意:ccxray.hub.* 運維 metric 在 Phase 1 是延後的 — design.md Open Questions 與 tasks.md §8.3 明說「defer 到 follow-up」。

+ + +

6. Namespace 與 CLI 共存

+ +

ccxray 跟 Claude Code CLI 內建 OTel 並存時,兩者用不同 namespace,且 ccxray 多 emit 一個 reconciliation diff metric:

+ +
+
+flowchart TB
+    CLI[Claude Code CLI
內建 OTel] --> NSA[claude_code.*
token / interaction / tool spans] + CCX[ccxray] --> NSB["ccxray.*
+ resource: ccxray.source='ccxray-proxy'
+ attr: ccxray.cli_otel_active=true (when both active)"] + + NSA -.同樣是 tokens.-> COMP[reconciliation 比對] + NSB -.同樣是 tokens.-> COMP + COMP --> DIFF[ccxray.reconciliation.token_diff_pct
差距 = 其中一邊有 bug] + + style NSA fill:#88c0d0,stroke:#5e81ac,color:#0f1419 + style NSB fill:#a3be8c,stroke:#5e81ac,color:#0f1419 + style DIFF fill:#ebcb8b,stroke:#5e81ac,color:#0f1419 +
+
+ +
+

「Every metric SHALL be named under the ccxray.<system>.<aspect> pattern. No metric SHALL be named identically to a Claude Code CLI metric or any other upstream OTel convention that would overlap.」

+ 📄 otel-export/spec.md › Requirement: ccxray.* namespace for all emitted metrics +
+ +
+

「ccxray SHALL detect the presence of CLAUDE_CODE_ENABLE_TELEMETRY=1 in the environment and, when detected, SHALL emit all metrics with an additional attribute ccxray.cli_otel_active=true. ccxray SHALL print a startup notice... ccxray SHALL NOT disable any of its own metrics based on CLI coexistence.」

+ 📄 otel-export/spec.md › Requirement: CLI OTel coexistence and complement mode +
+ +
+

「ccxray SHALL emit ccxray.reconciliation.token_diff_pct{model} as a gauge that compares ccxray's HTTP-observed token counts against the corresponding values reported by the CLI when both are active.」

+ 📄 otel-export/spec.md › Requirement: Reconciliation diff metric +
+ + +

7. Cardinality budget(動畫)

+ +

每個 metric 對每個 attribute 都有一個「容量上限」。當 unique values 數量達到上限,後續 unique values 變成字面字串 _overflow_,並 emit sentinel counter:

+ +
+
+
attribute key
+
tool
+
budget=50
+
23 / 50 unique
healthy
+
+
+
attribute key
+
model
+
budget=10
+
4 / 10 unique
healthy
+
+
+
attribute key
+
mcp_server
+
budget=30
+
30 / 30 — OVERFLOW
新值記為 _overflow_
sentinel ++
+
+
+ +
+

「Every metric declares an allow-list of attribute keys and a per-key cardinality budget (e.g. tool=50, model=10, mcp_server=30). Attribute values are tracked in a Set per (metric, attribute); when the Set reaches budget size, subsequent unique values are recorded as the literal string _overflow_ and a sentinel counter ccxray.metrics.overflow_total{metric,attribute} increments.」

+ 📄 design.md › D4. Cardinality budget with overflow fallback +
+ +
+

「Attribute keys not in the allow-list are dropped at the View API layer (OTel SDK native enforcement). High-cardinality candidates that look attractive (bash.command_pattern, file_path) are explicitly NOT emitted as metric labels.」

+ 📄 design.md › D4 +
+ +

三個對應的 scenarios(全部來自 spec)

+ + + + + + + + + + + + + + + +
情境規格行為
Tool name 是第 3 個於 50 budget正常 emit tool="Read",sentinel 增加
第 51 個 unique tool name 出現Emit tool="_overflow_",ccxray.metrics.overflow_total{metric="...",attribute="tool"} ++1
嘗試傳入不在 allow-list 的 attribute(如 bash_command)該 attribute 在 emission 前就被 View API 丟掉
+
+ 📄 otel-export/spec.md › Requirement: Cardinality budget enforcement › Scenarios +
+ + +

8. 失敗狀態機(動畫)

+ +

OTel 健康狀態固定 4 個 state。下方綠色 active 有 pulse 動畫示範「目前活躍」的視覺:

+ +
+
+ disabled + tier 0 或
OTel 套件缺失
+
+
+ active + SDK 初始化成功
exports 正常
+
+
+ degraded + SDK init 失敗
ccxray 仍正常運作
+
+
+ circuit_open + runtime 失敗
暫停 export,週期 half-open 試
+
+
+ +

規格定義的 transitions

+ +
+startup [tier 0 / no OTel pkg] disabled +startup [tier ≥ 1, SDK init OK] active +active [SDK init throws] degraded +active [5 consecutive export failures] circuit_open (cooldown=60s) +circuit_open [cooldown elapsed] half_open (trial export) +half_open [success] active (cooldown reset to 60s) +half_open [failure] circuit_open (cooldown × 2, max 600s) +
+ +
+

「ccxray SHALL maintain an OTel health state machine with exactly four states: disabled, active, degraded, and circuit_open. Transitions SHALL be driven exclusively by the conditions described in the subsequent requirements; no other code path SHALL mutate state.」

+ 📄 otel-health/spec.md › Requirement: Four-state OTel health machine +
+ +
+

「After 5 consecutive export failures, the state SHALL transition to circuit_open and exports SHALL be paused. After an initial cooldown of 60 seconds, the state SHALL transition to half_open and a single export SHALL be attempted. Success SHALL return the state to active. Failure SHALL keep the state at circuit_open and the cooldown SHALL double up to a maximum of 600 seconds.」

+ 📄 otel-health/spec.md › Requirement: Circuit breaker with exponential backoff +
+ +

Queue 行為(同樣由 spec 規定)

+ +
+

「The OTel export queue SHALL be bounded by a configurable size (default 2048 entries). When the queue is full and a new export is attempted, the oldest queued entry SHALL be dropped to make room. Each drop SHALL increment ccxray.otel.exports_dropped_total{signal}.」

+ 📄 otel-health/spec.md › Requirement: Bounded export queue with drop-oldest semantics +
+ +

三種失敗等級的分流

+ + + + + + + + + + + + + + + + + + +
失敗等級例子規格定義的處理
Config errorJSON 語法、unresolved ${VAR}啟動失敗,非零 exit,錯誤指出檔案 + 行號
Init errorendpoint URL 格式錯進入 degraded,ccxray 仍啟動,proxy 正常,status 顯示
Runtime errorcollector unreachable由 circuit breaker 吸收(進 circuit_open)
+
+ 📄 otel-health/spec.md › Requirement: Config errors fail fast, init/runtime errors degrade +
+ + +

9. Config 載入與 secret 拒絕

+ +
+
+flowchart TB
+    A[ccxray 啟動]
+    B{找到 .ccxray.json?
從 cwd 走到 git root} + C[讀 .ccxray.json] + D{找到 .ccxray.user.json?
cwd 或 $HOME} + E[讀 .ccxray.user.json] + F[JSON 語法 OK?] + G[Schema OK?] + H["${VAR} 全部 resolve?"] + I[沒有 literal secret pattern?] + J["tier_effective = min(project, personal)"] + K[初始化 SDK] + FAIL[exit 非零
fail fast] + NONE[tier 0
不啟動 SDK] + + A --> B + B -- 沒有 --> NONE + B -- 有 --> C + C --> F + F -- 否 --> FAIL + F -- 是 --> G + G -- 否 --> FAIL + G -- 是 --> H + H -- 否 --> FAIL + H -- 是 --> I + I -- 像 secret --> FAIL + I -- OK --> D + D -- 沒有 --> J + D -- 有 --> E + E --> F + J --> K + + style FAIL fill:#bf616a,stroke:#5e81ac,color:#0f1419 + style NONE fill:#8b95a5,stroke:#5e81ac,color:#0f1419 + style K fill:#a3be8c,stroke:#5e81ac,color:#0f1419 +
+
+ +
+

「Both files support ${ENV_VAR} interpolation in string values. The schema validator rejects any string that looks like a literal secret (Bearer [A-Za-z0-9]{20,}, sk_live_*, ghp_*, JWT structure) when not wrapped in ${...}. First-time generation auto-amends .gitignore to include .ccxray.user.json.」

+ 📄 design.md › D6. Config: .ccxray.json + .ccxray.user.json +
+ +

具體 scenario(直接列舉 spec 描述的)

+
    +
  • 合法插值:Config 含 "Authorization": "Bearer ${OTLP_TOKEN}",env OTLP_TOKEN=abc123 已設 → 載入 header 值為 "Bearer abc123",且 literal 字串 出現在任何 debug log 行
  • +
  • Env 缺失:Config 含 "Bearer ${MISSING_VAR}",MISSING_VAR 未設 → ccxray 非零 exit,錯誤訊息含檔案路徑、行號、變數名 MISSING_VAR
  • +
  • Literal token 被拒:Config 含字面 "Bearer abc123longtokenvalue..." → 啟動失敗,提示使用者改用 ${ENV_VAR}
  • +
  • 純 URL 通過:Pure URLs 和 hostnames 允許不寫 ${...}
  • +
+
+ 📄 otel-config/spec.md › Requirement: Environment variable interpolation, Requirement: Literal-secret rejection +
+ + +

10. Parser pipeline 與 sentinels

+ +

規格要求 tool / MCP / skill / agent-type 偵測不能再用 inline 字串散落在 system-prompt.js / store.js / helpers.js,而要改成版本化的 JSON schema:

+ +
+
+flowchart LR
+    REQ[一筆 entry 進來] --> DISP[server/parsers/index.js
dispatch] + + DISP --> S1[anthropic-tools.schema.json] + DISP --> S2[anthropic-skills.schema.json] + DISP --> S3[anthropic-agent-types.schema.json] + DISP --> S4[mcp-tools.schema.json] + DISP --> S5[codex-tools.schema.json] + + S1 --> OK[ccxray.tool.invocations_total
ccxray.skill.activations_total
...] + S1 --> UNK["ccxray.parser.unknown_tool_total{provider}"] + + S2 --> OK + S2 --> UNK2["ccxray.parser.unknown_skill_marker_total"] + + S4 --> OK + S4 --> UNK3["ccxray.parser.unknown_mcp_format_total"] + + OK --> INV[invariants check] + INV --> MISMATCH[ccxray.parser.reconciliation_mismatch_total
當 tool_use blocks 數 ≠ extracted 數] + + style DISP fill:#88c0d0,stroke:#5e81ac,color:#0f1419 + style OK fill:#a3be8c,stroke:#5e81ac,color:#0f1419 + style UNK fill:#ebcb8b,stroke:#5e81ac,color:#0f1419 + style UNK2 fill:#ebcb8b,stroke:#5e81ac,color:#0f1419 + style UNK3 fill:#ebcb8b,stroke:#5e81ac,color:#0f1419 + style MISMATCH fill:#bf616a,stroke:#5e81ac,color:#0f1419 +
+
+ +
+

「Detection logic for tool / MCP / skill / agent-type SHALL be expressed as JSON schemas under server/parsers/. There SHALL be at minimum one schema per (concern, provider) pair: parsers/anthropic-tools.schema.json, parsers/anthropic-skills.schema.json, parsers/anthropic-agent-types.schema.json, parsers/mcp-tools.schema.json (provider-agnostic MCP naming convention), parsers/codex-tools.schema.json. Each schema SHALL include a version field (semver) and a last_verified_against field (ISO 8601 date).」

+ 📄 parser-schemas/spec.md › Requirement: Versioned parser schemas per concern and provider +
+ +

Sentinel 是什麼意思?

+ +

「Unknown」不是 0,而是「我看到了一個 token 但 schema 不認得」。這是早期偵測 drift 的訊號:

+ +
+

「When the parser encounters a token, marker, or block that does not match any registered pattern in the relevant schema, it SHALL increment one of: ccxray.parser.unknown_tool_total{provider}, ccxray.parser.unknown_skill_marker_total{provider}, ccxray.parser.unknown_mcp_format_total, ccxray.parser.fallback_used_total{parser,reason}. The unknown event SHALL also be recorded with a short sample to ~/.ccxray/parser-drift.log for later inspection via ccxray parser report.」

+ 📄 parser-schemas/spec.md › Requirement: Sentinel counters for unknown tokens +
+ +

Reconciliation invariants

+ +
+

「For every processed entry the parser SHALL verify the following invariants: Number of tool_use blocks in the response equals the number of tool entries extracted by the parser. Sum of input/output token counts attributed by the parser equals the corresponding values in the upstream usage block.」

+ 📄 parser-schemas/spec.md › Requirement: Reconciliation invariants +
+ +

Snapshot fixtures(per spec 規定的最小集合)

+

Spec 明定每個 provider 都要至少有這些 fixtures:

+
    +
  • Basic tool invocation
  • +
  • Tool invocation with a skill marker active
  • +
  • Subagent invocation (Anthropic Task tool)
  • +
  • MCP server tool invocation
  • +
  • An intentional unknown tool name
  • +
+
+ 📄 parser-schemas/spec.md › Requirement: Snapshot fixtures per provider +
+ + +

11. CLI 命令面

+ +

Spec 明定 3 個新命令 + 1 個 startup banner:

+ +
+

ccxray status --otel

+

顯示:

+
    +
  • Effective tier(0/1/2)+ 哪些 config 檔案貢獻了
  • +
  • Endpoint URL(${VAR} 部分 masked)
  • +
  • OTel state(disabled / active / degraded / circuit_open)+ 最近 3 個 state transitions 含 timestamp
  • +
  • Circuit breaker 剩餘 cooldown(若 applicable)
  • +
  • 每個 metric 的 cardinality usage,格式 current / budget(例:tool: 23/50)
  • +
  • Exports succeeded / failed / dropped(過去 1 小時 + 24 小時)
  • +
  • Tier 2 時的 opt_in_acknowledged_at
  • +
  • CLI coexistence 指示器:CLAUDE_CODE_ENABLE_TELEMETRY 是否偵測到
  • +
+
+ 📄 otel-introspection/spec.md › Requirement: ccxray status --otel shows effective configuration and health +
+
+ +
+

ccxray otel preview

+
+

「The ccxray otel preview command SHALL print the exact JSON body that would be sent to the OTel collector on the next export, including all attribute values and resource attributes, WITHOUT sending any network request. Secrets resolved from ${ENV_VAR} SHALL be masked in the output.」

+ 📄 otel-introspection/spec.md › Requirement: ccxray otel preview dry-run +
+
+ +
+

ccxray parser report

+
+

「The ccxray parser report command SHALL print the top unknown tokens by frequency from the last 7 days of ~/.ccxray/parser-drift.log, grouped by category (tool / skill / MCP / fallback). The output SHALL include sample tokens and a GitHub issue body template the user can copy to file a drift report.」

+ 📄 parser-schemas/spec.md › Requirement: ccxray parser report command +
+
+ +
+

Startup banner

+
+

「When ccxray starts at tier ≥ 1, it SHALL print a one-line banner to stderr summarizing: tier value, endpoint (without secret), and complement-mode status (if CLI OTel is active). The banner SHALL NOT print when tier is 0.」

+ 📄 otel-introspection/spec.md › Requirement: Startup banner declares active tier and mode +
+

Scenario:

+
    +
  • Tier 1 standalone → stderr 含一行符合 ccxray OTel tier: 1 (anonymous) → <endpoint>
  • +
  • Tier 1 with CLI active → 該行含 tier: 1complement-mode: true
  • +
  • Tier 0 → 無 OTel banner 任何字
  • +
+
+ + +

12. 引用來源索引

+ +

所有規格檔案(可在 VS Code 點開):

+ + + + + + + + + + + +
檔案內容
proposal.mdWhy / What Changes / Capabilities / Impact
design.mdContext / Goals / 8 個決策 D1–D8 / Risks / Migration / Open Questions
specs/otel-config/spec.md6 個 Requirements 含 scenarios
specs/otel-export/spec.md8 個 Requirements 含 scenarios
specs/otel-tiers/spec.md6 個 Requirements 含 scenarios
specs/otel-health/spec.md7 個 Requirements 含 scenarios
specs/parser-schemas/spec.md6 個 Requirements 含 scenarios
specs/otel-introspection/spec.md4 個 Requirements 含 scenarios
tasks.md11 個任務群,60+ checkbox
+ +

這份視覺化是 mental model 工具,不是規格。任何 ambiguity 以 OpenSpec change 檔案為準。

+ +
+ + + + + + diff --git a/docs/otel-integration.html b/docs/otel-integration.html new file mode 100644 index 0000000..761621e --- /dev/null +++ b/docs/otel-integration.html @@ -0,0 +1,1124 @@ + + + + +OTel 整合探索 — ccxray + + + + +
+ +

OTel 整合探索

+
理解 OpenTelemetry 是什麼,以及 ccxray 要如何接上 OTel 生態的三種方案比較
+ + + + +

1. OTel 是什麼?

+ +

OpenTelemetry(以下簡稱 OTel)不是一個產品,而是一套觀測資料的標準。它定義了:

+
    +
  • 資料的結構(span 長什麼樣、metric 長什麼樣)
  • +
  • 資料的傳輸協議(叫 OTLP,通常透過 HTTP/gRPC)
  • +
  • 各種語言的 SDK(Node、Python、Go...)讓你產生這些資料
  • +
+

它解決的問題是:「以前每個觀測後端(Datadog、New Relic、Honeycomb)都有自己的 SDK,換後端就要改程式碼。現在大家都講 OTel,你只要 emit 一次,送去哪都可以。」

+ +
+
+flowchart LR
+    A[你的應用程式
例如 ccxray] -->|OTel SDK
產生標準資料| B[OTel Collector
選配,中繼站] + B -->|OTLP 協議| C[Honeycomb] + B -->|OTLP 協議| D[Datadog] + B -->|OTLP 協議| E[Grafana / Jaeger] + B -->|OTLP 協議| F[Langfuse] + A -.直接送.-> C + + style A fill:#88c0d0,stroke:#5e81ac,color:#0f1419 + style B fill:#a3be8c,stroke:#5e81ac,color:#0f1419 +
+
+ +
+記住一件事:你的程式 → OTel SDK → 後端。中間是「資料格式長一樣」。 +
+ + +

2. 三種訊號

+ +

OTel 把觀測資料分成三類,各自獨立,各自可開關:

+ +
+
+

Traces 追蹤

+

一次「操作」的時間軸

+

由多個 span 組成樹狀結構。每個 span 有開始/結束時間、parent。

+

ccxray 例子: 一次 Claude turn = 一個 trace,內含 1 個 HTTP request span + N 個 tool span

+
+
+

Metrics 指標

+

數字、計數、分布

+

Counter(累加)、Gauge(瞬時值)、Histogram(分布)。便宜、聚合好。

+

ccxray 例子: input_tokens 累計、cost 累計、cache hit rate

+
+
+

Logs 事件

+

結構化 log 紀錄

+

類似傳統 log,但是結構化(JSON),可以關聯到 trace 和 span。

+

ccxray 例子: 完整 request body、tool 執行結果

+
+
+ +

三者怎麼串在一起?

+ +
+
+flowchart TB
+    subgraph Trace [Trace: 一次 Claude turn]
+        S1["Span: HTTP POST /v1/messages
200ms"] + S2["Span: tool_use Read
50ms"] + S3["Span: tool_use Bash
1200ms"] + S1 --> S2 + S1 --> S3 + end + + subgraph Metrics [Metrics 同時被記錄] + M1["counter tokens.input += 2500"] + M2["counter cost.usd += 0.0125"] + end + + subgraph Logs [Logs 關聯到 span] + L1["event user_prompt
linked to S1"] + L2["event tool_result
linked to S3"] + end + + style S1 fill:#88c0d0,stroke:#5e81ac,color:#0f1419 + style S2 fill:#a3be8c,stroke:#5e81ac,color:#0f1419 + style S3 fill:#a3be8c,stroke:#5e81ac,color:#0f1419 +
+
+ + +

3. Claude Code 內建的 OTel 已經做了什麼?

+ +

當你設定 CLAUDE_CODE_ENABLE_TELEMETRY=1,Claude Code CLI 會自己送 OTel 出去,完全不用 ccxray:

+ +
+
+flowchart LR
+    A[Claude Code CLI
內建 OTel] -->|OTLP| B[你的 Collector] + B --> C[Honeycomb / Datadog] + + A2[Claude Code CLI
無 OTel 設定] -->|純 HTTP| X[Anthropic API] + + style A fill:#a3be8c,stroke:#5e81ac,color:#0f1419 + style A2 fill:#8b95a5,stroke:#5e81ac,color:#0f1419 +
+
+ +

CLI 自己會 emit 的 span:

+
    +
  • claude_code.interaction — 一個 agent loop turn
  • +
  • claude_code.llm_request — 每次呼叫 Anthropic API
  • +
  • claude_code.tool — 每次工具呼叫(含 permission 等待和執行)
  • +
  • claude_code.hook — 每次 hook 執行(beta)
  • +
+ +
+注意:這只在 Anthropic 官方 Claude CodeCodex、Gemini 等其他 provider 完全沒有 OTel。 +
+ + +

4. ccxray 站在 HTTP 層,看得到/看不到什麼?

+ +
+
+flowchart LR
+    CLI[Claude Code / Codex] -->|HTTP request| CCX[ccxray proxy]
+    CCX -->|forward| API[Anthropic / OpenAI API]
+    API -->|response| CCX
+    CCX -->|response| CLI
+
+    CCX -.寫入.-> LOG[(~/.ccxray/logs)]
+    CCX -.SSE.-> UI[Dashboard]
+
+    style CCX fill:#88c0d0,stroke:#5e81ac,color:#0f1419
+
+
+ + + + + + + +
ccxray 看得到 ✅ccxray 看不到 ❌
+
    +
  • 每次 HTTP request / response 的完整 payload
  • +
  • model、input/output/cache tokens
  • +
  • cost(用 LiteLLM pricing 算)
  • +
  • latency(從 request 進來到 response 結束)
  • +
  • 從 response 解析 tool_use block → 知道 LLM 要求 執行什麼工具
  • +
  • 下一個 request 帶 tool_result 回來 → 知道工具結果
  • +
  • 跨 provider:Codex / Gemini 也都看得到
  • +
+
+
    +
  • 工具實際執行的時間(只能推斷)
  • +
  • Permission prompt 等待時間
  • +
  • Hook 執行
  • +
  • 本地檔案 I/O 細節
  • +
  • 使用者的 prompt 輸入動作
  • +
+
+ + +

5. 四個整合方案

+ +

方案 A Metrics Only — 輕量起手式

+ +

只 emit 數字型指標:token、cost、request count、cache hit rate。不碰 trace

+ +
+
+flowchart LR
+    REQ[每次 HTTP 完成] --> M["counter tokens.input ++
counter tokens.output ++
counter cost.usd ++
histogram latency ms"] + M -->|OTLP| COL[Collector] + COL --> GRA[Grafana / Datadog
畫圖表] + + style M fill:#a3be8c,stroke:#5e81ac,color:#0f1419 +
+
+ +
+ 動到哪些檔案 +
    +
  • 新增 server/otel.js — 只在 env var 存在時啟動 @opentelemetry/sdk-node
  • +
  • 修改 server/forward.js — request 結束時 counter.add(tokens)
  • +
+ 優點 +
    +
  • 實作最簡單(估 1-2 天)
  • +
  • 跟 CLI 內建的 OTel 不重複(CLI 也有 metrics,但 ccxray 多了 Codex 的)
  • +
  • 對使用者最有用:可以在 Grafana 上畫每日 token / cost 趨勢
  • +
+ 缺點 +
    +
  • 看不到「為什麼貴」(沒有 trace,不知道是哪個 turn 燒最多)
  • +
+
+ + + +

方案 B Metrics + Synthetic Traces — 中度整合

+ +

加上 trace,但 trace 是「合成」的(因為看不到真實 tool 執行時間,只能從 HTTP 推斷)。

+ +
+
+flowchart TB
+    subgraph Trace [合成的 Trace]
+        I["claude_code.interaction
由 session_id 群組"] + L["claude_code.llm_request
真實 HTTP 時間"] + T1["ccxray.tool.synthetic
從 tool_use 推斷"] + T2["ccxray.tool.synthetic
從 tool_use 推斷"] + I --> L + L --> T1 + L --> T2 + end + + style I fill:#88c0d0,stroke:#5e81ac,color:#0f1419 + style L fill:#a3be8c,stroke:#5e81ac,color:#0f1419 + style T1 fill:#ebcb8b,stroke:#5e81ac,color:#0f1419 + style T2 fill:#ebcb8b,stroke:#5e81ac,color:#0f1419 +
+
+ +
+ 動到哪些檔案 +
    +
  • 方案 A 的全部
  • +
  • 修改 server/store.js — session/turn 推斷時開 interaction span
  • +
  • 修改 server/forward.js — 讀 incoming traceparent header 當父 context
  • +
  • 解析 response 的 tool_use block,合成 tool span
  • +
+ 優點 +
    +
  • 可以看到「一個 turn 內各 tool 的耗時分布」(雖然是估的)
  • +
  • 如果使用者開了 CLI OTel,ccxray 的 span 可以自動掛在他們的 trace 底下
  • +
  • 對 Codex 來說這是唯一的 trace 來源
  • +
+ 缺點 +
    +
  • 同時開 CLI OTel + ccxray 會出現重複 span(同一個 llm_request 兩邊都送)
  • +
  • Tool 時間是「下一個 request 來的時候 - 上一個 response 結束」,不是真實執行時間
  • +
  • 實作複雜度估 3-5 天
  • +
+
+ + + +

方案 C 完整(Metrics + Traces + Log Events)— 重度整合

+ +

把 ccxray 看到的完整 payload 也 emit 成 log event,讓使用者可以在 OTel 後端做全文搜尋。

+ +
+
+flowchart LR
+    REQ[HTTP request] --> M[Metrics]
+    REQ --> T[Traces]
+    REQ --> L["Log Events
完整 request / response JSON"] + + M --> COL[Collector] + T --> COL + L --> COL + + COL --> BE[Honeycomb / Langfuse
可全文搜尋 payload] + + style L fill:#bf616a,stroke:#5e81ac,color:#0f1419 +
+
+ +
+ 動到哪些檔案 +
    +
  • 方案 B 的全部
  • +
  • 新增 log event emit:每次 request/response 結束,送一個完整 body 的 log event
  • +
  • 需要處理大 payload 的 chunking、PII 遮蔽選項
  • +
+ 優點 +
    +
  • 使用者可以在 Langfuse / Honeycomb 看完整對話歷史和工具結果
  • +
  • 比 CLI 內建的 OTEL_LOG_RAW_API_BODIES 還完整(CLI 那個是 60KB truncated)
  • +
  • 可以做進階分析:prompt 模式、tool 失敗率、超長 conversation
  • +
+ 缺點 +
    +
  • 資料量爆增,後端費用顯著上升
  • +
  • 隱私 / 合規問題(完整對話內容被外送)
  • +
  • 跟 ccxray 自己的 local log 功能有點重疊(使用者已經可以在 dashboard 看)
  • +
  • 實作複雜度估 1-2 週
  • +
+
+ + + +

方案 D ★ 推薦 雲端追蹤 + 本地反查(Hybrid)

+ +

把 ccxray 看到的 metadata(model、token、cost、tool 名稱、timing)送雲端,完整 payload 留在本地。span 上掛一個 ccxray.entry_id attribute,在 Grafana 發現問題後可以回 ccxray dashboard 反查完整對話。

+ +
+
+flowchart LR
+    REQ[HTTP request 進來] --> CCX[ccxray proxy]
+    CCX -->|完整 payload
~50KB/turn| LOG[(~/.ccxray/logs
本地)] + CCX -->|metadata + entry_id
~1KB/turn| OTLP[OTLP Collector] + OTLP --> GRA[Grafana / Honeycomb
聚合查詢] + + GRA -.點 entry_id
跳回本地.-> UI[ccxray Dashboard
看完整 payload] + LOG --> UI + + style CCX fill:#88c0d0,stroke:#5e81ac,color:#0f1419 + style LOG fill:#a3be8c,stroke:#5e81ac,color:#0f1419 + style UI fill:#b48ead,stroke:#5e81ac,color:#0f1419 +
+
+ +

反查的工作流

+ +
+
+sequenceDiagram
+    autonumber
+    actor U as 工程師
+    participant G as Grafana
+    participant D as ccxray Dashboard
+    participant F as 本地 log 檔
+
+    Note over G: 看到異常 spike
cost 突然爆增 + U->>G: 點開最貴的那個 span + G-->>U: trace 顯示 ccxray.entry_id=
"2026-05-12T09-31-04-227" + U->>D: 開啟 http://localhost:5577/entry/2026-... + D->>F: 讀取本地 _req.json / _res.json + F-->>D: 完整 payload + D-->>U: 顯示完整對話、tool 呼叫、cache 結構 + Note over U: 找到原因:
某個 tool result
把 200KB 文字塞進 context +
+
+ +
+ 實際 emit 的 span 長這樣(metadata-only) +
{
+  "name": "ccxray.llm_request",
+  "attributes": {
+    "ccxray.entry_id":        "2026-05-12T09-31-04-227",
+    "ccxray.dashboard_url":   "http://localhost:5577/entry/2026-05-12T09-31-04-227",
+    "ccxray.provider":        "anthropic",
+    "model":                  "claude-opus-4-7",
+    "tokens.input":            45230,
+    "tokens.output":            1820,
+    "tokens.cache_read":       38500,
+    "tokens.cache_creation":    6730,
+    "cost.usd":              0.0825,
+    "latency_ms":              4210,
+    "tools.count":                 3,
+    "tools.names":  ["Read","Bash","Edit"]
+  }
+}
+ 注意:沒有任何 prompt 文字、tool input、tool output。 +
+ +
+ 動到哪些檔案 +
    +
  • 方案 A 的全部(metrics)
  • +
  • 新增 server/otel.js 多一段「開 span 並掛 entry_id」
  • +
  • 修改 server/forward.js — 在 entry 寫入後 emit span,attribute 直接從現有 store 的 metadata 取
  • +
  • 修改 server/routes/api.js — 加一個 /entry/:id 路由,直接 deep-link 到該筆
  • +
  • 不需要處理大 payload chunking、PII 遮蔽、log event(因為沒送)
  • +
+ + 優點 +
    +
  • 隱私零風險:沒有任何對話內容外送
  • +
  • 資料量極小:每個 turn ~1KB,Grafana free tier 都吃得下
  • +
  • 反查路徑清楚:Grafana 看到怪 → 點連結 → ccxray dashboard 看細節
  • +
  • 跟 dashboard 不重疊,而是強化:Grafana 負責「橫向聚合」,dashboard 負責「縱向細節」
  • +
  • 實作比 B 還簡單:不用合成 tool span 的時間(那個本來就不準),只送一個 llm_request span 加 entry_id 即可。估 2–3 天
  • +
+ + 缺點 / 限制 +
    +
  • ⚠️ 反查只能在「同一台機器」上做(本地 log 的本質)。如果使用者是遠端 / 多人共用,需要先決定 log 放哪
  • +
  • ⚠️ 本地 log 被 rotate / 清掉後,trace 上的 entry_id 會變死連結。可在 dashboard 顯示「此 entry 已 expire」提示
  • +
  • ⚠️ 如果使用者只開 Grafana 不開 ccxray dashboard,點連結會打不開(可考慮 fallback:span 上加一個 ccxray.summary attribute 提供 50 字摘要)
  • +
+
+ +
+為什麼這個比 B 和 C 都好? +
    +
  • B 好:不用煩惱 synthetic tool span 的時間不準、不用煩惱跟 CLI 重複(metadata-level 重複沒關係),又多了反查能力
  • +
  • C 好:價值幾乎一樣(都能看完整 payload),但隱私 / 成本 / 實作複雜度全面贏
  • +
  • 本質上是「用 trace_id / entry_id 當索引,把儲存外包給本地」的設計
  • +
+
+ + +

6. 四案比較表

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
面向方案 A
Metrics Only
方案 B
+ Synthetic Traces
方案 C
完整 payload
方案 D
Hybrid 反查
實作工時1–2 天3–5 天1–2 週2–3 天
使用者價值中:cost / token 趨勢高:turn timing 分析看情境:power user最高:聚合 + 細節都有
跟 CLI 內建 OTel 衝突不衝突span 重複重複更嚴重不衝突
(用 ccxray.* namespace)
Codex / Gemini 支援是(唯一)是(唯一)是(唯一)
資料量 / 後端費用很低高,需取樣低(~1KB/turn)
隱私風險無(payload 不出機器)
取代 dashboard 的程度完全不衝突部分重疊高度重疊互補強化
需要使用者持續開 ccxray dashboard不需要不需要不需要反查時需要本地 log 還在
+ + +

7. 管理者視角:還能看到什麼?

+ +

除了 cost / token,管理者(team lead、平台 owner)通常也想知道:

+
    +
  • 誰在用什麼 MCP server?各用幾次?哪個 MCP 失敗率最高?
  • +
  • 哪些 tool 被叫最多?Bash、Read、Edit、WebSearch... 各佔多少?
  • +
  • Skill 採用率?哪些 skill 被觸發?哪些建了沒人用?
  • +
  • 每個團隊 / 專案的差異?同樣是用 Claude Code,A 團隊跟 B 團隊行為差在哪?
  • +
+ +

這些全部都是 metrics 加上 attribute(label),屬於方案 A 和方案 D 的能力範圍,不需要 trace 或完整 payload。

+ +

能 emit 的 counter 範例

+ +
+
# 每個 MCP server 被叫的次數
+ccxray.mcp.invocations_total {server="filesystem", tool="read_file"} = 1248
+ccxray.mcp.invocations_total {server="github",     tool="create_pr"} =   42
+ccxray.mcp.invocations_total {server="slack",      tool="post_message"} = 89
+
+# MCP 失敗次數
+ccxray.mcp.errors_total {server="github", error_type="timeout"} = 7
+
+# 內建 tool 使用次數
+ccxray.tool.invocations_total {tool="Bash",   provider="anthropic"} = 5230
+ccxray.tool.invocations_total {tool="Read",   provider="anthropic"} = 8120
+ccxray.tool.invocations_total {tool="Edit",   provider="anthropic"} = 1840
+ccxray.tool.invocations_total {tool="WebSearch", provider="anthropic"} = 92
+
+# Skill 觸發次數(從 system prompt 解析)
+ccxray.skill.activations_total {skill="release",   provider="anthropic"} = 12
+ccxray.skill.activations_total {skill="git-commit", provider="anthropic"} = 87
+
+# 每個 provider 的 session 數
+ccxray.sessions_total {provider="anthropic"} = 234
+ccxray.sessions_total {provider="codex"}     =  41
+
+# 維度可組合:依 model 拆 token 消耗
+ccxray.tokens.input_total {model="claude-opus-4-7", provider="anthropic"} = 12_500_000
+ccxray.tokens.input_total {model="claude-sonnet-4-6", provider="anthropic"} = 38_200_000
+
+ +

ccxray 已經有的資料來源

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
想 emit 的 metricccxray 現有來源難度
ccxray.tool.invocations_totalresponse 內的 tool_use block(已經在解析)
ccxray.mcp.invocations_totaltool name 以 mcp__<server>__<tool> 為前綴(已有命名規則)
ccxray.skill.activations_totalsystem prompt 內的 skill 觸發 marker(system-prompt.js 已在解析)中(需要確認 marker)
ccxray.sessions_totalstore.js 的 session 推斷
ccxray.tokens.* / cost.*pricing.js + response usage 欄位
依「使用者 / 團隊」拆分需新增 OTEL_RESOURCE_ATTRIBUTES=enduser.id=... 設定指引中(靠使用者設定環境變數)
+ +

更多管理者會在意的指標(全部都在方案 A 能力範圍)

+ +

以下指標 ccxray 都能從 HTTP 看到的資料推導出來,不需要 trace 或完整 payload:

+ +
+ +
+

📈 生產力 / 採用

+
    +
  • ccxray.users.active_daily
    DAU / WAU,看推廣成效
  • +
  • ccxray.sessions.duration_seconds
    histogram,session 平均時長
  • +
  • ccxray.turns_per_session
    每 session 對話幾輪
  • +
  • ccxray.first_token_latency_ms
    UX 體感速度
  • +
  • ccxray.agent_type.invocations
    general / explore / plan / 自訂 subagent 各用幾次
  • +
+
+ +
+

💰 成本效率

+
    +
  • ccxray.cache.hit_ratio
    cache_read / total_input,< 70% 表示 prompt 設計有問題
  • +
  • ccxray.cost_per_session_usd
    histogram,找出燒錢 outlier
  • +
  • ccxray.tokens.output_per_input_ratio
    產出 / 輸入比,過低代表 context 浪費
  • +
  • ccxray.quota.burn_rate_pct
    5h / 週 quota 燒到幾 %
  • +
  • ccxray.retries_total
    retry 次數,間接成本
  • +
+
+ +
+

🚨 品質 / 可靠度

+
    +
  • ccxray.errors_total{type}
    rate_limit / overloaded / timeout / 500
  • +
  • ccxray.stop_reason{reason}
    end_turn / tool_use / max_tokens / stop_sequence
  • +
  • ccxray.max_tokens_hit_rate
    被截斷率,高代表 UX 差
  • +
  • ccxray.latency_ms{model,p95}
    各 model SLA
  • +
  • ccxray.aborted_total
    使用者 ctrl-c / Esc 比例
  • +
+
+ +
+

🧠 使用模式

+
    +
  • ccxray.context.utilization_pct
    histogram,context window 平均吃多滿
  • +
  • ccxray.auto_compact.triggered_total
    壓縮觸發次數,代表「需要更大 context」
  • +
  • ccxray.subagent.invocations
    主 agent vs Task 子 agent 比例
  • +
  • ccxray.tools_per_turn
    每輪平均叫幾個 tool
  • +
  • ccxray.thinking.token_ratio
    extended thinking 佔輸出比例
  • +
+
+ +
+

🛠️ Tool / MCP 細節

+
    +
  • ccxray.tool.latency_ms{tool}
    推估 tool 執行時間(下一個 request 進來 − 上一個 response 結束)
  • +
  • ccxray.tool.result_size_bytes{tool}
    tool 回傳大小,過大會吃 context
  • +
  • ccxray.tool.failures_total{tool,reason}
    從 tool_result 的 is_error 解析
  • +
  • ccxray.mcp.unique_servers
    使用者連了幾個 MCP server
  • +
  • ccxray.bash.command_pattern{cmd}
    最常 bash 跑什麼(取第一個 token,有 cardinality 風險,需設白名單)
  • +
+
+ +
+

🔒 治理 / 安全

+
    +
  • ccxray.permission_mode.usage{mode}
    default / acceptEdits / bypassPermissions(yolo) 比例
  • +
  • ccxray.dangerous_tool.invocations
    rm -rf / force-push / drop table 偵測
  • +
  • ccxray.file_writes_total
    Edit + Write 加總
  • +
  • ccxray.provider.distribution
    Anthropic vs Codex vs Gemini 比例
  • +
  • ccxray.system_prompt.version_changes
    agent system prompt 改了幾次(知道誰在自訂)
  • +
+
+ +
+ +
+Cardinality 警告:{user}{cmd}{file_path} 等高基數 attribute 的 metric 會把後端 explode。設計時: +
    +
  • 低基數 labels(tool name, model, provider, error type)— 直接用
  • +
  • 中基數(user, team)— 用 OTEL_RESOURCE_ATTRIBUTES,且設使用者上限
  • +
  • 高基數(file path, full command, prompt text)— 不要當 metric label,只能放 trace attribute 或 log event
  • +
+
+ +

管理者可以做的 Grafana / Datadog 報表

+ +
+
+flowchart TB
+    subgraph Reports [典型管理報表]
+        R1["📊 每週各團隊 token 消耗 / 成本
(by enduser.id)"] + R2["🔧 Top 10 最常用 tool
(by tool name)"] + R3["🔌 各 MCP server 使用熱度
(by server name)"] + R4["⚙️ Skill 採用率排行
(用了 vs 沒用)"] + R5["💸 哪個 model 燒最多錢
(by model + provider)"] + R6["🚨 MCP 失敗率告警
(error rate > X%)"] + end + + M["ccxray 送出的 metrics
含 attributes:
tool / mcp / skill / model / user"] --> Reports + + style M fill:#88c0d0,stroke:#5e81ac,color:#0f1419 +
+
+ +
+關鍵洞察:「用了什麼 / 用幾次」這類問題 只需要 metrics(方案 A 的核心),不需要 trace 或 payload。Cardinality 控制好(tool 名稱、MCP server 名稱是有限集合),即使免費 tier 的 Grafana / Prometheus 都吃得下。 +
+ + +

8. 建議路線

+ +
+建議分兩階段:第一階段做方案 A(含管理面 metrics),第二階段升級到方案 D。 +
+ +

第一階段:方案 A — 多面向 Metrics(1–2 週)

+
    +
  • cost / token counter
  • +
  • tool / MCP / skill 使用次數 counter(管理者視角)
  • +
  • session / provider 維度
  • +
  • 支援 OTEL_RESOURCE_ATTRIBUTES 讓使用者標記 team / project / user
  • +
  • 價值:Grafana 一接,管理報表立刻有
  • +
+ +

第二階段:升級到方案 D — 加 trace + 反查(再 2–3 天)

+
    +
  • 每個 llm_request 開一個 metadata-only span
  • +
  • span 帶 ccxray.entry_id + dashboard_url
  • +
  • 新增 /entry/:id deep-link 路由
  • +
  • 價值:Grafana 看到異常 → 一鍵跳回本地看完整 payload
  • +
+ +

不建議做的

+
    +
  • 方案 B 的 synthetic tool span 時間:時間不準,容易誤導,且跟 CLI 內建 OTel 直接打架
  • +
  • 方案 C 的完整 payload 外送:隱私風險高,跟 ccxray 自身定位重疊。若使用者真要這個,應該獨立做「ccxray log 上傳到 S3 / 自建後端」的功能,而不是塞進 OTel pipeline
  • +
+ + +

9. 事前驗屍與解方

+ +

建構前先想像「半年後失敗了會是因為什麼?」每題用 10 分制加權評估,只接受 ≥ 9 分方案。共 10 題(1 題跳過為可接受風險,2 題後續掃描補充),9 題解方全數通過。

+ + + +
+ +
+

#1 Cardinality 爆炸 高傷害9.4 / 10

+

使用者把 enduser.id 設成 email、bash command 當 label,Grafana 帳號被限流

+
解方
+
    +
  • Attribute key allow-list(View API)
  • +
  • Per-(metric, attribute) cardinality budget,超過改 _overflow_
  • +
  • ccxray.metrics.overflow_total sentinel + ccxray status --metrics 顯示用量
  • +
  • 新 metric 必須註冊 schema,缺漏 CI fail
  • +
+
驗證
+
    +
  • 實作:餵 51 unique values,assert 第 51 為 overflow
  • +
  • 上線:overflow counter > 0 → 自動冒泡警示
  • +
+
+ +
+

#2 沒人用,功能死掉 高傷害9.0 / 10

+

半年後 < 5% 使用者啟用 OTel,維護成本變沉沒成本

+
解方
+
    +
  • ccxray --otel-demo 本地一鍵起 Grafana,30 秒看到資料
  • +
  • README 90 秒接 Grafana 截圖教學
  • +
  • 本地 heartbeat 統計使用率(不外送)
  • +
  • 三個月 sunset clock:< 10 個 GitHub 提及則停損
  • +
+
驗證
+
    +
  • 實作:3 個新使用者走流程,中位數 < 5 分鐘看到資料
  • +
  • 上線:三個月 KPI 閘門明確
  • +
+
+ +
+

#3 多機反查全壞 高傷害9.4 / 10

+

Manager 在 Grafana 看到 trace,點 localhost 連結打不開(那是工程師的機器)

+
解方
+
    +
  • Span 帶 entry_id + host + 50字 summary + local_url + 可選 public_url
  • +
  • Dashboard /entry/:id 找不到時優雅降級,顯示「在 host=X」提示
  • +
  • 文件明說:個人 / 小團隊 / 大團隊各自的反查路徑
  • +
+
驗證
+
    +
  • 實作:CI 雙 ccxray instance 模擬跨機反查
  • +
  • 上線:deeplink_resolved_total{outcome} 追蹤 wrong_host 比例
  • +
+
+ +
+

#4 CLI OTel 衝突 → 雙重計費 中傷害9.5 / 10

+

使用者同時開 CLI 和 ccxray OTel,token 算兩倍,budget 警報全錯

+
解方
+
    +
  • 強制 ccxray.* namespace,不模仿 claude_code.* 欄位
  • +
  • 偵測 CLAUDE_CODE_ENABLE_TELEMETRY 進入 complement mode,印警告
  • +
  • 每筆 emit 帶 ccxray.source=ccxray-proxy resource attribute
  • +
  • ccxray.reconciliation.token_diff_pct:跟 CLI 對帳的差異
  • +
+
驗證
+
    +
  • 實作:雙開模式 fixture 測試,assert source attribute 分得開
  • +
  • 上線:reconciliation diff > 5% 警報
  • +
+
+ +
+

#5 管理者誤用 metric 監控個人 高傷害9.7 / 10

+

Team lead 拿使用次數開檢討會,工程師集體棄用 ccxray

+
解方
+
    +
  • 三層 tier:預設 OFF / 專案匿名 / 個人具名
  • +
  • 專案是上限,個人是下限,個人可隨時降級退出
  • +
  • 個人具名走 .ccxray.user.json(gitignore),不入 repo
  • +
  • 啟動 banner + ccxray status --otel + ccxray otel preview dry-run
  • +
  • 文件明寫:不要用這些 metric 評估個人績效
  • +
+
驗證
+
    +
  • 實作:tier 升降 4 種組合矩陣全測
  • +
  • 上線:tier_distribution 追蹤採用率,tier 2 < 5% 強化文件
  • +
+
+ +
+

#6 Parser drift(skill / MCP / tool)中傷害9.4 / 10

+

Claude Code 改 prompt 格式,skill detector 全 0,半年沒人發現

+
解方
+
    +
  • Schema 化 parser(parsers/*.schema.json 帶版本)
  • +
  • Snapshot fixtures:每 provider 一組固定 request/response
  • +
  • Sentinel metrics:ccxray.parser.unknown_*_total — 未識別不是 0,是「看到了但分類不了」
  • +
  • 對帳 invariants:tool_use block count 必對得起 extracted count
  • +
  • Parser 包 try/catch,壞掉不影響 ccxray 核心
  • +
  • ccxray parser report 命令一鍵看 unknown top 10
  • +
+
驗證
+
    +
  • 實作:餵未知 tool → assert sentinel ++,assert 不 throw
  • +
  • 上線:reconciliation_mismatch > 0 = bug,unknown_* 持續 7 天自動建議檢查
  • +
+
+ +
+

#7 Bundle size 膨脹 可接受— 跳過

+

@opentelemetry/sdk-node + auto-instrumentations 把 ccxray 從 3MB 變 18MB

+
處置
+
    +
  • 使用者評估為可接受風險,跳過正式評估
  • +
  • 實作時自我約束:只 import 必要模組(api、sdk-metrics、exporter-otlp-http),不引 auto-instrumentations
  • +
+
+ +
+

#8 Hub mode env 傳遞 低傷害9.5 / 10

+

使用者改 env 重新跑,但 hub 還在背景跑舊設定,以為改好實際沒送對地方

+
解方
+
    +
  • 業務 OTel 走 client 端,不走 hub(hub 只負責 proxy + SSE broadcast)
  • +
  • 每個 client 自己讀 .ccxray.json + 個人 config + env
  • +
  • 不同 tier / endpoint 在同一個 hub 下自然共存
  • +
  • Hub 自己另開 ccxray.hub.* 運維 metric(uptime / requests / clients)
  • +
  • ccxray status 顯示每個 client 的 tier 和 env 一致性
  • +
+
驗證
+
    +
  • 實作:兩個 client 不同 config,同 hub,assert 各送各的
  • +
  • 上線:env_inconsistency_total 追蹤「改了沒重啟」累積
  • +
+
+ +
+

#11 Collector down 時記憶體 / 行為 中傷害9.4 / 10

+

Collector 掛掉,OTel SDK 無限重試,buffer 堆爆把 ccxray OOM

+
解方
+
    +
  • Bounded queue(2048),滿了 drop oldest
  • +
  • Circuit breaker:連續 5 次失敗 → open 60s → half-open 試探 → 失敗則 backoff(60→120→240→600s)
  • +
  • State + dropped 計數寫本地 log,不送網路(因為網路本來就斷)
  • +
  • 設計選擇:丟資料 > 拖垮 ccxray,文件明說
  • +
+
驗證
+
    +
  • 實作:mock collector 回 500,assert memory 不增長、drop counter ++
  • +
  • 上線:circuit_breaker_open_seconds 累積長 = 持續問題
  • +
+
+ +
+

#12 Config secret 風險 中傷害9.5 / 10

+

使用者把 Authorization token 寫進 .ccxray.json,commit 進 git

+
解方
+
    +
  • ${ENV_VAR} 插值,token 只能在 env
  • +
  • Schema 拒絕看起來像 secret 的字面值(Bearer、JWT、ghp_ 等 pattern)
  • +
  • 第一次生成 .ccxray.json 時自動加 .gitignore 提醒
  • +
  • ccxray status 掃 git tracked config 是否含明文 secret
  • +
+
驗證
+
    +
  • 實作:餵 Bearer 字面值 → schema 拒絕並給修正建議
  • +
  • 實作:餵 ${TOKEN} 但 env 未設 → 啟動失敗
  • +
+
+ +
+

#13 OTel 失敗 fallback 策略 中傷害9.7 / 10

+

OTel config 寫錯或 collector 掛,ccxray 整個跑不起來

+
解方
+
    +
  • 三層失敗:config error(啟動失敗)/ init error(降級,ccxray 仍跑)/ runtime error(由 #11 處理)
  • +
  • 狀態機:disabled / active / degraded / circuit_open
  • +
  • ~/.ccxray/otel.log 紀錄最近 100 條失敗,自動 rotate
  • +
  • 核心原則:OTel 是增強,不是必需。網路問題不擋,config 錯擋
  • +
+
驗證
+
    +
  • 實作:餵壞 endpoint URL → assert ccxray 仍啟動、proxy 仍轉發、status 標 degraded
  • +
  • 上線:otel.state{state} 看 degraded 比例
  • +
+
+ +
+ +

共用基礎設施

+ +

#11–#13 共用同一組失敗處理框架,可降低總工時:

+ +
+
server/otel-health.js        # 失敗處理框架(共用)
+  ├─ State machine (active / degraded / circuit_open / disabled)
+  ├─ Bounded queue + drop counter
+  ├─ Circuit breaker
+  ├─ Local log writer (~/.ccxray/otel.log)
+  └─ Status reporter (餵給 ccxray status 命令)
+
+server/config-loader.js      # 配置載入(共用)
+  ├─ JSON Schema 驗證
+  ├─ ${ENV_VAR} 插值
+  ├─ Secret pattern 偵測
+  └─ .gitignore 檢查
+
+ +

結論

+ +
+事前驗屍 9 解全部 ≥ 9 分,可進入實作階段。每題的「上線後監測 metric」本身也是 ccxray 的 OTel emit 內容 — 設計上自我驗證:這套系統能持續偵測自己有沒有壞掉。 +
+ +

+本文件位於 docs/otel-integration.html。內容為決策前的探索筆記,實作時請以最終 PR 為準。 +

+ +
+ + + + + + diff --git a/openspec/changes/add-otel-metrics-phase1/.openspec.yaml b/openspec/changes/add-otel-metrics-phase1/.openspec.yaml new file mode 100644 index 0000000..40cc12f --- /dev/null +++ b/openspec/changes/add-otel-metrics-phase1/.openspec.yaml @@ -0,0 +1,2 @@ +schema: spec-driven +created: 2026-05-12 diff --git a/openspec/changes/add-otel-metrics-phase1/design.md b/openspec/changes/add-otel-metrics-phase1/design.md new file mode 100644 index 0000000..e584e80 --- /dev/null +++ b/openspec/changes/add-otel-metrics-phase1/design.md @@ -0,0 +1,166 @@ +## Context + +ccxray currently emits no telemetry to external systems. All observation happens via the local dashboard reading from `~/.ccxray/logs/`. Adding OpenTelemetry export changes ccxray's blast radius — data starts leaving the user's machine — and intersects with three sensitive design surfaces: + +1. **Privacy.** Engineers run ccxray in their own dev environment. Any telemetry that identifies them by default would break that contract. +2. **Trust with managers.** Aggregated metrics are genuinely useful for engineering leaders, but a feature that lets a manager track individual tool usage will trigger a backlash that kills adoption. +3. **Provider neutrality.** Claude Code's CLI has built-in OTel for Anthropic; Codex/Gemini have none. ccxray must coexist with the CLI without double-counting, and must remain the only telemetry source for non-Anthropic providers. + +Before drafting this design, an 11-risk pre-mortem was completed and recorded in `docs/otel-integration.html`. Every accepted solution scored ≥ 9/10 on weighted criteria including verification mechanisms. The design below is the synthesis of those solutions. + +## Goals / Non-Goals + +**Goals:** + +- Provide ccxray-emitted OTel metrics covering cost, usage (tool/MCP/skill), quality (errors/latency/cache), patterns (context/subagent), and governance. +- Default OFF. Zero telemetry until the user explicitly opts in per-project. +- Three-tier opt-in (disabled / project-anonymous / personal-named) where the project sets an upper bound and personal config can only equal-or-downgrade. +- Coexist with Claude Code CLI's built-in OTel without overlap, with a reconciliation metric to surface accounting bugs on either side. +- Never let OTel failure break the proxy. Config errors fail at startup, init errors degrade silently, runtime errors are absorbed by a bounded queue + circuit breaker. +- Make parser drift visible. Unknown tools / skills / MCP markers must increment a sentinel counter rather than silently turn into zero. +- Provide introspection: `ccxray status --otel`, `ccxray otel preview` (dry-run), `ccxray parser report`. + +**Non-Goals:** + +- **Traces / spans.** Phase 1 emits metrics only. Spans, `entry_id` deep-link attributes, and `/entry/:id` drill-back UI are Phase 2. +- **Full payload export.** Request/response bodies never leave the machine. If a future user wants this, it belongs in a separate "ccxray log → S3 / self-hosted backend" product, not in the OTel pipeline. +- **Synthetic tool span timing.** Tool execution durations inferred from HTTP cadence would be misleading; the CLI emits accurate timing for Anthropic, and we will not compete with inaccurate data. +- **Central ccxray hub for team-wide aggregation.** Each engineer's ccxray remains local. Cross-machine correlation, if needed, is a Phase 2+ discussion. +- **Auto-instrumentation.** We will not pull in `@opentelemetry/auto-instrumentations-node`. ccxray controls every emit point explicitly to keep the dependency footprint and behavior predictable. + +## Decisions + +### D1. Default OFF with three-tier opt-in + +Three tier values: + +- **tier 0 (disabled)** — no OTel SDK initialization, no network egress. Default behavior when no config file or env override exists. +- **tier 1 (project anonymous)** — metrics emit with project-level attributes (`project.name`, optional `team`) but no individual identity. Activated by `.ccxray.json` checked into the repo. +- **tier 2 (personal named)** — adds `enduser.id` (a self-chosen string, not necessarily real name) to allow individual ccxray usage analytics. Activated by `.ccxray.user.json` in the working directory, which is gitignored. + +Resolution rule: `effective_tier = min(project_tier, personal_tier)`. Project config is the upper bound; personal config can only equal or downgrade. An engineer can always set tier 0 in personal config to opt out of project-level emit on their own machine. + +**Alternatives considered:** + +- *Always-on anonymous* — rejected. "Anonymous" telemetry has well-documented re-identification risks; defaulting to ON breaks the implicit trust contract. +- *Cookie-style consent prompt at startup* — rejected. Prompt fatigue leads to blanket yes; one-time `opt_in_acknowledged_at` timestamp in personal config achieves the same intent without nagging. +- *k-anonymity at the backend* — rejected. ccxray does not control the backend; small teams (k < 5) cannot rely on this guarantee. + +### D2. Client-side emit, not hub-side + +OTel SDK initialization and metric emission happen in the client process (the one that ran `ccxray claude`). The hub remains a pure HTTP proxy plus SSE broadcaster. The hub MAY emit its own operational metrics under `ccxray.hub.*` namespace using a separate config (`~/.ccxray/hub-config.json`), but it does NOT emit business metrics on behalf of clients. + +This means different projects connecting to the same hub can configure different tiers, endpoints, and `OTEL_RESOURCE_ATTRIBUTES` without interfering with each other. + +**Alternatives considered:** + +- *Hub-side emit with per-client config fanout* — rejected. Adds a routing/fan-out concern to the hub with no clear value; the hub would need to track which spans belong to which client config. +- *Hub-only emit, ignore per-project differences* — rejected. Conflicts with D1 and forces every project on a host to share one OTel destination. + +### D3. `ccxray.*` namespace, never mirror `claude_code.*` + +Every metric uses `ccxray..` (`ccxray.tokens.input_total`, `ccxray.tool.invocations_total`, etc.). Every emit carries the resource attribute `ccxray.source="ccxray-proxy"`. When the CLI's `CLAUDE_CODE_ENABLE_TELEMETRY=1` is detected, ccxray enters "complement mode" and adds `ccxray.cli_otel_active=true` to its emits, plus a startup notice explaining how to choose between the two metric families. + +A new reconciliation metric `ccxray.reconciliation.token_diff_pct{model}` exposes the percentage difference between ccxray's HTTP-observed token counts and what the CLI reports (when both are running). A persistent non-zero diff indicates a pricing or accounting bug on one side and is itself a high-value signal. + +**Alternatives considered:** + +- *Auto-disable ccxray emit when CLI is active* — rejected. Loses the reconciliation signal and forfeits ccxray's Codex/Gemini advantage. +- *Same metric names, different resource* — rejected. Backends commonly aggregate by metric name first; using the same names would force users to filter by resource attribute on every panel. + +### D4. Cardinality budget with overflow fallback + +Every metric declares an allow-list of attribute keys and a per-key cardinality budget (e.g. `tool=50`, `model=10`, `mcp_server=30`). Attribute values are tracked in a `Set` per (metric, attribute); when the Set reaches budget size, subsequent unique values are recorded as the literal string `_overflow_` and a sentinel counter `ccxray.metrics.overflow_total{metric,attribute}` increments. + +Attribute keys not in the allow-list are dropped at the View API layer (OTel SDK native enforcement). High-cardinality candidates that look attractive (`bash.command_pattern`, `file_path`) are explicitly NOT emitted as metric labels. + +**Alternatives considered:** + +- *Trust the backend to handle cardinality* — rejected. Free-tier Grafana Cloud, open-source Prometheus, and many enterprise backends impose hard limits that result in dropped series or account-level throttling. +- *Silent drop on overflow* — rejected. Violates the "no silent failure" principle. + +### D5. Failure isolation via state machine + bounded queue + circuit breaker + +`server/otel-health.js` owns a state machine with four states: + +- `disabled` — OTel never initialized (tier 0 or no config). +- `active` — SDK initialized, exports succeeding. +- `degraded` — SDK init failed; ccxray continues without OTel; status command shows the error. +- `circuit_open` — runtime export failures triggered the circuit breaker; periodic half-open retries. + +The export queue is bounded (default 2048 entries, configurable). On overflow, oldest entries are dropped and `ccxray.otel.exports_dropped_total{signal}` increments locally (network is presumed unreachable when the queue overflows). + +Circuit breaker: 5 consecutive failures → `circuit_open` for 60s → `half_open` test → success returns to `active`, failure backs off (60 → 120 → 240 → 600s max). + +**Alternatives considered:** + +- *Unbounded queue with retries* — rejected. OOMs ccxray when the collector is down. +- *Fail-fast on first error* — rejected. Transient errors are common; one timeout should not disable telemetry for the rest of the session. + +### D6. Config: `.ccxray.json` + `.ccxray.user.json` with `${ENV_VAR}` interpolation + +Two-file config: + +- `.ccxray.json` — project root, checked into git, sets tier upper bound and shared settings (endpoint, headers, resource attributes). +- `.ccxray.user.json` — project root or `$HOME`, gitignored, sets personal identity and overrides (only ever equal-or-downgrade vs project config). + +Both files support `${ENV_VAR}` interpolation in string values. The schema validator rejects any string that looks like a literal secret (`Bearer [A-Za-z0-9]{20,}`, `sk_live_*`, `ghp_*`, JWT structure) when not wrapped in `${...}`. First-time generation auto-amends `.gitignore` to include `.ccxray.user.json`. + +Config errors (syntax, schema, unresolved `${VAR}`) fail at startup with a clear error pointing to the offending line. Init errors (bad endpoint format) transition to `degraded`. Runtime errors (collector down) transition to `circuit_open`. + +**Alternatives considered:** + +- *Single file with comments marking secrets* — rejected. JSON has no comments and the convention is too fragile. +- *Pure env-var configuration* — rejected. Loses per-project granularity; same shell environment cannot easily switch contexts when working across multiple repos. + +### D7. Parser schema-ization with sentinel counters + +Tool / MCP / skill / agent-type detection moves from inline strings in `system-prompt.js` / `store.js` / `helpers.js` to versioned JSON schemas under `server/parsers/`. Each schema declares the patterns it recognizes and carries a `last_verified_against` date. + +For every entry processed, parsers emit: + +- The recognized metrics (tool invocations, skill activations, etc.). +- `ccxray.parser.unknown_*_total{provider}` counters when a token/marker is seen but not recognized. +- `ccxray.parser.reconciliation_mismatch_total{type}` when invariants fail (e.g. count of `tool_use` blocks in response ≠ count of tools extracted by parser). + +Parsers are wrapped in try/catch; on exception, `ccxray.parser.error_total{parser}` increments and the entry continues to be written to local logs (degraded OTel, never blocked proxy). + +Snapshot fixtures under `test/fixtures/parser/` lock current behavior; changes require committing new snapshots and pass review. + +**Alternatives considered:** + +- *Keep inline parsing* — rejected. Already fragile (silent dependence on Claude Code's evolving prompt format) and cannot detect drift. +- *Server-side parser updates via remote schema fetch* — rejected. Adds a new failure surface and security concern. + +### D8. CLI surface: `status --otel`, `otel preview`, `parser report` + +- `ccxray status --otel` — current tier, endpoint, OTel state, cardinality usage (e.g. `tool: 23/50`), dropped event counters, circuit breaker state. +- `ccxray otel preview` — dry-run printing the next export's content without sending. Lets users see exactly what would be exported before enabling. +- `ccxray parser report` — last 7 days of unknown tool / skill / MCP markers grouped by frequency; generates a GitHub issue template body for drift reports. + +Startup banner declares the active tier and (if applicable) complement-mode coexistence with CLI OTel. + +## Risks / Trade-offs + +- **Risk: Adoption stalls because individual devs do not have an OTel backend.** → Ship `ccxray --otel-demo` that spins up a local Grafana + Prometheus via Docker Compose so a developer can see their own metrics in 30 seconds without joining any external service. Set a 3-month KPI gate: < 10 GitHub references → pause Phase 2 investment. +- **Risk: Manager misuse for individual surveillance.** → Default OFF + tier 2 requires personal opt-in by the engineer + explicit `docs/otel-ethics.md` distributed as part of the change ("these metrics are not for individual performance evaluation; the reasons follow…"). Track `ccxray.otel.tier_distribution`: if tier 2 share is < 5%, strengthen the docs. +- **Risk: Cardinality explosion despite budgets.** → Budgets enforced at SDK View API layer with sentinel counter for overflow visibility. CI lint blocks new metrics that lack a schema entry. `ccxray.metrics.overflow_total > 0` for sustained periods triggers an in-status warning. +- **Risk: Bundle bloat from OTel SDK.** → Import only `@opentelemetry/api`, `@opentelemetry/sdk-metrics`, `@opentelemetry/exporter-metrics-otlp-http`, `@opentelemetry/resources`. No auto-instrumentations. Optional dependency pattern so the package still resolves when OTel deps are absent (lazy require). +- **Risk: Hub-mode env changes don't propagate.** → Business OTel is client-side (D2); hub env only affects `ccxray.hub.*` operational metrics. `ccxray status` displays per-client tier/endpoint so users can see whether each client has picked up the env they expected. +- **Risk: Parser drift when Anthropic changes the prompt format.** → Sentinel counters (`ccxray.parser.unknown_*_total`) make drift visible within hours instead of months; `last_verified_against` dates trigger quarterly re-verification; `ccxray parser report` makes drift reports easy to file. +- **Risk: OTel semconv conventions evolve and our attribute names become out of date.** → All metric names live in the schema registry under `server/otel.js`; a future migration is a search-and-replace plus a deprecation period. +- **Trade-off: We do not compete with the CLI on Anthropic tool span timing.** → Acceptable. Our value is the HTTP-layer truth, Codex/Gemini coverage, the reconciliation diff, and the future Phase 2 drill-back. + +## Migration Plan + +- **Forward.** Phase 1 ships behind opt-in defaults; existing ccxray users see no behavior change. Adopters add a `.ccxray.json`, set an endpoint, and confirm with `ccxray otel preview` before traffic flows. The `--otel-demo` subcommand provides a zero-config local Grafana for evaluation. +- **Rollback.** Each `ccxray.*` metric is a contract; once shipped, names cannot be renamed without a deprecation cycle. The schema registry tracks every metric with its introduction version. +- **Phase 2 prerequisites.** Shared modules introduced here (`otel-health.js`, `config-loader.js`, parser schemas, sentinel framework, status surface) are designed to host Phase 2's span emit and `/entry/:id` route without rework. + +## Open Questions + +- Should `.ccxray.json` lookup walk up from cwd to the nearest enclosing dir (monorepo-friendly), or only check cwd? Recommendation: walk up to nearest git root, take the first match. +- Should we ship `--otel-demo` Docker Compose files in this PR or as a follow-up doc? Recommendation: follow-up, to keep Phase 1 scope tight. +- Should `ccxray.hub.*` operational metrics ship in Phase 1 or be deferred? Recommendation: defer to keep this change focused on the client side. +- For the auto-update of `.gitignore`, should the user be prompted or should it be automatic? Recommendation: prompt the first time, with a `--yes` flag for automation. +- Should `ccxray --otel-demo` be a documented dev tool only, or a supported feature? Recommendation: dev tool only (clearly labeled experimental). diff --git a/openspec/changes/add-otel-metrics-phase1/proposal.md b/openspec/changes/add-otel-metrics-phase1/proposal.md new file mode 100644 index 0000000..f07267f --- /dev/null +++ b/openspec/changes/add-otel-metrics-phase1/proposal.md @@ -0,0 +1,47 @@ +## Why + +ccxray captures everything an agent does at the HTTP layer — full request/response, token counts, cost, tool calls, MCP server activity, skill activations — but the data lives only in the local dashboard. Teams that already operate Grafana / Datadog / Honeycomb cannot aggregate ccxray's signals into their existing observability pipeline. Claude Code's CLI has built-in OTel for Anthropic only and does not expose the HTTP-layer truth ccxray sees; Codex, Gemini, and future providers have no OTel at all. The full design rationale, pre-mortem (11 risks scored ≥ 9/10) and alternative options live at `docs/otel-integration.html`. + +This change adds Phase 1: emit ccxray's metrics over OTLP, gated behind a default-off tiered opt-in, with a failure model that never degrades the proxy. Phase 2 (metadata-only traces with `entry_id` drill-back) is a follow-up. + +## What Changes + +- New optional metric export under `ccxray.*` namespace covering cost, usage (tool / MCP / skill / agent_type / provider), quality (errors, stop_reason, latency, max_tokens_hit_rate), patterns (context_utilization, auto_compact_triggered, subagent_ratio, tools_per_turn) and governance (permission_mode, dangerous_tool, file_writes). +- New configuration files: `.ccxray.json` (repo, project-level) and `.ccxray.user.json` (gitignored, personal). `${ENV_VAR}` interpolation. Schema rejects literal-looking secrets. Auto-add `.ccxray.user.json` to `.gitignore` if missing. +- Three-tier opt-in model: **tier 0 disabled (default)** / tier 1 anonymous project-level / tier 2 personal named. Project config is the upper bound; personal config can only equal or downgrade. Engineers can opt out unilaterally. +- Detect `CLAUDE_CODE_ENABLE_TELEMETRY=1` and enter "complement mode" with `ccxray.cli_otel_active=true` attribute; every metric carries `ccxray.source="ccxray-proxy"` resource attribute. New reconciliation metric `ccxray.reconciliation.token_diff_pct` cross-checks ccxray vs CLI accounting. +- Cardinality budget per (metric, attribute) with `_overflow_` fallback and `ccxray.metrics.overflow_total` sentinel; attribute key allow-list enforced via OTel View API. +- Parser schema-ization: extract tool / MCP / skill detection into `server/parsers/*.schema.json` with snapshot fixtures, sentinel metrics (`ccxray.parser.unknown_*_total`), and reconciliation invariants (tool_use block count must equal extracted count). +- Failure fallback: config errors fail fast at startup; init errors degrade silently (ccxray keeps proxying); runtime errors handled by bounded queue (drop oldest) + circuit breaker (5 failures → open 60s → exponential backoff). OTel failures **never** break the proxy. +- New shared modules: `server/otel-health.js` (state machine, circuit breaker, bounded queue, local log writer) and `server/config-loader.js` (JSON schema validation, env interpolation, secret detection, gitignore check). +- OTel emit lives in the **client** process, not the hub. Each project's tier/endpoint coexists on the same hub. Hub gains its own operational metrics under `ccxray.hub.*` namespace. +- New CLI commands: `ccxray status --otel` (current tier, endpoint, health, cardinality usage), `ccxray otel preview` (dry-run printing the next export's content), `ccxray parser report` (recent unknown events for drift detection). +- Out of scope (Phase 2 follow-up): span emit (traces), `/entry/:id` deep-link route, `ccxray.entry_id` / `dashboard_url` attributes. + +## Capabilities + +### New Capabilities + +- `otel-config`: `.ccxray.json` and `.ccxray.user.json` schema, `${ENV_VAR}` interpolation, literal-secret rejection, `.gitignore` auto-amend, project-upper-bound + personal-lower-bound merging rules. +- `otel-export`: OTel SDK initialization (client-side, not hub), metric definitions under `ccxray.*` namespace, `ccxray.source` resource attribute, cardinality budget enforcement with `_overflow_` fallback, CLI coexistence detection and complement-mode signaling, reconciliation diff metric. +- `otel-tiers`: three-tier opt-in (disabled / project-anonymous / personal-named), tier resolution with project as upper bound and personal as lower bound, `enduser.id` attribute only in tier 2, opt-in acknowledgment timestamp persisted in personal config. +- `otel-health`: failure state machine (`disabled / active / degraded / circuit_open`), bounded export queue with drop-oldest semantics, circuit breaker with exponential backoff, local failure log at `~/.ccxray/otel.log` with rotation, never-block guarantee for the proxy path. +- `parser-schemas`: extract skill / MCP / tool / agent-type detection into versioned JSON schemas, snapshot fixtures per provider (Anthropic + Codex), sentinel metrics for unknown events, reconciliation invariants run per entry, try/catch isolation so parser failure does not affect ccxray core. +- `otel-introspection`: `ccxray status --otel` view (tier, endpoint, health, cardinality, dropped counts), `ccxray otel preview` dry-run, `ccxray parser report` for drift inspection, startup banner declaring active tier and CLI coexistence mode. + +### Modified Capabilities + +(None — Phase 1 is additive. Existing capabilities are not changed.) + +## Impact + +- New `server/otel.js`, `server/otel-health.js`, `server/config-loader.js`, `server/parsers/` directory tree (schemas + fixtures + unknown-handler). +- `server/forward.js` — emit metric on request completion (counters + histograms) via the otel-health-guarded queue; no behavior change when OTel is disabled. +- `server/store.js` — session / tool / skill / MCP / agent_type detection becomes a thin shim over `server/parsers/*`; reconciliation invariants run per entry; sentinel counters incremented on unknown. +- `server/system-prompt.js` — agent-type and skill marker detection moves into `parsers/anthropic-skills.schema.json`; existing parsing behavior preserved. +- `server/hub.js` — hub gains optional `ccxray.hub.*` operational metrics (uptime, request rate, connected clients) under its own config in `~/.ccxray/hub-config.json`. Hub does NOT emit business metrics; those stay client-side. +- `server/routes/api.js` — no new HTTP routes in Phase 1 (deep-link route is Phase 2). +- `bin/ccxray.js` or equivalent CLI entry — new subcommands: `status --otel`, `otel preview`, `parser report`. Existing commands unaffected when OTel is disabled. +- `package.json` — add minimal OTel dependencies (`@opentelemetry/api`, `@opentelemetry/sdk-metrics`, `@opentelemetry/exporter-metrics-otlp-http`, `@opentelemetry/resources`). No auto-instrumentations. Optional dependency pattern so the package still works if OTel is not installed. +- New docs: `docs/otel-integration.html` (already exists, decision record), `docs/otel-ethics.md` (why these metrics are not for individual performance evaluation), `docs/otel-quickstart.md` (90-second Grafana onboarding). +- Tests: parser snapshot fixtures, cardinality budget enforcement tests, tier resolution matrix tests, failure-mode tests (collector down, bad endpoint, bad auth, malformed config). diff --git a/openspec/changes/add-otel-metrics-phase1/specs/otel-config/spec.md b/openspec/changes/add-otel-metrics-phase1/specs/otel-config/spec.md new file mode 100644 index 0000000..b30283a --- /dev/null +++ b/openspec/changes/add-otel-metrics-phase1/specs/otel-config/spec.md @@ -0,0 +1,90 @@ +## ADDED Requirements + +### Requirement: Project and personal config files + +ccxray SHALL read two optional configuration files at startup: `.ccxray.json` (project-level, repo-checked-in) and `.ccxray.user.json` (personal-level, gitignored). Both files use JSON. Missing files SHALL be treated as tier 0 (disabled). + +#### Scenario: No config present + +- **WHEN** ccxray starts in a directory with neither `.ccxray.json` nor `.ccxray.user.json` +- **THEN** OTel SDK SHALL NOT initialize and no network egress SHALL occur + +#### Scenario: Project config present, no personal config + +- **WHEN** ccxray starts in a directory with `.ccxray.json` that enables tier 1 +- **THEN** OTel SDK SHALL initialize at tier 1 with project-level attributes only + +#### Scenario: Both project and personal config present + +- **WHEN** project config sets tier 1 and personal config sets tier 2 with `enduser.id` +- **THEN** the effective tier SHALL be tier 2 and `enduser.id` SHALL be attached to emitted metrics + +### Requirement: Tier resolution as upper bound and lower bound + +The effective tier SHALL be `min(project_tier, personal_tier)` so that the project config is an upper bound and personal config can only equal-or-downgrade. An engineer SHALL be able to unilaterally opt out by setting tier 0 in personal config. + +#### Scenario: Personal config downgrades from project + +- **WHEN** project config enables tier 1 and personal config explicitly sets tier 0 +- **THEN** no OTel emission SHALL occur for this engineer + +#### Scenario: Personal config cannot exceed project + +- **WHEN** project config enables tier 1 and personal config sets tier 2 +- **THEN** the effective tier SHALL be tier 2 only if the project explicitly authorizes tier 2; otherwise tier resolution SHALL clamp to tier 1 and emit a warning + +### Requirement: Environment variable interpolation + +All string values in config files SHALL support `${VAR}` interpolation, resolved at load time from `process.env`. Unresolved variables SHALL cause startup failure with a clear error message naming the missing variable. + +#### Scenario: Header value uses env var + +- **WHEN** config contains `"Authorization": "Bearer ${OTLP_TOKEN}"` and `OTLP_TOKEN=abc123` is set in the environment +- **THEN** the loaded header value SHALL be `"Bearer abc123"` and the literal string SHALL NOT appear in any debug log line + +#### Scenario: Missing env var + +- **WHEN** config contains `"Authorization": "Bearer ${MISSING_VAR}"` and `MISSING_VAR` is not set +- **THEN** ccxray SHALL exit non-zero with an error message that includes the file path, line, and the variable name `MISSING_VAR` + +### Requirement: Literal-secret rejection + +The schema validator SHALL reject any string value that matches a literal-secret pattern (`Bearer [A-Za-z0-9]{20,}`, `sk_live_*`, `sk_test_*`, `ghp_*`, JWT three-segment structure) unless the value is wrapped in `${...}`. Pure URLs and hostnames SHALL be allowed. + +#### Scenario: Literal bearer token rejected + +- **WHEN** config contains `"Authorization": "Bearer abc123longtokenvalue..."` +- **THEN** ccxray SHALL exit at startup with an error suggesting the user switch to `${ENV_VAR}` interpolation + +#### Scenario: Interpolated bearer token accepted + +- **WHEN** config contains `"Authorization": "Bearer ${TOKEN}"` and `TOKEN` is set +- **THEN** ccxray SHALL load successfully and use the resolved value + +### Requirement: Gitignore auto-amend on first generation + +When ccxray writes a new `.ccxray.user.json` for the first time, it SHALL check whether the file is covered by the project's `.gitignore`. If not, ccxray SHALL prompt the user (or apply automatically when `--yes` is passed) to append `.ccxray.user.json` to `.gitignore`. + +#### Scenario: Gitignore missing entry + +- **WHEN** ccxray creates `.ccxray.user.json` in a repo whose `.gitignore` does not list it +- **THEN** ccxray SHALL prompt for permission to append `.ccxray.user.json` and reflect the choice in the next run + +#### Scenario: Gitignore already covers the file + +- **WHEN** ccxray creates `.ccxray.user.json` and `.gitignore` already contains an entry matching the file +- **THEN** no prompt SHALL appear and the file SHALL be written silently + +### Requirement: Config error fails fast at startup + +Config syntax errors, schema violations, unresolved `${VAR}` references, and literal-secret matches SHALL cause ccxray to exit non-zero at startup with an actionable error message. ccxray SHALL NOT silently continue with a partial config. + +#### Scenario: Invalid JSON + +- **WHEN** `.ccxray.json` contains malformed JSON +- **THEN** ccxray SHALL print a parse error citing the file path and the offending line/column, and SHALL exit non-zero + +#### Scenario: Schema violation + +- **WHEN** `.ccxray.json` sets `otel.tier` to an unknown value +- **THEN** ccxray SHALL print a schema error naming the field and listing valid values, and SHALL exit non-zero diff --git a/openspec/changes/add-otel-metrics-phase1/specs/otel-export/spec.md b/openspec/changes/add-otel-metrics-phase1/specs/otel-export/spec.md new file mode 100644 index 0000000..5012246 --- /dev/null +++ b/openspec/changes/add-otel-metrics-phase1/specs/otel-export/spec.md @@ -0,0 +1,121 @@ +## ADDED Requirements + +### Requirement: Client-side OTel SDK initialization + +OTel SDK initialization SHALL occur in the client process (the one running `ccxray claude` or similar) and SHALL NOT occur in the hub process. The hub SHALL remain a pure HTTP proxy and SSE broadcaster. + +#### Scenario: Client initializes OTel + +- **WHEN** a ccxray client process starts with tier ≥ 1 +- **THEN** the OTel SDK SHALL initialize within the client process and emit metrics tagged with that client's resource attributes + +#### Scenario: Hub does not emit business metrics + +- **WHEN** the ccxray hub forwards an HTTP request between a client and an upstream provider +- **THEN** the hub SHALL NOT emit any business metric on behalf of the client, regardless of the client's tier setting + +### Requirement: `ccxray.*` namespace for all emitted metrics + +Every metric SHALL be named under the `ccxray..` pattern. No metric SHALL be named identically to a Claude Code CLI metric or any other upstream OTel convention that would overlap. + +#### Scenario: Metric naming + +- **WHEN** an OTel metric is registered +- **THEN** its name SHALL start with the literal prefix `ccxray.` + +#### Scenario: Namespace collision prevention + +- **WHEN** code attempts to register a metric whose name matches a `claude_code.*` pattern +- **THEN** registration SHALL fail and tests SHALL flag it + +### Requirement: Source resource attribute on every emit + +Every metric SHALL carry the resource attribute `ccxray.source="ccxray-proxy"` so that backends can filter ccxray-emitted data from data emitted by other OTel sources running on the same host. + +#### Scenario: Source attribute present + +- **WHEN** any metric is exported by ccxray +- **THEN** its resource attributes SHALL include `ccxray.source="ccxray-proxy"` + +### Requirement: Cardinality budget enforcement + +Each metric SHALL declare its allowed attribute keys and a numeric cardinality budget per key. Attribute keys not in the allow-list SHALL be dropped via OTel View API. When the count of unique values for an allow-listed key reaches its budget, subsequent unique values SHALL be replaced with the literal string `_overflow_` and the sentinel counter `ccxray.metrics.overflow_total{metric,attribute}` SHALL increment. + +#### Scenario: Allowed attribute within budget + +- **WHEN** `ccxray.tool.invocations_total` receives an attribute `tool="Read"` and `Read` is the 3rd of 50 budgeted tool names +- **THEN** the metric SHALL emit with `tool="Read"` and `ccxray.metrics.overflow_total` SHALL NOT increment + +#### Scenario: Budget exhausted + +- **WHEN** the cardinality budget for `tool` is 50 and a 51st unique tool name arrives +- **THEN** the metric SHALL emit with `tool="_overflow_"` and `ccxray.metrics.overflow_total{metric="ccxray.tool.invocations_total",attribute="tool"}` SHALL increment by 1 + +#### Scenario: Unallowed attribute key + +- **WHEN** code attempts to record `ccxray.tool.invocations_total` with attribute `bash_command="rm -rf /tmp/foo"` while `bash_command` is not in the allow-list +- **THEN** the `bash_command` attribute SHALL be dropped before emission + +### Requirement: CLI OTel coexistence and complement mode + +ccxray SHALL detect the presence of `CLAUDE_CODE_ENABLE_TELEMETRY=1` in the environment and, when detected, SHALL emit all metrics with an additional attribute `ccxray.cli_otel_active=true`. ccxray SHALL print a startup notice explaining how to choose between ccxray and CLI metrics when both are active. ccxray SHALL NOT disable any of its own metrics based on CLI coexistence. + +#### Scenario: CLI OTel detected + +- **WHEN** ccxray starts with `CLAUDE_CODE_ENABLE_TELEMETRY=1` set +- **THEN** ccxray SHALL print a startup notice indicating complement mode and SHALL add `ccxray.cli_otel_active=true` to all emitted metrics + +#### Scenario: CLI OTel not detected + +- **WHEN** ccxray starts without `CLAUDE_CODE_ENABLE_TELEMETRY` +- **THEN** ccxray SHALL print a notice indicating standalone mode and the attribute `ccxray.cli_otel_active` SHALL NOT be set + +### Requirement: Reconciliation diff metric + +ccxray SHALL emit `ccxray.reconciliation.token_diff_pct{model}` as a gauge that compares ccxray's HTTP-observed token counts against the corresponding values reported by the CLI when both are active. The metric SHALL be emitted only when both ccxray and CLI OTel signals are available. + +#### Scenario: Both active with matching counts + +- **WHEN** ccxray and CLI emit the same token count for the same request +- **THEN** `ccxray.reconciliation.token_diff_pct` SHALL be 0 for the affected model + +#### Scenario: Mismatch detected + +- **WHEN** ccxray observes input_tokens=1000 and CLI reports input_tokens=1050 for the same request +- **THEN** the metric SHALL emit approximately 5.0 (percent difference) + +### Requirement: Required metric families + +ccxray SHALL emit the following metric families when OTel is enabled: + +- **Cost**: `ccxray.tokens.input_total`, `ccxray.tokens.output_total`, `ccxray.tokens.cache_read_total`, `ccxray.tokens.cache_creation_total`, `ccxray.cost.usd_total`, `ccxray.cache.hit_ratio` (gauge). +- **Usage**: `ccxray.tool.invocations_total{tool,provider}`, `ccxray.mcp.invocations_total{server,tool}`, `ccxray.skill.activations_total{skill,provider}`, `ccxray.sessions_total{provider}`, `ccxray.agent_type.invocations_total{type}`. +- **Quality**: `ccxray.errors_total{type,provider}`, `ccxray.stop_reason_total{reason}`, `ccxray.latency_ms` (histogram, attributes: `model`,`provider`), `ccxray.max_tokens_hit_total{model}`. +- **Patterns**: `ccxray.context.utilization_pct` (histogram), `ccxray.auto_compact.triggered_total`, `ccxray.subagent.invocations_total`, `ccxray.tools_per_turn` (histogram). +- **Governance**: `ccxray.permission_mode.usage_total{mode}`, `ccxray.dangerous_tool.invocations_total{pattern}`, `ccxray.file_writes_total`, `ccxray.provider.distribution_total{provider}`. + +Each metric SHALL be registered with its allow-list of attribute keys and cardinality budget at SDK initialization. + +#### Scenario: Cost metric emission after a turn + +- **WHEN** ccxray completes forwarding a request and receives a usage block from the upstream provider +- **THEN** `ccxray.tokens.input_total`, `ccxray.tokens.output_total`, and `ccxray.cost.usd_total` SHALL each increment by the corresponding value + +#### Scenario: Tool invocation metric + +- **WHEN** ccxray detects a `tool_use` block named `Bash` in a response +- **THEN** `ccxray.tool.invocations_total` SHALL increment by 1 with attribute `tool="Bash"` + +### Requirement: Minimal optional dependencies + +The OTel-related Node.js dependencies SHALL be limited to `@opentelemetry/api`, `@opentelemetry/sdk-metrics`, `@opentelemetry/exporter-metrics-otlp-http`, and `@opentelemetry/resources`. Auto-instrumentation packages SHALL NOT be included. Dependencies SHALL be resolved lazily so that ccxray remains functional even when OTel packages are absent (tier 0 only). + +#### Scenario: OTel packages absent and tier 0 + +- **WHEN** OTel packages are not installed and effective tier is 0 +- **THEN** ccxray SHALL start normally without referencing any OTel package + +#### Scenario: OTel packages absent and tier ≥ 1 + +- **WHEN** OTel packages are not installed and effective tier is ≥ 1 +- **THEN** ccxray SHALL emit a clear error explaining which packages to install and SHALL exit non-zero diff --git a/openspec/changes/add-otel-metrics-phase1/specs/otel-health/spec.md b/openspec/changes/add-otel-metrics-phase1/specs/otel-health/spec.md new file mode 100644 index 0000000..42ad9c2 --- /dev/null +++ b/openspec/changes/add-otel-metrics-phase1/specs/otel-health/spec.md @@ -0,0 +1,99 @@ +## ADDED Requirements + +### Requirement: Four-state OTel health machine + +ccxray SHALL maintain an OTel health state machine with exactly four states: `disabled`, `active`, `degraded`, and `circuit_open`. Transitions SHALL be driven exclusively by the conditions described in the subsequent requirements; no other code path SHALL mutate state. + +#### Scenario: Disabled at startup + +- **WHEN** effective tier is 0 or OTel packages are absent +- **THEN** the state SHALL be `disabled` and `ccxray.otel.state` SHALL emit only its disabled gauge (where possible) and otherwise stay silent + +#### Scenario: Active after successful init + +- **WHEN** effective tier is ≥ 1 and SDK initialization completes +- **THEN** the state SHALL be `active` + +### Requirement: Bounded export queue with drop-oldest semantics + +The OTel export queue SHALL be bounded by a configurable size (default 2048 entries). When the queue is full and a new export is attempted, the oldest queued entry SHALL be dropped to make room. Each drop SHALL increment `ccxray.otel.exports_dropped_total{signal}`. + +#### Scenario: Queue under limit + +- **WHEN** the queue holds fewer than its configured maximum entries and a new export arrives +- **THEN** the new entry SHALL be appended and no drop SHALL occur + +#### Scenario: Queue at limit + +- **WHEN** the queue is at its configured maximum and a new export arrives +- **THEN** the oldest entry SHALL be removed, the new entry SHALL be appended, and `ccxray.otel.exports_dropped_total{signal=""}` SHALL increment by 1 + +### Requirement: Circuit breaker with exponential backoff + +After 5 consecutive export failures, the state SHALL transition to `circuit_open` and exports SHALL be paused. After an initial cooldown of 60 seconds, the state SHALL transition to `half_open` and a single export SHALL be attempted. Success SHALL return the state to `active`. Failure SHALL keep the state at `circuit_open` and the cooldown SHALL double up to a maximum of 600 seconds. + +#### Scenario: Trip on 5 consecutive failures + +- **WHEN** 5 consecutive export attempts return errors +- **THEN** the state SHALL transition to `circuit_open` and no further exports SHALL be attempted until the cooldown elapses + +#### Scenario: Half-open success returns to active + +- **WHEN** the cooldown elapses, the state moves to `half_open`, and the trial export succeeds +- **THEN** the state SHALL transition back to `active` and the cooldown SHALL reset to 60 seconds + +#### Scenario: Half-open failure increases cooldown + +- **WHEN** the trial export in `half_open` fails +- **THEN** the state SHALL remain `circuit_open` and the next cooldown SHALL be `min(previous_cooldown * 2, 600)` seconds + +### Requirement: Failure log on local disk + +Failed export attempts and state transitions SHALL be written to `~/.ccxray/otel.log` in append mode. The file SHALL be rotated once it exceeds a configurable size (default 1 MB). Rotated files SHALL be retained up to a configurable count (default 5). + +#### Scenario: Export error recorded + +- **WHEN** an export attempt fails with a network error +- **THEN** a single line SHALL be appended to `~/.ccxray/otel.log` containing the timestamp, the error class, and the queue depth at time of failure + +#### Scenario: File rotated at size limit + +- **WHEN** `~/.ccxray/otel.log` exceeds 1 MB +- **THEN** it SHALL be renamed to `otel.log.1` (with existing rotations shifted), a fresh `otel.log` SHALL be created, and files beyond the retention count SHALL be deleted + +### Requirement: Never-block guarantee for the proxy + +OTel export operations SHALL NOT block the HTTP proxy path. All emit operations SHALL enqueue without awaiting export completion. SDK shutdown during process exit SHALL be capped at 2 seconds and SHALL NOT prevent clean exit on timeout. + +#### Scenario: Collector unreachable + +- **WHEN** the OTLP endpoint is unreachable for the duration of a proxy request +- **THEN** the proxy SHALL forward the request and return the response with no additional latency from OTel + +#### Scenario: SDK shutdown timeout + +- **WHEN** the process is exiting and OTel SDK flush is in progress +- **THEN** the shutdown SHALL be aborted after 2 seconds and the process SHALL exit cleanly + +### Requirement: Config errors fail fast, init/runtime errors degrade + +Config parsing or schema errors SHALL cause non-zero process exit at startup with an actionable message. SDK initialization errors (e.g. invalid endpoint URL format) SHALL transition the state to `degraded` and SHALL NOT block ccxray startup. Runtime export errors SHALL be handled by the circuit breaker without affecting other ccxray behavior. + +#### Scenario: Bad endpoint URL + +- **WHEN** `.ccxray.json` sets `otel.endpoint` to a string that is not a valid URL +- **THEN** ccxray SHALL continue to start, the state SHALL be `degraded`, the dashboard and proxy SHALL function normally, and `ccxray status --otel` SHALL display the error + +#### Scenario: Missing required field + +- **WHEN** `.ccxray.json` enables tier 1 but omits `otel.endpoint` +- **THEN** ccxray SHALL exit non-zero at startup with an error pointing to the missing field + +### Requirement: Health state observable via metric and status command + +The current health state SHALL be observable through (a) a gauge `ccxray.otel.state{state}` (where possible — emitted only when state is `active` or `degraded`), and (b) the `ccxray status --otel` output regardless of state. + +#### Scenario: State visible in status command + +- **WHEN** an engineer runs `ccxray status --otel` +- **THEN** the output SHALL include the current state, the last 3 state transitions with timestamps, and the current circuit breaker cooldown remaining (if applicable) diff --git a/openspec/changes/add-otel-metrics-phase1/specs/otel-introspection/spec.md b/openspec/changes/add-otel-metrics-phase1/specs/otel-introspection/spec.md new file mode 100644 index 0000000..53f1589 --- /dev/null +++ b/openspec/changes/add-otel-metrics-phase1/specs/otel-introspection/spec.md @@ -0,0 +1,66 @@ +## ADDED Requirements + +### Requirement: `ccxray status --otel` shows effective configuration and health + +The `ccxray status --otel` command SHALL print: + +- The current effective tier (0/1/2) and which config files contributed. +- The endpoint URL with any `${VAR}` masked. +- The OTel health state (`disabled / active / degraded / circuit_open`) and last 3 state transitions with timestamps. +- The circuit breaker cooldown remaining (when applicable). +- Per-metric cardinality usage in `current / budget` format (e.g. `tool: 23/50`). +- Total counts: exports succeeded, exports failed, exports dropped (last hour and last 24 hours). +- The `opt_in_acknowledged_at` timestamp for tier 2 (when applicable). +- CLI coexistence indicator: whether `CLAUDE_CODE_ENABLE_TELEMETRY` is detected. + +#### Scenario: Status at tier 1 + +- **WHEN** ccxray is running at tier 1 with a healthy collector +- **THEN** `ccxray status --otel` SHALL show `tier=1`, `state=active`, the endpoint, cardinality usage rows for each registered metric, and the export success/failure counts + +#### Scenario: Status at tier 0 + +- **WHEN** ccxray is running at tier 0 +- **THEN** `ccxray status --otel` SHALL show `tier=0`, `state=disabled`, and SHALL NOT attempt to read OTel runtime state + +### Requirement: `ccxray otel preview` dry-run + +The `ccxray otel preview` command SHALL print the exact JSON body that would be sent to the OTel collector on the next export, including all attribute values and resource attributes, WITHOUT sending any network request. Secrets resolved from `${ENV_VAR}` SHALL be masked in the output. + +#### Scenario: Preview before enabling + +- **WHEN** an engineer runs `ccxray otel preview` after setting up `.ccxray.json` +- **THEN** the command SHALL print a single JSON object representing the next export, with `Authorization` and similar header values shown as `Bearer ***` rather than the resolved token + +#### Scenario: Preview with no recent metrics + +- **WHEN** ccxray has no queued metrics to export +- **THEN** the command SHALL print a notice that no metrics are pending and SHALL exit zero + +### Requirement: Startup banner declares active tier and mode + +When ccxray starts at tier ≥ 1, it SHALL print a one-line banner to stderr summarizing: tier value, endpoint (without secret), and complement-mode status (if CLI OTel is active). The banner SHALL NOT print when tier is 0. + +#### Scenario: Banner at tier 1 standalone + +- **WHEN** ccxray starts at tier 1 without CLI OTel +- **THEN** stderr SHALL contain a single line matching the pattern `ccxray OTel tier: 1 (anonymous) → ` followed by no further banner output for that launch + +#### Scenario: Banner at tier 1 complement + +- **WHEN** ccxray starts at tier 1 with `CLAUDE_CODE_ENABLE_TELEMETRY=1` +- **THEN** stderr SHALL contain a line indicating `tier: 1` and `complement-mode: true` + +#### Scenario: No banner at tier 0 + +- **WHEN** ccxray starts at tier 0 +- **THEN** stderr SHALL NOT contain any OTel-related banner line + +### Requirement: Secrets masking in all introspection output + +`ccxray status --otel` and `ccxray otel preview` SHALL mask any value resolved from a `${VAR}` interpolation. Masked values SHALL display as the prefix (up to 4 characters) followed by `***`. The full unmasked value SHALL never be printed by any introspection command. + +#### Scenario: Auth header masked + +- **WHEN** the resolved auth header is `Bearer abc123longtokenvalue` +- **THEN** introspection output SHALL display `Bearer abc1***` and SHALL NOT print the remainder of the token diff --git a/openspec/changes/add-otel-metrics-phase1/specs/otel-tiers/spec.md b/openspec/changes/add-otel-metrics-phase1/specs/otel-tiers/spec.md new file mode 100644 index 0000000..e10c2c9 --- /dev/null +++ b/openspec/changes/add-otel-metrics-phase1/specs/otel-tiers/spec.md @@ -0,0 +1,79 @@ +## ADDED Requirements + +### Requirement: Three discrete tier values + +ccxray SHALL support exactly three tier values for OTel export: + +- **0 — disabled**: No SDK initialization, no network egress. +- **1 — project anonymous**: Emit with project-level resource attributes (`project.name`, optional `team`) but no individual identity. +- **2 — personal named**: Emit with `enduser.id` attached (a self-chosen string set by the engineer). + +#### Scenario: Tier 0 produces no egress + +- **WHEN** the effective tier resolves to 0 +- **THEN** no OTel package SHALL be loaded and no network connection SHALL be opened for telemetry + +#### Scenario: Tier 1 omits identity + +- **WHEN** the effective tier resolves to 1 and a request completes +- **THEN** emitted metrics SHALL include `project.name` (if configured) but SHALL NOT include any `enduser.id` attribute + +#### Scenario: Tier 2 includes identity + +- **WHEN** the effective tier resolves to 2 and personal config provides `identity: "alice"` +- **THEN** emitted metrics SHALL include `enduser.id="alice"` as a resource attribute + +### Requirement: Tier resolution rule + +The effective tier SHALL be `min(project_tier, personal_tier)`. If either side is absent, the present side SHALL be used. The minimum SHALL clamp downward; personal config SHALL NOT exceed project config. + +#### Scenario: Personal lower than project + +- **WHEN** project tier is 1 and personal tier is 0 +- **THEN** the effective tier SHALL be 0 + +#### Scenario: Project lower than personal + +- **WHEN** project tier is 1 and personal tier is 2 without project authorization for tier 2 +- **THEN** the effective tier SHALL be 1 and ccxray SHALL emit a warning that personal tier is clamped + +#### Scenario: Equal tiers + +- **WHEN** project tier is 1 and personal tier is 1 +- **THEN** the effective tier SHALL be 1 + +### Requirement: Engineer unilateral opt-out + +Any engineer SHALL be able to opt out of OTel emission for their own machine by setting `tier: 0` in `.ccxray.user.json`, regardless of the project config. This opt-out SHALL take effect on the next ccxray launch. + +#### Scenario: Opt-out overrides project tier + +- **WHEN** project config sets tier 2 and personal config sets tier 0 +- **THEN** the engineer's ccxray client SHALL emit no telemetry until personal config is changed + +### Requirement: Personal config gitignore enforcement + +The personal config file `.ccxray.user.json` SHALL be excluded from version control. ccxray SHALL refuse to load personal-tier identity from a file that is currently tracked by git and SHALL emit a warning explaining the risk. + +#### Scenario: Personal config tracked by git + +- **WHEN** `.ccxray.user.json` exists in the repo and is tracked by git +- **THEN** ccxray SHALL print a warning recommending `git rm --cached` and SHALL refuse to apply the personal identity until the file is untracked or moved to `$HOME` + +### Requirement: Opt-in acknowledgment timestamp + +When personal config sets tier 2 for the first time, the file SHALL record an `opt_in_acknowledged_at` ISO 8601 timestamp. This timestamp SHALL be displayed in `ccxray status --otel` so the engineer can confirm when they last opted in. + +#### Scenario: First-time tier 2 opt-in + +- **WHEN** a user creates `.ccxray.user.json` with tier 2 for the first time +- **THEN** ccxray SHALL write the current time into the file as `opt_in_acknowledged_at` and SHALL include it in subsequent `status --otel` output + +### Requirement: Tier distribution sentinel + +ccxray SHALL emit `ccxray.otel.tier_distribution{tier}` as a counter incremented once per process launch that initializes OTel, labeled with the effective tier value. This metric is meant to inform documentation strengthening decisions (e.g. low tier 2 share suggests trust concerns). + +#### Scenario: Counter increments on launch + +- **WHEN** ccxray client process initializes at tier 1 +- **THEN** `ccxray.otel.tier_distribution{tier="1"}` SHALL increment by 1 diff --git a/openspec/changes/add-otel-metrics-phase1/specs/parser-schemas/spec.md b/openspec/changes/add-otel-metrics-phase1/specs/parser-schemas/spec.md new file mode 100644 index 0000000..36e6b23 --- /dev/null +++ b/openspec/changes/add-otel-metrics-phase1/specs/parser-schemas/spec.md @@ -0,0 +1,83 @@ +## ADDED Requirements + +### Requirement: Versioned parser schemas per concern and provider + +Detection logic for tool / MCP / skill / agent-type SHALL be expressed as JSON schemas under `server/parsers/`. There SHALL be at minimum one schema per (concern, provider) pair: + +- `parsers/anthropic-tools.schema.json` +- `parsers/anthropic-skills.schema.json` +- `parsers/anthropic-agent-types.schema.json` +- `parsers/mcp-tools.schema.json` (provider-agnostic MCP naming convention) +- `parsers/codex-tools.schema.json` + +Each schema SHALL include a `version` field (semver) and a `last_verified_against` field (ISO 8601 date). Inline string matching in `server/system-prompt.js`, `server/store.js`, or other code paths SHALL be removed in favor of the schema-driven parser. + +#### Scenario: Schema referenced at runtime + +- **WHEN** ccxray processes an Anthropic response containing a `tool_use` block +- **THEN** the tool name SHALL be classified using `parsers/anthropic-tools.schema.json` and SHALL NOT be matched against any hardcoded list embedded in other files + +### Requirement: Snapshot fixtures per provider + +Test fixtures under `test/fixtures/parser/` SHALL cover at minimum the following cases per provider: + +- Basic tool invocation +- Tool invocation with a skill marker active +- Subagent invocation (Anthropic Task tool) +- MCP server tool invocation +- An intentional unknown tool name + +Each fixture SHALL pair an input (request or response JSON) with an expected parser output snapshot. Parser changes SHALL require committing new snapshots and SHALL pass review before merge. + +#### Scenario: Snapshot drift fails CI + +- **WHEN** parser code is changed in a way that alters fixture output +- **THEN** the test suite SHALL fail with a diff between old and new snapshot until the snapshot is updated and reviewed + +### Requirement: Sentinel counters for unknown tokens + +When the parser encounters a token, marker, or block that does not match any registered pattern in the relevant schema, it SHALL increment one of: + +- `ccxray.parser.unknown_tool_total{provider}` +- `ccxray.parser.unknown_skill_marker_total{provider}` +- `ccxray.parser.unknown_mcp_format_total` +- `ccxray.parser.fallback_used_total{parser,reason}` + +The unknown event SHALL also be recorded with a short sample to `~/.ccxray/parser-drift.log` for later inspection via `ccxray parser report`. + +#### Scenario: Unknown tool name observed + +- **WHEN** ccxray sees a `tool_use` block whose `name` does not match any pattern in `parsers/anthropic-tools.schema.json` +- **THEN** `ccxray.parser.unknown_tool_total{provider="anthropic"}` SHALL increment by 1 and a sample SHALL be appended to `~/.ccxray/parser-drift.log` + +### Requirement: Reconciliation invariants + +For every processed entry the parser SHALL verify the following invariants: + +- Number of `tool_use` blocks in the response equals the number of tool entries extracted by the parser. +- Sum of input/output token counts attributed by the parser equals the corresponding values in the upstream usage block. + +When an invariant fails, `ccxray.parser.reconciliation_mismatch_total{type}` SHALL increment by 1 and the entry ID SHALL be appended to `~/.ccxray/parser-drift.log`. The mismatch SHALL NOT alter the entry's local log content. + +#### Scenario: Tool count mismatch + +- **WHEN** a response contains 3 `tool_use` blocks but the parser extracts only 2 tool entries +- **THEN** `ccxray.parser.reconciliation_mismatch_total{type="tool_count"}` SHALL increment and the entry ID SHALL be recorded in the drift log + +### Requirement: Parser error isolation + +Parser code SHALL be wrapped in try/catch boundaries. On exception, `ccxray.parser.error_total{parser,error_type}` SHALL increment and the originating entry SHALL still be written to local logs. The OTel span/metric for the affected entry SHALL be tagged `ccxray.parser.degraded=true`. Parser failure SHALL NOT propagate to the proxy path or terminate ccxray. + +#### Scenario: Parser throws + +- **WHEN** the skill marker parser throws a runtime exception while processing a response +- **THEN** ccxray SHALL log the exception locally, increment `ccxray.parser.error_total{parser="anthropic-skills",error_type=""}`, write the entry to disk as usual, and continue forwarding subsequent requests + +### Requirement: `ccxray parser report` command + +The `ccxray parser report` command SHALL print the top unknown tokens by frequency from the last 7 days of `~/.ccxray/parser-drift.log`, grouped by category (tool / skill / MCP / fallback). The output SHALL include sample tokens and a GitHub issue body template the user can copy to file a drift report. + +#### Scenario: Reporting after seeing unknown markers + +- **WHEN** the engineer has accumulated unknown markers and runs `ccxray parser report` +- **THEN** the command SHALL print a categorized summary, the most recent 5 unique samples per category, and a formatted GitHub issue body diff --git a/openspec/changes/add-otel-metrics-phase1/tasks.md b/openspec/changes/add-otel-metrics-phase1/tasks.md new file mode 100644 index 0000000..ab15634 --- /dev/null +++ b/openspec/changes/add-otel-metrics-phase1/tasks.md @@ -0,0 +1,104 @@ +## 1. Dependencies and package wiring + +- [ ] 1.1 Add `@opentelemetry/api`, `@opentelemetry/sdk-metrics`, `@opentelemetry/exporter-metrics-otlp-http`, `@opentelemetry/resources` as `dependencies` in `package.json` (no auto-instrumentations) +- [ ] 1.2 Implement lazy require in a helper so ccxray still runs at tier 0 when OTel packages are absent +- [ ] 1.3 Update `package-lock.json` and confirm bundle size delta is within an acceptable bound + +## 2. Config loader (`server/config-loader.js`) + +- [ ] 2.1 Define JSON schema for `.ccxray.json` (project) and `.ccxray.user.json` (personal) covering: `otel.enabled`, `otel.tier`, `otel.endpoint`, `otel.headers`, `otel.resource_attributes`, `otel.cardinality_overrides` +- [ ] 2.2 Implement schema validation with line/column error reporting +- [ ] 2.3 Implement `${ENV_VAR}` interpolation across all string values; fail fast with named variable on unresolved +- [ ] 2.4 Implement literal-secret detector (Bearer/JWT/`sk_*`/`ghp_*`) that rejects values not wrapped in `${...}` +- [ ] 2.5 Implement project config lookup walking up from cwd to git root, taking the first `.ccxray.json` match +- [ ] 2.6 Implement personal config lookup: cwd first, then `$HOME` fallback +- [ ] 2.7 Implement tier resolution `effective = min(project_tier, personal_tier)` with downward clamp warning +- [ ] 2.8 Implement `.gitignore` check and auto-amend with `--yes` flag for `.ccxray.user.json` +- [ ] 2.9 Reject personal config that is currently tracked by git, with explanatory error +- [ ] 2.10 Persist `opt_in_acknowledged_at` ISO 8601 timestamp on first tier 2 enable +- [ ] 2.11 Unit tests covering all error paths, interpolation, secret rejection, tier resolution matrix + +## 3. OTel health module (`server/otel-health.js`) + +- [ ] 3.1 Implement state machine with four states: `disabled / active / degraded / circuit_open` and transitions only via documented APIs +- [ ] 3.2 Implement bounded export queue with drop-oldest semantics and `ccxray.otel.exports_dropped_total{signal}` increment per drop +- [ ] 3.3 Implement circuit breaker: 5 consecutive failures trips, 60s initial cooldown, half-open trial, exponential backoff to 600s max +- [ ] 3.4 Implement `~/.ccxray/otel.log` append writer with size-based rotation (default 1 MB, 5 file retention) +- [ ] 3.5 Implement SDK shutdown with 2-second hard cap to never block process exit +- [ ] 3.6 Surface state and metrics via a status reporter API consumed by the CLI status command +- [ ] 3.7 Unit tests with mock collector (200 / 500 / timeout) covering queue overflow, circuit transitions, half-open recovery, and exponential backoff + +## 4. OTel SDK initialization (`server/otel.js`) + +- [ ] 4.1 Implement SDK init for metrics only, with `ccxray.source="ccxray-proxy"` resource attribute +- [ ] 4.2 Define metric registry with allow-list of attribute keys and cardinality budgets per metric (View API) +- [ ] 4.3 Implement cardinality budget tracker with `_overflow_` fallback and `ccxray.metrics.overflow_total{metric,attribute}` sentinel +- [ ] 4.4 Detect `CLAUDE_CODE_ENABLE_TELEMETRY=1` and apply `ccxray.cli_otel_active=true` attribute in complement mode +- [ ] 4.5 Register all metric families per `otel-export/spec.md`: cost, usage, quality, patterns, governance +- [ ] 4.6 Register sentinel metrics: overflow, parser unknowns, parser mismatches, otel state, reconciliation diff, tier distribution +- [ ] 4.7 Implement reconciliation diff gauge `ccxray.reconciliation.token_diff_pct{model}` computed against detected CLI telemetry +- [ ] 4.8 Implement export-time masking of any value resolved from `${ENV_VAR}` for log lines and trace dumps +- [ ] 4.9 Unit tests for namespace lint (no metric name starts with `claude_code.`), source attribute presence, budget enforcement, complement mode attribute, lazy SDK init at tier 0 + +## 5. Parser schema-ization (`server/parsers/`) + +- [ ] 5.1 Define the JSON schema format (fields: `version`, `last_verified_against`, `patterns`, `examples`) +- [ ] 5.2 Author `parsers/anthropic-tools.schema.json` covering current internal tool names +- [ ] 5.3 Author `parsers/anthropic-skills.schema.json` covering known skill marker formats from `system-prompt.js` +- [ ] 5.4 Author `parsers/anthropic-agent-types.schema.json` for general / explore / plan / known subagent types +- [ ] 5.5 Author `parsers/mcp-tools.schema.json` for `mcp____` naming +- [ ] 5.6 Author `parsers/codex-tools.schema.json` for OpenAI Responses tool patterns +- [ ] 5.7 Implement parser dispatch in `server/parsers/index.js` consuming the schemas +- [ ] 5.8 Replace inline string matching in `server/system-prompt.js`, `server/store.js`, and `server/helpers.js` with schema dispatch calls +- [ ] 5.9 Implement sentinel emission for unknown tools / skills / MCP markers and `~/.ccxray/parser-drift.log` append writer +- [ ] 5.10 Implement reconciliation invariants: tool_use block count equals extracted count; token attribution sums equal usage block values +- [ ] 5.11 Wrap parser calls in try/catch with `ccxray.parser.error_total{parser,error_type}` increment and `ccxray.parser.degraded=true` attribute on the affected entry +- [ ] 5.12 Author snapshot fixtures under `test/fixtures/parser/` for every (provider, scenario) pair listed in `parser-schemas/spec.md` +- [ ] 5.13 Wire snapshot tests into `npm test` + +## 6. Wire metrics into forward / store paths + +- [ ] 6.1 In `server/forward.js`, emit cost / token / latency / error / stop_reason metrics after each completed forward, using the otel-health queue +- [ ] 6.2 In `server/store.js`, emit usage / pattern / governance metrics as session/tool/skill/MCP detection runs through the new parsers +- [ ] 6.3 Ensure no emit path can throw into the proxy code path; all emits are best-effort +- [ ] 6.4 Add a unit test that verifies forward.js continues to function with OTel disabled, init-failed (degraded), and circuit_open states + +## 7. CLI introspection commands + +- [ ] 7.1 Implement `ccxray status --otel` per `otel-introspection/spec.md`: tier, endpoint (masked), state, transitions, cooldown, cardinality usage rows, success/failure/dropped counts, opt_in_acknowledged_at, CLI coexistence flag +- [ ] 7.2 Implement `ccxray otel preview` dry-run printing next-export JSON with secrets masked +- [ ] 7.3 Implement `ccxray parser report` command summarizing top unknown tokens and generating a GitHub issue body template +- [ ] 7.4 Add startup banner declaring tier and complement-mode status when tier ≥ 1 +- [ ] 7.5 Unit tests for each command and banner output + +## 8. Hub-side coexistence (minimal Phase 1 changes) + +- [ ] 8.1 Confirm the hub does NOT initialize OTel SDK for business metrics; document this explicitly in the hub module header comment +- [ ] 8.2 Make `ccxray status` aware of per-client OTel state via hub's existing client registration channel (so cross-client visibility works) +- [ ] 8.3 Defer `ccxray.hub.*` operational metrics to a follow-up change (per Open Questions in design.md) + +## 9. Documentation + +- [ ] 9.1 Add `docs/otel-ethics.md` (bilingual): why these metrics are not for individual performance evaluation; what acceptable uses look like +- [ ] 9.2 Add `docs/otel-quickstart.md` (bilingual): 90-second Grafana onboarding with screenshots +- [ ] 9.3 Reference `docs/otel-integration.html` (existing) as the design record from README +- [ ] 9.4 Update README with a single section: "Optional: send metrics to your observability backend" linking to quickstart and ethics docs +- [ ] 9.5 Update `CLAUDE.md` Architecture section to note the new modules and their roles + +## 10. Verification gates + +- [ ] 10.1 CI lint: every emitted metric name MUST exist in `server/otel.js` schema registry; new metrics without registry entries fail build +- [ ] 10.2 CI lint: no metric name SHALL start with `claude_code.`; assertion runs across all `server/**/*.js` +- [ ] 10.3 Integration test: spin a local OTLP collector (docker), run a synthetic ccxray session, assert collector received the expected metric families with correct attributes +- [ ] 10.4 Integration test: simulate collector returning 500 → assert circuit opens, queue drops oldest, ccxray continues forwarding +- [ ] 10.5 Integration test: simulate `CLAUDE_CODE_ENABLE_TELEMETRY=1` → assert `cli_otel_active` attribute appears on emitted metrics +- [ ] 10.6 Manual usability test: 3 new engineers walk README + quickstart, target median time-to-first-metric < 5 minutes +- [ ] 10.7 Set 3-month KPI gate in repo: track GitHub references to "otel" / "OTEL_EXPORTER"; if < 10 within 3 months of release, pause Phase 2 work and revisit + +## 11. Release prep + +- [ ] 11.1 Update CHANGELOG with new dependencies, default-off behavior, three-tier model, and link to design doc +- [ ] 11.2 Confirm npm publish package size delta and document in PR description +- [ ] 11.3 Open follow-up issue for Phase 2 (span emit + `/entry/:id` drill-back) +- [ ] 11.4 Open follow-up issue for `--otel-demo` Docker Compose helper +- [ ] 11.5 Open follow-up issue for `ccxray.hub.*` operational metrics From e347fe70d73cfc58393c1b8d1b6dfc0c0469bc57 Mon Sep 17 00:00:00 2001 From: Justin Lee Date: Wed, 13 May 2026 13:51:24 +0800 Subject: [PATCH 02/10] docs(otel): replace mermaid walkthrough with pure-SVG overview MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit docs/otel-change-walkthrough.html used Mermaid via CDN for the state machine and module dependency diagrams. Replaced with docs/otel-phase1-overview.html which renders the same diagrams as native SVG, eliminating the external dependency and making the file fully self-contained. Content is unchanged in substance — same 10 sections, same citations back to the OpenSpec change. Two diagrams (OTel health state machine and module dependency graph) re-authored in inline SVG with hand-laid arrow paths and the same color coding as the other diagrams on the page. Verified twice via subagent review against proposal.md, design.md, all six specs, and tasks.md — every metric name, numeric default, attribute list, state transition, and module annotation traces to source. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/otel-change-walkthrough.html | 916 ----------------------- docs/otel-phase1-overview.html | 1138 +++++++++++++++++++++++++++++ 2 files changed, 1138 insertions(+), 916 deletions(-) delete mode 100644 docs/otel-change-walkthrough.html create mode 100644 docs/otel-phase1-overview.html diff --git a/docs/otel-change-walkthrough.html b/docs/otel-change-walkthrough.html deleted file mode 100644 index d3eec08..0000000 --- a/docs/otel-change-walkthrough.html +++ /dev/null @@ -1,916 +0,0 @@ - - - - -OpenSpec change: add-otel-metrics-phase1 — Mental Model - - - - -
- -

OpenSpec change: add-otel-metrics-phase1

-
這份文件 100% 來自規格本身,每個段落都標註來源。用視覺化(流程圖、狀態機、動畫)幫你建立正確的 mental model,而不是描述。
- - - - -

1. 這個 change 是什麼,不是什麼

- -

是什麼(Goals)

-
    -
  • 提供 ccxray 自己 emit 的 OTel metrics,涵蓋 cost、usage(tool/MCP/skill)、quality、patterns、governance
  • -
  • 預設關閉。除非使用者明確 opt in per-project,否則零 telemetry
  • -
  • 三層 opt-in(disabled / project-anonymous / personal-named),project 設上限,personal 只能 equal-or-downgrade
  • -
  • 跟 Claude Code CLI 內建 OTel 共存,有 reconciliation metric 來反查雙方計費 bug
  • -
  • OTel 失敗絕不影響 proxy:config 錯誤 fail at startup、init 錯誤 degraded silently、runtime 錯誤由 bounded queue + circuit breaker 吸收
  • -
  • Parser drift 必須可見:未識別事件 emit sentinel counter,不能 silently 變 0
  • -
  • 內省命令:ccxray status --otelccxray otel previewccxray parser report
  • -
-
-

Source 列表逐條對應 design.md 的 Goals 段落。

- 📄 design.md › Goals / Non-Goals -
- -

不是什麼(Non-Goals)

-
    -
  • Traces / spans。Phase 1 只 emit metrics。Span、entry_id deep-link、/entry/:id 路由都是 Phase 2
  • -
  • 完整 payload 外送。Request/response body 永不離開機器
  • -
  • Synthetic tool span 時間。從 HTTP cadence 推估 tool 執行時間會誤導
  • -
  • 中央 ccxray hub 做跨機器聚合。每個工程師的 ccxray 都本地獨立
  • -
  • Auto-instrumentation。不會引入 @opentelemetry/auto-instrumentations-node
  • -
-
- 📄 design.md › Non-Goals -
- - -

2. 六個 capabilities 的全景

- -

規格定義了 6 個新 capabilities(proposal.md › New Capabilities)。它們的關係:

- -
-
-flowchart TB
-    CFG[otel-config
讀取/驗證 .ccxray.json
+ .ccxray.user.json] - TIERS[otel-tiers
三層 opt-in 解析] - HEALTH[otel-health
狀態機 + queue + breaker] - EXPORT[otel-export
SDK init + metrics emit] - PARSER[parser-schemas
schema 化 + sentinels] - INTRO[otel-introspection
status / preview / report] - - CFG --> TIERS - TIERS --> EXPORT - HEALTH --> EXPORT - PARSER --> EXPORT - CFG --> INTRO - HEALTH --> INTRO - EXPORT --> INTRO - PARSER --> INTRO - - style CFG fill:#88c0d0,stroke:#5e81ac,color:#0f1419 - style TIERS fill:#88c0d0,stroke:#5e81ac,color:#0f1419 - style HEALTH fill:#88c0d0,stroke:#5e81ac,color:#0f1419 - style EXPORT fill:#a3be8c,stroke:#5e81ac,color:#0f1419 - style PARSER fill:#88c0d0,stroke:#5e81ac,color:#0f1419 - style INTRO fill:#b48ead,stroke:#5e81ac,color:#0f1419 -
-
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Capability規格檔核心職責(直接引用)
otel-configspecs/otel-config/spec.md.ccxray.json and .ccxray.user.json schema, ${ENV_VAR} interpolation, literal-secret rejection, .gitignore auto-amend, project-upper-bound + personal-lower-bound merging rules.」
otel-exportspecs/otel-export/spec.md「OTel SDK initialization (client-side, not hub), metric definitions under ccxray.* namespace, ccxray.source resource attribute, cardinality budget enforcement with _overflow_ fallback, CLI coexistence detection and complement-mode signaling, reconciliation diff metric.」
otel-tiersspecs/otel-tiers/spec.md「three-tier opt-in (disabled / project-anonymous / personal-named), tier resolution with project as upper bound and personal as lower bound, enduser.id attribute only in tier 2, opt-in acknowledgment timestamp persisted in personal config.」
otel-healthspecs/otel-health/spec.md「failure state machine (disabled / active / degraded / circuit_open), bounded export queue with drop-oldest semantics, circuit breaker with exponential backoff, local failure log at ~/.ccxray/otel.log with rotation, never-block guarantee for the proxy path.」
parser-schemasspecs/parser-schemas/spec.md「extract skill / MCP / tool / agent-type detection into versioned JSON schemas, snapshot fixtures per provider (Anthropic + Codex), sentinel metrics for unknown events, reconciliation invariants run per entry, try/catch isolation so parser failure does not affect ccxray core.」
otel-introspectionspecs/otel-introspection/spec.mdccxray status --otel view (tier, endpoint, health, cardinality, dropped counts), ccxray otel preview dry-run, ccxray parser report for drift inspection, startup banner declaring active tier and CLI coexistence mode.」
-
- 📄 proposal.md › Capabilities › New Capabilities -
- - -

3. 一次請求的旅程(資料流動畫)

- -

當一個 HTTP 請求被 ccxray 攔截、回應、進入本地 log 時,同時會走 OTel 路徑。下面綠球代表「一筆 metric 事件」從產生到送達 OTLP collector 的旅程:

- -
-
-
forward.jsrequest 完成
取得 usage
- -
parsers/*解析 tool /
skill / MCP
- -
otel.jsbudget 檢查
+ counter.add()
- -
otel-healthqueue +
state machine
- -
OTLP HTTP送往
collector
-
-
-
綠球 = 一筆 metric 事件;從左到右流經 5 個階段。注意這條路徑跟 proxy forward 平行,任一階段失敗都不會影響 proxy。
-
- -

規格定義的關鍵保證

- -
-

「OTel export operations SHALL NOT block the HTTP proxy path. All emit operations SHALL enqueue without awaiting export completion.」

- 📄 otel-health/spec.md › Requirement: Never-block guarantee for the proxy -
- -
-

「Parser code SHALL be wrapped in try/catch boundaries. On exception... Parser failure SHALL NOT propagate to the proxy path or terminate ccxray.」

- 📄 parser-schemas/spec.md › Requirement: Parser error isolation -
- -

對應 tasks.md 的實作位置

-
    -
  • forward.js 改動:tasks.md §6.1
  • -
  • store.js 改動:tasks.md §6.2
  • -
  • 「No emit path can throw into the proxy」:tasks.md §6.3
  • -
- - -

4. 三層 opt-in 決策樹

- -

規格規定 tier 的解析規則是 min(project_tier, personal_tier)。下面這棵樹列出所有有意義的組合:

- -
-ccxray 啟動 -├─ 找不到任何 config? -│ └─ ▶ tier 0 (disabled) — 不載入 OTel SDK,不開網路連線 -│ -├─ 只有 .ccxray.json(專案開啟 tier 1) -│ └─ ▶ tier 1 (project anonymous) — 只帶 project.name / team,沒有 enduser.id -│ -├─ .ccxray.json tier 1 + .ccxray.user.json tier 0 -│ └─ ▶ tier 0 — 個人 unilateral opt-out -│ -├─ .ccxray.json tier 1 + .ccxray.user.json tier 2 -│ └─ ⚠ project 沒有 authorize tier 2 -│
→ clamp 為 tier 1,印 warning(spec 明定)
-│ -└─ .ccxray.json tier 2 + .ccxray.user.json tier 2 - └─ ▶ tier 2 (personal named) — 帶 enduser.id,記錄 opt_in_acknowledged_at -
- -
-

「The effective tier SHALL be min(project_tier, personal_tier). If either side is absent, the present side SHALL be used. The minimum SHALL clamp downward; personal config SHALL NOT exceed project config.」

- 📄 otel-tiers/spec.md › Requirement: Tier resolution rule -
- -
-

「Any engineer SHALL be able to opt out of OTel emission for their own machine by setting tier: 0 in .ccxray.user.json, regardless of the project config.」

- 📄 otel-tiers/spec.md › Requirement: Engineer unilateral opt-out -
- -

Tier 2 的額外條件

-
    -
  • 必須在 .ccxray.user.json 提供 identity 字串(可化名,不必真名)— spec D1
  • -
  • 檔案必須 gitignored。若 ccxray 偵測到此檔被 git tracked,refuse to apply personal identity(spec otel-tiers › Personal config gitignore enforcement)
  • -
  • 首次啟用時自動寫入 opt_in_acknowledged_at ISO 8601 timestamp(spec otel-tiers › Opt-in acknowledgment timestamp)
  • -
- - -

5. Client 端 emit,不是 hub

- -

ccxray 的 hub mode 讓多個專案共用一個 proxy 進程。但這個 change 明確規定 OTel emit 不在 hub,而是在 client 端,讓不同專案可以各自設不同的 tier / endpoint:

- -
-
-flowchart LR
-    subgraph ClientA [Client A: projectA
.ccxray.json tier=1] - SDKA[OTel SDK A] --> CollA[collector-A] - end - subgraph ClientB [Client B: projectB
.ccxray.json tier=0] - NoneB((no SDK)) - end - subgraph ClientC [Client C: projectC
.ccxray.user.json tier=2] - SDKC[OTel SDK C
含 enduser.id] --> CollC[collector-C] - end - - ClientA -.HTTP only.-> Hub[ccxray hub
不 emit business metrics] - ClientB -.HTTP only.-> Hub - ClientC -.HTTP only.-> Hub - Hub --> API[Anthropic / OpenAI] - - style Hub fill:#88c0d0,stroke:#5e81ac,color:#0f1419 - style SDKA fill:#a3be8c,stroke:#5e81ac,color:#0f1419 - style SDKC fill:#b48ead,stroke:#5e81ac,color:#0f1419 - style NoneB fill:#2a313c,stroke:#5e81ac,color:#d8dee9 -
-
- -
-

「OTel SDK initialization and metric emission happen in the client process (the one that ran ccxray claude). The hub remains a pure HTTP proxy plus SSE broadcaster. The hub MAY emit its own operational metrics under ccxray.hub.* namespace using a separate config (~/.ccxray/hub-config.json), but it does NOT emit business metrics on behalf of clients.」

- 📄 design.md › D2. Client-side emit, not hub-side -
- -
-

「Hub does not emit business metrics: WHEN the ccxray hub forwards an HTTP request between a client and an upstream provider, THEN the hub SHALL NOT emit any business metric on behalf of the client, regardless of the client's tier setting.」

- 📄 otel-export/spec.md › Requirement: Client-side OTel SDK initialization › Scenario: Hub does not emit business metrics -
- -

注意:ccxray.hub.* 運維 metric 在 Phase 1 是延後的 — design.md Open Questions 與 tasks.md §8.3 明說「defer 到 follow-up」。

- - -

6. Namespace 與 CLI 共存

- -

ccxray 跟 Claude Code CLI 內建 OTel 並存時,兩者用不同 namespace,且 ccxray 多 emit 一個 reconciliation diff metric:

- -
-
-flowchart TB
-    CLI[Claude Code CLI
內建 OTel] --> NSA[claude_code.*
token / interaction / tool spans] - CCX[ccxray] --> NSB["ccxray.*
+ resource: ccxray.source='ccxray-proxy'
+ attr: ccxray.cli_otel_active=true (when both active)"] - - NSA -.同樣是 tokens.-> COMP[reconciliation 比對] - NSB -.同樣是 tokens.-> COMP - COMP --> DIFF[ccxray.reconciliation.token_diff_pct
差距 = 其中一邊有 bug] - - style NSA fill:#88c0d0,stroke:#5e81ac,color:#0f1419 - style NSB fill:#a3be8c,stroke:#5e81ac,color:#0f1419 - style DIFF fill:#ebcb8b,stroke:#5e81ac,color:#0f1419 -
-
- -
-

「Every metric SHALL be named under the ccxray.<system>.<aspect> pattern. No metric SHALL be named identically to a Claude Code CLI metric or any other upstream OTel convention that would overlap.」

- 📄 otel-export/spec.md › Requirement: ccxray.* namespace for all emitted metrics -
- -
-

「ccxray SHALL detect the presence of CLAUDE_CODE_ENABLE_TELEMETRY=1 in the environment and, when detected, SHALL emit all metrics with an additional attribute ccxray.cli_otel_active=true. ccxray SHALL print a startup notice... ccxray SHALL NOT disable any of its own metrics based on CLI coexistence.」

- 📄 otel-export/spec.md › Requirement: CLI OTel coexistence and complement mode -
- -
-

「ccxray SHALL emit ccxray.reconciliation.token_diff_pct{model} as a gauge that compares ccxray's HTTP-observed token counts against the corresponding values reported by the CLI when both are active.」

- 📄 otel-export/spec.md › Requirement: Reconciliation diff metric -
- - -

7. Cardinality budget(動畫)

- -

每個 metric 對每個 attribute 都有一個「容量上限」。當 unique values 數量達到上限,後續 unique values 變成字面字串 _overflow_,並 emit sentinel counter:

- -
-
-
attribute key
-
tool
-
budget=50
-
23 / 50 unique
healthy
-
-
-
attribute key
-
model
-
budget=10
-
4 / 10 unique
healthy
-
-
-
attribute key
-
mcp_server
-
budget=30
-
30 / 30 — OVERFLOW
新值記為 _overflow_
sentinel ++
-
-
- -
-

「Every metric declares an allow-list of attribute keys and a per-key cardinality budget (e.g. tool=50, model=10, mcp_server=30). Attribute values are tracked in a Set per (metric, attribute); when the Set reaches budget size, subsequent unique values are recorded as the literal string _overflow_ and a sentinel counter ccxray.metrics.overflow_total{metric,attribute} increments.」

- 📄 design.md › D4. Cardinality budget with overflow fallback -
- -
-

「Attribute keys not in the allow-list are dropped at the View API layer (OTel SDK native enforcement). High-cardinality candidates that look attractive (bash.command_pattern, file_path) are explicitly NOT emitted as metric labels.」

- 📄 design.md › D4 -
- -

三個對應的 scenarios(全部來自 spec)

- - - - - - - - - - - - - - - -
情境規格行為
Tool name 是第 3 個於 50 budget正常 emit tool="Read",sentinel 增加
第 51 個 unique tool name 出現Emit tool="_overflow_",ccxray.metrics.overflow_total{metric="...",attribute="tool"} ++1
嘗試傳入不在 allow-list 的 attribute(如 bash_command)該 attribute 在 emission 前就被 View API 丟掉
-
- 📄 otel-export/spec.md › Requirement: Cardinality budget enforcement › Scenarios -
- - -

8. 失敗狀態機(動畫)

- -

OTel 健康狀態固定 4 個 state。下方綠色 active 有 pulse 動畫示範「目前活躍」的視覺:

- -
-
- disabled - tier 0 或
OTel 套件缺失
-
-
- active - SDK 初始化成功
exports 正常
-
-
- degraded - SDK init 失敗
ccxray 仍正常運作
-
-
- circuit_open - runtime 失敗
暫停 export,週期 half-open 試
-
-
- -

規格定義的 transitions

- -
-startup [tier 0 / no OTel pkg] disabled -startup [tier ≥ 1, SDK init OK] active -active [SDK init throws] degraded -active [5 consecutive export failures] circuit_open (cooldown=60s) -circuit_open [cooldown elapsed] half_open (trial export) -half_open [success] active (cooldown reset to 60s) -half_open [failure] circuit_open (cooldown × 2, max 600s) -
- -
-

「ccxray SHALL maintain an OTel health state machine with exactly four states: disabled, active, degraded, and circuit_open. Transitions SHALL be driven exclusively by the conditions described in the subsequent requirements; no other code path SHALL mutate state.」

- 📄 otel-health/spec.md › Requirement: Four-state OTel health machine -
- -
-

「After 5 consecutive export failures, the state SHALL transition to circuit_open and exports SHALL be paused. After an initial cooldown of 60 seconds, the state SHALL transition to half_open and a single export SHALL be attempted. Success SHALL return the state to active. Failure SHALL keep the state at circuit_open and the cooldown SHALL double up to a maximum of 600 seconds.」

- 📄 otel-health/spec.md › Requirement: Circuit breaker with exponential backoff -
- -

Queue 行為(同樣由 spec 規定)

- -
-

「The OTel export queue SHALL be bounded by a configurable size (default 2048 entries). When the queue is full and a new export is attempted, the oldest queued entry SHALL be dropped to make room. Each drop SHALL increment ccxray.otel.exports_dropped_total{signal}.」

- 📄 otel-health/spec.md › Requirement: Bounded export queue with drop-oldest semantics -
- -

三種失敗等級的分流

- - - - - - - - - - - - - - - - - - -
失敗等級例子規格定義的處理
Config errorJSON 語法、unresolved ${VAR}啟動失敗,非零 exit,錯誤指出檔案 + 行號
Init errorendpoint URL 格式錯進入 degraded,ccxray 仍啟動,proxy 正常,status 顯示
Runtime errorcollector unreachable由 circuit breaker 吸收(進 circuit_open)
-
- 📄 otel-health/spec.md › Requirement: Config errors fail fast, init/runtime errors degrade -
- - -

9. Config 載入與 secret 拒絕

- -
-
-flowchart TB
-    A[ccxray 啟動]
-    B{找到 .ccxray.json?
從 cwd 走到 git root} - C[讀 .ccxray.json] - D{找到 .ccxray.user.json?
cwd 或 $HOME} - E[讀 .ccxray.user.json] - F[JSON 語法 OK?] - G[Schema OK?] - H["${VAR} 全部 resolve?"] - I[沒有 literal secret pattern?] - J["tier_effective = min(project, personal)"] - K[初始化 SDK] - FAIL[exit 非零
fail fast] - NONE[tier 0
不啟動 SDK] - - A --> B - B -- 沒有 --> NONE - B -- 有 --> C - C --> F - F -- 否 --> FAIL - F -- 是 --> G - G -- 否 --> FAIL - G -- 是 --> H - H -- 否 --> FAIL - H -- 是 --> I - I -- 像 secret --> FAIL - I -- OK --> D - D -- 沒有 --> J - D -- 有 --> E - E --> F - J --> K - - style FAIL fill:#bf616a,stroke:#5e81ac,color:#0f1419 - style NONE fill:#8b95a5,stroke:#5e81ac,color:#0f1419 - style K fill:#a3be8c,stroke:#5e81ac,color:#0f1419 -
-
- -
-

「Both files support ${ENV_VAR} interpolation in string values. The schema validator rejects any string that looks like a literal secret (Bearer [A-Za-z0-9]{20,}, sk_live_*, ghp_*, JWT structure) when not wrapped in ${...}. First-time generation auto-amends .gitignore to include .ccxray.user.json.」

- 📄 design.md › D6. Config: .ccxray.json + .ccxray.user.json -
- -

具體 scenario(直接列舉 spec 描述的)

-
    -
  • 合法插值:Config 含 "Authorization": "Bearer ${OTLP_TOKEN}",env OTLP_TOKEN=abc123 已設 → 載入 header 值為 "Bearer abc123",且 literal 字串 出現在任何 debug log 行
  • -
  • Env 缺失:Config 含 "Bearer ${MISSING_VAR}",MISSING_VAR 未設 → ccxray 非零 exit,錯誤訊息含檔案路徑、行號、變數名 MISSING_VAR
  • -
  • Literal token 被拒:Config 含字面 "Bearer abc123longtokenvalue..." → 啟動失敗,提示使用者改用 ${ENV_VAR}
  • -
  • 純 URL 通過:Pure URLs 和 hostnames 允許不寫 ${...}
  • -
-
- 📄 otel-config/spec.md › Requirement: Environment variable interpolation, Requirement: Literal-secret rejection -
- - -

10. Parser pipeline 與 sentinels

- -

規格要求 tool / MCP / skill / agent-type 偵測不能再用 inline 字串散落在 system-prompt.js / store.js / helpers.js,而要改成版本化的 JSON schema:

- -
-
-flowchart LR
-    REQ[一筆 entry 進來] --> DISP[server/parsers/index.js
dispatch] - - DISP --> S1[anthropic-tools.schema.json] - DISP --> S2[anthropic-skills.schema.json] - DISP --> S3[anthropic-agent-types.schema.json] - DISP --> S4[mcp-tools.schema.json] - DISP --> S5[codex-tools.schema.json] - - S1 --> OK[ccxray.tool.invocations_total
ccxray.skill.activations_total
...] - S1 --> UNK["ccxray.parser.unknown_tool_total{provider}"] - - S2 --> OK - S2 --> UNK2["ccxray.parser.unknown_skill_marker_total"] - - S4 --> OK - S4 --> UNK3["ccxray.parser.unknown_mcp_format_total"] - - OK --> INV[invariants check] - INV --> MISMATCH[ccxray.parser.reconciliation_mismatch_total
當 tool_use blocks 數 ≠ extracted 數] - - style DISP fill:#88c0d0,stroke:#5e81ac,color:#0f1419 - style OK fill:#a3be8c,stroke:#5e81ac,color:#0f1419 - style UNK fill:#ebcb8b,stroke:#5e81ac,color:#0f1419 - style UNK2 fill:#ebcb8b,stroke:#5e81ac,color:#0f1419 - style UNK3 fill:#ebcb8b,stroke:#5e81ac,color:#0f1419 - style MISMATCH fill:#bf616a,stroke:#5e81ac,color:#0f1419 -
-
- -
-

「Detection logic for tool / MCP / skill / agent-type SHALL be expressed as JSON schemas under server/parsers/. There SHALL be at minimum one schema per (concern, provider) pair: parsers/anthropic-tools.schema.json, parsers/anthropic-skills.schema.json, parsers/anthropic-agent-types.schema.json, parsers/mcp-tools.schema.json (provider-agnostic MCP naming convention), parsers/codex-tools.schema.json. Each schema SHALL include a version field (semver) and a last_verified_against field (ISO 8601 date).」

- 📄 parser-schemas/spec.md › Requirement: Versioned parser schemas per concern and provider -
- -

Sentinel 是什麼意思?

- -

「Unknown」不是 0,而是「我看到了一個 token 但 schema 不認得」。這是早期偵測 drift 的訊號:

- -
-

「When the parser encounters a token, marker, or block that does not match any registered pattern in the relevant schema, it SHALL increment one of: ccxray.parser.unknown_tool_total{provider}, ccxray.parser.unknown_skill_marker_total{provider}, ccxray.parser.unknown_mcp_format_total, ccxray.parser.fallback_used_total{parser,reason}. The unknown event SHALL also be recorded with a short sample to ~/.ccxray/parser-drift.log for later inspection via ccxray parser report.」

- 📄 parser-schemas/spec.md › Requirement: Sentinel counters for unknown tokens -
- -

Reconciliation invariants

- -
-

「For every processed entry the parser SHALL verify the following invariants: Number of tool_use blocks in the response equals the number of tool entries extracted by the parser. Sum of input/output token counts attributed by the parser equals the corresponding values in the upstream usage block.」

- 📄 parser-schemas/spec.md › Requirement: Reconciliation invariants -
- -

Snapshot fixtures(per spec 規定的最小集合)

-

Spec 明定每個 provider 都要至少有這些 fixtures:

-
    -
  • Basic tool invocation
  • -
  • Tool invocation with a skill marker active
  • -
  • Subagent invocation (Anthropic Task tool)
  • -
  • MCP server tool invocation
  • -
  • An intentional unknown tool name
  • -
-
- 📄 parser-schemas/spec.md › Requirement: Snapshot fixtures per provider -
- - -

11. CLI 命令面

- -

Spec 明定 3 個新命令 + 1 個 startup banner:

- -
-

ccxray status --otel

-

顯示:

-
    -
  • Effective tier(0/1/2)+ 哪些 config 檔案貢獻了
  • -
  • Endpoint URL(${VAR} 部分 masked)
  • -
  • OTel state(disabled / active / degraded / circuit_open)+ 最近 3 個 state transitions 含 timestamp
  • -
  • Circuit breaker 剩餘 cooldown(若 applicable)
  • -
  • 每個 metric 的 cardinality usage,格式 current / budget(例:tool: 23/50)
  • -
  • Exports succeeded / failed / dropped(過去 1 小時 + 24 小時)
  • -
  • Tier 2 時的 opt_in_acknowledged_at
  • -
  • CLI coexistence 指示器:CLAUDE_CODE_ENABLE_TELEMETRY 是否偵測到
  • -
-
- 📄 otel-introspection/spec.md › Requirement: ccxray status --otel shows effective configuration and health -
-
- -
-

ccxray otel preview

-
-

「The ccxray otel preview command SHALL print the exact JSON body that would be sent to the OTel collector on the next export, including all attribute values and resource attributes, WITHOUT sending any network request. Secrets resolved from ${ENV_VAR} SHALL be masked in the output.」

- 📄 otel-introspection/spec.md › Requirement: ccxray otel preview dry-run -
-
- -
-

ccxray parser report

-
-

「The ccxray parser report command SHALL print the top unknown tokens by frequency from the last 7 days of ~/.ccxray/parser-drift.log, grouped by category (tool / skill / MCP / fallback). The output SHALL include sample tokens and a GitHub issue body template the user can copy to file a drift report.」

- 📄 parser-schemas/spec.md › Requirement: ccxray parser report command -
-
- -
-

Startup banner

-
-

「When ccxray starts at tier ≥ 1, it SHALL print a one-line banner to stderr summarizing: tier value, endpoint (without secret), and complement-mode status (if CLI OTel is active). The banner SHALL NOT print when tier is 0.」

- 📄 otel-introspection/spec.md › Requirement: Startup banner declares active tier and mode -
-

Scenario:

-
    -
  • Tier 1 standalone → stderr 含一行符合 ccxray OTel tier: 1 (anonymous) → <endpoint>
  • -
  • Tier 1 with CLI active → 該行含 tier: 1complement-mode: true
  • -
  • Tier 0 → 無 OTel banner 任何字
  • -
-
- - -

12. 引用來源索引

- -

所有規格檔案(可在 VS Code 點開):

- - - - - - - - - - - -
檔案內容
proposal.mdWhy / What Changes / Capabilities / Impact
design.mdContext / Goals / 8 個決策 D1–D8 / Risks / Migration / Open Questions
specs/otel-config/spec.md6 個 Requirements 含 scenarios
specs/otel-export/spec.md8 個 Requirements 含 scenarios
specs/otel-tiers/spec.md6 個 Requirements 含 scenarios
specs/otel-health/spec.md7 個 Requirements 含 scenarios
specs/parser-schemas/spec.md6 個 Requirements 含 scenarios
specs/otel-introspection/spec.md4 個 Requirements 含 scenarios
tasks.md11 個任務群,60+ checkbox
- -

這份視覺化是 mental model 工具,不是規格。任何 ambiguity 以 OpenSpec change 檔案為準。

- -
- - - - - - diff --git a/docs/otel-phase1-overview.html b/docs/otel-phase1-overview.html new file mode 100644 index 0000000..f0a8b26 --- /dev/null +++ b/docs/otel-phase1-overview.html @@ -0,0 +1,1138 @@ + + + + +OTel Phase 1 Change — Visual Overview + + + + +
+ +

OTel Phase 1 — 視覺總覽

+
+ 本頁逐節呈現 add-otel-metrics-phase1 這個 OpenSpec change 的全貌。每個圖示下方標註 「Source:」 是該宣稱的依據出處,點擊可開啟對應 spec 檔案。內容嚴格依據 proposal / design / specs / tasks,無推測成分。 +
+ + + + +

1. 大圖:資料流向

+ +

ccxray 是 client / hub 雙進程架構。OTel 的初始化和 emit 都在 client 端,hub 純粹是 HTTP proxy + SSE broadcaster,不負責業務 metric。每個 client 自己讀自己的 .ccxray.json + .ccxray.user.json,所以同一個 hub 下不同 project 可以有不同 tier 和 endpoint。

+ +
+ + + + + + + + + + + + + + + + + + + + + + + Claude Code + (or Codex) + + + + ccxray client + forward.js + store.js + otel.js + otel-health.js + config-loader.js + + + + ccxray hub + (no business metrics; proxy + SSE only) + + + + Anthropic API + /v1/messages + + + + OTLP Collector + + + + Grafana / Datadog / Honeycomb + + + + ~/.ccxray/logs + + + + + + + + + + request + forward + response + SSE + metrics (OTLP) + local log + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ request + response (SSE) + OTel metric export +
+
+ +
+ Source: + specs/otel-export/spec.md § Client-side OTel SDK initialization + + Source: + design.md § D2. Client-side emit, not hub-side + + Source: + proposal.md § Impact (server/hub.js note) + +
+ + +

2. 三層 Tier Opt-in 模型

+ +

tier 是「會送出多少資訊」的開關。預設 tier 0(完全不送)。專案 config 是上限,個人 config 是下限,工程師永遠可以單方面降級退出。

+ +
+ + + + + Tier 0 — disabled + no SDK init, no network egress + + + Tier 1 — project anonymous + project.name + optional team + + + Tier 2 — personal named + + enduser.id (engineer-chosen) + + + + effective_tier = + min(project_tier, personal_tier) + + project = upper bound + personal = lower bound (can only equal-or-downgrade) + project=1, personal=2 → clamps to 1 + warning + project=2, personal=0 → effective 0 (unilateral opt-out) + + + + Resolution Matrix + + + project + personal + effective + + + + 0 + + 1 + + 1 + + 1 + 0 + 0 (opt-out) + + 1 + 2 + 1 (clamped) + + 2 + 2 + 2 (with enduser.id) + + 2 + 0 + 0 (opt-out) + + 「—」 表示該層 config 不存在 + missing = treat as that side absent + + + + + +
+ +
+關鍵限制(spec § Personal config gitignore enforcement):如果 .ccxray.user.json 被 git tracked,ccxray 拒絕載入個人 identity,並建議 git rm --cached。 +
+ +
+ Source: + specs/otel-tiers/spec.md § Three discrete tier values + + Source: + specs/otel-tiers/spec.md § Tier resolution rule + + Source: + specs/otel-tiers/spec.md § Engineer unilateral opt-out + + Source: + specs/otel-tiers/spec.md § Personal config gitignore enforcement + +
+ + +

3. 配置檔案與 env 插值

+ +

兩個檔案:.ccxray.json(專案層,checked into git)+ .ccxray.user.json(個人層,gitignored)。所有 string value 支援 ${VAR} 從 process.env 替換。Schema 拒絕看起來像 secret 的字面值。

+ +
+ + + + + + + + .ccxray.json + (repo, in git) + { + "otel": { + "tier": 1, + "endpoint": "https://...", + "headers": { + "Authorization": "Bearer ${TOKEN}" + } + } + + + + .ccxray.user.json + (personal, gitignored) + { "otel": { "tier": 2, "identity": "alice", + "opt_in_acknowledged_at": "..." } } + + + + config-loader.js + 1. Parse JSON + 2. Schema validate + 3. Interpolate ${VAR} + 4. Detect literal secrets + + + + process.env + TOKEN=abc... + (secret stays in env) + + + + Loaded config + effective_tier = 2 + Authorization: Bearer abc*** + + + + Startup FAIL + if literal Bearer, missing ${VAR}, + or schema error + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + + + + + +
輸入結果
"Authorization": "Bearer ${TOKEN}" + TOKEN=abc123✓ 載入,實際值 Bearer abc123
"Authorization": "Bearer ${MISSING}" + 未設 env✗ Startup 失敗,訊息含 file path / line / 變數名 MISSING
"Authorization": "Bearer abc123longtokenvalue..."✗ Schema 拒絕,建議改用 ${ENV_VAR}
JSON syntax error✗ Startup 失敗,訊息含 line / column
+ +
+ Source: + specs/otel-config/spec.md § Project and personal config files + + Source: + specs/otel-config/spec.md § Environment variable interpolation + + Source: + specs/otel-config/spec.md § Literal-secret rejection + + Source: + specs/otel-config/spec.md § Config error fails fast at startup + +
+ + +

4. OTel 健康狀態機

+ +

四個狀態,只能透過記錄在 spec 的條件轉換。核心承諾:OTel 失敗永遠不會擋 ccxray proxy

+ +
+ + + + + + + + + + + + + + + + [start] + + + + disabled + no SDK, no egress + + + active + SDK init OK, exporting + + + degraded + init failed; proxy still OK + + + + circuit_open + exports paused, cooldown + + + half_open + single trial export + + + + tier=0 or no OTel pkg + + + + tier≥1, init OK + + + + tier≥1, init fails + + + + 5 consecutive failures + + + + trial OK + + + + cooldown elapsed + + + + trial fails → backoff + + + + while active, queue full: + drop oldest + exports_dropped_total++ + (no state change) + + + + cooldown formula: + next = min(previous * 2, 600s) + starting from 60s after first trip + + + + when degraded: + ccxray proxy keeps working; no further OTel attempts; + visible in ccxray status --otel until process restart. + + +
+ +

失敗分層

+ + + + + + + + + + + + + + + + + + +
失敗類型例子處理
Config errorJSON syntax 錯、schema 違規、${VAR} 未解啟動失敗(exit code != 0)
Init errorEndpoint URL 格式不合法degraded,ccxray 正常,status 顯示錯誤
Runtime errorCollector unreachable、auth fail、timeoutcircuit breaker 處理,exponential backoff
+ +

持久化與容量限制

+ +
    +
  • ~/.ccxray/otel.log append,1 MB rotation,5 file retention(預設)
  • +
  • Export queue 預設 2048,滿 → drop oldest + ccxray.otel.exports_dropped_total{signal} ++
  • +
  • SDK shutdown 硬上限 2 秒,逾時強制 exit
  • +
+ +
+ Source: + specs/otel-health/spec.md § Four-state OTel health machine + + Source: + specs/otel-health/spec.md § Bounded export queue with drop-oldest semantics + + Source: + specs/otel-health/spec.md § Circuit breaker with exponential backoff + + Source: + specs/otel-health/spec.md § Failure log on local disk + + Source: + specs/otel-health/spec.md § Never-block guarantee for the proxy + + Source: + specs/otel-health/spec.md § Config errors fail fast, init/runtime errors degrade + +
+ + +

5. Parser pipeline 與 sentinel

+ +

解析 tool / MCP / skill / agent-type 從散落的 inline 字串改成 versioned JSON schemas。每筆 entry 都跑 reconciliation invariants;未識別的事件不會變 0,而是 increment sentinel counter,並寫進 ~/.ccxray/parser-drift.log

+ +
+ + + + + + + + + + Response from upstream + tool_use blocks, + usage tokens, etc. + + + + Parser dispatch + server/parsers/index.js + anthropic-tools.schema.json + anthropic-skills.schema.json + mcp-tools.schema.json + codex-tools.schema.json + + + + Recognized → metrics + ccxray.tool.invocations_total{tool}, etc. + + + + Unknown → sentinel + ccxray.parser.unknown_*_total ++ + + append to parser-drift.log + + + + Invariant fail → mismatch + tool_use count ≠ extracted count? + ccxray.parser.reconciliation_mismatch_total ++ + + + + + + + + post-extractcheck + + + + + + + + + + + + + +
+ +

每個 schema 帶的元資料

+ +
+
{
+  "version": "1.0.0",
+  "last_verified_against": "2026-05-10",
+  "patterns": [ ... ],
+  "examples": [ ... ]
+}
+
+ +

Error isolation

+ +

所有 parser 包在 try/catch。若拋例外 → ccxray.parser.error_total{parser,error_type} ++,該 entry 仍寫進本地 log,該 entry 對應 metric/span 帶 ccxray.parser.degraded=trueParser 失敗不會影響 proxy 路徑

+ +
+ Source: + specs/parser-schemas/spec.md § Versioned parser schemas per concern and provider + + Source: + specs/parser-schemas/spec.md § Sentinel counters for unknown tokens + + Source: + specs/parser-schemas/spec.md § Reconciliation invariants + + Source: + specs/parser-schemas/spec.md § Parser error isolation + +
+ + +

6. 跟 CLI 內建 OTel 共存

+ +

Claude Code CLI 也內建 OTel。ccxray 偵測到 CLAUDE_CODE_ENABLE_TELEMETRY=1 時進入 complement mode,所有 emit 加 ccxray.cli_otel_active=true attribute。ccxray 永遠不關自己的 emit(因為 CLI 沒 Codex 支援、ccxray 看的是 HTTP truth、且兩邊 diff 本身是價值訊號)。

+ +
+ + + + + Standalone mode + CLAUDE_CODE_ENABLE_TELEMETRY 未設 + + + Claude Code CLI + (no OTel) + + + ccxray emits + ccxray.* + + → Single source of truth + → Banner: "ccxray OTel tier: 1 (anonymous)" + + + + Complement mode + CLAUDE_CODE_ENABLE_TELEMETRY=1 + + + CLI emits + claude_code.* + + + ccxray emits + ccxray.* + cli_otel_active=true + + → Reconciliation: ccxray.reconciliation.token_diff_pct{model} + → Both flow to user's collector, distinguishable via + resource attribute ccxray.source="ccxray-proxy" + + +
+ +
+為什麼不關 ccxray emit:(1) CLI 內建 OTel 只有 Anthropic,Codex / Gemini 沒有;(2) ccxray 看到 HTTP truth,跟 CLI 不同視角;(3) 兩邊 diff 是高價值警報(代表某一邊 pricing 算錯)。 +
+ +
+ Source: + specs/otel-export/spec.md § CLI OTel coexistence and complement mode + + Source: + specs/otel-export/spec.md § Source resource attribute on every emit + + Source: + specs/otel-export/spec.md § Reconciliation diff metric + + Source: + specs/otel-export/spec.md § `ccxray.*` namespace for all emitted metrics + +
+ + +

7. Cardinality budget

+ +

每個 metric 宣告「允許哪些 attribute key」+「每個 key 最多幾個 unique value」。Key 不在 allow-list → 直接 drop(OTel View API)。Value 超 budget → 改記成 _overflow_,sentinel counter ++。

+ +
+ + + + + ccxray.tool.invocations_total — tool budget: 50 + + + + + 23 used + 27 remaining + + + + incoming attribute + tool="Bash" + + + + + accepted as-is + overflow_total: 0 + + + + When 51st unique value arrives + tool="FancyNewToolThatNobodyKnows" + + + + + recorded as tool="_overflow_" + + ccxray.metrics.overflow_total{ + metric=..., attribute="tool" } ++ + + +
+ +

不可當 metric label 的高基數欄位(per design.md § D4)

+ +
+spec 沒有明確列出黑名單,但 design.md § D4 指出 bash.command_patternfile_path 明確 NOT 當 metric label 使用,避免基數爆炸。 +
+ +
+ Source: + specs/otel-export/spec.md § Cardinality budget enforcement + + Source: + design.md § D4. Cardinality budget with overflow fallback + +
+ + +

8. 新檔案與模組關係

+ +
+ + + + + + + + + + + + + 新增模組 + + + server/config-loader.js + schema · ${VAR} · secrets · gitignore + + + server/otel-health.js + state machine · queue · breaker · log + + + server/otel.js + SDK init · registry · cardinality · source + + + server/parsers/ + *.schema.json + index.js + + + test/fixtures/parser/ + snapshot fixtures + + + package.json + minimal OTel deps · lazy require + + + + 被修改的既有檔案 + + + server/forward.js + emit metrics on request complete + + + server/store.js + thin shim over parsers + + + server/system-prompt.js + skill marker via schema + + + server/hub.js + no business metrics (doc comment) + + + bin/ccxray.js + status --otel · otel preview · parser report + + + + Phase 2 follow-up + (NOT in this change) + + + span emit (traces) + ccxray.entry_id, + dashboard_url + + + /entry/:id route + deep-link drill-back UI + + + + + + + + + + + + + + + + + + + deps + + + ↑ snapshot tests run against parsers/ + + +
+ +
+ Source: + proposal.md § Impact + + Source: + tasks.md § Tasks 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 + +
+ + +

9. Phase 1 範圍 vs Phase 2

+ +
+ +
+

✓ Phase 1 — 本 change 範圍

+
    +
  • Metrics emit(ccxray.* namespace)
  • +
  • 三層 tier opt-in(預設 OFF)
  • +
  • .ccxray.json / .ccxray.user.json + ${VAR} 插值
  • +
  • OTel health(state machine、queue、breaker)
  • +
  • Parser schemas + sentinels + reconciliation
  • +
  • CLI 共存偵測 + reconciliation diff metric
  • +
  • ccxray status --otel / otel preview / parser report
  • +
  • 啟動 banner、secrets masking
  • +
+
+ +
+

✗ Phase 2 follow-up

+
    +
  • Span emit(traces)
  • +
  • ccxray.entry_id / dashboard_url attributes
  • +
  • /entry/:id deep-link route
  • +
  • ccxray.hub.* 運維 metrics(open question)
  • +
  • --otel-demo Docker Compose helper(open question)
  • +
+
+ +
+ +
+ Source: + proposal.md § What Changes (last bullet: Out of scope) + + Source: + design.md § Non-Goals + + Source: + design.md § Open Questions + +
+ + +

10. 完整 metric 清單

+ +

所有 metric 都在 ccxray.* namespace,每筆 emit 帶 resource attribute ccxray.source="ccxray-proxy"。Complement mode 時額外帶 ccxray.cli_otel_active=true

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
家族MetricAttributes
Costccxray.tokens.input_totalmodel, provider *
ccxray.tokens.output_totalmodel, provider *
ccxray.tokens.cache_read_totalmodel, provider *
ccxray.tokens.cache_creation_totalmodel, provider *
ccxray.cost.usd_totalmodel, provider *
ccxray.cache.hit_ratio(gauge)model, provider *
Usageccxray.tool.invocations_totaltool, provider
ccxray.mcp.invocations_totalserver, tool
ccxray.skill.activations_totalskill, provider
ccxray.sessions_totalprovider
ccxray.agent_type.invocations_totaltype
Qualityccxray.errors_totaltype, provider
ccxray.stop_reason_totalreason
ccxray.latency_ms(histogram)model, provider
ccxray.max_tokens_hit_totalmodel
Patternsccxray.context.utilization_pct(histogram)
ccxray.auto_compact.triggered_total
ccxray.subagent.invocations_total
ccxray.tools_per_turn(histogram)
Governanceccxray.permission_mode.usage_totalmode
ccxray.dangerous_tool.invocations_totalpattern
ccxray.file_writes_total
ccxray.provider.distribution_totalprovider
Sentinelsccxray.metrics.overflow_totalmetric, attribute
ccxray.parser.unknown_tool_totalprovider
ccxray.parser.unknown_skill_marker_totalprovider
ccxray.parser.unknown_mcp_format_total
ccxray.parser.fallback_used_totalparser, reason
ccxray.parser.reconciliation_mismatch_totaltype
ccxray.parser.error_totalparser, error_type
ccxray.otel.exports_dropped_totalsignal
ccxray.otel.state(gauge)state
CLI 對帳ccxray.reconciliation.token_diff_pct(gauge)model
Tier 觀測ccxray.otel.tier_distributiontier
+ +

+* Cost 系列的 attribute 並未在 specs/otel-export/spec.md § Required metric families 內逐一列出(spec 只在 Usage / Quality 部分內嵌標明)。本表依照實作時的常見維度(modelprovider)預先填入,實際 attribute 註冊清單以 server/otel.js 的 metric registry 為準(Tasks § 4.5)。 +

+ +
+ Source: + specs/otel-export/spec.md § Required metric families + + Source: + specs/otel-export/spec.md § Cardinality budget enforcement + + Source: + specs/otel-export/spec.md § Reconciliation diff metric + + Source: + specs/parser-schemas/spec.md § Sentinel counters / Reconciliation invariants / Parser error isolation + + Source: + specs/otel-health/spec.md § Bounded export queue / Health state observable + + Source: + specs/otel-tiers/spec.md § Tier distribution sentinel + +
+ +

+本檔案位於 docs/otel-phase1-overview.html,內容嚴格依據 openspec/changes/add-otel-metrics-phase1/ 的 proposal / design / specs / tasks 文件。所有宣稱皆有出處連結;若你發現任何視覺與 spec 不一致,請以 spec 為準並回報。 +

+ +
+ + + From e840d311a6dc134bb2be972ce7ee6c3fed236a96 Mon Sep 17 00:00:00 2001 From: Justin Lee Date: Wed, 13 May 2026 15:05:51 +0800 Subject: [PATCH 03/10] refactor(server): remove duplicate fs require in startServer error handler fs is already imported at the top of server/index.js (line 5). The local require inside the catch block shadowed it without purpose. Co-Authored-By: Claude Opus 4.7 (1M context) --- server/index.js | 1 - 1 file changed, 1 deletion(-) diff --git a/server/index.js b/server/index.js index c47ce40..cb32371 100755 --- a/server/index.js +++ b/server/index.js @@ -775,7 +775,6 @@ async function startServer() { if (acquired) hub.releaseForkLock(); console.error(`\x1b[31m${err.message}\x1b[0m`); // Show last hub log lines so user doesn't have to open the file - const fs = require('fs'); try { const log = fs.readFileSync(hub.HUB_LOG_PATH, 'utf8'); const lines = log.trim().split('\n'); From feee026e21e9a483e33266b851599280120aa525 Mon Sep 17 00:00:00 2001 From: Justin Lee Date: Wed, 13 May 2026 15:13:05 +0800 Subject: [PATCH 04/10] docs(otel): pivot reconciliation gauge to downstream pattern MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Expert review (Sigelman / Majors / Sridharan) converged on dropping ccxray.reconciliation.token_diff_pct{model} from Phase 1: - Pre-aggregated diff gauge cannot answer "which request diverged" - Legitimate non-zero diffs (SSE chunking, retries, prompt-cache edges) produce alert fatigue - Acquiring CLI counts in-process either couples ccxray to user storage backends or turns ccxray into an OTLP receiver — violates instrumentation neutrality and expands blast radius Phase 1 now emits ccxray-internal invariants only (parser sums, SSE truncation). Cross-source reconciliation moves to docs/otel-recon.md as a downstream pattern (recording rules, sidecar, wide-event join on request_id). - tasks.md §4.7 reconciliation task removed - tasks.md §4.8 replaced with internal invariants - tasks.md §9.6 added (docs/otel-recon.md) - specs/otel-export/spec.md requirement rewritten with 3 new scenarios including explicit non-emit of ccxray.reconciliation.* - proposal.md and design.md updated; design.md records the pivot rationale openspec validate --strict passes. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../changes/add-otel-metrics-phase1/design.md | 2 +- .../add-otel-metrics-phase1/proposal.md | 4 ++-- .../specs/otel-export/spec.md | 23 ++++++++++++------- .../changes/add-otel-metrics-phase1/tasks.md | 5 ++-- 4 files changed, 21 insertions(+), 13 deletions(-) diff --git a/openspec/changes/add-otel-metrics-phase1/design.md b/openspec/changes/add-otel-metrics-phase1/design.md index e584e80..0641cdc 100644 --- a/openspec/changes/add-otel-metrics-phase1/design.md +++ b/openspec/changes/add-otel-metrics-phase1/design.md @@ -61,7 +61,7 @@ This means different projects connecting to the same hub can configure different Every metric uses `ccxray..` (`ccxray.tokens.input_total`, `ccxray.tool.invocations_total`, etc.). Every emit carries the resource attribute `ccxray.source="ccxray-proxy"`. When the CLI's `CLAUDE_CODE_ENABLE_TELEMETRY=1` is detected, ccxray enters "complement mode" and adds `ccxray.cli_otel_active=true` to its emits, plus a startup notice explaining how to choose between the two metric families. -A new reconciliation metric `ccxray.reconciliation.token_diff_pct{model}` exposes the percentage difference between ccxray's HTTP-observed token counts and what the CLI reports (when both are running). A persistent non-zero diff indicates a pricing or accounting bug on one side and is itself a high-value signal. +**Cross-source reconciliation: pivoted to downstream.** An earlier version of this design proposed emitting `ccxray.reconciliation.token_diff_pct{model}` as a gauge. After expert review (Sigelman / Majors / Sridharan), Phase 1 drops the in-proxy diff gauge for these reasons: (1) the gauge is pre-aggregated and cannot answer "which request diverged"; (2) the diff is rarely zero for legitimate reasons (SSE chunking, retries, prompt-cache edge cases) → alert fatigue; (3) acquiring the CLI's counts in-process requires either querying the user's storage backend (couples ccxray to Prom/OTLP dialects) or embedding an OTLP receiver (turns ccxray from proxy into telemetry product, violating instrumentation neutrality and expanding blast radius). Instead, ccxray emits faithful per-request signals and ccxray-internal invariant metrics (`ccxray.invariants.*`); cross-source reconciliation against the CLI is a downstream concern — see `docs/otel-recon.md` for recording-rule / sidecar / wide-event join recipes. A `--debug-reconcile` ad-hoc flag may be reconsidered in a later phase. **Alternatives considered:** diff --git a/openspec/changes/add-otel-metrics-phase1/proposal.md b/openspec/changes/add-otel-metrics-phase1/proposal.md index f07267f..c490321 100644 --- a/openspec/changes/add-otel-metrics-phase1/proposal.md +++ b/openspec/changes/add-otel-metrics-phase1/proposal.md @@ -9,7 +9,7 @@ This change adds Phase 1: emit ccxray's metrics over OTLP, gated behind a defaul - New optional metric export under `ccxray.*` namespace covering cost, usage (tool / MCP / skill / agent_type / provider), quality (errors, stop_reason, latency, max_tokens_hit_rate), patterns (context_utilization, auto_compact_triggered, subagent_ratio, tools_per_turn) and governance (permission_mode, dangerous_tool, file_writes). - New configuration files: `.ccxray.json` (repo, project-level) and `.ccxray.user.json` (gitignored, personal). `${ENV_VAR}` interpolation. Schema rejects literal-looking secrets. Auto-add `.ccxray.user.json` to `.gitignore` if missing. - Three-tier opt-in model: **tier 0 disabled (default)** / tier 1 anonymous project-level / tier 2 personal named. Project config is the upper bound; personal config can only equal or downgrade. Engineers can opt out unilaterally. -- Detect `CLAUDE_CODE_ENABLE_TELEMETRY=1` and enter "complement mode" with `ccxray.cli_otel_active=true` attribute; every metric carries `ccxray.source="ccxray-proxy"` resource attribute. New reconciliation metric `ccxray.reconciliation.token_diff_pct` cross-checks ccxray vs CLI accounting. +- Detect `CLAUDE_CODE_ENABLE_TELEMETRY=1` and enter "complement mode" with `ccxray.cli_otel_active=true` attribute; every metric carries `ccxray.source="ccxray-proxy"` resource attribute. ccxray emits ccxray-internal invariant metrics (`ccxray.invariants.*`); cross-source reconciliation against the CLI is documented as a downstream pattern (recording rules / sidecar / wide-event join on `request_id`) in `docs/otel-recon.md`, not as an in-proxy gauge — keeps ccxray as a transparent proxy with bounded blast radius. - Cardinality budget per (metric, attribute) with `_overflow_` fallback and `ccxray.metrics.overflow_total` sentinel; attribute key allow-list enforced via OTel View API. - Parser schema-ization: extract tool / MCP / skill detection into `server/parsers/*.schema.json` with snapshot fixtures, sentinel metrics (`ccxray.parser.unknown_*_total`), and reconciliation invariants (tool_use block count must equal extracted count). - Failure fallback: config errors fail fast at startup; init errors degrade silently (ccxray keeps proxying); runtime errors handled by bounded queue (drop oldest) + circuit breaker (5 failures → open 60s → exponential backoff). OTel failures **never** break the proxy. @@ -23,7 +23,7 @@ This change adds Phase 1: emit ccxray's metrics over OTLP, gated behind a defaul ### New Capabilities - `otel-config`: `.ccxray.json` and `.ccxray.user.json` schema, `${ENV_VAR}` interpolation, literal-secret rejection, `.gitignore` auto-amend, project-upper-bound + personal-lower-bound merging rules. -- `otel-export`: OTel SDK initialization (client-side, not hub), metric definitions under `ccxray.*` namespace, `ccxray.source` resource attribute, cardinality budget enforcement with `_overflow_` fallback, CLI coexistence detection and complement-mode signaling, reconciliation diff metric. +- `otel-export`: OTel SDK initialization (client-side, not hub), metric definitions under `ccxray.*` namespace, `ccxray.source` resource attribute, cardinality budget enforcement with `_overflow_` fallback, CLI coexistence detection and complement-mode signaling, ccxray-internal invariant metrics, explicit non-emit of cross-source diff gauge (deferred to downstream). - `otel-tiers`: three-tier opt-in (disabled / project-anonymous / personal-named), tier resolution with project as upper bound and personal as lower bound, `enduser.id` attribute only in tier 2, opt-in acknowledgment timestamp persisted in personal config. - `otel-health`: failure state machine (`disabled / active / degraded / circuit_open`), bounded export queue with drop-oldest semantics, circuit breaker with exponential backoff, local failure log at `~/.ccxray/otel.log` with rotation, never-block guarantee for the proxy path. - `parser-schemas`: extract skill / MCP / tool / agent-type detection into versioned JSON schemas, snapshot fixtures per provider (Anthropic + Codex), sentinel metrics for unknown events, reconciliation invariants run per entry, try/catch isolation so parser failure does not affect ccxray core. diff --git a/openspec/changes/add-otel-metrics-phase1/specs/otel-export/spec.md b/openspec/changes/add-otel-metrics-phase1/specs/otel-export/spec.md index 5012246..f4c0112 100644 --- a/openspec/changes/add-otel-metrics-phase1/specs/otel-export/spec.md +++ b/openspec/changes/add-otel-metrics-phase1/specs/otel-export/spec.md @@ -70,19 +70,26 @@ ccxray SHALL detect the presence of `CLAUDE_CODE_ENABLE_TELEMETRY=1` in the envi - **WHEN** ccxray starts without `CLAUDE_CODE_ENABLE_TELEMETRY` - **THEN** ccxray SHALL print a notice indicating standalone mode and the attribute `ccxray.cli_otel_active` SHALL NOT be set -### Requirement: Reconciliation diff metric +### Requirement: Internal invariant metrics; cross-source reconciliation is a downstream concern -ccxray SHALL emit `ccxray.reconciliation.token_diff_pct{model}` as a gauge that compares ccxray's HTTP-observed token counts against the corresponding values reported by the CLI when both are active. The metric SHALL be emitted only when both ccxray and CLI OTel signals are available. +ccxray SHALL emit invariant metrics that describe ccxray-internal consistency only. ccxray SHALL NOT emit a cross-source diff metric (e.g. ccxray vs CLI token counts) as part of Phase 1. Cross-source reconciliation SHALL be performed by downstream consumers (recording rules, Grafana panels, sidecar processes) using `request_id` or `session_id` joins on per-request metrics emitted independently by ccxray and the CLI. -#### Scenario: Both active with matching counts +Rationale: A pre-aggregated diff gauge cannot answer "which request diverged" and produces persistent non-zero values for legitimate reasons (SSE chunking boundaries, retries, prompt-caching edge cases), creating alert fatigue. ccxray's correct role is to emit faithful per-request signals; cross-source diff is an analytical task that belongs in the user's observability tier, where it can be expressed as a derived series. -- **WHEN** ccxray and CLI emit the same token count for the same request -- **THEN** `ccxray.reconciliation.token_diff_pct` SHALL be 0 for the affected model +#### Scenario: Parser sum invariant -#### Scenario: Mismatch detected +- **WHEN** ccxray's parser extracts a sum of per-tool token attributions that differs from the upstream `usage` block totals for the same response +- **THEN** `ccxray.invariants.parser_mismatch_total{type="token_sum"}` SHALL increment -- **WHEN** ccxray observes input_tokens=1000 and CLI reports input_tokens=1050 for the same request -- **THEN** the metric SHALL emit approximately 5.0 (percent difference) +#### Scenario: SSE stream completeness invariant + +- **WHEN** ccxray observes the upstream SSE stream terminating without a `[DONE]` (Anthropic) or `response.completed` (OpenAI Responses) terminal event +- **THEN** `ccxray.invariants.sse_truncated_total{provider}` SHALL increment + +#### Scenario: No cross-source diff gauge is emitted + +- **WHEN** OTel is enabled at any tier +- **THEN** no metric whose name matches `ccxray.reconciliation.*` SHALL be registered with the SDK in Phase 1 ### Requirement: Required metric families diff --git a/openspec/changes/add-otel-metrics-phase1/tasks.md b/openspec/changes/add-otel-metrics-phase1/tasks.md index ab15634..2a6e1ef 100644 --- a/openspec/changes/add-otel-metrics-phase1/tasks.md +++ b/openspec/changes/add-otel-metrics-phase1/tasks.md @@ -36,8 +36,8 @@ - [ ] 4.4 Detect `CLAUDE_CODE_ENABLE_TELEMETRY=1` and apply `ccxray.cli_otel_active=true` attribute in complement mode - [ ] 4.5 Register all metric families per `otel-export/spec.md`: cost, usage, quality, patterns, governance - [ ] 4.6 Register sentinel metrics: overflow, parser unknowns, parser mismatches, otel state, reconciliation diff, tier distribution -- [ ] 4.7 Implement reconciliation diff gauge `ccxray.reconciliation.token_diff_pct{model}` computed against detected CLI telemetry -- [ ] 4.8 Implement export-time masking of any value resolved from `${ENV_VAR}` for log lines and trace dumps +- [ ] 4.7 Implement export-time masking of any value resolved from `${ENV_VAR}` for log lines and trace dumps +- [ ] 4.8 Implement internal invariant metrics (`ccxray.invariants.parser_mismatch_total{type}`, `ccxray.invariants.sse_truncated_total`) — cross-source diff against CLI is NOT in Phase 1; documented as downstream pattern instead - [ ] 4.9 Unit tests for namespace lint (no metric name starts with `claude_code.`), source attribute presence, budget enforcement, complement mode attribute, lazy SDK init at tier 0 ## 5. Parser schema-ization (`server/parsers/`) @@ -84,6 +84,7 @@ - [ ] 9.3 Reference `docs/otel-integration.html` (existing) as the design record from README - [ ] 9.4 Update README with a single section: "Optional: send metrics to your observability backend" linking to quickstart and ethics docs - [ ] 9.5 Update `CLAUDE.md` Architecture section to note the new modules and their roles +- [ ] 9.6 Add `docs/otel-recon.md` (bilingual): why cross-source reconciliation is a downstream concern, recording-rule / Grafana-panel / sidecar recipes for diffing ccxray vs CLI counts on `request_id` ## 10. Verification gates From 5dc7251cfda1cdaf98cf016303de5c9baaec1506 Mon Sep 17 00:00:00 2001 From: Justin Lee Date: Wed, 13 May 2026 15:13:29 +0800 Subject: [PATCH 05/10] refactor(server): extract argv parsing into server/cli.js MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Pulls the --port / --hub-mode / --allow-upstream-loop / --no-browser flag detection and provider lookup out of server/index.js (793 LOC) into a new server/cli.js (63 LOC) so future CLI subcommands can be added without growing the entry-point file further. Behaviour is preserved: parseArgs still mutates process.argv in place to strip consumed flags (matches existing assumptions), still exits on invalid --port values or unknown providers, and still derives DISPLAY_NAME through providers.getDisplayName. Phase 0a-ii of the add-otel-metrics-phase1 OpenSpec change — clears space for §7 (status --otel / otel preview / parser report) without piling new subcommands onto an already long entry point. Tests: 456 passing, 1 pre-existing Codex E2E failure (unchanged from baseline before this commit). Co-Authored-By: Claude Opus 4.7 (1M context) --- server/cli.js | 63 +++++++++++++++++++++++++++++++++++++++++++++++++ server/index.js | 48 +++++++++++-------------------------- 2 files changed, 77 insertions(+), 34 deletions(-) create mode 100644 server/cli.js diff --git a/server/cli.js b/server/cli.js new file mode 100644 index 0000000..0124a36 --- /dev/null +++ b/server/cli.js @@ -0,0 +1,63 @@ +'use strict'; + +// CLI argv parsing for ccxray. Splits flag detection from server/index.js so +// new subcommands can be added without growing the entry-point file. Mutates +// process.argv in place to strip consumed flags (existing behaviour). + +const providers = require('./providers'); + +function parseArgs(argv = process.argv, env = process.env) { + const portIdx = argv.indexOf('--port'); + let explicitPort = false; + let port = null; + if (portIdx !== -1) { + const portVal = argv[portIdx + 1]; + const parsed = parseInt(portVal, 10); + if (!portVal || isNaN(parsed) || parsed < 1 || parsed > 65535) { + console.error('\x1b[31mError: --port requires a valid port number (1-65535)\x1b[0m'); + process.exit(1); + } + port = parsed; + explicitPort = true; + argv.splice(portIdx, 2); + } + + const hubMode = argv.includes('--hub-mode'); + if (hubMode) argv.splice(argv.indexOf('--hub-mode'), 1); + + const allowUpstreamLoop = argv.includes('--allow-upstream-loop') || env.CCXRAY_ALLOW_UPSTREAM_LOOP === '1'; + if (argv.includes('--allow-upstream-loop')) argv.splice(argv.indexOf('--allow-upstream-loop'), 1); + + const noBrowser = argv.includes('--no-browser'); + if (noBrowser) argv.splice(argv.indexOf('--no-browser'), 1); + + const cliCommand = argv[2]; + const unknownCommand = cliCommand + && cliCommand !== 'status' + && !cliCommand.startsWith('-') + && !providers.isAgentProvider(cliCommand); + if (unknownCommand) { + console.error(`\x1b[31mError: unsupported provider "${cliCommand}". Supported providers: ${providers.supportedProviderList()}\x1b[0m`); + process.exit(1); + } + + const agentCommand = providers.isAgentProvider(cliCommand) ? cliCommand : null; + const agentMode = Boolean(agentCommand); + const agentArgs = agentMode ? argv.slice(3) : []; + const displayName = providers.getDisplayName(agentCommand, env); + + return { + port, + explicitPort, + hubMode, + allowUpstreamLoop, + noBrowser, + cliCommand, + agentCommand, + agentMode, + agentArgs, + displayName, + }; +} + +module.exports = { parseArgs }; diff --git a/server/index.js b/server/index.js index cb32371..2abc32a 100755 --- a/server/index.js +++ b/server/index.js @@ -18,40 +18,20 @@ const { authMiddleware } = require('./auth'); const { extractAgentType, extractPromptAgentType, splitB2IntoBlocks } = require('./system-prompt'); const { findSharedPrefix } = require('./delta-helpers'); const providers = require('./providers'); - -// ── CLI: parse flags and detect provider launchers ── -const portIdx = process.argv.indexOf('--port'); -let explicitPort = false; -if (portIdx !== -1) { - const portVal = process.argv[portIdx + 1]; - const parsed = parseInt(portVal, 10); - if (!portVal || isNaN(parsed) || parsed < 1 || parsed > 65535) { - console.error('\x1b[31mError: --port requires a valid port number (1-65535)\x1b[0m'); - process.exit(1); - } - config.PORT = parsed; - explicitPort = true; - process.argv.splice(portIdx, 2); -} -const hubMode = process.argv.includes('--hub-mode'); -if (hubMode) process.argv.splice(process.argv.indexOf('--hub-mode'), 1); -const allowUpstreamLoop = process.argv.includes('--allow-upstream-loop') || process.env.CCXRAY_ALLOW_UPSTREAM_LOOP === '1'; -if (process.argv.includes('--allow-upstream-loop')) process.argv.splice(process.argv.indexOf('--allow-upstream-loop'), 1); -const noBrowser = process.argv.includes('--no-browser'); -if (noBrowser) process.argv.splice(process.argv.indexOf('--no-browser'), 1); -const cliCommand = process.argv[2]; -const unknownCommand = cliCommand - && cliCommand !== 'status' - && !cliCommand.startsWith('-') - && !providers.isAgentProvider(cliCommand); -if (unknownCommand) { - console.error(`\x1b[31mError: unsupported provider "${cliCommand}". Supported providers: ${providers.supportedProviderList()}\x1b[0m`); - process.exit(1); -} -const agentCommand = providers.isAgentProvider(cliCommand) ? cliCommand : null; -const agentMode = Boolean(agentCommand); -const agentArgs = agentMode ? process.argv.slice(3) : []; -const DISPLAY_NAME = providers.getDisplayName(agentCommand, process.env); +const { parseArgs } = require('./cli'); + +const { + port: cliPort, + explicitPort, + hubMode, + allowUpstreamLoop, + noBrowser, + agentCommand, + agentMode, + agentArgs, + displayName: DISPLAY_NAME, +} = parseArgs(); +if (cliPort != null) config.PORT = cliPort; // In agent/hub mode, mute startup logs so they don't pollute output. const _origLog = console.log; From 35b458a7aa5420338ef403459a51989473369f43 Mon Sep 17 00:00:00 2001 From: Justin Lee Date: Wed, 13 May 2026 16:17:23 +0800 Subject: [PATCH 06/10] feat(server): add internal event bus for OTel/parser hooks MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add server/emit.js — a minimal on/emit primitive with synchronous dispatch, no subscribers by default (O(1) no-op for tier 0), and try/catch isolation so a buggy subscriber cannot break the proxy path. This is the "drum" for the OTel work (Phase 1 of the add-otel-metrics-phase1 plan): forward.js and store.js will emit events here in a later phase, and the OTel SDK / parser sentinels will subscribe. Wiring callers is intentionally deferred — this commit ships the contract only, keeping the surface review-able. Defined events (payload shapes locked for Phase 1): - entry_completed { entry } - session_started { sessionId, provider, inferred } - parser_unknown { provider, kind, token } - parser_mismatch { type, expected, got, entryId? } - parser_error { parser, errorType, message } Refs: openspec/changes/add-otel-metrics-phase1 §6.1–6.3, §5.9–5.11 Co-Authored-By: Claude Opus 4.7 (1M context) --- server/emit.js | 40 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+) create mode 100644 server/emit.js diff --git a/server/emit.js b/server/emit.js new file mode 100644 index 0000000..050a239 --- /dev/null +++ b/server/emit.js @@ -0,0 +1,40 @@ +'use strict'; + +// Internal event bus for OTel handlers, parser sentinels, and future status hooks. +// +// Phase D (OTel SDK init) registers subscribers; Phase E wires emit() calls in +// forward.js / store.js. With no subscribers, emit() is an O(1) no-op — tier 0 +// pays zero cost. +// +// Handlers run synchronously and MUST NOT throw into the proxy code path; this +// module wraps every dispatch in try/catch so a buggy subscriber cannot break +// request forwarding. +// +// Defined events (payload shape stable across Phase 1): +// entry_completed { entry } +// session_started { sessionId, provider, inferred } +// parser_unknown { provider, kind, token } +// parser_mismatch { type, expected, got, entryId? } +// parser_error { parser, errorType, message } + +const subscribers = new Map(); + +function on(event, handler) { + if (typeof handler !== 'function') throw new TypeError('handler must be a function'); + if (!subscribers.has(event)) subscribers.set(event, new Set()); + subscribers.get(event).add(handler); + return () => subscribers.get(event)?.delete(handler); +} + +function emit(event, payload) { + const set = subscribers.get(event); + if (!set || set.size === 0) return; + for (const handler of set) { + try { handler(payload); } + catch (err) { + try { console.error(`[emit] handler "${event}":`, err && err.message); } catch {} + } + } +} + +module.exports = { on, emit }; From 6378e4e42d2f7719da77ded6e97368f649522b51 Mon Sep 17 00:00:00 2001 From: Justin Lee Date: Wed, 13 May 2026 17:02:04 +0800 Subject: [PATCH 07/10] feat(otel): add OTel deps, lazy require helper, and minimal config loader MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Phase 2a of the OpenSpec change add-otel-metrics-phase1 (vertical slice spike, dependency layer): - package.json: pin minimal OTel deps — @opentelemetry/api, @opentelemetry/sdk-metrics, @opentelemetry/exporter-metrics-otlp-http, @opentelemetry/resources. No auto-instrumentation packages, per design decision D-deps in the change proposal. - server/otel-lazy.js: tryRequire()/isAvailable() helpers so ccxray keeps running at tier 0 when OTel packages are absent (e.g. a slimmed install). Whitelist of known package names; unknown names throw to catch typos. - server/config-loader.js: minimum-viable reader for .ccxray.json. Returns a frozen DEFAULT_CONFIG when the file is absent; parses and validates the otel block when present. Env interpolation, secret detection, gitignore amend, and personal-config (.ccxray.user.json) lookup land in later Phase 2 sub-phases per the change tasks. - test/config-loader.test.js: 7 unit tests covering defaults, parsed values, malformed JSON, non-integer tier coercion, and the lazy require helper. No existing runtime path imports these new modules yet; the proxy and hub behavior is unchanged. --- package-lock.json | 395 ++++++++++++++++++++++++++++++++++++- package.json | 4 + server/config-loader.js | 64 ++++++ server/otel-lazy.js | 35 ++++ test/config-loader.test.js | 85 ++++++++ 5 files changed, 581 insertions(+), 2 deletions(-) create mode 100644 server/config-loader.js create mode 100644 server/otel-lazy.js create mode 100644 test/config-loader.test.js diff --git a/package-lock.json b/package-lock.json index 3afd051..866b5e2 100644 --- a/package-lock.json +++ b/package-lock.json @@ -1,15 +1,19 @@ { "name": "ccxray", - "version": "1.5.0", + "version": "1.9.2", "lockfileVersion": 3, "requires": true, "packages": { "": { "name": "ccxray", - "version": "1.5.0", + "version": "1.9.2", "license": "MIT", "dependencies": { "@anthropic-ai/tokenizer": "^0.0.4", + "@opentelemetry/api": "^1.9.0", + "@opentelemetry/exporter-metrics-otlp-http": "^0.205.0", + "@opentelemetry/resources": "^2.0.0", + "@opentelemetry/sdk-metrics": "^2.0.0", "ws": "^8.19.0" }, "bin": { @@ -497,6 +501,299 @@ "node": ">= 8" } }, + "node_modules/@opentelemetry/api": { + "version": "1.9.1", + "resolved": "https://registry.npmjs.org/@opentelemetry/api/-/api-1.9.1.tgz", + "integrity": "sha512-gLyJlPHPZYdAk1JENA9LeHejZe1Ti77/pTeFm/nMXmQH/HFZlcS/O2XJB+L8fkbrNSqhdtlvjBVjxwUYanNH5Q==", + "license": "Apache-2.0", + "engines": { + "node": ">=8.0.0" + } + }, + "node_modules/@opentelemetry/api-logs": { + "version": "0.205.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/api-logs/-/api-logs-0.205.0.tgz", + "integrity": "sha512-wBlPk1nFB37Hsm+3Qy73yQSobVn28F4isnWIBvKpd5IUH/eat8bwcL02H9yzmHyyPmukeccSl2mbN5sDQZYnPg==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/api": "^1.3.0" + }, + "engines": { + "node": ">=8.0.0" + } + }, + "node_modules/@opentelemetry/core": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/core/-/core-2.1.0.tgz", + "integrity": "sha512-RMEtHsxJs/GiHHxYT58IY57UXAQTuUnZVco6ymDEqTNlJKTimM4qPUPVe8InNFyBjhHBEAx4k3Q8LtNayBsbUQ==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/semantic-conventions": "^1.29.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.0.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/exporter-metrics-otlp-http": { + "version": "0.205.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/exporter-metrics-otlp-http/-/exporter-metrics-otlp-http-0.205.0.tgz", + "integrity": "sha512-fFxNQ/HbbpLmh1pgU6HUVbFD1kNIjrkoluoKJkh88+gnmpFD92kMQ8WFNjPnSbjg2mNVnEkeKXgCYEowNW+p1w==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/core": "2.1.0", + "@opentelemetry/otlp-exporter-base": "0.205.0", + "@opentelemetry/otlp-transformer": "0.205.0", + "@opentelemetry/resources": "2.1.0", + "@opentelemetry/sdk-metrics": "2.1.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": "^1.3.0" + } + }, + "node_modules/@opentelemetry/exporter-metrics-otlp-http/node_modules/@opentelemetry/resources": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/resources/-/resources-2.1.0.tgz", + "integrity": "sha512-1CJjf3LCvoefUOgegxi8h6r4B/wLSzInyhGP2UmIBYNlo4Qk5CZ73e1eEyWmfXvFtm1ybkmfb2DqWvspsYLrWw==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/core": "2.1.0", + "@opentelemetry/semantic-conventions": "^1.29.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.3.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/exporter-metrics-otlp-http/node_modules/@opentelemetry/sdk-metrics": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/sdk-metrics/-/sdk-metrics-2.1.0.tgz", + "integrity": "sha512-J9QX459mzqHLL9Y6FZ4wQPRZG4TOpMCyPOh6mkr/humxE1W2S3Bvf4i75yiMW9uyed2Kf5rxmLhTm/UK8vNkAw==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/core": "2.1.0", + "@opentelemetry/resources": "2.1.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.9.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/otlp-exporter-base": { + "version": "0.205.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/otlp-exporter-base/-/otlp-exporter-base-0.205.0.tgz", + "integrity": "sha512-2MN0C1IiKyo34M6NZzD6P9Nv9Dfuz3OJ3rkZwzFmF6xzjDfqqCTatc9v1EpNfaP55iDOCLHFyYNCgs61FFgtUQ==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/core": "2.1.0", + "@opentelemetry/otlp-transformer": "0.205.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": "^1.3.0" + } + }, + "node_modules/@opentelemetry/otlp-transformer": { + "version": "0.205.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/otlp-transformer/-/otlp-transformer-0.205.0.tgz", + "integrity": "sha512-KmObgqPtk9k/XTlWPJHdMbGCylRAmMJNXIRh6VYJmvlRDMfe+DonH41G7eenG8t4FXn3fxOGh14o/WiMRR6vPg==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/api-logs": "0.205.0", + "@opentelemetry/core": "2.1.0", + "@opentelemetry/resources": "2.1.0", + "@opentelemetry/sdk-logs": "0.205.0", + "@opentelemetry/sdk-metrics": "2.1.0", + "@opentelemetry/sdk-trace-base": "2.1.0", + "protobufjs": "^7.3.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": "^1.3.0" + } + }, + "node_modules/@opentelemetry/otlp-transformer/node_modules/@opentelemetry/resources": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/resources/-/resources-2.1.0.tgz", + "integrity": "sha512-1CJjf3LCvoefUOgegxi8h6r4B/wLSzInyhGP2UmIBYNlo4Qk5CZ73e1eEyWmfXvFtm1ybkmfb2DqWvspsYLrWw==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/core": "2.1.0", + "@opentelemetry/semantic-conventions": "^1.29.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.3.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/otlp-transformer/node_modules/@opentelemetry/sdk-metrics": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/sdk-metrics/-/sdk-metrics-2.1.0.tgz", + "integrity": "sha512-J9QX459mzqHLL9Y6FZ4wQPRZG4TOpMCyPOh6mkr/humxE1W2S3Bvf4i75yiMW9uyed2Kf5rxmLhTm/UK8vNkAw==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/core": "2.1.0", + "@opentelemetry/resources": "2.1.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.9.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/resources": { + "version": "2.7.1", + "resolved": "https://registry.npmjs.org/@opentelemetry/resources/-/resources-2.7.1.tgz", + "integrity": "sha512-DeT6KKolmC4e/dRQvMQ/RwlnzhaqeiFOXY5ngoOPJ07GgVVKxZOg9EcrNZb5aTzUn+iCrJldAgOfQm1O/QfPAQ==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/core": "2.7.1", + "@opentelemetry/semantic-conventions": "^1.29.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.3.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/resources/node_modules/@opentelemetry/core": { + "version": "2.7.1", + "resolved": "https://registry.npmjs.org/@opentelemetry/core/-/core-2.7.1.tgz", + "integrity": "sha512-QAqIj32AtK6+pEVNG7EOVxHdE06RP+FM5qpiEJ4RtDcFIqKUZHYhl7/7UY5efhwmwNAg7j8QbJVBLxMerc0+gw==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/semantic-conventions": "^1.29.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.0.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/sdk-logs": { + "version": "0.205.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/sdk-logs/-/sdk-logs-0.205.0.tgz", + "integrity": "sha512-nyqhNQ6eEzPWQU60Nc7+A5LIq8fz3UeIzdEVBQYefB4+msJZ2vuVtRuk9KxPMw1uHoHDtYEwkr2Ct0iG29jU8w==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/api-logs": "0.205.0", + "@opentelemetry/core": "2.1.0", + "@opentelemetry/resources": "2.1.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.4.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/sdk-logs/node_modules/@opentelemetry/resources": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/resources/-/resources-2.1.0.tgz", + "integrity": "sha512-1CJjf3LCvoefUOgegxi8h6r4B/wLSzInyhGP2UmIBYNlo4Qk5CZ73e1eEyWmfXvFtm1ybkmfb2DqWvspsYLrWw==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/core": "2.1.0", + "@opentelemetry/semantic-conventions": "^1.29.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.3.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/sdk-metrics": { + "version": "2.7.1", + "resolved": "https://registry.npmjs.org/@opentelemetry/sdk-metrics/-/sdk-metrics-2.7.1.tgz", + "integrity": "sha512-MpDJdkiFDs3Pm1RHO3KByuZbuBdJEXEAkiC0+yJdsZGVCdf1RpHR6n+LHDcS7ffmfrt5kVCzJSCfm4z2C7v0uQ==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/core": "2.7.1", + "@opentelemetry/resources": "2.7.1" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.9.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/sdk-metrics/node_modules/@opentelemetry/core": { + "version": "2.7.1", + "resolved": "https://registry.npmjs.org/@opentelemetry/core/-/core-2.7.1.tgz", + "integrity": "sha512-QAqIj32AtK6+pEVNG7EOVxHdE06RP+FM5qpiEJ4RtDcFIqKUZHYhl7/7UY5efhwmwNAg7j8QbJVBLxMerc0+gw==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/semantic-conventions": "^1.29.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.0.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/sdk-trace-base": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/sdk-trace-base/-/sdk-trace-base-2.1.0.tgz", + "integrity": "sha512-uTX9FBlVQm4S2gVQO1sb5qyBLq/FPjbp+tmGoxu4tIgtYGmBYB44+KX/725RFDe30yBSaA9Ml9fqphe1hbUyLQ==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/core": "2.1.0", + "@opentelemetry/resources": "2.1.0", + "@opentelemetry/semantic-conventions": "^1.29.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.3.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/sdk-trace-base/node_modules/@opentelemetry/resources": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/resources/-/resources-2.1.0.tgz", + "integrity": "sha512-1CJjf3LCvoefUOgegxi8h6r4B/wLSzInyhGP2UmIBYNlo4Qk5CZ73e1eEyWmfXvFtm1ybkmfb2DqWvspsYLrWw==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/core": "2.1.0", + "@opentelemetry/semantic-conventions": "^1.29.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.3.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/semantic-conventions": { + "version": "1.41.1", + "resolved": "https://registry.npmjs.org/@opentelemetry/semantic-conventions/-/semantic-conventions-1.41.1.tgz", + "integrity": "sha512-/UhIkaZgPutTFmQ7RnIJGgDXZmtEJ7Dvi86xNTFWcnRxVRNk/aotsqDJYeEvDP+FSMB2SdW+pQzNMcWP0rwuNA==", + "license": "Apache-2.0", + "engines": { + "node": ">=14" + } + }, "node_modules/@posthog/core": { "version": "1.10.0", "resolved": "https://registry.npmjs.org/@posthog/core/-/core-1.10.0.tgz", @@ -507,6 +804,70 @@ "cross-spawn": "^7.0.6" } }, + "node_modules/@protobufjs/aspromise": { + "version": "1.1.2", + "resolved": "https://registry.npmjs.org/@protobufjs/aspromise/-/aspromise-1.1.2.tgz", + "integrity": "sha512-j+gKExEuLmKwvz3OgROXtrJ2UG2x8Ch2YZUxahh+s1F2HZ+wAceUNLkvy6zKCPVRkU++ZWQrdxsUeQXmcg4uoQ==", + "license": "BSD-3-Clause" + }, + "node_modules/@protobufjs/base64": { + "version": "1.1.2", + "resolved": "https://registry.npmjs.org/@protobufjs/base64/-/base64-1.1.2.tgz", + "integrity": "sha512-AZkcAA5vnN/v4PDqKyMR5lx7hZttPDgClv83E//FMNhR2TMcLUhfRUBHCmSl0oi9zMgDDqRUJkSxO3wm85+XLg==", + "license": "BSD-3-Clause" + }, + "node_modules/@protobufjs/codegen": { + "version": "2.0.5", + "resolved": "https://registry.npmjs.org/@protobufjs/codegen/-/codegen-2.0.5.tgz", + "integrity": "sha512-zgXFLzW3Ap33e6d0Wlj4MGIm6Ce8O89n/apUaGNB/jx+hw+ruWEp7EwGUshdLKVRCxZW12fp9r40E1mQrf/34g==", + "license": "BSD-3-Clause" + }, + "node_modules/@protobufjs/eventemitter": { + "version": "1.1.0", + "resolved": "https://registry.npmjs.org/@protobufjs/eventemitter/-/eventemitter-1.1.0.tgz", + "integrity": "sha512-j9ednRT81vYJ9OfVuXG6ERSTdEL1xVsNgqpkxMsbIabzSo3goCjDIveeGv5d03om39ML71RdmrGNjG5SReBP/Q==", + "license": "BSD-3-Clause" + }, + "node_modules/@protobufjs/fetch": { + "version": "1.1.0", + "resolved": "https://registry.npmjs.org/@protobufjs/fetch/-/fetch-1.1.0.tgz", + "integrity": "sha512-lljVXpqXebpsijW71PZaCYeIcE5on1w5DlQy5WH6GLbFryLUrBD4932W/E2BSpfRJWseIL4v/KPgBFxDOIdKpQ==", + "license": "BSD-3-Clause", + "dependencies": { + "@protobufjs/aspromise": "^1.1.1", + "@protobufjs/inquire": "^1.1.0" + } + }, + "node_modules/@protobufjs/float": { + "version": "1.0.2", + "resolved": "https://registry.npmjs.org/@protobufjs/float/-/float-1.0.2.tgz", + "integrity": "sha512-Ddb+kVXlXst9d+R9PfTIxh1EdNkgoRe5tOX6t01f1lYWOvJnSPDBlG241QLzcyPdoNTsblLUdujGSE4RzrTZGQ==", + "license": "BSD-3-Clause" + }, + "node_modules/@protobufjs/inquire": { + "version": "1.1.1", + "resolved": "https://registry.npmjs.org/@protobufjs/inquire/-/inquire-1.1.1.tgz", + "integrity": "sha512-mnzgDV26ueAvk7rsbt9L7bE0SuAoqyuys/sMMrmVcN5x9VsxpcG3rqAUSgDyLp0UZlmNfIbQ4fHfCtreVBk8Ew==", + "license": "BSD-3-Clause" + }, + "node_modules/@protobufjs/path": { + "version": "1.1.2", + "resolved": "https://registry.npmjs.org/@protobufjs/path/-/path-1.1.2.tgz", + "integrity": "sha512-6JOcJ5Tm08dOHAbdR3GrvP+yUUfkjG5ePsHYczMFLq3ZmMkAD98cDgcT2iA1lJ9NVwFd4tH/iSSoe44YWkltEA==", + "license": "BSD-3-Clause" + }, + "node_modules/@protobufjs/pool": { + "version": "1.1.0", + "resolved": "https://registry.npmjs.org/@protobufjs/pool/-/pool-1.1.0.tgz", + "integrity": "sha512-0kELaGSIDBKvcgS4zkjz1PeddatrjYcmMWOlAuAPwAeccUrPHdUqo/J6LiymHHEiJT5NrF1UVwxY14f+fy4WQw==", + "license": "BSD-3-Clause" + }, + "node_modules/@protobufjs/utf8": { + "version": "1.1.1", + "resolved": "https://registry.npmjs.org/@protobufjs/utf8/-/utf8-1.1.1.tgz", + "integrity": "sha512-oOAWABowe8EAbMyWKM0tYDKi8Yaox52D+HWZhAIJqQXbqe0xI/GV7FhLWqlEKreMkfDjshR5FKgi3mnle0h6Eg==", + "license": "BSD-3-Clause" + }, "node_modules/@puppeteer/browsers": { "version": "2.13.0", "resolved": "https://registry.npmjs.org/@puppeteer/browsers/-/browsers-2.13.0.tgz", @@ -1460,6 +1821,12 @@ "url": "https://github.com/sponsors/sindresorhus" } }, + "node_modules/long": { + "version": "5.3.2", + "resolved": "https://registry.npmjs.org/long/-/long-5.3.2.tgz", + "integrity": "sha512-mNAgZ1GmyNhD7AuqnTG3/VQ26o760+ZYBPKjPvugO8+nLbYfX6TVpJPseBvopbdY+qpZ/lKUnmEc1LeZYS3QAA==", + "license": "Apache-2.0" + }, "node_modules/lru-cache": { "version": "7.18.3", "resolved": "https://registry.npmjs.org/lru-cache/-/lru-cache-7.18.3.tgz", @@ -1771,6 +2138,30 @@ "node": ">=0.4.0" } }, + "node_modules/protobufjs": { + "version": "7.5.8", + "resolved": "https://registry.npmjs.org/protobufjs/-/protobufjs-7.5.8.tgz", + "integrity": "sha512-dvpCIeLPbXZS/Ete7yLaO7RenOdken2NHKykBXbsaGxZT0UTltcarBciw+A78SRQs9iMAAVpsYA+l8b1hTePIA==", + "hasInstallScript": true, + "license": "BSD-3-Clause", + "dependencies": { + "@protobufjs/aspromise": "^1.1.2", + "@protobufjs/base64": "^1.1.2", + "@protobufjs/codegen": "^2.0.5", + "@protobufjs/eventemitter": "^1.1.0", + "@protobufjs/fetch": "^1.1.0", + "@protobufjs/float": "^1.0.2", + "@protobufjs/inquire": "^1.1.1", + "@protobufjs/path": "^1.1.2", + "@protobufjs/pool": "^1.1.0", + "@protobufjs/utf8": "^1.1.1", + "@types/node": ">=13.7.0", + "long": "^5.0.0" + }, + "engines": { + "node": ">=12.0.0" + } + }, "node_modules/proxy-agent": { "version": "6.5.0", "resolved": "https://registry.npmjs.org/proxy-agent/-/proxy-agent-6.5.0.tgz", diff --git a/package.json b/package.json index bd5b57c..767851a 100644 --- a/package.json +++ b/package.json @@ -37,6 +37,10 @@ }, "dependencies": { "@anthropic-ai/tokenizer": "^0.0.4", + "@opentelemetry/api": "^1.9.0", + "@opentelemetry/exporter-metrics-otlp-http": "^0.205.0", + "@opentelemetry/resources": "^2.0.0", + "@opentelemetry/sdk-metrics": "^2.0.0", "ws": "^8.19.0" }, "devDependencies": { diff --git a/server/config-loader.js b/server/config-loader.js new file mode 100644 index 0000000..e108ac0 --- /dev/null +++ b/server/config-loader.js @@ -0,0 +1,64 @@ +'use strict'; + +// Minimal config loader for the OTel rollout (Phase 2a slice). +// This intentionally implements only the surface needed for the first +// vertical slice: read .ccxray.json from cwd if present, return a default +// shape otherwise. Env interpolation, literal-secret detection, gitignore +// auto-amend, personal config (.ccxray.user.json), and walk-up-to-git-root +// lookup all land in later Phase 2 sub-phases per the OpenSpec change. + +const fs = require('fs'); +const path = require('path'); + +const DEFAULT_CONFIG = Object.freeze({ + otel: Object.freeze({ + enabled: false, + tier: 0, + endpoint: null, + headers: Object.freeze({}), + resource_attributes: Object.freeze({}), + cardinality_overrides: Object.freeze({}), + }), +}); + +function projectConfigPath(cwd) { + return path.join(cwd || process.cwd(), '.ccxray.json'); +} + +function readProjectConfig(cwd) { + const file = projectConfigPath(cwd); + let raw; + try { + raw = fs.readFileSync(file, 'utf8'); + } catch (err) { + if (err.code === 'ENOENT') return { config: DEFAULT_CONFIG, source: null }; + throw new Error(`config-loader: failed to read ${file}: ${err.message}`); + } + let parsed; + try { + parsed = JSON.parse(raw); + } catch (err) { + throw new Error(`config-loader: ${file} is not valid JSON (${err.message})`); + } + return { config: mergeWithDefaults(parsed), source: file }; +} + +function mergeWithDefaults(input) { + const otel = input && typeof input.otel === 'object' && input.otel ? input.otel : {}; + return { + otel: { + enabled: otel.enabled === true, + tier: Number.isInteger(otel.tier) ? otel.tier : 0, + endpoint: typeof otel.endpoint === 'string' ? otel.endpoint : null, + headers: otel.headers && typeof otel.headers === 'object' ? { ...otel.headers } : {}, + resource_attributes: otel.resource_attributes && typeof otel.resource_attributes === 'object' + ? { ...otel.resource_attributes } + : {}, + cardinality_overrides: otel.cardinality_overrides && typeof otel.cardinality_overrides === 'object' + ? { ...otel.cardinality_overrides } + : {}, + }, + }; +} + +module.exports = { readProjectConfig, projectConfigPath, DEFAULT_CONFIG }; diff --git a/server/otel-lazy.js b/server/otel-lazy.js new file mode 100644 index 0000000..d796e68 --- /dev/null +++ b/server/otel-lazy.js @@ -0,0 +1,35 @@ +'use strict'; + +// Lazy require for OpenTelemetry packages. +// Phase 1 of the OTel rollout: ccxray must run at tier 0 even when the +// @opentelemetry/* packages are absent (e.g. user installed via a minimal +// distribution). Callers ask for a package by name; we return null if it +// cannot be resolved instead of throwing. + +const KNOWN_PACKAGES = new Set([ + '@opentelemetry/api', + '@opentelemetry/resources', + '@opentelemetry/sdk-metrics', + '@opentelemetry/exporter-metrics-otlp-http', +]); + +function tryRequire(name) { + if (!KNOWN_PACKAGES.has(name)) { + throw new Error(`otel-lazy: unknown package "${name}"`); + } + try { + return require(name); + } catch (err) { + if (err && err.code === 'MODULE_NOT_FOUND') return null; + throw err; + } +} + +function isAvailable() { + for (const name of KNOWN_PACKAGES) { + if (tryRequire(name) == null) return false; + } + return true; +} + +module.exports = { tryRequire, isAvailable, KNOWN_PACKAGES }; diff --git a/test/config-loader.test.js b/test/config-loader.test.js new file mode 100644 index 0000000..1d74e3a --- /dev/null +++ b/test/config-loader.test.js @@ -0,0 +1,85 @@ +'use strict'; + +const test = require('node:test'); +const assert = require('node:assert/strict'); +const fs = require('fs'); +const os = require('os'); +const path = require('path'); + +const { readProjectConfig, DEFAULT_CONFIG } = require('../server/config-loader'); +const { tryRequire, isAvailable } = require('../server/otel-lazy'); + +function mkTmp() { + return fs.mkdtempSync(path.join(os.tmpdir(), 'ccxray-cfg-')); +} + +test('config-loader: returns default config when .ccxray.json is absent', () => { + const dir = mkTmp(); + try { + const { config, source } = readProjectConfig(dir); + assert.equal(source, null); + assert.deepEqual(config, DEFAULT_CONFIG); + assert.equal(config.otel.enabled, false); + assert.equal(config.otel.tier, 0); + } finally { + fs.rmSync(dir, { recursive: true, force: true }); + } +}); + +test('config-loader: reads otel block from .ccxray.json', () => { + const dir = mkTmp(); + try { + fs.writeFileSync(path.join(dir, '.ccxray.json'), JSON.stringify({ + otel: { + enabled: true, + tier: 1, + endpoint: 'http://collector.local:4318', + headers: { 'x-team': 'platform' }, + resource_attributes: { 'service.name': 'ccxray-proxy' }, + }, + })); + const { config, source } = readProjectConfig(dir); + assert.ok(source && source.endsWith('.ccxray.json')); + assert.equal(config.otel.enabled, true); + assert.equal(config.otel.tier, 1); + assert.equal(config.otel.endpoint, 'http://collector.local:4318'); + assert.equal(config.otel.headers['x-team'], 'platform'); + assert.equal(config.otel.resource_attributes['service.name'], 'ccxray-proxy'); + } finally { + fs.rmSync(dir, { recursive: true, force: true }); + } +}); + +test('config-loader: malformed JSON throws with a descriptive error', () => { + const dir = mkTmp(); + try { + fs.writeFileSync(path.join(dir, '.ccxray.json'), '{ not valid json'); + assert.throws(() => readProjectConfig(dir), /not valid JSON/); + } finally { + fs.rmSync(dir, { recursive: true, force: true }); + } +}); + +test('config-loader: tier defaults to 0 when value is non-integer', () => { + const dir = mkTmp(); + try { + fs.writeFileSync(path.join(dir, '.ccxray.json'), JSON.stringify({ otel: { tier: 'one' } })); + const { config } = readProjectConfig(dir); + assert.equal(config.otel.tier, 0); + } finally { + fs.rmSync(dir, { recursive: true, force: true }); + } +}); + +test('otel-lazy: tryRequire returns the package object when installed', () => { + const api = tryRequire('@opentelemetry/api'); + assert.ok(api && typeof api === 'object', 'expected @opentelemetry/api to resolve'); +}); + +test('otel-lazy: tryRequire rejects unknown package names', () => { + assert.throws(() => tryRequire('@opentelemetry/not-real'), /unknown package/); +}); + +test('otel-lazy: isAvailable returns true once all known packages resolve', () => { + assert.equal(isAvailable(), true); +}); From 726feebd8eabc7bbf513d119fee0bfbbbca8eaa7 Mon Sep 17 00:00:00 2001 From: Justin Lee Date: Wed, 13 May 2026 17:26:38 +0800 Subject: [PATCH 08/10] feat(otel): add health state machine and SDK init shell MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Phase 2b of add-otel-metrics-phase1. Frames the subscriber wiring against the emit.js event bus introduced in Phase 1 without committing to a metric registry shape (that lands in Phase 2c+). - server/otel-health.js: four-state machine (disabled / active / degraded / circuit_open) with validated transitions. Phase 2b ships the shell only; bounded queue, circuit breaker, and log rotation are deferred per tasks.md §3.2–3.4. - server/otel.js: init(config) chooses behavior by tier. tier 0 returns early — no @opentelemetry/* require, no subscribers, zero cost. tier ≥ 1 resolves packages via otel-lazy; absent packages → degraded (proxy keeps running). Available packages → register five no-op subscribers on the emit.js bus (entry_completed, session_started, parser_unknown, parser_mismatch, parser_error) → active. - Both modules accept dependency injection so tests can exercise the "packages missing" branch without uninstalling them. - test/otel-init.test.js: 8 unit tests covering tier 0 no-op, tier ≥ 1 active path, packages-absent degraded path, idempotency, shutdown, and invalid-transition guards on the state machine. No existing runtime path imports these modules yet; proxy and hub behavior unchanged. Phase 2c will require server/otel.js from server/index.js (or from a CLI bootstrap) and wire the first emit() call site. --- server/otel-health.js | 60 +++++++++++++++++++++++++++++++ server/otel.js | 71 ++++++++++++++++++++++++++++++++++++ test/otel-init.test.js | 82 ++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 213 insertions(+) create mode 100644 server/otel-health.js create mode 100644 server/otel.js create mode 100644 test/otel-init.test.js diff --git a/server/otel-health.js b/server/otel-health.js new file mode 100644 index 0000000..f833696 --- /dev/null +++ b/server/otel-health.js @@ -0,0 +1,60 @@ +'use strict'; + +// OTel export health state machine. Phase 2b: state shell only. +// Bounded export queue (3.2), circuit breaker (3.3), log rotation (3.4), +// and shutdown cap (3.5) land in later sub-phases of the OpenSpec change. +// +// States: +// disabled — OTel never initialized (tier 0 or packages missing-and-tolerated) +// active — SDK initialized, exports presumed working +// degraded — SDK init failed or runtime non-recoverable; proxy continues +// circuit_open — runtime export failures tripped the breaker; periodic half-open retry +// +// Only documented APIs may mutate state. Invalid transitions throw so bugs +// surface in tests rather than silently corrupt observability. + +const STATES = Object.freeze(['disabled', 'active', 'degraded', 'circuit_open']); + +const VALID_TRANSITIONS = Object.freeze({ + disabled: new Set(['active', 'degraded']), + active: new Set(['degraded', 'circuit_open', 'disabled']), + degraded: new Set(['active', 'circuit_open', 'disabled']), + circuit_open: new Set(['active', 'degraded', 'disabled']), +}); + +let currentState = 'disabled'; +let lastTransitionAt = Date.now(); +let lastReason = null; + +function getState() { + return currentState; +} + +function getStatus() { + return { + state: currentState, + lastTransitionAt, + reason: lastReason, + }; +} + +function transition(to, { reason } = {}) { + if (!STATES.includes(to)) throw new Error(`otel-health: unknown state "${to}"`); + if (currentState === to) return false; + const allowed = VALID_TRANSITIONS[currentState]; + if (!allowed.has(to)) { + throw new Error(`otel-health: invalid transition ${currentState} → ${to}`); + } + currentState = to; + lastTransitionAt = Date.now(); + lastReason = (to === 'degraded' || to === 'circuit_open') ? (reason || null) : null; + return true; +} + +function _resetForTests() { + currentState = 'disabled'; + lastTransitionAt = Date.now(); + lastReason = null; +} + +module.exports = { STATES, getState, getStatus, transition, _resetForTests }; diff --git a/server/otel.js b/server/otel.js new file mode 100644 index 0000000..8300f5a --- /dev/null +++ b/server/otel.js @@ -0,0 +1,71 @@ +'use strict'; + +// OTel SDK init + emit.js subscribers. +// +// Phase 2b scope: wire the subscriber frame. tier 0 = full no-op (no +// require of @opentelemetry/*, no subscribers, no SDK). tier ≥ 1 +// resolves the lazy packages; if any are missing, transitions to +// degraded and keeps the proxy running. If packages are present, +// registers no-op subscribers and transitions to active. Metric +// registry, View API setup, and the actual MeterProvider land in +// Phase 2c+ — keeping this file small until the cardinality budget +// design is wired in. +// +// Never throws into the caller. init() returns the resulting health state. + +const emit = require('./emit'); +const defaultOtelLazy = require('./otel-lazy'); +const health = require('./otel-health'); + +let initialized = false; +let unsubscribers = []; + +function init(config, deps = {}) { + if (initialized) return health.getState(); + initialized = true; + + const tier = (config && config.otel && Number.isInteger(config.otel.tier)) + ? config.otel.tier + : 0; + + if (tier <= 0) { + // tier 0 pays nothing: do not load OTel, do not subscribe. + return health.getState(); + } + + const otelLazy = deps.otelLazy || defaultOtelLazy; + if (!otelLazy.isAvailable()) { + health.transition('degraded', { reason: 'opentelemetry packages not installed' }); + return health.getState(); + } + + // Phase 2b: register stub subscribers so the bus wiring is exercised + // without committing to a metric registry shape. Each handler stays a + // no-op until Phase 2c attaches actual instruments. + unsubscribers.push(emit.on('entry_completed', () => { /* tier ≥ 1 stub */ })); + unsubscribers.push(emit.on('session_started', () => { /* tier ≥ 1 stub */ })); + unsubscribers.push(emit.on('parser_unknown', () => { /* tier ≥ 1 stub */ })); + unsubscribers.push(emit.on('parser_mismatch', () => { /* tier ≥ 1 stub */ })); + unsubscribers.push(emit.on('parser_error', () => { /* tier ≥ 1 stub */ })); + + health.transition('active'); + return health.getState(); +} + +function shutdown() { + for (const off of unsubscribers) { + try { off(); } catch { /* ignore */ } + } + unsubscribers = []; + if (health.getState() !== 'disabled') { + health.transition('disabled'); + } + initialized = false; +} + +function _resetForTests() { + shutdown(); + health._resetForTests(); +} + +module.exports = { init, shutdown, _resetForTests }; diff --git a/test/otel-init.test.js b/test/otel-init.test.js new file mode 100644 index 0000000..7f60877 --- /dev/null +++ b/test/otel-init.test.js @@ -0,0 +1,82 @@ +'use strict'; + +const test = require('node:test'); +const assert = require('node:assert/strict'); + +const emit = require('../server/emit'); +const otel = require('../server/otel'); +const health = require('../server/otel-health'); + +test.beforeEach(() => otel._resetForTests()); +test.afterEach(() => otel._resetForTests()); + +test('otel.init: tier 0 stays disabled and registers no subscribers', () => { + let entryCompletedFired = false; + const off = emit.on('entry_completed', () => { entryCompletedFired = true; }); + try { + const state = otel.init({ otel: { tier: 0 } }); + assert.equal(state, 'disabled'); + + // Only our test subscriber is attached; otel.init must not have added one. + emit.emit('entry_completed', { entry: { id: 'x' } }); + assert.equal(entryCompletedFired, true, 'test subscriber should still fire'); + assert.equal(health.getState(), 'disabled'); + } finally { + off(); + } +}); + +test('otel.init: tier ≥ 1 with packages present → active', () => { + const state = otel.init({ otel: { tier: 1 } }); + assert.equal(state, 'active'); + assert.equal(health.getState(), 'active'); +}); + +test('otel.init: tier ≥ 1 with packages absent → degraded with reason', () => { + const fakeLazy = { isAvailable: () => false, tryRequire: () => null }; + const state = otel.init({ otel: { tier: 1 } }, { otelLazy: fakeLazy }); + assert.equal(state, 'degraded'); + const status = health.getStatus(); + assert.equal(status.state, 'degraded'); + assert.match(status.reason || '', /not installed/i); +}); + +test('otel.init: idempotent — second call returns current state without crashing', () => { + const first = otel.init({ otel: { tier: 1 } }); + const second = otel.init({ otel: { tier: 1 } }); + assert.equal(first, 'active'); + assert.equal(second, 'active'); +}); + +test('otel.shutdown: returns state to disabled and unsubscribes', () => { + otel.init({ otel: { tier: 1 } }); + assert.equal(health.getState(), 'active'); + + // Verify subscribers exist by spying on a known event — when we emit, + // the otel no-op handler fires but does not throw. The handler itself + // is a no-op, so we just confirm shutdown clears state without error. + otel.shutdown(); + assert.equal(health.getState(), 'disabled'); + + // After shutdown, init can run again. + const reinit = otel.init({ otel: { tier: 1 } }); + assert.equal(reinit, 'active'); +}); + +test('otel-health: rejects unknown states', () => { + assert.throws(() => health.transition('flying'), /unknown state/); +}); + +test('otel-health: rejects invalid transitions', () => { + health._resetForTests(); + // disabled → circuit_open is not in the allow-list + assert.throws(() => health.transition('circuit_open'), /invalid transition/); +}); + +test('otel-health: transition clears reason when leaving error states', () => { + health._resetForTests(); + health.transition('degraded', { reason: 'boom' }); + assert.equal(health.getStatus().reason, 'boom'); + health.transition('active'); + assert.equal(health.getStatus().reason, null); +}); From 35c33d729ab0d7b1de989ce6d3b67d4495e08a6e Mon Sep 17 00:00:00 2001 From: Justin Lee Date: Wed, 13 May 2026 18:42:30 +0800 Subject: [PATCH 09/10] fix(forward): handle late socket errors so the proxy survives EPIPE/ECONNRESET Two fixes for upstream socket error handling that previously crashed the proxy process in production: 1. forward.js: register a socket-level catch-all on proxyReq for the default (no HTTPS_PROXY) path. Anthropic occasionally returns 500 and then closes the TCP connection while ccxray still has a pending write to the underlying TLSSocket; the resulting EPIPE is emitted on the socket but not re-emitted on the ClientRequest, so without a listener the entire proxy crashes. Logs the error and lets the request fail gracefully via the existing proxyReq 'error' handler. 2. createTunnelAgent: guard the one-shot tls.connect callback with a "connected" flag. Pre-connect errors continue to flow into the agent callback as before; post-connect late errors now only log and never re-invoke the already-consumed callback. Addresses the HTTPS_PROXY path (Gemini-identified failure mode). node --check passes. Existing test suite unaffected. --- server/forward.js | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/server/forward.js b/server/forward.js index 71a7360..e1449cb 100644 --- a/server/forward.js +++ b/server/forward.js @@ -80,8 +80,15 @@ function createTunnelAgent(proxyUrl) { } const tlsOpts = { socket, servername: options.servername || options.host }; if (options.rejectUnauthorized !== undefined) tlsOpts.rejectUnauthorized = options.rejectUnauthorized; - const tlsSocket = tls.connect(tlsOpts, () => callback(null, tlsSocket)); - tlsSocket.on('error', callback); + let connected = false; + const tlsSocket = tls.connect(tlsOpts, () => { + connected = true; + callback(null, tlsSocket); + }); + tlsSocket.on('error', (err) => { + if (!connected) return callback(err); + console.error(`\x1b[31m❌ TUNNEL SOCKET ERROR: ${err.code || err.message}\x1b[0m`); + }); }); connectReq.on('error', callback); @@ -377,6 +384,15 @@ function forwardRequest(ctx) { clientRes.end(JSON.stringify({ error: 'proxy_error', message: err.message })); }); + // Late socket errors (EPIPE / ECONNRESET after the response has been received) + // are emitted on the underlying TLS/TCP socket and may not re-emit on the + // ClientRequest. Without a listener they crash the entire proxy process. + proxyReq.on('socket', (socket) => { + socket.on('error', (err) => { + console.error(`\x1b[31m❌ UPSTREAM SOCKET ERROR: ${err.code || err.message}\x1b[0m`); + }); + }); + proxyReq.end(bodyToSend); } From e85db57c259f004464a168a29da19d94745c0ecf Mon Sep 17 00:00:00 2001 From: Justin Lee Date: Fri, 15 May 2026 17:21:14 +0800 Subject: [PATCH 10/10] =?UTF-8?q?feat(otel):=20vertical=20slice=20?= =?UTF-8?q?=E2=80=94=20SDK=20init,=20token=20counters,=20emit=20point?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 打通 OTel 第一條 end-to-end 鏈:emit → MeterProvider → OTLPMetricExporter → collector。先驗證 rail 通了再回頭補強。 - server/otel.js: 真正的 MeterProvider + OTLP HTTP exporter, resource 帶 ccxray.source=ccxray-proxy;shutdown 2 秒 hard cap - server/otel.js: 註冊 4 個 token counter (ccxray.tokens.{input,output,cache_read,cache_creation}_total) with { provider, model } attributes - server/forward.js: 三條 forward 路徑(Anthropic SSE、OpenAI SSE、 non-SSE)emit('entry_completed', { entry }) - test/otel-vertical.test.js: 4 個整合測試,含 mock OTLP collector OpenSpec tasks: §3.5、§4.1 完成;§4.5、§6.1、§10.3 partial。 Queue routing (§3.2)、cardinality budget (§4.2-4.4) 留待下一刀。 Co-Authored-By: Claude Opus 4.7 --- .../changes/add-otel-metrics-phase1/tasks.md | 14 +- server/forward.js | 4 + server/otel.js | 166 +++++++++++++++-- test/otel-vertical.test.js | 171 ++++++++++++++++++ 4 files changed, 329 insertions(+), 26 deletions(-) create mode 100644 test/otel-vertical.test.js diff --git a/openspec/changes/add-otel-metrics-phase1/tasks.md b/openspec/changes/add-otel-metrics-phase1/tasks.md index 2a6e1ef..5c9af0e 100644 --- a/openspec/changes/add-otel-metrics-phase1/tasks.md +++ b/openspec/changes/add-otel-metrics-phase1/tasks.md @@ -1,8 +1,8 @@ ## 1. Dependencies and package wiring -- [ ] 1.1 Add `@opentelemetry/api`, `@opentelemetry/sdk-metrics`, `@opentelemetry/exporter-metrics-otlp-http`, `@opentelemetry/resources` as `dependencies` in `package.json` (no auto-instrumentations) -- [ ] 1.2 Implement lazy require in a helper so ccxray still runs at tier 0 when OTel packages are absent -- [ ] 1.3 Update `package-lock.json` and confirm bundle size delta is within an acceptable bound +- [x] 1.1 Add `@opentelemetry/api`, `@opentelemetry/sdk-metrics`, `@opentelemetry/exporter-metrics-otlp-http`, `@opentelemetry/resources` as `dependencies` in `package.json` (no auto-instrumentations) +- [x] 1.2 Implement lazy require in a helper so ccxray still runs at tier 0 when OTel packages are absent +- [x] 1.3 Update `package-lock.json` and confirm bundle size delta is within an acceptable bound ## 2. Config loader (`server/config-loader.js`) @@ -20,17 +20,17 @@ ## 3. OTel health module (`server/otel-health.js`) -- [ ] 3.1 Implement state machine with four states: `disabled / active / degraded / circuit_open` and transitions only via documented APIs +- [x] 3.1 Implement state machine with four states: `disabled / active / degraded / circuit_open` and transitions only via documented APIs - [ ] 3.2 Implement bounded export queue with drop-oldest semantics and `ccxray.otel.exports_dropped_total{signal}` increment per drop - [ ] 3.3 Implement circuit breaker: 5 consecutive failures trips, 60s initial cooldown, half-open trial, exponential backoff to 600s max - [ ] 3.4 Implement `~/.ccxray/otel.log` append writer with size-based rotation (default 1 MB, 5 file retention) -- [ ] 3.5 Implement SDK shutdown with 2-second hard cap to never block process exit +- [x] 3.5 Implement SDK shutdown with 2-second hard cap to never block process exit - [ ] 3.6 Surface state and metrics via a status reporter API consumed by the CLI status command - [ ] 3.7 Unit tests with mock collector (200 / 500 / timeout) covering queue overflow, circuit transitions, half-open recovery, and exponential backoff ## 4. OTel SDK initialization (`server/otel.js`) -- [ ] 4.1 Implement SDK init for metrics only, with `ccxray.source="ccxray-proxy"` resource attribute +- [x] 4.1 Implement SDK init for metrics only, with `ccxray.source="ccxray-proxy"` resource attribute - [ ] 4.2 Define metric registry with allow-list of attribute keys and cardinality budgets per metric (View API) - [ ] 4.3 Implement cardinality budget tracker with `_overflow_` fallback and `ccxray.metrics.overflow_total{metric,attribute}` sentinel - [ ] 4.4 Detect `CLAUDE_CODE_ENABLE_TELEMETRY=1` and apply `ccxray.cli_otel_active=true` attribute in complement mode @@ -58,7 +58,7 @@ ## 6. Wire metrics into forward / store paths -- [ ] 6.1 In `server/forward.js`, emit cost / token / latency / error / stop_reason metrics after each completed forward, using the otel-health queue +- [ ] 6.1 In `server/forward.js`, emit cost / token / latency / error / stop_reason metrics after each completed forward, using the otel-health queue _(partial: `emit('entry_completed', { entry })` wired in all 3 forward paths with full entry payload; routing through the otel-health queue is pending §3.2)_ - [ ] 6.2 In `server/store.js`, emit usage / pattern / governance metrics as session/tool/skill/MCP detection runs through the new parsers - [ ] 6.3 Ensure no emit path can throw into the proxy code path; all emits are best-effort - [ ] 6.4 Add a unit test that verifies forward.js continues to function with OTel disabled, init-failed (degraded), and circuit_open states diff --git a/server/forward.js b/server/forward.js index e1449cb..23d3852 100644 --- a/server/forward.js +++ b/server/forward.js @@ -10,6 +10,7 @@ const helpers = require('./helpers'); const { broadcast, broadcastSessionStatus, broadcastSessionTitleUpdate } = require('./sse-broadcast'); const { appendSample, collectRatelimitHeaders } = require('./ratelimit-log'); const hub = require('./hub'); +const emit = require('./emit'); // For title-generator subagent responses, extract the clean title from the // JSON payload and (when attribution succeeds) stamp it onto the parent @@ -615,6 +616,7 @@ function handleSSEResponse(ctx, proxyRes, clientRes) { store.trimEntries(); store.propagateLoadedSkills(entry, sessionId); broadcast(entry); + emit.emit('entry_completed', { entry }); // Persist to index (fire-and-forget after broadcast) const indexLine = JSON.stringify({ @@ -746,6 +748,7 @@ function handleOpenAISSE(ctx, proxyRes, clientRes) { store.entries.push(entry); store.trimEntries(); broadcast(entry); + emit.emit('entry_completed', { entry }); const indexLine = JSON.stringify({ id, ts: ctx.ts, sessionId: reqSessionId, @@ -891,6 +894,7 @@ function handleNonSSEResponse(ctx, proxyRes, clientRes) { store.trimEntries(); store.propagateLoadedSkills(entry, sessionId); broadcast(entry); + emit.emit('entry_completed', { entry }); const indexLine = JSON.stringify({ id, ts: ctx.ts, sessionId, diff --git a/server/otel.js b/server/otel.js index 8300f5a..653a787 100644 --- a/server/otel.js +++ b/server/otel.js @@ -2,16 +2,31 @@ // OTel SDK init + emit.js subscribers. // -// Phase 2b scope: wire the subscriber frame. tier 0 = full no-op (no -// require of @opentelemetry/*, no subscribers, no SDK). tier ≥ 1 -// resolves the lazy packages; if any are missing, transitions to -// degraded and keeps the proxy running. If packages are present, -// registers no-op subscribers and transitions to active. Metric -// registry, View API setup, and the actual MeterProvider land in -// Phase 2c+ — keeping this file small until the cardinality budget -// design is wired in. +// Vertical-slice scope (Phase 1, first cut): tier 0 = full no-op. tier ≥ 1 + +// packages present + endpoint configured → real MeterProvider with OTLP HTTP +// exporter and the first metric family (token usage). tier ≥ 1 with packages +// present but no endpoint → active state with no exporter (useful for staging +// the wiring before pointing at a collector). // -// Never throws into the caller. init() returns the resulting health state. +// Metrics registered in this slice (aligned with otel-export/spec.md): +// ccxray.tokens.input_total (counter, unit=tokens) +// ccxray.tokens.output_total (counter, unit=tokens) +// ccxray.tokens.cache_read_total (counter, unit=tokens) +// ccxray.tokens.cache_creation_total (counter, unit=tokens) +// Each is recorded with { provider, model } attributes. Cardinality budgets, +// View API allow-lists, sentinel metrics, and the full cost/usage/quality +// families land in later slices (§4.2–§4.9 of the OpenSpec change). +// +// Resource attribute `ccxray.source=ccxray-proxy` is always set so downstream +// consumers can distinguish ccxray-emitted metrics from `claude_code.*` CLI +// metrics that the user may also be exporting. +// +// shutdown() returns synchronously to disabled state (so existing callers +// don't need to await) and fires the SDK provider.shutdown() in the +// background with a 2-second hard cap — never blocks process exit. +// +// init() never throws into the caller; any failure transitions to degraded +// with a reason and the proxy continues running. const emit = require('./emit'); const defaultOtelLazy = require('./otel-lazy'); @@ -19,6 +34,7 @@ const health = require('./otel-health'); let initialized = false; let unsubscribers = []; +let sdkContext = null; // { provider, reader, instruments } | null function init(config, deps = {}) { if (initialized) return health.getState(); @@ -29,7 +45,6 @@ function init(config, deps = {}) { : 0; if (tier <= 0) { - // tier 0 pays nothing: do not load OTel, do not subscribe. return health.getState(); } @@ -39,33 +54,146 @@ function init(config, deps = {}) { return health.getState(); } - // Phase 2b: register stub subscribers so the bus wiring is exercised - // without committing to a metric registry shape. Each handler stays a - // no-op until Phase 2c attaches actual instruments. - unsubscribers.push(emit.on('entry_completed', () => { /* tier ≥ 1 stub */ })); + try { + if (config.otel.endpoint) { + sdkContext = initSdk(config, otelLazy); + } + registerHandlers(); + health.transition('active'); + } catch (err) { + sdkContext = null; + health.transition('degraded', { reason: `SDK init failed: ${err && err.message || err}` }); + } + return health.getState(); +} + +function initSdk(config, otelLazy) { + const sdk = otelLazy.tryRequire('@opentelemetry/sdk-metrics'); + const exp = otelLazy.tryRequire('@opentelemetry/exporter-metrics-otlp-http'); + const res = otelLazy.tryRequire('@opentelemetry/resources'); + if (!sdk || !exp || !res) { + throw new Error('required OTel package failed to resolve'); + } + + const exporter = new exp.OTLPMetricExporter({ + url: config.otel.endpoint, + headers: config.otel.headers || {}, + }); + + // Default 60s export interval, overridable for tests via env var. + const intervalMs = Number(process.env.CCXRAY_OTEL_EXPORT_INTERVAL_MS) || 60000; + const reader = new sdk.PeriodicExportingMetricReader({ + exporter, + exportIntervalMillis: intervalMs, + }); + + const resource = res.resourceFromAttributes({ + 'ccxray.source': 'ccxray-proxy', + ...(config.otel.resource_attributes || {}), + }); + + const provider = new sdk.MeterProvider({ resource, readers: [reader] }); + + const meter = provider.getMeter('ccxray', '1'); + const instruments = { + inputTokens: meter.createCounter('ccxray.tokens.input_total', { + description: 'Input tokens per completed entry', + unit: 'tokens', + }), + outputTokens: meter.createCounter('ccxray.tokens.output_total', { + description: 'Output tokens per completed entry', + unit: 'tokens', + }), + cacheReadTokens: meter.createCounter('ccxray.tokens.cache_read_total', { + description: 'Cache-read input tokens per completed entry', + unit: 'tokens', + }), + cacheCreationTokens: meter.createCounter('ccxray.tokens.cache_creation_total', { + description: 'Cache-creation input tokens per completed entry', + unit: 'tokens', + }), + }; + + return { provider, reader, instruments }; +} + +function registerHandlers() { + unsubscribers.push(emit.on('entry_completed', onEntryCompleted)); + // Other event types land as later slices wire them up. unsubscribers.push(emit.on('session_started', () => { /* tier ≥ 1 stub */ })); unsubscribers.push(emit.on('parser_unknown', () => { /* tier ≥ 1 stub */ })); unsubscribers.push(emit.on('parser_mismatch', () => { /* tier ≥ 1 stub */ })); unsubscribers.push(emit.on('parser_error', () => { /* tier ≥ 1 stub */ })); +} - health.transition('active'); - return health.getState(); +function onEntryCompleted(payload) { + if (!sdkContext) return; + const entry = payload && payload.entry; + const usage = entry && entry.usage; + if (!usage) return; + + const attrs = { + provider: entry.provider || 'unknown', + model: entry.model || 'unknown', + }; + + const input = Number(usage.input_tokens) || 0; + const output = Number(usage.output_tokens) || 0; + const cacheRead = Number(usage.cache_read_input_tokens) || 0; + const cacheCreate = Number(usage.cache_creation_input_tokens) || 0; + + sdkContext.instruments.inputTokens.add(input, attrs); + sdkContext.instruments.outputTokens.add(output, attrs); + sdkContext.instruments.cacheReadTokens.add(cacheRead, attrs); + sdkContext.instruments.cacheCreationTokens.add(cacheCreate, attrs); } -function shutdown() { +// Returns a Promise but is safe to ignore. The synchronous portion (before the +// first await below) is enough to make `health.getState() === 'disabled'` and +// `initialized === false` visible to immediate follow-up calls — existing +// `otel.shutdown()` callers that do not await still see the new state. +async function shutdown() { for (const off of unsubscribers) { try { off(); } catch { /* ignore */ } } unsubscribers = []; + + const ctx = sdkContext; + sdkContext = null; + if (health.getState() !== 'disabled') { health.transition('disabled'); } initialized = false; + + if (ctx && ctx.provider && typeof ctx.provider.shutdown === 'function') { + try { + await Promise.race([ + ctx.provider.shutdown(), + new Promise(resolve => setTimeout(resolve, 2000)), + ]); + } catch { /* never block process exit on shutdown errors */ } + } +} + +// Force-flush exists so tests (and a future `ccxray status --otel` command) +// can drain the reader on demand. Returns a Promise that resolves even on +// failure — never throws to the caller. +async function flush() { + if (!sdkContext || !sdkContext.provider) return; + try { + await sdkContext.provider.forceFlush(); + } catch { /* ignore */ } } function _resetForTests() { - shutdown(); + // Sync drop of everything for tests that do not await shutdown. + for (const off of unsubscribers) { try { off(); } catch {} } + unsubscribers = []; + sdkContext = null; + if (health.getState() !== 'disabled') health.transition('disabled'); + initialized = false; health._resetForTests(); } -module.exports = { init, shutdown, _resetForTests }; +module.exports = { init, shutdown, flush, _resetForTests }; diff --git a/test/otel-vertical.test.js b/test/otel-vertical.test.js new file mode 100644 index 0000000..1746dcc --- /dev/null +++ b/test/otel-vertical.test.js @@ -0,0 +1,171 @@ +'use strict'; + +// Vertical-slice integration: a real OTel MeterProvider posts to an in-process +// mock OTLP HTTP collector. Proves the full chain — init → emit → record → +// PeriodicExportingMetricReader → OTLPMetricExporter → HTTP — is wired. +// +// Body content (protobuf) is not decoded here. Asserting (1) at least one POST +// arrived at `/v1/metrics`, (2) content-type is the OTLP HTTP signature, (3) +// the body is non-empty is enough to demo the rail. Decoded-content assertions +// land with §10.3 once a protobuf transformer is on the test path. + +const test = require('node:test'); +const assert = require('node:assert/strict'); +const http = require('node:http'); + +const emit = require('../server/emit'); +const otel = require('../server/otel'); +const health = require('../server/otel-health'); + +function startMockCollector() { + const requests = []; + const server = http.createServer((req, res) => { + const chunks = []; + req.on('data', (c) => chunks.push(c)); + req.on('end', () => { + requests.push({ + method: req.method, + url: req.url, + contentType: req.headers['content-type'] || '', + contentLength: Buffer.concat(chunks).length, + }); + res.writeHead(200, { 'Content-Type': 'application/x-protobuf' }); + res.end(); + }); + }); + return new Promise((resolve) => { + server.listen(0, '127.0.0.1', () => { + const { port } = server.address(); + resolve({ + url: `http://127.0.0.1:${port}/v1/metrics`, + requests, + close: () => new Promise((r) => server.close(() => r())), + }); + }); + }); +} + +test.beforeEach(() => otel._resetForTests()); +test.afterEach(async () => { + await otel.shutdown(); +}); + +test('otel vertical slice: tier 1 + endpoint → exporter posts to collector', async () => { + const prevInterval = process.env.CCXRAY_OTEL_EXPORT_INTERVAL_MS; + // Long interval — we drain explicitly with flush() to avoid races. + process.env.CCXRAY_OTEL_EXPORT_INTERVAL_MS = '60000'; + const collector = await startMockCollector(); + + try { + const state = otel.init({ + otel: { + tier: 1, + endpoint: collector.url, + headers: {}, + resource_attributes: { 'service.name': 'ccxray-test' }, + }, + }); + assert.equal(state, 'active'); + assert.equal(health.getState(), 'active'); + + emit.emit('entry_completed', { + entry: { + provider: 'anthropic', + model: 'claude-test-model', + usage: { + input_tokens: 100, + output_tokens: 50, + cache_read_input_tokens: 200, + cache_creation_input_tokens: 25, + }, + }, + }); + + await otel.flush(); + + // forceFlush triggers the exporter synchronously inside the reader. Give + // the HTTP request one tick to actually deliver to our server. + for (let i = 0; i < 50 && collector.requests.length === 0; i++) { + await new Promise((r) => setTimeout(r, 10)); + } + + assert.ok(collector.requests.length > 0, 'collector should have received at least one POST'); + const first = collector.requests[0]; + assert.equal(first.method, 'POST'); + assert.equal(first.url, '/v1/metrics'); + assert.match(first.contentType, /protobuf|json/); + assert.ok(first.contentLength > 0, 'collector POST body must be non-empty'); + } finally { + await collector.close(); + if (prevInterval === undefined) delete process.env.CCXRAY_OTEL_EXPORT_INTERVAL_MS; + else process.env.CCXRAY_OTEL_EXPORT_INTERVAL_MS = prevInterval; + } +}); + +test('otel vertical slice: tier 1 with no endpoint → active but no exporter', async () => { + const state = otel.init({ otel: { tier: 1 } }); + assert.equal(state, 'active'); + + // No collector, no SDK context — emit must not throw, must not record. + emit.emit('entry_completed', { + entry: { + provider: 'anthropic', + model: 'claude-test-model', + usage: { input_tokens: 1, output_tokens: 1 }, + }, + }); + + await otel.flush(); // no-op, must not throw + assert.equal(health.getState(), 'active'); +}); + +test('otel vertical slice: shutdown honors 2-second cap even when provider hangs', async () => { + const prevInterval = process.env.CCXRAY_OTEL_EXPORT_INTERVAL_MS; + process.env.CCXRAY_OTEL_EXPORT_INTERVAL_MS = '60000'; + + // Mock collector that hangs — never responds. Forces provider.shutdown() to + // block until the timeout race resolves. + const server = http.createServer((_req, _res) => { /* hang */ }); + await new Promise((r) => server.listen(0, '127.0.0.1', r)); + const { port } = server.address(); + const url = `http://127.0.0.1:${port}/v1/metrics`; + + try { + otel.init({ otel: { tier: 1, endpoint: url, headers: {} } }); + emit.emit('entry_completed', { + entry: { provider: 'anthropic', model: 'm', usage: { input_tokens: 1, output_tokens: 1 } }, + }); + + const t0 = Date.now(); + await otel.shutdown(); + const elapsed = Date.now() - t0; + + // Hard cap is 2000ms; give 500ms scheduler slack. + assert.ok(elapsed < 2500, `shutdown took ${elapsed}ms, must respect 2s cap`); + assert.equal(health.getState(), 'disabled'); + } finally { + // Forcibly close still-open sockets from the hung exporter request, + // otherwise server.close() waits for them to drain (~8s). + if (typeof server.closeAllConnections === 'function') server.closeAllConnections(); + await new Promise((r) => server.close(() => r())); + if (prevInterval === undefined) delete process.env.CCXRAY_OTEL_EXPORT_INTERVAL_MS; + else process.env.CCXRAY_OTEL_EXPORT_INTERVAL_MS = prevInterval; + } +}); + +test('otel vertical slice: emit with no usage is a safe no-op', async () => { + const collector = await startMockCollector(); + try { + otel.init({ otel: { tier: 1, endpoint: collector.url, headers: {} } }); + + // Entries without usage (e.g. proxy errors) must not break the handler. + emit.emit('entry_completed', { entry: { provider: 'anthropic', model: 'm' } }); + emit.emit('entry_completed', { entry: null }); + emit.emit('entry_completed', {}); + + await otel.flush(); + assert.equal(health.getState(), 'active'); + } finally { + await collector.close(); + } +});