diff --git a/docs/otel-integration.html b/docs/otel-integration.html new file mode 100644 index 0000000..761621e --- /dev/null +++ b/docs/otel-integration.html @@ -0,0 +1,1124 @@ + + + + +OTel 整合探索 — ccxray + + + + +
+ +

OTel 整合探索

+
理解 OpenTelemetry 是什麼,以及 ccxray 要如何接上 OTel 生態的三種方案比較
+ +
+ 目錄
+ 1. OTel 是什麼?
+ 2. 三種訊號:Traces / Metrics / Logs
+ 3. Claude Code 內建的 OTel 已經做了什麼?
+ 4. ccxray 站在 HTTP 層,看得到/看不到什麼?
+ 5. 四個整合方案(A / B / C / D ★)
+ 6. 四案比較表
+ 7. 管理者視角:MCP / tool / skill 使用統計
+ 8. 建議路線
+ 9. 事前驗屍與解方(10 題,9 解,全 ≥ 9 分) +
+ + +

1. OTel 是什麼?

+ +

OpenTelemetry(以下簡稱 OTel)不是一個產品,而是一套觀測資料的標準。它定義了:

+ +

它解決的問題是:「以前每個觀測後端(Datadog、New Relic、Honeycomb)都有自己的 SDK,換後端就要改程式碼。現在大家都講 OTel,你只要 emit 一次,送去哪都可以。」

+ +
+
+flowchart LR
+    A[你的應用程式
例如 ccxray] -->|OTel SDK
產生標準資料| B[OTel Collector
選配,中繼站] + B -->|OTLP 協議| C[Honeycomb] + B -->|OTLP 協議| D[Datadog] + B -->|OTLP 協議| E[Grafana / Jaeger] + B -->|OTLP 協議| F[Langfuse] + A -.直接送.-> C + + style A fill:#88c0d0,stroke:#5e81ac,color:#0f1419 + style B fill:#a3be8c,stroke:#5e81ac,color:#0f1419 +
+
+ +
+記住一件事:你的程式 → OTel SDK → 後端。中間是「資料格式長一樣」。 +
+ + +

2. 三種訊號

+ +

OTel 把觀測資料分成三類,各自獨立,各自可開關:

+ +
+
+

Traces 追蹤

+

一次「操作」的時間軸

+

由多個 span 組成樹狀結構。每個 span 有開始/結束時間、parent。

+

ccxray 例子: 一次 Claude turn = 一個 trace,內含 1 個 HTTP request span + N 個 tool span

+
+
+

Metrics 指標

+

數字、計數、分布

+

Counter(累加)、Gauge(瞬時值)、Histogram(分布)。便宜、聚合好。

+

ccxray 例子: input_tokens 累計、cost 累計、cache hit rate

+
+
+

Logs 事件

+

結構化 log 紀錄

+

類似傳統 log,但是結構化(JSON),可以關聯到 trace 和 span。

+

ccxray 例子: 完整 request body、tool 執行結果

+
+
+ +

三者怎麼串在一起?

+ +
+
+flowchart TB
+    subgraph Trace [Trace: 一次 Claude turn]
+        S1["Span: HTTP POST /v1/messages
200ms"] + S2["Span: tool_use Read
50ms"] + S3["Span: tool_use Bash
1200ms"] + S1 --> S2 + S1 --> S3 + end + + subgraph Metrics [Metrics 同時被記錄] + M1["counter tokens.input += 2500"] + M2["counter cost.usd += 0.0125"] + end + + subgraph Logs [Logs 關聯到 span] + L1["event user_prompt
linked to S1"] + L2["event tool_result
linked to S3"] + end + + style S1 fill:#88c0d0,stroke:#5e81ac,color:#0f1419 + style S2 fill:#a3be8c,stroke:#5e81ac,color:#0f1419 + style S3 fill:#a3be8c,stroke:#5e81ac,color:#0f1419 +
+
+ + +

3. Claude Code 內建的 OTel 已經做了什麼?

+ +

當你設定 CLAUDE_CODE_ENABLE_TELEMETRY=1,Claude Code CLI 會自己送 OTel 出去,完全不用 ccxray:

+ +
+
+flowchart LR
+    A[Claude Code CLI
內建 OTel] -->|OTLP| B[你的 Collector] + B --> C[Honeycomb / Datadog] + + A2[Claude Code CLI
無 OTel 設定] -->|純 HTTP| X[Anthropic API] + + style A fill:#a3be8c,stroke:#5e81ac,color:#0f1419 + style A2 fill:#8b95a5,stroke:#5e81ac,color:#0f1419 +
+
+ +

CLI 自己會 emit 的 span:

+ + +
+注意:這只在 Anthropic 官方 Claude CodeCodex、Gemini 等其他 provider 完全沒有 OTel。 +
+ + +

4. ccxray 站在 HTTP 層,看得到/看不到什麼?

+ +
+
+flowchart LR
+    CLI[Claude Code / Codex] -->|HTTP request| CCX[ccxray proxy]
+    CCX -->|forward| API[Anthropic / OpenAI API]
+    API -->|response| CCX
+    CCX -->|response| CLI
+
+    CCX -.寫入.-> LOG[(~/.ccxray/logs)]
+    CCX -.SSE.-> UI[Dashboard]
+
+    style CCX fill:#88c0d0,stroke:#5e81ac,color:#0f1419
+
+
+ + + + + + + +
ccxray 看得到 ✅ccxray 看不到 ❌
+
    +
  • 每次 HTTP request / response 的完整 payload
  • +
  • model、input/output/cache tokens
  • +
  • cost(用 LiteLLM pricing 算)
  • +
  • latency(從 request 進來到 response 結束)
  • +
  • 從 response 解析 tool_use block → 知道 LLM 要求 執行什麼工具
  • +
  • 下一個 request 帶 tool_result 回來 → 知道工具結果
  • +
  • 跨 provider:Codex / Gemini 也都看得到
  • +
+
+
    +
  • 工具實際執行的時間(只能推斷)
  • +
  • Permission prompt 等待時間
  • +
  • Hook 執行
  • +
  • 本地檔案 I/O 細節
  • +
  • 使用者的 prompt 輸入動作
  • +
+
+ + +

5. 四個整合方案

+ +

方案 A Metrics Only — 輕量起手式

+ +

只 emit 數字型指標:token、cost、request count、cache hit rate。不碰 trace

+ +
+
+flowchart LR
+    REQ[每次 HTTP 完成] --> M["counter tokens.input ++
counter tokens.output ++
counter cost.usd ++
histogram latency ms"] + M -->|OTLP| COL[Collector] + COL --> GRA[Grafana / Datadog
畫圖表] + + style M fill:#a3be8c,stroke:#5e81ac,color:#0f1419 +
+
+ +
+ 動到哪些檔案 + + 優點 + + 缺點 + +
+ + + +

方案 B Metrics + Synthetic Traces — 中度整合

+ +

加上 trace,但 trace 是「合成」的(因為看不到真實 tool 執行時間,只能從 HTTP 推斷)。

+ +
+
+flowchart TB
+    subgraph Trace [合成的 Trace]
+        I["claude_code.interaction
由 session_id 群組"] + L["claude_code.llm_request
真實 HTTP 時間"] + T1["ccxray.tool.synthetic
從 tool_use 推斷"] + T2["ccxray.tool.synthetic
從 tool_use 推斷"] + I --> L + L --> T1 + L --> T2 + end + + style I fill:#88c0d0,stroke:#5e81ac,color:#0f1419 + style L fill:#a3be8c,stroke:#5e81ac,color:#0f1419 + style T1 fill:#ebcb8b,stroke:#5e81ac,color:#0f1419 + style T2 fill:#ebcb8b,stroke:#5e81ac,color:#0f1419 +
+
+ +
+ 動到哪些檔案 + + 優點 + + 缺點 + +
+ + + +

方案 C 完整(Metrics + Traces + Log Events)— 重度整合

+ +

把 ccxray 看到的完整 payload 也 emit 成 log event,讓使用者可以在 OTel 後端做全文搜尋。

+ +
+
+flowchart LR
+    REQ[HTTP request] --> M[Metrics]
+    REQ --> T[Traces]
+    REQ --> L["Log Events
完整 request / response JSON"] + + M --> COL[Collector] + T --> COL + L --> COL + + COL --> BE[Honeycomb / Langfuse
可全文搜尋 payload] + + style L fill:#bf616a,stroke:#5e81ac,color:#0f1419 +
+
+ +
+ 動到哪些檔案 + + 優點 + + 缺點 + +
+ + + +

方案 D ★ 推薦 雲端追蹤 + 本地反查(Hybrid)

+ +

把 ccxray 看到的 metadata(model、token、cost、tool 名稱、timing)送雲端,完整 payload 留在本地。span 上掛一個 ccxray.entry_id attribute,在 Grafana 發現問題後可以回 ccxray dashboard 反查完整對話。

+ +
+
+flowchart LR
+    REQ[HTTP request 進來] --> CCX[ccxray proxy]
+    CCX -->|完整 payload
~50KB/turn| LOG[(~/.ccxray/logs
本地)] + CCX -->|metadata + entry_id
~1KB/turn| OTLP[OTLP Collector] + OTLP --> GRA[Grafana / Honeycomb
聚合查詢] + + GRA -.點 entry_id
跳回本地.-> UI[ccxray Dashboard
看完整 payload] + LOG --> UI + + style CCX fill:#88c0d0,stroke:#5e81ac,color:#0f1419 + style LOG fill:#a3be8c,stroke:#5e81ac,color:#0f1419 + style UI fill:#b48ead,stroke:#5e81ac,color:#0f1419 +
+
+ +

反查的工作流

+ +
+
+sequenceDiagram
+    autonumber
+    actor U as 工程師
+    participant G as Grafana
+    participant D as ccxray Dashboard
+    participant F as 本地 log 檔
+
+    Note over G: 看到異常 spike
cost 突然爆增 + U->>G: 點開最貴的那個 span + G-->>U: trace 顯示 ccxray.entry_id=
"2026-05-12T09-31-04-227" + U->>D: 開啟 http://localhost:5577/entry/2026-... + D->>F: 讀取本地 _req.json / _res.json + F-->>D: 完整 payload + D-->>U: 顯示完整對話、tool 呼叫、cache 結構 + Note over U: 找到原因:
某個 tool result
把 200KB 文字塞進 context +
+
+ +
+ 實際 emit 的 span 長這樣(metadata-only) +
{
+  "name": "ccxray.llm_request",
+  "attributes": {
+    "ccxray.entry_id":        "2026-05-12T09-31-04-227",
+    "ccxray.dashboard_url":   "http://localhost:5577/entry/2026-05-12T09-31-04-227",
+    "ccxray.provider":        "anthropic",
+    "model":                  "claude-opus-4-7",
+    "tokens.input":            45230,
+    "tokens.output":            1820,
+    "tokens.cache_read":       38500,
+    "tokens.cache_creation":    6730,
+    "cost.usd":              0.0825,
+    "latency_ms":              4210,
+    "tools.count":                 3,
+    "tools.names":  ["Read","Bash","Edit"]
+  }
+}
+ 注意:沒有任何 prompt 文字、tool input、tool output。 +
+ +
+ 動到哪些檔案 + + + 優點 + + + 缺點 / 限制 + +
+ +
+為什麼這個比 B 和 C 都好? + +
+ + +

6. 四案比較表

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
面向方案 A
Metrics Only
方案 B
+ Synthetic Traces
方案 C
完整 payload
方案 D
Hybrid 反查
實作工時1–2 天3–5 天1–2 週2–3 天
使用者價值中:cost / token 趨勢高:turn timing 分析看情境:power user最高:聚合 + 細節都有
跟 CLI 內建 OTel 衝突不衝突span 重複重複更嚴重不衝突
(用 ccxray.* namespace)
Codex / Gemini 支援是(唯一)是(唯一)是(唯一)
資料量 / 後端費用很低高,需取樣低(~1KB/turn)
隱私風險無(payload 不出機器)
取代 dashboard 的程度完全不衝突部分重疊高度重疊互補強化
需要使用者持續開 ccxray dashboard不需要不需要不需要反查時需要本地 log 還在
+ + +

7. 管理者視角:還能看到什麼?

+ +

除了 cost / token,管理者(team lead、平台 owner)通常也想知道:

+ + +

這些全部都是 metrics 加上 attribute(label),屬於方案 A 和方案 D 的能力範圍,不需要 trace 或完整 payload。

+ +

能 emit 的 counter 範例

+ +
+
# 每個 MCP server 被叫的次數
+ccxray.mcp.invocations_total {server="filesystem", tool="read_file"} = 1248
+ccxray.mcp.invocations_total {server="github",     tool="create_pr"} =   42
+ccxray.mcp.invocations_total {server="slack",      tool="post_message"} = 89
+
+# MCP 失敗次數
+ccxray.mcp.errors_total {server="github", error_type="timeout"} = 7
+
+# 內建 tool 使用次數
+ccxray.tool.invocations_total {tool="Bash",   provider="anthropic"} = 5230
+ccxray.tool.invocations_total {tool="Read",   provider="anthropic"} = 8120
+ccxray.tool.invocations_total {tool="Edit",   provider="anthropic"} = 1840
+ccxray.tool.invocations_total {tool="WebSearch", provider="anthropic"} = 92
+
+# Skill 觸發次數(從 system prompt 解析)
+ccxray.skill.activations_total {skill="release",   provider="anthropic"} = 12
+ccxray.skill.activations_total {skill="git-commit", provider="anthropic"} = 87
+
+# 每個 provider 的 session 數
+ccxray.sessions_total {provider="anthropic"} = 234
+ccxray.sessions_total {provider="codex"}     =  41
+
+# 維度可組合:依 model 拆 token 消耗
+ccxray.tokens.input_total {model="claude-opus-4-7", provider="anthropic"} = 12_500_000
+ccxray.tokens.input_total {model="claude-sonnet-4-6", provider="anthropic"} = 38_200_000
+
+ +

ccxray 已經有的資料來源

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
想 emit 的 metricccxray 現有來源難度
ccxray.tool.invocations_totalresponse 內的 tool_use block(已經在解析)
ccxray.mcp.invocations_totaltool name 以 mcp__<server>__<tool> 為前綴(已有命名規則)
ccxray.skill.activations_totalsystem prompt 內的 skill 觸發 marker(system-prompt.js 已在解析)中(需要確認 marker)
ccxray.sessions_totalstore.js 的 session 推斷
ccxray.tokens.* / cost.*pricing.js + response usage 欄位
依「使用者 / 團隊」拆分需新增 OTEL_RESOURCE_ATTRIBUTES=enduser.id=... 設定指引中(靠使用者設定環境變數)
+ +

更多管理者會在意的指標(全部都在方案 A 能力範圍)

+ +

以下指標 ccxray 都能從 HTTP 看到的資料推導出來,不需要 trace 或完整 payload:

+ +
+ +
+

📈 生產力 / 採用

+
    +
  • ccxray.users.active_daily
    DAU / WAU,看推廣成效
  • +
  • ccxray.sessions.duration_seconds
    histogram,session 平均時長
  • +
  • ccxray.turns_per_session
    每 session 對話幾輪
  • +
  • ccxray.first_token_latency_ms
    UX 體感速度
  • +
  • ccxray.agent_type.invocations
    general / explore / plan / 自訂 subagent 各用幾次
  • +
+
+ +
+

💰 成本效率

+
    +
  • ccxray.cache.hit_ratio
    cache_read / total_input,< 70% 表示 prompt 設計有問題
  • +
  • ccxray.cost_per_session_usd
    histogram,找出燒錢 outlier
  • +
  • ccxray.tokens.output_per_input_ratio
    產出 / 輸入比,過低代表 context 浪費
  • +
  • ccxray.quota.burn_rate_pct
    5h / 週 quota 燒到幾 %
  • +
  • ccxray.retries_total
    retry 次數,間接成本
  • +
+
+ +
+

🚨 品質 / 可靠度

+
    +
  • ccxray.errors_total{type}
    rate_limit / overloaded / timeout / 500
  • +
  • ccxray.stop_reason{reason}
    end_turn / tool_use / max_tokens / stop_sequence
  • +
  • ccxray.max_tokens_hit_rate
    被截斷率,高代表 UX 差
  • +
  • ccxray.latency_ms{model,p95}
    各 model SLA
  • +
  • ccxray.aborted_total
    使用者 ctrl-c / Esc 比例
  • +
+
+ +
+

🧠 使用模式

+
    +
  • ccxray.context.utilization_pct
    histogram,context window 平均吃多滿
  • +
  • ccxray.auto_compact.triggered_total
    壓縮觸發次數,代表「需要更大 context」
  • +
  • ccxray.subagent.invocations
    主 agent vs Task 子 agent 比例
  • +
  • ccxray.tools_per_turn
    每輪平均叫幾個 tool
  • +
  • ccxray.thinking.token_ratio
    extended thinking 佔輸出比例
  • +
+
+ +
+

🛠️ Tool / MCP 細節

+
    +
  • ccxray.tool.latency_ms{tool}
    推估 tool 執行時間(下一個 request 進來 − 上一個 response 結束)
  • +
  • ccxray.tool.result_size_bytes{tool}
    tool 回傳大小,過大會吃 context
  • +
  • ccxray.tool.failures_total{tool,reason}
    從 tool_result 的 is_error 解析
  • +
  • ccxray.mcp.unique_servers
    使用者連了幾個 MCP server
  • +
  • ccxray.bash.command_pattern{cmd}
    最常 bash 跑什麼(取第一個 token,有 cardinality 風險,需設白名單)
  • +
+
+ +
+

🔒 治理 / 安全

+
    +
  • ccxray.permission_mode.usage{mode}
    default / acceptEdits / bypassPermissions(yolo) 比例
  • +
  • ccxray.dangerous_tool.invocations
    rm -rf / force-push / drop table 偵測
  • +
  • ccxray.file_writes_total
    Edit + Write 加總
  • +
  • ccxray.provider.distribution
    Anthropic vs Codex vs Gemini 比例
  • +
  • ccxray.system_prompt.version_changes
    agent system prompt 改了幾次(知道誰在自訂)
  • +
+
+ +
+ +
+Cardinality 警告:{user}{cmd}{file_path} 等高基數 attribute 的 metric 會把後端 explode。設計時: + +
+ +

管理者可以做的 Grafana / Datadog 報表

+ +
+
+flowchart TB
+    subgraph Reports [典型管理報表]
+        R1["📊 每週各團隊 token 消耗 / 成本
(by enduser.id)"] + R2["🔧 Top 10 最常用 tool
(by tool name)"] + R3["🔌 各 MCP server 使用熱度
(by server name)"] + R4["⚙️ Skill 採用率排行
(用了 vs 沒用)"] + R5["💸 哪個 model 燒最多錢
(by model + provider)"] + R6["🚨 MCP 失敗率告警
(error rate > X%)"] + end + + M["ccxray 送出的 metrics
含 attributes:
tool / mcp / skill / model / user"] --> Reports + + style M fill:#88c0d0,stroke:#5e81ac,color:#0f1419 +
+
+ +
+關鍵洞察:「用了什麼 / 用幾次」這類問題 只需要 metrics(方案 A 的核心),不需要 trace 或 payload。Cardinality 控制好(tool 名稱、MCP server 名稱是有限集合),即使免費 tier 的 Grafana / Prometheus 都吃得下。 +
+ + +

8. 建議路線

+ +
+建議分兩階段:第一階段做方案 A(含管理面 metrics),第二階段升級到方案 D。 +
+ +

第一階段:方案 A — 多面向 Metrics(1–2 週)

+ + +

第二階段:升級到方案 D — 加 trace + 反查(再 2–3 天)

+ + +

不建議做的

+ + + +

9. 事前驗屍與解方

+ +

建構前先想像「半年後失敗了會是因為什麼?」每題用 10 分制加權評估,只接受 ≥ 9 分方案。共 10 題(1 題跳過為可接受風險,2 題後續掃描補充),9 題解方全數通過。

+ + + +
+ +
+

#1 Cardinality 爆炸 高傷害9.4 / 10

+

使用者把 enduser.id 設成 email、bash command 當 label,Grafana 帳號被限流

+
解方
+
    +
  • Attribute key allow-list(View API)
  • +
  • Per-(metric, attribute) cardinality budget,超過改 _overflow_
  • +
  • ccxray.metrics.overflow_total sentinel + ccxray status --metrics 顯示用量
  • +
  • 新 metric 必須註冊 schema,缺漏 CI fail
  • +
+
驗證
+
    +
  • 實作:餵 51 unique values,assert 第 51 為 overflow
  • +
  • 上線:overflow counter > 0 → 自動冒泡警示
  • +
+
+ +
+

#2 沒人用,功能死掉 高傷害9.0 / 10

+

半年後 < 5% 使用者啟用 OTel,維護成本變沉沒成本

+
解方
+
    +
  • ccxray --otel-demo 本地一鍵起 Grafana,30 秒看到資料
  • +
  • README 90 秒接 Grafana 截圖教學
  • +
  • 本地 heartbeat 統計使用率(不外送)
  • +
  • 三個月 sunset clock:< 10 個 GitHub 提及則停損
  • +
+
驗證
+
    +
  • 實作:3 個新使用者走流程,中位數 < 5 分鐘看到資料
  • +
  • 上線:三個月 KPI 閘門明確
  • +
+
+ +
+

#3 多機反查全壞 高傷害9.4 / 10

+

Manager 在 Grafana 看到 trace,點 localhost 連結打不開(那是工程師的機器)

+
解方
+
    +
  • Span 帶 entry_id + host + 50字 summary + local_url + 可選 public_url
  • +
  • Dashboard /entry/:id 找不到時優雅降級,顯示「在 host=X」提示
  • +
  • 文件明說:個人 / 小團隊 / 大團隊各自的反查路徑
  • +
+
驗證
+
    +
  • 實作:CI 雙 ccxray instance 模擬跨機反查
  • +
  • 上線:deeplink_resolved_total{outcome} 追蹤 wrong_host 比例
  • +
+
+ +
+

#4 CLI OTel 衝突 → 雙重計費 中傷害9.5 / 10

+

使用者同時開 CLI 和 ccxray OTel,token 算兩倍,budget 警報全錯

+
解方
+
    +
  • 強制 ccxray.* namespace,不模仿 claude_code.* 欄位
  • +
  • 偵測 CLAUDE_CODE_ENABLE_TELEMETRY 進入 complement mode,印警告
  • +
  • 每筆 emit 帶 ccxray.source=ccxray-proxy resource attribute
  • +
  • ccxray.reconciliation.token_diff_pct:跟 CLI 對帳的差異
  • +
+
驗證
+
    +
  • 實作:雙開模式 fixture 測試,assert source attribute 分得開
  • +
  • 上線:reconciliation diff > 5% 警報
  • +
+
+ +
+

#5 管理者誤用 metric 監控個人 高傷害9.7 / 10

+

Team lead 拿使用次數開檢討會,工程師集體棄用 ccxray

+
解方
+
    +
  • 三層 tier:預設 OFF / 專案匿名 / 個人具名
  • +
  • 專案是上限,個人是下限,個人可隨時降級退出
  • +
  • 個人具名走 .ccxray.user.json(gitignore),不入 repo
  • +
  • 啟動 banner + ccxray status --otel + ccxray otel preview dry-run
  • +
  • 文件明寫:不要用這些 metric 評估個人績效
  • +
+
驗證
+
    +
  • 實作:tier 升降 4 種組合矩陣全測
  • +
  • 上線:tier_distribution 追蹤採用率,tier 2 < 5% 強化文件
  • +
+
+ +
+

#6 Parser drift(skill / MCP / tool)中傷害9.4 / 10

+

Claude Code 改 prompt 格式,skill detector 全 0,半年沒人發現

+
解方
+
    +
  • Schema 化 parser(parsers/*.schema.json 帶版本)
  • +
  • Snapshot fixtures:每 provider 一組固定 request/response
  • +
  • Sentinel metrics:ccxray.parser.unknown_*_total — 未識別不是 0,是「看到了但分類不了」
  • +
  • 對帳 invariants:tool_use block count 必對得起 extracted count
  • +
  • Parser 包 try/catch,壞掉不影響 ccxray 核心
  • +
  • ccxray parser report 命令一鍵看 unknown top 10
  • +
+
驗證
+
    +
  • 實作:餵未知 tool → assert sentinel ++,assert 不 throw
  • +
  • 上線:reconciliation_mismatch > 0 = bug,unknown_* 持續 7 天自動建議檢查
  • +
+
+ +
+

#7 Bundle size 膨脹 可接受— 跳過

+

@opentelemetry/sdk-node + auto-instrumentations 把 ccxray 從 3MB 變 18MB

+
處置
+
    +
  • 使用者評估為可接受風險,跳過正式評估
  • +
  • 實作時自我約束:只 import 必要模組(api、sdk-metrics、exporter-otlp-http),不引 auto-instrumentations
  • +
+
+ +
+

#8 Hub mode env 傳遞 低傷害9.5 / 10

+

使用者改 env 重新跑,但 hub 還在背景跑舊設定,以為改好實際沒送對地方

+
解方
+
    +
  • 業務 OTel 走 client 端,不走 hub(hub 只負責 proxy + SSE broadcast)
  • +
  • 每個 client 自己讀 .ccxray.json + 個人 config + env
  • +
  • 不同 tier / endpoint 在同一個 hub 下自然共存
  • +
  • Hub 自己另開 ccxray.hub.* 運維 metric(uptime / requests / clients)
  • +
  • ccxray status 顯示每個 client 的 tier 和 env 一致性
  • +
+
驗證
+
    +
  • 實作:兩個 client 不同 config,同 hub,assert 各送各的
  • +
  • 上線:env_inconsistency_total 追蹤「改了沒重啟」累積
  • +
+
+ +
+

#11 Collector down 時記憶體 / 行為 中傷害9.4 / 10

+

Collector 掛掉,OTel SDK 無限重試,buffer 堆爆把 ccxray OOM

+
解方
+
    +
  • Bounded queue(2048),滿了 drop oldest
  • +
  • Circuit breaker:連續 5 次失敗 → open 60s → half-open 試探 → 失敗則 backoff(60→120→240→600s)
  • +
  • State + dropped 計數寫本地 log,不送網路(因為網路本來就斷)
  • +
  • 設計選擇:丟資料 > 拖垮 ccxray,文件明說
  • +
+
驗證
+
    +
  • 實作:mock collector 回 500,assert memory 不增長、drop counter ++
  • +
  • 上線:circuit_breaker_open_seconds 累積長 = 持續問題
  • +
+
+ +
+

#12 Config secret 風險 中傷害9.5 / 10

+

使用者把 Authorization token 寫進 .ccxray.json,commit 進 git

+
解方
+
    +
  • ${ENV_VAR} 插值,token 只能在 env
  • +
  • Schema 拒絕看起來像 secret 的字面值(Bearer、JWT、ghp_ 等 pattern)
  • +
  • 第一次生成 .ccxray.json 時自動加 .gitignore 提醒
  • +
  • ccxray status 掃 git tracked config 是否含明文 secret
  • +
+
驗證
+
    +
  • 實作:餵 Bearer 字面值 → schema 拒絕並給修正建議
  • +
  • 實作:餵 ${TOKEN} 但 env 未設 → 啟動失敗
  • +
+
+ +
+

#13 OTel 失敗 fallback 策略 中傷害9.7 / 10

+

OTel config 寫錯或 collector 掛,ccxray 整個跑不起來

+
解方
+
    +
  • 三層失敗:config error(啟動失敗)/ init error(降級,ccxray 仍跑)/ runtime error(由 #11 處理)
  • +
  • 狀態機:disabled / active / degraded / circuit_open
  • +
  • ~/.ccxray/otel.log 紀錄最近 100 條失敗,自動 rotate
  • +
  • 核心原則:OTel 是增強,不是必需。網路問題不擋,config 錯擋
  • +
+
驗證
+
    +
  • 實作:餵壞 endpoint URL → assert ccxray 仍啟動、proxy 仍轉發、status 標 degraded
  • +
  • 上線:otel.state{state} 看 degraded 比例
  • +
+
+ +
+ +

共用基礎設施

+ +

#11–#13 共用同一組失敗處理框架,可降低總工時:

+ +
+
server/otel-health.js        # 失敗處理框架(共用)
+  ├─ State machine (active / degraded / circuit_open / disabled)
+  ├─ Bounded queue + drop counter
+  ├─ Circuit breaker
+  ├─ Local log writer (~/.ccxray/otel.log)
+  └─ Status reporter (餵給 ccxray status 命令)
+
+server/config-loader.js      # 配置載入(共用)
+  ├─ JSON Schema 驗證
+  ├─ ${ENV_VAR} 插值
+  ├─ Secret pattern 偵測
+  └─ .gitignore 檢查
+
+ +

結論

+ +
+事前驗屍 9 解全部 ≥ 9 分,可進入實作階段。每題的「上線後監測 metric」本身也是 ccxray 的 OTel emit 內容 — 設計上自我驗證:這套系統能持續偵測自己有沒有壞掉。 +
+ +

+本文件位於 docs/otel-integration.html。內容為決策前的探索筆記,實作時請以最終 PR 為準。 +

+ +
+ + + + + + diff --git a/docs/otel-phase1-overview.html b/docs/otel-phase1-overview.html new file mode 100644 index 0000000..f0a8b26 --- /dev/null +++ b/docs/otel-phase1-overview.html @@ -0,0 +1,1138 @@ + + + + +OTel Phase 1 Change — Visual Overview + + + + +
+ +

OTel Phase 1 — 視覺總覽

+
+ 本頁逐節呈現 add-otel-metrics-phase1 這個 OpenSpec change 的全貌。每個圖示下方標註 「Source:」 是該宣稱的依據出處,點擊可開啟對應 spec 檔案。內容嚴格依據 proposal / design / specs / tasks,無推測成分。 +
+ +
+ 目錄
+ 1. 大圖:資料流向(動畫)
+ 2. 三層 Tier Opt-in 模型
+ 3. 配置檔案與 env 插值
+ 4. OTel 健康狀態機
+ 5. Parser pipeline 與 sentinel
+ 6. 跟 CLI 內建 OTel 共存
+ 7. Cardinality budget
+ 8. 新檔案與模組關係
+ 9. Phase 1 範圍 vs Phase 2
+ 10. 完整 metric 清單 +
+ + +

1. 大圖:資料流向

+ +

ccxray 是 client / hub 雙進程架構。OTel 的初始化和 emit 都在 client 端,hub 純粹是 HTTP proxy + SSE broadcaster,不負責業務 metric。每個 client 自己讀自己的 .ccxray.json + .ccxray.user.json,所以同一個 hub 下不同 project 可以有不同 tier 和 endpoint。

+ +
+ + + + + + + + + + + + + + + + + + + + + + + Claude Code + (or Codex) + + + + ccxray client + forward.js + store.js + otel.js + otel-health.js + config-loader.js + + + + ccxray hub + (no business metrics; proxy + SSE only) + + + + Anthropic API + /v1/messages + + + + OTLP Collector + + + + Grafana / Datadog / Honeycomb + + + + ~/.ccxray/logs + + + + + + + + + + request + forward + response + SSE + metrics (OTLP) + local log + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ request + response (SSE) + OTel metric export +
+
+ +
+ Source: + specs/otel-export/spec.md § Client-side OTel SDK initialization + + Source: + design.md § D2. Client-side emit, not hub-side + + Source: + proposal.md § Impact (server/hub.js note) + +
+ + +

2. 三層 Tier Opt-in 模型

+ +

tier 是「會送出多少資訊」的開關。預設 tier 0(完全不送)。專案 config 是上限,個人 config 是下限,工程師永遠可以單方面降級退出。

+ +
+ + + + + Tier 0 — disabled + no SDK init, no network egress + + + Tier 1 — project anonymous + project.name + optional team + + + Tier 2 — personal named + + enduser.id (engineer-chosen) + + + + effective_tier = + min(project_tier, personal_tier) + + project = upper bound + personal = lower bound (can only equal-or-downgrade) + project=1, personal=2 → clamps to 1 + warning + project=2, personal=0 → effective 0 (unilateral opt-out) + + + + Resolution Matrix + + + project + personal + effective + + + + 0 + + 1 + + 1 + + 1 + 0 + 0 (opt-out) + + 1 + 2 + 1 (clamped) + + 2 + 2 + 2 (with enduser.id) + + 2 + 0 + 0 (opt-out) + + 「—」 表示該層 config 不存在 + missing = treat as that side absent + + + + + +
+ +
+關鍵限制(spec § Personal config gitignore enforcement):如果 .ccxray.user.json 被 git tracked,ccxray 拒絕載入個人 identity,並建議 git rm --cached。 +
+ +
+ Source: + specs/otel-tiers/spec.md § Three discrete tier values + + Source: + specs/otel-tiers/spec.md § Tier resolution rule + + Source: + specs/otel-tiers/spec.md § Engineer unilateral opt-out + + Source: + specs/otel-tiers/spec.md § Personal config gitignore enforcement + +
+ + +

3. 配置檔案與 env 插值

+ +

兩個檔案:.ccxray.json(專案層,checked into git)+ .ccxray.user.json(個人層,gitignored)。所有 string value 支援 ${VAR} 從 process.env 替換。Schema 拒絕看起來像 secret 的字面值。

+ +
+ + + + + + + + .ccxray.json + (repo, in git) + { + "otel": { + "tier": 1, + "endpoint": "https://...", + "headers": { + "Authorization": "Bearer ${TOKEN}" + } + } + + + + .ccxray.user.json + (personal, gitignored) + { "otel": { "tier": 2, "identity": "alice", + "opt_in_acknowledged_at": "..." } } + + + + config-loader.js + 1. Parse JSON + 2. Schema validate + 3. Interpolate ${VAR} + 4. Detect literal secrets + + + + process.env + TOKEN=abc... + (secret stays in env) + + + + Loaded config + effective_tier = 2 + Authorization: Bearer abc*** + + + + Startup FAIL + if literal Bearer, missing ${VAR}, + or schema error + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + + + + + +
輸入結果
"Authorization": "Bearer ${TOKEN}" + TOKEN=abc123✓ 載入,實際值 Bearer abc123
"Authorization": "Bearer ${MISSING}" + 未設 env✗ Startup 失敗,訊息含 file path / line / 變數名 MISSING
"Authorization": "Bearer abc123longtokenvalue..."✗ Schema 拒絕,建議改用 ${ENV_VAR}
JSON syntax error✗ Startup 失敗,訊息含 line / column
+ +
+ Source: + specs/otel-config/spec.md § Project and personal config files + + Source: + specs/otel-config/spec.md § Environment variable interpolation + + Source: + specs/otel-config/spec.md § Literal-secret rejection + + Source: + specs/otel-config/spec.md § Config error fails fast at startup + +
+ + +

4. OTel 健康狀態機

+ +

四個狀態,只能透過記錄在 spec 的條件轉換。核心承諾:OTel 失敗永遠不會擋 ccxray proxy

+ +
+ + + + + + + + + + + + + + + + [start] + + + + disabled + no SDK, no egress + + + active + SDK init OK, exporting + + + degraded + init failed; proxy still OK + + + + circuit_open + exports paused, cooldown + + + half_open + single trial export + + + + tier=0 or no OTel pkg + + + + tier≥1, init OK + + + + tier≥1, init fails + + + + 5 consecutive failures + + + + trial OK + + + + cooldown elapsed + + + + trial fails → backoff + + + + while active, queue full: + drop oldest + exports_dropped_total++ + (no state change) + + + + cooldown formula: + next = min(previous * 2, 600s) + starting from 60s after first trip + + + + when degraded: + ccxray proxy keeps working; no further OTel attempts; + visible in ccxray status --otel until process restart. + + +
+ +

失敗分層

+ + + + + + + + + + + + + + + + + + +
失敗類型例子處理
Config errorJSON syntax 錯、schema 違規、${VAR} 未解啟動失敗(exit code != 0)
Init errorEndpoint URL 格式不合法degraded,ccxray 正常,status 顯示錯誤
Runtime errorCollector unreachable、auth fail、timeoutcircuit breaker 處理,exponential backoff
+ +

持久化與容量限制

+ + + +
+ Source: + specs/otel-health/spec.md § Four-state OTel health machine + + Source: + specs/otel-health/spec.md § Bounded export queue with drop-oldest semantics + + Source: + specs/otel-health/spec.md § Circuit breaker with exponential backoff + + Source: + specs/otel-health/spec.md § Failure log on local disk + + Source: + specs/otel-health/spec.md § Never-block guarantee for the proxy + + Source: + specs/otel-health/spec.md § Config errors fail fast, init/runtime errors degrade + +
+ + +

5. Parser pipeline 與 sentinel

+ +

解析 tool / MCP / skill / agent-type 從散落的 inline 字串改成 versioned JSON schemas。每筆 entry 都跑 reconciliation invariants;未識別的事件不會變 0,而是 increment sentinel counter,並寫進 ~/.ccxray/parser-drift.log

+ +
+ + + + + + + + + + Response from upstream + tool_use blocks, + usage tokens, etc. + + + + Parser dispatch + server/parsers/index.js + anthropic-tools.schema.json + anthropic-skills.schema.json + mcp-tools.schema.json + codex-tools.schema.json + + + + Recognized → metrics + ccxray.tool.invocations_total{tool}, etc. + + + + Unknown → sentinel + ccxray.parser.unknown_*_total ++ + + append to parser-drift.log + + + + Invariant fail → mismatch + tool_use count ≠ extracted count? + ccxray.parser.reconciliation_mismatch_total ++ + + + + + + + + post-extractcheck + + + + + + + + + + + + + +
+ +

每個 schema 帶的元資料

+ +
+
{
+  "version": "1.0.0",
+  "last_verified_against": "2026-05-10",
+  "patterns": [ ... ],
+  "examples": [ ... ]
+}
+
+ +

Error isolation

+ +

所有 parser 包在 try/catch。若拋例外 → ccxray.parser.error_total{parser,error_type} ++,該 entry 仍寫進本地 log,該 entry 對應 metric/span 帶 ccxray.parser.degraded=trueParser 失敗不會影響 proxy 路徑

+ +
+ Source: + specs/parser-schemas/spec.md § Versioned parser schemas per concern and provider + + Source: + specs/parser-schemas/spec.md § Sentinel counters for unknown tokens + + Source: + specs/parser-schemas/spec.md § Reconciliation invariants + + Source: + specs/parser-schemas/spec.md § Parser error isolation + +
+ + +

6. 跟 CLI 內建 OTel 共存

+ +

Claude Code CLI 也內建 OTel。ccxray 偵測到 CLAUDE_CODE_ENABLE_TELEMETRY=1 時進入 complement mode,所有 emit 加 ccxray.cli_otel_active=true attribute。ccxray 永遠不關自己的 emit(因為 CLI 沒 Codex 支援、ccxray 看的是 HTTP truth、且兩邊 diff 本身是價值訊號)。

+ +
+ + + + + Standalone mode + CLAUDE_CODE_ENABLE_TELEMETRY 未設 + + + Claude Code CLI + (no OTel) + + + ccxray emits + ccxray.* + + → Single source of truth + → Banner: "ccxray OTel tier: 1 (anonymous)" + + + + Complement mode + CLAUDE_CODE_ENABLE_TELEMETRY=1 + + + CLI emits + claude_code.* + + + ccxray emits + ccxray.* + cli_otel_active=true + + → Reconciliation: ccxray.reconciliation.token_diff_pct{model} + → Both flow to user's collector, distinguishable via + resource attribute ccxray.source="ccxray-proxy" + + +
+ +
+為什麼不關 ccxray emit:(1) CLI 內建 OTel 只有 Anthropic,Codex / Gemini 沒有;(2) ccxray 看到 HTTP truth,跟 CLI 不同視角;(3) 兩邊 diff 是高價值警報(代表某一邊 pricing 算錯)。 +
+ +
+ Source: + specs/otel-export/spec.md § CLI OTel coexistence and complement mode + + Source: + specs/otel-export/spec.md § Source resource attribute on every emit + + Source: + specs/otel-export/spec.md § Reconciliation diff metric + + Source: + specs/otel-export/spec.md § `ccxray.*` namespace for all emitted metrics + +
+ + +

7. Cardinality budget

+ +

每個 metric 宣告「允許哪些 attribute key」+「每個 key 最多幾個 unique value」。Key 不在 allow-list → 直接 drop(OTel View API)。Value 超 budget → 改記成 _overflow_,sentinel counter ++。

+ +
+ + + + + ccxray.tool.invocations_total — tool budget: 50 + + + + + 23 used + 27 remaining + + + + incoming attribute + tool="Bash" + + + + + accepted as-is + overflow_total: 0 + + + + When 51st unique value arrives + tool="FancyNewToolThatNobodyKnows" + + + + + recorded as tool="_overflow_" + + ccxray.metrics.overflow_total{ + metric=..., attribute="tool" } ++ + + +
+ +

不可當 metric label 的高基數欄位(per design.md § D4)

+ +
+spec 沒有明確列出黑名單,但 design.md § D4 指出 bash.command_patternfile_path 明確 NOT 當 metric label 使用,避免基數爆炸。 +
+ +
+ Source: + specs/otel-export/spec.md § Cardinality budget enforcement + + Source: + design.md § D4. Cardinality budget with overflow fallback + +
+ + +

8. 新檔案與模組關係

+ +
+ + + + + + + + + + + + + 新增模組 + + + server/config-loader.js + schema · ${VAR} · secrets · gitignore + + + server/otel-health.js + state machine · queue · breaker · log + + + server/otel.js + SDK init · registry · cardinality · source + + + server/parsers/ + *.schema.json + index.js + + + test/fixtures/parser/ + snapshot fixtures + + + package.json + minimal OTel deps · lazy require + + + + 被修改的既有檔案 + + + server/forward.js + emit metrics on request complete + + + server/store.js + thin shim over parsers + + + server/system-prompt.js + skill marker via schema + + + server/hub.js + no business metrics (doc comment) + + + bin/ccxray.js + status --otel · otel preview · parser report + + + + Phase 2 follow-up + (NOT in this change) + + + span emit (traces) + ccxray.entry_id, + dashboard_url + + + /entry/:id route + deep-link drill-back UI + + + + + + + + + + + + + + + + + + + deps + + + ↑ snapshot tests run against parsers/ + + +
+ +
+ Source: + proposal.md § Impact + + Source: + tasks.md § Tasks 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 + +
+ + +

9. Phase 1 範圍 vs Phase 2

+ +
+ +
+

✓ Phase 1 — 本 change 範圍

+
    +
  • Metrics emit(ccxray.* namespace)
  • +
  • 三層 tier opt-in(預設 OFF)
  • +
  • .ccxray.json / .ccxray.user.json + ${VAR} 插值
  • +
  • OTel health(state machine、queue、breaker)
  • +
  • Parser schemas + sentinels + reconciliation
  • +
  • CLI 共存偵測 + reconciliation diff metric
  • +
  • ccxray status --otel / otel preview / parser report
  • +
  • 啟動 banner、secrets masking
  • +
+
+ +
+

✗ Phase 2 follow-up

+
    +
  • Span emit(traces)
  • +
  • ccxray.entry_id / dashboard_url attributes
  • +
  • /entry/:id deep-link route
  • +
  • ccxray.hub.* 運維 metrics(open question)
  • +
  • --otel-demo Docker Compose helper(open question)
  • +
+
+ +
+ +
+ Source: + proposal.md § What Changes (last bullet: Out of scope) + + Source: + design.md § Non-Goals + + Source: + design.md § Open Questions + +
+ + +

10. 完整 metric 清單

+ +

所有 metric 都在 ccxray.* namespace,每筆 emit 帶 resource attribute ccxray.source="ccxray-proxy"。Complement mode 時額外帶 ccxray.cli_otel_active=true

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
家族MetricAttributes
Costccxray.tokens.input_totalmodel, provider *
ccxray.tokens.output_totalmodel, provider *
ccxray.tokens.cache_read_totalmodel, provider *
ccxray.tokens.cache_creation_totalmodel, provider *
ccxray.cost.usd_totalmodel, provider *
ccxray.cache.hit_ratio(gauge)model, provider *
Usageccxray.tool.invocations_totaltool, provider
ccxray.mcp.invocations_totalserver, tool
ccxray.skill.activations_totalskill, provider
ccxray.sessions_totalprovider
ccxray.agent_type.invocations_totaltype
Qualityccxray.errors_totaltype, provider
ccxray.stop_reason_totalreason
ccxray.latency_ms(histogram)model, provider
ccxray.max_tokens_hit_totalmodel
Patternsccxray.context.utilization_pct(histogram)
ccxray.auto_compact.triggered_total
ccxray.subagent.invocations_total
ccxray.tools_per_turn(histogram)
Governanceccxray.permission_mode.usage_totalmode
ccxray.dangerous_tool.invocations_totalpattern
ccxray.file_writes_total
ccxray.provider.distribution_totalprovider
Sentinelsccxray.metrics.overflow_totalmetric, attribute
ccxray.parser.unknown_tool_totalprovider
ccxray.parser.unknown_skill_marker_totalprovider
ccxray.parser.unknown_mcp_format_total
ccxray.parser.fallback_used_totalparser, reason
ccxray.parser.reconciliation_mismatch_totaltype
ccxray.parser.error_totalparser, error_type
ccxray.otel.exports_dropped_totalsignal
ccxray.otel.state(gauge)state
CLI 對帳ccxray.reconciliation.token_diff_pct(gauge)model
Tier 觀測ccxray.otel.tier_distributiontier
+ +

+* Cost 系列的 attribute 並未在 specs/otel-export/spec.md § Required metric families 內逐一列出(spec 只在 Usage / Quality 部分內嵌標明)。本表依照實作時的常見維度(modelprovider)預先填入,實際 attribute 註冊清單以 server/otel.js 的 metric registry 為準(Tasks § 4.5)。 +

+ +
+ Source: + specs/otel-export/spec.md § Required metric families + + Source: + specs/otel-export/spec.md § Cardinality budget enforcement + + Source: + specs/otel-export/spec.md § Reconciliation diff metric + + Source: + specs/parser-schemas/spec.md § Sentinel counters / Reconciliation invariants / Parser error isolation + + Source: + specs/otel-health/spec.md § Bounded export queue / Health state observable + + Source: + specs/otel-tiers/spec.md § Tier distribution sentinel + +
+ +

+本檔案位於 docs/otel-phase1-overview.html,內容嚴格依據 openspec/changes/add-otel-metrics-phase1/ 的 proposal / design / specs / tasks 文件。所有宣稱皆有出處連結;若你發現任何視覺與 spec 不一致,請以 spec 為準並回報。 +

+ +
+ + + diff --git a/openspec/changes/add-otel-metrics-phase1/.openspec.yaml b/openspec/changes/add-otel-metrics-phase1/.openspec.yaml new file mode 100644 index 0000000..40cc12f --- /dev/null +++ b/openspec/changes/add-otel-metrics-phase1/.openspec.yaml @@ -0,0 +1,2 @@ +schema: spec-driven +created: 2026-05-12 diff --git a/openspec/changes/add-otel-metrics-phase1/design.md b/openspec/changes/add-otel-metrics-phase1/design.md new file mode 100644 index 0000000..0641cdc --- /dev/null +++ b/openspec/changes/add-otel-metrics-phase1/design.md @@ -0,0 +1,166 @@ +## Context + +ccxray currently emits no telemetry to external systems. All observation happens via the local dashboard reading from `~/.ccxray/logs/`. Adding OpenTelemetry export changes ccxray's blast radius — data starts leaving the user's machine — and intersects with three sensitive design surfaces: + +1. **Privacy.** Engineers run ccxray in their own dev environment. Any telemetry that identifies them by default would break that contract. +2. **Trust with managers.** Aggregated metrics are genuinely useful for engineering leaders, but a feature that lets a manager track individual tool usage will trigger a backlash that kills adoption. +3. **Provider neutrality.** Claude Code's CLI has built-in OTel for Anthropic; Codex/Gemini have none. ccxray must coexist with the CLI without double-counting, and must remain the only telemetry source for non-Anthropic providers. + +Before drafting this design, an 11-risk pre-mortem was completed and recorded in `docs/otel-integration.html`. Every accepted solution scored ≥ 9/10 on weighted criteria including verification mechanisms. The design below is the synthesis of those solutions. + +## Goals / Non-Goals + +**Goals:** + +- Provide ccxray-emitted OTel metrics covering cost, usage (tool/MCP/skill), quality (errors/latency/cache), patterns (context/subagent), and governance. +- Default OFF. Zero telemetry until the user explicitly opts in per-project. +- Three-tier opt-in (disabled / project-anonymous / personal-named) where the project sets an upper bound and personal config can only equal-or-downgrade. +- Coexist with Claude Code CLI's built-in OTel without overlap, with a reconciliation metric to surface accounting bugs on either side. +- Never let OTel failure break the proxy. Config errors fail at startup, init errors degrade silently, runtime errors are absorbed by a bounded queue + circuit breaker. +- Make parser drift visible. Unknown tools / skills / MCP markers must increment a sentinel counter rather than silently turn into zero. +- Provide introspection: `ccxray status --otel`, `ccxray otel preview` (dry-run), `ccxray parser report`. + +**Non-Goals:** + +- **Traces / spans.** Phase 1 emits metrics only. Spans, `entry_id` deep-link attributes, and `/entry/:id` drill-back UI are Phase 2. +- **Full payload export.** Request/response bodies never leave the machine. If a future user wants this, it belongs in a separate "ccxray log → S3 / self-hosted backend" product, not in the OTel pipeline. +- **Synthetic tool span timing.** Tool execution durations inferred from HTTP cadence would be misleading; the CLI emits accurate timing for Anthropic, and we will not compete with inaccurate data. +- **Central ccxray hub for team-wide aggregation.** Each engineer's ccxray remains local. Cross-machine correlation, if needed, is a Phase 2+ discussion. +- **Auto-instrumentation.** We will not pull in `@opentelemetry/auto-instrumentations-node`. ccxray controls every emit point explicitly to keep the dependency footprint and behavior predictable. + +## Decisions + +### D1. Default OFF with three-tier opt-in + +Three tier values: + +- **tier 0 (disabled)** — no OTel SDK initialization, no network egress. Default behavior when no config file or env override exists. +- **tier 1 (project anonymous)** — metrics emit with project-level attributes (`project.name`, optional `team`) but no individual identity. Activated by `.ccxray.json` checked into the repo. +- **tier 2 (personal named)** — adds `enduser.id` (a self-chosen string, not necessarily real name) to allow individual ccxray usage analytics. Activated by `.ccxray.user.json` in the working directory, which is gitignored. + +Resolution rule: `effective_tier = min(project_tier, personal_tier)`. Project config is the upper bound; personal config can only equal or downgrade. An engineer can always set tier 0 in personal config to opt out of project-level emit on their own machine. + +**Alternatives considered:** + +- *Always-on anonymous* — rejected. "Anonymous" telemetry has well-documented re-identification risks; defaulting to ON breaks the implicit trust contract. +- *Cookie-style consent prompt at startup* — rejected. Prompt fatigue leads to blanket yes; one-time `opt_in_acknowledged_at` timestamp in personal config achieves the same intent without nagging. +- *k-anonymity at the backend* — rejected. ccxray does not control the backend; small teams (k < 5) cannot rely on this guarantee. + +### D2. Client-side emit, not hub-side + +OTel SDK initialization and metric emission happen in the client process (the one that ran `ccxray claude`). The hub remains a pure HTTP proxy plus SSE broadcaster. The hub MAY emit its own operational metrics under `ccxray.hub.*` namespace using a separate config (`~/.ccxray/hub-config.json`), but it does NOT emit business metrics on behalf of clients. + +This means different projects connecting to the same hub can configure different tiers, endpoints, and `OTEL_RESOURCE_ATTRIBUTES` without interfering with each other. + +**Alternatives considered:** + +- *Hub-side emit with per-client config fanout* — rejected. Adds a routing/fan-out concern to the hub with no clear value; the hub would need to track which spans belong to which client config. +- *Hub-only emit, ignore per-project differences* — rejected. Conflicts with D1 and forces every project on a host to share one OTel destination. + +### D3. `ccxray.*` namespace, never mirror `claude_code.*` + +Every metric uses `ccxray..` (`ccxray.tokens.input_total`, `ccxray.tool.invocations_total`, etc.). Every emit carries the resource attribute `ccxray.source="ccxray-proxy"`. When the CLI's `CLAUDE_CODE_ENABLE_TELEMETRY=1` is detected, ccxray enters "complement mode" and adds `ccxray.cli_otel_active=true` to its emits, plus a startup notice explaining how to choose between the two metric families. + +**Cross-source reconciliation: pivoted to downstream.** An earlier version of this design proposed emitting `ccxray.reconciliation.token_diff_pct{model}` as a gauge. After expert review (Sigelman / Majors / Sridharan), Phase 1 drops the in-proxy diff gauge for these reasons: (1) the gauge is pre-aggregated and cannot answer "which request diverged"; (2) the diff is rarely zero for legitimate reasons (SSE chunking, retries, prompt-cache edge cases) → alert fatigue; (3) acquiring the CLI's counts in-process requires either querying the user's storage backend (couples ccxray to Prom/OTLP dialects) or embedding an OTLP receiver (turns ccxray from proxy into telemetry product, violating instrumentation neutrality and expanding blast radius). Instead, ccxray emits faithful per-request signals and ccxray-internal invariant metrics (`ccxray.invariants.*`); cross-source reconciliation against the CLI is a downstream concern — see `docs/otel-recon.md` for recording-rule / sidecar / wide-event join recipes. A `--debug-reconcile` ad-hoc flag may be reconsidered in a later phase. + +**Alternatives considered:** + +- *Auto-disable ccxray emit when CLI is active* — rejected. Loses the reconciliation signal and forfeits ccxray's Codex/Gemini advantage. +- *Same metric names, different resource* — rejected. Backends commonly aggregate by metric name first; using the same names would force users to filter by resource attribute on every panel. + +### D4. Cardinality budget with overflow fallback + +Every metric declares an allow-list of attribute keys and a per-key cardinality budget (e.g. `tool=50`, `model=10`, `mcp_server=30`). Attribute values are tracked in a `Set` per (metric, attribute); when the Set reaches budget size, subsequent unique values are recorded as the literal string `_overflow_` and a sentinel counter `ccxray.metrics.overflow_total{metric,attribute}` increments. + +Attribute keys not in the allow-list are dropped at the View API layer (OTel SDK native enforcement). High-cardinality candidates that look attractive (`bash.command_pattern`, `file_path`) are explicitly NOT emitted as metric labels. + +**Alternatives considered:** + +- *Trust the backend to handle cardinality* — rejected. Free-tier Grafana Cloud, open-source Prometheus, and many enterprise backends impose hard limits that result in dropped series or account-level throttling. +- *Silent drop on overflow* — rejected. Violates the "no silent failure" principle. + +### D5. Failure isolation via state machine + bounded queue + circuit breaker + +`server/otel-health.js` owns a state machine with four states: + +- `disabled` — OTel never initialized (tier 0 or no config). +- `active` — SDK initialized, exports succeeding. +- `degraded` — SDK init failed; ccxray continues without OTel; status command shows the error. +- `circuit_open` — runtime export failures triggered the circuit breaker; periodic half-open retries. + +The export queue is bounded (default 2048 entries, configurable). On overflow, oldest entries are dropped and `ccxray.otel.exports_dropped_total{signal}` increments locally (network is presumed unreachable when the queue overflows). + +Circuit breaker: 5 consecutive failures → `circuit_open` for 60s → `half_open` test → success returns to `active`, failure backs off (60 → 120 → 240 → 600s max). + +**Alternatives considered:** + +- *Unbounded queue with retries* — rejected. OOMs ccxray when the collector is down. +- *Fail-fast on first error* — rejected. Transient errors are common; one timeout should not disable telemetry for the rest of the session. + +### D6. Config: `.ccxray.json` + `.ccxray.user.json` with `${ENV_VAR}` interpolation + +Two-file config: + +- `.ccxray.json` — project root, checked into git, sets tier upper bound and shared settings (endpoint, headers, resource attributes). +- `.ccxray.user.json` — project root or `$HOME`, gitignored, sets personal identity and overrides (only ever equal-or-downgrade vs project config). + +Both files support `${ENV_VAR}` interpolation in string values. The schema validator rejects any string that looks like a literal secret (`Bearer [A-Za-z0-9]{20,}`, `sk_live_*`, `ghp_*`, JWT structure) when not wrapped in `${...}`. First-time generation auto-amends `.gitignore` to include `.ccxray.user.json`. + +Config errors (syntax, schema, unresolved `${VAR}`) fail at startup with a clear error pointing to the offending line. Init errors (bad endpoint format) transition to `degraded`. Runtime errors (collector down) transition to `circuit_open`. + +**Alternatives considered:** + +- *Single file with comments marking secrets* — rejected. JSON has no comments and the convention is too fragile. +- *Pure env-var configuration* — rejected. Loses per-project granularity; same shell environment cannot easily switch contexts when working across multiple repos. + +### D7. Parser schema-ization with sentinel counters + +Tool / MCP / skill / agent-type detection moves from inline strings in `system-prompt.js` / `store.js` / `helpers.js` to versioned JSON schemas under `server/parsers/`. Each schema declares the patterns it recognizes and carries a `last_verified_against` date. + +For every entry processed, parsers emit: + +- The recognized metrics (tool invocations, skill activations, etc.). +- `ccxray.parser.unknown_*_total{provider}` counters when a token/marker is seen but not recognized. +- `ccxray.parser.reconciliation_mismatch_total{type}` when invariants fail (e.g. count of `tool_use` blocks in response ≠ count of tools extracted by parser). + +Parsers are wrapped in try/catch; on exception, `ccxray.parser.error_total{parser}` increments and the entry continues to be written to local logs (degraded OTel, never blocked proxy). + +Snapshot fixtures under `test/fixtures/parser/` lock current behavior; changes require committing new snapshots and pass review. + +**Alternatives considered:** + +- *Keep inline parsing* — rejected. Already fragile (silent dependence on Claude Code's evolving prompt format) and cannot detect drift. +- *Server-side parser updates via remote schema fetch* — rejected. Adds a new failure surface and security concern. + +### D8. CLI surface: `status --otel`, `otel preview`, `parser report` + +- `ccxray status --otel` — current tier, endpoint, OTel state, cardinality usage (e.g. `tool: 23/50`), dropped event counters, circuit breaker state. +- `ccxray otel preview` — dry-run printing the next export's content without sending. Lets users see exactly what would be exported before enabling. +- `ccxray parser report` — last 7 days of unknown tool / skill / MCP markers grouped by frequency; generates a GitHub issue template body for drift reports. + +Startup banner declares the active tier and (if applicable) complement-mode coexistence with CLI OTel. + +## Risks / Trade-offs + +- **Risk: Adoption stalls because individual devs do not have an OTel backend.** → Ship `ccxray --otel-demo` that spins up a local Grafana + Prometheus via Docker Compose so a developer can see their own metrics in 30 seconds without joining any external service. Set a 3-month KPI gate: < 10 GitHub references → pause Phase 2 investment. +- **Risk: Manager misuse for individual surveillance.** → Default OFF + tier 2 requires personal opt-in by the engineer + explicit `docs/otel-ethics.md` distributed as part of the change ("these metrics are not for individual performance evaluation; the reasons follow…"). Track `ccxray.otel.tier_distribution`: if tier 2 share is < 5%, strengthen the docs. +- **Risk: Cardinality explosion despite budgets.** → Budgets enforced at SDK View API layer with sentinel counter for overflow visibility. CI lint blocks new metrics that lack a schema entry. `ccxray.metrics.overflow_total > 0` for sustained periods triggers an in-status warning. +- **Risk: Bundle bloat from OTel SDK.** → Import only `@opentelemetry/api`, `@opentelemetry/sdk-metrics`, `@opentelemetry/exporter-metrics-otlp-http`, `@opentelemetry/resources`. No auto-instrumentations. Optional dependency pattern so the package still resolves when OTel deps are absent (lazy require). +- **Risk: Hub-mode env changes don't propagate.** → Business OTel is client-side (D2); hub env only affects `ccxray.hub.*` operational metrics. `ccxray status` displays per-client tier/endpoint so users can see whether each client has picked up the env they expected. +- **Risk: Parser drift when Anthropic changes the prompt format.** → Sentinel counters (`ccxray.parser.unknown_*_total`) make drift visible within hours instead of months; `last_verified_against` dates trigger quarterly re-verification; `ccxray parser report` makes drift reports easy to file. +- **Risk: OTel semconv conventions evolve and our attribute names become out of date.** → All metric names live in the schema registry under `server/otel.js`; a future migration is a search-and-replace plus a deprecation period. +- **Trade-off: We do not compete with the CLI on Anthropic tool span timing.** → Acceptable. Our value is the HTTP-layer truth, Codex/Gemini coverage, the reconciliation diff, and the future Phase 2 drill-back. + +## Migration Plan + +- **Forward.** Phase 1 ships behind opt-in defaults; existing ccxray users see no behavior change. Adopters add a `.ccxray.json`, set an endpoint, and confirm with `ccxray otel preview` before traffic flows. The `--otel-demo` subcommand provides a zero-config local Grafana for evaluation. +- **Rollback.** Each `ccxray.*` metric is a contract; once shipped, names cannot be renamed without a deprecation cycle. The schema registry tracks every metric with its introduction version. +- **Phase 2 prerequisites.** Shared modules introduced here (`otel-health.js`, `config-loader.js`, parser schemas, sentinel framework, status surface) are designed to host Phase 2's span emit and `/entry/:id` route without rework. + +## Open Questions + +- Should `.ccxray.json` lookup walk up from cwd to the nearest enclosing dir (monorepo-friendly), or only check cwd? Recommendation: walk up to nearest git root, take the first match. +- Should we ship `--otel-demo` Docker Compose files in this PR or as a follow-up doc? Recommendation: follow-up, to keep Phase 1 scope tight. +- Should `ccxray.hub.*` operational metrics ship in Phase 1 or be deferred? Recommendation: defer to keep this change focused on the client side. +- For the auto-update of `.gitignore`, should the user be prompted or should it be automatic? Recommendation: prompt the first time, with a `--yes` flag for automation. +- Should `ccxray --otel-demo` be a documented dev tool only, or a supported feature? Recommendation: dev tool only (clearly labeled experimental). diff --git a/openspec/changes/add-otel-metrics-phase1/proposal.md b/openspec/changes/add-otel-metrics-phase1/proposal.md new file mode 100644 index 0000000..c490321 --- /dev/null +++ b/openspec/changes/add-otel-metrics-phase1/proposal.md @@ -0,0 +1,47 @@ +## Why + +ccxray captures everything an agent does at the HTTP layer — full request/response, token counts, cost, tool calls, MCP server activity, skill activations — but the data lives only in the local dashboard. Teams that already operate Grafana / Datadog / Honeycomb cannot aggregate ccxray's signals into their existing observability pipeline. Claude Code's CLI has built-in OTel for Anthropic only and does not expose the HTTP-layer truth ccxray sees; Codex, Gemini, and future providers have no OTel at all. The full design rationale, pre-mortem (11 risks scored ≥ 9/10) and alternative options live at `docs/otel-integration.html`. + +This change adds Phase 1: emit ccxray's metrics over OTLP, gated behind a default-off tiered opt-in, with a failure model that never degrades the proxy. Phase 2 (metadata-only traces with `entry_id` drill-back) is a follow-up. + +## What Changes + +- New optional metric export under `ccxray.*` namespace covering cost, usage (tool / MCP / skill / agent_type / provider), quality (errors, stop_reason, latency, max_tokens_hit_rate), patterns (context_utilization, auto_compact_triggered, subagent_ratio, tools_per_turn) and governance (permission_mode, dangerous_tool, file_writes). +- New configuration files: `.ccxray.json` (repo, project-level) and `.ccxray.user.json` (gitignored, personal). `${ENV_VAR}` interpolation. Schema rejects literal-looking secrets. Auto-add `.ccxray.user.json` to `.gitignore` if missing. +- Three-tier opt-in model: **tier 0 disabled (default)** / tier 1 anonymous project-level / tier 2 personal named. Project config is the upper bound; personal config can only equal or downgrade. Engineers can opt out unilaterally. +- Detect `CLAUDE_CODE_ENABLE_TELEMETRY=1` and enter "complement mode" with `ccxray.cli_otel_active=true` attribute; every metric carries `ccxray.source="ccxray-proxy"` resource attribute. ccxray emits ccxray-internal invariant metrics (`ccxray.invariants.*`); cross-source reconciliation against the CLI is documented as a downstream pattern (recording rules / sidecar / wide-event join on `request_id`) in `docs/otel-recon.md`, not as an in-proxy gauge — keeps ccxray as a transparent proxy with bounded blast radius. +- Cardinality budget per (metric, attribute) with `_overflow_` fallback and `ccxray.metrics.overflow_total` sentinel; attribute key allow-list enforced via OTel View API. +- Parser schema-ization: extract tool / MCP / skill detection into `server/parsers/*.schema.json` with snapshot fixtures, sentinel metrics (`ccxray.parser.unknown_*_total`), and reconciliation invariants (tool_use block count must equal extracted count). +- Failure fallback: config errors fail fast at startup; init errors degrade silently (ccxray keeps proxying); runtime errors handled by bounded queue (drop oldest) + circuit breaker (5 failures → open 60s → exponential backoff). OTel failures **never** break the proxy. +- New shared modules: `server/otel-health.js` (state machine, circuit breaker, bounded queue, local log writer) and `server/config-loader.js` (JSON schema validation, env interpolation, secret detection, gitignore check). +- OTel emit lives in the **client** process, not the hub. Each project's tier/endpoint coexists on the same hub. Hub gains its own operational metrics under `ccxray.hub.*` namespace. +- New CLI commands: `ccxray status --otel` (current tier, endpoint, health, cardinality usage), `ccxray otel preview` (dry-run printing the next export's content), `ccxray parser report` (recent unknown events for drift detection). +- Out of scope (Phase 2 follow-up): span emit (traces), `/entry/:id` deep-link route, `ccxray.entry_id` / `dashboard_url` attributes. + +## Capabilities + +### New Capabilities + +- `otel-config`: `.ccxray.json` and `.ccxray.user.json` schema, `${ENV_VAR}` interpolation, literal-secret rejection, `.gitignore` auto-amend, project-upper-bound + personal-lower-bound merging rules. +- `otel-export`: OTel SDK initialization (client-side, not hub), metric definitions under `ccxray.*` namespace, `ccxray.source` resource attribute, cardinality budget enforcement with `_overflow_` fallback, CLI coexistence detection and complement-mode signaling, ccxray-internal invariant metrics, explicit non-emit of cross-source diff gauge (deferred to downstream). +- `otel-tiers`: three-tier opt-in (disabled / project-anonymous / personal-named), tier resolution with project as upper bound and personal as lower bound, `enduser.id` attribute only in tier 2, opt-in acknowledgment timestamp persisted in personal config. +- `otel-health`: failure state machine (`disabled / active / degraded / circuit_open`), bounded export queue with drop-oldest semantics, circuit breaker with exponential backoff, local failure log at `~/.ccxray/otel.log` with rotation, never-block guarantee for the proxy path. +- `parser-schemas`: extract skill / MCP / tool / agent-type detection into versioned JSON schemas, snapshot fixtures per provider (Anthropic + Codex), sentinel metrics for unknown events, reconciliation invariants run per entry, try/catch isolation so parser failure does not affect ccxray core. +- `otel-introspection`: `ccxray status --otel` view (tier, endpoint, health, cardinality, dropped counts), `ccxray otel preview` dry-run, `ccxray parser report` for drift inspection, startup banner declaring active tier and CLI coexistence mode. + +### Modified Capabilities + +(None — Phase 1 is additive. Existing capabilities are not changed.) + +## Impact + +- New `server/otel.js`, `server/otel-health.js`, `server/config-loader.js`, `server/parsers/` directory tree (schemas + fixtures + unknown-handler). +- `server/forward.js` — emit metric on request completion (counters + histograms) via the otel-health-guarded queue; no behavior change when OTel is disabled. +- `server/store.js` — session / tool / skill / MCP / agent_type detection becomes a thin shim over `server/parsers/*`; reconciliation invariants run per entry; sentinel counters incremented on unknown. +- `server/system-prompt.js` — agent-type and skill marker detection moves into `parsers/anthropic-skills.schema.json`; existing parsing behavior preserved. +- `server/hub.js` — hub gains optional `ccxray.hub.*` operational metrics (uptime, request rate, connected clients) under its own config in `~/.ccxray/hub-config.json`. Hub does NOT emit business metrics; those stay client-side. +- `server/routes/api.js` — no new HTTP routes in Phase 1 (deep-link route is Phase 2). +- `bin/ccxray.js` or equivalent CLI entry — new subcommands: `status --otel`, `otel preview`, `parser report`. Existing commands unaffected when OTel is disabled. +- `package.json` — add minimal OTel dependencies (`@opentelemetry/api`, `@opentelemetry/sdk-metrics`, `@opentelemetry/exporter-metrics-otlp-http`, `@opentelemetry/resources`). No auto-instrumentations. Optional dependency pattern so the package still works if OTel is not installed. +- New docs: `docs/otel-integration.html` (already exists, decision record), `docs/otel-ethics.md` (why these metrics are not for individual performance evaluation), `docs/otel-quickstart.md` (90-second Grafana onboarding). +- Tests: parser snapshot fixtures, cardinality budget enforcement tests, tier resolution matrix tests, failure-mode tests (collector down, bad endpoint, bad auth, malformed config). diff --git a/openspec/changes/add-otel-metrics-phase1/specs/otel-config/spec.md b/openspec/changes/add-otel-metrics-phase1/specs/otel-config/spec.md new file mode 100644 index 0000000..b30283a --- /dev/null +++ b/openspec/changes/add-otel-metrics-phase1/specs/otel-config/spec.md @@ -0,0 +1,90 @@ +## ADDED Requirements + +### Requirement: Project and personal config files + +ccxray SHALL read two optional configuration files at startup: `.ccxray.json` (project-level, repo-checked-in) and `.ccxray.user.json` (personal-level, gitignored). Both files use JSON. Missing files SHALL be treated as tier 0 (disabled). + +#### Scenario: No config present + +- **WHEN** ccxray starts in a directory with neither `.ccxray.json` nor `.ccxray.user.json` +- **THEN** OTel SDK SHALL NOT initialize and no network egress SHALL occur + +#### Scenario: Project config present, no personal config + +- **WHEN** ccxray starts in a directory with `.ccxray.json` that enables tier 1 +- **THEN** OTel SDK SHALL initialize at tier 1 with project-level attributes only + +#### Scenario: Both project and personal config present + +- **WHEN** project config sets tier 1 and personal config sets tier 2 with `enduser.id` +- **THEN** the effective tier SHALL be tier 2 and `enduser.id` SHALL be attached to emitted metrics + +### Requirement: Tier resolution as upper bound and lower bound + +The effective tier SHALL be `min(project_tier, personal_tier)` so that the project config is an upper bound and personal config can only equal-or-downgrade. An engineer SHALL be able to unilaterally opt out by setting tier 0 in personal config. + +#### Scenario: Personal config downgrades from project + +- **WHEN** project config enables tier 1 and personal config explicitly sets tier 0 +- **THEN** no OTel emission SHALL occur for this engineer + +#### Scenario: Personal config cannot exceed project + +- **WHEN** project config enables tier 1 and personal config sets tier 2 +- **THEN** the effective tier SHALL be tier 2 only if the project explicitly authorizes tier 2; otherwise tier resolution SHALL clamp to tier 1 and emit a warning + +### Requirement: Environment variable interpolation + +All string values in config files SHALL support `${VAR}` interpolation, resolved at load time from `process.env`. Unresolved variables SHALL cause startup failure with a clear error message naming the missing variable. + +#### Scenario: Header value uses env var + +- **WHEN** config contains `"Authorization": "Bearer ${OTLP_TOKEN}"` and `OTLP_TOKEN=abc123` is set in the environment +- **THEN** the loaded header value SHALL be `"Bearer abc123"` and the literal string SHALL NOT appear in any debug log line + +#### Scenario: Missing env var + +- **WHEN** config contains `"Authorization": "Bearer ${MISSING_VAR}"` and `MISSING_VAR` is not set +- **THEN** ccxray SHALL exit non-zero with an error message that includes the file path, line, and the variable name `MISSING_VAR` + +### Requirement: Literal-secret rejection + +The schema validator SHALL reject any string value that matches a literal-secret pattern (`Bearer [A-Za-z0-9]{20,}`, `sk_live_*`, `sk_test_*`, `ghp_*`, JWT three-segment structure) unless the value is wrapped in `${...}`. Pure URLs and hostnames SHALL be allowed. + +#### Scenario: Literal bearer token rejected + +- **WHEN** config contains `"Authorization": "Bearer abc123longtokenvalue..."` +- **THEN** ccxray SHALL exit at startup with an error suggesting the user switch to `${ENV_VAR}` interpolation + +#### Scenario: Interpolated bearer token accepted + +- **WHEN** config contains `"Authorization": "Bearer ${TOKEN}"` and `TOKEN` is set +- **THEN** ccxray SHALL load successfully and use the resolved value + +### Requirement: Gitignore auto-amend on first generation + +When ccxray writes a new `.ccxray.user.json` for the first time, it SHALL check whether the file is covered by the project's `.gitignore`. If not, ccxray SHALL prompt the user (or apply automatically when `--yes` is passed) to append `.ccxray.user.json` to `.gitignore`. + +#### Scenario: Gitignore missing entry + +- **WHEN** ccxray creates `.ccxray.user.json` in a repo whose `.gitignore` does not list it +- **THEN** ccxray SHALL prompt for permission to append `.ccxray.user.json` and reflect the choice in the next run + +#### Scenario: Gitignore already covers the file + +- **WHEN** ccxray creates `.ccxray.user.json` and `.gitignore` already contains an entry matching the file +- **THEN** no prompt SHALL appear and the file SHALL be written silently + +### Requirement: Config error fails fast at startup + +Config syntax errors, schema violations, unresolved `${VAR}` references, and literal-secret matches SHALL cause ccxray to exit non-zero at startup with an actionable error message. ccxray SHALL NOT silently continue with a partial config. + +#### Scenario: Invalid JSON + +- **WHEN** `.ccxray.json` contains malformed JSON +- **THEN** ccxray SHALL print a parse error citing the file path and the offending line/column, and SHALL exit non-zero + +#### Scenario: Schema violation + +- **WHEN** `.ccxray.json` sets `otel.tier` to an unknown value +- **THEN** ccxray SHALL print a schema error naming the field and listing valid values, and SHALL exit non-zero diff --git a/openspec/changes/add-otel-metrics-phase1/specs/otel-export/spec.md b/openspec/changes/add-otel-metrics-phase1/specs/otel-export/spec.md new file mode 100644 index 0000000..f4c0112 --- /dev/null +++ b/openspec/changes/add-otel-metrics-phase1/specs/otel-export/spec.md @@ -0,0 +1,128 @@ +## ADDED Requirements + +### Requirement: Client-side OTel SDK initialization + +OTel SDK initialization SHALL occur in the client process (the one running `ccxray claude` or similar) and SHALL NOT occur in the hub process. The hub SHALL remain a pure HTTP proxy and SSE broadcaster. + +#### Scenario: Client initializes OTel + +- **WHEN** a ccxray client process starts with tier ≥ 1 +- **THEN** the OTel SDK SHALL initialize within the client process and emit metrics tagged with that client's resource attributes + +#### Scenario: Hub does not emit business metrics + +- **WHEN** the ccxray hub forwards an HTTP request between a client and an upstream provider +- **THEN** the hub SHALL NOT emit any business metric on behalf of the client, regardless of the client's tier setting + +### Requirement: `ccxray.*` namespace for all emitted metrics + +Every metric SHALL be named under the `ccxray..` pattern. No metric SHALL be named identically to a Claude Code CLI metric or any other upstream OTel convention that would overlap. + +#### Scenario: Metric naming + +- **WHEN** an OTel metric is registered +- **THEN** its name SHALL start with the literal prefix `ccxray.` + +#### Scenario: Namespace collision prevention + +- **WHEN** code attempts to register a metric whose name matches a `claude_code.*` pattern +- **THEN** registration SHALL fail and tests SHALL flag it + +### Requirement: Source resource attribute on every emit + +Every metric SHALL carry the resource attribute `ccxray.source="ccxray-proxy"` so that backends can filter ccxray-emitted data from data emitted by other OTel sources running on the same host. + +#### Scenario: Source attribute present + +- **WHEN** any metric is exported by ccxray +- **THEN** its resource attributes SHALL include `ccxray.source="ccxray-proxy"` + +### Requirement: Cardinality budget enforcement + +Each metric SHALL declare its allowed attribute keys and a numeric cardinality budget per key. Attribute keys not in the allow-list SHALL be dropped via OTel View API. When the count of unique values for an allow-listed key reaches its budget, subsequent unique values SHALL be replaced with the literal string `_overflow_` and the sentinel counter `ccxray.metrics.overflow_total{metric,attribute}` SHALL increment. + +#### Scenario: Allowed attribute within budget + +- **WHEN** `ccxray.tool.invocations_total` receives an attribute `tool="Read"` and `Read` is the 3rd of 50 budgeted tool names +- **THEN** the metric SHALL emit with `tool="Read"` and `ccxray.metrics.overflow_total` SHALL NOT increment + +#### Scenario: Budget exhausted + +- **WHEN** the cardinality budget for `tool` is 50 and a 51st unique tool name arrives +- **THEN** the metric SHALL emit with `tool="_overflow_"` and `ccxray.metrics.overflow_total{metric="ccxray.tool.invocations_total",attribute="tool"}` SHALL increment by 1 + +#### Scenario: Unallowed attribute key + +- **WHEN** code attempts to record `ccxray.tool.invocations_total` with attribute `bash_command="rm -rf /tmp/foo"` while `bash_command` is not in the allow-list +- **THEN** the `bash_command` attribute SHALL be dropped before emission + +### Requirement: CLI OTel coexistence and complement mode + +ccxray SHALL detect the presence of `CLAUDE_CODE_ENABLE_TELEMETRY=1` in the environment and, when detected, SHALL emit all metrics with an additional attribute `ccxray.cli_otel_active=true`. ccxray SHALL print a startup notice explaining how to choose between ccxray and CLI metrics when both are active. ccxray SHALL NOT disable any of its own metrics based on CLI coexistence. + +#### Scenario: CLI OTel detected + +- **WHEN** ccxray starts with `CLAUDE_CODE_ENABLE_TELEMETRY=1` set +- **THEN** ccxray SHALL print a startup notice indicating complement mode and SHALL add `ccxray.cli_otel_active=true` to all emitted metrics + +#### Scenario: CLI OTel not detected + +- **WHEN** ccxray starts without `CLAUDE_CODE_ENABLE_TELEMETRY` +- **THEN** ccxray SHALL print a notice indicating standalone mode and the attribute `ccxray.cli_otel_active` SHALL NOT be set + +### Requirement: Internal invariant metrics; cross-source reconciliation is a downstream concern + +ccxray SHALL emit invariant metrics that describe ccxray-internal consistency only. ccxray SHALL NOT emit a cross-source diff metric (e.g. ccxray vs CLI token counts) as part of Phase 1. Cross-source reconciliation SHALL be performed by downstream consumers (recording rules, Grafana panels, sidecar processes) using `request_id` or `session_id` joins on per-request metrics emitted independently by ccxray and the CLI. + +Rationale: A pre-aggregated diff gauge cannot answer "which request diverged" and produces persistent non-zero values for legitimate reasons (SSE chunking boundaries, retries, prompt-caching edge cases), creating alert fatigue. ccxray's correct role is to emit faithful per-request signals; cross-source diff is an analytical task that belongs in the user's observability tier, where it can be expressed as a derived series. + +#### Scenario: Parser sum invariant + +- **WHEN** ccxray's parser extracts a sum of per-tool token attributions that differs from the upstream `usage` block totals for the same response +- **THEN** `ccxray.invariants.parser_mismatch_total{type="token_sum"}` SHALL increment + +#### Scenario: SSE stream completeness invariant + +- **WHEN** ccxray observes the upstream SSE stream terminating without a `[DONE]` (Anthropic) or `response.completed` (OpenAI Responses) terminal event +- **THEN** `ccxray.invariants.sse_truncated_total{provider}` SHALL increment + +#### Scenario: No cross-source diff gauge is emitted + +- **WHEN** OTel is enabled at any tier +- **THEN** no metric whose name matches `ccxray.reconciliation.*` SHALL be registered with the SDK in Phase 1 + +### Requirement: Required metric families + +ccxray SHALL emit the following metric families when OTel is enabled: + +- **Cost**: `ccxray.tokens.input_total`, `ccxray.tokens.output_total`, `ccxray.tokens.cache_read_total`, `ccxray.tokens.cache_creation_total`, `ccxray.cost.usd_total`, `ccxray.cache.hit_ratio` (gauge). +- **Usage**: `ccxray.tool.invocations_total{tool,provider}`, `ccxray.mcp.invocations_total{server,tool}`, `ccxray.skill.activations_total{skill,provider}`, `ccxray.sessions_total{provider}`, `ccxray.agent_type.invocations_total{type}`. +- **Quality**: `ccxray.errors_total{type,provider}`, `ccxray.stop_reason_total{reason}`, `ccxray.latency_ms` (histogram, attributes: `model`,`provider`), `ccxray.max_tokens_hit_total{model}`. +- **Patterns**: `ccxray.context.utilization_pct` (histogram), `ccxray.auto_compact.triggered_total`, `ccxray.subagent.invocations_total`, `ccxray.tools_per_turn` (histogram). +- **Governance**: `ccxray.permission_mode.usage_total{mode}`, `ccxray.dangerous_tool.invocations_total{pattern}`, `ccxray.file_writes_total`, `ccxray.provider.distribution_total{provider}`. + +Each metric SHALL be registered with its allow-list of attribute keys and cardinality budget at SDK initialization. + +#### Scenario: Cost metric emission after a turn + +- **WHEN** ccxray completes forwarding a request and receives a usage block from the upstream provider +- **THEN** `ccxray.tokens.input_total`, `ccxray.tokens.output_total`, and `ccxray.cost.usd_total` SHALL each increment by the corresponding value + +#### Scenario: Tool invocation metric + +- **WHEN** ccxray detects a `tool_use` block named `Bash` in a response +- **THEN** `ccxray.tool.invocations_total` SHALL increment by 1 with attribute `tool="Bash"` + +### Requirement: Minimal optional dependencies + +The OTel-related Node.js dependencies SHALL be limited to `@opentelemetry/api`, `@opentelemetry/sdk-metrics`, `@opentelemetry/exporter-metrics-otlp-http`, and `@opentelemetry/resources`. Auto-instrumentation packages SHALL NOT be included. Dependencies SHALL be resolved lazily so that ccxray remains functional even when OTel packages are absent (tier 0 only). + +#### Scenario: OTel packages absent and tier 0 + +- **WHEN** OTel packages are not installed and effective tier is 0 +- **THEN** ccxray SHALL start normally without referencing any OTel package + +#### Scenario: OTel packages absent and tier ≥ 1 + +- **WHEN** OTel packages are not installed and effective tier is ≥ 1 +- **THEN** ccxray SHALL emit a clear error explaining which packages to install and SHALL exit non-zero diff --git a/openspec/changes/add-otel-metrics-phase1/specs/otel-health/spec.md b/openspec/changes/add-otel-metrics-phase1/specs/otel-health/spec.md new file mode 100644 index 0000000..42ad9c2 --- /dev/null +++ b/openspec/changes/add-otel-metrics-phase1/specs/otel-health/spec.md @@ -0,0 +1,99 @@ +## ADDED Requirements + +### Requirement: Four-state OTel health machine + +ccxray SHALL maintain an OTel health state machine with exactly four states: `disabled`, `active`, `degraded`, and `circuit_open`. Transitions SHALL be driven exclusively by the conditions described in the subsequent requirements; no other code path SHALL mutate state. + +#### Scenario: Disabled at startup + +- **WHEN** effective tier is 0 or OTel packages are absent +- **THEN** the state SHALL be `disabled` and `ccxray.otel.state` SHALL emit only its disabled gauge (where possible) and otherwise stay silent + +#### Scenario: Active after successful init + +- **WHEN** effective tier is ≥ 1 and SDK initialization completes +- **THEN** the state SHALL be `active` + +### Requirement: Bounded export queue with drop-oldest semantics + +The OTel export queue SHALL be bounded by a configurable size (default 2048 entries). When the queue is full and a new export is attempted, the oldest queued entry SHALL be dropped to make room. Each drop SHALL increment `ccxray.otel.exports_dropped_total{signal}`. + +#### Scenario: Queue under limit + +- **WHEN** the queue holds fewer than its configured maximum entries and a new export arrives +- **THEN** the new entry SHALL be appended and no drop SHALL occur + +#### Scenario: Queue at limit + +- **WHEN** the queue is at its configured maximum and a new export arrives +- **THEN** the oldest entry SHALL be removed, the new entry SHALL be appended, and `ccxray.otel.exports_dropped_total{signal=""}` SHALL increment by 1 + +### Requirement: Circuit breaker with exponential backoff + +After 5 consecutive export failures, the state SHALL transition to `circuit_open` and exports SHALL be paused. After an initial cooldown of 60 seconds, the state SHALL transition to `half_open` and a single export SHALL be attempted. Success SHALL return the state to `active`. Failure SHALL keep the state at `circuit_open` and the cooldown SHALL double up to a maximum of 600 seconds. + +#### Scenario: Trip on 5 consecutive failures + +- **WHEN** 5 consecutive export attempts return errors +- **THEN** the state SHALL transition to `circuit_open` and no further exports SHALL be attempted until the cooldown elapses + +#### Scenario: Half-open success returns to active + +- **WHEN** the cooldown elapses, the state moves to `half_open`, and the trial export succeeds +- **THEN** the state SHALL transition back to `active` and the cooldown SHALL reset to 60 seconds + +#### Scenario: Half-open failure increases cooldown + +- **WHEN** the trial export in `half_open` fails +- **THEN** the state SHALL remain `circuit_open` and the next cooldown SHALL be `min(previous_cooldown * 2, 600)` seconds + +### Requirement: Failure log on local disk + +Failed export attempts and state transitions SHALL be written to `~/.ccxray/otel.log` in append mode. The file SHALL be rotated once it exceeds a configurable size (default 1 MB). Rotated files SHALL be retained up to a configurable count (default 5). + +#### Scenario: Export error recorded + +- **WHEN** an export attempt fails with a network error +- **THEN** a single line SHALL be appended to `~/.ccxray/otel.log` containing the timestamp, the error class, and the queue depth at time of failure + +#### Scenario: File rotated at size limit + +- **WHEN** `~/.ccxray/otel.log` exceeds 1 MB +- **THEN** it SHALL be renamed to `otel.log.1` (with existing rotations shifted), a fresh `otel.log` SHALL be created, and files beyond the retention count SHALL be deleted + +### Requirement: Never-block guarantee for the proxy + +OTel export operations SHALL NOT block the HTTP proxy path. All emit operations SHALL enqueue without awaiting export completion. SDK shutdown during process exit SHALL be capped at 2 seconds and SHALL NOT prevent clean exit on timeout. + +#### Scenario: Collector unreachable + +- **WHEN** the OTLP endpoint is unreachable for the duration of a proxy request +- **THEN** the proxy SHALL forward the request and return the response with no additional latency from OTel + +#### Scenario: SDK shutdown timeout + +- **WHEN** the process is exiting and OTel SDK flush is in progress +- **THEN** the shutdown SHALL be aborted after 2 seconds and the process SHALL exit cleanly + +### Requirement: Config errors fail fast, init/runtime errors degrade + +Config parsing or schema errors SHALL cause non-zero process exit at startup with an actionable message. SDK initialization errors (e.g. invalid endpoint URL format) SHALL transition the state to `degraded` and SHALL NOT block ccxray startup. Runtime export errors SHALL be handled by the circuit breaker without affecting other ccxray behavior. + +#### Scenario: Bad endpoint URL + +- **WHEN** `.ccxray.json` sets `otel.endpoint` to a string that is not a valid URL +- **THEN** ccxray SHALL continue to start, the state SHALL be `degraded`, the dashboard and proxy SHALL function normally, and `ccxray status --otel` SHALL display the error + +#### Scenario: Missing required field + +- **WHEN** `.ccxray.json` enables tier 1 but omits `otel.endpoint` +- **THEN** ccxray SHALL exit non-zero at startup with an error pointing to the missing field + +### Requirement: Health state observable via metric and status command + +The current health state SHALL be observable through (a) a gauge `ccxray.otel.state{state}` (where possible — emitted only when state is `active` or `degraded`), and (b) the `ccxray status --otel` output regardless of state. + +#### Scenario: State visible in status command + +- **WHEN** an engineer runs `ccxray status --otel` +- **THEN** the output SHALL include the current state, the last 3 state transitions with timestamps, and the current circuit breaker cooldown remaining (if applicable) diff --git a/openspec/changes/add-otel-metrics-phase1/specs/otel-introspection/spec.md b/openspec/changes/add-otel-metrics-phase1/specs/otel-introspection/spec.md new file mode 100644 index 0000000..53f1589 --- /dev/null +++ b/openspec/changes/add-otel-metrics-phase1/specs/otel-introspection/spec.md @@ -0,0 +1,66 @@ +## ADDED Requirements + +### Requirement: `ccxray status --otel` shows effective configuration and health + +The `ccxray status --otel` command SHALL print: + +- The current effective tier (0/1/2) and which config files contributed. +- The endpoint URL with any `${VAR}` masked. +- The OTel health state (`disabled / active / degraded / circuit_open`) and last 3 state transitions with timestamps. +- The circuit breaker cooldown remaining (when applicable). +- Per-metric cardinality usage in `current / budget` format (e.g. `tool: 23/50`). +- Total counts: exports succeeded, exports failed, exports dropped (last hour and last 24 hours). +- The `opt_in_acknowledged_at` timestamp for tier 2 (when applicable). +- CLI coexistence indicator: whether `CLAUDE_CODE_ENABLE_TELEMETRY` is detected. + +#### Scenario: Status at tier 1 + +- **WHEN** ccxray is running at tier 1 with a healthy collector +- **THEN** `ccxray status --otel` SHALL show `tier=1`, `state=active`, the endpoint, cardinality usage rows for each registered metric, and the export success/failure counts + +#### Scenario: Status at tier 0 + +- **WHEN** ccxray is running at tier 0 +- **THEN** `ccxray status --otel` SHALL show `tier=0`, `state=disabled`, and SHALL NOT attempt to read OTel runtime state + +### Requirement: `ccxray otel preview` dry-run + +The `ccxray otel preview` command SHALL print the exact JSON body that would be sent to the OTel collector on the next export, including all attribute values and resource attributes, WITHOUT sending any network request. Secrets resolved from `${ENV_VAR}` SHALL be masked in the output. + +#### Scenario: Preview before enabling + +- **WHEN** an engineer runs `ccxray otel preview` after setting up `.ccxray.json` +- **THEN** the command SHALL print a single JSON object representing the next export, with `Authorization` and similar header values shown as `Bearer ***` rather than the resolved token + +#### Scenario: Preview with no recent metrics + +- **WHEN** ccxray has no queued metrics to export +- **THEN** the command SHALL print a notice that no metrics are pending and SHALL exit zero + +### Requirement: Startup banner declares active tier and mode + +When ccxray starts at tier ≥ 1, it SHALL print a one-line banner to stderr summarizing: tier value, endpoint (without secret), and complement-mode status (if CLI OTel is active). The banner SHALL NOT print when tier is 0. + +#### Scenario: Banner at tier 1 standalone + +- **WHEN** ccxray starts at tier 1 without CLI OTel +- **THEN** stderr SHALL contain a single line matching the pattern `ccxray OTel tier: 1 (anonymous) → ` followed by no further banner output for that launch + +#### Scenario: Banner at tier 1 complement + +- **WHEN** ccxray starts at tier 1 with `CLAUDE_CODE_ENABLE_TELEMETRY=1` +- **THEN** stderr SHALL contain a line indicating `tier: 1` and `complement-mode: true` + +#### Scenario: No banner at tier 0 + +- **WHEN** ccxray starts at tier 0 +- **THEN** stderr SHALL NOT contain any OTel-related banner line + +### Requirement: Secrets masking in all introspection output + +`ccxray status --otel` and `ccxray otel preview` SHALL mask any value resolved from a `${VAR}` interpolation. Masked values SHALL display as the prefix (up to 4 characters) followed by `***`. The full unmasked value SHALL never be printed by any introspection command. + +#### Scenario: Auth header masked + +- **WHEN** the resolved auth header is `Bearer abc123longtokenvalue` +- **THEN** introspection output SHALL display `Bearer abc1***` and SHALL NOT print the remainder of the token diff --git a/openspec/changes/add-otel-metrics-phase1/specs/otel-tiers/spec.md b/openspec/changes/add-otel-metrics-phase1/specs/otel-tiers/spec.md new file mode 100644 index 0000000..e10c2c9 --- /dev/null +++ b/openspec/changes/add-otel-metrics-phase1/specs/otel-tiers/spec.md @@ -0,0 +1,79 @@ +## ADDED Requirements + +### Requirement: Three discrete tier values + +ccxray SHALL support exactly three tier values for OTel export: + +- **0 — disabled**: No SDK initialization, no network egress. +- **1 — project anonymous**: Emit with project-level resource attributes (`project.name`, optional `team`) but no individual identity. +- **2 — personal named**: Emit with `enduser.id` attached (a self-chosen string set by the engineer). + +#### Scenario: Tier 0 produces no egress + +- **WHEN** the effective tier resolves to 0 +- **THEN** no OTel package SHALL be loaded and no network connection SHALL be opened for telemetry + +#### Scenario: Tier 1 omits identity + +- **WHEN** the effective tier resolves to 1 and a request completes +- **THEN** emitted metrics SHALL include `project.name` (if configured) but SHALL NOT include any `enduser.id` attribute + +#### Scenario: Tier 2 includes identity + +- **WHEN** the effective tier resolves to 2 and personal config provides `identity: "alice"` +- **THEN** emitted metrics SHALL include `enduser.id="alice"` as a resource attribute + +### Requirement: Tier resolution rule + +The effective tier SHALL be `min(project_tier, personal_tier)`. If either side is absent, the present side SHALL be used. The minimum SHALL clamp downward; personal config SHALL NOT exceed project config. + +#### Scenario: Personal lower than project + +- **WHEN** project tier is 1 and personal tier is 0 +- **THEN** the effective tier SHALL be 0 + +#### Scenario: Project lower than personal + +- **WHEN** project tier is 1 and personal tier is 2 without project authorization for tier 2 +- **THEN** the effective tier SHALL be 1 and ccxray SHALL emit a warning that personal tier is clamped + +#### Scenario: Equal tiers + +- **WHEN** project tier is 1 and personal tier is 1 +- **THEN** the effective tier SHALL be 1 + +### Requirement: Engineer unilateral opt-out + +Any engineer SHALL be able to opt out of OTel emission for their own machine by setting `tier: 0` in `.ccxray.user.json`, regardless of the project config. This opt-out SHALL take effect on the next ccxray launch. + +#### Scenario: Opt-out overrides project tier + +- **WHEN** project config sets tier 2 and personal config sets tier 0 +- **THEN** the engineer's ccxray client SHALL emit no telemetry until personal config is changed + +### Requirement: Personal config gitignore enforcement + +The personal config file `.ccxray.user.json` SHALL be excluded from version control. ccxray SHALL refuse to load personal-tier identity from a file that is currently tracked by git and SHALL emit a warning explaining the risk. + +#### Scenario: Personal config tracked by git + +- **WHEN** `.ccxray.user.json` exists in the repo and is tracked by git +- **THEN** ccxray SHALL print a warning recommending `git rm --cached` and SHALL refuse to apply the personal identity until the file is untracked or moved to `$HOME` + +### Requirement: Opt-in acknowledgment timestamp + +When personal config sets tier 2 for the first time, the file SHALL record an `opt_in_acknowledged_at` ISO 8601 timestamp. This timestamp SHALL be displayed in `ccxray status --otel` so the engineer can confirm when they last opted in. + +#### Scenario: First-time tier 2 opt-in + +- **WHEN** a user creates `.ccxray.user.json` with tier 2 for the first time +- **THEN** ccxray SHALL write the current time into the file as `opt_in_acknowledged_at` and SHALL include it in subsequent `status --otel` output + +### Requirement: Tier distribution sentinel + +ccxray SHALL emit `ccxray.otel.tier_distribution{tier}` as a counter incremented once per process launch that initializes OTel, labeled with the effective tier value. This metric is meant to inform documentation strengthening decisions (e.g. low tier 2 share suggests trust concerns). + +#### Scenario: Counter increments on launch + +- **WHEN** ccxray client process initializes at tier 1 +- **THEN** `ccxray.otel.tier_distribution{tier="1"}` SHALL increment by 1 diff --git a/openspec/changes/add-otel-metrics-phase1/specs/parser-schemas/spec.md b/openspec/changes/add-otel-metrics-phase1/specs/parser-schemas/spec.md new file mode 100644 index 0000000..36e6b23 --- /dev/null +++ b/openspec/changes/add-otel-metrics-phase1/specs/parser-schemas/spec.md @@ -0,0 +1,83 @@ +## ADDED Requirements + +### Requirement: Versioned parser schemas per concern and provider + +Detection logic for tool / MCP / skill / agent-type SHALL be expressed as JSON schemas under `server/parsers/`. There SHALL be at minimum one schema per (concern, provider) pair: + +- `parsers/anthropic-tools.schema.json` +- `parsers/anthropic-skills.schema.json` +- `parsers/anthropic-agent-types.schema.json` +- `parsers/mcp-tools.schema.json` (provider-agnostic MCP naming convention) +- `parsers/codex-tools.schema.json` + +Each schema SHALL include a `version` field (semver) and a `last_verified_against` field (ISO 8601 date). Inline string matching in `server/system-prompt.js`, `server/store.js`, or other code paths SHALL be removed in favor of the schema-driven parser. + +#### Scenario: Schema referenced at runtime + +- **WHEN** ccxray processes an Anthropic response containing a `tool_use` block +- **THEN** the tool name SHALL be classified using `parsers/anthropic-tools.schema.json` and SHALL NOT be matched against any hardcoded list embedded in other files + +### Requirement: Snapshot fixtures per provider + +Test fixtures under `test/fixtures/parser/` SHALL cover at minimum the following cases per provider: + +- Basic tool invocation +- Tool invocation with a skill marker active +- Subagent invocation (Anthropic Task tool) +- MCP server tool invocation +- An intentional unknown tool name + +Each fixture SHALL pair an input (request or response JSON) with an expected parser output snapshot. Parser changes SHALL require committing new snapshots and SHALL pass review before merge. + +#### Scenario: Snapshot drift fails CI + +- **WHEN** parser code is changed in a way that alters fixture output +- **THEN** the test suite SHALL fail with a diff between old and new snapshot until the snapshot is updated and reviewed + +### Requirement: Sentinel counters for unknown tokens + +When the parser encounters a token, marker, or block that does not match any registered pattern in the relevant schema, it SHALL increment one of: + +- `ccxray.parser.unknown_tool_total{provider}` +- `ccxray.parser.unknown_skill_marker_total{provider}` +- `ccxray.parser.unknown_mcp_format_total` +- `ccxray.parser.fallback_used_total{parser,reason}` + +The unknown event SHALL also be recorded with a short sample to `~/.ccxray/parser-drift.log` for later inspection via `ccxray parser report`. + +#### Scenario: Unknown tool name observed + +- **WHEN** ccxray sees a `tool_use` block whose `name` does not match any pattern in `parsers/anthropic-tools.schema.json` +- **THEN** `ccxray.parser.unknown_tool_total{provider="anthropic"}` SHALL increment by 1 and a sample SHALL be appended to `~/.ccxray/parser-drift.log` + +### Requirement: Reconciliation invariants + +For every processed entry the parser SHALL verify the following invariants: + +- Number of `tool_use` blocks in the response equals the number of tool entries extracted by the parser. +- Sum of input/output token counts attributed by the parser equals the corresponding values in the upstream usage block. + +When an invariant fails, `ccxray.parser.reconciliation_mismatch_total{type}` SHALL increment by 1 and the entry ID SHALL be appended to `~/.ccxray/parser-drift.log`. The mismatch SHALL NOT alter the entry's local log content. + +#### Scenario: Tool count mismatch + +- **WHEN** a response contains 3 `tool_use` blocks but the parser extracts only 2 tool entries +- **THEN** `ccxray.parser.reconciliation_mismatch_total{type="tool_count"}` SHALL increment and the entry ID SHALL be recorded in the drift log + +### Requirement: Parser error isolation + +Parser code SHALL be wrapped in try/catch boundaries. On exception, `ccxray.parser.error_total{parser,error_type}` SHALL increment and the originating entry SHALL still be written to local logs. The OTel span/metric for the affected entry SHALL be tagged `ccxray.parser.degraded=true`. Parser failure SHALL NOT propagate to the proxy path or terminate ccxray. + +#### Scenario: Parser throws + +- **WHEN** the skill marker parser throws a runtime exception while processing a response +- **THEN** ccxray SHALL log the exception locally, increment `ccxray.parser.error_total{parser="anthropic-skills",error_type=""}`, write the entry to disk as usual, and continue forwarding subsequent requests + +### Requirement: `ccxray parser report` command + +The `ccxray parser report` command SHALL print the top unknown tokens by frequency from the last 7 days of `~/.ccxray/parser-drift.log`, grouped by category (tool / skill / MCP / fallback). The output SHALL include sample tokens and a GitHub issue body template the user can copy to file a drift report. + +#### Scenario: Reporting after seeing unknown markers + +- **WHEN** the engineer has accumulated unknown markers and runs `ccxray parser report` +- **THEN** the command SHALL print a categorized summary, the most recent 5 unique samples per category, and a formatted GitHub issue body diff --git a/openspec/changes/add-otel-metrics-phase1/tasks.md b/openspec/changes/add-otel-metrics-phase1/tasks.md new file mode 100644 index 0000000..5c9af0e --- /dev/null +++ b/openspec/changes/add-otel-metrics-phase1/tasks.md @@ -0,0 +1,105 @@ +## 1. Dependencies and package wiring + +- [x] 1.1 Add `@opentelemetry/api`, `@opentelemetry/sdk-metrics`, `@opentelemetry/exporter-metrics-otlp-http`, `@opentelemetry/resources` as `dependencies` in `package.json` (no auto-instrumentations) +- [x] 1.2 Implement lazy require in a helper so ccxray still runs at tier 0 when OTel packages are absent +- [x] 1.3 Update `package-lock.json` and confirm bundle size delta is within an acceptable bound + +## 2. Config loader (`server/config-loader.js`) + +- [ ] 2.1 Define JSON schema for `.ccxray.json` (project) and `.ccxray.user.json` (personal) covering: `otel.enabled`, `otel.tier`, `otel.endpoint`, `otel.headers`, `otel.resource_attributes`, `otel.cardinality_overrides` +- [ ] 2.2 Implement schema validation with line/column error reporting +- [ ] 2.3 Implement `${ENV_VAR}` interpolation across all string values; fail fast with named variable on unresolved +- [ ] 2.4 Implement literal-secret detector (Bearer/JWT/`sk_*`/`ghp_*`) that rejects values not wrapped in `${...}` +- [ ] 2.5 Implement project config lookup walking up from cwd to git root, taking the first `.ccxray.json` match +- [ ] 2.6 Implement personal config lookup: cwd first, then `$HOME` fallback +- [ ] 2.7 Implement tier resolution `effective = min(project_tier, personal_tier)` with downward clamp warning +- [ ] 2.8 Implement `.gitignore` check and auto-amend with `--yes` flag for `.ccxray.user.json` +- [ ] 2.9 Reject personal config that is currently tracked by git, with explanatory error +- [ ] 2.10 Persist `opt_in_acknowledged_at` ISO 8601 timestamp on first tier 2 enable +- [ ] 2.11 Unit tests covering all error paths, interpolation, secret rejection, tier resolution matrix + +## 3. OTel health module (`server/otel-health.js`) + +- [x] 3.1 Implement state machine with four states: `disabled / active / degraded / circuit_open` and transitions only via documented APIs +- [ ] 3.2 Implement bounded export queue with drop-oldest semantics and `ccxray.otel.exports_dropped_total{signal}` increment per drop +- [ ] 3.3 Implement circuit breaker: 5 consecutive failures trips, 60s initial cooldown, half-open trial, exponential backoff to 600s max +- [ ] 3.4 Implement `~/.ccxray/otel.log` append writer with size-based rotation (default 1 MB, 5 file retention) +- [x] 3.5 Implement SDK shutdown with 2-second hard cap to never block process exit +- [ ] 3.6 Surface state and metrics via a status reporter API consumed by the CLI status command +- [ ] 3.7 Unit tests with mock collector (200 / 500 / timeout) covering queue overflow, circuit transitions, half-open recovery, and exponential backoff + +## 4. OTel SDK initialization (`server/otel.js`) + +- [x] 4.1 Implement SDK init for metrics only, with `ccxray.source="ccxray-proxy"` resource attribute +- [ ] 4.2 Define metric registry with allow-list of attribute keys and cardinality budgets per metric (View API) +- [ ] 4.3 Implement cardinality budget tracker with `_overflow_` fallback and `ccxray.metrics.overflow_total{metric,attribute}` sentinel +- [ ] 4.4 Detect `CLAUDE_CODE_ENABLE_TELEMETRY=1` and apply `ccxray.cli_otel_active=true` attribute in complement mode +- [ ] 4.5 Register all metric families per `otel-export/spec.md`: cost, usage, quality, patterns, governance +- [ ] 4.6 Register sentinel metrics: overflow, parser unknowns, parser mismatches, otel state, reconciliation diff, tier distribution +- [ ] 4.7 Implement export-time masking of any value resolved from `${ENV_VAR}` for log lines and trace dumps +- [ ] 4.8 Implement internal invariant metrics (`ccxray.invariants.parser_mismatch_total{type}`, `ccxray.invariants.sse_truncated_total`) — cross-source diff against CLI is NOT in Phase 1; documented as downstream pattern instead +- [ ] 4.9 Unit tests for namespace lint (no metric name starts with `claude_code.`), source attribute presence, budget enforcement, complement mode attribute, lazy SDK init at tier 0 + +## 5. Parser schema-ization (`server/parsers/`) + +- [ ] 5.1 Define the JSON schema format (fields: `version`, `last_verified_against`, `patterns`, `examples`) +- [ ] 5.2 Author `parsers/anthropic-tools.schema.json` covering current internal tool names +- [ ] 5.3 Author `parsers/anthropic-skills.schema.json` covering known skill marker formats from `system-prompt.js` +- [ ] 5.4 Author `parsers/anthropic-agent-types.schema.json` for general / explore / plan / known subagent types +- [ ] 5.5 Author `parsers/mcp-tools.schema.json` for `mcp____` naming +- [ ] 5.6 Author `parsers/codex-tools.schema.json` for OpenAI Responses tool patterns +- [ ] 5.7 Implement parser dispatch in `server/parsers/index.js` consuming the schemas +- [ ] 5.8 Replace inline string matching in `server/system-prompt.js`, `server/store.js`, and `server/helpers.js` with schema dispatch calls +- [ ] 5.9 Implement sentinel emission for unknown tools / skills / MCP markers and `~/.ccxray/parser-drift.log` append writer +- [ ] 5.10 Implement reconciliation invariants: tool_use block count equals extracted count; token attribution sums equal usage block values +- [ ] 5.11 Wrap parser calls in try/catch with `ccxray.parser.error_total{parser,error_type}` increment and `ccxray.parser.degraded=true` attribute on the affected entry +- [ ] 5.12 Author snapshot fixtures under `test/fixtures/parser/` for every (provider, scenario) pair listed in `parser-schemas/spec.md` +- [ ] 5.13 Wire snapshot tests into `npm test` + +## 6. Wire metrics into forward / store paths + +- [ ] 6.1 In `server/forward.js`, emit cost / token / latency / error / stop_reason metrics after each completed forward, using the otel-health queue _(partial: `emit('entry_completed', { entry })` wired in all 3 forward paths with full entry payload; routing through the otel-health queue is pending §3.2)_ +- [ ] 6.2 In `server/store.js`, emit usage / pattern / governance metrics as session/tool/skill/MCP detection runs through the new parsers +- [ ] 6.3 Ensure no emit path can throw into the proxy code path; all emits are best-effort +- [ ] 6.4 Add a unit test that verifies forward.js continues to function with OTel disabled, init-failed (degraded), and circuit_open states + +## 7. CLI introspection commands + +- [ ] 7.1 Implement `ccxray status --otel` per `otel-introspection/spec.md`: tier, endpoint (masked), state, transitions, cooldown, cardinality usage rows, success/failure/dropped counts, opt_in_acknowledged_at, CLI coexistence flag +- [ ] 7.2 Implement `ccxray otel preview` dry-run printing next-export JSON with secrets masked +- [ ] 7.3 Implement `ccxray parser report` command summarizing top unknown tokens and generating a GitHub issue body template +- [ ] 7.4 Add startup banner declaring tier and complement-mode status when tier ≥ 1 +- [ ] 7.5 Unit tests for each command and banner output + +## 8. Hub-side coexistence (minimal Phase 1 changes) + +- [ ] 8.1 Confirm the hub does NOT initialize OTel SDK for business metrics; document this explicitly in the hub module header comment +- [ ] 8.2 Make `ccxray status` aware of per-client OTel state via hub's existing client registration channel (so cross-client visibility works) +- [ ] 8.3 Defer `ccxray.hub.*` operational metrics to a follow-up change (per Open Questions in design.md) + +## 9. Documentation + +- [ ] 9.1 Add `docs/otel-ethics.md` (bilingual): why these metrics are not for individual performance evaluation; what acceptable uses look like +- [ ] 9.2 Add `docs/otel-quickstart.md` (bilingual): 90-second Grafana onboarding with screenshots +- [ ] 9.3 Reference `docs/otel-integration.html` (existing) as the design record from README +- [ ] 9.4 Update README with a single section: "Optional: send metrics to your observability backend" linking to quickstart and ethics docs +- [ ] 9.5 Update `CLAUDE.md` Architecture section to note the new modules and their roles +- [ ] 9.6 Add `docs/otel-recon.md` (bilingual): why cross-source reconciliation is a downstream concern, recording-rule / Grafana-panel / sidecar recipes for diffing ccxray vs CLI counts on `request_id` + +## 10. Verification gates + +- [ ] 10.1 CI lint: every emitted metric name MUST exist in `server/otel.js` schema registry; new metrics without registry entries fail build +- [ ] 10.2 CI lint: no metric name SHALL start with `claude_code.`; assertion runs across all `server/**/*.js` +- [ ] 10.3 Integration test: spin a local OTLP collector (docker), run a synthetic ccxray session, assert collector received the expected metric families with correct attributes +- [ ] 10.4 Integration test: simulate collector returning 500 → assert circuit opens, queue drops oldest, ccxray continues forwarding +- [ ] 10.5 Integration test: simulate `CLAUDE_CODE_ENABLE_TELEMETRY=1` → assert `cli_otel_active` attribute appears on emitted metrics +- [ ] 10.6 Manual usability test: 3 new engineers walk README + quickstart, target median time-to-first-metric < 5 minutes +- [ ] 10.7 Set 3-month KPI gate in repo: track GitHub references to "otel" / "OTEL_EXPORTER"; if < 10 within 3 months of release, pause Phase 2 work and revisit + +## 11. Release prep + +- [ ] 11.1 Update CHANGELOG with new dependencies, default-off behavior, three-tier model, and link to design doc +- [ ] 11.2 Confirm npm publish package size delta and document in PR description +- [ ] 11.3 Open follow-up issue for Phase 2 (span emit + `/entry/:id` drill-back) +- [ ] 11.4 Open follow-up issue for `--otel-demo` Docker Compose helper +- [ ] 11.5 Open follow-up issue for `ccxray.hub.*` operational metrics diff --git a/package-lock.json b/package-lock.json index 3afd051..866b5e2 100644 --- a/package-lock.json +++ b/package-lock.json @@ -1,15 +1,19 @@ { "name": "ccxray", - "version": "1.5.0", + "version": "1.9.2", "lockfileVersion": 3, "requires": true, "packages": { "": { "name": "ccxray", - "version": "1.5.0", + "version": "1.9.2", "license": "MIT", "dependencies": { "@anthropic-ai/tokenizer": "^0.0.4", + "@opentelemetry/api": "^1.9.0", + "@opentelemetry/exporter-metrics-otlp-http": "^0.205.0", + "@opentelemetry/resources": "^2.0.0", + "@opentelemetry/sdk-metrics": "^2.0.0", "ws": "^8.19.0" }, "bin": { @@ -497,6 +501,299 @@ "node": ">= 8" } }, + "node_modules/@opentelemetry/api": { + "version": "1.9.1", + "resolved": "https://registry.npmjs.org/@opentelemetry/api/-/api-1.9.1.tgz", + "integrity": "sha512-gLyJlPHPZYdAk1JENA9LeHejZe1Ti77/pTeFm/nMXmQH/HFZlcS/O2XJB+L8fkbrNSqhdtlvjBVjxwUYanNH5Q==", + "license": "Apache-2.0", + "engines": { + "node": ">=8.0.0" + } + }, + "node_modules/@opentelemetry/api-logs": { + "version": "0.205.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/api-logs/-/api-logs-0.205.0.tgz", + "integrity": "sha512-wBlPk1nFB37Hsm+3Qy73yQSobVn28F4isnWIBvKpd5IUH/eat8bwcL02H9yzmHyyPmukeccSl2mbN5sDQZYnPg==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/api": "^1.3.0" + }, + "engines": { + "node": ">=8.0.0" + } + }, + "node_modules/@opentelemetry/core": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/core/-/core-2.1.0.tgz", + "integrity": "sha512-RMEtHsxJs/GiHHxYT58IY57UXAQTuUnZVco6ymDEqTNlJKTimM4qPUPVe8InNFyBjhHBEAx4k3Q8LtNayBsbUQ==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/semantic-conventions": "^1.29.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.0.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/exporter-metrics-otlp-http": { + "version": "0.205.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/exporter-metrics-otlp-http/-/exporter-metrics-otlp-http-0.205.0.tgz", + "integrity": "sha512-fFxNQ/HbbpLmh1pgU6HUVbFD1kNIjrkoluoKJkh88+gnmpFD92kMQ8WFNjPnSbjg2mNVnEkeKXgCYEowNW+p1w==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/core": "2.1.0", + "@opentelemetry/otlp-exporter-base": "0.205.0", + "@opentelemetry/otlp-transformer": "0.205.0", + "@opentelemetry/resources": "2.1.0", + "@opentelemetry/sdk-metrics": "2.1.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": "^1.3.0" + } + }, + "node_modules/@opentelemetry/exporter-metrics-otlp-http/node_modules/@opentelemetry/resources": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/resources/-/resources-2.1.0.tgz", + "integrity": "sha512-1CJjf3LCvoefUOgegxi8h6r4B/wLSzInyhGP2UmIBYNlo4Qk5CZ73e1eEyWmfXvFtm1ybkmfb2DqWvspsYLrWw==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/core": "2.1.0", + "@opentelemetry/semantic-conventions": "^1.29.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.3.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/exporter-metrics-otlp-http/node_modules/@opentelemetry/sdk-metrics": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/sdk-metrics/-/sdk-metrics-2.1.0.tgz", + "integrity": "sha512-J9QX459mzqHLL9Y6FZ4wQPRZG4TOpMCyPOh6mkr/humxE1W2S3Bvf4i75yiMW9uyed2Kf5rxmLhTm/UK8vNkAw==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/core": "2.1.0", + "@opentelemetry/resources": "2.1.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.9.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/otlp-exporter-base": { + "version": "0.205.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/otlp-exporter-base/-/otlp-exporter-base-0.205.0.tgz", + "integrity": "sha512-2MN0C1IiKyo34M6NZzD6P9Nv9Dfuz3OJ3rkZwzFmF6xzjDfqqCTatc9v1EpNfaP55iDOCLHFyYNCgs61FFgtUQ==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/core": "2.1.0", + "@opentelemetry/otlp-transformer": "0.205.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": "^1.3.0" + } + }, + "node_modules/@opentelemetry/otlp-transformer": { + "version": "0.205.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/otlp-transformer/-/otlp-transformer-0.205.0.tgz", + "integrity": "sha512-KmObgqPtk9k/XTlWPJHdMbGCylRAmMJNXIRh6VYJmvlRDMfe+DonH41G7eenG8t4FXn3fxOGh14o/WiMRR6vPg==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/api-logs": "0.205.0", + "@opentelemetry/core": "2.1.0", + "@opentelemetry/resources": "2.1.0", + "@opentelemetry/sdk-logs": "0.205.0", + "@opentelemetry/sdk-metrics": "2.1.0", + "@opentelemetry/sdk-trace-base": "2.1.0", + "protobufjs": "^7.3.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": "^1.3.0" + } + }, + "node_modules/@opentelemetry/otlp-transformer/node_modules/@opentelemetry/resources": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/resources/-/resources-2.1.0.tgz", + "integrity": "sha512-1CJjf3LCvoefUOgegxi8h6r4B/wLSzInyhGP2UmIBYNlo4Qk5CZ73e1eEyWmfXvFtm1ybkmfb2DqWvspsYLrWw==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/core": "2.1.0", + "@opentelemetry/semantic-conventions": "^1.29.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.3.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/otlp-transformer/node_modules/@opentelemetry/sdk-metrics": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/sdk-metrics/-/sdk-metrics-2.1.0.tgz", + "integrity": "sha512-J9QX459mzqHLL9Y6FZ4wQPRZG4TOpMCyPOh6mkr/humxE1W2S3Bvf4i75yiMW9uyed2Kf5rxmLhTm/UK8vNkAw==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/core": "2.1.0", + "@opentelemetry/resources": "2.1.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.9.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/resources": { + "version": "2.7.1", + "resolved": "https://registry.npmjs.org/@opentelemetry/resources/-/resources-2.7.1.tgz", + "integrity": "sha512-DeT6KKolmC4e/dRQvMQ/RwlnzhaqeiFOXY5ngoOPJ07GgVVKxZOg9EcrNZb5aTzUn+iCrJldAgOfQm1O/QfPAQ==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/core": "2.7.1", + "@opentelemetry/semantic-conventions": "^1.29.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.3.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/resources/node_modules/@opentelemetry/core": { + "version": "2.7.1", + "resolved": "https://registry.npmjs.org/@opentelemetry/core/-/core-2.7.1.tgz", + "integrity": "sha512-QAqIj32AtK6+pEVNG7EOVxHdE06RP+FM5qpiEJ4RtDcFIqKUZHYhl7/7UY5efhwmwNAg7j8QbJVBLxMerc0+gw==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/semantic-conventions": "^1.29.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.0.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/sdk-logs": { + "version": "0.205.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/sdk-logs/-/sdk-logs-0.205.0.tgz", + "integrity": "sha512-nyqhNQ6eEzPWQU60Nc7+A5LIq8fz3UeIzdEVBQYefB4+msJZ2vuVtRuk9KxPMw1uHoHDtYEwkr2Ct0iG29jU8w==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/api-logs": "0.205.0", + "@opentelemetry/core": "2.1.0", + "@opentelemetry/resources": "2.1.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.4.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/sdk-logs/node_modules/@opentelemetry/resources": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/resources/-/resources-2.1.0.tgz", + "integrity": "sha512-1CJjf3LCvoefUOgegxi8h6r4B/wLSzInyhGP2UmIBYNlo4Qk5CZ73e1eEyWmfXvFtm1ybkmfb2DqWvspsYLrWw==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/core": "2.1.0", + "@opentelemetry/semantic-conventions": "^1.29.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.3.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/sdk-metrics": { + "version": "2.7.1", + "resolved": "https://registry.npmjs.org/@opentelemetry/sdk-metrics/-/sdk-metrics-2.7.1.tgz", + "integrity": "sha512-MpDJdkiFDs3Pm1RHO3KByuZbuBdJEXEAkiC0+yJdsZGVCdf1RpHR6n+LHDcS7ffmfrt5kVCzJSCfm4z2C7v0uQ==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/core": "2.7.1", + "@opentelemetry/resources": "2.7.1" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.9.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/sdk-metrics/node_modules/@opentelemetry/core": { + "version": "2.7.1", + "resolved": "https://registry.npmjs.org/@opentelemetry/core/-/core-2.7.1.tgz", + "integrity": "sha512-QAqIj32AtK6+pEVNG7EOVxHdE06RP+FM5qpiEJ4RtDcFIqKUZHYhl7/7UY5efhwmwNAg7j8QbJVBLxMerc0+gw==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/semantic-conventions": "^1.29.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.0.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/sdk-trace-base": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/sdk-trace-base/-/sdk-trace-base-2.1.0.tgz", + "integrity": "sha512-uTX9FBlVQm4S2gVQO1sb5qyBLq/FPjbp+tmGoxu4tIgtYGmBYB44+KX/725RFDe30yBSaA9Ml9fqphe1hbUyLQ==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/core": "2.1.0", + "@opentelemetry/resources": "2.1.0", + "@opentelemetry/semantic-conventions": "^1.29.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.3.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/sdk-trace-base/node_modules/@opentelemetry/resources": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/@opentelemetry/resources/-/resources-2.1.0.tgz", + "integrity": "sha512-1CJjf3LCvoefUOgegxi8h6r4B/wLSzInyhGP2UmIBYNlo4Qk5CZ73e1eEyWmfXvFtm1ybkmfb2DqWvspsYLrWw==", + "license": "Apache-2.0", + "dependencies": { + "@opentelemetry/core": "2.1.0", + "@opentelemetry/semantic-conventions": "^1.29.0" + }, + "engines": { + "node": "^18.19.0 || >=20.6.0" + }, + "peerDependencies": { + "@opentelemetry/api": ">=1.3.0 <1.10.0" + } + }, + "node_modules/@opentelemetry/semantic-conventions": { + "version": "1.41.1", + "resolved": "https://registry.npmjs.org/@opentelemetry/semantic-conventions/-/semantic-conventions-1.41.1.tgz", + "integrity": "sha512-/UhIkaZgPutTFmQ7RnIJGgDXZmtEJ7Dvi86xNTFWcnRxVRNk/aotsqDJYeEvDP+FSMB2SdW+pQzNMcWP0rwuNA==", + "license": "Apache-2.0", + "engines": { + "node": ">=14" + } + }, "node_modules/@posthog/core": { "version": "1.10.0", "resolved": "https://registry.npmjs.org/@posthog/core/-/core-1.10.0.tgz", @@ -507,6 +804,70 @@ "cross-spawn": "^7.0.6" } }, + "node_modules/@protobufjs/aspromise": { + "version": "1.1.2", + "resolved": "https://registry.npmjs.org/@protobufjs/aspromise/-/aspromise-1.1.2.tgz", + "integrity": "sha512-j+gKExEuLmKwvz3OgROXtrJ2UG2x8Ch2YZUxahh+s1F2HZ+wAceUNLkvy6zKCPVRkU++ZWQrdxsUeQXmcg4uoQ==", + "license": "BSD-3-Clause" + }, + "node_modules/@protobufjs/base64": { + "version": "1.1.2", + "resolved": "https://registry.npmjs.org/@protobufjs/base64/-/base64-1.1.2.tgz", + "integrity": "sha512-AZkcAA5vnN/v4PDqKyMR5lx7hZttPDgClv83E//FMNhR2TMcLUhfRUBHCmSl0oi9zMgDDqRUJkSxO3wm85+XLg==", + "license": "BSD-3-Clause" + }, + "node_modules/@protobufjs/codegen": { + "version": "2.0.5", + "resolved": "https://registry.npmjs.org/@protobufjs/codegen/-/codegen-2.0.5.tgz", + "integrity": "sha512-zgXFLzW3Ap33e6d0Wlj4MGIm6Ce8O89n/apUaGNB/jx+hw+ruWEp7EwGUshdLKVRCxZW12fp9r40E1mQrf/34g==", + "license": "BSD-3-Clause" + }, + "node_modules/@protobufjs/eventemitter": { + "version": "1.1.0", + "resolved": "https://registry.npmjs.org/@protobufjs/eventemitter/-/eventemitter-1.1.0.tgz", + "integrity": "sha512-j9ednRT81vYJ9OfVuXG6ERSTdEL1xVsNgqpkxMsbIabzSo3goCjDIveeGv5d03om39ML71RdmrGNjG5SReBP/Q==", + "license": "BSD-3-Clause" + }, + "node_modules/@protobufjs/fetch": { + "version": "1.1.0", + "resolved": "https://registry.npmjs.org/@protobufjs/fetch/-/fetch-1.1.0.tgz", + "integrity": "sha512-lljVXpqXebpsijW71PZaCYeIcE5on1w5DlQy5WH6GLbFryLUrBD4932W/E2BSpfRJWseIL4v/KPgBFxDOIdKpQ==", + "license": "BSD-3-Clause", + "dependencies": { + "@protobufjs/aspromise": "^1.1.1", + "@protobufjs/inquire": "^1.1.0" + } + }, + "node_modules/@protobufjs/float": { + "version": "1.0.2", + "resolved": "https://registry.npmjs.org/@protobufjs/float/-/float-1.0.2.tgz", + "integrity": "sha512-Ddb+kVXlXst9d+R9PfTIxh1EdNkgoRe5tOX6t01f1lYWOvJnSPDBlG241QLzcyPdoNTsblLUdujGSE4RzrTZGQ==", + "license": "BSD-3-Clause" + }, + "node_modules/@protobufjs/inquire": { + "version": "1.1.1", + "resolved": "https://registry.npmjs.org/@protobufjs/inquire/-/inquire-1.1.1.tgz", + "integrity": "sha512-mnzgDV26ueAvk7rsbt9L7bE0SuAoqyuys/sMMrmVcN5x9VsxpcG3rqAUSgDyLp0UZlmNfIbQ4fHfCtreVBk8Ew==", + "license": "BSD-3-Clause" + }, + "node_modules/@protobufjs/path": { + "version": "1.1.2", + "resolved": "https://registry.npmjs.org/@protobufjs/path/-/path-1.1.2.tgz", + "integrity": "sha512-6JOcJ5Tm08dOHAbdR3GrvP+yUUfkjG5ePsHYczMFLq3ZmMkAD98cDgcT2iA1lJ9NVwFd4tH/iSSoe44YWkltEA==", + "license": "BSD-3-Clause" + }, + "node_modules/@protobufjs/pool": { + "version": "1.1.0", + "resolved": "https://registry.npmjs.org/@protobufjs/pool/-/pool-1.1.0.tgz", + "integrity": "sha512-0kELaGSIDBKvcgS4zkjz1PeddatrjYcmMWOlAuAPwAeccUrPHdUqo/J6LiymHHEiJT5NrF1UVwxY14f+fy4WQw==", + "license": "BSD-3-Clause" + }, + "node_modules/@protobufjs/utf8": { + "version": "1.1.1", + "resolved": "https://registry.npmjs.org/@protobufjs/utf8/-/utf8-1.1.1.tgz", + "integrity": "sha512-oOAWABowe8EAbMyWKM0tYDKi8Yaox52D+HWZhAIJqQXbqe0xI/GV7FhLWqlEKreMkfDjshR5FKgi3mnle0h6Eg==", + "license": "BSD-3-Clause" + }, "node_modules/@puppeteer/browsers": { "version": "2.13.0", "resolved": "https://registry.npmjs.org/@puppeteer/browsers/-/browsers-2.13.0.tgz", @@ -1460,6 +1821,12 @@ "url": "https://github.com/sponsors/sindresorhus" } }, + "node_modules/long": { + "version": "5.3.2", + "resolved": "https://registry.npmjs.org/long/-/long-5.3.2.tgz", + "integrity": "sha512-mNAgZ1GmyNhD7AuqnTG3/VQ26o760+ZYBPKjPvugO8+nLbYfX6TVpJPseBvopbdY+qpZ/lKUnmEc1LeZYS3QAA==", + "license": "Apache-2.0" + }, "node_modules/lru-cache": { "version": "7.18.3", "resolved": "https://registry.npmjs.org/lru-cache/-/lru-cache-7.18.3.tgz", @@ -1771,6 +2138,30 @@ "node": ">=0.4.0" } }, + "node_modules/protobufjs": { + "version": "7.5.8", + "resolved": "https://registry.npmjs.org/protobufjs/-/protobufjs-7.5.8.tgz", + "integrity": "sha512-dvpCIeLPbXZS/Ete7yLaO7RenOdken2NHKykBXbsaGxZT0UTltcarBciw+A78SRQs9iMAAVpsYA+l8b1hTePIA==", + "hasInstallScript": true, + "license": "BSD-3-Clause", + "dependencies": { + "@protobufjs/aspromise": "^1.1.2", + "@protobufjs/base64": "^1.1.2", + "@protobufjs/codegen": "^2.0.5", + "@protobufjs/eventemitter": "^1.1.0", + "@protobufjs/fetch": "^1.1.0", + "@protobufjs/float": "^1.0.2", + "@protobufjs/inquire": "^1.1.1", + "@protobufjs/path": "^1.1.2", + "@protobufjs/pool": "^1.1.0", + "@protobufjs/utf8": "^1.1.1", + "@types/node": ">=13.7.0", + "long": "^5.0.0" + }, + "engines": { + "node": ">=12.0.0" + } + }, "node_modules/proxy-agent": { "version": "6.5.0", "resolved": "https://registry.npmjs.org/proxy-agent/-/proxy-agent-6.5.0.tgz", diff --git a/package.json b/package.json index bd5b57c..767851a 100644 --- a/package.json +++ b/package.json @@ -37,6 +37,10 @@ }, "dependencies": { "@anthropic-ai/tokenizer": "^0.0.4", + "@opentelemetry/api": "^1.9.0", + "@opentelemetry/exporter-metrics-otlp-http": "^0.205.0", + "@opentelemetry/resources": "^2.0.0", + "@opentelemetry/sdk-metrics": "^2.0.0", "ws": "^8.19.0" }, "devDependencies": { diff --git a/server/cli.js b/server/cli.js new file mode 100644 index 0000000..0124a36 --- /dev/null +++ b/server/cli.js @@ -0,0 +1,63 @@ +'use strict'; + +// CLI argv parsing for ccxray. Splits flag detection from server/index.js so +// new subcommands can be added without growing the entry-point file. Mutates +// process.argv in place to strip consumed flags (existing behaviour). + +const providers = require('./providers'); + +function parseArgs(argv = process.argv, env = process.env) { + const portIdx = argv.indexOf('--port'); + let explicitPort = false; + let port = null; + if (portIdx !== -1) { + const portVal = argv[portIdx + 1]; + const parsed = parseInt(portVal, 10); + if (!portVal || isNaN(parsed) || parsed < 1 || parsed > 65535) { + console.error('\x1b[31mError: --port requires a valid port number (1-65535)\x1b[0m'); + process.exit(1); + } + port = parsed; + explicitPort = true; + argv.splice(portIdx, 2); + } + + const hubMode = argv.includes('--hub-mode'); + if (hubMode) argv.splice(argv.indexOf('--hub-mode'), 1); + + const allowUpstreamLoop = argv.includes('--allow-upstream-loop') || env.CCXRAY_ALLOW_UPSTREAM_LOOP === '1'; + if (argv.includes('--allow-upstream-loop')) argv.splice(argv.indexOf('--allow-upstream-loop'), 1); + + const noBrowser = argv.includes('--no-browser'); + if (noBrowser) argv.splice(argv.indexOf('--no-browser'), 1); + + const cliCommand = argv[2]; + const unknownCommand = cliCommand + && cliCommand !== 'status' + && !cliCommand.startsWith('-') + && !providers.isAgentProvider(cliCommand); + if (unknownCommand) { + console.error(`\x1b[31mError: unsupported provider "${cliCommand}". Supported providers: ${providers.supportedProviderList()}\x1b[0m`); + process.exit(1); + } + + const agentCommand = providers.isAgentProvider(cliCommand) ? cliCommand : null; + const agentMode = Boolean(agentCommand); + const agentArgs = agentMode ? argv.slice(3) : []; + const displayName = providers.getDisplayName(agentCommand, env); + + return { + port, + explicitPort, + hubMode, + allowUpstreamLoop, + noBrowser, + cliCommand, + agentCommand, + agentMode, + agentArgs, + displayName, + }; +} + +module.exports = { parseArgs }; diff --git a/server/config-loader.js b/server/config-loader.js new file mode 100644 index 0000000..e108ac0 --- /dev/null +++ b/server/config-loader.js @@ -0,0 +1,64 @@ +'use strict'; + +// Minimal config loader for the OTel rollout (Phase 2a slice). +// This intentionally implements only the surface needed for the first +// vertical slice: read .ccxray.json from cwd if present, return a default +// shape otherwise. Env interpolation, literal-secret detection, gitignore +// auto-amend, personal config (.ccxray.user.json), and walk-up-to-git-root +// lookup all land in later Phase 2 sub-phases per the OpenSpec change. + +const fs = require('fs'); +const path = require('path'); + +const DEFAULT_CONFIG = Object.freeze({ + otel: Object.freeze({ + enabled: false, + tier: 0, + endpoint: null, + headers: Object.freeze({}), + resource_attributes: Object.freeze({}), + cardinality_overrides: Object.freeze({}), + }), +}); + +function projectConfigPath(cwd) { + return path.join(cwd || process.cwd(), '.ccxray.json'); +} + +function readProjectConfig(cwd) { + const file = projectConfigPath(cwd); + let raw; + try { + raw = fs.readFileSync(file, 'utf8'); + } catch (err) { + if (err.code === 'ENOENT') return { config: DEFAULT_CONFIG, source: null }; + throw new Error(`config-loader: failed to read ${file}: ${err.message}`); + } + let parsed; + try { + parsed = JSON.parse(raw); + } catch (err) { + throw new Error(`config-loader: ${file} is not valid JSON (${err.message})`); + } + return { config: mergeWithDefaults(parsed), source: file }; +} + +function mergeWithDefaults(input) { + const otel = input && typeof input.otel === 'object' && input.otel ? input.otel : {}; + return { + otel: { + enabled: otel.enabled === true, + tier: Number.isInteger(otel.tier) ? otel.tier : 0, + endpoint: typeof otel.endpoint === 'string' ? otel.endpoint : null, + headers: otel.headers && typeof otel.headers === 'object' ? { ...otel.headers } : {}, + resource_attributes: otel.resource_attributes && typeof otel.resource_attributes === 'object' + ? { ...otel.resource_attributes } + : {}, + cardinality_overrides: otel.cardinality_overrides && typeof otel.cardinality_overrides === 'object' + ? { ...otel.cardinality_overrides } + : {}, + }, + }; +} + +module.exports = { readProjectConfig, projectConfigPath, DEFAULT_CONFIG }; diff --git a/server/emit.js b/server/emit.js new file mode 100644 index 0000000..050a239 --- /dev/null +++ b/server/emit.js @@ -0,0 +1,40 @@ +'use strict'; + +// Internal event bus for OTel handlers, parser sentinels, and future status hooks. +// +// Phase D (OTel SDK init) registers subscribers; Phase E wires emit() calls in +// forward.js / store.js. With no subscribers, emit() is an O(1) no-op — tier 0 +// pays zero cost. +// +// Handlers run synchronously and MUST NOT throw into the proxy code path; this +// module wraps every dispatch in try/catch so a buggy subscriber cannot break +// request forwarding. +// +// Defined events (payload shape stable across Phase 1): +// entry_completed { entry } +// session_started { sessionId, provider, inferred } +// parser_unknown { provider, kind, token } +// parser_mismatch { type, expected, got, entryId? } +// parser_error { parser, errorType, message } + +const subscribers = new Map(); + +function on(event, handler) { + if (typeof handler !== 'function') throw new TypeError('handler must be a function'); + if (!subscribers.has(event)) subscribers.set(event, new Set()); + subscribers.get(event).add(handler); + return () => subscribers.get(event)?.delete(handler); +} + +function emit(event, payload) { + const set = subscribers.get(event); + if (!set || set.size === 0) return; + for (const handler of set) { + try { handler(payload); } + catch (err) { + try { console.error(`[emit] handler "${event}":`, err && err.message); } catch {} + } + } +} + +module.exports = { on, emit }; diff --git a/server/forward.js b/server/forward.js index 71a7360..23d3852 100644 --- a/server/forward.js +++ b/server/forward.js @@ -10,6 +10,7 @@ const helpers = require('./helpers'); const { broadcast, broadcastSessionStatus, broadcastSessionTitleUpdate } = require('./sse-broadcast'); const { appendSample, collectRatelimitHeaders } = require('./ratelimit-log'); const hub = require('./hub'); +const emit = require('./emit'); // For title-generator subagent responses, extract the clean title from the // JSON payload and (when attribution succeeds) stamp it onto the parent @@ -80,8 +81,15 @@ function createTunnelAgent(proxyUrl) { } const tlsOpts = { socket, servername: options.servername || options.host }; if (options.rejectUnauthorized !== undefined) tlsOpts.rejectUnauthorized = options.rejectUnauthorized; - const tlsSocket = tls.connect(tlsOpts, () => callback(null, tlsSocket)); - tlsSocket.on('error', callback); + let connected = false; + const tlsSocket = tls.connect(tlsOpts, () => { + connected = true; + callback(null, tlsSocket); + }); + tlsSocket.on('error', (err) => { + if (!connected) return callback(err); + console.error(`\x1b[31m❌ TUNNEL SOCKET ERROR: ${err.code || err.message}\x1b[0m`); + }); }); connectReq.on('error', callback); @@ -377,6 +385,15 @@ function forwardRequest(ctx) { clientRes.end(JSON.stringify({ error: 'proxy_error', message: err.message })); }); + // Late socket errors (EPIPE / ECONNRESET after the response has been received) + // are emitted on the underlying TLS/TCP socket and may not re-emit on the + // ClientRequest. Without a listener they crash the entire proxy process. + proxyReq.on('socket', (socket) => { + socket.on('error', (err) => { + console.error(`\x1b[31m❌ UPSTREAM SOCKET ERROR: ${err.code || err.message}\x1b[0m`); + }); + }); + proxyReq.end(bodyToSend); } @@ -599,6 +616,7 @@ function handleSSEResponse(ctx, proxyRes, clientRes) { store.trimEntries(); store.propagateLoadedSkills(entry, sessionId); broadcast(entry); + emit.emit('entry_completed', { entry }); // Persist to index (fire-and-forget after broadcast) const indexLine = JSON.stringify({ @@ -730,6 +748,7 @@ function handleOpenAISSE(ctx, proxyRes, clientRes) { store.entries.push(entry); store.trimEntries(); broadcast(entry); + emit.emit('entry_completed', { entry }); const indexLine = JSON.stringify({ id, ts: ctx.ts, sessionId: reqSessionId, @@ -875,6 +894,7 @@ function handleNonSSEResponse(ctx, proxyRes, clientRes) { store.trimEntries(); store.propagateLoadedSkills(entry, sessionId); broadcast(entry); + emit.emit('entry_completed', { entry }); const indexLine = JSON.stringify({ id, ts: ctx.ts, sessionId, diff --git a/server/index.js b/server/index.js index c47ce40..2abc32a 100755 --- a/server/index.js +++ b/server/index.js @@ -18,40 +18,20 @@ const { authMiddleware } = require('./auth'); const { extractAgentType, extractPromptAgentType, splitB2IntoBlocks } = require('./system-prompt'); const { findSharedPrefix } = require('./delta-helpers'); const providers = require('./providers'); - -// ── CLI: parse flags and detect provider launchers ── -const portIdx = process.argv.indexOf('--port'); -let explicitPort = false; -if (portIdx !== -1) { - const portVal = process.argv[portIdx + 1]; - const parsed = parseInt(portVal, 10); - if (!portVal || isNaN(parsed) || parsed < 1 || parsed > 65535) { - console.error('\x1b[31mError: --port requires a valid port number (1-65535)\x1b[0m'); - process.exit(1); - } - config.PORT = parsed; - explicitPort = true; - process.argv.splice(portIdx, 2); -} -const hubMode = process.argv.includes('--hub-mode'); -if (hubMode) process.argv.splice(process.argv.indexOf('--hub-mode'), 1); -const allowUpstreamLoop = process.argv.includes('--allow-upstream-loop') || process.env.CCXRAY_ALLOW_UPSTREAM_LOOP === '1'; -if (process.argv.includes('--allow-upstream-loop')) process.argv.splice(process.argv.indexOf('--allow-upstream-loop'), 1); -const noBrowser = process.argv.includes('--no-browser'); -if (noBrowser) process.argv.splice(process.argv.indexOf('--no-browser'), 1); -const cliCommand = process.argv[2]; -const unknownCommand = cliCommand - && cliCommand !== 'status' - && !cliCommand.startsWith('-') - && !providers.isAgentProvider(cliCommand); -if (unknownCommand) { - console.error(`\x1b[31mError: unsupported provider "${cliCommand}". Supported providers: ${providers.supportedProviderList()}\x1b[0m`); - process.exit(1); -} -const agentCommand = providers.isAgentProvider(cliCommand) ? cliCommand : null; -const agentMode = Boolean(agentCommand); -const agentArgs = agentMode ? process.argv.slice(3) : []; -const DISPLAY_NAME = providers.getDisplayName(agentCommand, process.env); +const { parseArgs } = require('./cli'); + +const { + port: cliPort, + explicitPort, + hubMode, + allowUpstreamLoop, + noBrowser, + agentCommand, + agentMode, + agentArgs, + displayName: DISPLAY_NAME, +} = parseArgs(); +if (cliPort != null) config.PORT = cliPort; // In agent/hub mode, mute startup logs so they don't pollute output. const _origLog = console.log; @@ -775,7 +755,6 @@ async function startServer() { if (acquired) hub.releaseForkLock(); console.error(`\x1b[31m${err.message}\x1b[0m`); // Show last hub log lines so user doesn't have to open the file - const fs = require('fs'); try { const log = fs.readFileSync(hub.HUB_LOG_PATH, 'utf8'); const lines = log.trim().split('\n'); diff --git a/server/otel-health.js b/server/otel-health.js new file mode 100644 index 0000000..f833696 --- /dev/null +++ b/server/otel-health.js @@ -0,0 +1,60 @@ +'use strict'; + +// OTel export health state machine. Phase 2b: state shell only. +// Bounded export queue (3.2), circuit breaker (3.3), log rotation (3.4), +// and shutdown cap (3.5) land in later sub-phases of the OpenSpec change. +// +// States: +// disabled — OTel never initialized (tier 0 or packages missing-and-tolerated) +// active — SDK initialized, exports presumed working +// degraded — SDK init failed or runtime non-recoverable; proxy continues +// circuit_open — runtime export failures tripped the breaker; periodic half-open retry +// +// Only documented APIs may mutate state. Invalid transitions throw so bugs +// surface in tests rather than silently corrupt observability. + +const STATES = Object.freeze(['disabled', 'active', 'degraded', 'circuit_open']); + +const VALID_TRANSITIONS = Object.freeze({ + disabled: new Set(['active', 'degraded']), + active: new Set(['degraded', 'circuit_open', 'disabled']), + degraded: new Set(['active', 'circuit_open', 'disabled']), + circuit_open: new Set(['active', 'degraded', 'disabled']), +}); + +let currentState = 'disabled'; +let lastTransitionAt = Date.now(); +let lastReason = null; + +function getState() { + return currentState; +} + +function getStatus() { + return { + state: currentState, + lastTransitionAt, + reason: lastReason, + }; +} + +function transition(to, { reason } = {}) { + if (!STATES.includes(to)) throw new Error(`otel-health: unknown state "${to}"`); + if (currentState === to) return false; + const allowed = VALID_TRANSITIONS[currentState]; + if (!allowed.has(to)) { + throw new Error(`otel-health: invalid transition ${currentState} → ${to}`); + } + currentState = to; + lastTransitionAt = Date.now(); + lastReason = (to === 'degraded' || to === 'circuit_open') ? (reason || null) : null; + return true; +} + +function _resetForTests() { + currentState = 'disabled'; + lastTransitionAt = Date.now(); + lastReason = null; +} + +module.exports = { STATES, getState, getStatus, transition, _resetForTests }; diff --git a/server/otel-lazy.js b/server/otel-lazy.js new file mode 100644 index 0000000..d796e68 --- /dev/null +++ b/server/otel-lazy.js @@ -0,0 +1,35 @@ +'use strict'; + +// Lazy require for OpenTelemetry packages. +// Phase 1 of the OTel rollout: ccxray must run at tier 0 even when the +// @opentelemetry/* packages are absent (e.g. user installed via a minimal +// distribution). Callers ask for a package by name; we return null if it +// cannot be resolved instead of throwing. + +const KNOWN_PACKAGES = new Set([ + '@opentelemetry/api', + '@opentelemetry/resources', + '@opentelemetry/sdk-metrics', + '@opentelemetry/exporter-metrics-otlp-http', +]); + +function tryRequire(name) { + if (!KNOWN_PACKAGES.has(name)) { + throw new Error(`otel-lazy: unknown package "${name}"`); + } + try { + return require(name); + } catch (err) { + if (err && err.code === 'MODULE_NOT_FOUND') return null; + throw err; + } +} + +function isAvailable() { + for (const name of KNOWN_PACKAGES) { + if (tryRequire(name) == null) return false; + } + return true; +} + +module.exports = { tryRequire, isAvailable, KNOWN_PACKAGES }; diff --git a/server/otel.js b/server/otel.js new file mode 100644 index 0000000..653a787 --- /dev/null +++ b/server/otel.js @@ -0,0 +1,199 @@ +'use strict'; + +// OTel SDK init + emit.js subscribers. +// +// Vertical-slice scope (Phase 1, first cut): tier 0 = full no-op. tier ≥ 1 + +// packages present + endpoint configured → real MeterProvider with OTLP HTTP +// exporter and the first metric family (token usage). tier ≥ 1 with packages +// present but no endpoint → active state with no exporter (useful for staging +// the wiring before pointing at a collector). +// +// Metrics registered in this slice (aligned with otel-export/spec.md): +// ccxray.tokens.input_total (counter, unit=tokens) +// ccxray.tokens.output_total (counter, unit=tokens) +// ccxray.tokens.cache_read_total (counter, unit=tokens) +// ccxray.tokens.cache_creation_total (counter, unit=tokens) +// Each is recorded with { provider, model } attributes. Cardinality budgets, +// View API allow-lists, sentinel metrics, and the full cost/usage/quality +// families land in later slices (§4.2–§4.9 of the OpenSpec change). +// +// Resource attribute `ccxray.source=ccxray-proxy` is always set so downstream +// consumers can distinguish ccxray-emitted metrics from `claude_code.*` CLI +// metrics that the user may also be exporting. +// +// shutdown() returns synchronously to disabled state (so existing callers +// don't need to await) and fires the SDK provider.shutdown() in the +// background with a 2-second hard cap — never blocks process exit. +// +// init() never throws into the caller; any failure transitions to degraded +// with a reason and the proxy continues running. + +const emit = require('./emit'); +const defaultOtelLazy = require('./otel-lazy'); +const health = require('./otel-health'); + +let initialized = false; +let unsubscribers = []; +let sdkContext = null; // { provider, reader, instruments } | null + +function init(config, deps = {}) { + if (initialized) return health.getState(); + initialized = true; + + const tier = (config && config.otel && Number.isInteger(config.otel.tier)) + ? config.otel.tier + : 0; + + if (tier <= 0) { + return health.getState(); + } + + const otelLazy = deps.otelLazy || defaultOtelLazy; + if (!otelLazy.isAvailable()) { + health.transition('degraded', { reason: 'opentelemetry packages not installed' }); + return health.getState(); + } + + try { + if (config.otel.endpoint) { + sdkContext = initSdk(config, otelLazy); + } + registerHandlers(); + health.transition('active'); + } catch (err) { + sdkContext = null; + health.transition('degraded', { reason: `SDK init failed: ${err && err.message || err}` }); + } + return health.getState(); +} + +function initSdk(config, otelLazy) { + const sdk = otelLazy.tryRequire('@opentelemetry/sdk-metrics'); + const exp = otelLazy.tryRequire('@opentelemetry/exporter-metrics-otlp-http'); + const res = otelLazy.tryRequire('@opentelemetry/resources'); + if (!sdk || !exp || !res) { + throw new Error('required OTel package failed to resolve'); + } + + const exporter = new exp.OTLPMetricExporter({ + url: config.otel.endpoint, + headers: config.otel.headers || {}, + }); + + // Default 60s export interval, overridable for tests via env var. + const intervalMs = Number(process.env.CCXRAY_OTEL_EXPORT_INTERVAL_MS) || 60000; + const reader = new sdk.PeriodicExportingMetricReader({ + exporter, + exportIntervalMillis: intervalMs, + }); + + const resource = res.resourceFromAttributes({ + 'ccxray.source': 'ccxray-proxy', + ...(config.otel.resource_attributes || {}), + }); + + const provider = new sdk.MeterProvider({ resource, readers: [reader] }); + + const meter = provider.getMeter('ccxray', '1'); + const instruments = { + inputTokens: meter.createCounter('ccxray.tokens.input_total', { + description: 'Input tokens per completed entry', + unit: 'tokens', + }), + outputTokens: meter.createCounter('ccxray.tokens.output_total', { + description: 'Output tokens per completed entry', + unit: 'tokens', + }), + cacheReadTokens: meter.createCounter('ccxray.tokens.cache_read_total', { + description: 'Cache-read input tokens per completed entry', + unit: 'tokens', + }), + cacheCreationTokens: meter.createCounter('ccxray.tokens.cache_creation_total', { + description: 'Cache-creation input tokens per completed entry', + unit: 'tokens', + }), + }; + + return { provider, reader, instruments }; +} + +function registerHandlers() { + unsubscribers.push(emit.on('entry_completed', onEntryCompleted)); + // Other event types land as later slices wire them up. + unsubscribers.push(emit.on('session_started', () => { /* tier ≥ 1 stub */ })); + unsubscribers.push(emit.on('parser_unknown', () => { /* tier ≥ 1 stub */ })); + unsubscribers.push(emit.on('parser_mismatch', () => { /* tier ≥ 1 stub */ })); + unsubscribers.push(emit.on('parser_error', () => { /* tier ≥ 1 stub */ })); +} + +function onEntryCompleted(payload) { + if (!sdkContext) return; + const entry = payload && payload.entry; + const usage = entry && entry.usage; + if (!usage) return; + + const attrs = { + provider: entry.provider || 'unknown', + model: entry.model || 'unknown', + }; + + const input = Number(usage.input_tokens) || 0; + const output = Number(usage.output_tokens) || 0; + const cacheRead = Number(usage.cache_read_input_tokens) || 0; + const cacheCreate = Number(usage.cache_creation_input_tokens) || 0; + + sdkContext.instruments.inputTokens.add(input, attrs); + sdkContext.instruments.outputTokens.add(output, attrs); + sdkContext.instruments.cacheReadTokens.add(cacheRead, attrs); + sdkContext.instruments.cacheCreationTokens.add(cacheCreate, attrs); +} + +// Returns a Promise but is safe to ignore. The synchronous portion (before the +// first await below) is enough to make `health.getState() === 'disabled'` and +// `initialized === false` visible to immediate follow-up calls — existing +// `otel.shutdown()` callers that do not await still see the new state. +async function shutdown() { + for (const off of unsubscribers) { + try { off(); } catch { /* ignore */ } + } + unsubscribers = []; + + const ctx = sdkContext; + sdkContext = null; + + if (health.getState() !== 'disabled') { + health.transition('disabled'); + } + initialized = false; + + if (ctx && ctx.provider && typeof ctx.provider.shutdown === 'function') { + try { + await Promise.race([ + ctx.provider.shutdown(), + new Promise(resolve => setTimeout(resolve, 2000)), + ]); + } catch { /* never block process exit on shutdown errors */ } + } +} + +// Force-flush exists so tests (and a future `ccxray status --otel` command) +// can drain the reader on demand. Returns a Promise that resolves even on +// failure — never throws to the caller. +async function flush() { + if (!sdkContext || !sdkContext.provider) return; + try { + await sdkContext.provider.forceFlush(); + } catch { /* ignore */ } +} + +function _resetForTests() { + // Sync drop of everything for tests that do not await shutdown. + for (const off of unsubscribers) { try { off(); } catch {} } + unsubscribers = []; + sdkContext = null; + if (health.getState() !== 'disabled') health.transition('disabled'); + initialized = false; + health._resetForTests(); +} + +module.exports = { init, shutdown, flush, _resetForTests }; diff --git a/test/config-loader.test.js b/test/config-loader.test.js new file mode 100644 index 0000000..1d74e3a --- /dev/null +++ b/test/config-loader.test.js @@ -0,0 +1,85 @@ +'use strict'; + +const test = require('node:test'); +const assert = require('node:assert/strict'); +const fs = require('fs'); +const os = require('os'); +const path = require('path'); + +const { readProjectConfig, DEFAULT_CONFIG } = require('../server/config-loader'); +const { tryRequire, isAvailable } = require('../server/otel-lazy'); + +function mkTmp() { + return fs.mkdtempSync(path.join(os.tmpdir(), 'ccxray-cfg-')); +} + +test('config-loader: returns default config when .ccxray.json is absent', () => { + const dir = mkTmp(); + try { + const { config, source } = readProjectConfig(dir); + assert.equal(source, null); + assert.deepEqual(config, DEFAULT_CONFIG); + assert.equal(config.otel.enabled, false); + assert.equal(config.otel.tier, 0); + } finally { + fs.rmSync(dir, { recursive: true, force: true }); + } +}); + +test('config-loader: reads otel block from .ccxray.json', () => { + const dir = mkTmp(); + try { + fs.writeFileSync(path.join(dir, '.ccxray.json'), JSON.stringify({ + otel: { + enabled: true, + tier: 1, + endpoint: 'http://collector.local:4318', + headers: { 'x-team': 'platform' }, + resource_attributes: { 'service.name': 'ccxray-proxy' }, + }, + })); + const { config, source } = readProjectConfig(dir); + assert.ok(source && source.endsWith('.ccxray.json')); + assert.equal(config.otel.enabled, true); + assert.equal(config.otel.tier, 1); + assert.equal(config.otel.endpoint, 'http://collector.local:4318'); + assert.equal(config.otel.headers['x-team'], 'platform'); + assert.equal(config.otel.resource_attributes['service.name'], 'ccxray-proxy'); + } finally { + fs.rmSync(dir, { recursive: true, force: true }); + } +}); + +test('config-loader: malformed JSON throws with a descriptive error', () => { + const dir = mkTmp(); + try { + fs.writeFileSync(path.join(dir, '.ccxray.json'), '{ not valid json'); + assert.throws(() => readProjectConfig(dir), /not valid JSON/); + } finally { + fs.rmSync(dir, { recursive: true, force: true }); + } +}); + +test('config-loader: tier defaults to 0 when value is non-integer', () => { + const dir = mkTmp(); + try { + fs.writeFileSync(path.join(dir, '.ccxray.json'), JSON.stringify({ otel: { tier: 'one' } })); + const { config } = readProjectConfig(dir); + assert.equal(config.otel.tier, 0); + } finally { + fs.rmSync(dir, { recursive: true, force: true }); + } +}); + +test('otel-lazy: tryRequire returns the package object when installed', () => { + const api = tryRequire('@opentelemetry/api'); + assert.ok(api && typeof api === 'object', 'expected @opentelemetry/api to resolve'); +}); + +test('otel-lazy: tryRequire rejects unknown package names', () => { + assert.throws(() => tryRequire('@opentelemetry/not-real'), /unknown package/); +}); + +test('otel-lazy: isAvailable returns true once all known packages resolve', () => { + assert.equal(isAvailable(), true); +}); diff --git a/test/otel-init.test.js b/test/otel-init.test.js new file mode 100644 index 0000000..7f60877 --- /dev/null +++ b/test/otel-init.test.js @@ -0,0 +1,82 @@ +'use strict'; + +const test = require('node:test'); +const assert = require('node:assert/strict'); + +const emit = require('../server/emit'); +const otel = require('../server/otel'); +const health = require('../server/otel-health'); + +test.beforeEach(() => otel._resetForTests()); +test.afterEach(() => otel._resetForTests()); + +test('otel.init: tier 0 stays disabled and registers no subscribers', () => { + let entryCompletedFired = false; + const off = emit.on('entry_completed', () => { entryCompletedFired = true; }); + try { + const state = otel.init({ otel: { tier: 0 } }); + assert.equal(state, 'disabled'); + + // Only our test subscriber is attached; otel.init must not have added one. + emit.emit('entry_completed', { entry: { id: 'x' } }); + assert.equal(entryCompletedFired, true, 'test subscriber should still fire'); + assert.equal(health.getState(), 'disabled'); + } finally { + off(); + } +}); + +test('otel.init: tier ≥ 1 with packages present → active', () => { + const state = otel.init({ otel: { tier: 1 } }); + assert.equal(state, 'active'); + assert.equal(health.getState(), 'active'); +}); + +test('otel.init: tier ≥ 1 with packages absent → degraded with reason', () => { + const fakeLazy = { isAvailable: () => false, tryRequire: () => null }; + const state = otel.init({ otel: { tier: 1 } }, { otelLazy: fakeLazy }); + assert.equal(state, 'degraded'); + const status = health.getStatus(); + assert.equal(status.state, 'degraded'); + assert.match(status.reason || '', /not installed/i); +}); + +test('otel.init: idempotent — second call returns current state without crashing', () => { + const first = otel.init({ otel: { tier: 1 } }); + const second = otel.init({ otel: { tier: 1 } }); + assert.equal(first, 'active'); + assert.equal(second, 'active'); +}); + +test('otel.shutdown: returns state to disabled and unsubscribes', () => { + otel.init({ otel: { tier: 1 } }); + assert.equal(health.getState(), 'active'); + + // Verify subscribers exist by spying on a known event — when we emit, + // the otel no-op handler fires but does not throw. The handler itself + // is a no-op, so we just confirm shutdown clears state without error. + otel.shutdown(); + assert.equal(health.getState(), 'disabled'); + + // After shutdown, init can run again. + const reinit = otel.init({ otel: { tier: 1 } }); + assert.equal(reinit, 'active'); +}); + +test('otel-health: rejects unknown states', () => { + assert.throws(() => health.transition('flying'), /unknown state/); +}); + +test('otel-health: rejects invalid transitions', () => { + health._resetForTests(); + // disabled → circuit_open is not in the allow-list + assert.throws(() => health.transition('circuit_open'), /invalid transition/); +}); + +test('otel-health: transition clears reason when leaving error states', () => { + health._resetForTests(); + health.transition('degraded', { reason: 'boom' }); + assert.equal(health.getStatus().reason, 'boom'); + health.transition('active'); + assert.equal(health.getStatus().reason, null); +}); diff --git a/test/otel-vertical.test.js b/test/otel-vertical.test.js new file mode 100644 index 0000000..1746dcc --- /dev/null +++ b/test/otel-vertical.test.js @@ -0,0 +1,171 @@ +'use strict'; + +// Vertical-slice integration: a real OTel MeterProvider posts to an in-process +// mock OTLP HTTP collector. Proves the full chain — init → emit → record → +// PeriodicExportingMetricReader → OTLPMetricExporter → HTTP — is wired. +// +// Body content (protobuf) is not decoded here. Asserting (1) at least one POST +// arrived at `/v1/metrics`, (2) content-type is the OTLP HTTP signature, (3) +// the body is non-empty is enough to demo the rail. Decoded-content assertions +// land with §10.3 once a protobuf transformer is on the test path. + +const test = require('node:test'); +const assert = require('node:assert/strict'); +const http = require('node:http'); + +const emit = require('../server/emit'); +const otel = require('../server/otel'); +const health = require('../server/otel-health'); + +function startMockCollector() { + const requests = []; + const server = http.createServer((req, res) => { + const chunks = []; + req.on('data', (c) => chunks.push(c)); + req.on('end', () => { + requests.push({ + method: req.method, + url: req.url, + contentType: req.headers['content-type'] || '', + contentLength: Buffer.concat(chunks).length, + }); + res.writeHead(200, { 'Content-Type': 'application/x-protobuf' }); + res.end(); + }); + }); + return new Promise((resolve) => { + server.listen(0, '127.0.0.1', () => { + const { port } = server.address(); + resolve({ + url: `http://127.0.0.1:${port}/v1/metrics`, + requests, + close: () => new Promise((r) => server.close(() => r())), + }); + }); + }); +} + +test.beforeEach(() => otel._resetForTests()); +test.afterEach(async () => { + await otel.shutdown(); +}); + +test('otel vertical slice: tier 1 + endpoint → exporter posts to collector', async () => { + const prevInterval = process.env.CCXRAY_OTEL_EXPORT_INTERVAL_MS; + // Long interval — we drain explicitly with flush() to avoid races. + process.env.CCXRAY_OTEL_EXPORT_INTERVAL_MS = '60000'; + const collector = await startMockCollector(); + + try { + const state = otel.init({ + otel: { + tier: 1, + endpoint: collector.url, + headers: {}, + resource_attributes: { 'service.name': 'ccxray-test' }, + }, + }); + assert.equal(state, 'active'); + assert.equal(health.getState(), 'active'); + + emit.emit('entry_completed', { + entry: { + provider: 'anthropic', + model: 'claude-test-model', + usage: { + input_tokens: 100, + output_tokens: 50, + cache_read_input_tokens: 200, + cache_creation_input_tokens: 25, + }, + }, + }); + + await otel.flush(); + + // forceFlush triggers the exporter synchronously inside the reader. Give + // the HTTP request one tick to actually deliver to our server. + for (let i = 0; i < 50 && collector.requests.length === 0; i++) { + await new Promise((r) => setTimeout(r, 10)); + } + + assert.ok(collector.requests.length > 0, 'collector should have received at least one POST'); + const first = collector.requests[0]; + assert.equal(first.method, 'POST'); + assert.equal(first.url, '/v1/metrics'); + assert.match(first.contentType, /protobuf|json/); + assert.ok(first.contentLength > 0, 'collector POST body must be non-empty'); + } finally { + await collector.close(); + if (prevInterval === undefined) delete process.env.CCXRAY_OTEL_EXPORT_INTERVAL_MS; + else process.env.CCXRAY_OTEL_EXPORT_INTERVAL_MS = prevInterval; + } +}); + +test('otel vertical slice: tier 1 with no endpoint → active but no exporter', async () => { + const state = otel.init({ otel: { tier: 1 } }); + assert.equal(state, 'active'); + + // No collector, no SDK context — emit must not throw, must not record. + emit.emit('entry_completed', { + entry: { + provider: 'anthropic', + model: 'claude-test-model', + usage: { input_tokens: 1, output_tokens: 1 }, + }, + }); + + await otel.flush(); // no-op, must not throw + assert.equal(health.getState(), 'active'); +}); + +test('otel vertical slice: shutdown honors 2-second cap even when provider hangs', async () => { + const prevInterval = process.env.CCXRAY_OTEL_EXPORT_INTERVAL_MS; + process.env.CCXRAY_OTEL_EXPORT_INTERVAL_MS = '60000'; + + // Mock collector that hangs — never responds. Forces provider.shutdown() to + // block until the timeout race resolves. + const server = http.createServer((_req, _res) => { /* hang */ }); + await new Promise((r) => server.listen(0, '127.0.0.1', r)); + const { port } = server.address(); + const url = `http://127.0.0.1:${port}/v1/metrics`; + + try { + otel.init({ otel: { tier: 1, endpoint: url, headers: {} } }); + emit.emit('entry_completed', { + entry: { provider: 'anthropic', model: 'm', usage: { input_tokens: 1, output_tokens: 1 } }, + }); + + const t0 = Date.now(); + await otel.shutdown(); + const elapsed = Date.now() - t0; + + // Hard cap is 2000ms; give 500ms scheduler slack. + assert.ok(elapsed < 2500, `shutdown took ${elapsed}ms, must respect 2s cap`); + assert.equal(health.getState(), 'disabled'); + } finally { + // Forcibly close still-open sockets from the hung exporter request, + // otherwise server.close() waits for them to drain (~8s). + if (typeof server.closeAllConnections === 'function') server.closeAllConnections(); + await new Promise((r) => server.close(() => r())); + if (prevInterval === undefined) delete process.env.CCXRAY_OTEL_EXPORT_INTERVAL_MS; + else process.env.CCXRAY_OTEL_EXPORT_INTERVAL_MS = prevInterval; + } +}); + +test('otel vertical slice: emit with no usage is a safe no-op', async () => { + const collector = await startMockCollector(); + try { + otel.init({ otel: { tier: 1, endpoint: collector.url, headers: {} } }); + + // Entries without usage (e.g. proxy errors) must not break the handler. + emit.emit('entry_completed', { entry: { provider: 'anthropic', model: 'm' } }); + emit.emit('entry_completed', { entry: null }); + emit.emit('entry_completed', {}); + + await otel.flush(); + assert.equal(health.getState(), 'active'); + } finally { + await collector.close(); + } +});