Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,124 changes: 1,124 additions & 0 deletions docs/otel-integration.html

Large diffs are not rendered by default.

1,138 changes: 1,138 additions & 0 deletions docs/otel-phase1-overview.html

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions openspec/changes/add-otel-metrics-phase1/.openspec.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-05-12
166 changes: 166 additions & 0 deletions openspec/changes/add-otel-metrics-phase1/design.md

Large diffs are not rendered by default.

47 changes: 47 additions & 0 deletions openspec/changes/add-otel-metrics-phase1/proposal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
## Why

ccxray captures everything an agent does at the HTTP layer — full request/response, token counts, cost, tool calls, MCP server activity, skill activations — but the data lives only in the local dashboard. Teams that already operate Grafana / Datadog / Honeycomb cannot aggregate ccxray's signals into their existing observability pipeline. Claude Code's CLI has built-in OTel for Anthropic only and does not expose the HTTP-layer truth ccxray sees; Codex, Gemini, and future providers have no OTel at all. The full design rationale, pre-mortem (11 risks scored ≥ 9/10) and alternative options live at `docs/otel-integration.html`.

This change adds Phase 1: emit ccxray's metrics over OTLP, gated behind a default-off tiered opt-in, with a failure model that never degrades the proxy. Phase 2 (metadata-only traces with `entry_id` drill-back) is a follow-up.

## What Changes

- New optional metric export under `ccxray.*` namespace covering cost, usage (tool / MCP / skill / agent_type / provider), quality (errors, stop_reason, latency, max_tokens_hit_rate), patterns (context_utilization, auto_compact_triggered, subagent_ratio, tools_per_turn) and governance (permission_mode, dangerous_tool, file_writes).
- New configuration files: `.ccxray.json` (repo, project-level) and `.ccxray.user.json` (gitignored, personal). `${ENV_VAR}` interpolation. Schema rejects literal-looking secrets. Auto-add `.ccxray.user.json` to `.gitignore` if missing.
- Three-tier opt-in model: **tier 0 disabled (default)** / tier 1 anonymous project-level / tier 2 personal named. Project config is the upper bound; personal config can only equal or downgrade. Engineers can opt out unilaterally.
- Detect `CLAUDE_CODE_ENABLE_TELEMETRY=1` and enter "complement mode" with `ccxray.cli_otel_active=true` attribute; every metric carries `ccxray.source="ccxray-proxy"` resource attribute. ccxray emits ccxray-internal invariant metrics (`ccxray.invariants.*`); cross-source reconciliation against the CLI is documented as a downstream pattern (recording rules / sidecar / wide-event join on `request_id`) in `docs/otel-recon.md`, not as an in-proxy gauge — keeps ccxray as a transparent proxy with bounded blast radius.
- Cardinality budget per (metric, attribute) with `_overflow_` fallback and `ccxray.metrics.overflow_total` sentinel; attribute key allow-list enforced via OTel View API.
- Parser schema-ization: extract tool / MCP / skill detection into `server/parsers/*.schema.json` with snapshot fixtures, sentinel metrics (`ccxray.parser.unknown_*_total`), and reconciliation invariants (tool_use block count must equal extracted count).
- Failure fallback: config errors fail fast at startup; init errors degrade silently (ccxray keeps proxying); runtime errors handled by bounded queue (drop oldest) + circuit breaker (5 failures → open 60s → exponential backoff). OTel failures **never** break the proxy.
- New shared modules: `server/otel-health.js` (state machine, circuit breaker, bounded queue, local log writer) and `server/config-loader.js` (JSON schema validation, env interpolation, secret detection, gitignore check).
- OTel emit lives in the **client** process, not the hub. Each project's tier/endpoint coexists on the same hub. Hub gains its own operational metrics under `ccxray.hub.*` namespace.
- New CLI commands: `ccxray status --otel` (current tier, endpoint, health, cardinality usage), `ccxray otel preview` (dry-run printing the next export's content), `ccxray parser report` (recent unknown events for drift detection).
- Out of scope (Phase 2 follow-up): span emit (traces), `/entry/:id` deep-link route, `ccxray.entry_id` / `dashboard_url` attributes.

## Capabilities

### New Capabilities

- `otel-config`: `.ccxray.json` and `.ccxray.user.json` schema, `${ENV_VAR}` interpolation, literal-secret rejection, `.gitignore` auto-amend, project-upper-bound + personal-lower-bound merging rules.
- `otel-export`: OTel SDK initialization (client-side, not hub), metric definitions under `ccxray.*` namespace, `ccxray.source` resource attribute, cardinality budget enforcement with `_overflow_` fallback, CLI coexistence detection and complement-mode signaling, ccxray-internal invariant metrics, explicit non-emit of cross-source diff gauge (deferred to downstream).
- `otel-tiers`: three-tier opt-in (disabled / project-anonymous / personal-named), tier resolution with project as upper bound and personal as lower bound, `enduser.id` attribute only in tier 2, opt-in acknowledgment timestamp persisted in personal config.
- `otel-health`: failure state machine (`disabled / active / degraded / circuit_open`), bounded export queue with drop-oldest semantics, circuit breaker with exponential backoff, local failure log at `~/.ccxray/otel.log` with rotation, never-block guarantee for the proxy path.
- `parser-schemas`: extract skill / MCP / tool / agent-type detection into versioned JSON schemas, snapshot fixtures per provider (Anthropic + Codex), sentinel metrics for unknown events, reconciliation invariants run per entry, try/catch isolation so parser failure does not affect ccxray core.
- `otel-introspection`: `ccxray status --otel` view (tier, endpoint, health, cardinality, dropped counts), `ccxray otel preview` dry-run, `ccxray parser report` for drift inspection, startup banner declaring active tier and CLI coexistence mode.

### Modified Capabilities

(None — Phase 1 is additive. Existing capabilities are not changed.)

## Impact

- New `server/otel.js`, `server/otel-health.js`, `server/config-loader.js`, `server/parsers/` directory tree (schemas + fixtures + unknown-handler).
- `server/forward.js` — emit metric on request completion (counters + histograms) via the otel-health-guarded queue; no behavior change when OTel is disabled.
- `server/store.js` — session / tool / skill / MCP / agent_type detection becomes a thin shim over `server/parsers/*`; reconciliation invariants run per entry; sentinel counters incremented on unknown.
- `server/system-prompt.js` — agent-type and skill marker detection moves into `parsers/anthropic-skills.schema.json`; existing parsing behavior preserved.
- `server/hub.js` — hub gains optional `ccxray.hub.*` operational metrics (uptime, request rate, connected clients) under its own config in `~/.ccxray/hub-config.json`. Hub does NOT emit business metrics; those stay client-side.
- `server/routes/api.js` — no new HTTP routes in Phase 1 (deep-link route is Phase 2).
- `bin/ccxray.js` or equivalent CLI entry — new subcommands: `status --otel`, `otel preview`, `parser report`. Existing commands unaffected when OTel is disabled.
- `package.json` — add minimal OTel dependencies (`@opentelemetry/api`, `@opentelemetry/sdk-metrics`, `@opentelemetry/exporter-metrics-otlp-http`, `@opentelemetry/resources`). No auto-instrumentations. Optional dependency pattern so the package still works if OTel is not installed.
- New docs: `docs/otel-integration.html` (already exists, decision record), `docs/otel-ethics.md` (why these metrics are not for individual performance evaluation), `docs/otel-quickstart.md` (90-second Grafana onboarding).
- Tests: parser snapshot fixtures, cardinality budget enforcement tests, tier resolution matrix tests, failure-mode tests (collector down, bad endpoint, bad auth, malformed config).
90 changes: 90 additions & 0 deletions openspec/changes/add-otel-metrics-phase1/specs/otel-config/spec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
## ADDED Requirements

### Requirement: Project and personal config files

ccxray SHALL read two optional configuration files at startup: `.ccxray.json` (project-level, repo-checked-in) and `.ccxray.user.json` (personal-level, gitignored). Both files use JSON. Missing files SHALL be treated as tier 0 (disabled).

#### Scenario: No config present

- **WHEN** ccxray starts in a directory with neither `.ccxray.json` nor `.ccxray.user.json`
- **THEN** OTel SDK SHALL NOT initialize and no network egress SHALL occur

#### Scenario: Project config present, no personal config

- **WHEN** ccxray starts in a directory with `.ccxray.json` that enables tier 1
- **THEN** OTel SDK SHALL initialize at tier 1 with project-level attributes only

#### Scenario: Both project and personal config present

- **WHEN** project config sets tier 1 and personal config sets tier 2 with `enduser.id`
- **THEN** the effective tier SHALL be tier 2 and `enduser.id` SHALL be attached to emitted metrics

### Requirement: Tier resolution as upper bound and lower bound

The effective tier SHALL be `min(project_tier, personal_tier)` so that the project config is an upper bound and personal config can only equal-or-downgrade. An engineer SHALL be able to unilaterally opt out by setting tier 0 in personal config.

#### Scenario: Personal config downgrades from project

- **WHEN** project config enables tier 1 and personal config explicitly sets tier 0
- **THEN** no OTel emission SHALL occur for this engineer

#### Scenario: Personal config cannot exceed project

- **WHEN** project config enables tier 1 and personal config sets tier 2
- **THEN** the effective tier SHALL be tier 2 only if the project explicitly authorizes tier 2; otherwise tier resolution SHALL clamp to tier 1 and emit a warning

### Requirement: Environment variable interpolation

All string values in config files SHALL support `${VAR}` interpolation, resolved at load time from `process.env`. Unresolved variables SHALL cause startup failure with a clear error message naming the missing variable.

#### Scenario: Header value uses env var

- **WHEN** config contains `"Authorization": "Bearer ${OTLP_TOKEN}"` and `OTLP_TOKEN=abc123` is set in the environment
- **THEN** the loaded header value SHALL be `"Bearer abc123"` and the literal string SHALL NOT appear in any debug log line

#### Scenario: Missing env var

- **WHEN** config contains `"Authorization": "Bearer ${MISSING_VAR}"` and `MISSING_VAR` is not set
- **THEN** ccxray SHALL exit non-zero with an error message that includes the file path, line, and the variable name `MISSING_VAR`

### Requirement: Literal-secret rejection

The schema validator SHALL reject any string value that matches a literal-secret pattern (`Bearer [A-Za-z0-9]{20,}`, `sk_live_*`, `sk_test_*`, `ghp_*`, JWT three-segment structure) unless the value is wrapped in `${...}`. Pure URLs and hostnames SHALL be allowed.

#### Scenario: Literal bearer token rejected

- **WHEN** config contains `"Authorization": "Bearer abc123longtokenvalue..."`
- **THEN** ccxray SHALL exit at startup with an error suggesting the user switch to `${ENV_VAR}` interpolation

#### Scenario: Interpolated bearer token accepted

- **WHEN** config contains `"Authorization": "Bearer ${TOKEN}"` and `TOKEN` is set
- **THEN** ccxray SHALL load successfully and use the resolved value

### Requirement: Gitignore auto-amend on first generation

When ccxray writes a new `.ccxray.user.json` for the first time, it SHALL check whether the file is covered by the project's `.gitignore`. If not, ccxray SHALL prompt the user (or apply automatically when `--yes` is passed) to append `.ccxray.user.json` to `.gitignore`.

#### Scenario: Gitignore missing entry

- **WHEN** ccxray creates `.ccxray.user.json` in a repo whose `.gitignore` does not list it
- **THEN** ccxray SHALL prompt for permission to append `.ccxray.user.json` and reflect the choice in the next run

#### Scenario: Gitignore already covers the file

- **WHEN** ccxray creates `.ccxray.user.json` and `.gitignore` already contains an entry matching the file
- **THEN** no prompt SHALL appear and the file SHALL be written silently

### Requirement: Config error fails fast at startup

Config syntax errors, schema violations, unresolved `${VAR}` references, and literal-secret matches SHALL cause ccxray to exit non-zero at startup with an actionable error message. ccxray SHALL NOT silently continue with a partial config.

#### Scenario: Invalid JSON

- **WHEN** `.ccxray.json` contains malformed JSON
- **THEN** ccxray SHALL print a parse error citing the file path and the offending line/column, and SHALL exit non-zero

#### Scenario: Schema violation

- **WHEN** `.ccxray.json` sets `otel.tier` to an unknown value
- **THEN** ccxray SHALL print a schema error naming the field and listing valid values, and SHALL exit non-zero
Loading
Loading