Skip to content

Session content search and secret scanning: storage, API, CLI#534

Open
wesm wants to merge 2 commits into
mainfrom
feat/session-content-search
Open

Session content search and secret scanning: storage, API, CLI#534
wesm wants to merge 2 commits into
mainfrom
feat/session-content-search

Conversation

@wesm
Copy link
Copy Markdown
Member

@wesm wesm commented May 22, 2026

Adds search across session content and a secret-detection and findings
pipeline. This PR covers the engine, storage, HTTP API, and CLI. The web UI
for both features is deferred to a follow-up PR.

Content search

  • SearchContent queries message text, tool input, and tool result content in
    three modes: substring, regex (with a literal prefilter), and FTS5.
  • Results return snippets snapped to rune boundaries; secrets detected in a
    snippet are masked by default.
  • One-shot/automated sessions are excluded from content search by default.
  • CLI: agentsview session search <pattern> with --regex, --fts,
    --in messages,tool_input,tool_result, and date/project/agent filters.
  • HTTP: GET /api/v1/search/content.

Secret detection (internal/secrets)

A pure detector (no DB, no IO) with two confidence tiers:

  • Definite (well-anchored vendor formats): AWS, Anthropic, GitHub, Slack,
    Stripe, and Google API keys, plus PEM private-key blocks.
  • Candidate (false-positive-prone heuristics): basic-auth URLs, JWTs, and
    high-entropy assignments.

Scan reports findings, Redact masks them in arbitrary text, and Verify
re-checks a finding's coordinates against the current source.

Secret findings: storage and scanning

  • New secret_findings table plus secret_leak_count and
    secrets_rules_version columns on sessions. secret_leak_count counts
    definite findings only.
  • Findings are scanned and persisted on each sync write and carried forward
    across a full-resync orphan copy (the database is a persistent archive).
  • agentsview secrets scan [--backfill] runs a resumable scan over the
    archive; agentsview secrets list lists findings, redacted by default.
  • --reveal prints unredacted values only when the server is bound to
    localhost, warns on stderr, never logs or stores revealed values, and
    re-reads the source by stored coordinates and re-validates before printing.
  • agentsview session list --has-secret filters to sessions with definite
    findings; secret_leak_count is surfaced in session get and health output.

Inline sync scans the definite rules only and stamps a definite-only ruleset
version, keeping the candidate rules (which dominate both CPU and false
positives) out of the sync hot path. An explicit secrets scan runs the full
ruleset and adds candidate findings; secrets scan --backfill re-scans
sessions that received only the inline scan. secrets list defaults to
definite findings, with candidates opt-in via --confidence.

PostgreSQL

The optional read-only PostgreSQL mirror reaches parity with SQLite for these
features: schema for findings and session columns, push of secret_leak_count
and findings, the has-secret filter, findings list and source lookup, and
content search (substring and regex).

Migration

Schema changes are additive (new table, new columns); existing rows are
preserved. The data-version bump triggers a non-destructive full resync that
carries findings forward.

Not in this PR

The Svelte frontend for content search and secret findings is deferred to a
follow-up PR. This change is storage, API, and CLI only.

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 22, 2026

roborev: Combined Review (8bd10e5)

Review blocked: one reviewer could not access the diff, and no substantive code findings were reported.

High

  • /tmp/roborev-snapshot-2214367073/roborev-snapshot-content.diff
    • The diff file was outside the allowed workspace, preventing review.
    • Move the diff into the repository workspace, such as /home/roborev/.roborev/clones/kenn-io/agentsview, and rerun the review.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 22, 2026

roborev: Combined Review (8bd10e5)

Unable to complete review: one agent reported the diff was inaccessible.

High

  • /tmp/roborev-snapshot-2214367073/roborev-snapshot-content.diff
    • The diff file could not be reviewed because it was outside the allowed workspace path.
    • Move the diff into the project workspace, such as /home/roborev/.roborev/clones/kenn-io/agentsview, and rerun the review.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@wesm wesm force-pushed the feat/session-content-search branch from 8bd10e5 to 117f1bd Compare May 22, 2026 21:10
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 22, 2026

roborev: Combined Review (117f1bd)

Summary verdict: No Medium, High, or Critical findings were reported.

All available review outputs are clean or empty; no actionable findings to include.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 22, 2026

roborev: Combined Review (117f1bd)

No Medium, High, or Critical findings were reported.

All review outputs are clean or empty; no actionable findings to include.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 22, 2026

roborev: Combined Review (1fbe2a7)

No Medium, High, or Critical findings were reported; the reviewed code is clean.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 22, 2026

roborev: Combined Review (1fbe2a7)

No Medium, High, or Critical findings were reported.

All reviewed agents found the code clean or reported no actionable issues.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@wesm wesm force-pushed the feat/session-content-search branch from 1fbe2a7 to 9335ed6 Compare May 23, 2026 01:52
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 23, 2026

roborev: Combined Review (9335ed6)

No Medium, High, or Critical findings were reported.

All reviewers found the change clean.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 23, 2026

roborev: Combined Review (9335ed6)

No Medium, High, or Critical issues found.

All review agents reported the code as clean.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@wesm wesm force-pushed the feat/session-content-search branch from 9335ed6 to 8c36ae3 Compare May 23, 2026 13:52
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 23, 2026

roborev: Combined Review (8c36ae3)

Search functionality looks mostly clean, with one medium backend performance issue to address before merge.

Medium

  • internal/postgres/schema.go
    PostgreSQL search uses ILIKE '%query%' on messages.content, but the table does not define a trigram GIN index. As session history grows, searches will require full table scans and can become significantly slow.

    Fix: Add trigram support and an index during schema setup or migration:

    CREATE EXTENSION IF NOT EXISTS pg_trgm;
    CREATE INDEX IF NOT EXISTS idx_messages_content_trgm
      ON messages USING gin (content gin_trgm_ops);

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 23, 2026

roborev: Combined Review (1a90e15)

Summary verdict: No Medium, High, or Critical findings were reported.

All reviewed agents found no actionable issues in scope.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Adds full-text and substring search across stored session content, plus a
secret-scanning subsystem that detects and redacts vendor-format API keys
and PEM blocks both inline during sync and via an explicit backfill scan.

Highlights:

- New /api/v1/search/content endpoint and CLI command supporting
  substring, regex, and FTS5 modes against messages, tool inputs, and
  canonical tool output, with date-range and project/agent/machine
  filters, pagination, and `--reveal` for localhost-only unredacted
  output.
- Definite-confidence secret rules (AWS, Anthropic, GitHub, Slack,
  Stripe, Google API, private-key blocks) scanned in the inline sync
  hot path; FP-prone candidate rules (high-entropy assignments, JWTs,
  basic-auth URLs) reserved for the explicit full scan.
- Stored findings carry rule, confidence, location, message ordinal,
  call/event index, byte span, and a redacted preview, indexed by
  session and (rule, confidence) for fast listing.
- Placeholder filters reject AWS-docs canonical IDs (IOSFODNN/EXAMPL),
  Anthropic keys with low-entropy suffixes, Slack tokens ending in
  `0123` or 4-rune repeats, and PEM blocks with bodies under 150
  non-whitespace bytes; suppresses the dominant false-positive class
  on real corpora (711 -> 29 definite findings on a 79K-session DB).
- RedactWindow masks secrets that straddle the snippet window,
  including grouped-rule spans (basic-auth password, high-entropy
  assignment value) whose anchoring context lies outside the slice.
- PostgreSQL backend mirrors the SQLite content-search and
  secret-listing surface (ILIKE substring, regex, fts->substring
  fallback) and adds a gin/pg_trgm index on messages.content for
  substring throughput.
- Hidden `--cpuprofile`, `--memprofile`, and `--trace` flags on
  `agentsview sync` plus per-phase wall-clock counters
  (prep/scan/write) for profiling the resync hot path.

Schema additions: secret_findings table with composite indices, plus
sessions.secret_leak_count and sessions.secrets_rules_version columns.
Bumps dataVersion to 29 to trigger a one-shot full resync that scans
existing sessions.
@wesm wesm force-pushed the feat/session-content-search branch from 1a90e15 to eef1f42 Compare May 23, 2026 19:11
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 23, 2026

roborev: Combined Review (eef1f42)

No Medium, High, or Critical issues were reported.

All reviewed agents found the code clean within their scopes.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

The validators wired in v2 still let through the dominant placeholder
shapes that flood agent transcripts: repeating short blocks (ghp_aaaa…,
A1b2A1b2…, a1B2a1B2…, aB3_xaB3_x…), monotone alphabet/digit runs
(abcdefABCDEF, 1234567890, ZYXWVU), and markdown diffs that quote PEM
BEGIN/END markers around prose. Backfilling against a 33k-session DB
showed 29 definite findings with 7 distinct placeholder shapes still
making it through; the same set drops to 22 with this change, and the
remaining values are all high-entropy with no structural patterns.

Add three structural helpers to rules.go:

* hasRepeatingBlock: a single byte covering ≥75% of s (block size 1),
  or the leading block at any phase covering ≥75% (sizes 2..6). Catches
  every "ghp_<seed>" style placeholder.
* hasMonotoneRun: a run of ≥6 bytes stepping monotonically by ±1 in
  ASCII. Catches "abcdefABCDEF012345" tails and the like.
* bodyLooksRandom: chains Shannon entropy ≥3.5 + hasRepeatingBlock +
  hasMonotoneRun. Each rule's validator strips the well-known vendor
  prefix before calling so the fixed prefix doesn't anchor a phase.

Wire validators into every definite rule (github-pat, stripe-secret,
google-api-key previously had none) and extend the AWS/Anthropic/Slack
validators to also call bodyLooksRandom on the body after the prefix.
Tighten the PEM gate to require ≥90% base64 alphabet in the body — a
markdown diff with pipes, hyphens, and prose between BEGIN/END markers
fails the purity gate while a real ≥256-char base64 body passes.

Refactor the existing notLowEntropySuffix into a generic
notTrailingRunRepeat helper used by both the Anthropic and Slack
validators. Bump rulesAlgorithmVersion to 3 so backfill re-scans
stored findings.

Test fixtures across the codebase needed updating: every
AKIA1234567890ABCDEF / AKIAZYXWVUTSRQPONMLK / xoxb-…c8Jp /
xoxs-…xYz9 / rep("a", 36) / rep("A1b2", 20) style fixture was
designed as a structurally obvious placeholder and would now fail the
validator. Replace them with hand-picked random-looking fixtures
(AKIA7QHWN2DKR4FYPLJM etc.) verified to have no monotone runs of ≥6 and
no dominant byte. The placeholder patterns move into a new
TestRejectsRepeatingBlockPlaceholders / TestRejectsSequentialRunPlaceholders
suite that asserts they are now rejected.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 23, 2026

roborev: Combined Review (5a5760d)

No Medium, High, or Critical findings were reported.

Security review found no exploitable issues. Other completed reviews did not report findings; one review was skipped due to quota.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Note: gemini review skipped (agent quota exhausted)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant