Session content search and secret scanning: storage, API, CLI#534
Conversation
roborev: Combined Review (
|
roborev: Combined Review (
|
8bd10e5 to
117f1bd
Compare
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
roborev: Combined Review (
|
1fbe2a7 to
9335ed6
Compare
roborev: Combined Review (
|
roborev: Combined Review (
|
9335ed6 to
8c36ae3
Compare
roborev: Combined Review (
|
roborev: Combined Review (
|
Adds full-text and substring search across stored session content, plus a secret-scanning subsystem that detects and redacts vendor-format API keys and PEM blocks both inline during sync and via an explicit backfill scan. Highlights: - New /api/v1/search/content endpoint and CLI command supporting substring, regex, and FTS5 modes against messages, tool inputs, and canonical tool output, with date-range and project/agent/machine filters, pagination, and `--reveal` for localhost-only unredacted output. - Definite-confidence secret rules (AWS, Anthropic, GitHub, Slack, Stripe, Google API, private-key blocks) scanned in the inline sync hot path; FP-prone candidate rules (high-entropy assignments, JWTs, basic-auth URLs) reserved for the explicit full scan. - Stored findings carry rule, confidence, location, message ordinal, call/event index, byte span, and a redacted preview, indexed by session and (rule, confidence) for fast listing. - Placeholder filters reject AWS-docs canonical IDs (IOSFODNN/EXAMPL), Anthropic keys with low-entropy suffixes, Slack tokens ending in `0123` or 4-rune repeats, and PEM blocks with bodies under 150 non-whitespace bytes; suppresses the dominant false-positive class on real corpora (711 -> 29 definite findings on a 79K-session DB). - RedactWindow masks secrets that straddle the snippet window, including grouped-rule spans (basic-auth password, high-entropy assignment value) whose anchoring context lies outside the slice. - PostgreSQL backend mirrors the SQLite content-search and secret-listing surface (ILIKE substring, regex, fts->substring fallback) and adds a gin/pg_trgm index on messages.content for substring throughput. - Hidden `--cpuprofile`, `--memprofile`, and `--trace` flags on `agentsview sync` plus per-phase wall-clock counters (prep/scan/write) for profiling the resync hot path. Schema additions: secret_findings table with composite indices, plus sessions.secret_leak_count and sessions.secrets_rules_version columns. Bumps dataVersion to 29 to trigger a one-shot full resync that scans existing sessions.
1a90e15 to
eef1f42
Compare
roborev: Combined Review (
|
The validators wired in v2 still let through the dominant placeholder
shapes that flood agent transcripts: repeating short blocks (ghp_aaaa…,
A1b2A1b2…, a1B2a1B2…, aB3_xaB3_x…), monotone alphabet/digit runs
(abcdefABCDEF, 1234567890, ZYXWVU), and markdown diffs that quote PEM
BEGIN/END markers around prose. Backfilling against a 33k-session DB
showed 29 definite findings with 7 distinct placeholder shapes still
making it through; the same set drops to 22 with this change, and the
remaining values are all high-entropy with no structural patterns.
Add three structural helpers to rules.go:
* hasRepeatingBlock: a single byte covering ≥75% of s (block size 1),
or the leading block at any phase covering ≥75% (sizes 2..6). Catches
every "ghp_<seed>" style placeholder.
* hasMonotoneRun: a run of ≥6 bytes stepping monotonically by ±1 in
ASCII. Catches "abcdefABCDEF012345" tails and the like.
* bodyLooksRandom: chains Shannon entropy ≥3.5 + hasRepeatingBlock +
hasMonotoneRun. Each rule's validator strips the well-known vendor
prefix before calling so the fixed prefix doesn't anchor a phase.
Wire validators into every definite rule (github-pat, stripe-secret,
google-api-key previously had none) and extend the AWS/Anthropic/Slack
validators to also call bodyLooksRandom on the body after the prefix.
Tighten the PEM gate to require ≥90% base64 alphabet in the body — a
markdown diff with pipes, hyphens, and prose between BEGIN/END markers
fails the purity gate while a real ≥256-char base64 body passes.
Refactor the existing notLowEntropySuffix into a generic
notTrailingRunRepeat helper used by both the Anthropic and Slack
validators. Bump rulesAlgorithmVersion to 3 so backfill re-scans
stored findings.
Test fixtures across the codebase needed updating: every
AKIA1234567890ABCDEF / AKIAZYXWVUTSRQPONMLK / xoxb-…c8Jp /
xoxs-…xYz9 / rep("a", 36) / rep("A1b2", 20) style fixture was
designed as a structurally obvious placeholder and would now fail the
validator. Replace them with hand-picked random-looking fixtures
(AKIA7QHWN2DKR4FYPLJM etc.) verified to have no monotone runs of ≥6 and
no dominant byte. The placeholder patterns move into a new
TestRejectsRepeatingBlockPlaceholders / TestRejectsSequentialRunPlaceholders
suite that asserts they are now rejected.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
roborev: Combined Review (
|
Adds search across session content and a secret-detection and findings
pipeline. This PR covers the engine, storage, HTTP API, and CLI. The web UI
for both features is deferred to a follow-up PR.
Content search
SearchContentqueries message text, tool input, and tool result content inthree modes: substring, regex (with a literal prefilter), and FTS5.
snippet are masked by default.
agentsview session search <pattern>with--regex,--fts,--in messages,tool_input,tool_result, and date/project/agent filters.GET /api/v1/search/content.Secret detection (
internal/secrets)A pure detector (no DB, no IO) with two confidence tiers:
Stripe, and Google API keys, plus PEM private-key blocks.
high-entropy assignments.
Scanreports findings,Redactmasks them in arbitrary text, andVerifyre-checks a finding's coordinates against the current source.
Secret findings: storage and scanning
secret_findingstable plussecret_leak_countandsecrets_rules_versioncolumns on sessions.secret_leak_countcountsdefinite findings only.
across a full-resync orphan copy (the database is a persistent archive).
agentsview secrets scan [--backfill]runs a resumable scan over thearchive;
agentsview secrets listlists findings, redacted by default.--revealprints unredacted values only when the server is bound tolocalhost, warns on stderr, never logs or stores revealed values, and
re-reads the source by stored coordinates and re-validates before printing.
agentsview session list --has-secretfilters to sessions with definitefindings;
secret_leak_countis surfaced insession getand health output.Inline sync scans the definite rules only and stamps a definite-only ruleset
version, keeping the candidate rules (which dominate both CPU and false
positives) out of the sync hot path. An explicit
secrets scanruns the fullruleset and adds candidate findings;
secrets scan --backfillre-scanssessions that received only the inline scan.
secrets listdefaults todefinite findings, with candidates opt-in via
--confidence.PostgreSQL
The optional read-only PostgreSQL mirror reaches parity with SQLite for these
features: schema for findings and session columns, push of
secret_leak_countand findings, the has-secret filter, findings list and source lookup, and
content search (substring and regex).
Migration
Schema changes are additive (new table, new columns); existing rows are
preserved. The data-version bump triggers a non-destructive full resync that
carries findings forward.
Not in this PR
The Svelte frontend for content search and secret findings is deferred to a
follow-up PR. This change is storage, API, and CLI only.