Accounts, Identities, Collections, and Deduplication#304
Conversation
roborev: Combined Review (
|
…dings Fixes the medium and low findings from the roborev combined review on PR wesm#304. The HIGH finding ("cross-account collection dedup can hide legitimate messages") is a misunderstanding of the spec, not a bug; addressed in the PR comment with a citation to the design alignment document. Mediums: - deduplicate: --dry-run combined with --undo silently performed the undo because the undo path runs before the dry-run gate is consulted. Mark the two flags mutually exclusive at the cobra layer so the user gets a clear error rather than an unintended write. - internal/dedup content-hash fallback: excludeIDs now contains only losers (messages already selected for pruning), not survivors. A Message-ID-less orphan can now link to the survivor of an ID-bearing group via content hash. To prevent demoting a Message-ID survivor (which has already absorbed labels from its losers) into a content-hash loser, post-process content-hash groups: if a Message-ID survivor appears in a content-hash group as a non- survivor, force it back to survivor position. - internal/store/migrate_legacy_identity: the migration no longer marks itself applied when the user has [identity] addresses configured but zero sources exist. Marking would permanently drop the addresses on the floor: the next account the user adds would never receive them. The next command run after a source exists now picks the migration up. Lows: - list-identities --confirmed without --account or --collection now defaults to the union across every source instead of returning an empty result. GetIdentitiesForScope's documented "explicit empty scope means no match" semantics still hold; the CLI explicitly expands "no scope" to "all sources" before calling. - internal/query MergeFilterIntoQuery now forwards filter.SourceIDs (multi-source) into merged.AccountIDs, preferring the multi-source field when both are set. Latent today (no caller sets SourceIDs on MessageFilter before this call), but defensive: future callers of SearchFast/SearchFastCount that pass a multi-source filter will not silently lose the scope. - dedup-purge in remote mode now rejects upfront with a clear message ("dedup-purge is local-only; not supported in remote mode") rather than the generic "must specify --batch or --all-hidden" hint.
|
Addressed in HIGH — pushing back
Respectfully disagree — re-reading the design alignment document, cross-account dedup under
The same-source-only constraint is on remote deletion staging, not on local dedup grouping. The contract is preserved at The default unscoped If we want a regression test that demonstrates the contract: two accounts, same Mediums — fixed
Lows from job 378 — also fixed
|
|
@wesm i will mark ready once I've done my own acceptance tests. however i wanted to get you this once it was ready for your reviews |
… semantics The roborev review on PR wesm#304 raised concern that --collection silently crosses source boundaries. The behavior is correct per the design alignment spec — cross-account local-hide under --collection is explicit user opt-in, and remote deletion stays same-source-only — but the CLI help did not make that contract visible. This documents it. - Long help: gains a Scope section that explicitly states each flag's source-boundary semantics. --account never crosses sources. --collection is the only way to compare across sources, is an explicit opt-in, hides losers locally (reversible via --undo), and --delete-dups-from-source-server stays same-source-only. - --account and --collection short flag descriptions clarified to match: --account scopes to one account, --collection opts into cross-source comparison. - Runtime banner: when a collection-scoped run spans more than one account, prints a one-line note that cross-source dedup is reversible and that remote deletion is same-source-only, plus a reminder that --dry-run previews without writing. No engine behavior change; doc-only clarification of the existing contract.
|
Documentation follow-up to the H1 cross-account question — The behavior is unchanged (cross-account local-hide under
No engine change. The reviewer's concern was real on the user-education axis even if the underlying behavior matches the spec. |
roborev: Combined Review (
|
|
Confirmed that this is working as designed. However, I am not sure about right syntax for the new commands. For instance, collection vs collections. However, this works, builds and reviews clean so fine for testing. |
0ea4332 to
04dc1c7
Compare
roborev: Combined Review (
|
This is helpful feedback and useful for future improvements, but the proposed solution breaks the current functionality and is out of scope. |
04dc1c7 to
2aa2f09
Compare
roborev: Combined Review (
|
|
While it was something originally discussed in the project scope - I believe it is best to gate the gmail remote delete gated. We can also remove it. Basically everything else we are doing right now (local hard-delete, dedup, even import) is reversible or local. Remote-delete touches the user's authoritative Gmail mailbox and I just don't feel comfortable with that going out in this at all. Very open to feedback on this and everything else, however the two commits above put it behind an ENV flag. |
roborev: Combined Review (
|
f0ab152 to
5188612
Compare
roborev: Combined Review (
|
|
Addressed in eb7aaaf. Finding 2 (collection methodology) — fixed. Added Finding 1 (live-message filtering) — partial. The prescribed fix would have changed product behavior.
The accurate observation was that — Claude (Opus 4.7) |
|
Pushed
Verified with the built binary: The CSV-fallback I/O note (also |
roborev: Combined Review (
|
|
Addressed the medium on Medium ( The reviewer is right that Scan calls
What was actually missing is consent legibility: the prompt said "hide N duplicates (reversible with --undo)" without telling the user that scan already wrote N′ rfc822 values that stay regardless of their answer.
No behavior change. Low (redundant — Claude Code (Opus 4.7) and Primeradiant Superpowers |
roborev: Combined Review (
|
|
This was, in the end, a lot more work than I was expecting. |
|
For anyone wanting to test #304 alongside the Facebook Messenger importer (#281): I've rebased fbmessenger on top of this PR at |
…tions (#306) ## Motivation (Deliberately not affected by & mergable with #304) This PR enhances #263 by sharpening the SQL logger so slow and failing queries actually tell you what happened. Three things change. First, streaming queries now record their duration at `rows.Close` instead of when `db.Query` returns, because the work happens while you read rows, not when you get the handle. Second, slow and errored queries log their bound args, and the statement truncation preserves the `WHERE` clause, so you can see what ran with what values without the diagnostic part getting cut off. Third, a few small correctness fixes: `loggedRows.Close` ordering, UTF-8-safe arg truncation, and `logStmt` inlined where it belongs. ## Summary - Log streaming-query duration at `rows.Close`, not at `db.Query` return - Log args on slow/error queries; preserve the `WHERE` clause when truncating long statements - Fix `loggedRows.Close` ordering, make arg truncation UTF-8-safe, inline `logStmt` Co-authored-by: Jesse Robbins <jesserobbins@users.noreply.github.com>
roborev: Combined Review (
|
|
Pushed What landed:
Pushback list at convergence (9 items): documented in private notes, not re-fixed. Highlights:
The loop ran the full review-fix-rereview discipline with predictions before each round and grades after. Two of my own consolidations regressed adjacent code and were caught in the next round when both agents independently flagged the same finding — strong signal both times. Specifically: (1) routing the trash-path confirm through the shared "irreversible" helper changed the user-visible wording for an action that's actually reversible, and (2) the source-type allowlist for the legacy identity migration didn't extend its corresponding defer logic, so an upgrade with only non-email sources would have permanently dropped the legacy addresses. Both are now closed. Final HEAD: Authored by Claude Code (Opus 4.7) in a roborev refine loop, with feedback from @jesserobbins mid-loop that reset the discipline (scope to branch-touched code only, smallest changes to land, Medium+ filter, ledger discipline). Tooling: roborev (multi-agent code review daemon orchestrating claude-code and codex agents in parallel), Claude Code (Opus 4.7), Codex, and Primeradiant Superpowers. |
roborev: Combined Review (
|
roborev: Combined Review (
|
|
Verified the queryless scoped-search Medium and it doesn't repro. The claim is that cobra would reject Empirical confirmation against the local build at
For context: this is the third asserted-but-unverifiable claim from the multi-agent review pipeline against
Each was raised after multiple iterations of review without surfacing earlier, then dropped on the next round once it was rebutted. Pattern looks like agent hallucination on this specific file rather than a real defect class. Calling it out so this doesn't get re-raised next pass. Verified by Claude Code (Opus 4.7) against local build at |
|
@wesm I told my agents to be nice and also push back citing evidence. They seems a little testy but I would be too if I had to run 25 iterations. I built a small harness that triggers combined roborev runs, including branch reviews using Claude and codex and then vectors toward convergence. Having watched and engaged a bunch I think Claude feels codex is not as cool or fun :-) I think this release is ready. |
Resolves conflict between wesm#304 branch and upstream main on the delete-staged flag shape. Upstream introduced --trash (default = permanent); wesm#304 introduced --permanent (default = trash). Keep --permanent / default = trash per maintainer agreement: every other rung of the deletion progression in msgvault is locally reversible, so the remote rung should be too unless the user explicitly opts out. - cmd/msgvault/cmd/deletions.go: keep deletePermanent variable and --permanent flag; fold in upstream's service-account scope-check refactor (wrap the scope-escalation prompt in !isServiceAccount and pass saScopes through buildAPIClient). - cmd/msgvault/cmd/addaccount_test.go: keep all wesm#304 identity tests and append upstream's two new service-account error tests (Headless / Force).
roborev: Combined Review (
|
|
Will spend some time on this today to see if there's anything that needs to be addressed prior to merge |
Resolves conflict between wesm#304 branch and upstream main on the delete-staged flag shape. Upstream introduced --trash (default = permanent); wesm#304 introduced --permanent (default = trash). Keep --permanent / default = trash per maintainer agreement: every other rung of the deletion progression in msgvault is locally reversible, so the remote rung should be too unless the user explicitly opts out. - cmd/msgvault/cmd/deletions.go: keep deletePermanent variable and --permanent flag; fold in upstream's service-account scope-check refactor (wrap the scope-escalation prompt in !isServiceAccount and pass saScopes through buildAPIClient). - cmd/msgvault/cmd/addaccount_test.go: keep all wesm#304 identity tests and append upstream's two new service-account error tests (Headless / Force).
9aee6b6 to
b7a9a09
Compare
restored move-to-trash by default convention per discussion. |
roborev: Combined Review (
|
…urce filter - dedup-purge: store dedup operations and local hard-delete rung - account_identities: multi-signal set semantics, case-preserving identifiers - identity: list, show, add, remove subcommands - add-account: auto-confirm account-identifier identity at source creation - store_resolver: document deferred legacy identity migration - dedup: correctness and safety fixes across remote-delete, cache, undo, and migration - dedup: scope-aware methodology and narrow live-message contract - fix: undo flag exclusivity, E.164 collection identifiers, identity rebind, deferred-notice count - deduplicate: surface rfc822 backfill in confirm prompt and Scan godoc - collection: drop redundant len(p) > 0 guard before digit check - dedup: align terminology with one-source = one account; drop --apply hint - search: resolve --account/--collection before mode dispatch - store: route leaking read paths through LiveMessagesWhere - tui: invalidate parquet cache on dedup hides since last sync - delete-staged: default to trash; --permanent opts into batch delete - dedup: log Scan/Execute/Undo entry and exit at info level - search: reject empty query for --mode=vector|hybrid - store/dedup: DeleteAllDeduped covers batchless rows; flip defer rollback; document BackfillRFC822IDs failed-row semantics - account_identities: case-insensitive remove for email-shaped identifiers - collection: reject explicit mutation of auto-managed All collection - migrate_legacy_identity: distinguish deferred from applied in logs - tui+store: keep cache reasons disjoint; pre-compute live-messages predicate - store: ALTER TABLE messages ADD deleted_at on schema upgrade - delete-deduped: align precount with broadened DeleteAllDeduped predicate - account_identities: add path matches remove path on email case-folding - config+dedup: backup paths route through DatabasePath helper - config: DatabasePath decodes file: URI percent-encoding via net/url - account_identities: tighten email detection beyond bare '@' contains - store/migrations: drop dead errors.Is(sql.ErrNoRows) check in MarkMigrationApplied - query+store: deep-copy AccountIDs in MergeFilterIntoQuery; document All self-heal - store: add identifierMatch helper for email-vs-other comparison - store: route AddAccountIdentity through identifierMatch helper - store: route RemoveAccountIdentity through identifierMatch helper - store: route legacy identity migration through identifierMatch helper - identity: route CLI prior-row lookup through EqualIdentifier; surface ListAccountIdentities errors - openStoreAndInit: drop missing-database guard so 15 callers can create on first use - search: reuse scope-resolution store in runLocalSearch instead of double-init - identity: filter empty parts in splitSignalSet to mirror mergeSignalSet - config: PathUnescape relative file: URI in DatabasePath - dedup: route per-source identity case-folding through NormalizeIdentifierForCompare - search: reject vector/hybrid filter-only queries up front - deduplicate: surface scan side effects on no-duplicate per-source iterations - RemoveAccountIdentity: return rows-affected count; surface multi-row legacy cleanup in CLI - collection: reject CreateCollection name=All with ErrCollectionImmutable - ingest: re-run startup migrations after first source creation - ingest: skip pre-source migration when no source exists - import-mbox: drop redundant in-loop post-source-create migration call - store: case-aware legacy migration dedupe; empty SourceIDs match-nothing - dedup+collection: preserve From-case in dedup; numeric-account fall-through in collection lookup - stats+dedup: handle empty-collection stats; FromEmail comment and parallel test - store+dedup: GetSourceByID not-found sentinel; single-member collection methodology - ingest: confirmDefaultIdentity must run before legacy migration - deduplicate: clarify undo warning; surface undo hint on Execute error - import-emlx, import-mbox: comment why migration error returns early - query+store: centralize live-message predicate via parameterized LiveMessagesWhere - search: run schema init on unscoped vector/hybrid dispatch - store: stream raw MIME in BackfillRFC822IDs to remove per-row SELECTs - docs: add accounts-identities-collections-dedup README and diagrams - docs: add accounts/identities/collections/dedup specification - collection: reconcile ErrCollectionImmutable text with spec - deletion: persist cancelled manifests in their own status directory - deletion: allow MoveManifest to target StatusCancelled - deletion: CancelManifest moves to cancelled/ instead of os.Remove - deletion: add ListCancelled - list-deletions: show cancelled batches alongside other statuses - cmd: extract confirmDestructive helper for testable safety prompts - delete-deduped: --all-hidden cannot be bypassed by --yes - delete-staged: --permanent and --yes are mutually exclusive - delete-staged: --permanent requires typing literal "delete" to confirm - cmd: silence errcheck on fmt.Fprint* in confirm helper and list-deletions - dedup: skip content-hash groups with multiple Message-ID survivors - delete-deduped: --batch path uses cancel-on-EOF, not contract error - delete-staged: route non-permanent confirm through cmd.OutOrStdout - spec: expand footer with author credit and toolchain attribution - dedup: skip content-hash groups colliding MID survivors with sent orphans - confirmDefaultIdentity: route confirmation print through cmd writer - ingest: drop pre-source legacy migration to preserve confirmDefaultIdentity - delete-deduped: route all output through cmd writer - deletion: CancelManifest renames first, rewrites inline status second - delete-staged: route trash-path confirm through confirmDestructive helper - deletion: document Manifest.String() stale-status window in CancelManifest - serve: confirmDefaultIdentity on first source creation - build_cache_test: mirror production messages columns in CSV fallback - serve: defer legacy identity migration to first scheduled sync - delete-staged: trash-path prompt states reversibility - migrate_legacy_identity: skip non-email source types - migrate_legacy_identity: defer when no eligible sources; add apple-mail to allowlist - Merge upstream/main: keep --permanent flag, default to trash - cmd: silence unchecked fmt.Fprint* writes; fix ST1005 in delete-staged gate - query: route getMessageRawShared through LiveMessagesWhere
b7a9a09 to
99af280
Compare
roborev: Combined Review (
|
The contract is "permanently remove rows the dedup pipeline soft-hid", so the DELETE should be keyed on the positive marker delete_batch_id IS NOT NULL in addition to deleted_at IS NOT NULL. Earlier the gate was deleted_at alone, which was only safe by convention — any future soft-delete writer (a "trash" view, a per-message user hide, etc.) would have its rows silently destroyed by the dedup hard-delete rung. Tighten the predicate, retire the FUTURE REFINEMENT comment that flagged the trap, align the CLI --all-hidden row count so the preview matches what the delete actually removes, and add a regression test that confirms a bare deleted_at row (no batch ID) survives DeleteAllDeduped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The default-All bootstrap ran on every InitSchema and used a SELECT- then-INSERT shape: two processes hitting init at the same time (a CLI command and serve, or two CLI commands sharing the data dir) could both pass the SELECT, both attempt the INSERT, and the loser would fail with a UNIQUE-constraint error surfaced as "create default collection: ...". SQLite's single-writer prevented data corruption but the loser's command died on a confusing error that had nothing to do with what the user typed. Switch to INSERT OR IGNORE (dialect-rewritten for PostgreSQL via dialect.InsertOrIgnore) followed by an unconditional SELECT id. Both racers succeed; the second insert is ignored; both selects return the same row id. As a bonus this drops the LastInsertId path, which was only correct because SQLite happened to return the auto-generated id on a successful INSERT. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… only) The post-run hint and Long help both said "vector and parquet caches" could be rebuilt with 'build-cache --full-rebuild', but build-cache only rebuilds the parquet analytics layer — the vector index lives in vectors.db and rebuilds via 'build-embeddings --full-rebuild'. Mention both commands so the user has a complete recipe after a large purge. (Vector pruning of orphan embeddings after delete-deduped is a separate gap; this commit is documentation only.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Three post-review design fixes pushed to address concerns from a careful re-read of the squashed branch. None change user-facing semantics; they tighten invariants that were "safe by convention" into invariants enforced by the code.
Considered but deferredFor visibility to future readers: a careful re-read also surfaced four items that look like real design notes rather than blockers, and one that's a real invariant mismatch best handled as a separate change.
|
|
Filed the deferred follow-ups as issues so they don't get lost:
Each issue has the full context + proposed approach so they're actionable in isolation. |
|
thank you! |
Implements and resolves #278 (supersedes #286). Specification:
docs/accounts-identities-collections-dedup/spec.md.What this branch adds
A reusable account/collection scope, per-account identity storage, and
a deduplication pipeline staged into a four-rung safety progression
(scan → hide → local hard delete → remote delete). Every read surface
in the archive routes through one canonical visibility predicate.
New CLI surface
msgvault deduplicate(aliasesdedup,dedupe)--content-hashfallback. Scoped via--accountor--collection.--dry-run,--undo <batch-id>(repeatable),--delete-dups-from-source-server.msgvault delete-deduped--batch <id>(repeatable) or--all-hidden. VACUUM INTO backup before deletion; interactive confirmation by default.msgvault identity {list,show,add,remove}listaccepts--account/--collection/--json. Replaces the removedlist-identitiescommand.msgvault collection {list,create,add,remove,delete,show}Allcollection is auto-managed and rejects explicit mutation withErrCollectionImmutable.msgvault search --collection <name>--accountand--collectionare mutually exclusive.msgvault stats --account/--collectionmsgvault delete-staged --permanent--yes;--permanentrequires typingdeleteto confirm. Execution is gated byMSGVAULT_ENABLE_REMOTE_DELETE=1for the v1 release.Key semantics
--accountresolves an account/source identifier; collection names supplied to--accountare rejected with a hint to use--collection.--collectionresolves only collections; account names supplied to--collectionare rejected symmetrically.--collection. Outside collection scope, dedup never crosses source boundaries — protects sent-message provenance (alice's Sent and bob's Inbox share an RFC822 Message-ID and must both survive).gmail,imap,mbox,emlx,hey) with tiebreakers on raw MIME presence, label count, archive time, and id. Sent-copy eligibility filter runs before survivor selection.Identity model
account_identities (source_id, address, source_signal, confirmed_at)with multi-signal set semantics (signals stored as comma-delimited sorted set).identifierMatch/NormalizeIdentifierForComparewhich case-folds email-shaped values only.[identity].addressesconfig block into per-account records. If no eligible source exists at startup, the migration is deferred until first source creation. Apple-Mail and other non-email source types are filtered.add-account,add-imap,add-o365,import-mbox,import-emlx,import-whatsapp, andimport-gvoice.import-imessageis exempted (iMessage contacts are not self-identifying).confirmDefaultIdentityis the shared helper; the prompt is routed throughcmd.OutOrStdout()so test harnesses can capture it.Live-message contract
internal/store.LiveMessagesWhere(alias, hideDeletedFromSource)is thecanonical SQL fragment. It always filters
deleted_at IS NULL(deduplosers are hidden everywhere) and gates
deleted_from_source_at IS NULLbehind the boolean (archive views may show source-deleted rows by
design). All read paths route through it:
internal/storeAPI/list/search/summary,
internal/querySQLite + DuckDB engines (includingtext search and the raw-MIME helper),
internal/vector/sqlitevecbackend and fused-search filter, dedup engine, and stats.
Schema changes
collections,collection_sources,account_identities,applied_migrations, plus dedup operation tracking.messages.deleted_at(added viaALTER TABLEmigration on schema upgrade).internal/store/migrations.gointroduces a forward-only migration runner withMarkMigrationApplied.Safety guardrails
delete-stageddefaults to trash;--permanentis opt-in, mutually exclusive with--yes, and requires typing the literal worddelete.delete-deduped --all-hiddenalways prompts even when--yesis set.delete-stagedexecution is gated byMSGVAULT_ENABLE_REMOTE_DELETE=1. Read-only modes (--list,--dry-run,show-deletion) work without the gate.deduplicate --dry-runand--undoare mutually exclusive (undo writes; dry-run promises no writes).cancelled/directory rather than removed, so audit history survives.--no-backup).TUI cache invalidation
tui.gocache-staleness check counts dedup hides (deleted_at >= LastSyncAt AND deleted_from_source_at IS NULL) as a third signalalongside new messages and source-deletions, forcing a full Parquet
rebuild when present. Counts are kept disjoint from source-deletion
counts to avoid double-attribution.
Documentation
docs/accounts-identities-collections-dedup/spec.md— full specification (scope semantics, identity model, dedup detection/survivor selection, safety progression, CLI surface).docs/accounts-identities-collections-dedup/README.md— narrative overview with concept diagrams.Notable behavior changes
list-identitiesremoved; useidentity list.[identity].addressesinconfig.tomlbecomes legacy input after the one-time migration — write per-account identities instead.gmailonly. The manifest format (GmailIDs) and executor (gmail.API) are Gmail-specific; IMAP is gated until a manifest format that records source type and an IMAP executor exist.delete-dedupedrun, vector and Parquet caches may hold stale entries — the command recommendsbuild-cache --full-rebuild.