Skip to content

Accounts, Identities, Collections, and Deduplication#304

Merged
wesm merged 4 commits intowesm:mainfrom
jesserobbins:jesse/identities-collections-dedup
May 6, 2026
Merged

Accounts, Identities, Collections, and Deduplication#304
wesm merged 4 commits intowesm:mainfrom
jesserobbins:jesse/identities-collections-dedup

Conversation

@jesserobbins
Copy link
Copy Markdown
Contributor

@jesserobbins jesserobbins commented Apr 30, 2026

Implements and resolves #278 (supersedes #286). Specification:
docs/accounts-identities-collections-dedup/spec.md.

What this branch adds

A reusable account/collection scope, per-account identity storage, and
a deduplication pipeline staged into a four-rung safety progression
(scan → hide → local hard delete → remote delete). Every read surface
in the archive routes through one canonical visibility predicate.

New CLI surface

Command Purpose
msgvault deduplicate (aliases dedup, dedupe) Find and merge duplicates by RFC822 Message-ID with optional --content-hash fallback. Scoped via --account or --collection. --dry-run, --undo <batch-id> (repeatable), --delete-dups-from-source-server.
msgvault delete-deduped Local hard-delete rung. --batch <id> (repeatable) or --all-hidden. VACUUM INTO backup before deletion; interactive confirmation by default.
msgvault identity {list,show,add,remove} Per-account identity management. list accepts --account/--collection/--json. Replaces the removed list-identities command.
msgvault collection {list,create,add,remove,delete,show} User-defined groupings of accounts. The All collection is auto-managed and rejects explicit mutation with ErrCollectionImmutable.
msgvault search --collection <name> Collection scope on FTS, vector, and hybrid search modes. --account and --collection are mutually exclusive.
msgvault stats --account/--collection Per-scope stats.
msgvault delete-staged --permanent Opt into batch permanent deletion. Default is Gmail trash (~30-day recovery). Mutually exclusive with --yes; --permanent requires typing delete to confirm. Execution is gated by MSGVAULT_ENABLE_REMOTE_DELETE=1 for the v1 release.

Key semantics

  • One account = one source. --account resolves an account/source identifier; collection names supplied to --account are rejected with a hint to use --collection. --collection resolves only collections; account names supplied to --collection are rejected symmetrically.
  • Cross-source dedup is opt-in via --collection. Outside collection scope, dedup never crosses source boundaries — protects sent-message provenance (alice's Sent and bob's Inbox share an RFC822 Message-ID and must both survive).
  • Remote deletion is same-source-only, even under collection scope. Cross-account dedup hides locally; it never stages a remote delete that crosses sources.
  • Survivor selection has source-type preference (default gmail,imap,mbox,emlx,hey) with tiebreakers on raw MIME presence, label count, archive time, and id. Sent-copy eligibility filter runs before survivor selection.
  • Content-hash fallback is supplementary, not transitive. Groups with multiple Message-ID survivors are skipped; groups containing both an MID survivor and a sent-copy orphan are skipped (per spec § Detection / Survivor selection).

Identity model

  • Confirmed identities live in account_identities (source_id, address, source_signal, confirmed_at) with multi-signal set semantics (signals stored as comma-delimited sorted set).
  • Case is preserved on insert because the column accommodates phone E.164 and synthetic handles; comparison uses identifierMatch / NormalizeIdentifierForCompare which case-folds email-shaped values only.
  • A one-time legacy migration copies any [identity].addresses config block into per-account records. If no eligible source exists at startup, the migration is deferred until first source creation. Apple-Mail and other non-email source types are filtered.
  • Auto-confirm runs at source creation for add-account, add-imap, add-o365, import-mbox, import-emlx, import-whatsapp, and import-gvoice. import-imessage is exempted (iMessage contacts are not self-identifying).
  • confirmDefaultIdentity is the shared helper; the prompt is routed through cmd.OutOrStdout() so test harnesses can capture it.

Live-message contract

internal/store.LiveMessagesWhere(alias, hideDeletedFromSource) is the
canonical SQL fragment. It always filters deleted_at IS NULL (dedup
losers are hidden everywhere) and gates deleted_from_source_at IS NULL
behind the boolean (archive views may show source-deleted rows by
design). All read paths route through it: internal/store API/list/
search/summary, internal/query SQLite + DuckDB engines (including
text search and the raw-MIME helper), internal/vector/sqlitevec
backend and fused-search filter, dedup engine, and stats.

Schema changes

  • New tables: collections, collection_sources, account_identities, applied_migrations, plus dedup operation tracking.
  • New column: messages.deleted_at (added via ALTER TABLE migration on schema upgrade).
  • internal/store/migrations.go introduces a forward-only migration runner with MarkMigrationApplied.

Safety guardrails

  • delete-staged defaults to trash; --permanent is opt-in, mutually exclusive with --yes, and requires typing the literal word delete.
  • delete-deduped --all-hidden always prompts even when --yes is set.
  • delete-staged execution is gated by MSGVAULT_ENABLE_REMOTE_DELETE=1. Read-only modes (--list, --dry-run, show-deletion) work without the gate.
  • deduplicate --dry-run and --undo are mutually exclusive (undo writes; dry-run promises no writes).
  • Cancelled deletion manifests are persisted in their own cancelled/ directory rather than removed, so audit history survives.
  • Database backup before any local hard delete or merge (skippable with --no-backup).

TUI cache invalidation

tui.go cache-staleness check counts dedup hides (deleted_at >= LastSyncAt AND deleted_from_source_at IS NULL) as a third signal
alongside new messages and source-deletions, forcing a full Parquet
rebuild when present. Counts are kept disjoint from source-deletion
counts to avoid double-attribution.

Documentation

Notable behavior changes

  • list-identities removed; use identity list.
  • [identity].addresses in config.toml becomes legacy input after the one-time migration — write per-account identities instead.
  • Remote-source eligibility for staged deletion is currently gmail only. The manifest format (GmailIDs) and executor (gmail.API) are Gmail-specific; IMAP is gated until a manifest format that records source type and an IMAP executor exist.
  • Collection identity is derived as the union of member-account confirmed identities; not configured separately.
  • After a large delete-deduped run, vector and Parquet caches may hold stale entries — the command recommends build-cache --full-rebuild.

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 30, 2026

roborev: Combined Review (a73c525)

Verdict: Changes need fixes before merge due to cross-account dedup safety and several medium correctness issues.

High

  • Cross-account collection dedup can hide legitimate messages

    • cmd/msgvault/cmd/deduplicate.go:100
    • cmd/msgvault/cmd/deduplicate.go:103
    • internal/store/dedup.go:66

    --collection resolves all collection member source IDs into one dedup scan. For mixed-account collections such as the default All, duplicate groups can span different accounts, despite the dedup contract that cross-account messages must never be merged. A matching or guessed Message-ID could cause a legitimate message in another account to be marked deleted and later purged.

    Fix: Keep dedup account-bound under collection scope. Partition collection members by logical account/source identity before scanning, or reject collections that contain multiple account identities. Add a regression test where two accounts contain messages with the same RFC822 Message-ID and both remain live.

Medium

  • Content-hash fallback misses matches against Message-ID groups

    • internal/dedup/dedup.go:254

    The content-hash fallback excludes every message already present in a Message-ID duplicate group. That prevents it from catching a message missing Message-ID but matching the content of an ID-bearing group.

    Fix: Exclude only messages already selected for pruning, or allow content-hash groups to include the RFC822 survivor while preventing duplicate pruning.

  • Dry-run undo still performs the undo

    • cmd/msgvault/cmd/deduplicate.go:168

    msgvault deduplicate --dry-run --undo <batch> still executes the undo path because it ignores dedupDryRun.

    Fix: Reject --dry-run with --undo, or implement a true dry-run undo preview.

  • Legacy identity migration can mark itself applied too early

    • internal/store/migrate_legacy_identity.go:58

    The migration marks itself applied when configured identity addresses exist but there are zero sources, which can happen during init-db. Later accounts will then never receive the migrated identities.

    Fix: Do not mark the migration applied when identities exist but no sources are present, or defer migration until at least one source exists.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

jesserobbins added a commit to jesserobbins/msgvault-1 that referenced this pull request Apr 30, 2026
…dings

Fixes the medium and low findings from the roborev combined review on
PR wesm#304. The HIGH finding ("cross-account collection dedup can hide
legitimate messages") is a misunderstanding of the spec, not a bug;
addressed in the PR comment with a citation to the design alignment
document.

Mediums:

- deduplicate: --dry-run combined with --undo silently performed the
  undo because the undo path runs before the dry-run gate is
  consulted. Mark the two flags mutually exclusive at the cobra layer
  so the user gets a clear error rather than an unintended write.
- internal/dedup content-hash fallback: excludeIDs now contains only
  losers (messages already selected for pruning), not survivors. A
  Message-ID-less orphan can now link to the survivor of an
  ID-bearing group via content hash. To prevent demoting a Message-ID
  survivor (which has already absorbed labels from its losers) into a
  content-hash loser, post-process content-hash groups: if a
  Message-ID survivor appears in a content-hash group as a non-
  survivor, force it back to survivor position.
- internal/store/migrate_legacy_identity: the migration no longer
  marks itself applied when the user has [identity] addresses
  configured but zero sources exist. Marking would permanently drop
  the addresses on the floor: the next account the user adds would
  never receive them. The next command run after a source exists now
  picks the migration up.

Lows:

- list-identities --confirmed without --account or --collection now
  defaults to the union across every source instead of returning an
  empty result. GetIdentitiesForScope's documented "explicit empty
  scope means no match" semantics still hold; the CLI explicitly
  expands "no scope" to "all sources" before calling.
- internal/query MergeFilterIntoQuery now forwards filter.SourceIDs
  (multi-source) into merged.AccountIDs, preferring the multi-source
  field when both are set. Latent today (no caller sets SourceIDs on
  MessageFilter before this call), but defensive: future callers of
  SearchFast/SearchFastCount that pass a multi-source filter will
  not silently lose the scope.
- dedup-purge in remote mode now rejects upfront with a clear
  message ("dedup-purge is local-only; not supported in remote
  mode") rather than the generic "must specify --batch or
  --all-hidden" hint.
@jesserobbins
Copy link
Copy Markdown
Contributor Author

Addressed in d3cdb9c. Pushing back on the HIGH and confirming each Medium / Low.

HIGH — pushing back

Cross-account collection dedup can hide legitimate messages.
--collection resolves all collection member source IDs into one dedup scan. For mixed-account collections such as the default All, duplicate groups can span different accounts, despite the dedup contract that cross-account messages must never be merged.

Respectfully disagree — re-reading the design alignment document, cross-account dedup under --collection is explicit user opt-in, not a contract violation. From the design alignment spec (and this PR's commit messages):

  • deduplicate --collection <name>: "Compare messages across member accounts in that collection."
  • "A user who genuinely wants to dedup across every account can still write --collection All."

The same-source-only constraint is on remote deletion staging, not on local dedup grouping. The contract is preserved at internal/dedup/dedup.go:738: a remote-deletion entry is only staged when m.SourceID == survivor.SourceID. Cross-account losers under --collection get soft-deleted locally (reversible via --undo), but never staged for remote deletion.

The default unscoped deduplicate (no flags) iterates per-account and never crosses source boundaries. Users have to explicitly write --collection to opt into cross-account dedup; the spec covers this trade-off as the user's deliberate choice. If the safety language wants strengthening, that's a doc question, not a code change.

If we want a regression test that demonstrates the contract: two accounts, same Message-ID, run deduplicate (no flags). Both messages remain live. Run deduplicate --collection All. One is locally hidden. --undo restores it. No remote-deletion entry is staged. I can add that test if it helps.

Mediums — fixed

  • Content-hash fallback misses matches against Message-ID groups. excludeIDs now contains only losers, not survivors. A Message-ID-less orphan can now link to a Message-ID survivor via content hash. To prevent demoting a Message-ID survivor (which has already absorbed labels from its losers) into a content-hash loser, the content-hash pass post-processes its groups: any Message-ID survivor that appears as a non-survivor in a content-hash group is forced back to survivor position.
  • Dry-run undo still performs the undo. --dry-run and --undo are now MarkFlagsMutuallyExclusive at the cobra layer.
  • Legacy identity migration can mark itself applied too early. The migration no longer marks itself applied when [identity].addresses is set but zero sources exist. Marking in that state would permanently drop the addresses; instead the migration runs on the next command after a source is added.

Lows from job 378 — also fixed

  • list-identities --confirmed with no scope now defaults to the union across every source. GetIdentitiesForScope's "explicit empty scope means no match" contract is intact; the CLI explicitly expands "no flag" to "all sources."
  • MergeFilterIntoQuery ignored filter.SourceIDs. Now forwards multi-source into merged.AccountIDs, preferring SourceIDs when both are set. Latent (no current caller sets it), but defensive against future callers.
  • dedup-purge in remote mode now rejects upfront with a clearer message rather than the generic "must specify --batch or --all-hidden."

go test ./... clean.

@jesserobbins
Copy link
Copy Markdown
Contributor Author

jesserobbins commented Apr 30, 2026

@wesm i will mark ready once I've done my own acceptance tests. however i wanted to get you this once it was ready for your reviews

jesserobbins added a commit to jesserobbins/msgvault-1 that referenced this pull request Apr 30, 2026
… semantics

The roborev review on PR wesm#304 raised concern that --collection silently
crosses source boundaries. The behavior is correct per the design
alignment spec — cross-account local-hide under --collection is
explicit user opt-in, and remote deletion stays same-source-only — but
the CLI help did not make that contract visible. This documents it.

- Long help: gains a Scope section that explicitly states each flag's
  source-boundary semantics. --account never crosses sources.
  --collection is the only way to compare across sources, is an
  explicit opt-in, hides losers locally (reversible via --undo), and
  --delete-dups-from-source-server stays same-source-only.
- --account and --collection short flag descriptions clarified to
  match: --account scopes to one account, --collection opts into
  cross-source comparison.
- Runtime banner: when a collection-scoped run spans more than one
  account, prints a one-line note that cross-source dedup is
  reversible and that remote deletion is same-source-only, plus a
  reminder that --dry-run previews without writing.

No engine behavior change; doc-only clarification of the existing
contract.
@jesserobbins
Copy link
Copy Markdown
Contributor Author

Documentation follow-up to the H1 cross-account question — 0ea4332.

The behavior is unchanged (cross-account local-hide under --collection is the design contract) but the CLI help didn't make that visible to the user. Now it does:

  • deduplicate Long help has a new Scope section that explicitly states each flag's source-boundary semantics. --account never crosses sources. --collection is the only way to compare across sources, is an explicit opt-in, hides losers locally (reversible via --undo), and --delete-dups-from-source-server stays same-source-only.
  • --account and --collection short descriptions clarified to match — "scope to one account; never crosses source boundaries" vs "opts into cross-source comparison."
  • Runtime banner: when a collection-scoped run spans more than one account, prints a one-line note that cross-source dedup is reversible and remote deletion stays same-source-only, plus a reminder that --dry-run previews without writing.

No engine change. The reviewer's concern was real on the user-education axis even if the underlying behavior matches the spec.

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 30, 2026

roborev: Combined Review (0ea4332)

No Medium, High, or Critical findings were reported.

All review agents found the code clean, including the security review.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@jesserobbins
Copy link
Copy Markdown
Contributor Author

Confirmed that this is working as designed. However, I am not sure about right syntax for the new commands. For instance, collection vs collections. However, this works, builds and reviews clean so fine for testing.

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 30, 2026

roborev: Combined Review (04dc1c7)

Potentially unsafe PR: one Medium security finding needs attention before merge.

Medium

  • internal/dedup/dedup.go:~100, ~720: remoteSourceTypes includes "imap": true, and Execute stages m.SourceMessageID into deletion manifests for any remote source type. Current delete-staged execution uses the Gmail client and manifest GmailIDs, keyed only by Filters.Account, so IMAP duplicates staged via deduplicate --delete-dups-from-source-server can later be executed against Gmail as if their IMAP message IDs were Gmail IDs.

    Exploit path: a malicious or compromised IMAP server could control IMAP source_message_id values for an account that also has Gmail OAuth configured. If the user runs dedup with remote deletion enabled and then delete-staged, crafted IMAP IDs matching Gmail message IDs could cause unintended Gmail trash/delete operations.

    Remediation: remove "imap" from remoteSourceTypes until IMAP deletion execution exists, or extend manifests with a required source_type and make delete-staged dispatch by source type, rejecting non-Gmail manifests in the current Gmail-only executor.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@jesserobbins jesserobbins marked this pull request as ready for review April 30, 2026 20:59
@jesserobbins
Copy link
Copy Markdown
Contributor Author

roborev: Combined Review (04dc1c7)

Potentially unsafe PR: one Medium security finding needs attention before merge.

Medium

  • internal/dedup/dedup.go:~100, ~720: remoteSourceTypes includes "imap": true, and Execute stages m.SourceMessageID into deletion manifests for any remote source type. Current delete-staged execution uses the Gmail client and manifest GmailIDs, keyed only by Filters.Account, so IMAP duplicates staged via deduplicate --delete-dups-from-source-server can later be executed against Gmail as if their IMAP message IDs were Gmail IDs.
    Exploit path: a malicious or compromised IMAP server could control IMAP source_message_id values for an account that also has Gmail OAuth configured. If the user runs dedup with remote deletion enabled and then delete-staged, crafted IMAP IDs matching Gmail message IDs could cause unintended Gmail trash/delete operations.
    Remediation: remove "imap" from remoteSourceTypes until IMAP deletion execution exists, or extend manifests with a required source_type and make delete-staged dispatch by source type, rejecting non-Gmail manifests in the current Gmail-only executor.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

This is helpful feedback and useful for future improvements, but the proposed solution breaks the current functionality and is out of scope.

@jesserobbins jesserobbins force-pushed the jesse/identities-collections-dedup branch from 04dc1c7 to 2aa2f09 Compare April 30, 2026 23:52
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 30, 2026

roborev: Combined Review (2aa2f09)

Summary verdict: Changes need fixes before merge due to one High severity remote deletion routing bug and two Medium behavior regressions.

High

  • internal/dedup/dedup.go:90
    remoteSourceTypes marks imap as eligible for staged remote deletion, but the staged deletion pipeline and manifest format are Gmail-specific (GmailIDs, Gmail OAuth/client executor). An IMAP dedup run with --delete-dups-from-source-server can create manifests that delete-staged will try to execute through Gmail instead of IMAP.
    Fix: Remove imap from remoteSourceTypes until manifests persist source type and delete-staged can route to an IMAP deletion executor.

Medium

  • cmd/msgvault/cmd/build_cache.go:684
    The deleted_at IS NULL filter was added only to the CSV fallback query in setupSQLiteSource. On the normal sqlite-scanner path, sqlite_db.messages is the raw attached SQLite table, and the main cache export query is unchanged, so a full cache rebuild can still export dedup-hidden rows on Linux/macOS.
    Fix: Add m.deleted_at IS NULL to the main COPY ... FROM sqlite_db.messages export query, or make both sqlite-scanner and CSV paths expose the same filtered messages relation.

  • cmd/msgvault/cmd/deduplicate.go:92
    --undo still resolves --account / --collection before reaching the undo branch, even though the flag help says scope flags are ignored. A stale or missing account/collection can block undoing a valid batch.
    Fix: Handle dedupUndo before scope resolution, or explicitly reject combining --undo with scope flags.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@jesserobbins
Copy link
Copy Markdown
Contributor Author

jesserobbins commented May 1, 2026

While it was something originally discussed in the project scope - I believe it is best to gate the gmail remote delete gated. We can also remove it. Basically everything else we are doing right now (local hard-delete, dedup, even import) is reversible or local. Remote-delete touches the user's authoritative Gmail mailbox and I just don't feel comfortable with that going out in this at all. Very open to feedback on this and everything else, however the two commits above put it behind an ENV flag.

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 1, 2026

roborev: Combined Review (f0ab152)

Summary verdict: Medium-risk edge cases remain around collection stats, numeric account identifiers, and default identity rebinding.

Medium

  • cmd/msgvault/cmd/stats.go:56
    stats --collection passes an empty collection’s SourceIDs() directly to GetStatsForScope; an empty/nil scope means “global stats,” so an empty collection reports whole-database counts.
    Fix: Check len(sourceIDs) == 0 after resolving scope and return a “collection has no member accounts” error before calling GetStatsForScope.

  • cmd/msgvault/cmd/collection.go:202
    resolveAccountList parses numeric-looking account tokens as source IDs before trying account resolution. E.164 phone identifiers like +15551234567 parse as integers, so WhatsApp/Google Voice accounts cannot be added by identifier.
    Fix: Resolve account identifiers/display names first, or avoid treating leading + values as numeric source IDs; use an explicit source-ID syntax if needed.

  • cmd/msgvault/cmd/addaccount.go:167
    Auto-default identity is written whenever an existing account is revalidated/rebound, not only when a source is created. This can re-add an identity the user deliberately removed, affecting dedup sent-copy detection.
    Fix: Gate confirmDefaultIdentity on actual new source creation across add/import paths, for example by pre-checking existence or returning a created flag from source creation.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@jesserobbins jesserobbins force-pushed the jesse/identities-collections-dedup branch from f0ab152 to 5188612 Compare May 1, 2026 01:11
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 1, 2026

roborev: Combined Review (5188612)

Summary verdict: Medium-risk issues remain around live-message filtering consistency and collection dedup confirmation text.

Medium

  • internal/query/sqlite.go:1143: Default SQLite/DuckDB read paths only filter deleted_at IS NULL, but the live-message contract requires excluding both deleted_at and deleted_from_source_at. This can surface source-deleted messages in search/list/aggregate and collection-scoped reads while other paths exclude them.

    Fix: Apply the full live predicate on default read paths, e.g. add deleted_from_source_at IS NULL alongside deleted_at IS NULL, or centralize query engines on the equivalent of store.LiveMessagesWhere.

  • internal/dedup/dedup.go:1053: FormatMethodology says dedup “NEVER merges messages across different accounts,” but --collection intentionally dedups across all member sources/accounts. In collection runs, this confirmation text is false and may lead users to approve hiding legitimate cross-account copies.

    Fix: Make the methodology scope-aware, or explicitly distinguish single-account/per-source mode from collection mode.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@jesserobbins
Copy link
Copy Markdown
Contributor Author

Addressed in eb7aaaf.

Finding 2 (collection methodology) — fixed. Added Config.ScopeIsCollection, plumbed from the --collection branch in cmd/msgvault/cmd/deduplicate.go. FormatMethodology now branches: collection mode prints that cross-account merging is intentional and reversible via --undo; account/per-source mode keeps the original "NEVER merges" wording.

Finding 1 (live-message filtering) — partial. The prescribed fix would have changed product behavior. HideDeletedFromSource is opt-in by design:

  • TUI exposes hideDeletedFromSource as a user-toggleable filter, default off (internal/tui/keys.go:1008, internal/tui/view.go:1301).
  • internal/api/handlers.go:1035,1094,1362 explicitly set it to true on paths that want hiding — those would be dead code if the engine already filtered.
  • msgvault is an archive whose value prop is keeping source-deleted rows visible.

The accurate observation was that LiveMessagesWhere's docstring (introduced in 5188612) overclaimed a contract the query layer never honored — the internal/query package never imports the helper. Narrowed the comment in internal/store/live_messages.go to describe the predicate for store/vector callers and explicitly note that internal/query intentionally applies only deleted_at by default and gates deleted_from_source_at behind HideDeletedFromSource.

— Claude (Opus 4.7)

@jesserobbins
Copy link
Copy Markdown
Contributor Author

Pushed 8cb9c60 for the four findings still open from the f0ab152 and 5188612 combined reviews. Bundled into one commit because they're small and independent:

  1. --undo mutual exclusion (job #410 medium). Cobra's existing exclusivity covered --account / --collection / --dry-run but not the destructive/scan flags. Added MarkFlagsMutuallyExclusive("undo", X) for --delete-dups-from-source-server, --prefer, --content-hash, --no-backup, --yes so msgvault deduplicate --undo X --delete-dups-from-source-server now errors at parse time instead of silently ignoring the destructive flag.

  2. resolveAccountList E.164 (f0ab152 medium). strconv.ParseInt accepts a leading +, so +15551234567 parsed as the integer 15551234567 and was treated as a source ID. Gated the numeric branch on p[0] being a decimal digit — signed inputs fall through to identifier resolution, restoring the path for WhatsApp / Google Voice accounts.

  3. confirmDefaultIdentity rebind guard (f0ab152 medium). Pushed the guard down into confirmDefaultIdentity itself rather than threading a created flag through 9 ingest call sites: skip the write when the source already has any identity rows. Preserves the documented "freshly created source" intent. Degrades gracefully if the user has removed every identity — in which case the default is restored, which seems right on a clean slate.

  4. Deferred-migration notice count (5188612 low). Notice now reports the post-normalization count instead of len(rawAddresses), so a config like ["", " ", "alice@x"] no longer announces "3 address(es) parked" when only one is. Test updated.

Verified with the built binary: --undo + --delete-dups-from-source-server and --undo + --content-hash both reject. Full test suite green locally.

The CSV-fallback I/O note (also 5188612 low) is intentionally left alone — reverting the inner deleted_at IS NULL filter would re-introduce the schema drift between sqlite_scanner and CSV-view paths that broke Windows in f0ab152. Rationale is in that commit body.

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 1, 2026

roborev: Combined Review (8cb9c60)

Medium issue found: deduplication can mutate the database before user confirmation or backup.

Medium

  • Location: cmd/msgvault/cmd/deduplicate.go:343
  • Problem: engine.Scan() runs before confirmation and backup, but Scan() can mutate the database by backfilling rfc822_message_id when not in dry-run mode. If the user declines the prompt, or if backup later fails, the database has still been changed without a pre-change backup.
  • Fix: Make scan read-only, or move backup/confirmation before any non-dry-run backfill. Dry-run should also preview backfill-derived duplicate groups without mutating.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@jesserobbins
Copy link
Copy Markdown
Contributor Author

Addressed the medium on 8cb9c60 in c90fb73, plus the low cosmetic on 8cb9c60 in 82606f4.

Medium (engine.Scan mutates pre-confirmation/pre-backup) — addressed by surfacing, not by restructuring.

The reviewer is right that Scan calls BackfillRFC822IDs outside dry-run mode and that this happens before the user prompt and before the backup. I think the underlying behavior is correct and shouldn't change:

  • Backfill is idempotent metadata derivation — it fills a previously-NULL rfc822_message_id from MIME data already on the row. It never overwrites a non-NULL value, never changes message content, and is recoverable from message_bodies if anyone ever wanted to recompute.
  • The dedup-batch backup-before-merge contract is sized for the merge (which soft-deletes losers and transfers labels), not for metadata catch-up.
  • Restructuring Scan to be read-only would force one of: (a) double-pass backfill (waste), (b) a report that under-reports duplicates the backfill would have surfaced (worse than the status quo for the user), or (c) a non-trivial restructure of the Scan/Execute split. None of those buys safety against a meaningful failure mode.

What was actually missing is consent legibility: the prompt said "hide N duplicates (reversible with --undo)" without telling the user that scan already wrote N′ rfc822 values that stay regardless of their answer. FormatReport already prints the backfill count, so the user sees it in the scan output, but the prompt didn't scope itself.

c90fb73 adds:

  • A one-line note in both confirm prompts (single-source runDeduplicateOnce and per-source runDeduplicatePerSource) when BackfilledCount > 0, calling out that scan already performed the backfill and it's kept regardless of the answer.
  • An expanded godoc on dedup.Engine.Scan documenting the side effect, why it lives in scan rather than execute, and why the backup contract is unchanged.

No behavior change.

Low (redundant len(p) > 0 in collection.go) — fixed in 82606f4. The surrounding loop already trims and continues on empty tokens, so p[0] indexing is safe.

— Claude Code (Opus 4.7) and Primeradiant Superpowers

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 1, 2026

roborev: Combined Review (82606f4)

No Medium, High, or Critical issues were found.

All review outputs are clean or contain no reportable findings.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@jesserobbins
Copy link
Copy Markdown
Contributor Author

This was, in the end, a lot more work than I was expecting.

@jesserobbins
Copy link
Copy Markdown
Contributor Author

For anyone wanting to test #304 alongside the Facebook Messenger importer (#281): I've rebased fbmessenger on top of this PR at jesserobbins/msgvault:jesse/fbmessenger-post-304. It includes the import-messengerconfirmDefaultIdentity wiring (signal account-identifier) so messenger sources participate in the new identity model the same way the other importers do.

wesm pushed a commit that referenced this pull request May 2, 2026
…tions (#306)

## Motivation

(Deliberately not affected by & mergable with #304) This PR enhances #263 by sharpening the SQL logger so slow and failing queries actually tell you what happened. Three things change. First, streaming queries now record their duration at `rows.Close` instead of when `db.Query` returns, because the work happens while you read rows, not when you get the handle. Second, slow and errored queries log their bound args, and the statement truncation preserves the `WHERE` clause, so you can see what ran with what values without the diagnostic part getting cut off. Third, a few small correctness fixes: `loggedRows.Close` ordering, UTF-8-safe arg truncation, and `logStmt` inlined where it belongs.

## Summary

- Log streaming-query duration at `rows.Close`, not at `db.Query` return
- Log args on slow/error queries; preserve the `WHERE` clause when truncating long statements
- Fix `loggedRows.Close` ordering, make arg truncation UTF-8-safe, inline `logStmt`

Co-authored-by: Jesse Robbins <jesserobbins@users.noreply.github.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 3, 2026

roborev: Combined Review (a0b4346)

High-severity compile break plus two medium correctness issues remain; low-severity findings omitted.

High

  • cmd/msgvault/cmd/search_test.go:327
    The new collection-scope search test uses sql.NullString, but the test import block does not add database/sql, so the package will fail to compile.
    Fix: Add database/sql to the imports.

Medium

  • cmd/msgvault/cmd/addaccount.go:84
    Ingest commands run runStartupMigrationsForIngest before confirmDefaultIdentity whenever any source already exists. On first post-upgrade rebind/rerun for an existing source, the legacy identity migration can populate account_identities first, causing confirmDefaultIdentity to skip the source’s own account identifier.
    Fix: Defer the legacy identity data migration until after the per-source default identity write on ingest paths, or split pre-ingest schema work from legacy identity migration.

  • internal/dedup/dedup.go:792
    Undo treats an unknown or mistyped batch ID as success when UndoDedup restores zero rows, so users get “Restored 0 messages” instead of a batch-not-found error.
    Fix: Record/lookup dedup batches or check for matching rows/manifests and return a clear error when the batch cannot be found.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@jesserobbins
Copy link
Copy Markdown
Contributor Author

jesserobbins commented May 4, 2026

Pushed 271ba4c..ca2afac (12 commits) addressing a fresh roborev refine pass against this branch — eight rounds with both claude-code and codex agents reviewing in parallel, plus the upstream roborev-ci combined-review feedback. Each round: review → fix the in-scope findings → re-review → repeat until no new substantive findings remained. After the first three rounds I tightened the filter to Medium-severity-and-up only to keep the loop focused on shipping #304 rather than chasing cosmetics.

What landed:

  • confirmDefaultIdentity now routes its confirmation through cmd.OutOrStdout() across all 9 ingest call sites (271ba4c).
  • Pre-source legacy [identity] migration removed from ingest paths to preserve confirmDefaultIdentity ordering on rebind (134db1c).
  • delete-deduped routes all output through the cobra writer (6c9a135).
  • CancelManifest reorders rename-then-rewrite-status so a crash mid-cancel never leaves a manifest in pending/ with Status=cancelled (5a2e269).
  • serve's scheduled-sync path calls confirmDefaultIdentity on first source creation (ab04842); daemon startup defers the legacy migration to first scheduled sync to avoid the same ordering hole as ingest (eca38eb).
  • Windows CSV fallback test in build_cache_test.go mirrors production schema (0a197e5).
  • delete-staged trash-path prompt accurately states reversibility — fixes a regression introduced earlier in the loop by routing it through the shared confirmDestructive(ConfirmModeYesNo) helper whose wording assumed irreversibility (0d01f08).
  • Legacy [identity] migration now skips non-email source types (whatsapp, imessage, sms, google_voice*) so phone-keyed sources don't get email identities written to them (bcdde74); also defers when no eligible sources exist instead of dropping the addresses on the floor (ca2afac).

Pushback list at convergence (9 items): documented in private notes, not re-fixed. Highlights:

  • BackfillRFC822IDs 'zlib'-only special case — flagged as a Medium three times by Codex; matches the established GetMessageRaw pattern, both INSERT paths hard-code 'zlib', no current write path produces a non-zlib value. Pushback.
  • Undo returns "Restored 0 messages" rather than "batch not found" on a typo — there is no batch ledger separate from row state, so "0 restored" is observably indistinguishable from "already restored" without adding a new catalog. Pushback.
  • account_identities PK case-sensitivity vs Go-side LOWER() comparison — flagged six times across the prior session and this one; tracked separately as a deferred typed-kinds proposal. Pushback.
  • A handful of daemon-edge consistency windows (default-collection membership, Manifest.String() stale inline status during a rename window) that self-heal on the next process launch.

The loop ran the full review-fix-rereview discipline with predictions before each round and grades after. Two of my own consolidations regressed adjacent code and were caught in the next round when both agents independently flagged the same finding — strong signal both times. Specifically: (1) routing the trash-path confirm through the shared "irreversible" helper changed the user-visible wording for an action that's actually reversible, and (2) the source-type allowlist for the legacy identity migration didn't extend its corresponding defer logic, so an upgrade with only non-email sources would have permanently dropped the legacy addresses. Both are now closed.

Final HEAD: ca2afac. Ready for your review.


Authored by Claude Code (Opus 4.7) in a roborev refine loop, with feedback from @jesserobbins mid-loop that reset the discipline (scope to branch-touched code only, smallest changes to land, Medium+ filter, ledger discipline). Tooling: roborev (multi-agent code review daemon orchestrating claude-code and codex agents in parallel), Claude Code (Opus 4.7), Codex, and Primeradiant Superpowers.

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 4, 2026

roborev: Combined Review (ca2afac)

Medium issue found; no Critical or High findings reported.

Medium

  • cmd/msgvault/cmd/search.go:55 - search still appears to require at least one positional argument via Cobra, but the new RunE logic supports queryless scoped searches like msgvault search --collection alice-only. Cobra will reject those invocations before RunE, making the new --account / --collection queryless path unreachable.

    Fix: Change the command args validator to allow zero args, and keep the existing RunE validation that requires either query text or a scope flag.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 4, 2026

roborev: Combined Review (ca2afac)

Summary verdict: One medium issue found; queryless scoped search appears unreachable.

Medium

  • cmd/msgvault/cmd/search.go:55 - search still appears to require at least one positional argument via Cobra, but the new RunE logic supports queryless scoped searches like msgvault search --collection alice-only. Cobra will reject those invocations before RunE, so the new --account/--collection queryless path is unreachable.

    Fix: Change the command args validator to allow zero args, and keep the existing RunE validation that requires either query text or a scope flag.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@jesserobbins
Copy link
Copy Markdown
Contributor Author

jesserobbins commented May 4, 2026

Verified the queryless scoped-search Medium and it doesn't repro. The claim is that cobra would reject msgvault search --collection alice-only (no positional arg) before reaching RunE. It doesn't: cmd/msgvault/cmd/search.go:56 declares Args: cobra.ArbitraryArgs, which accepts zero or more positional args. The RunE at line 61 already does the right validation: queryStr == "" && searchAccount == "" && searchCollection == "" → error.

Empirical confirmation against the local build at ca2afac:

$ ./msgvault search --collection bogus-collection-doesnt-exist
... command="msgvault search" argc=0 ...
Error: no collection named "bogus-collection-doesnt-exist" (try 'msgvault collection list')

argc=0 reaches RunE; the error is from the collection lookup, not from cobra. The queryless --account / --collection path is reachable. No code change needed.

For context: this is the third asserted-but-unverifiable claim from the multi-agent review pipeline against cmd/msgvault/cmd/search.go and its test file in this PR's history, all of which the build trivially disproves:

  1. Codex once asserted search.go was missing its store import and "won't compile" — search.go:13 imports it; go build ./... passes.
  2. roborev-ci (combined-review bot) once asserted search_test.go:327 used sql.NullString without importing database/sql — line 4 of the test file imports it.
  3. Today's claim that cobra rejects queryless invocations — Args: cobra.ArbitraryArgs and the empirical run above.

Each was raised after multiple iterations of review without surfacing earlier, then dropped on the next round once it was rebutted. Pattern looks like agent hallucination on this specific file rather than a real defect class. Calling it out so this doesn't get re-raised next pass.


Verified by Claude Code (Opus 4.7) against local build at ca2afac. Tooling: roborev (multi-agent code review daemon orchestrating claude-code and codex agents in parallel), Claude Code (Opus 4.7), Codex, and Primeradiant Superpowers.

@jesserobbins
Copy link
Copy Markdown
Contributor Author

jesserobbins commented May 4, 2026

@wesm I told my agents to be nice and also push back citing evidence. They seems a little testy but I would be too if I had to run 25 iterations. I built a small harness that triggers combined roborev runs, including branch reviews using Claude and codex and then vectors toward convergence. Having watched and engaged a bunch I think Claude feels codex is not as cool or fun :-)

I think this release is ready.

jesserobbins added a commit to jesserobbins/msgvault-1 that referenced this pull request May 5, 2026
Resolves conflict between wesm#304 branch and upstream main on the
delete-staged flag shape. Upstream introduced --trash (default =
permanent); wesm#304 introduced --permanent (default = trash). Keep
--permanent / default = trash per maintainer agreement: every other
rung of the deletion progression in msgvault is locally reversible,
so the remote rung should be too unless the user explicitly opts out.

- cmd/msgvault/cmd/deletions.go: keep deletePermanent variable and
  --permanent flag; fold in upstream's service-account scope-check
  refactor (wrap the scope-escalation prompt in !isServiceAccount and
  pass saScopes through buildAPIClient).
- cmd/msgvault/cmd/addaccount_test.go: keep all wesm#304 identity tests
  and append upstream's two new service-account error tests
  (Headless / Force).
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 5, 2026

roborev: Combined Review (9aee6b6)

Summary verdict: Fix two medium regressions before merge; no medium-or-higher security findings were reported.

Medium

  • cmd/msgvault/cmd/account_identity.go:51: confirmDefaultIdentity skips writing the account identifier whenever any identity row already exists. On upgraded databases, a non-ingest command can run the legacy identity migration first and create config_migration rows; later sync/add-account skips the source’s own account-identifier, breaking sent-copy detection for that primary address.

    • Fix: Distinguish “identifier already present” from “some identity exists”, or seed the account identifier before/inside the legacy migration for existing sources without unintentionally re-adding user-removed identities.
  • internal/query/duckdb.go:1041: DuckDBEngine.SubAggregate only applies opts.SourceID and ignores the new opts.SourceIDs collection scope. Collection-scoped drill-down/sub-aggregate queries can leak counts from sources outside the selected collection.

    • Fix: Apply the same multi-source filtering used elsewhere, such as routing opts.SourceID/opts.SourceIDs through appendSourceFilter in SubAggregate.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@wesm
Copy link
Copy Markdown
Owner

wesm commented May 5, 2026

Will spend some time on this today to see if there's anything that needs to be addressed prior to merge

jesserobbins added a commit to jesserobbins/msgvault-1 that referenced this pull request May 5, 2026
Resolves conflict between wesm#304 branch and upstream main on the
delete-staged flag shape. Upstream introduced --trash (default =
permanent); wesm#304 introduced --permanent (default = trash). Keep
--permanent / default = trash per maintainer agreement: every other
rung of the deletion progression in msgvault is locally reversible,
so the remote rung should be too unless the user explicitly opts out.

- cmd/msgvault/cmd/deletions.go: keep deletePermanent variable and
  --permanent flag; fold in upstream's service-account scope-check
  refactor (wrap the scope-escalation prompt in !isServiceAccount and
  pass saScopes through buildAPIClient).
- cmd/msgvault/cmd/addaccount_test.go: keep all wesm#304 identity tests
  and append upstream's two new service-account error tests
  (Headless / Force).
@jesserobbins jesserobbins force-pushed the jesse/identities-collections-dedup branch from 9aee6b6 to b7a9a09 Compare May 5, 2026 13:49
@jesserobbins
Copy link
Copy Markdown
Contributor Author

Will spend some time on this today to see if there's anything that needs to be addressed prior to merge

restored move-to-trash by default convention per discussion.

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 5, 2026

roborev: Combined Review (b7a9a09)

No Medium, High, or Critical findings were provided by the review agents.

All review outputs only reported SEVERITY_THRESHOLD_MET and did not include actionable file/line findings to deduplicate or preserve.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

…urce filter

- dedup-purge: store dedup operations and local hard-delete rung
- account_identities: multi-signal set semantics, case-preserving identifiers
- identity: list, show, add, remove subcommands
- add-account: auto-confirm account-identifier identity at source creation
- store_resolver: document deferred legacy identity migration
- dedup: correctness and safety fixes across remote-delete, cache, undo, and migration
- dedup: scope-aware methodology and narrow live-message contract
- fix: undo flag exclusivity, E.164 collection identifiers, identity rebind, deferred-notice count
- deduplicate: surface rfc822 backfill in confirm prompt and Scan godoc
- collection: drop redundant len(p) > 0 guard before digit check
- dedup: align terminology with one-source = one account; drop --apply hint
- search: resolve --account/--collection before mode dispatch
- store: route leaking read paths through LiveMessagesWhere
- tui: invalidate parquet cache on dedup hides since last sync
- delete-staged: default to trash; --permanent opts into batch delete
- dedup: log Scan/Execute/Undo entry and exit at info level
- search: reject empty query for --mode=vector|hybrid
- store/dedup: DeleteAllDeduped covers batchless rows; flip defer rollback; document BackfillRFC822IDs failed-row semantics
- account_identities: case-insensitive remove for email-shaped identifiers
- collection: reject explicit mutation of auto-managed All collection
- migrate_legacy_identity: distinguish deferred from applied in logs
- tui+store: keep cache reasons disjoint; pre-compute live-messages predicate
- store: ALTER TABLE messages ADD deleted_at on schema upgrade
- delete-deduped: align precount with broadened DeleteAllDeduped predicate
- account_identities: add path matches remove path on email case-folding
- config+dedup: backup paths route through DatabasePath helper
- config: DatabasePath decodes file: URI percent-encoding via net/url
- account_identities: tighten email detection beyond bare '@' contains
- store/migrations: drop dead errors.Is(sql.ErrNoRows) check in MarkMigrationApplied
- query+store: deep-copy AccountIDs in MergeFilterIntoQuery; document All self-heal
- store: add identifierMatch helper for email-vs-other comparison
- store: route AddAccountIdentity through identifierMatch helper
- store: route RemoveAccountIdentity through identifierMatch helper
- store: route legacy identity migration through identifierMatch helper
- identity: route CLI prior-row lookup through EqualIdentifier; surface ListAccountIdentities errors
- openStoreAndInit: drop missing-database guard so 15 callers can create on first use
- search: reuse scope-resolution store in runLocalSearch instead of double-init
- identity: filter empty parts in splitSignalSet to mirror mergeSignalSet
- config: PathUnescape relative file: URI in DatabasePath
- dedup: route per-source identity case-folding through NormalizeIdentifierForCompare
- search: reject vector/hybrid filter-only queries up front
- deduplicate: surface scan side effects on no-duplicate per-source iterations
- RemoveAccountIdentity: return rows-affected count; surface multi-row legacy cleanup in CLI
- collection: reject CreateCollection name=All with ErrCollectionImmutable
- ingest: re-run startup migrations after first source creation
- ingest: skip pre-source migration when no source exists
- import-mbox: drop redundant in-loop post-source-create migration call
- store: case-aware legacy migration dedupe; empty SourceIDs match-nothing
- dedup+collection: preserve From-case in dedup; numeric-account fall-through in collection lookup
- stats+dedup: handle empty-collection stats; FromEmail comment and parallel test
- store+dedup: GetSourceByID not-found sentinel; single-member collection methodology
- ingest: confirmDefaultIdentity must run before legacy migration
- deduplicate: clarify undo warning; surface undo hint on Execute error
- import-emlx, import-mbox: comment why migration error returns early
- query+store: centralize live-message predicate via parameterized LiveMessagesWhere
- search: run schema init on unscoped vector/hybrid dispatch
- store: stream raw MIME in BackfillRFC822IDs to remove per-row SELECTs
- docs: add accounts-identities-collections-dedup README and diagrams
- docs: add accounts/identities/collections/dedup specification
- collection: reconcile ErrCollectionImmutable text with spec
- deletion: persist cancelled manifests in their own status directory
- deletion: allow MoveManifest to target StatusCancelled
- deletion: CancelManifest moves to cancelled/ instead of os.Remove
- deletion: add ListCancelled
- list-deletions: show cancelled batches alongside other statuses
- cmd: extract confirmDestructive helper for testable safety prompts
- delete-deduped: --all-hidden cannot be bypassed by --yes
- delete-staged: --permanent and --yes are mutually exclusive
- delete-staged: --permanent requires typing literal "delete" to confirm
- cmd: silence errcheck on fmt.Fprint* in confirm helper and list-deletions
- dedup: skip content-hash groups with multiple Message-ID survivors
- delete-deduped: --batch path uses cancel-on-EOF, not contract error
- delete-staged: route non-permanent confirm through cmd.OutOrStdout
- spec: expand footer with author credit and toolchain attribution
- dedup: skip content-hash groups colliding MID survivors with sent orphans
- confirmDefaultIdentity: route confirmation print through cmd writer
- ingest: drop pre-source legacy migration to preserve confirmDefaultIdentity
- delete-deduped: route all output through cmd writer
- deletion: CancelManifest renames first, rewrites inline status second
- delete-staged: route trash-path confirm through confirmDestructive helper
- deletion: document Manifest.String() stale-status window in CancelManifest
- serve: confirmDefaultIdentity on first source creation
- build_cache_test: mirror production messages columns in CSV fallback
- serve: defer legacy identity migration to first scheduled sync
- delete-staged: trash-path prompt states reversibility
- migrate_legacy_identity: skip non-email source types
- migrate_legacy_identity: defer when no eligible sources; add apple-mail to allowlist
- Merge upstream/main: keep --permanent flag, default to trash
- cmd: silence unchecked fmt.Fprint* writes; fix ST1005 in delete-staged gate
- query: route getMessageRawShared through LiveMessagesWhere
@wesm wesm force-pushed the jesse/identities-collections-dedup branch from b7a9a09 to 99af280 Compare May 6, 2026 00:05
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 6, 2026

roborev: Combined Review (99af280)

The PR has three medium-severity issues to address before merge.

Medium

  • internal/store/migrate_legacy_identity.go:244
    sourceTypeUsesEmailIdentity omits the emlx source type, so existing Apple Mail/Emlx accounts will not receive migrated legacy [identity] addresses. This weakens sent-copy detection for dedup on those accounts.
    Fix: Add emlx to the eligible email-source list and add a migration test for an EMLX source.

  • internal/deletion/manifest.go:378
    CancelManifest moves in_progress manifests to cancelled, but an executor may already be deleting remotely and continue using the old in-progress path. The final checkpoint/move can fail, leaving a cancelled manifest for work that actually completed.
    Fix: Only cancel pending manifests and return a clear “currently executing cannot be cancelled” error for in_progress, or implement cooperative cancellation in the executor before moving the manifest.

  • cmd/msgvault/cmd/build_cache.go:136
    buildCache still skips when maxID <= lastMessageID, so running msgvault build-cache after dedup hides existing rows can leave stale Parquet rows because incremental export cannot remove previously exported messages.
    Fix: Reuse the dedup-hidden staleness check from cacheNeedsBuild in buildCache and force a full rebuild before the skip path.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

wesm and others added 3 commits May 5, 2026 19:35
The contract is "permanently remove rows the dedup pipeline soft-hid",
so the DELETE should be keyed on the positive marker delete_batch_id
IS NOT NULL in addition to deleted_at IS NOT NULL. Earlier the gate
was deleted_at alone, which was only safe by convention — any future
soft-delete writer (a "trash" view, a per-message user hide, etc.)
would have its rows silently destroyed by the dedup hard-delete rung.

Tighten the predicate, retire the FUTURE REFINEMENT comment that
flagged the trap, align the CLI --all-hidden row count so the
preview matches what the delete actually removes, and add a
regression test that confirms a bare deleted_at row (no batch ID)
survives DeleteAllDeduped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The default-All bootstrap ran on every InitSchema and used a SELECT-
then-INSERT shape: two processes hitting init at the same time (a CLI
command and serve, or two CLI commands sharing the data dir) could
both pass the SELECT, both attempt the INSERT, and the loser would
fail with a UNIQUE-constraint error surfaced as "create default
collection: ...". SQLite's single-writer prevented data corruption
but the loser's command died on a confusing error that had nothing
to do with what the user typed.

Switch to INSERT OR IGNORE (dialect-rewritten for PostgreSQL via
dialect.InsertOrIgnore) followed by an unconditional SELECT id. Both
racers succeed; the second insert is ignored; both selects return the
same row id. As a bonus this drops the LastInsertId path, which was
only correct because SQLite happened to return the auto-generated id
on a successful INSERT.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… only)

The post-run hint and Long help both said "vector and parquet caches"
could be rebuilt with 'build-cache --full-rebuild', but build-cache
only rebuilds the parquet analytics layer — the vector index lives in
vectors.db and rebuilds via 'build-embeddings --full-rebuild'. Mention
both commands so the user has a complete recipe after a large purge.

(Vector pruning of orphan embeddings after delete-deduped is a
separate gap; this commit is documentation only.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wesm
Copy link
Copy Markdown
Owner

wesm commented May 6, 2026

Three post-review design fixes pushed to address concerns from a careful re-read of the squashed branch. None change user-facing semantics; they tighten invariants that were "safe by convention" into invariants enforced by the code.

7991bdcDeleteAllDeduped gated on delete_batch_id, not just deleted_at.
The contract is "permanently remove rows the dedup pipeline soft-hid", but the predicate was WHERE deleted_at IS NOT NULL alone. Safe today because nothing else writes that column, but a future feature adding any other soft-delete semantics (trash view, per-message hide) would have its rows silently destroyed by the dedup hard-delete rung. Added the positive delete_batch_id IS NOT NULL gate, aligned the CLI --all-hidden preview count, retired the FUTURE REFINEMENT comment that flagged the trap, and added TestDeleteAllDeduped_PreservesBatchlessSoftDelete — verified to fail against the old code, pass against the fix.

209c595 — race-free EnsureDefaultCollection via INSERT OR IGNORE.
The default-All bootstrap ran on every InitSchema and used SELECT … WHERE name = 'All' then INSERT on miss. Two processes hitting init at the same time (a CLI command and serve, or two CLIs sharing the data dir) could both pass the SELECT and both attempt the INSERT; the loser surfaced a UNIQUE constraint error that read like a real failure. Switched to dialect-aware INSERT OR IGNORE + unconditional SELECT id so both racers succeed on the same row id, and the LastInsertId path is gone.

f48e8f1delete-deduped vector-cache hint correction.
Three sites in delete_deduped.go (Long help, inline comment, post-run summary) said "vector and parquet caches" could be rebuilt with build-cache --full-rebuild. build-cache rebuilds parquet only; vectors live in vectors.db and rebuild via build-embeddings --full-rebuild. All three now name both commands. (Vector pruning of orphan embeddings after delete-deduped is a separate gap and not addressed here.)

Considered but deferred

For visibility to future readers: a careful re-read also surfaced four items that look like real design notes rather than blockers, and one that's a real invariant mismatch best handled as a separate change.

  • Identity case-sensitivity invariant mismatch. Schema has PRIMARY KEY (source_id, address) with raw bytes, but identifierMatch treats Alice@Example.com and alice@example.com as the same logical identity for emails (and intentionally case-sensitive for non-emails). The clean fix is a normalized comparison key column or a partial expression index — bigger change, deferred. SQLite single-writer prevents in-process divergence today; the gap matters more once Postgres/cross-process writers are part of the supported deployment.
  • Backfill timing in Scan(non-dry-run). Writes rfc822_message_id to disk before the user confirms the merge; not covered by the merge backup. Idempotent metadata derivation, but the safety story is one rung weaker than the doc claims. Cleaner shape: backfill inside the merge transaction.
  • MSGVAULT_ENABLE_REMOTE_DELETE gate removal plan. The gate is the right shape; what's missing is an explicit removal trigger (version commitment, observability requirement, config-file alternative for headless deployments).
  • is_from_me/looksLikeEmail invariant. Compare must always go through NormalizeIdentifierForCompare; not currently a bug because is_from_me is read off the row, but worth a constructor or one-line invariant comment.
  • applied_migrations has no per-migration version. INSERT OR IGNORE keyed on name means a future re-shape of the same logical migration would need a new name. Fine for one migration today; will grow.

@wesm
Copy link
Copy Markdown
Owner

wesm commented May 6, 2026

Filed the deferred follow-ups as issues so they don't get lost:

Each issue has the full context + proposed approach so they're actionable in isolation.

@wesm wesm merged commit 2af34c2 into wesm:main May 6, 2026
5 of 6 checks passed
@wesm
Copy link
Copy Markdown
Owner

wesm commented May 6, 2026

thank you!

@jesserobbins jesserobbins deleted the jesse/identities-collections-dedup branch May 7, 2026 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Accounts, Identities, Collections, and Deduplication

2 participants