Our communications are scattered across decades of email accounts, chat apps, phone backups, and meeting recordings. msgvault finally lets us consolidate that into a single, safe, searchable place. Our varied collections finally consolidated across Gmail API syncs, old mbox exports, Apple Mail folders from a retired laptop, IMAP backups, Facebook Messenger dumps, SMS exports, WhatsApp histories, meeting recordings, call logs.
If you are like me, you spend a weekend pulling all the distributed archives together and importing them. Then you start searching in the TUI, with the MCP, or directly. The sense of joy turns to something else as you see the same email appears three times because it arrived via Gmail sync, an old mbox export, and an Apple Mail import. Message counts are wrong. Search returns dozens of copies sometimes. There's no way to tell what was sent vs received in an old mbox dump that lost its metadata. And every import sits in its own silo, with no unified view of "all my communications."
This new feature is a pretty big swing at expanding on this great foundation.
Updated 2026-04-29 with the aligned design. This proposal was first opened with a draft model and a working branch (PR #286). After @wesm's design review, I rewrote the proposal end-to-end rather than patching it. The full aligned design is below — this issue is now the canonical statement of the model. PR #286 will be reshaped against it. I realize this is more of an Epic but these kinds of things often are. ;-)
Introducing: Accounts, Identities, Collections, and Deduplication
Accounts, identities, and collections
This feature introduces three related concepts that make the archive usable without erasing provenance or changing the underlying data structure.
Accounts
An account is one imported message source/archive. A Gmail sync is an account. An mbox import is an account. An Apple Mail import is an account.
Even if two accounts represent the same real-world mailbox, msgvault keeps them separate until the user groups them. That preserves the boundary between independent archives and avoids guessing that two sources should be merged just because their data overlaps.
Identities
An identity identifies the owner of an account: the set of addresses, phone numbers, or provider identifiers that mean "me" inside that account. An account has one identity, which may contain many identifiers.
Identities are stored per account because the meaning of an address depends on the source. A collection's identity is derived from its member accounts. This lets msgvault recover sent/received meaning in old imports and make safer deduplication choices without applying one global identity list to every archive.
Collections
A collection is the user's explicit grouping of accounts. All is the default collection containing every account, and users can create collections such as work, personal, or old laptop imports.
Collections are how msgvault offers a unified view across sources while keeping source provenance intact. Any operation that crosses account boundaries does so because the user selected a collection.
The name "collection" matters. An earlier prototype used "merged account," which only described the moment of assembly, not the thing the user actually has afterward. Users don't think "I merged accounts." They think "this is all my work email" or "this is everything." A collection describes what the user has, not how they assembled it. It works at every scale — All is a collection, work is a collection — and it doesn't overload the word "account," which already means one ingest source.
A future revision may introduce other identity or collection types (for example, device identities or saved-query collections). If that happens, renaming these objects to account_identity and account_collection would disambiguate. For now, identity and collection are unambiguous.
Speaking of unambiguous... here's a diagram that hopefully makes this clear. (this is a real thing that I spent time on, not slop. I rejected a bunch of helpful attempts at mermaid charts. :-)
Deduplication
Deduplication works inside those boundaries. deduplicate --account cleans up repeated rows within one source. deduplicate --collection compares messages across the accounts the user deliberately grouped. deduplicate with no scope processes each account independently — never as one global pool.
Applying dedup hides redundant local copies from normal search, browse, stats, API, MCP, and vector/hybrid retrieval paths. It does not destroy data: hidden copies stay on disk and --undo <batch-id> restores them. Hard-deleting locally and deleting from a source server are separate, explicit steps the user must take on top of dedup; the system never escalates from one to the next on its own.
The result is a model that supports both safety and usefulness: users can search and analyze a unified communication archive, but msgvault still knows which source each message came from, which identities apply to which account, and when an operation is crossing account boundaries.
Proposed Resolutions from Design Review
| Category |
Item |
Proposed Resolution |
| Account meaning |
Wes (PR #286 review): "Account: one ingest source/archive." |
Define an account as one source/archive created by one ingest path; never as a logical mailbox spanning sources. |
| Collection meaning |
Issue #278: "A collection is a named grouping of accounts." Wes: "Collection: a named grouping of accounts/sources." |
Define a collection as a named grouping of account/source IDs, and use it as the only cross-account grouping primitive. |
| Default collection |
Issue #278: "create an 'All' collection ... automatically includes every account." Wes: "All: the default collection containing every account/source." |
SeedAll by default with every account/source and treat it as a collection, not an account. |
| Dedup safety boundary |
Issue #278: "Dedup only operates within the boundary the user specifies." Wes: "Dedup across accounts should require an explicit collection boundary." |
Confine within-account dedup to that source, and require an explicit collection scope for any cross-account dedup. |
| Unscoped dedup |
Issue #278: "msgvault deduplicate (no flags) — scans each account independently." Wes: "msgvault deduplicate: each account/source independently." |
Iterate per-account when no scope is given; never treat unscoped dedup as "dedup everything together." |
| CLI scope clarity |
Wes: "keep--account restricted to one account/source, and add an explicit --collection flag." |
Encode the boundary in the flag itself:--account resolves one source, --collection resolves a named group. |
| Name shadowing |
Wes: "--account work may target a collection named work, not an account/source." |
Reject collection names passed through--account, and emit an explicit error with the correct flag hint. |
| Nested collections |
Wes: collection creation can "effectively allow nested collection references." |
Restrict collection membership to accounts/sources only; do not support nested collections in this model. |
| Identity scope |
Issue #278: "Identities are tied to accounts." Wes: global[identity] config is "a different model." |
Store identities as per-account records, and treat any global identity list as legacy input rather than an active scope. |
| Collection identity |
Issue #278: "A collection's identity is the union of its accounts' identities." |
Derive collection identity from member accounts; do not configure a collection's identity set separately. |
| Query hiding contract |
Issue #278: "Pruned copies are soft-deleted and hidden from all query paths." Wes: vector/hybrid paths can still "surface pruned duplicates." |
Treat dedup-hidden rows as hidden on every normal read surface, as a product contract rather than a query optimization. |
| Live-message predicate |
Wes: "Centralize live-message filtering" across storage and retrieval surfaces. |
Define one live-message rule and apply it consistently across SQLite, DuckDB, FTS, vector, API, MCP, TUI, and stats. |
| Collection query scope |
Wes: decide whether collections are "only a dedup/admin concept" or "first-class query scopes." |
Promote collections to first-class user scopes for search, browse, stats, and dedup; reject any partial implementation that scopes only dedup. |
| Cache/index policy |
Wes: "Cache/index invalidation needs a clearer policy." |
Anchor correctness to live-message filtering; treat rebuilds as storage/performance hygiene, never as a prerequisite for hiding pruned duplicates. |
| Schema ownership |
Wes: permanent collections "probably belong in the canonical schema/migration path." |
Place accounts, collections, identities, and dedup metadata in canonical schema and migrations as core data model concepts. |
| Undo semantics |
Wes: "Undo is not a full rollback." |
Restore local visibility and cancel pending deletion intent where possible; do not guarantee an exact pre-run database state. |
| Remote deletion scope |
Wes: manifest naming and reporting need to remain "source/account-specific rather than collection-specific." |
Keep remote deletion source-scoped even when duplicate detection used a collection boundary. |
The Core model
Account
An account is one ingest source/archive. It is the smallest durable provenance unit in msgvault.
Examples:
- one Gmail sync source
- one IMAP source
- one mbox import
- one Apple Mail import
- one iMessage import
- one SMS import
- one Facebook Messenger import
- one meeting transcript import source
The same real-world mailbox imported through Gmail sync and later through an old mbox export creates two accounts. They may represent the same human mailbox, but they are distinct archives with distinct provenance and source-specific deletion semantics.
This keeps the data model honest: msgvault does not infer that two imports belong together just because an email address, display name, or message content overlaps.
Collection
A collection is a named grouping of accounts. It is the user's explicit statement that multiple sources should be viewed or operated on together.
Examples:
All
work
personal
old laptop imports
gmail plus exports
family messages
Collections are many-to-many:
- An account can belong to multiple collections.
- A collection can contain multiple accounts.
- A collection contains account/source IDs, not other collections.
Collections are the boundary for cross-account features. If two independent archives should be searched, counted, deduplicated, or exported together, the user expresses that by putting them in a collection.
All
All is the default collection containing every account/source.
All gives users a natural unified view without collapsing account provenance. It is still a collection. Operations against All are collection-scoped operations and should be displayed that way.
Scope semantics
The user-facing scope vocabulary is deliberately small:
| Scope |
Meaning |
| Account scope |
One source/archive. |
| Collection scope |
All member accounts of one collection. |
| All scope |
The default collection containing every account/source. |
CLI flags should expose those boundaries directly:
| Command shape |
Meaning |
--account <account> |
Resolve exactly one account/source. |
--collection <collection> |
Resolve exactly one collection. |
| no flag where supported |
Use the command's documented default, such as per-account iteration for dedup orAll for search/browse. |
--account and --collection are mutually exclusive. A generic internal Scope type is useful, but generic user-facing flags are not. Users should be able to tell when they are crossing source boundaries by reading the command.
Name conflicts should fail clearly:
- If
work is a collection, --account work fails and suggests --collection work.
- If
alice@example.com is an account, --collection alice@example.com fails and suggests --account alice@example.com.
- If the same string exists as both an account display name and a collection name, the user must choose the correct flag and may need a more specific account identifier.
Identity model
Identity answers: "Who am I in this source?"
An identity belongs to an account/source. It can include email addresses, phone numbers, or other protocol-specific identifiers. A confirmed identity means that messages from that address or identifier can be treated as "from me" within that account's context.
Identity is account-scoped for two reasons:
- The same address may appear in multiple imports, and that is expected.
- An address that is safe to treat as "me" in one account may be misleading in another account or shared archive.
A collection's identity is derived from its member accounts. It is the union of confirmed identities from those accounts, used only within the collection's scope.
Identity discovery should be evidence-based and reviewable. Candidate signals include:
is_from_me metadata from ingest
- sent-folder or sent-label evidence
- account/source identifier match
- OAuth or provider account metadata
- user confirmation
Global identity configuration is not part of the target model.
If an [identity] block exists in config.toml from an older msgvault version, the first startup after upgrade migrates its addresses into per-account confirmed identities for every existing account, logs a warning naming the migration, and prints a one-time CLI notice asking the user to review per-account identities (msgvault identity list). The global config block is then no longer read. This preserves the old behavior across the upgrade while moving every account onto the per-account model.
After migration, identity is stored per account. The global config block is legacy input only.
Identity management CLI
The first release ships a small, explicit command surface for managing the identifiers attached to an account's identity. Discovery happens at ingest from the signals listed above, plus the legacy migration; everything else the user does manually through these commands.
| Command |
Meaning |
msgvault identity list [--account <a> | --collection <c>] |
List identifiers. With no scope, lists every account's identity. --account shows one account's identifiers; --collection shows the union of member accounts. |
msgvault identity show <account> |
Show one account's identity: every confirmed identifier with its source signal. |
msgvault identity add <account> <identifier> |
Add an identifier (email address, phone number, provider ID) to the account's identity. Idempotent on (account, identifier). |
msgvault identity remove <account> <identifier> |
Remove an identifier from the account's identity. |
Naming follows the rest of the CLI: singular namespace (identity), one identity per account, identifiers added or removed from it.
Auto-default-identity at source creation
When an account is created through an ingest command that has a usable identifier, msgvault writes one confirmed identifier to the new account's identity automatically. The user gets functional dedup sent-copy safety with no extra step. Each affected ingest command registers --no-default-identity to suppress the write when the user prefers to manage identity manually.
| Command |
Identifier written |
Signal recorded |
add-account <email> |
the <email> argument |
account-identifier |
add-imap |
configured address |
account-identifier |
add-o365 <email> |
the <email> argument |
account-identifier |
import-mbox <identifier> <file> |
<identifier> arg |
account-identifier |
import-emlx |
resolved account email per discovered account |
account-identifier |
import-whatsapp --phone <e164> |
the --phone value |
phone-e164 |
import-gvoice |
user's GV phone (E.164) when the takeout exposes it |
phone-e164 |
import-imessage does not auto-write an identity at source creation. The source is created with the literal identifier "local" — there is no per-account identifier known at source-creation time. Apple IDs and phone numbers in the handle table belong to message participants, not to the user's identity; promoting them to confirmed identifiers requires a discovery pass that is out of scope for this release. iMessage users add identifiers manually with identity add.
Out of scope for this initial release
The following are valuable but explicitly deferred so this feature can land coherently. They do not block the model and can be added later without changing the user-facing semantics defined above:
- Interactive identity discovery beyond what ingest already records (no candidate-ranking command, no "scan the archive for likely-me addresses" workflow).
- Identity confirmation UX that promotes a discovered candidate to a confirmed identifier (the manual
identity add is the only confirmation path in this release).
- Identity-derived inbound/outbound classification across historical imports.
- Rich identity review UI, including TUI views and bulk confirm/reject flows.
- Provider-specific identifier types beyond email addresses, phone numbers, and free-form strings (e.g. richer schemas for OAuth subject IDs, device IDs, account-level metadata).
- Automatic identity propagation when accounts are added to a collection.
Anything in this list is welcome as follow-up work. None of it is required to satisfy the core model.
Collection behavior
Collections are a primary user concept, not just a dedup helper.
Required behavior:
All is created and maintained automatically.
- Users can create named collections from accounts.
- Users can add and remove accounts from collections.
- Collection membership accepts only accounts/sources.
- Collection views preserve account provenance.
Out of scope for the core model:
- Nested collections.
- Implicit collection creation based on matching email addresses.
- Treating a collection as an account.
Collection names and account identifiers can share human-friendly names, so the CLI and UI must preserve the distinction visually and behaviorally.
Deduplication model
Deduplication removes redundant local copies from normal user-facing results without destroying the underlying archive by default.
Valid dedup scopes
| Invocation |
Boundary |
deduplicate --account <account> |
Compare messages only within that account/source. |
deduplicate --collection <collection> |
Compare messages across member accounts in that collection. |
deduplicate |
Process each account independently. |
The unscoped form is a convenience for per-account cleanup. It must not compare all messages across all accounts as one global set.
The unscoped default is per-account iteration rather than --collection All because cross-account dedup is the higher-risk operation: it can collapse duplicates between independent archives whose provenance the user may want to preserve. Cross-account dedup should require explicit opt-in through --collection. A user who genuinely wants to dedup across every account can still write --collection All.
Detection
Duplicate detection can use multiple signals:
- RFC822
Message-ID
- normalized raw MIME or body content hash
- provider/source message IDs where appropriate
- attachment content hashes where relevant
Detection signals should be merged into duplicate groups carefully. A content-hash match can connect messages that do not share a Message-ID, and a Message-ID match can connect messages with slightly different stored bodies. The grouping model should allow transitive duplicate sets rather than treating each signal as an isolated pass.
Survivor selection
Survivor selection should be deterministic and explainable. The policy prefers the copy that is most useful as the durable representative, evaluated in this priority order:
- source preference when configured
- has raw MIME or complete original payload
- source metadata quality
- richer label or folder metadata
- earlier archived timestamp when meaningful
- stable row ID as the final tie-breaker
Earlier rules win outright; later rules only apply when all earlier ones tie. The exact policy should be documented and visible in dry-run output, so a user can read why one copy survived and another was hidden.
Sent-message safety rule
Sent-copy safety is an eligibility filter, not a tie-breaker. When any message in a duplicate group looks like a sent copy, only sent copies are eligible to survive. Received-copy candidates are removed from the group before the priority list above runs. Losing the sent signal silently changes user interpretation of the archive — "I sent this" is harder to recover than "I received this."
A message looks like a sent copy when any of these signals fires (OR):
- a Gmail
SENT label on the message
- an
is_from_me flag on the message from ingest metadata
- the
From address matches a confirmed identity for the message's account
Effects
Applying dedup should:
- choose one survivor per duplicate group
- hide redundant local rows from normal query paths
- preserve enough metadata to explain what happened
- write a batch ID for audit and undo
- avoid remote deletion unless explicitly requested
Dedup should not silently escalate from local hiding to local hard deletion or remote deletion.
Safety progression
Dedup is a ladder, not a single switch. Each rung is a separate, explicit user action. The system never escalates from one rung to the next on its own.
- Scan. Detect duplicates and report what would change. No data is touched. Dry-run is the default.
- Hide. Apply dedup. Pruned copies are soft-deleted: hidden from normal reads but kept on disk.
--undo <batch-id> restores visibility.
- Local hard delete. A separate, opt-in action that permanently removes hidden rows from the local archive. Dedup itself never does this; the user runs it explicitly after a hide step they're confident in.
- Remote delete. Deleting from the source server (Gmail, IMAP, another service) is a further separate decision. The default is trash-with-recovery (Gmail's ~30-day trash). Permanent remote deletion requires explicit opt-in and interactive confirmation.
The user controls every rung. "Apply dedup" never implies hard delete. "Hard delete locally" never implies remote delete. "Remote delete" never implies permanent remote delete.
Attachment dedup is independent of message dedup: attachments are stored in a content-addressed pool, so identical files are stored once regardless of how many messages reference them. Hiding or hard-deleting a duplicate message does not delete the underlying attachment blob unless no remaining message references it.
Live-message contract
A live message is a message that has not been locally hidden by dedup and has not been recorded as deleted from the source server. The term is internal vocabulary for this contract and shows up in implementation slices and code.
Normal user-facing reads should return live messages only.
This contract applies to:
- message search
- vector and hybrid search
- TUI browsing
- stats and aggregates
- API responses
- MCP responses
- exports that claim to represent the visible archive
Indexes and caches may lag behind SQLite state, but normal retrieval must still filter hidden rows. Rebuilding derived surfaces is valuable for size and performance; it should not be the only thing preventing hidden duplicates from appearing.
Query scope
Collections should be first-class query scopes.
If users can create work or personal, they should be able to search, browse, count, and inspect those collections without learning which source IDs are inside. That applies across local search, vector/hybrid search, TUI, API, MCP, and stats.
The scope model should produce the same result set regardless of retrieval backend:
- account scope maps to one source ID
- collection scope maps to many source IDs
All maps to every source ID
Backend differences are acceptable for ranking or performance, but not for scope membership or live-message visibility.
Cache and index policy
The product contract is:
- Dedup changes the canonical archive state.
- Normal reads hide rows that are no longer live.
- Derived indexes may be rebuilt, updated, or marked stale as an operational concern.
Recommended policy:
- Filtering is mandatory for correctness.
- Best-effort derived index cleanup is allowed.
- Manual rebuild commands remain available.
- Any known stale derived surface should be visible in command output or logs.
This avoids coupling dedup correctness to every cache and index implementation.
Undo model
Undo is not full time travel.
Undo should restore local visibility for rows hidden by a dedup batch and cancel pending remote deletion manifests when they have not executed. It should not promise to reverse every side effect of dedup, such as survivor label unioning, raw-MIME enrichment, index cleanup, or remote deletion already performed against a source service.
Canonical user-facing language:
--undo <batch-id> restores rows hidden by that dedup batch and cancels the batch's pending remote-deletion manifest where possible. It does not restore an exact pre-run database state.
Remote deletion model
Remote deletion is a separate operation from local dedup.
Even when duplicate detection runs across a collection, remote deletion decisions remain source-specific. It is only valid to stage remote deletion when the survivor and loser belong to the same source and that source supports the requested remote-deletion behavior.
Rules:
- Same-source constraint. A remote-deletion entry is only staged when the loser and the survivor share a
source_id. Cross-source duplicate groups produce no remote-deletion entries even when the dedup scope is a collection that spans those sources.
- Source-scoped manifests. Remote-deletion manifests, manifest filenames, and reporting labels reflect the source, never the collection name, even when dedup was invoked under
--collection.
- Trash by default. Where the source supports a trash or recoverable state (e.g. Gmail's ~30-day trash), the default remote-deletion behavior moves messages there rather than removing them outright.
- Permanent deletion is opt-in. Permanent remote deletion requires an explicit flag and interactive confirmation. It is never the default, never inferred from dedup, and never applied in batch without the user acknowledging the source and scope at the moment of the action.
This preserves the distinction between "hide this redundant local row," "hard-delete it from the local archive," and "delete something from Gmail / IMAP / another source service."
Schema and persistence
Accounts, collections, identities, dedup batches, and deletion manifests are core domain concepts. Their durable state belongs in canonical schema and migrations, with dialect-aware ownership where msgvault supports multiple database engines.
The target model needs durable storage for:
- collection definitions
- collection membership
- account-scoped identity records
- dedup batches
- hidden duplicate row metadata
- remote deletion manifests or manifest references
Ad hoc lazy table creation is acceptable only as a development bridge, not as the settled architecture for these concepts.
Product scope
Core scope
These concepts belong together and should be designed as one coherent model:
- Account/source as one ingest unit.
- Collection as explicit grouping.
- Default
All collection.
- Account-scoped identities, with one-time migration from any legacy global identity config.
- Collection identity as derived union.
- Account-scoped dedup.
- Collection-scoped dedup.
- Sent-message safety as a survivor eligibility filter, not a tie-breaker.
- Live-message filtering across normal reads.
- Safety progression of scan → hide → local hard delete → remote delete, with no automatic escalation between rungs.
- Undo as local visibility restore, not full rollback.
- Remote deletion as explicit source-scoped follow-up, same-source-only, trash-by-default with permanent deletion behind interactive confirmation.
Implementation slices
The implementation does not have to land all at once. Reasonable slices are:
- Model and CLI scope: vocabulary,
--account/--collection, All, no nested collections.
- Read scope and visibility: live-message predicate, collection query scope, backend consistency.
- Dedup application: account and collection dedup, survivor policy, batch audit, undo language.
- Identity persistence: per-account identity records and the
identity {list,show,add,remove} command surface, with derived collection identity used at read time. Advanced discovery and review UX are out of scope for this release (see Identity model § Out of scope for this initial release).
- Remote deletion: source-scoped manifests and collection-scope safety tests.
These slices should preserve the model even if delivered separately. A partial slice should not introduce user-facing semantics that contradict the target design.
Future product work
These are valuable but do not need to define the first aligned implementation:
- Import-time dedup with
--into.
- Automatic dedup when creating or adding to collections.
- Exporting a deduplicated collection into a clean account. Once a collection has been deduplicated, its survivors form a coherent unified view across the member sources. A future operation should be able to export those survivors into a single new account that becomes the canonical archive going forward, while the original member accounts remain intact for provenance. This gives users a path from "many overlapping imports" to "one clean source of truth" without forcing them to throw away the originals.
- Identity-derived inbound/outbound classification across historical imports.
- Rich identity review UI.
- Policy controls for source preference and survivor scoring.
Mapping to PR #304
The original draft branch sat on PR #286. After the design review on that PR (linked above), the model was rewritten and reimplemented on the jesse/identities-collections-dedup branch, which is now in PR #304. Most of the aligned model is already on that branch. Recording the mapping here so the gap between the target model and the shipped code is visible.
CLI surface as shipped on the branch:
msgvault deduplicate — canonical command; accepts dedup and dedupe as aliases so users can type whichever feels natural.
msgvault collection {create,list,show,add,remove,delete} — singular namespace for managing collections.
msgvault delete-deduped — the local hard-delete rung; permanently removes rows already hidden by a prior deduplicate run. Sibling to delete-staged (remote-deletion executor); the two destruction verbs are deliberately separate.
msgvault identity {list,show,add,remove} — per-account identity management.
Status:
- Already on the branch: account-as-source vocabulary,
--account and --collection flags on dedup, default All collection bootstrap, per-account dedup, collection-scope dedup, sent-message safety in survivor selection, undo as local-visibility restore, source-scoped remote-deletion manifests with same-source-only staging, the explicit local hard-delete rung (delete-deduped), the one-time legacy [identity] config migration on first startup after upgrade, msgvault identity {list,show,add,remove} command surface, auto-default-identity at source creation across all ingest commands with usable identifiers (with --no-default-identity opt-out), multi-signal source_signal accumulation (sorted comma-separated set; JSON exposes signals: []), and case-preserving identifier storage.
- Partial: live-message filtering (applied to SQLite, DuckDB, FTS, and the SQLite vector backend including the fused vector+keyword path; MCP response audit still pending), name-collision errors between accounts and collections (basic guards in place; full ambiguity-suggestion UX is not), and collections as first-class query scopes outside dedup.
- Not yet on the branch: identity discovery beyond ingest metadata, identity confirmation UX, derived collection identity used at read time, and policy controls for survivor scoring.
The implementation slices above can be applied to the existing branch incrementally rather than as a single reshape.
Things to consider prior to implmentation
Use this checklist before translating the design back into implementation tasks:
- Does "account" always mean one ingest source/archive?
- Is every cross-account operation expressed through a collection?
- Can users tell from the command or UI when they are crossing account/source boundaries?
- Are identities account-scoped rather than global, with a defined migration from any legacy global config?
- Is
All modeled as a collection?
- Are collections first-class query scopes?
- Are hidden duplicates excluded from every normal read path by contract?
- Does dedup honor sent-message eligibility before falling back to the survivor priority list?
- Does dedup keep scan / hide / local hard delete / remote delete as four separate user actions, with no automatic escalation between them?
- Does remote deletion stay same-source-only, trash-by-default, and require explicit confirmation for permanent removal?
- Does undo avoid promising exact rollback?
- Are implementation slices allowed only when they preserve these semantics?
Our communications are scattered across decades of email accounts, chat apps, phone backups, and meeting recordings. msgvault finally lets us consolidate that into a single, safe, searchable place. Our varied collections finally consolidated across Gmail API syncs, old mbox exports, Apple Mail folders from a retired laptop, IMAP backups, Facebook Messenger dumps, SMS exports, WhatsApp histories, meeting recordings, call logs.
If you are like me, you spend a weekend pulling all the distributed archives together and importing them. Then you start searching in the TUI, with the MCP, or directly. The sense of joy turns to something else as you see the same email appears three times because it arrived via Gmail sync, an old mbox export, and an Apple Mail import. Message counts are wrong. Search returns dozens of copies sometimes. There's no way to tell what was sent vs received in an old mbox dump that lost its metadata. And every import sits in its own silo, with no unified view of "all my communications."
This new feature is a pretty big swing at expanding on this great foundation.
Introducing: Accounts, Identities, Collections, and Deduplication
Accounts, identities, and collections
This feature introduces three related concepts that make the archive usable without erasing provenance or changing the underlying data structure.
Accounts
An account is one imported message source/archive. A Gmail sync is an account. An mbox import is an account. An Apple Mail import is an account.
Even if two accounts represent the same real-world mailbox, msgvault keeps them separate until the user groups them. That preserves the boundary between independent archives and avoids guessing that two sources should be merged just because their data overlaps.
Identities
An identity identifies the owner of an account: the set of addresses, phone numbers, or provider identifiers that mean "me" inside that account. An account has one identity, which may contain many identifiers.
Identities are stored per account because the meaning of an address depends on the source. A collection's identity is derived from its member accounts. This lets msgvault recover sent/received meaning in old imports and make safer deduplication choices without applying one global identity list to every archive.
Collections
A collection is the user's explicit grouping of accounts.
Allis the default collection containing every account, and users can create collections such aswork,personal, orold laptop imports.Collections are how msgvault offers a unified view across sources while keeping source provenance intact. Any operation that crosses account boundaries does so because the user selected a collection.
The name "collection" matters. An earlier prototype used "merged account," which only described the moment of assembly, not the thing the user actually has afterward. Users don't think "I merged accounts." They think "this is all my work email" or "this is everything." A collection describes what the user has, not how they assembled it. It works at every scale —
Allis a collection,workis a collection — and it doesn't overload the word "account," which already means one ingest source.A future revision may introduce other identity or collection types (for example, device identities or saved-query collections). If that happens, renaming these objects to
account_identityandaccount_collectionwould disambiguate. For now,identityandcollectionare unambiguous.Speaking of unambiguous... here's a diagram that hopefully makes this clear. (this is a real thing that I spent time on, not slop. I rejected a bunch of helpful attempts at mermaid charts. :-)
Deduplication
Deduplication works inside those boundaries.
deduplicate --accountcleans up repeated rows within one source.deduplicate --collectioncompares messages across the accounts the user deliberately grouped.deduplicatewith no scope processes each account independently — never as one global pool.Applying dedup hides redundant local copies from normal search, browse, stats, API, MCP, and vector/hybrid retrieval paths. It does not destroy data: hidden copies stay on disk and
--undo <batch-id>restores them. Hard-deleting locally and deleting from a source server are separate, explicit steps the user must take on top of dedup; the system never escalates from one to the next on its own.The result is a model that supports both safety and usefulness: users can search and analyze a unified communication archive, but msgvault still knows which source each message came from, which identities apply to which account, and when an operation is crossing account boundaries.
Proposed Resolutions from Design Review
Allby default with every account/source and treat it as a collection, not an account.msgvault deduplicate(no flags) — scans each account independently." Wes: "msgvault deduplicate: each account/source independently."--accountrestricted to one account/source, and add an explicit--collectionflag."--accountresolves one source,--collectionresolves a named group.--account workmay target a collection namedwork, not an account/source."--account, and emit an explicit error with the correct flag hint.[identity]config is "a different model."The Core model
Account
An account is one ingest source/archive. It is the smallest durable provenance unit in msgvault.
Examples:
The same real-world mailbox imported through Gmail sync and later through an old mbox export creates two accounts. They may represent the same human mailbox, but they are distinct archives with distinct provenance and source-specific deletion semantics.
This keeps the data model honest: msgvault does not infer that two imports belong together just because an email address, display name, or message content overlaps.
Collection
A collection is a named grouping of accounts. It is the user's explicit statement that multiple sources should be viewed or operated on together.
Examples:
Allworkpersonalold laptop importsgmail plus exportsfamily messagesCollections are many-to-many:
Collections are the boundary for cross-account features. If two independent archives should be searched, counted, deduplicated, or exported together, the user expresses that by putting them in a collection.
All
Allis the default collection containing every account/source.Allgives users a natural unified view without collapsing account provenance. It is still a collection. Operations againstAllare collection-scoped operations and should be displayed that way.Scope semantics
The user-facing scope vocabulary is deliberately small:
CLI flags should expose those boundaries directly:
--account <account>--collection <collection>Allfor search/browse.--accountand--collectionare mutually exclusive. A generic internalScopetype is useful, but generic user-facing flags are not. Users should be able to tell when they are crossing source boundaries by reading the command.Name conflicts should fail clearly:
workis a collection,--account workfails and suggests--collection work.alice@example.comis an account,--collection alice@example.comfails and suggests--account alice@example.com.Identity model
Identity answers: "Who am I in this source?"
An identity belongs to an account/source. It can include email addresses, phone numbers, or other protocol-specific identifiers. A confirmed identity means that messages from that address or identifier can be treated as "from me" within that account's context.
Identity is account-scoped for two reasons:
A collection's identity is derived from its member accounts. It is the union of confirmed identities from those accounts, used only within the collection's scope.
Identity discovery should be evidence-based and reviewable. Candidate signals include:
is_from_memetadata from ingestGlobal identity configuration is not part of the target model.
If an
[identity]block exists inconfig.tomlfrom an older msgvault version, the first startup after upgrade migrates its addresses into per-account confirmed identities for every existing account, logs a warning naming the migration, and prints a one-time CLI notice asking the user to review per-account identities (msgvault identity list). The global config block is then no longer read. This preserves the old behavior across the upgrade while moving every account onto the per-account model.After migration, identity is stored per account. The global config block is legacy input only.
Identity management CLI
The first release ships a small, explicit command surface for managing the identifiers attached to an account's identity. Discovery happens at ingest from the signals listed above, plus the legacy migration; everything else the user does manually through these commands.
msgvault identity list [--account <a> | --collection <c>]--accountshows one account's identifiers;--collectionshows the union of member accounts.msgvault identity show <account>msgvault identity add <account> <identifier>(account, identifier).msgvault identity remove <account> <identifier>Naming follows the rest of the CLI: singular namespace (
identity), one identity per account, identifiers added or removed from it.Auto-default-identity at source creation
When an account is created through an ingest command that has a usable identifier, msgvault writes one confirmed identifier to the new account's identity automatically. The user gets functional dedup sent-copy safety with no extra step. Each affected ingest command registers
--no-default-identityto suppress the write when the user prefers to manage identity manually.add-account <email><email>argumentaccount-identifieradd-imapaccount-identifieradd-o365 <email><email>argumentaccount-identifierimport-mbox <identifier> <file><identifier>argaccount-identifierimport-emlxaccount-identifierimport-whatsapp --phone <e164>--phonevaluephone-e164import-gvoicephone-e164import-imessagedoes not auto-write an identity at source creation. The source is created with the literal identifier"local"— there is no per-account identifier known at source-creation time. Apple IDs and phone numbers in thehandletable belong to message participants, not to the user's identity; promoting them to confirmed identifiers requires a discovery pass that is out of scope for this release. iMessage users add identifiers manually withidentity add.Out of scope for this initial release
The following are valuable but explicitly deferred so this feature can land coherently. They do not block the model and can be added later without changing the user-facing semantics defined above:
identity addis the only confirmation path in this release).Anything in this list is welcome as follow-up work. None of it is required to satisfy the core model.
Collection behavior
Collections are a primary user concept, not just a dedup helper.
Required behavior:
Allis created and maintained automatically.Out of scope for the core model:
Collection names and account identifiers can share human-friendly names, so the CLI and UI must preserve the distinction visually and behaviorally.
Deduplication model
Deduplication removes redundant local copies from normal user-facing results without destroying the underlying archive by default.
Valid dedup scopes
deduplicate --account <account>deduplicate --collection <collection>deduplicateThe unscoped form is a convenience for per-account cleanup. It must not compare all messages across all accounts as one global set.
The unscoped default is per-account iteration rather than
--collection Allbecause cross-account dedup is the higher-risk operation: it can collapse duplicates between independent archives whose provenance the user may want to preserve. Cross-account dedup should require explicit opt-in through--collection. A user who genuinely wants to dedup across every account can still write--collection All.Detection
Duplicate detection can use multiple signals:
Message-IDDetection signals should be merged into duplicate groups carefully. A content-hash match can connect messages that do not share a
Message-ID, and aMessage-IDmatch can connect messages with slightly different stored bodies. The grouping model should allow transitive duplicate sets rather than treating each signal as an isolated pass.Survivor selection
Survivor selection should be deterministic and explainable. The policy prefers the copy that is most useful as the durable representative, evaluated in this priority order:
Earlier rules win outright; later rules only apply when all earlier ones tie. The exact policy should be documented and visible in dry-run output, so a user can read why one copy survived and another was hidden.
Sent-message safety rule
Sent-copy safety is an eligibility filter, not a tie-breaker. When any message in a duplicate group looks like a sent copy, only sent copies are eligible to survive. Received-copy candidates are removed from the group before the priority list above runs. Losing the sent signal silently changes user interpretation of the archive — "I sent this" is harder to recover than "I received this."
A message looks like a sent copy when any of these signals fires (OR):
SENTlabel on the messageis_from_meflag on the message from ingest metadataFromaddress matches a confirmed identity for the message's accountEffects
Applying dedup should:
Dedup should not silently escalate from local hiding to local hard deletion or remote deletion.
Safety progression
Dedup is a ladder, not a single switch. Each rung is a separate, explicit user action. The system never escalates from one rung to the next on its own.
--undo <batch-id>restores visibility.The user controls every rung. "Apply dedup" never implies hard delete. "Hard delete locally" never implies remote delete. "Remote delete" never implies permanent remote delete.
Attachment dedup is independent of message dedup: attachments are stored in a content-addressed pool, so identical files are stored once regardless of how many messages reference them. Hiding or hard-deleting a duplicate message does not delete the underlying attachment blob unless no remaining message references it.
Live-message contract
A live message is a message that has not been locally hidden by dedup and has not been recorded as deleted from the source server. The term is internal vocabulary for this contract and shows up in implementation slices and code.
Normal user-facing reads should return live messages only.
This contract applies to:
Indexes and caches may lag behind SQLite state, but normal retrieval must still filter hidden rows. Rebuilding derived surfaces is valuable for size and performance; it should not be the only thing preventing hidden duplicates from appearing.
Query scope
Collections should be first-class query scopes.
If users can create
workorpersonal, they should be able to search, browse, count, and inspect those collections without learning which source IDs are inside. That applies across local search, vector/hybrid search, TUI, API, MCP, and stats.The scope model should produce the same result set regardless of retrieval backend:
Allmaps to every source IDBackend differences are acceptable for ranking or performance, but not for scope membership or live-message visibility.
Cache and index policy
The product contract is:
Recommended policy:
This avoids coupling dedup correctness to every cache and index implementation.
Undo model
Undo is not full time travel.
Undo should restore local visibility for rows hidden by a dedup batch and cancel pending remote deletion manifests when they have not executed. It should not promise to reverse every side effect of dedup, such as survivor label unioning, raw-MIME enrichment, index cleanup, or remote deletion already performed against a source service.
Canonical user-facing language:
Remote deletion model
Remote deletion is a separate operation from local dedup.
Even when duplicate detection runs across a collection, remote deletion decisions remain source-specific. It is only valid to stage remote deletion when the survivor and loser belong to the same source and that source supports the requested remote-deletion behavior.
Rules:
source_id. Cross-source duplicate groups produce no remote-deletion entries even when the dedup scope is a collection that spans those sources.--collection.This preserves the distinction between "hide this redundant local row," "hard-delete it from the local archive," and "delete something from Gmail / IMAP / another source service."
Schema and persistence
Accounts, collections, identities, dedup batches, and deletion manifests are core domain concepts. Their durable state belongs in canonical schema and migrations, with dialect-aware ownership where msgvault supports multiple database engines.
The target model needs durable storage for:
Ad hoc lazy table creation is acceptable only as a development bridge, not as the settled architecture for these concepts.
Product scope
Core scope
These concepts belong together and should be designed as one coherent model:
Allcollection.Implementation slices
The implementation does not have to land all at once. Reasonable slices are:
--account/--collection,All, no nested collections.identity {list,show,add,remove}command surface, with derived collection identity used at read time. Advanced discovery and review UX are out of scope for this release (see Identity model § Out of scope for this initial release).These slices should preserve the model even if delivered separately. A partial slice should not introduce user-facing semantics that contradict the target design.
Future product work
These are valuable but do not need to define the first aligned implementation:
--into.Mapping to PR #304
The original draft branch sat on PR #286. After the design review on that PR (linked above), the model was rewritten and reimplemented on the
jesse/identities-collections-dedupbranch, which is now in PR #304. Most of the aligned model is already on that branch. Recording the mapping here so the gap between the target model and the shipped code is visible.CLI surface as shipped on the branch:
msgvault deduplicate— canonical command; acceptsdedupanddedupeas aliases so users can type whichever feels natural.msgvault collection {create,list,show,add,remove,delete}— singular namespace for managing collections.msgvault delete-deduped— the local hard-delete rung; permanently removes rows already hidden by a priordeduplicaterun. Sibling todelete-staged(remote-deletion executor); the two destruction verbs are deliberately separate.msgvault identity {list,show,add,remove}— per-account identity management.Status:
--accountand--collectionflags on dedup, defaultAllcollection bootstrap, per-account dedup, collection-scope dedup, sent-message safety in survivor selection, undo as local-visibility restore, source-scoped remote-deletion manifests with same-source-only staging, the explicit local hard-delete rung (delete-deduped), the one-time legacy[identity]config migration on first startup after upgrade,msgvault identity {list,show,add,remove}command surface, auto-default-identity at source creation across all ingest commands with usable identifiers (with--no-default-identityopt-out), multi-signalsource_signalaccumulation (sorted comma-separated set; JSON exposessignals: []), and case-preserving identifier storage.The implementation slices above can be applied to the existing branch incrementally rather than as a single reshape.
Things to consider prior to implmentation
Use this checklist before translating the design back into implementation tasks:
Allmodeled as a collection?