Feature: Accounts, Identities, Collections, and Deduplication

Our communications are scattered across decades of email accounts, chat apps, phone backups, and meeting recordings. msgvault finally lets us consolidate that into a single, safe, searchable place. Our varied collections finally consolidated across  Gmail API syncs, old mbox exports, Apple Mail folders from a retired laptop, IMAP backups, Facebook Messenger dumps, SMS exports, WhatsApp histories, meeting recordings, call logs.

If you are like me, you spend a weekend pulling all the distributed archives together and importing them. Then you start searching in the TUI, with the MCP, or directly. The sense of joy turns to something else as you see the same email appears three times because it arrived via Gmail sync, an old mbox export, and an Apple Mail import. Message counts are wrong. Search returns dozens of copies sometimes. There's no way to tell what was sent vs received in an old mbox dump that lost its metadata. And every import sits in its own silo, with no unified view of "all my communications."

This new feature is a pretty big swing at expanding on this great foundation. 

> **Updated 2026-04-29 with the aligned design.** This proposal was first opened with a draft model and a working branch (PR #286). After @wesm's [design review](https://github.com/wesm/msgvault/pull/286#issuecomment-4320039075), I rewrote the proposal end-to-end rather than patching it. The full aligned design is below — this issue is now the canonical statement of the model. PR #286 will be reshaped against it. I realize this is more of an Epic but these kinds of things often are. ;-)

# Introducing: Accounts, Identities, Collections, and Deduplication

## Accounts, identities, and collections

This feature introduces three related concepts that make the archive usable without erasing provenance or changing the underlying data structure. 

### Accounts

An **account** is one imported message source/archive. A Gmail sync is an account. An mbox import is an account. An Apple Mail import is an account.

Even if two accounts represent the same real-world mailbox, msgvault keeps them separate until the user groups them. That preserves the boundary between independent archives and avoids guessing that two sources should be merged just because their data overlaps.

### Identities

An **identity** identifies the owner of an account: the set of addresses, phone numbers, or provider identifiers that mean "me" inside that account. An account has one identity, which may contain many identifiers.

Identities are stored per account because the meaning of an address depends on the source. A collection's identity is derived from its member accounts. This lets msgvault recover sent/received meaning in old imports and make safer deduplication choices without applying one global identity list to every archive.

### Collections

A **collection** is the user's explicit grouping of accounts. `All` is the default collection containing every account, and users can create collections such as `work`, `personal`, or `old laptop imports`.

Collections are how msgvault offers a unified view across sources while keeping source provenance intact. Any operation that crosses account boundaries does so because the user selected a collection.

The name "collection" matters. An earlier prototype used "merged account," which only described the moment of assembly, not the thing the user actually has afterward. Users don't think "I merged accounts." They think "this is all my work email" or "this is everything." A collection describes what the user has, not how they assembled it. It works at every scale — `All` is a collection, `work` is a collection — and it doesn't overload the word "account," which already means one ingest source.

A future revision may introduce other identity or collection types (for example, device identities or saved-query collections). If that happens, renaming these objects to `account_identity` and `account_collection` would disambiguate. For now, `identity` and `collection` are unambiguous.

Speaking of unambiguous... here's a diagram that hopefully makes this clear. _(this is a real thing that I spent time on, not slop. I rejected a bunch of helpful attempts at mermaid charts. :-)_ 

<img width="1600" height="1170" alt="Image" src="https://github.com/user-attachments/assets/70223c58-78e7-41ea-a17b-e7a6b44f621b" />

## Deduplication

**Deduplication** works inside those boundaries. `deduplicate --account` cleans up repeated rows within one source. `deduplicate --collection` compares messages across the accounts the user deliberately grouped. `deduplicate` with no scope processes each account independently — never as one global pool.

Applying dedup hides redundant local copies from normal search, browse, stats, API, MCP, and vector/hybrid retrieval paths. It does not destroy data: hidden copies stay on disk and `--undo <batch-id>` restores them. Hard-deleting locally and deleting from a source server are separate, explicit steps the user must take on top of dedup; the system never escalates from one to the next on its own.

The result is a model that supports both safety and usefulness: users can search and analyze a unified communication archive, but msgvault still knows which source each message came from, which identities apply to which account, and when an operation is crossing account boundaries.

<img width="1600" height="1100" alt="Image" src="https://github.com/user-attachments/assets/1da69753-74c0-48d8-aa67-220173c337ad" />

## Proposed Resolutions from Design Review


| Category               | Item                                                                                                                                                       | Proposed Resolution                                                                                                                                |
| ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------- |
| Account meaning        | Wes (PR #286 review): "**Account**: one ingest source/archive."                                                                                            | Define an account as one source/archive created by one ingest path; never as a logical mailbox spanning sources.                                   |
| Collection meaning     | Issue #278: "A collection is a named grouping of accounts." Wes: "**Collection**: a named grouping of accounts/sources."                                   | Define a collection as a named grouping of account/source IDs, and use it as the only cross-account grouping primitive.                            |
| Default collection     | Issue #278: "create an 'All' collection ... automatically includes every account." Wes: "**All**: the default collection containing every account/source." | Seed`All` by default with every account/source and treat it as a collection, not an account.                                                       |
| Dedup safety boundary  | Issue #278: "Dedup only operates within the boundary the user specifies." Wes: "Dedup across accounts should require an explicit collection boundary."     | Confine within-account dedup to that source, and require an explicit collection scope for any cross-account dedup.                                 |
| Unscoped dedup         | Issue #278: "`msgvault deduplicate` (no flags) — scans each account independently." Wes: "`msgvault deduplicate`: each account/source independently."     | Iterate per-account when no scope is given; never treat unscoped dedup as "dedup everything together."                                             |
| CLI scope clarity      | Wes: "keep`--account` restricted to one account/source, and add an explicit `--collection` flag."                                                          | Encode the boundary in the flag itself:`--account` resolves one source, `--collection` resolves a named group.                                     |
| Name shadowing         | Wes: "`--account work` may target a collection named `work`, not an account/source."                                                                       | Reject collection names passed through`--account`, and emit an explicit error with the correct flag hint.                                          |
| Nested collections     | Wes: collection creation can "effectively allow nested collection references."                                                                             | Restrict collection membership to accounts/sources only; do not support nested collections in this model.                                          |
| Identity scope         | Issue #278: "Identities are tied to accounts." Wes: global`[identity]` config is "a different model."                                                      | Store identities as per-account records, and treat any global identity list as legacy input rather than an active scope.                           |
| Collection identity    | Issue #278: "A collection's identity is the union of its accounts' identities."                                                                            | Derive collection identity from member accounts; do not configure a collection's identity set separately.                                          |
| Query hiding contract  | Issue #278: "Pruned copies are soft-deleted and hidden from all query paths." Wes: vector/hybrid paths can still "surface pruned duplicates."              | Treat dedup-hidden rows as hidden on every normal read surface, as a product contract rather than a query optimization.                            |
| Live-message predicate | Wes: "Centralize live-message filtering" across storage and retrieval surfaces.                                                                            | Define one live-message rule and apply it consistently across SQLite, DuckDB, FTS, vector, API, MCP, TUI, and stats.                               |
| Collection query scope | Wes: decide whether collections are "only a dedup/admin concept" or "first-class query scopes."                                                            | Promote collections to first-class user scopes for search, browse, stats, and dedup; reject any partial implementation that scopes only dedup.     |
| Cache/index policy     | Wes: "Cache/index invalidation needs a clearer policy."                                                                                                    | Anchor correctness to live-message filtering; treat rebuilds as storage/performance hygiene, never as a prerequisite for hiding pruned duplicates. |
| Schema ownership       | Wes: permanent collections "probably belong in the canonical schema/migration path."                                                                       | Place accounts, collections, identities, and dedup metadata in canonical schema and migrations as core data model concepts.                        |
| Undo semantics         | Wes: "Undo is not a full rollback."                                                                                                                        | Restore local visibility and cancel pending deletion intent where possible; do not guarantee an exact pre-run database state.                      |
| Remote deletion scope  | Wes: manifest naming and reporting need to remain "source/account-specific rather than collection-specific."                                               | Keep remote deletion source-scoped even when duplicate detection used a collection boundary.                                                       |

# The Core model

### Account

An account is one ingest source/archive. It is the smallest durable provenance unit in msgvault.

Examples:

- one Gmail sync source
- one IMAP source
- one mbox import
- one Apple Mail import
- one iMessage import
- one SMS import
- one Facebook Messenger import
- one meeting transcript import source

The same real-world mailbox imported through Gmail sync and later through an old mbox export creates two accounts. They may represent the same human mailbox, but they are distinct archives with distinct provenance and source-specific deletion semantics.

This keeps the data model honest: msgvault does not infer that two imports belong together just because an email address, display name, or message content overlaps.

### Collection

A collection is a named grouping of accounts. It is the user's explicit statement that multiple sources should be viewed or operated on together.

Examples:

- `All`
- `work`
- `personal`
- `old laptop imports`
- `gmail plus exports`
- `family messages`

Collections are many-to-many:

- An account can belong to multiple collections.
- A collection can contain multiple accounts.
- A collection contains account/source IDs, not other collections.

Collections are the boundary for cross-account features. If two independent archives should be searched, counted, deduplicated, or exported together, the user expresses that by putting them in a collection.

### All

`All` is the default collection containing every account/source.

`All` gives users a natural unified view without collapsing account provenance. It is still a collection. Operations against `All` are collection-scoped operations and should be displayed that way.

## Scope semantics

The user-facing scope vocabulary is deliberately small:


| Scope            | Meaning                                                 |
| ------------------ | --------------------------------------------------------- |
| Account scope    | One source/archive.                                     |
| Collection scope | All member accounts of one collection.                  |
| All scope        | The default collection containing every account/source. |

CLI flags should expose those boundaries directly:


| Command shape               | Meaning                                                                                                  |
| ----------------------------- | ---------------------------------------------------------------------------------------------------------- |
| `--account <account>`       | Resolve exactly one account/source.                                                                      |
| `--collection <collection>` | Resolve exactly one collection.                                                                          |
| no flag where supported     | Use the command's documented default, such as per-account iteration for dedup or`All` for search/browse. |

`--account` and `--collection` are mutually exclusive. A generic internal `Scope` type is useful, but generic user-facing flags are not. Users should be able to tell when they are crossing source boundaries by reading the command.

Name conflicts should fail clearly:

- If `work` is a collection, `--account work` fails and suggests `--collection work`.
- If `alice@example.com` is an account, `--collection alice@example.com` fails and suggests `--account alice@example.com`.
- If the same string exists as both an account display name and a collection name, the user must choose the correct flag and may need a more specific account identifier.

## Identity model

Identity answers: "Who am I in this source?"

An identity belongs to an account/source. It can include email addresses, phone numbers, or other protocol-specific identifiers. A confirmed identity means that messages from that address or identifier can be treated as "from me" within that account's context.

Identity is account-scoped for two reasons:

- The same address may appear in multiple imports, and that is expected.
- An address that is safe to treat as "me" in one account may be misleading in another account or shared archive.

A collection's identity is derived from its member accounts. It is the union of confirmed identities from those accounts, used only within the collection's scope.

Identity discovery should be evidence-based and reviewable. Candidate signals include:

- `is_from_me` metadata from ingest
- sent-folder or sent-label evidence
- account/source identifier match
- OAuth or provider account metadata
- user confirmation

Global identity configuration is not part of the target model.

If an `[identity]` block exists in `config.toml` from an older msgvault version, the first startup after upgrade migrates its addresses into per-account confirmed identities for every existing account, logs a warning naming the migration, and prints a one-time CLI notice asking the user to review per-account identities (`msgvault identity list`). The global config block is then no longer read. This preserves the old behavior across the upgrade while moving every account onto the per-account model.

After migration, identity is stored per account. The global config block is legacy input only.

### Identity management CLI

The first release ships a small, explicit command surface for managing the identifiers attached to an account's identity. Discovery happens at ingest from the signals listed above, plus the legacy migration; everything else the user does manually through these commands.

| Command | Meaning |
| --- | --- |
| `msgvault identity list [--account <a> \| --collection <c>]` | List identifiers. With no scope, lists every account's identity. `--account` shows one account's identifiers; `--collection` shows the union of member accounts. |
| `msgvault identity show <account>` | Show one account's identity: every confirmed identifier with its source signal. |
| `msgvault identity add <account> <identifier>` | Add an identifier (email address, phone number, provider ID) to the account's identity. Idempotent on `(account, identifier)`. |
| `msgvault identity remove <account> <identifier>` | Remove an identifier from the account's identity. |

Naming follows the rest of the CLI: singular namespace (`identity`), one identity per account, identifiers added or removed from it.

### Auto-default-identity at source creation

When an account is created through an ingest command that has a usable identifier, msgvault writes one confirmed identifier to the new account's identity automatically. The user gets functional dedup sent-copy safety with no extra step. Each affected ingest command registers `--no-default-identity` to suppress the write when the user prefers to manage identity manually.

| Command | Identifier written | Signal recorded |
| --- | --- | --- |
| `add-account <email>` | the `<email>` argument | `account-identifier` |
| `add-imap` | configured address | `account-identifier` |
| `add-o365 <email>` | the `<email>` argument | `account-identifier` |
| `import-mbox <identifier> <file>` | `<identifier>` arg | `account-identifier` |
| `import-emlx` | resolved account email per discovered account | `account-identifier` |
| `import-whatsapp --phone <e164>` | the `--phone` value | `phone-e164` |
| `import-gvoice` | user's GV phone (E.164) when the takeout exposes it | `phone-e164` |

`import-imessage` does not auto-write an identity at source creation. The source is created with the literal identifier `"local"` — there is no per-account identifier known at source-creation time. Apple IDs and phone numbers in the `handle` table belong to message participants, not to the user's identity; promoting them to confirmed identifiers requires a discovery pass that is out of scope for this release. iMessage users add identifiers manually with `identity add`.

### Out of scope for this initial release

The following are valuable but explicitly deferred so this feature can land coherently. They do not block the model and can be added later without changing the user-facing semantics defined above:

- Interactive identity discovery beyond what ingest already records (no candidate-ranking command, no "scan the archive for likely-me addresses" workflow).
- Identity confirmation UX that promotes a discovered candidate to a confirmed identifier (the manual `identity add` is the only confirmation path in this release).
- Identity-derived inbound/outbound classification across historical imports.
- Rich identity review UI, including TUI views and bulk confirm/reject flows.
- Provider-specific identifier types beyond email addresses, phone numbers, and free-form strings (e.g. richer schemas for OAuth subject IDs, device IDs, account-level metadata).
- Automatic identity propagation when accounts are added to a collection.

Anything in this list is welcome as follow-up work. None of it is required to satisfy the core model.

## Collection behavior

Collections are a primary user concept, not just a dedup helper.

Required behavior:

- `All` is created and maintained automatically.
- Users can create named collections from accounts.
- Users can add and remove accounts from collections.
- Collection membership accepts only accounts/sources.
- Collection views preserve account provenance.

Out of scope for the core model:

- Nested collections.
- Implicit collection creation based on matching email addresses.
- Treating a collection as an account.

Collection names and account identifiers can share human-friendly names, so the CLI and UI must preserve the distinction visually and behaviorally.

## Deduplication model

Deduplication removes redundant local copies from normal user-facing results without destroying the underlying archive by default.

### Valid dedup scopes


| Invocation                              | Boundary                                                    |
| ----------------------------------------- | ------------------------------------------------------------- |
| `deduplicate --account <account>`       | Compare messages only within that account/source.           |
| `deduplicate --collection <collection>` | Compare messages across member accounts in that collection. |
| `deduplicate`                           | Process each account independently.                         |

The unscoped form is a convenience for per-account cleanup. It must not compare all messages across all accounts as one global set.

The unscoped default is per-account iteration rather than `--collection All` because cross-account dedup is the higher-risk operation: it can collapse duplicates between independent archives whose provenance the user may want to preserve. Cross-account dedup should require explicit opt-in through `--collection`. A user who genuinely wants to dedup across every account can still write `--collection All`.

### Detection

Duplicate detection can use multiple signals:

- RFC822 `Message-ID`
- normalized raw MIME or body content hash
- provider/source message IDs where appropriate
- attachment content hashes where relevant

Detection signals should be merged into duplicate groups carefully. A content-hash match can connect messages that do not share a `Message-ID`, and a `Message-ID` match can connect messages with slightly different stored bodies. The grouping model should allow transitive duplicate sets rather than treating each signal as an isolated pass.

### Survivor selection

Survivor selection should be deterministic and explainable. The policy prefers the copy that is most useful as the durable representative, evaluated in this priority order:

1. source preference when configured
2. has raw MIME or complete original payload
3. source metadata quality
4. richer label or folder metadata
5. earlier archived timestamp when meaningful
6. stable row ID as the final tie-breaker

Earlier rules win outright; later rules only apply when all earlier ones tie. The exact policy should be documented and visible in dry-run output, so a user can read why one copy survived and another was hidden.

#### Sent-message safety rule

Sent-copy safety is an **eligibility filter**, not a tie-breaker. When any message in a duplicate group looks like a sent copy, only sent copies are eligible to survive. Received-copy candidates are removed from the group before the priority list above runs. Losing the sent signal silently changes user interpretation of the archive — "I sent this" is harder to recover than "I received this."

A message looks like a sent copy when any of these signals fires (OR):

- a Gmail `SENT` label on the message
- an `is_from_me` flag on the message from ingest metadata
- the `From` address matches a confirmed identity for the message's account

### Effects

Applying dedup should:

- choose one survivor per duplicate group
- hide redundant local rows from normal query paths
- preserve enough metadata to explain what happened
- write a batch ID for audit and undo
- avoid remote deletion unless explicitly requested

Dedup should not silently escalate from local hiding to local hard deletion or remote deletion.

## Safety progression

Dedup is a ladder, not a single switch. Each rung is a separate, explicit user action. The system never escalates from one rung to the next on its own.

1. **Scan.** Detect duplicates and report what would change. No data is touched. Dry-run is the default.
2. **Hide.** Apply dedup. Pruned copies are soft-deleted: hidden from normal reads but kept on disk. `--undo <batch-id>` restores visibility.
3. **Local hard delete.** A separate, opt-in action that permanently removes hidden rows from the local archive. Dedup itself never does this; the user runs it explicitly after a hide step they're confident in.
4. **Remote delete.** Deleting from the source server (Gmail, IMAP, another service) is a further separate decision. The default is trash-with-recovery (Gmail's ~30-day trash). Permanent remote deletion requires explicit opt-in and interactive confirmation.

The user controls every rung. "Apply dedup" never implies hard delete. "Hard delete locally" never implies remote delete. "Remote delete" never implies permanent remote delete.

Attachment dedup is independent of message dedup: attachments are stored in a content-addressed pool, so identical files are stored once regardless of how many messages reference them. Hiding or hard-deleting a duplicate message does not delete the underlying attachment blob unless no remaining message references it.

## Live-message contract

A **live message** is a message that has not been locally hidden by dedup and has not been recorded as deleted from the source server. The term is internal vocabulary for this contract and shows up in implementation slices and code.

Normal user-facing reads should return live messages only.

This contract applies to:

- message search
- vector and hybrid search
- TUI browsing
- stats and aggregates
- API responses
- MCP responses
- exports that claim to represent the visible archive

Indexes and caches may lag behind SQLite state, but normal retrieval must still filter hidden rows. Rebuilding derived surfaces is valuable for size and performance; it should not be the only thing preventing hidden duplicates from appearing.

## Query scope

Collections should be first-class query scopes.

If users can create `work` or `personal`, they should be able to search, browse, count, and inspect those collections without learning which source IDs are inside. That applies across local search, vector/hybrid search, TUI, API, MCP, and stats.

The scope model should produce the same result set regardless of retrieval backend:

- account scope maps to one source ID
- collection scope maps to many source IDs
- `All` maps to every source ID

Backend differences are acceptable for ranking or performance, but not for scope membership or live-message visibility.

## Cache and index policy

The product contract is:

- Dedup changes the canonical archive state.
- Normal reads hide rows that are no longer live.
- Derived indexes may be rebuilt, updated, or marked stale as an operational concern.

Recommended policy:

- Filtering is mandatory for correctness.
- Best-effort derived index cleanup is allowed.
- Manual rebuild commands remain available.
- Any known stale derived surface should be visible in command output or logs.

This avoids coupling dedup correctness to every cache and index implementation.

## Undo model

Undo is not full time travel.

Undo should restore local visibility for rows hidden by a dedup batch and cancel pending remote deletion manifests when they have not executed. It should not promise to reverse every side effect of dedup, such as survivor label unioning, raw-MIME enrichment, index cleanup, or remote deletion already performed against a source service.

Canonical user-facing language:

> `--undo <batch-id>` restores rows hidden by that dedup batch and cancels the batch's pending remote-deletion manifest where possible. It does not restore an exact pre-run database state.

## Remote deletion model

Remote deletion is a separate operation from local dedup.

Even when duplicate detection runs across a collection, remote deletion decisions remain source-specific. It is only valid to stage remote deletion when the survivor and loser belong to the **same source** and that source supports the requested remote-deletion behavior.

Rules:

- **Same-source constraint.** A remote-deletion entry is only staged when the loser and the survivor share a `source_id`. Cross-source duplicate groups produce no remote-deletion entries even when the dedup scope is a collection that spans those sources.
- **Source-scoped manifests.** Remote-deletion manifests, manifest filenames, and reporting labels reflect the source, never the collection name, even when dedup was invoked under `--collection`.
- **Trash by default.** Where the source supports a trash or recoverable state (e.g. Gmail's ~30-day trash), the default remote-deletion behavior moves messages there rather than removing them outright.
- **Permanent deletion is opt-in.** Permanent remote deletion requires an explicit flag and interactive confirmation. It is never the default, never inferred from dedup, and never applied in batch without the user acknowledging the source and scope at the moment of the action.

This preserves the distinction between "hide this redundant local row," "hard-delete it from the local archive," and "delete something from Gmail / IMAP / another source service."

## Schema and persistence

Accounts, collections, identities, dedup batches, and deletion manifests are core domain concepts. Their durable state belongs in canonical schema and migrations, with dialect-aware ownership where msgvault supports multiple database engines.

The target model needs durable storage for:

- collection definitions
- collection membership
- account-scoped identity records
- dedup batches
- hidden duplicate row metadata
- remote deletion manifests or manifest references

Ad hoc lazy table creation is acceptable only as a development bridge, not as the settled architecture for these concepts.

## Product scope

### Core scope

These concepts belong together and should be designed as one coherent model:

- Account/source as one ingest unit.
- Collection as explicit grouping.
- Default `All` collection.
- Account-scoped identities, with one-time migration from any legacy global identity config.
- Collection identity as derived union.
- Account-scoped dedup.
- Collection-scoped dedup.
- Sent-message safety as a survivor eligibility filter, not a tie-breaker.
- Live-message filtering across normal reads.
- Safety progression of scan → hide → local hard delete → remote delete, with no automatic escalation between rungs.
- Undo as local visibility restore, not full rollback.
- Remote deletion as explicit source-scoped follow-up, same-source-only, trash-by-default with permanent deletion behind interactive confirmation.

### Implementation slices

The implementation does not have to land all at once. Reasonable slices are:

- **Model and CLI scope:** vocabulary, `--account`/`--collection`, `All`, no nested collections.
- **Read scope and visibility:** live-message predicate, collection query scope, backend consistency.
- **Dedup application:** account and collection dedup, survivor policy, batch audit, undo language.
- **Identity persistence:** per-account identity records and the `identity {list,show,add,remove}` command surface, with derived collection identity used at read time. Advanced discovery and review UX are out of scope for this release (see Identity model § Out of scope for this initial release).
- **Remote deletion:** source-scoped manifests and collection-scope safety tests.

These slices should preserve the model even if delivered separately. A partial slice should not introduce user-facing semantics that contradict the target design.

### Future product work

These are valuable but do not need to define the first aligned implementation:

- Import-time dedup with `--into`.
- Automatic dedup when creating or adding to collections.
- Exporting a deduplicated collection into a clean account. Once a collection has been deduplicated, its survivors form a coherent unified view across the member sources. A future operation should be able to export those survivors into a single new account that becomes the canonical archive going forward, while the original member accounts remain intact for provenance. This gives users a path from "many overlapping imports" to "one clean source of truth" without forcing them to throw away the originals.
- Identity-derived inbound/outbound classification across historical imports.
- Rich identity review UI.
- Policy controls for source preference and survivor scoring.

## Mapping to PR #304

The original draft branch sat on PR #286. After the design review on that PR (linked above), the model was rewritten and reimplemented on the [`jesse/identities-collections-dedup`](https://github.com/wesm/msgvault/tree/jesse/identities-collections-dedup) branch, which is now in PR #304. Most of the aligned model is already on that branch. Recording the mapping here so the gap between the target model and the shipped code is visible.

CLI surface as shipped on the branch:

- `msgvault deduplicate` — canonical command; accepts `dedup` and `dedupe` as aliases so users can type whichever feels natural.
- `msgvault collection {create,list,show,add,remove,delete}` — singular namespace for managing collections.
- `msgvault delete-deduped` — the local hard-delete rung; permanently removes rows already hidden by a prior `deduplicate` run. Sibling to `delete-staged` (remote-deletion executor); the two destruction verbs are deliberately separate.
- `msgvault identity {list,show,add,remove}` — per-account identity management.

Status:

- **Already on the branch:** account-as-source vocabulary, `--account` and `--collection` flags on dedup, default `All` collection bootstrap, per-account dedup, collection-scope dedup, sent-message safety in survivor selection, undo as local-visibility restore, source-scoped remote-deletion manifests with same-source-only staging, the explicit local hard-delete rung (`delete-deduped`), the one-time legacy `[identity]` config migration on first startup after upgrade, `msgvault identity {list,show,add,remove}` command surface, auto-default-identity at source creation across all ingest commands with usable identifiers (with `--no-default-identity` opt-out), multi-signal `source_signal` accumulation (sorted comma-separated set; JSON exposes `signals: []`), and case-preserving identifier storage.
- **Partial:** live-message filtering (applied to SQLite, DuckDB, FTS, and the SQLite vector backend including the fused vector+keyword path; MCP response audit still pending), name-collision errors between accounts and collections (basic guards in place; full ambiguity-suggestion UX is not), and collections as first-class query scopes outside dedup.
- **Not yet on the branch:** identity discovery beyond ingest metadata, identity confirmation UX, derived collection identity used at read time, and policy controls for survivor scoring.

The implementation slices above can be applied to the existing branch incrementally rather than as a single reshape.

## Things to consider prior to implmentation

Use this checklist before translating the design back into implementation tasks:

- Does "account" always mean one ingest source/archive?
- Is every cross-account operation expressed through a collection?
- Can users tell from the command or UI when they are crossing account/source boundaries?
- Are identities account-scoped rather than global, with a defined migration from any legacy global config?
- Is `All` modeled as a collection?
- Are collections first-class query scopes?
- Are hidden duplicates excluded from every normal read path by contract?
- Does dedup honor sent-message eligibility before falling back to the survivor priority list?
- Does dedup keep scan / hide / local hard delete / remote delete as four separate user actions, with no automatic escalation between them?
- Does remote deletion stay same-source-only, trash-by-default, and require explicit confirmation for permanent removal?
- Does undo avoid promising exact rollback?
- Are implementation slices allowed only when they preserve these semantics?




Command	Meaning
`msgvault identity list [--account <a> \| --collection <c>]`	List identifiers. With no scope, lists every account's identity. `--account` shows one account's identifiers; `--collection` shows the union of member accounts.
`msgvault identity show <account>`	Show one account's identity: every confirmed identifier with its source signal.
`msgvault identity add <account> <identifier>`	Add an identifier (email address, phone number, provider ID) to the account's identity. Idempotent on `(account, identifier)`.
`msgvault identity remove <account> <identifier>`	Remove an identifier from the account's identity.

Command	Identifier written	Signal recorded
`add-account <email>`	the `<email>` argument	`account-identifier`
`add-imap`	configured address	`account-identifier`
`add-o365 <email>`	the `<email>` argument	`account-identifier`
`import-mbox <identifier> <file>`	`<identifier>` arg	`account-identifier`
`import-emlx`	resolved account email per discovered account	`account-identifier`
`import-whatsapp --phone <e164>`	the `--phone` value	`phone-e164`
`import-gvoice`	user's GV phone (E.164) when the takeout exposes it	`phone-e164`

Category	Item	Proposed Resolution
Account meaning	Wes (PR #286 review): "Account: one ingest source/archive."	Define an account as one source/archive created by one ingest path; never as a logical mailbox spanning sources.
Collection meaning	Issue #278: "A collection is a named grouping of accounts." Wes: "Collection: a named grouping of accounts/sources."	Define a collection as a named grouping of account/source IDs, and use it as the only cross-account grouping primitive.
Default collection	Issue #278: "create an 'All' collection ... automatically includes every account." Wes: "All: the default collection containing every account/source."	Seed`All` by default with every account/source and treat it as a collection, not an account.
Dedup safety boundary	Issue #278: "Dedup only operates within the boundary the user specifies." Wes: "Dedup across accounts should require an explicit collection boundary."	Confine within-account dedup to that source, and require an explicit collection scope for any cross-account dedup.
Unscoped dedup	Issue #278: "`msgvault deduplicate` (no flags) — scans each account independently." Wes: "`msgvault deduplicate`: each account/source independently."	Iterate per-account when no scope is given; never treat unscoped dedup as "dedup everything together."
CLI scope clarity	Wes: "keep`--account` restricted to one account/source, and add an explicit `--collection` flag."	Encode the boundary in the flag itself:`--account` resolves one source, `--collection` resolves a named group.
Name shadowing	Wes: "`--account work` may target a collection named `work`, not an account/source."	Reject collection names passed through`--account`, and emit an explicit error with the correct flag hint.
Nested collections	Wes: collection creation can "effectively allow nested collection references."	Restrict collection membership to accounts/sources only; do not support nested collections in this model.
Identity scope	Issue #278: "Identities are tied to accounts." Wes: global`[identity]` config is "a different model."	Store identities as per-account records, and treat any global identity list as legacy input rather than an active scope.
Collection identity	Issue #278: "A collection's identity is the union of its accounts' identities."	Derive collection identity from member accounts; do not configure a collection's identity set separately.
Query hiding contract	Issue #278: "Pruned copies are soft-deleted and hidden from all query paths." Wes: vector/hybrid paths can still "surface pruned duplicates."	Treat dedup-hidden rows as hidden on every normal read surface, as a product contract rather than a query optimization.
Live-message predicate	Wes: "Centralize live-message filtering" across storage and retrieval surfaces.	Define one live-message rule and apply it consistently across SQLite, DuckDB, FTS, vector, API, MCP, TUI, and stats.
Collection query scope	Wes: decide whether collections are "only a dedup/admin concept" or "first-class query scopes."	Promote collections to first-class user scopes for search, browse, stats, and dedup; reject any partial implementation that scopes only dedup.
Cache/index policy	Wes: "Cache/index invalidation needs a clearer policy."	Anchor correctness to live-message filtering; treat rebuilds as storage/performance hygiene, never as a prerequisite for hiding pruned duplicates.
Schema ownership	Wes: permanent collections "probably belong in the canonical schema/migration path."	Place accounts, collections, identities, and dedup metadata in canonical schema and migrations as core data model concepts.
Undo semantics	Wes: "Undo is not a full rollback."	Restore local visibility and cancel pending deletion intent where possible; do not guarantee an exact pre-run database state.
Remote deletion scope	Wes: manifest naming and reporting need to remain "source/account-specific rather than collection-specific."	Keep remote deletion source-scoped even when duplicate detection used a collection boundary.

Scope	Meaning
Account scope	One source/archive.
Collection scope	All member accounts of one collection.
All scope	The default collection containing every account/source.

Command shape	Meaning
`--account <account>`	Resolve exactly one account/source.
`--collection <collection>`	Resolve exactly one collection.
no flag where supported	Use the command's documented default, such as per-account iteration for dedup or`All` for search/browse.

Invocation	Boundary
`deduplicate --account <account>`	Compare messages only within that account/source.
`deduplicate --collection <collection>`	Compare messages across member accounts in that collection.
`deduplicate`	Process each account independently.

Feature: Accounts, Identities, Collections, and Deduplication #278

Description

Introducing: Accounts, Identities, Collections, and Deduplication

Accounts, identities, and collections

Accounts

Identities

Collections

Deduplication

Proposed Resolutions from Design Review

The Core model

Account

Collection

All

Scope semantics

Identity model

Identity management CLI

Auto-default-identity at source creation

Out of scope for this initial release

Collection behavior

Deduplication model

Valid dedup scopes

Detection

Survivor selection

Sent-message safety rule

Effects

Safety progression

Live-message contract

Query scope

Cache and index policy

Undo model

Remote deletion model

Schema and persistence

Product scope

Core scope

Implementation slices

Future product work

Mapping to PR #304

Things to consider prior to implmentation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions