Skip to content

dedup: Scan(non-dry-run) writes rfc822_message_id before user confirmation, outside backup scope #312

@wesm

Description

@wesm

Context

Surfaced during the design re-read of PR #304. Not a blocker — the safety story is functionally fine — but the backup contract claims something the code doesn't quite uphold.

What happens

internal/dedup/dedup.go:243-263 documents that Scan(non-dry-run) writes rfc822_message_id to disk before the user confirms the merge:

Side effect (non-dry-run only): if the scoped sources contain messages with no rfc822_message_id but with stored MIME, Scan calls store.BackfillRFC822IDs to derive the column from the stored headers before grouping. The backfill is idempotent metadata derivation — it fills a previously-NULL column from data already on the row, never overwrites a non-NULL value, and changes no message content.

The author calls it "idempotent metadata derivation," and commit e9182d20 added a confirm-prompt note about it. The semantics are defensible on their own.

The gap is in the backup story. The doc at dedup.go:241 explicitly says:

The dedup-batch backup-before-merge contract still holds: the backup is sized for the merge (which soft-deletes losers and transfers labels), not for this metadata catch-up.

So if the user runs deduplicate --account alice, sees the report, and aborts at the confirm prompt, the rfc822 writes have already happened and are not covered by any backup. Probably harmless because the writes are derived from already-stored MIME, but the safety claim "every dedup operation is reversible via --undo or covered by a pre-merge backup" is one rung weaker than it reads.

There's also a UX angle: most CLIs train users to expect a non---dry-run invocation to be a "preview before deciding" up to the confirmation prompt. Writing during the preview phase violates that intuition.

Why it matters

Low blast radius today (the writes are derived data, no message content changes). Worth fixing if/when:

  • Backfill semantics ever get extended to write something less "purely derived"
  • A future user-facing safety doc claims "no writes before the merge confirmation"

Proposed approach

Move the backfill into Execute, inside the same transaction as the merge, after the user confirms. The reason the author cited (the report needs the backfilled IDs to surface duplicate groups) is real but solvable:

  • Run a read-only "would backfill N messages" probe during Scan.
  • Surface the count in the report as "would backfill N then find M groups (estimate; final group count after backfill may differ)."
  • Realize the writes during Execute, after confirmation, inside the merge transaction. The pre-merge backup then genuinely covers everything the merge transaction touches.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions