Skip to content

account_identities: schema PK is raw-bytes but compares are case-folded for emails #311

@wesm

Description

@wesm

Context

Surfaced during the design re-read of PR #304. Wes and Claude discussed and agreed this is a real invariant mismatch but not a blocker for #304 because SQLite's single-writer model prevents in-process divergence today. Filing as a follow-up because the gap matters more once Postgres or cross-process writers are part of the supported deployment.

The mismatch

The schema declares account_identities with raw-bytes equality:

-- internal/store/schema.sql:398
CREATE TABLE IF NOT EXISTS account_identities (
    source_id    INTEGER NOT NULL REFERENCES sources(id) ON DELETE CASCADE,
    address      TEXT NOT NULL,             -- case-preserved
    source_signal TEXT NOT NULL DEFAULT '',
    confirmed_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (source_id, address)
);

The code (internal/store/identifier_match.go) treats email-shaped identifiers case-insensitively via LOWER(address) = LOWER(?) while preserving original case on storage. Non-email identifiers (Matrix MXIDs, phone E.164, chat handles) are intentionally case-sensitive.

The PK enforces raw-bytes uniqueness but the application logic enforces case-folded uniqueness for emails. Two paths can produce divergent rows for the same logical identity:

  1. Row exists for Alice@Example.com. Concurrent process inserts alice@example.com. Both AddAccountIdentity and MigrateLegacyIdentityConfig do SELECT … WHERE LOWER(address) = LOWER(?), find no row, then INSERT. PK is on raw bytes — both inserts succeed.
  2. Result: two account_identities rows for the same logical identity. mergeSignalSet semantics break (signals attached to one row are invisible from the other). RemoveAccountIdentity finds both via the case-fold predicate, but only by accident.

The single-writer comment at account_identities.go:37-39 acknowledges this is only safe in-process. serve mode plus a parallel CLI command — the documented multi-process deployment — is exactly where divergence becomes possible.

Why it matters

  • Postgres support (PR Pr2 branch postgresql dialect #298 scaffolding) explicitly removes the SQLite single-writer guarantee.
  • The cleanup commit 28853fe (case-insensitive remove for email-shaped identifiers) handles the symptom on remove but doesn't prevent re-accumulation.
  • Long-running serve deployments running alongside CLI dedup jobs are exactly the workflow most likely to trip this.

Proposed approach

Two reasonable shapes; both are bigger than a one-line patch:

  1. Normalized comparison key column. Add address_key TEXT NOT NULL storing NormalizeIdentifierForCompare(address), make the PK (source_id, address_key), keep address as the display value. Backfill migration on schema upgrade. Cleanest invariant.
  2. Partial expression index. SQLite supports CREATE UNIQUE INDEX … WHERE looksLikeEmail(address) but looksLikeEmail is Go, not SQL. Either reproduce the predicate in SQL (address LIKE '%@%.%' is close but not identical) or split rows into two tables (email vs. other). Less invasive than (1) but the SQL predicate divergence is its own future trap.

Recommend (1). Do not slap UNIQUE(source_id, lower(address)) on all rows — non-email identifiers must keep case sensitivity.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions