account_identities: schema PK is raw-bytes but compares are case-folded for emails

## Context

Surfaced during the design re-read of PR #304. Wes and Claude discussed and agreed this is a real invariant mismatch but not a blocker for #304 because SQLite's single-writer model prevents in-process divergence today. Filing as a follow-up because the gap matters more once Postgres or cross-process writers are part of the supported deployment.

## The mismatch

The schema declares `account_identities` with raw-bytes equality:

```sql
-- internal/store/schema.sql:398
CREATE TABLE IF NOT EXISTS account_identities (
    source_id    INTEGER NOT NULL REFERENCES sources(id) ON DELETE CASCADE,
    address      TEXT NOT NULL,             -- case-preserved
    source_signal TEXT NOT NULL DEFAULT '',
    confirmed_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (source_id, address)
);
```

The code (`internal/store/identifier_match.go`) treats email-shaped identifiers case-insensitively via `LOWER(address) = LOWER(?)` while preserving original case on storage. Non-email identifiers (Matrix MXIDs, phone E.164, chat handles) are intentionally case-sensitive.

The PK enforces raw-bytes uniqueness but the application logic enforces case-folded uniqueness for emails. Two paths can produce divergent rows for the same logical identity:

1. Row exists for `Alice@Example.com`. Concurrent process inserts `alice@example.com`. Both `AddAccountIdentity` and `MigrateLegacyIdentityConfig` do `SELECT … WHERE LOWER(address) = LOWER(?)`, find no row, then `INSERT`. PK is on raw bytes — both inserts succeed.
2. Result: two `account_identities` rows for the same logical identity. `mergeSignalSet` semantics break (signals attached to one row are invisible from the other). `RemoveAccountIdentity` finds both via the case-fold predicate, but only by accident.

The single-writer comment at `account_identities.go:37-39` acknowledges this is only safe in-process. `serve` mode plus a parallel CLI command — the documented multi-process deployment — is exactly where divergence becomes possible.

## Why it matters

- Postgres support (PR #298 scaffolding) explicitly removes the SQLite single-writer guarantee.
- The cleanup commit `28853fe` (case-insensitive remove for email-shaped identifiers) handles the symptom on remove but doesn't prevent re-accumulation.
- Long-running `serve` deployments running alongside CLI dedup jobs are exactly the workflow most likely to trip this.

## Proposed approach

Two reasonable shapes; both are bigger than a one-line patch:

1. **Normalized comparison key column.** Add `address_key TEXT NOT NULL` storing `NormalizeIdentifierForCompare(address)`, make the PK `(source_id, address_key)`, keep `address` as the display value. Backfill migration on schema upgrade. Cleanest invariant.
2. **Partial expression index.** SQLite supports `CREATE UNIQUE INDEX … WHERE looksLikeEmail(address)` but `looksLikeEmail` is Go, not SQL. Either reproduce the predicate in SQL (`address LIKE '%@%.%'` is close but not identical) or split rows into two tables (email vs. other). Less invasive than (1) but the SQL predicate divergence is its own future trap.

Recommend (1). Do not slap `UNIQUE(source_id, lower(address))` on all rows — non-email identifiers must keep case sensitivity.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

account_identities: schema PK is raw-bytes but compares are case-folded for emails #311

Context

The mismatch

Why it matters

Proposed approach

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

account_identities: schema PK is raw-bytes but compares are case-folded for emails #311

Description

Context

The mismatch

Why it matters

Proposed approach

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions