Skip to content

vectors: prune orphan embeddings after delete-deduped (cheaper than build-embeddings --full-rebuild) #313

@wesm

Description

@wesm

Context

Surfaced during the design re-read of PR #304. The post-run hint correction landed in commit f48e8f1, but the underlying gap (orphan embeddings accumulating in vectors.db after delete-deduped purges rows from messages) is real and was deferred.

What happens

The vector backend's design contract is documented at internal/vector/sqlitevec/backend.go:300-306:

Dedup Execute does not remove vector-store rows by design: if a message is embedded then later soft-deleted, the embedding stays in the vector store and query-time live filtering (dropDeletedFromSource, filteredMessageIDs) enforces the live-message contract.

This is correct for soft-delete (deleted_at), where the message row still exists and the join still works. After delete-deduped permanently removes message rows, the vector-store rows whose message_id no longer joins are orphaned:

  • They consume disk space in vectors.db.
  • They get over-fetched by the deletedOverfetchFactor = 2 pad in dropDeletedFromSource (backend.go:797), which assumes a constant fraction of orphans.
  • They never get pruned. Over months of dedup + purge cycles, the orphan count grows unbounded relative to the live corpus.

The post-run hint in delete-deduped (now corrected by f48e8f1) tells the user to run build-embeddings --full-rebuild, which recreates the vector index from scratch — a heavy operation that re-pays the embedding-API cost for the entire corpus. That's a workaround, not a maintenance command.

Why it matters

  • build-embeddings --full-rebuild is expensive: it re-runs every embedding through the configured endpoint. Users running large archives will avoid it.
  • The over-fetch factor was tuned for a low orphan ratio. As orphans accumulate, ANN recall degrades because the live subset of the top-K shrinks.
  • Long-running daemonized deployments (serve) compound the problem.

Proposed approach

Add a lightweight vectors prune (or build-embeddings --prune-orphans) command that:

  1. Reads message IDs from the vector backend.
  2. Anti-joins against main.messages.id.
  3. Deletes vector-store rows whose message_id has no live message row.

This is much cheaper than a full rebuild: no embedding API calls, just a DELETE FROM vec_chunks WHERE message_id NOT IN (SELECT id FROM messages)-shaped query.

Optionally hook it into delete-deduped as a post-step (gated by a flag) so the cleanup happens in-line for users who want it, while remaining opt-out for users who batch their vector maintenance separately.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions