Skip to content

[bug] lore appraise: semantic arm gated off by stale meta counters even when vectors are healthy #76

@kunallanjewar

Description

@kunallanjewar

Summary

lore appraise gates its RRF semantic arm on meta.vector_coverage_num / meta.vector_coverage_den. These counters are bumped incrementally on every inscribe / seal / restore, and over time they drift from the actual lore_vectors row count. When the drift drops the reported coverage below the 0.90 RRF gate, the semantic arm is silently disabled even though the vectors on disk are perfectly healthy. Result: retrieval falls back to BM25-only and the user has no surface signal that the corpus is degraded.

Reproducer (observed in the wild)

On a live lore.db with 481 vector rows and 506 active entries (true coverage ≈ 95.0%):

sqlite> SELECT key, value FROM meta WHERE key LIKE 'vector_coverage%';
vector_coverage_num|333
vector_coverage_den|543

sqlite> SELECT COUNT(*) FROM lore_vectors;
481

sqlite> SELECT COUNT(*) FROM entries WHERE status NOT IN ('archived','parked');
506

guild lore health reports coverage 333/543 (61%) and tags the embedder backfilling even though:

  • embed_error_count = 0
  • embedder_state = enabled
  • embedder_state_reason = ok
  • The encode loop never runs because the auto-backfill trigger uses a different coverage source (live COUNT(*) against lore_vectors and entries) and correctly sees 95%, so it exits as a no-op.

The two sources of truth for the same fact are drifting independently.

Why this is a P1

  • Silent: nothing surfaces the degradation. Health reports state=enabled, reason=ok. Appraise still returns results, just from BM25 only.
  • Load-bearing: appraise is the single entry point for "should I research this" across every session. Losing the semantic arm is a measurable retrieval-quality regression on paraphrased queries (unique-vocabulary queries still hit by BM25, paraphrased queries return recency-bleed noise).
  • Self-healing path is gone: the auto-backfill loop's live-SQL gate sees the corpus as healthy, so it never fires. Only embed-rebuild (destructive) can reset the meta counters indirectly.

Probe evidence

Two appraise queries against the same corpus:

  • Unique-vocabulary query (harness lifecycle hooks autoinject SessionStart compact priming): 5/5 on-target — BM25 alone is enough when the query reuses the entry's literal terms.
  • Paraphrased query (stable secondary sort tie-breaker for newest-first reads): 1/5 on-target, four results are recency-bleed false positives — exactly the failure mode the semantic arm is supposed to fix.

Proposed fix

1. Efficient live-SQL gate in appraise (epoch-cached)

The naive fix — two COUNT(*) queries on every appraise — degrades as the corpus grows. After months or years of usage at 100k+ entries that is roughly 20-50ms per appraise just for the gate, on every retrieval call.

meta.vector_epoch already exists and is bumped atomically on every successful vector write (internal/lore/embed/hot.go) and at every backfill cycle end (internal/lore/embed/backfill.go). It is a free monotonic invalidation key.

Pattern (process-local, atomic.Pointer[gateState]):

type gateState struct{ epoch, num, den int64 }

epoch := readEpoch(ctx, db)               // tiny PK lookup, sub-ms
if cached := gate.Load(); cached != nil && cached.epoch == epoch {
    return cached.num, cached.den, nil    // O(1) hot path
}
// miss:
//   num := COUNT(*) FROM lore_vectors
//   den := COUNT(*) FROM entries WHERE status NOT IN ('archived','parked')
// store new gateState, return
  • Hot-path cost: one PK-indexed meta lookup + atomic pointer compare.
  • The two COUNT(*) queries run only when the corpus has actually changed since the last appraise. Bursty writes amortize to near-zero.
  • A fresh server pays one cold COUNT(*) on its first appraise, which is rounding error against the asset extraction + embedder probe that already runs at startup.
  • Predicates mirror the auto-backfill assessCorpus predicates exactly. No schema changes.
  • Cache is per-process; two servers reading the same epoch make the same decision, so cross-process coherence is automatic for a given epoch.

2. Reconcile both counters

guild lore coverage-reconcile currently only resets vector_coverage_den; extend it to also reset vector_coverage_num from COUNT(*) FROM lore_vectors so meta stays honest for other readers (health line, dashboards) and so operators have a one-shot manual fix. This path is operator-invoked, so its cost does not matter.

Out of scope (follow-ups)

  • Auditing every bump/decrement site (inscribe, seal, restore, invalidate) to find the missing-bump path that produced the original drift. Worth a follow-up, but the live-SQL gate makes the load-bearing reader drift-immune regardless.
  • Migrating every meta.vector_coverage_* reader to live SQL. The session line in lore health is acceptable as a snapshot view; appraise is the gate that actually changes behavior.

Acceptance

  • Appraise gate computes coverage from live COUNT(*), cached on meta.vector_epoch.
  • Regression tests:
    • Seed drift (preset meta.vector_coverage_num below the gate while leaving lore_vectors populated), assert appraise still engages the semantic arm.
    • Assert hot-path appraise reads do not run COUNT(*) when epoch is unchanged (hook a counter or use a query log).
  • coverage-reconcile resets both num and den, with both before/after values in the output.
  • make check clean; no schema changes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: loreLore knowledge archivebugSomething isn't workingneeds-triageAwaiting initial reviewpriority: P1High prioritysize: S< 50 lines

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions