vector: chunk long messages instead of truncating#323
Conversation
Adds four opt-in (default-on) transforms to Preprocess() so noise-laden
emails (inline base64 images, HTML residue, tracking-tagged links,
HTML→text whitespace bloat) tokenize back down inside the embedder's
context window:
strip_html strip <style>/<script> blocks, generic <tags>,
decode HTML entities
strip_base64 strip data:...;base64,... URIs and bare base64
runs >=200 chars (excluding '/' so URL paths
survive)
strip_url_tracking drop utm_*, fbclid, gclid, etc. query params
collapse_whitespace normalize CRLF -> LF, trim per-line trailing
whitespace, collapse runs of >=3 newlines, runs
of >=2 horizontal spaces
Motivation: while building embeddings for a 2.2M-message corpus on
nomic-embed-text (8192-token window), ~1.7% of messages tripped the
endpoint's context-length check even after the 6000-char rune cap. The
offenders were almost always polluted with one of the four patterns
above: a 30KB inline image, leaked <table style="..."> markup, or
campaign-tagged URLs repeating across newsletters. Stripping these
shrinks dense input to clean prose without semantic loss, eliminates
the downshift-to-batch-size=1 sawtooth that capped real throughput at
~10 msg/s, and improves vector quality by not averaging the embedding
over CSS gibberish.
Config follows the existing PreprocessConfig pattern: *bool in the TOML
tier (nil = "default true", explicit `false` preserved verbatim), plain
bool in the runtime tier, helpers like StripHTMLEnabled() bridge the
two. Both call sites (build-embeddings + the live worker spawned by
`serve`) are wired symmetrically.
Pipeline order matters and is deliberate:
1. CRLF normalization (line-oriented regexes assume LF)
2. base64 / data: URI strip (runs before HTML so an oversized
<img src="data:..."> -- longer than reHTMLTag's 500-char ceiling
-- has its payload removed first, leaving a small enough tag for
the subsequent HTML pass to sweep)
3. HTML strip + entity decode
4. URL tracking-param strip
5. existing quote/signature strip
6. whitespace collapse
7. TrimSpace + Subject prefix + rune-bounded truncation
Tests cover each transform in isolation, three regression cases
(URL-paths-look-base64, oversized-img-tag, CRLF normalization), and a
full-pipeline end-to-end. Config-tier tests verify the new toggles
honour the same nil/true/false tristate semantics as the existing pair.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
roborev: Combined Review (
|
Replaces the original `<[^>]{0,500}>` with a stricter tag-name
pattern `</?[a-zA-Z][a-zA-Z0-9-]*(?:\s[^>]{0,400})?\s*/?>` so the
stripper no longer eats text that merely contains angle brackets:
John <john@example.com> kept verbatim (@ rejects tag-name)
See <https://example.com>. kept verbatim (: rejects tag-name)
x < 3 and y > 4 kept verbatim (space-then-digit rejects)
<Aug 6, 2026> kept verbatim (space rejects)
Real HTML tags (<p>, <br/>, <a href="...">, </div>, <table style="...">)
continue to match. The {0,400} attribute-body cap is moved inside an
optional non-capturing group that only fires when a whitespace-then-
attributes section actually follows the tag name, so the stripper
treats `<p>` and `<a href="...">` symmetrically.
Caught by roborev on PR kenn-io#322.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces "one embedding per message" with "one embedding per chunk",
keyed by (generation_id, message_id, chunk_index). A long email is
split via a sliding window (size = MaxInputChars runes, overlap ≈ 3%
floored at zero for small windows, soft breaks at
paragraph→sentence→word in the back quarter of each window), each
window is embedded, and the search path collapses results by
message_id keeping the best chunk score.
Motivation: with nomic-embed-text (8,192-token window) and
max_input_chars=6000, ~1.7% of messages in a 2.2M-corpus still tripped
the embedder's context limit even after sanitization. The downshift
recovery sawtoothed throughput from 42 msg/s to ~11. With chunking,
every input fits by construction and the failure mode disappears; long
emails also keep their tail content instead of losing it to runtime
truncation.
Schema changes (single transactional migration, idempotent):
embeddings:
+ embedding_id INTEGER PRIMARY KEY AUTOINCREMENT -- synthetic
-- rowid for vec0
+ chunk_index INTEGER NOT NULL DEFAULT 0
+ chunk_char_start INTEGER NOT NULL DEFAULT 0 -- debug-only
+ chunk_char_end INTEGER NOT NULL DEFAULT 0 -- debug-only
PRIMARY KEY → UNIQUE (generation_id, message_id, chunk_index)
vectors_vec_dN (vec0):
PARTITION KEY generation_id
PRIMARY KEY embedding_id (was message_id)
embedding FLOAT[N]
Legacy migration preserves embedding_id == message_id for every
already-embedded row so existing vec0 rowids are reusable; the vec0
table is rebuilt in-place (drop + recreate + re-insert) inside a
single transaction. The AUTOINCREMENT counter is bumped past every
legacy rowid so new chunk inserts cannot collide.
Worker change (internal/vector/embed/worker.go):
Preprocess now runs with maxChars=0 (no truncation); ChunkText splits
the full preprocessed text into windows; each window becomes one
embedder input. Multiple chunks per message embed in the same batch
call. The pending_embeddings queue stays per-message: a multi-chunk
message completes when all its chunks are upserted.
Search changes (sqlitevec/backend.go, sqlitevec/fused.go):
Both the filtered and empty-filter Search paths JOIN vec0 through
embeddings and GROUP BY message_id with MIN(distance), keeping the
best-scoring chunk per message. A chunkOverfetchFactor (4×) is
multiplied into the requested k so the GROUP BY has enough chunks
to recover k distinct messages even when several messages each
contribute multiple chunks to the top-k. The fused-search ANN CTE
becomes ann_chunks → ann (GROUP BY message_id) under the same
factor.
Stats / LoadVector:
Stats.EmbeddingCount counts DISTINCT message_id so the progress-bar
invariant (Done / Total in messages) survives chunking.
LoadVector returns chunk_index=0 (the head of the message) — the
one consumer (find_similar) wants a representative vector, not all
chunks.
ChunkText (internal/vector/embed/chunk.go): pure function, sliding
window with overlap, soft-break preference paragraph→sentence→word
→ hard cut, clamps overlap to ≤ maxRunes/2 to prevent infinite loops,
returns runes-space CharStart/CharEnd offsets for each chunk so
backends can recover the source substring later.
Tests cover: ChunkText edge cases (empty, single-span, paragraph cut,
sentence cut, word cut, hard cut, overlap clamp, UTF-8 mixed scripts,
end-to-end coverage); multi-chunk Upsert + ReplaceFewerChunks
idempotency; the legacy-schema → chunked-schema migration with
embedding_id and rowid preservation; the worker fanning out a single
long pending message into multiple chunks and draining the queue in
one shot.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a maxSpans parameter to ChunkText and uses it from the worker via the maxSpansPerMessage=64 constant. The cap protects the embed batch from system-generated dumps (10+ MB stack traces, error-flow forwards) that would otherwise produce thousands of chunks for a single message, flatten into a single embed call, and trip the API timeout. Discovered in a real-world build against a 2.2M-message corpus: seven Salesforce automation-error notifications carried 5–15 MB of body text each, chunking into 600–4,170 spans, which mixed into batch_size=128 batches and pushed the resulting embed call past the 60s timeout ceiling. With the cap at 64 spans (256 KB of content per message at a 4 KB window), every legitimate long-form email survives intact while pathological inputs lose only their tail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three roborev kenn-io#323 follow-ups, all real, two seen in production: 1. Migration: legacy `embedding_id := message_id` shortcut collided on the new UNIQUE(generation_id, message_id, chunk_index) constraint whenever the same message_id appeared under multiple generations (which the pre-chunking PK (generation_id, message_id) allowed — typically an active gen + a building gen overlapping during a rebuild). The fix drops embedding_id from the INSERT so AUTOINCREMENT allocates a fresh, globally-unique rowid per legacy row, and the vec0 rebuild now looks up the new embedding_id per (generation_id, message_id) via a mapping built from the just- migrated embeddings table rather than carrying message_id as the new rowid. Adds TestMigrate_LegacyToChunked_MultiGenerationCollision to pin the case. 2. Worker: a 128-message batch with chunking could fan out to 64 × 128 = 8,192 embedder inputs in a single Embed call, exceeding Ollama's ~250-input request limit and tripping the 60s API timeout. This is the same failure we hit live yesterday at 50 msg/s → 1 msg/s. The fix splits the flattened input slice into sub-batches of at most BatchSize, runs each sub-batch as its own Embed call, concatenates results before assembling chunks. The pending queue stays per- message — a message completes only once every one of its chunks has been embedded and upserted in the same RunOnce iteration. Adds TestWorker_SplitsChunkInputsAcrossSubBatches. 3. Filtered Search: the empty-filter path widens the chunk overfetch via a doubling loop until enough distinct messages survive the GROUP BY collapse; the filtered path used a fixed k * 4. With long messages contributing many top-k chunks the filtered path could short-return below k even when enough filtered matches existed. The fix lifts the same doubling loop into the filtered path, bounded by COUNT(*) FROM embeddings WHERE generation_id = ? so the loop always terminates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0fe5a3f to
cacea26
Compare
roborev: Combined Review (
|
Mirrors the doubling loop already in backend.Search but for the FusedSearch CTE: extracts the SQL into a buildQuery closure parameterised on chunkK, then re-issues the query with a growing chunk fetch when the first pass returns fewer than KPerSignal+1 distinct ANN messages. Bounded by COUNT(*) FROM embeddings WHERE generation_id = ? so the loop terminates even when the corpus genuinely has fewer matches than the requested K. Without this, a query whose top chunks all pile up onto a few long messages collapses to far fewer than KPerSignal distinct ANN candidates, and the fused result loses messages that would have ranked further down the chunk-distance order. Same shape of bug as the empty-filter and filtered Search paths, caught by roborev on the previous push. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
ChunkText used to allocate a rune-byte offset table for the entire input before maxSpans had any effect. For a 15 MB body (a Salesforce flow-error forward, an mbox attachment that leaked into body_text), that's a ~120 MB allocation per call even though only the first 64 chunks worth of content can ever be emitted. A burst of synced oversized messages could OOM the embedding worker and stall vector indexing. The chunker only ever reads ahead within a window of maxRunes runes, so any content past maxSpans*maxRunes will be dropped on the floor by the in-loop maxSpans guard anyway. Truncating the byte slice up front to that bound is lossless for the spans it would otherwise emit, and turns the worst-case allocation from O(body_size) into O(maxSpans * maxRunes). Adds TestChunkText/MaxSpansCapsInputBytesProcessed: a 10M-rune input that completes in the same wall-time as the 1K-rune cases. Without the cap, this test would dominate the suite and the alloc profile would balloon to ~80 MB; with the cap both costs are flat. Caught by roborev on PR kenn-io#323 (e83967b). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
Two roborev kenn-io#323 (717ac4c) follow-ups, both real: 1. Preprocess input cap. The chunker's input cap (added previously) protects ChunkText from walking the whole body, but the regex passes inside Preprocess (StripHTML, StripBase64, StripURLTracking, whitespace collapse) still run on the full input — O(body_len) CPU and similar-size scratch allocations per transform. A 100 MB body would burn seconds before the chunker drops the tail anyway. Cap the raw body at MaxInputChars * maxSpansPerMessage * rawBodyMultiplier (with rawBodyMultiplier=16 to leave room for sanitize to strip noise) before any sanitize transform sees it. Adds TestWorker_CapsRawBodyBeforePreprocess. 2. RunResult.Truncated double-counting. truncated incremented per chunk, but Succeeded counts messages — so a single long message with N hard-cut chunks reported as N truncations against 1 success, making the "what fraction was truncated" metric nonsensical. Track distinct message_ids with at least one truncated chunk and increment the counter once per message. Adds TestWorker_TruncatedCountedPerMessageNotPerChunk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
Move MaxBodyRunes from a worker-side raw-input cap to a stage inside Preprocess that runs *after* CRLF normalization and StripBase64 but *before* StripHTML and the rest. The previous cap fired on raw input, so a body whose first megabyte was an inline base64 image got chopped to "just the blob" — the prose tail past the cap never reached sanitize, and the resulting embedding was empty. By doing the cheap pollution removal first, the same body lands at the cap with only ~2 KB left where the blob used to be, and the prose tail survives intact. Heavy regex passes (StripHTML, StripURLTracking, whitespace collapse) still operate on a bounded input, so the resource ceiling that motivated the earlier cap is preserved. The bool returned by Preprocess now also signals "body cap fired", propagated through msgText.BodyTruncated to every chunk's Truncated flag so per-message accounting downstream picks up cap-induced truncation alongside hard-cut chunks. Caught by roborev on PR kenn-io#323 (2d8f45d). Adds TestWorker_PrefixBase64DoesNotHidePoseTail: 2 MB of base64 ahead of a sentinel prose tail; the sentinel must appear in the embedder inputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
…nt as truncations Caught by roborev on PR kenn-io#323 (7d8d80a): when ChunkText hits its maxSpans cap but the last emitted chunk happens to land on a clean soft break, the per-chunk hard-cut Trunc flag stays false for every emitted chunk. So a message that lost everything past chunk 64 would report as fully embedded — RunResult.Truncated and embeddings.truncated would both miss it. ChunkText now returns (spans, tailDropped). tailDropped is true whenever content was dropped past the last emitted span, either from the input pre-cap or from the in-loop maxSpans guard, and the worker ORs it with the body-truncation flag onto every chunk's Trunc — so per-message truncation accounting picks up cap-induced loss regardless of where the last chunk happened to cut. Adds two regression tests: - TailDroppedFlagsCapWhenLastChunkLandsOnSoftBreak: 5-chunks-worth of prose ending in sentence terminators, capped at maxSpans=2. The last emitted chunk lands cleanly, so the old per-chunk flag would have missed the truncation; tailDropped must surface it. - TailDroppedFalseWhenAllContentEmitted: counter-test confirming a short input flagged tailDropped=false. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
roborev: Combined Review (
|
|
Thank you for working on this! I will review and merge when I can |
roborev: Combined Review (
|
Summary
Replaces "one embedding per message" with "one embedding per chunk", keyed by
(generation_id, message_id, chunk_index). Long emails are split via a sliding window (overlap ≈ 3% of window, soft breaks at paragraph → sentence → word → hard cut); each window is embedded independently; the search path joins back throughembedding_id, groups bymessage_id, and keeps the best chunk score per message.Motivation
While building embeddings against a 2.2M-message corpus on
nomic-embed-text(8,192-token window) withmax_input_chars = 6000, ~1.7% of messages still tripped Ollama's context-length check. Most were polluted (inline base64, leaked HTML/CSS — addressed by #322), but the residual long-tail (CJK threads, legitimate 50K-char prose) can never fit in a single 8K window. Truncation lost the tail; the worker's downshift recovery flattened throughput from 42 msg/s → ~11.With chunking every embedder input fits by construction (the failure mode disappears), and long messages get their entire content into the index instead of losing tail prose.
Schema changes (single transactional migration)
Migration preserves
embedding_id = message_idfor legacy rows so existing vec0 rowids stay valid; the vec0 table is rebuilt in-place (drop + recreate + re-insert) inside a single transaction. The AUTOINCREMENT counter is bumped past every legacy rowid so new chunk inserts cannot collide.Read / write path
Preprocessnow runs withmaxChars=0(no truncation);ChunkTextslices the full preprocessed text into windows; each window becomes one embedder input.pending_embeddingsstays per-message — a multi-chunk message completes when all its chunks are upserted in the same batch.JOIN vec0 ON v.embedding_id = e.embedding_id,GROUP BY e.message_id,MIN(distance). A newchunkOverfetchFactor(4×) is applied to the requestedkso the GROUP BY has enough chunks to recoverkdistinct messages even when several messages contribute multiple top-k chunks.annCTE becomesann_chunks → ann (GROUP BY message_id, MIN(distance), ROW_NUMBER)under the same factor.EmbeddingCountisCOUNT(DISTINCT message_id)so the progress-bar invariant (Done/Total in messages) survives the layout change.chunk_index = 0— the only consumer (find_similar) wants a representative vector, not the full chunk fan-out.Tests
chunk_index, joins back to vec0 throughembedding_id, andmessage_countstays a per-message count.TestBackend_Upsert_ReplaceFewerChunkspins the contract that re-upserting with fewer chunks vacates the stale rows — critical because chunk fan-out can change between upserts when preprocessing rules evolve.TestMigrate_LegacyToChunkedhand-builds the pre-chunking schema, runsMigrate, and asserts every legacy row survives aschunk_index=0withembedding_id == legacy message_id; the vec0 join is verifiable;sqlite_sequenceis bumped past every legacy rowid; a secondMigrateis a no-op.TestWorker_FansOutLongMessageIntoMultipleChunksdrives a single long message through the worker, asserts ≥2 chunk rows with consecutivechunk_index,message_count = 1, and that the queue drains in one Complete.Notes
chunk_char_start/chunk_char_endare stored but not yet surfaced in search results. Cheap (8 bytes/chunk extra) and enable future per-paragraph highlighting without another migration.maxRunes/30(≈ 3%), floored to 0 formaxRunes < 200. Not currently exposed as config; rationale is documented inline.Co-Authored-By: Claude Opus 4.7 (1M context)