Skip to content

vector: chunk long messages instead of truncating#323

Open
hansn74 wants to merge 10 commits into
kenn-io:mainfrom
hansn74:feat/embed-chunking
Open

vector: chunk long messages instead of truncating#323
hansn74 wants to merge 10 commits into
kenn-io:mainfrom
hansn74:feat/embed-chunking

Conversation

@hansn74
Copy link
Copy Markdown
Contributor

@hansn74 hansn74 commented May 13, 2026

Summary

Replaces "one embedding per message" with "one embedding per chunk", keyed by (generation_id, message_id, chunk_index). Long emails are split via a sliding window (overlap ≈ 3% of window, soft breaks at paragraph → sentence → word → hard cut); each window is embedded independently; the search path joins back through embedding_id, groups by message_id, and keeps the best chunk score per message.

Motivation

While building embeddings against a 2.2M-message corpus on nomic-embed-text (8,192-token window) with max_input_chars = 6000, ~1.7% of messages still tripped Ollama's context-length check. Most were polluted (inline base64, leaked HTML/CSS — addressed by #322), but the residual long-tail (CJK threads, legitimate 50K-char prose) can never fit in a single 8K window. Truncation lost the tail; the worker's downshift recovery flattened throughput from 42 msg/s → ~11.

With chunking every embedder input fits by construction (the failure mode disappears), and long messages get their entire content into the index instead of losing tail prose.

Schema changes (single transactional migration)

embeddings:
  + embedding_id     INTEGER PRIMARY KEY AUTOINCREMENT  -- synthetic rowid joined to vec0
  + chunk_index      INTEGER NOT NULL DEFAULT 0
  + chunk_char_start INTEGER NOT NULL DEFAULT 0          -- debug-only metadata
  + chunk_char_end   INTEGER NOT NULL DEFAULT 0
  PRIMARY KEY → UNIQUE (generation_id, message_id, chunk_index)

vectors_vec_dN (vec0):
  generation_id INTEGER PARTITION KEY
  embedding_id  INTEGER PRIMARY KEY                     -- was message_id
  embedding     FLOAT[N]

Migration preserves embedding_id = message_id for legacy rows so existing vec0 rowids stay valid; the vec0 table is rebuilt in-place (drop + recreate + re-insert) inside a single transaction. The AUTOINCREMENT counter is bumped past every legacy rowid so new chunk inserts cannot collide.

Read / write path

  • Worker: Preprocess now runs with maxChars=0 (no truncation); ChunkText slices the full preprocessed text into windows; each window becomes one embedder input. pending_embeddings stays per-message — a multi-chunk message completes when all its chunks are upserted in the same batch.
  • Search: both filtered and empty-filter paths JOIN vec0 ON v.embedding_id = e.embedding_id, GROUP BY e.message_id, MIN(distance). A new chunkOverfetchFactor (4×) is applied to the requested k so the GROUP BY has enough chunks to recover k distinct messages even when several messages contribute multiple top-k chunks.
  • FusedSearch: the ann CTE becomes ann_chunks → ann (GROUP BY message_id, MIN(distance), ROW_NUMBER) under the same factor.
  • Stats: EmbeddingCount is COUNT(DISTINCT message_id) so the progress-bar invariant (Done/Total in messages) survives the layout change.
  • LoadVector: returns chunk_index = 0 — the only consumer (find_similar) wants a representative vector, not the full chunk fan-out.

Tests

  • ChunkText: empty, single-span, paragraph/sentence/word/hard cuts, overlap-clamp infinite-loop guard, UTF-8 mixed-script integrity, end-to-end coverage check.
  • Upsert: multi-chunk fan-out lays down N rows with distinct chunk_index, joins back to vec0 through embedding_id, and message_count stays a per-message count.
  • Upsert idempotency: TestBackend_Upsert_ReplaceFewerChunks pins the contract that re-upserting with fewer chunks vacates the stale rows — critical because chunk fan-out can change between upserts when preprocessing rules evolve.
  • Migration: TestMigrate_LegacyToChunked hand-builds the pre-chunking schema, runs Migrate, and asserts every legacy row survives as chunk_index=0 with embedding_id == legacy message_id; the vec0 join is verifiable; sqlite_sequence is bumped past every legacy rowid; a second Migrate is a no-op.
  • Worker: TestWorker_FansOutLongMessageIntoMultipleChunks drives a single long message through the worker, asserts ≥2 chunk rows with consecutive chunk_index, message_count = 1, and that the queue drains in one Complete.

Notes

  • chunk_char_start/chunk_char_end are stored but not yet surfaced in search results. Cheap (8 bytes/chunk extra) and enable future per-paragraph highlighting without another migration.
  • Overlap is hardcoded at maxRunes/30 (≈ 3%), floored to 0 for maxRunes < 200. Not currently exposed as config; rationale is documented inline.
  • Builds on embed: extend Preprocess with HTML / base64 / URL-tracking / whitespace strip #322 (sanitize) but doesn't depend on it mechanically — these can land in either order.

Co-Authored-By: Claude Opus 4.7 (1M context)

Adds four opt-in (default-on) transforms to Preprocess() so noise-laden
emails (inline base64 images, HTML residue, tracking-tagged links,
HTML→text whitespace bloat) tokenize back down inside the embedder's
context window:

  strip_html           strip <style>/<script> blocks, generic <tags>,
                       decode HTML entities
  strip_base64         strip data:...;base64,... URIs and bare base64
                       runs >=200 chars (excluding '/' so URL paths
                       survive)
  strip_url_tracking   drop utm_*, fbclid, gclid, etc. query params
  collapse_whitespace  normalize CRLF -> LF, trim per-line trailing
                       whitespace, collapse runs of >=3 newlines, runs
                       of >=2 horizontal spaces

Motivation: while building embeddings for a 2.2M-message corpus on
nomic-embed-text (8192-token window), ~1.7% of messages tripped the
endpoint's context-length check even after the 6000-char rune cap. The
offenders were almost always polluted with one of the four patterns
above: a 30KB inline image, leaked <table style="..."> markup, or
campaign-tagged URLs repeating across newsletters. Stripping these
shrinks dense input to clean prose without semantic loss, eliminates
the downshift-to-batch-size=1 sawtooth that capped real throughput at
~10 msg/s, and improves vector quality by not averaging the embedding
over CSS gibberish.

Config follows the existing PreprocessConfig pattern: *bool in the TOML
tier (nil = "default true", explicit `false` preserved verbatim), plain
bool in the runtime tier, helpers like StripHTMLEnabled() bridge the
two. Both call sites (build-embeddings + the live worker spawned by
`serve`) are wired symmetrically.

Pipeline order matters and is deliberate:
  1. CRLF normalization (line-oriented regexes assume LF)
  2. base64 / data: URI strip (runs before HTML so an oversized
     <img src="data:..."> -- longer than reHTMLTag's 500-char ceiling
     -- has its payload removed first, leaving a small enough tag for
     the subsequent HTML pass to sweep)
  3. HTML strip + entity decode
  4. URL tracking-param strip
  5. existing quote/signature strip
  6. whitespace collapse
  7. TrimSpace + Subject prefix + rune-bounded truncation

Tests cover each transform in isolation, three regression cases
(URL-paths-look-base64, oversized-img-tag, CRLF normalization), and a
full-pipeline end-to-end. Config-tier tests verify the new toggles
honour the same nil/true/false tristate semantics as the existing pair.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 13, 2026

roborev: Combined Review (d203a66)

Summary verdict: Changes need fixes before merge due to one high-risk migration failure and two medium runtime/search issues.

High

  • internal/vector/sqlitevec/migrate.go:157
    Legacy migration sets embedding_id = message_id, but the old schema allowed the same message_id in multiple generations via PRIMARY KEY (generation_id, message_id). Existing DBs with active/building or retained generations for the same message can hit duplicate embedding_id values and fail migration.
    Fix: Allocate a unique embedding_id per legacy embedding row and carry a (generation_id, message_id) -> embedding_id mapping into the vec0 table rebuild.

Medium

  • internal/vector/embed/worker.go:488, internal/vector/embed/worker.go:521
    Preprocess(subject, body, 0, ...) disables the previous MaxInputChars truncation, then chunks the entire preprocessed message with no per-message or per-batch chunk cap. A very large email body can fan out into many embedding requests/rows, consuming CPU, memory, disk, and paid embedding quota.
    Fix: Keep a hard maximum on total preprocessed runes or chunks per message, mark the final accepted chunk as truncated, and consider enforcing BatchSize over chunks rather than only messages.

  • internal/vector/sqlitevec/backend.go:809
    Filtered vector search uses a fixed k * chunkOverfetchFactor before grouping chunks by message. A long message with many matching chunks can consume that pool, returning fewer than k distinct messages even when more filtered matches exist. The same fixed-overfetch pattern appears in fused ANN search.
    Fix: Use an adaptive overfetch loop for filtered/fused ANN paths, growing up to the matching chunk ceiling until enough distinct messages are recovered.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 14, 2026

roborev: Combined Review (0fe5a3f)

High-risk migration issue found; several medium-risk batching and retrieval-quality regressions should be addressed before merge.

High

  • internal/vector/sqlitevec/migrate.go:153: The legacy migration sets embedding_id = message_id, but the legacy embeddings primary key is (generation_id, message_id). The same message_id can exist across multiple generations, causing duplicate embedding_id values and migration failure.
    • Fix: Generate unique embedding_id values during migration and rebuild vec rows through a (generation_id, message_id) -> embedding_id mapping.

Medium

  • internal/vector/embed/worker.go:531: BatchSize still limits claimed messages, but each message can fan out to up to 64 embedder inputs. A single embed request can become BatchSize * chunks, exceeding the configured or provider-safe embedding batch size.

    • Fix: Batch inputs by the configured embed batch size, collect all vectors, and only complete each message after all chunks have been embedded and upserted.
  • internal/vector/sqlitevec/backend.go:815; internal/vector/sqlitevec/fused.go:182: Filtered vector search and fused ANN search fetch only k * chunkOverfetchFactor chunks before grouping by message. Since one message can contribute up to 64 chunks, top chunks may collapse to far fewer than k messages even when enough distinct matches exist.

    • Fix: Use a widening loop like the empty-filter path, or overfetch up to a bound based on maxSpansPerMessage or available chunk count before grouping.
  • internal/vector/embed/preprocess.go:47: The default-on HTML stripper treats any <...> text as a tag, so plain-text email content like <https://example.com>, <alice@example.com>, or comparisons such as 2 < 3 and 5 > 4 can be removed from embeddings.

    • Fix: Use an HTML tokenizer or narrow the tag matcher to valid tag syntax, preserving non-tag angle-bracketed prose.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

hansn74 and others added 4 commits May 15, 2026 13:42
Replaces the original `<[^>]{0,500}>` with a stricter tag-name
pattern `</?[a-zA-Z][a-zA-Z0-9-]*(?:\s[^>]{0,400})?\s*/?>` so the
stripper no longer eats text that merely contains angle brackets:

  John <john@example.com>      kept verbatim (@ rejects tag-name)
  See <https://example.com>.   kept verbatim (: rejects tag-name)
  x < 3 and y > 4              kept verbatim (space-then-digit rejects)
  <Aug 6, 2026>                kept verbatim (space rejects)

Real HTML tags (<p>, <br/>, <a href="...">, </div>, <table style="...">)
continue to match. The {0,400} attribute-body cap is moved inside an
optional non-capturing group that only fires when a whitespace-then-
attributes section actually follows the tag name, so the stripper
treats `<p>` and `<a href="...">` symmetrically.

Caught by roborev on PR kenn-io#322.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces "one embedding per message" with "one embedding per chunk",
keyed by (generation_id, message_id, chunk_index). A long email is
split via a sliding window (size = MaxInputChars runes, overlap ≈ 3%
floored at zero for small windows, soft breaks at
paragraph→sentence→word in the back quarter of each window), each
window is embedded, and the search path collapses results by
message_id keeping the best chunk score.

Motivation: with nomic-embed-text (8,192-token window) and
max_input_chars=6000, ~1.7% of messages in a 2.2M-corpus still tripped
the embedder's context limit even after sanitization. The downshift
recovery sawtoothed throughput from 42 msg/s to ~11. With chunking,
every input fits by construction and the failure mode disappears; long
emails also keep their tail content instead of losing it to runtime
truncation.

Schema changes (single transactional migration, idempotent):

  embeddings:
    + embedding_id     INTEGER PRIMARY KEY AUTOINCREMENT  -- synthetic
                                                          -- rowid for vec0
    + chunk_index      INTEGER NOT NULL DEFAULT 0
    + chunk_char_start INTEGER NOT NULL DEFAULT 0          -- debug-only
    + chunk_char_end   INTEGER NOT NULL DEFAULT 0          -- debug-only
    PRIMARY KEY → UNIQUE (generation_id, message_id, chunk_index)

  vectors_vec_dN (vec0):
    PARTITION KEY    generation_id
    PRIMARY KEY      embedding_id (was message_id)
    embedding        FLOAT[N]

Legacy migration preserves embedding_id == message_id for every
already-embedded row so existing vec0 rowids are reusable; the vec0
table is rebuilt in-place (drop + recreate + re-insert) inside a
single transaction. The AUTOINCREMENT counter is bumped past every
legacy rowid so new chunk inserts cannot collide.

Worker change (internal/vector/embed/worker.go):
  Preprocess now runs with maxChars=0 (no truncation); ChunkText splits
  the full preprocessed text into windows; each window becomes one
  embedder input. Multiple chunks per message embed in the same batch
  call. The pending_embeddings queue stays per-message: a multi-chunk
  message completes when all its chunks are upserted.

Search changes (sqlitevec/backend.go, sqlitevec/fused.go):
  Both the filtered and empty-filter Search paths JOIN vec0 through
  embeddings and GROUP BY message_id with MIN(distance), keeping the
  best-scoring chunk per message. A chunkOverfetchFactor (4×) is
  multiplied into the requested k so the GROUP BY has enough chunks
  to recover k distinct messages even when several messages each
  contribute multiple chunks to the top-k. The fused-search ANN CTE
  becomes ann_chunks → ann (GROUP BY message_id) under the same
  factor.

Stats / LoadVector:
  Stats.EmbeddingCount counts DISTINCT message_id so the progress-bar
  invariant (Done / Total in messages) survives chunking.
  LoadVector returns chunk_index=0 (the head of the message) — the
  one consumer (find_similar) wants a representative vector, not all
  chunks.

ChunkText (internal/vector/embed/chunk.go): pure function, sliding
window with overlap, soft-break preference paragraph→sentence→word
→ hard cut, clamps overlap to ≤ maxRunes/2 to prevent infinite loops,
returns runes-space CharStart/CharEnd offsets for each chunk so
backends can recover the source substring later.

Tests cover: ChunkText edge cases (empty, single-span, paragraph cut,
sentence cut, word cut, hard cut, overlap clamp, UTF-8 mixed scripts,
end-to-end coverage); multi-chunk Upsert + ReplaceFewerChunks
idempotency; the legacy-schema → chunked-schema migration with
embedding_id and rowid preservation; the worker fanning out a single
long pending message into multiple chunks and draining the queue in
one shot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a maxSpans parameter to ChunkText and uses it from the worker via
the maxSpansPerMessage=64 constant. The cap protects the embed batch
from system-generated dumps (10+ MB stack traces, error-flow forwards)
that would otherwise produce thousands of chunks for a single message,
flatten into a single embed call, and trip the API timeout.

Discovered in a real-world build against a 2.2M-message corpus: seven
Salesforce automation-error notifications carried 5–15 MB of body text
each, chunking into 600–4,170 spans, which mixed into batch_size=128
batches and pushed the resulting embed call past the 60s timeout
ceiling. With the cap at 64 spans (256 KB of content per message at a
4 KB window), every legitimate long-form email survives intact while
pathological inputs lose only their tail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three roborev kenn-io#323 follow-ups, all real, two seen in production:

1. Migration: legacy `embedding_id := message_id` shortcut collided on
   the new UNIQUE(generation_id, message_id, chunk_index) constraint
   whenever the same message_id appeared under multiple generations
   (which the pre-chunking PK (generation_id, message_id) allowed —
   typically an active gen + a building gen overlapping during a
   rebuild). The fix drops embedding_id from the INSERT so
   AUTOINCREMENT allocates a fresh, globally-unique rowid per legacy
   row, and the vec0 rebuild now looks up the new embedding_id per
   (generation_id, message_id) via a mapping built from the just-
   migrated embeddings table rather than carrying message_id as the
   new rowid. Adds TestMigrate_LegacyToChunked_MultiGenerationCollision
   to pin the case.

2. Worker: a 128-message batch with chunking could fan out to 64 × 128
   = 8,192 embedder inputs in a single Embed call, exceeding Ollama's
   ~250-input request limit and tripping the 60s API timeout. This is
   the same failure we hit live yesterday at 50 msg/s → 1 msg/s. The
   fix splits the flattened input slice into sub-batches of at most
   BatchSize, runs each sub-batch as its own Embed call, concatenates
   results before assembling chunks. The pending queue stays per-
   message — a message completes only once every one of its chunks
   has been embedded and upserted in the same RunOnce iteration.
   Adds TestWorker_SplitsChunkInputsAcrossSubBatches.

3. Filtered Search: the empty-filter path widens the chunk overfetch
   via a doubling loop until enough distinct messages survive the
   GROUP BY collapse; the filtered path used a fixed k * 4. With long
   messages contributing many top-k chunks the filtered path could
   short-return below k even when enough filtered matches existed.
   The fix lifts the same doubling loop into the filtered path,
   bounded by COUNT(*) FROM embeddings WHERE generation_id = ? so the
   loop always terminates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hansn74 hansn74 force-pushed the feat/embed-chunking branch from 0fe5a3f to cacea26 Compare May 15, 2026 12:32
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 15, 2026

roborev: Combined Review (cacea26)

Medium: fused vector search can under-return distinct message candidates because the chunk overfetch pool is fixed.

Medium

  • internal/vector/sqlitevec/fused.go:136
    Fused search applies a fixed chunkOverfetchFactor before grouping chunks by message. A long message with many high-scoring chunks can consume the ANN pool, leaving fewer than KPerSignal distinct vector candidates even when additional matching messages exist.

    Suggested fix: Use the same widening loop as Search, or fetch enough chunks to guarantee KPerSignal + 1 distinct messages under the per-message chunk cap.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Mirrors the doubling loop already in backend.Search but for the
FusedSearch CTE: extracts the SQL into a buildQuery closure
parameterised on chunkK, then re-issues the query with a growing
chunk fetch when the first pass returns fewer than KPerSignal+1
distinct ANN messages. Bounded by COUNT(*) FROM embeddings WHERE
generation_id = ? so the loop terminates even when the corpus
genuinely has fewer matches than the requested K.

Without this, a query whose top chunks all pile up onto a few
long messages collapses to far fewer than KPerSignal distinct
ANN candidates, and the fused result loses messages that would
have ranked further down the chunk-distance order. Same shape of
bug as the empty-filter and filtered Search paths, caught by
roborev on the previous push.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 15, 2026

roborev: Combined Review (e83967b)

Indexing changes are mostly sound, but there is one medium-severity resource exhaustion issue to fix before merge.

Medium

  • Location: internal/vector/embed/worker.go:489, internal/vector/embed/chunk.go:62
    Problem: A remote sender can trigger local resource exhaustion with a very large email body. The worker calls Preprocess(..., 0, ...), disabling the prior embedding input truncation, then passes the full preprocessed message to ChunkText. ChunkText builds offsets for the entire string before enforcing maxSpans, so huge synced messages can consume excessive CPU and memory even though only 64 chunks are embedded. This can OOM or stall the embedding worker and block vector indexing.
    Fix: Enforce an absolute pre-chunk rune/byte cap before full offset construction, or make ChunkText scan incrementally and stop once maxSpans chunks are produced. Derive the cap from maxSpans, maxRunes, and overlap so valid retained chunks are preserved without processing unbounded tail content.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

ChunkText used to allocate a rune-byte offset table for the entire
input before maxSpans had any effect. For a 15 MB body (a Salesforce
flow-error forward, an mbox attachment that leaked into body_text),
that's a ~120 MB allocation per call even though only the first 64
chunks worth of content can ever be emitted. A burst of synced
oversized messages could OOM the embedding worker and stall vector
indexing.

The chunker only ever reads ahead within a window of maxRunes runes,
so any content past maxSpans*maxRunes will be dropped on the floor
by the in-loop maxSpans guard anyway. Truncating the byte slice up
front to that bound is lossless for the spans it would otherwise
emit, and turns the worst-case allocation from O(body_size) into
O(maxSpans * maxRunes).

Adds TestChunkText/MaxSpansCapsInputBytesProcessed: a 10M-rune input
that completes in the same wall-time as the 1K-rune cases. Without
the cap, this test would dominate the suite and the alloc profile
would balloon to ~80 MB; with the cap both costs are flat.

Caught by roborev on PR kenn-io#323 (e83967b).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 15, 2026

roborev: Combined Review (717ac4c)

The PR has Medium-severity issues that should be addressed before merge.

Medium

  • internal/vector/embed/worker.go:491
    Preprocess(subject, body, 0, ...) disables the pre-embedding size cap before HTML/base64/URL/whitespace transforms run. A very large HTML/base64 or URL-heavy message can force expensive full-string processing before ChunkText applies its cap, creating a CPU/memory exhaustion risk for the embedding worker.
    Fix: Apply a raw/preprocessed input ceiling before expensive transforms, or make preprocessing bounded/streaming. The cap should account for the maximum emitted chunk text including overlap.

  • internal/vector/embed/worker.go:602
    truncated is incremented per truncated chunk while Succeeded remains per message. A single long message can therefore record multiple truncations, making run/progress metrics inconsistent.
    Fix: Track message IDs with any truncated or capped chunk and increment truncated once per message.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Two roborev kenn-io#323 (717ac4c) follow-ups, both real:

1. Preprocess input cap. The chunker's input cap (added previously)
   protects ChunkText from walking the whole body, but the regex
   passes inside Preprocess (StripHTML, StripBase64,
   StripURLTracking, whitespace collapse) still run on the full
   input — O(body_len) CPU and similar-size scratch allocations
   per transform. A 100 MB body would burn seconds before the
   chunker drops the tail anyway. Cap the raw body at
   MaxInputChars * maxSpansPerMessage * rawBodyMultiplier (with
   rawBodyMultiplier=16 to leave room for sanitize to strip
   noise) before any sanitize transform sees it. Adds
   TestWorker_CapsRawBodyBeforePreprocess.

2. RunResult.Truncated double-counting. truncated incremented per
   chunk, but Succeeded counts messages — so a single long message
   with N hard-cut chunks reported as N truncations against 1
   success, making the "what fraction was truncated" metric
   nonsensical. Track distinct message_ids with at least one
   truncated chunk and increment the counter once per message.
   Adds TestWorker_TruncatedCountedPerMessageNotPerChunk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 15, 2026

roborev: Combined Review (2d8f45d)

Medium finding remains after deduplication; no High or Critical issues reported.

Medium

  • internal/vector/embed/worker.go:498
    The worker truncates the raw body before Preprocess, so messages with a large removable prefix, such as inline base64 data or an HTML blob, can lose meaningful text that appears later before sanitization has a chance to strip the noise. This can regress embedding quality for newsletter/MIME-style bodies and may drop tail content without an accurate truncation signal.

    Suggested fix: Apply the cap after cheap pollution removal, or make the cap preprocessing-aware so post-sanitized content is preserved. Also propagate a truncation flag when either the raw cap or chunk cap drops message content.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Move MaxBodyRunes from a worker-side raw-input cap to a stage inside
Preprocess that runs *after* CRLF normalization and StripBase64 but
*before* StripHTML and the rest. The previous cap fired on raw
input, so a body whose first megabyte was an inline base64 image
got chopped to "just the blob" — the prose tail past the cap never
reached sanitize, and the resulting embedding was empty.

By doing the cheap pollution removal first, the same body lands at
the cap with only ~2 KB left where the blob used to be, and the
prose tail survives intact. Heavy regex passes (StripHTML,
StripURLTracking, whitespace collapse) still operate on a bounded
input, so the resource ceiling that motivated the earlier cap is
preserved.

The bool returned by Preprocess now also signals "body cap fired",
propagated through msgText.BodyTruncated to every chunk's Truncated
flag so per-message accounting downstream picks up cap-induced
truncation alongside hard-cut chunks.

Caught by roborev on PR kenn-io#323 (2d8f45d). Adds
TestWorker_PrefixBase64DoesNotHidePoseTail: 2 MB of base64 ahead
of a sentinel prose tail; the sentinel must appear in the
embedder inputs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 15, 2026

roborev: Combined Review (7d8d80a)

Medium issue found: truncation reporting can miss dropped message content.

Medium

  • internal/vector/embed/worker.go:570 - Messages truncated by ChunkText’s maxSpans cap are not reliably marked as truncated. When a long message is capped after 64 chunks but those chunks end on soft breaks, p.Trunc can remain false for every emitted chunk, so RunResult.Truncated and embeddings.truncated report the message as fully embedded even though tail content was dropped.
    • Fix: Have ChunkText return whether it dropped tail content, or compare the original rune count to the final emitted coverage, then propagate that per-message truncation flag to emitted chunks.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

…nt as truncations

Caught by roborev on PR kenn-io#323 (7d8d80a): when ChunkText hits its
maxSpans cap but the last emitted chunk happens to land on a clean
soft break, the per-chunk hard-cut Trunc flag stays false for every
emitted chunk. So a message that lost everything past chunk 64
would report as fully embedded — RunResult.Truncated and
embeddings.truncated would both miss it.

ChunkText now returns (spans, tailDropped). tailDropped is true
whenever content was dropped past the last emitted span, either
from the input pre-cap or from the in-loop maxSpans guard, and
the worker ORs it with the body-truncation flag onto every chunk's
Trunc — so per-message truncation accounting picks up cap-induced
loss regardless of where the last chunk happened to cut.

Adds two regression tests:
- TailDroppedFlagsCapWhenLastChunkLandsOnSoftBreak: 5-chunks-worth
  of prose ending in sentence terminators, capped at maxSpans=2.
  The last emitted chunk lands cleanly, so the old per-chunk flag
  would have missed the truncation; tailDropped must surface it.
- TailDroppedFalseWhenAllContentEmitted: counter-test confirming
  a short input flagged tailDropped=false.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 16, 2026

roborev: Combined Review (ccb226f)

No Medium, High, or Critical issues found.

All reported findings are below the requested severity threshold.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@wesm
Copy link
Copy Markdown
Member

wesm commented May 16, 2026

Thank you for working on this! I will review and merge when I can

@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented May 21, 2026

roborev: Combined Review (ccb226f)

Summary verdict: No Medium, High, or Critical findings were reported.

All review agents either reported no issues or provided no findings.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants