vector: chunk long messages instead of truncating by hansn74 · Pull Request #323 · kenn-io/msgvault

hansn74 · 2026-05-13T19:50:14Z

Summary

Replaces "one embedding per message" with "one embedding per chunk", keyed by (generation_id, message_id, chunk_index). Long emails are split via a sliding window (overlap ≈ 3% of window, soft breaks at paragraph → sentence → word → hard cut); each window is embedded independently; the search path joins back through embedding_id, groups by message_id, and keeps the best chunk score per message.

Motivation

While building embeddings against a 2.2M-message corpus on nomic-embed-text (8,192-token window) with max_input_chars = 6000, ~1.7% of messages still tripped Ollama's context-length check. Most were polluted (inline base64, leaked HTML/CSS — addressed by #322), but the residual long-tail (CJK threads, legitimate 50K-char prose) can never fit in a single 8K window. Truncation lost the tail; the worker's downshift recovery flattened throughput from 42 msg/s → ~11.

With chunking every embedder input fits by construction (the failure mode disappears), and long messages get their entire content into the index instead of losing tail prose.

Schema changes (single transactional migration)

embeddings:
  + embedding_id     INTEGER PRIMARY KEY AUTOINCREMENT  -- synthetic rowid joined to vec0
  + chunk_index      INTEGER NOT NULL DEFAULT 0
  + chunk_char_start INTEGER NOT NULL DEFAULT 0          -- debug-only metadata
  + chunk_char_end   INTEGER NOT NULL DEFAULT 0
  PRIMARY KEY → UNIQUE (generation_id, message_id, chunk_index)

vectors_vec_dN (vec0):
  generation_id INTEGER PARTITION KEY
  embedding_id  INTEGER PRIMARY KEY                     -- was message_id
  embedding     FLOAT[N]

Migration preserves embedding_id = message_id for legacy rows so existing vec0 rowids stay valid; the vec0 table is rebuilt in-place (drop + recreate + re-insert) inside a single transaction. The AUTOINCREMENT counter is bumped past every legacy rowid so new chunk inserts cannot collide.

Read / write path

Worker: Preprocess now runs with maxChars=0 (no truncation); ChunkText slices the full preprocessed text into windows; each window becomes one embedder input. pending_embeddings stays per-message — a multi-chunk message completes when all its chunks are upserted in the same batch.
Search: both filtered and empty-filter paths JOIN vec0 ON v.embedding_id = e.embedding_id, GROUP BY e.message_id, MIN(distance). A new chunkOverfetchFactor (4×) is applied to the requested k so the GROUP BY has enough chunks to recover k distinct messages even when several messages contribute multiple top-k chunks.
FusedSearch: the ann CTE becomes ann_chunks → ann (GROUP BY message_id, MIN(distance), ROW_NUMBER) under the same factor.
Stats: EmbeddingCount is COUNT(DISTINCT message_id) so the progress-bar invariant (Done/Total in messages) survives the layout change.
LoadVector: returns chunk_index = 0 — the only consumer (find_similar) wants a representative vector, not the full chunk fan-out.

Tests

ChunkText: empty, single-span, paragraph/sentence/word/hard cuts, overlap-clamp infinite-loop guard, UTF-8 mixed-script integrity, end-to-end coverage check.
Upsert: multi-chunk fan-out lays down N rows with distinct chunk_index, joins back to vec0 through embedding_id, and message_count stays a per-message count.
Upsert idempotency: TestBackend_Upsert_ReplaceFewerChunks pins the contract that re-upserting with fewer chunks vacates the stale rows — critical because chunk fan-out can change between upserts when preprocessing rules evolve.
Migration: TestMigrate_LegacyToChunked hand-builds the pre-chunking schema, runs Migrate, and asserts every legacy row survives as chunk_index=0 with embedding_id == legacy message_id; the vec0 join is verifiable; sqlite_sequence is bumped past every legacy rowid; a second Migrate is a no-op.
Worker: TestWorker_FansOutLongMessageIntoMultipleChunks drives a single long message through the worker, asserts ≥2 chunk rows with consecutive chunk_index, message_count = 1, and that the queue drains in one Complete.

Notes

chunk_char_start/chunk_char_end are stored but not yet surfaced in search results. Cheap (8 bytes/chunk extra) and enable future per-paragraph highlighting without another migration.
Overlap is hardcoded at maxRunes/30 (≈ 3%), floored to 0 for maxRunes < 200. Not currently exposed as config; rationale is documented inline.
Builds on embed: extend Preprocess with HTML / base64 / URL-tracking / whitespace strip #322 (sanitize) but doesn't depend on it mechanically — these can land in either order.

Co-Authored-By: Claude Opus 4.7 (1M context)

Adds four opt-in (default-on) transforms to Preprocess() so noise-laden emails (inline base64 images, HTML residue, tracking-tagged links, HTML→text whitespace bloat) tokenize back down inside the embedder's context window: strip_html strip <style>/<script> blocks, generic <tags>, decode HTML entities strip_base64 strip data:...;base64,... URIs and bare base64 runs >=200 chars (excluding '/' so URL paths survive) strip_url_tracking drop utm_*, fbclid, gclid, etc. query params collapse_whitespace normalize CRLF -> LF, trim per-line trailing whitespace, collapse runs of >=3 newlines, runs of >=2 horizontal spaces Motivation: while building embeddings for a 2.2M-message corpus on nomic-embed-text (8192-token window), ~1.7% of messages tripped the endpoint's context-length check even after the 6000-char rune cap. The offenders were almost always polluted with one of the four patterns above: a 30KB inline image, leaked <table style="..."> markup, or campaign-tagged URLs repeating across newsletters. Stripping these shrinks dense input to clean prose without semantic loss, eliminates the downshift-to-batch-size=1 sawtooth that capped real throughput at ~10 msg/s, and improves vector quality by not averaging the embedding over CSS gibberish. Config follows the existing PreprocessConfig pattern: *bool in the TOML tier (nil = "default true", explicit `false` preserved verbatim), plain bool in the runtime tier, helpers like StripHTMLEnabled() bridge the two. Both call sites (build-embeddings + the live worker spawned by `serve`) are wired symmetrically. Pipeline order matters and is deliberate: 1. CRLF normalization (line-oriented regexes assume LF) 2. base64 / data: URI strip (runs before HTML so an oversized <img src="data:..."> -- longer than reHTMLTag's 500-char ceiling -- has its payload removed first, leaving a small enough tag for the subsequent HTML pass to sweep) 3. HTML strip + entity decode 4. URL tracking-param strip 5. existing quote/signature strip 6. whitespace collapse 7. TrimSpace + Subject prefix + rune-bounded truncation Tests cover each transform in isolation, three regression cases (URL-paths-look-base64, oversized-img-tag, CRLF normalization), and a full-pipeline end-to-end. Config-tier tests verify the new toggles honour the same nil/true/false tristate semantics as the existing pair. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

roborev-ci · 2026-05-13T19:54:19Z

roborev: Combined Review (`d203a66`)

Summary verdict: Changes need fixes before merge due to one high-risk migration failure and two medium runtime/search issues.

High

internal/vector/sqlitevec/migrate.go:157
Legacy migration sets embedding_id = message_id, but the old schema allowed the same message_id in multiple generations via PRIMARY KEY (generation_id, message_id). Existing DBs with active/building or retained generations for the same message can hit duplicate embedding_id values and fail migration.
Fix: Allocate a unique embedding_id per legacy embedding row and carry a (generation_id, message_id) -> embedding_id mapping into the vec0 table rebuild.

Medium

internal/vector/embed/worker.go:488, internal/vector/embed/worker.go:521
Preprocess(subject, body, 0, ...) disables the previous MaxInputChars truncation, then chunks the entire preprocessed message with no per-message or per-batch chunk cap. A very large email body can fan out into many embedding requests/rows, consuming CPU, memory, disk, and paid embedding quota.
Fix: Keep a hard maximum on total preprocessed runes or chunks per message, mark the final accepted chunk as truncated, and consider enforcing BatchSize over chunks rather than only messages.
internal/vector/sqlitevec/backend.go:809
Filtered vector search uses a fixed k * chunkOverfetchFactor before grouping chunks by message. A long message with many matching chunks can consume that pool, returning fewer than k distinct messages even when more filtered matches exist. The same fixed-overfetch pattern appears in fused ANN search.
Fix: Use an adaptive overfetch loop for filtered/fused ANN paths, growing up to the matching chunk ceiling until enough distinct messages are recovered.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

roborev-ci · 2026-05-14T14:40:26Z

roborev: Combined Review (`0fe5a3f`)

High-risk migration issue found; several medium-risk batching and retrieval-quality regressions should be addressed before merge.

High

internal/vector/sqlitevec/migrate.go:153: The legacy migration sets embedding_id = message_id, but the legacy embeddings primary key is (generation_id, message_id). The same message_id can exist across multiple generations, causing duplicate embedding_id values and migration failure.
- Fix: Generate unique embedding_id values during migration and rebuild vec rows through a (generation_id, message_id) -> embedding_id mapping.

Medium

internal/vector/embed/worker.go:531: BatchSize still limits claimed messages, but each message can fan out to up to 64 embedder inputs. A single embed request can become BatchSize * chunks, exceeding the configured or provider-safe embedding batch size.
- Fix: Batch inputs by the configured embed batch size, collect all vectors, and only complete each message after all chunks have been embedded and upserted.
internal/vector/sqlitevec/backend.go:815; internal/vector/sqlitevec/fused.go:182: Filtered vector search and fused ANN search fetch only k * chunkOverfetchFactor chunks before grouping by message. Since one message can contribute up to 64 chunks, top chunks may collapse to far fewer than k messages even when enough distinct matches exist.
- Fix: Use a widening loop like the empty-filter path, or overfetch up to a bound based on maxSpansPerMessage or available chunk count before grouping.
internal/vector/embed/preprocess.go:47: The default-on HTML stripper treats any <...> text as a tag, so plain-text email content like <https://example.com>, <alice@example.com>, or comparisons such as 2 < 3 and 5 > 4 can be removed from embeddings.
- Fix: Use an HTML tokenizer or narrow the tag matcher to valid tag syntax, preserving non-tag angle-bracketed prose.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Replaces the original `<[^>]{0,500}>` with a stricter tag-name pattern `</?[a-zA-Z][a-zA-Z0-9-]*(?:\s[^>]{0,400})?\s*/?>` so the stripper no longer eats text that merely contains angle brackets: John <john@example.com> kept verbatim (@ rejects tag-name) See <https://example.com>. kept verbatim (: rejects tag-name) x < 3 and y > 4 kept verbatim (space-then-digit rejects) <Aug 6, 2026> kept verbatim (space rejects) Real HTML tags (<p>, <br/>, <a href="...">, </div>, <table style="...">) continue to match. The {0,400} attribute-body cap is moved inside an optional non-capturing group that only fires when a whitespace-then- attributes section actually follows the tag name, so the stripper treats `<p>` and `<a href="...">` symmetrically. Caught by roborev on PR kenn-io#322. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces "one embedding per message" with "one embedding per chunk", keyed by (generation_id, message_id, chunk_index). A long email is split via a sliding window (size = MaxInputChars runes, overlap ≈ 3% floored at zero for small windows, soft breaks at paragraph→sentence→word in the back quarter of each window), each window is embedded, and the search path collapses results by message_id keeping the best chunk score. Motivation: with nomic-embed-text (8,192-token window) and max_input_chars=6000, ~1.7% of messages in a 2.2M-corpus still tripped the embedder's context limit even after sanitization. The downshift recovery sawtoothed throughput from 42 msg/s to ~11. With chunking, every input fits by construction and the failure mode disappears; long emails also keep their tail content instead of losing it to runtime truncation. Schema changes (single transactional migration, idempotent): embeddings: + embedding_id INTEGER PRIMARY KEY AUTOINCREMENT -- synthetic -- rowid for vec0 + chunk_index INTEGER NOT NULL DEFAULT 0 + chunk_char_start INTEGER NOT NULL DEFAULT 0 -- debug-only + chunk_char_end INTEGER NOT NULL DEFAULT 0 -- debug-only PRIMARY KEY → UNIQUE (generation_id, message_id, chunk_index) vectors_vec_dN (vec0): PARTITION KEY generation_id PRIMARY KEY embedding_id (was message_id) embedding FLOAT[N] Legacy migration preserves embedding_id == message_id for every already-embedded row so existing vec0 rowids are reusable; the vec0 table is rebuilt in-place (drop + recreate + re-insert) inside a single transaction. The AUTOINCREMENT counter is bumped past every legacy rowid so new chunk inserts cannot collide. Worker change (internal/vector/embed/worker.go): Preprocess now runs with maxChars=0 (no truncation); ChunkText splits the full preprocessed text into windows; each window becomes one embedder input. Multiple chunks per message embed in the same batch call. The pending_embeddings queue stays per-message: a multi-chunk message completes when all its chunks are upserted. Search changes (sqlitevec/backend.go, sqlitevec/fused.go): Both the filtered and empty-filter Search paths JOIN vec0 through embeddings and GROUP BY message_id with MIN(distance), keeping the best-scoring chunk per message. A chunkOverfetchFactor (4×) is multiplied into the requested k so the GROUP BY has enough chunks to recover k distinct messages even when several messages each contribute multiple chunks to the top-k. The fused-search ANN CTE becomes ann_chunks → ann (GROUP BY message_id) under the same factor. Stats / LoadVector: Stats.EmbeddingCount counts DISTINCT message_id so the progress-bar invariant (Done / Total in messages) survives chunking. LoadVector returns chunk_index=0 (the head of the message) — the one consumer (find_similar) wants a representative vector, not all chunks. ChunkText (internal/vector/embed/chunk.go): pure function, sliding window with overlap, soft-break preference paragraph→sentence→word → hard cut, clamps overlap to ≤ maxRunes/2 to prevent infinite loops, returns runes-space CharStart/CharEnd offsets for each chunk so backends can recover the source substring later. Tests cover: ChunkText edge cases (empty, single-span, paragraph cut, sentence cut, word cut, hard cut, overlap clamp, UTF-8 mixed scripts, end-to-end coverage); multi-chunk Upsert + ReplaceFewerChunks idempotency; the legacy-schema → chunked-schema migration with embedding_id and rowid preservation; the worker fanning out a single long pending message into multiple chunks and draining the queue in one shot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a maxSpans parameter to ChunkText and uses it from the worker via the maxSpansPerMessage=64 constant. The cap protects the embed batch from system-generated dumps (10+ MB stack traces, error-flow forwards) that would otherwise produce thousands of chunks for a single message, flatten into a single embed call, and trip the API timeout. Discovered in a real-world build against a 2.2M-message corpus: seven Salesforce automation-error notifications carried 5–15 MB of body text each, chunking into 600–4,170 spans, which mixed into batch_size=128 batches and pushed the resulting embed call past the 60s timeout ceiling. With the cap at 64 spans (256 KB of content per message at a 4 KB window), every legitimate long-form email survives intact while pathological inputs lose only their tail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three roborev kenn-io#323 follow-ups, all real, two seen in production: 1. Migration: legacy `embedding_id := message_id` shortcut collided on the new UNIQUE(generation_id, message_id, chunk_index) constraint whenever the same message_id appeared under multiple generations (which the pre-chunking PK (generation_id, message_id) allowed — typically an active gen + a building gen overlapping during a rebuild). The fix drops embedding_id from the INSERT so AUTOINCREMENT allocates a fresh, globally-unique rowid per legacy row, and the vec0 rebuild now looks up the new embedding_id per (generation_id, message_id) via a mapping built from the just- migrated embeddings table rather than carrying message_id as the new rowid. Adds TestMigrate_LegacyToChunked_MultiGenerationCollision to pin the case. 2. Worker: a 128-message batch with chunking could fan out to 64 × 128 = 8,192 embedder inputs in a single Embed call, exceeding Ollama's ~250-input request limit and tripping the 60s API timeout. This is the same failure we hit live yesterday at 50 msg/s → 1 msg/s. The fix splits the flattened input slice into sub-batches of at most BatchSize, runs each sub-batch as its own Embed call, concatenates results before assembling chunks. The pending queue stays per- message — a message completes only once every one of its chunks has been embedded and upserted in the same RunOnce iteration. Adds TestWorker_SplitsChunkInputsAcrossSubBatches. 3. Filtered Search: the empty-filter path widens the chunk overfetch via a doubling loop until enough distinct messages survive the GROUP BY collapse; the filtered path used a fixed k * 4. With long messages contributing many top-k chunks the filtered path could short-return below k even when enough filtered matches existed. The fix lifts the same doubling loop into the filtered path, bounded by COUNT(*) FROM embeddings WHERE generation_id = ? so the loop always terminates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

roborev-ci · 2026-05-15T12:35:38Z

roborev: Combined Review (`cacea26`)

Medium: fused vector search can under-return distinct message candidates because the chunk overfetch pool is fixed.

Medium

internal/vector/sqlitevec/fused.go:136
Fused search applies a fixed chunkOverfetchFactor before grouping chunks by message. A long message with many high-scoring chunks can consume the ANN pool, leaving fewer than KPerSignal distinct vector candidates even when additional matching messages exist.

Suggested fix: Use the same widening loop as Search, or fetch enough chunks to guarantee KPerSignal + 1 distinct messages under the per-message chunk cap.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Mirrors the doubling loop already in backend.Search but for the FusedSearch CTE: extracts the SQL into a buildQuery closure parameterised on chunkK, then re-issues the query with a growing chunk fetch when the first pass returns fewer than KPerSignal+1 distinct ANN messages. Bounded by COUNT(*) FROM embeddings WHERE generation_id = ? so the loop terminates even when the corpus genuinely has fewer matches than the requested K. Without this, a query whose top chunks all pile up onto a few long messages collapses to far fewer than KPerSignal distinct ANN candidates, and the fused result loses messages that would have ranked further down the chunk-distance order. Same shape of bug as the empty-filter and filtered Search paths, caught by roborev on the previous push. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

roborev-ci · 2026-05-15T13:37:05Z

roborev: Combined Review (`e83967b`)

Indexing changes are mostly sound, but there is one medium-severity resource exhaustion issue to fix before merge.

Medium

Location: internal/vector/embed/worker.go:489, internal/vector/embed/chunk.go:62
Problem: A remote sender can trigger local resource exhaustion with a very large email body. The worker calls Preprocess(..., 0, ...), disabling the prior embedding input truncation, then passes the full preprocessed message to ChunkText. ChunkText builds offsets for the entire string before enforcing maxSpans, so huge synced messages can consume excessive CPU and memory even though only 64 chunks are embedded. This can OOM or stall the embedding worker and block vector indexing.
Fix: Enforce an absolute pre-chunk rune/byte cap before full offset construction, or make ChunkText scan incrementally and stop once maxSpans chunks are produced. Derive the cap from maxSpans, maxRunes, and overlap so valid retained chunks are preserved without processing unbounded tail content.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

ChunkText used to allocate a rune-byte offset table for the entire input before maxSpans had any effect. For a 15 MB body (a Salesforce flow-error forward, an mbox attachment that leaked into body_text), that's a ~120 MB allocation per call even though only the first 64 chunks worth of content can ever be emitted. A burst of synced oversized messages could OOM the embedding worker and stall vector indexing. The chunker only ever reads ahead within a window of maxRunes runes, so any content past maxSpans*maxRunes will be dropped on the floor by the in-loop maxSpans guard anyway. Truncating the byte slice up front to that bound is lossless for the spans it would otherwise emit, and turns the worst-case allocation from O(body_size) into O(maxSpans * maxRunes). Adds TestChunkText/MaxSpansCapsInputBytesProcessed: a 10M-rune input that completes in the same wall-time as the 1K-rune cases. Without the cap, this test would dominate the suite and the alloc profile would balloon to ~80 MB; with the cap both costs are flat. Caught by roborev on PR kenn-io#323 (e83967b). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

roborev-ci · 2026-05-15T14:38:41Z

roborev: Combined Review (`717ac4c`)

The PR has Medium-severity issues that should be addressed before merge.

Medium

internal/vector/embed/worker.go:491
Preprocess(subject, body, 0, ...) disables the pre-embedding size cap before HTML/base64/URL/whitespace transforms run. A very large HTML/base64 or URL-heavy message can force expensive full-string processing before ChunkText applies its cap, creating a CPU/memory exhaustion risk for the embedding worker.
Fix: Apply a raw/preprocessed input ceiling before expensive transforms, or make preprocessing bounded/streaming. The cap should account for the maximum emitted chunk text including overlap.
internal/vector/embed/worker.go:602
truncated is incremented per truncated chunk while Succeeded remains per message. A single long message can therefore record multiple truncations, making run/progress metrics inconsistent.
Fix: Track message IDs with any truncated or capped chunk and increment truncated once per message.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Two roborev kenn-io#323 (717ac4c) follow-ups, both real: 1. Preprocess input cap. The chunker's input cap (added previously) protects ChunkText from walking the whole body, but the regex passes inside Preprocess (StripHTML, StripBase64, StripURLTracking, whitespace collapse) still run on the full input — O(body_len) CPU and similar-size scratch allocations per transform. A 100 MB body would burn seconds before the chunker drops the tail anyway. Cap the raw body at MaxInputChars * maxSpansPerMessage * rawBodyMultiplier (with rawBodyMultiplier=16 to leave room for sanitize to strip noise) before any sanitize transform sees it. Adds TestWorker_CapsRawBodyBeforePreprocess. 2. RunResult.Truncated double-counting. truncated incremented per chunk, but Succeeded counts messages — so a single long message with N hard-cut chunks reported as N truncations against 1 success, making the "what fraction was truncated" metric nonsensical. Track distinct message_ids with at least one truncated chunk and increment the counter once per message. Adds TestWorker_TruncatedCountedPerMessageNotPerChunk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

roborev-ci · 2026-05-15T15:40:18Z

roborev: Combined Review (`2d8f45d`)

Medium finding remains after deduplication; no High or Critical issues reported.

Medium

internal/vector/embed/worker.go:498
The worker truncates the raw body before Preprocess, so messages with a large removable prefix, such as inline base64 data or an HTML blob, can lose meaningful text that appears later before sanitization has a chance to strip the noise. This can regress embedding quality for newsletter/MIME-style bodies and may drop tail content without an accurate truncation signal.

Suggested fix: Apply the cap after cheap pollution removal, or make the cap preprocessing-aware so post-sanitized content is preserved. Also propagate a truncation flag when either the raw cap or chunk cap drops message content.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Move MaxBodyRunes from a worker-side raw-input cap to a stage inside Preprocess that runs *after* CRLF normalization and StripBase64 but *before* StripHTML and the rest. The previous cap fired on raw input, so a body whose first megabyte was an inline base64 image got chopped to "just the blob" — the prose tail past the cap never reached sanitize, and the resulting embedding was empty. By doing the cheap pollution removal first, the same body lands at the cap with only ~2 KB left where the blob used to be, and the prose tail survives intact. Heavy regex passes (StripHTML, StripURLTracking, whitespace collapse) still operate on a bounded input, so the resource ceiling that motivated the earlier cap is preserved. The bool returned by Preprocess now also signals "body cap fired", propagated through msgText.BodyTruncated to every chunk's Truncated flag so per-message accounting downstream picks up cap-induced truncation alongside hard-cut chunks. Caught by roborev on PR kenn-io#323 (2d8f45d). Adds TestWorker_PrefixBase64DoesNotHidePoseTail: 2 MB of base64 ahead of a sentinel prose tail; the sentinel must appear in the embedder inputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

roborev-ci · 2026-05-15T20:01:31Z

roborev: Combined Review (`7d8d80a`)

Medium issue found: truncation reporting can miss dropped message content.

Medium

internal/vector/embed/worker.go:570 - Messages truncated by ChunkText’s maxSpans cap are not reliably marked as truncated. When a long message is capped after 64 chunks but those chunks end on soft breaks, p.Trunc can remain false for every emitted chunk, so RunResult.Truncated and embeddings.truncated report the message as fully embedded even though tail content was dropped.
- Fix: Have ChunkText return whether it dropped tail content, or compare the original rune count to the final emitted coverage, then propagate that per-message truncation flag to emitted chunks.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

…nt as truncations Caught by roborev on PR kenn-io#323 (7d8d80a): when ChunkText hits its maxSpans cap but the last emitted chunk happens to land on a clean soft break, the per-chunk hard-cut Trunc flag stays false for every emitted chunk. So a message that lost everything past chunk 64 would report as fully embedded — RunResult.Truncated and embeddings.truncated would both miss it. ChunkText now returns (spans, tailDropped). tailDropped is true whenever content was dropped past the last emitted span, either from the input pre-cap or from the in-loop maxSpans guard, and the worker ORs it with the body-truncation flag onto every chunk's Trunc — so per-message truncation accounting picks up cap-induced loss regardless of where the last chunk happened to cut. Adds two regression tests: - TailDroppedFlagsCapWhenLastChunkLandsOnSoftBreak: 5-chunks-worth of prose ending in sentence terminators, capped at maxSpans=2. The last emitted chunk lands cleanly, so the old per-chunk flag would have missed the truncation; tailDropped must surface it. - TailDroppedFalseWhenAllContentEmitted: counter-test confirming a short input flagged tailDropped=false. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

roborev-ci · 2026-05-16T12:15:17Z

roborev: Combined Review (`ccb226f`)

No Medium, High, or Critical issues found.

All reported findings are below the requested severity threshold.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

wesm · 2026-05-16T18:23:15Z

Thank you for working on this! I will review and merge when I can

roborev-ci · 2026-05-21T21:39:52Z

roborev: Combined Review (`ccb226f`)

Summary verdict: No Medium, High, or Critical findings were reported.

All review agents either reported no issues or provided no findings.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

hansn74 and others added 4 commits May 15, 2026 13:42

hansn74 force-pushed the feat/embed-chunking branch from 0fe5a3f to cacea26 Compare May 15, 2026 12:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vector: chunk long messages instead of truncating#323

vector: chunk long messages instead of truncating#323
hansn74 wants to merge 10 commits into
kenn-io:mainfrom
hansn74:feat/embed-chunking

hansn74 commented May 13, 2026

Uh oh!

roborev-ci Bot commented May 13, 2026

Uh oh!

roborev-ci Bot commented May 14, 2026

Uh oh!

roborev-ci Bot commented May 15, 2026

Uh oh!

roborev-ci Bot commented May 15, 2026

Uh oh!

roborev-ci Bot commented May 15, 2026

Uh oh!

roborev-ci Bot commented May 15, 2026

Uh oh!

roborev-ci Bot commented May 15, 2026

Uh oh!

roborev-ci Bot commented May 16, 2026

Uh oh!

wesm commented May 16, 2026

Uh oh!

roborev-ci Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hansn74 commented May 13, 2026

Summary

Motivation

Schema changes (single transactional migration)

Read / write path

Tests

Notes

Uh oh!

roborev-ci Bot commented May 13, 2026

roborev: Combined Review (d203a66)

High

Medium

Uh oh!

roborev-ci Bot commented May 14, 2026

roborev: Combined Review (0fe5a3f)

High

Medium

Uh oh!

roborev-ci Bot commented May 15, 2026

roborev: Combined Review (cacea26)

Medium

Uh oh!

roborev-ci Bot commented May 15, 2026

roborev: Combined Review (e83967b)

Medium

Uh oh!

roborev-ci Bot commented May 15, 2026

roborev: Combined Review (717ac4c)

Medium

Uh oh!

roborev-ci Bot commented May 15, 2026

roborev: Combined Review (2d8f45d)

Medium

Uh oh!

roborev-ci Bot commented May 15, 2026

roborev: Combined Review (7d8d80a)

Medium

Uh oh!

roborev-ci Bot commented May 16, 2026

roborev: Combined Review (ccb226f)

Uh oh!

wesm commented May 16, 2026

Uh oh!

roborev-ci Bot commented May 21, 2026

roborev: Combined Review (ccb226f)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

roborev: Combined Review (`d203a66`)

roborev: Combined Review (`0fe5a3f`)

roborev: Combined Review (`cacea26`)

roborev: Combined Review (`e83967b`)

roborev: Combined Review (`717ac4c`)

roborev: Combined Review (`2d8f45d`)

roborev: Combined Review (`7d8d80a`)

roborev: Combined Review (`ccb226f`)

roborev: Combined Review (`ccb226f`)