feat: add local_bm25 sparse provider for hybrid retrieval by rocke2020 · Pull Request #1857 · volcengine/OpenViking

rocke2020 · 2026-05-05T08:09:01Z

Summary

Adds local_bm25 as a new sparse embedding provider, enabling local BM25 lexical retrieval without cloud APIs or heavyweight ML dependencies
Users can combine any dense embedding (e.g., Ollama qwen3-embedding:0.6b) with local BM25 for hybrid search via the existing CompositeHybridEmbedder path
Uses Milvus-inspired TF/IDF split architecture: documents store length-normalized TF at insert time, queries compute IDF-weighted vectors from live corpus stats at search time

Design

Follows the existing SparseEmbedderBase interface. The C++ sparse engine only supports dot product, so full BM25 scoring is pre-baked into the vectors:

Document vector: tf / (tf + k1 * (1 - b + b * doclen/avgdl)) per term
Query vector: idf(t) * (k1 + 1) per term
dot_product(query, doc) = BM25 score

Config example:

embedding:
  dense:
    provider: ollama
    model: qwen3-embedding:0.6b
    dimension: 1024
  sparse:
    provider: local_bm25

Files changed

File	Change
`openviking/models/embedder/local_bm25_embedder.py`	NEW: BM25 embedder (tokenizer, CRC32 hashing, corpus stats, scoring)
`openviking/models/embedder/__init__.py`	Export `LocalBM25Embedder`
`openviking_cli/utils/config/embedding_config.py`	Add `local_bm25` to provider validation + factory registry
`tests/unit/test_local_bm25_embedder.py`	28 unit tests

Test plan

Unit test tokenization and CRC32 hashing
Unit test BM25 stats persistence (save/load)
Unit test IDF: rare terms get higher weight than common terms
Unit test dot-product ranking: query "openviking" ranks doc containing "openviking" higher
Config validation: local_bm25 provider accepted, model defaults to "bm25"
Regression: dense-only configs still work unchanged
Integration: EmbeddingConfig with dense+sparse creates CompositeHybridEmbedder

🤖 Generated with Claude Code

CLAassistant · 2026-05-05T08:09:09Z

All committers have signed the CLA.

github-actions · 2026-05-05T08:10:15Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 90
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review Performance Concern Saving BM25 stats to disk after every document embed can cause excessive I/O and slow down bulk insertions. Consider adding a configurable flush interval or an explicit flush method instead. self.stats.add_document(token_hashes, doc_len) if self._stats_path: self.stats.save(self._stats_path)

github-actions · 2026-05-05T08:11:26Z

PR Code Suggestions ✨

No code suggestions found for the PR.

MaojiaSheng · 2026-05-14T12:15:20Z

Thanks. and there are some suggestions:

Tokenization Limitations

DEFAULT_TOKEN_PATTERN = r"\w+"

The regex \w+ doesn't handle CJK (Chinese/Japanese/Korean) text well - it will treat entire sentences as single tokens
Consider adding language-specific tokenization or using a more sophisticated tokenizer for multilingual support

Configuration Flexibility

local_bm25_embedder.py

def init(
self,
model_name: str = "bm25",
k1: float = DEFAULT_K1,
b: float = DEFAULT_B,
token_pattern: str = DEFAULT_TOKEN_PATTERN,
stats_path: Optional[str] = None,
config: Optional[Dict[str, Any]] = None,
):

The custom parameters (k1, b, token_pattern, stats_path) are accepted but not exposed in the config factory
Consider adding these to EmbeddingModelConfig so users can tune BM25 parameters via yaml config

Stats Persistence Granularity

def _embed_document(self, token_hashes: List[int]) -> EmbedResult:
# ...
self.stats.add_document(token_hashes, doc_len)
if self._stats_path:
self.stats.save(self._stats_path) # Saves after EVERY document

Saving stats after every document insertion could cause I/O bottlenecks for bulk inserts
Consider adding a batch mode or periodic autosave instead

ByteDanceLiuYang · 2026-05-14T12:40:53Z

+                self.total_tokens = raw.get("total_tokens", 0)
+                self.term_doc_freq = {int(k): v for k, v in raw.get("term_doc_freq", {}).items()}
+        except (json.JSONDecodeError, ValueError, OSError) as e:
+            logger.warning("bm25: failed to load stats from %s: %s", path, e)


如果有文件损坏等严重错误，最好是抛出异常

ByteDanceLiuYang · 2026-05-14T12:43:18Z

+        if is_query:
+            return self._embed_query(token_hashes)
+        return self._embed_document(token_hashes)
+


todo: 可以考虑实现下embed_batch

ByteDanceLiuYang · 2026-05-14T12:50:08Z

+
+def _hash_token(token: str) -> int:
+    """CRC32 hash of token, matching Milvus approach."""
+    return zlib.crc32(token.encode("utf-8")[:128]) & 0xFFFFFFFF


可以不用照抄milvus的做法，crc32的空间有限，哈希碰撞概率大。这里python项目可以换用例如xxh64

谢谢，已经修改为 xxhash

ByteDanceLiuYang · 2026-05-14T12:54:13Z

+            logger.warning("bm25: failed to load stats from %s: %s", path, e)
+
+
+def _tokenize(text: str, pattern: str = DEFAULT_TOKEN_PATTERN) -> List[str]:


分词器可以拎出来一个配置项，默认至少可以用jieba之类，性能不会太差。当前这个分词器相当于对中文没分词，效果应该不太好。

谢谢！已经默认改成 jieba

rocke2020 · 2026-05-16T12:48:43Z

Thanks. and there are some suggestions:

Tokenization Limitations

DEFAULT_TOKEN_PATTERN = r"\w+"

The regex \w+ doesn't handle CJK (Chinese/Japanese/Korean) text well - it will treat entire sentences as single tokens

Consider adding language-specific tokenization or using a more sophisticated tokenizer for multilingual support

Configuration Flexibility

local_bm25_embedder.py

def init( self, model_name: str = "bm25", k1: float = DEFAULT_K1, b: float = DEFAULT_B, token_pattern: str = DEFAULT_TOKEN_PATTERN, stats_path: Optional[str] = None, config: Optional[Dict[str, Any]] = None, ):

The custom parameters (k1, b, token_pattern, stats_path) are accepted but not exposed in the config factory

Consider adding these to EmbeddingModelConfig so users can tune BM25 parameters via yaml config

Stats Persistence Granularity

def _embed_document(self, token_hashes: List[int]) -> EmbedResult: # ... self.stats.add_document(token_hashes, doc_len) if self._stats_path: self.stats.save(self._stats_path) # Saves after EVERY document

Saving stats after every document insertion could cause I/O bottlenecks for bulk inserts

Consider adding a batch mode or periodic autosave instead

thanks for your advice. I realized them.

ByteDanceLiuYang · 2026-05-18T03:26:36Z

@rocke2020 LGTM
cc @MaojiaSheng @zhoujh01

…ud dependencies Enables local BM25 lexical retrieval as a sparse provider, allowing users to combine any dense embedding (e.g., Ollama qwen3-embedding) with local BM25 for hybrid search. Uses Milvus-inspired TF/IDF split: documents store length-normalized TF, queries use live IDF from corpus stats. Dot product of the two produces correct BM25 ranking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

rocke2020 · 2026-05-31T23:27:07Z

Core fix: local_bm25 is now rebuild-only to keep BM25 scores consistent.
Previously, document sparse vectors were generated one document at a time while corpus stats were being mutated. That meant older documents, newer documents, and query vectors could be based on different doc_count, avgdl, and df(t) values, making BM25 dot-product scores inconsistent and upload-order dependent.

Now the flow is:

document batch:
rebuild BM25 corpus stats from the full batch first then generate all document sparse vectors using the same stats

query:
generate the query vector from the current rebuilt stats, do not mutate corpus stats

This keeps document vectors and query vectors aligned to the same BM25 statistics, avoiding stale stats and preserving score consistency.
This rebuild-based design is appropriate when each corpus update can afford a full BM25 stats/vector rebuild.

For continuously growing corpora with frequent uploads, the more robust long-term design is search-time BM25 in the retrieval index. prefer either search-time BM25 in the retrieval index, or an external sparse/BM25 retrieval provider that owns corpus statistics and scoring.

rocke2020 · 2026-06-01T22:45:07Z

@ByteDanceLiuYang thanks for your instructions, now i polish local_bm25 to keep BM25 scores consistent. I personally think now, this feature is really a feasible feature, could you review? thanks in advance

ByteDanceLiuYang · 2026-06-02T04:14:34Z

    "fastapi>=0.128.0",
    "uvicorn>=0.39.0",
    "xxhash>=3.0.0",
+    "jieba>=0.42.1",


Maybe we can move jieba into an optional dependency group and lazy import it, because it is quite large and most non-BM25 users won’t need it.
Example: pip install openviking[local-bm25] ?

ByteDanceLiuYang · 2026-06-02T06:29:58Z

@ByteDanceLiuYang thanks for your instructions, now i polish local_bm25 to keep BM25 scores consistent. I personally think now, this feature is really a feasible feature, could you review? thanks in advance

@rocke2020 Overall LGTM ~ Two minor points left:

jieba should be an optional dependency as mentioned earlier.
Current BM25 needs full reindex; incremental updates lead to skewed scores. Precise Milvus‑2.5‑style sparse BM25 requires refactoring our C++ engine but not necessary: For large-scale full-text search with accurate BM25 scores, we recommend to use VolcanoEngine VikingDB’s SearchByKeywords API instead, and it will be soon supported in OpenViking. Since this feature targets cost reduction for sparse LLM workloads, the score defect is tolerable. The current lightweight solution is viable, but we must highlight this sparse limitation clearly in docs to avoid improper usage.
cc @MaojiaSheng @zhoujh01

…te bm25

rocke2020 · 2026-06-02T13:10:02Z

thanks for your instructions which indeed help me much to understand this repo. With codex, I re-read milvus-style logic, milvus calculate bm25 at the retrieval stage, and not need to rebuild the whole corpus when update the corpus.

Now, I made an accurate bm25 to full reindex for each doucument updates, in an async and write-lock way. Yes, now, it is accurate bm25 application. I also made jieba as optional, useful for chinese, not useful for english.
For small corpus, maybe 10k documents, this solution may be ok. I have tested 10k documents locally with batch size 512, 2.2s without consider DB storate I/O cost.

From my real usage with milvus, I feel it is better to supply a free local bm25 solution for users whose corpus are not large and who want a local and fast solution.

btw, it is also ok to boil the lake with agent coding, to refactor the c++ engine to realize the same milvus solution which not need full reindex, but slightly increase retrieval times.

ByteDanceLiuYang · 2026-06-03T06:31:40Z

thanks for your instructions which indeed help me much to understand this repo. With codex, I re-read milvus-style logic, milvus calculate bm25 at the retrieval stage, and not need to rebuild the whole corpus when update the corpus.

Now, I made an accurate bm25 to full reindex for each doucument updates, in an async and write-lock way. Yes, now, it is accurate bm25 application. I also made jieba as optional, useful for chinese, not useful for english. For small corpus, maybe 10k documents, this solution may be ok. I have tested 10k documents locally with batch size 512, 2.2s without consider DB storate I/O cost.

From my real usage with milvus, I feel it is better to supply a free local bm25 solution for users whose corpus are not large and who want a local and fast solution.

btw, it is also ok to boil the lake with agent coding, to refactor the c++ engine to realize the same milvus solution which not need full reindex, but slightly increase retrieval times.

@rocke2020 Thanks for your iteration — it is a creative idea. But there're some concerns about commit 0f972467:

Heavy write amplification. If I'm reading it right, every add_resource schedules a full corpus scan + re-upsert of all sparse vectors. Even though it's async with in-flight coalescing, streaming N inserts still costs O(N²) writes to the sparse index, plus latency spikes on unrelated queries. For this, I think maybe a size-driven trigger would amortize this: After a rebuild at corpus size N₀, only trigger the next rebuild at N₀ × 1.5. Each doc gets rebuilt O(log N) times, amortized cost per insert drops to O(1).
Stepping back — maybe not worth it. Even if fixed with "size-driven trigger", this only covers add-resource opt. We still should handle rm and mv, and so on. For a "cheap local lexical signal in hybrid retrieval" provider, that's a lot of surface area.

My suggestion: maybe it's better to revert to the version before 0f97246 — keep the explicit "rebuild-only, caller-driven" contract, and document it clearly. Curious to hear your thoughts.

github-project-automation Bot added this to OpenViking project May 5, 2026

github-project-automation Bot moved this to Backlog in OpenViking project May 5, 2026

rocke2020 force-pushed the feat/local-bm25-sparse-provider branch from 33ff666 to 1f4db25 Compare May 5, 2026 13:28

qin-ctx requested a review from zhoujh01 May 6, 2026 03:13

qin-ctx assigned zhoujh01 May 6, 2026

rocke2020 force-pushed the feat/local-bm25-sparse-provider branch from 1f4db25 to cc811fc Compare May 12, 2026 22:44

ByteDanceLiuYang reviewed May 14, 2026

View reviewed changes

rocke2020 force-pushed the feat/local-bm25-sparse-provider branch 6 times, most recently from 1487897 to 06c6edd Compare May 16, 2026 12:36

rocke2020 and others added 2 commits June 1, 2026 07:14

fix: make local bm25 rebuild-only

10a3bfd

rocke2020 force-pushed the feat/local-bm25-sparse-provider branch from eb0a88d to 10a3bfd Compare May 31, 2026 23:18

chore: sync uv lock after rebase

be275d0

ByteDanceLiuYang reviewed Jun 2, 2026

View reviewed changes

rocke2020 added 4 commits June 2, 2026 19:10

fix: make local bm25 jieba optional

9d80aee

Fix local BM25 corpus rebuild

0f97246

Fix local BM25 rebuild scheduling race

55fec83

docs: update the configuration.md, local bm25 is full reindex, accura…

b83dcf8

…te bm25

		logger.warning("bm25: failed to load stats from %s: %s", path, e)


		def _tokenize(text: str, pattern: str = DEFAULT_TOKEN_PATTERN) -> List[str]:

Conversation

rocke2020 commented May 5, 2026

Summary

Design

Files changed

Test plan

Uh oh!

CLAassistant commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 5, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions Bot commented May 5, 2026

PR Code Suggestions ✨

Uh oh!

MaojiaSheng commented May 14, 2026

local_bm25_embedder.py

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rocke2020 commented May 16, 2026

local_bm25_embedder.py

Uh oh!

ByteDanceLiuYang commented May 18, 2026

Uh oh!

rocke2020 commented May 31, 2026

Uh oh!

rocke2020 commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ByteDanceLiuYang commented Jun 2, 2026

Uh oh!

rocke2020 commented Jun 2, 2026

Uh oh!

ByteDanceLiuYang commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

CLAassistant commented May 5, 2026 •

edited

Loading

rocke2020 commented Jun 1, 2026 •

edited

Loading