Skip to content

feat: add local_bm25 sparse provider for hybrid retrieval#1857

Open
rocke2020 wants to merge 7 commits into
volcengine:mainfrom
rocke2020:feat/local-bm25-sparse-provider
Open

feat: add local_bm25 sparse provider for hybrid retrieval#1857
rocke2020 wants to merge 7 commits into
volcengine:mainfrom
rocke2020:feat/local-bm25-sparse-provider

Conversation

@rocke2020
Copy link
Copy Markdown

Summary

  • Adds local_bm25 as a new sparse embedding provider, enabling local BM25 lexical retrieval without cloud APIs or heavyweight ML dependencies
  • Users can combine any dense embedding (e.g., Ollama qwen3-embedding:0.6b) with local BM25 for hybrid search via the existing CompositeHybridEmbedder path
  • Uses Milvus-inspired TF/IDF split architecture: documents store length-normalized TF at insert time, queries compute IDF-weighted vectors from live corpus stats at search time

Design

Follows the existing SparseEmbedderBase interface. The C++ sparse engine only supports dot product, so full BM25 scoring is pre-baked into the vectors:

  • Document vector: tf / (tf + k1 * (1 - b + b * doclen/avgdl)) per term
  • Query vector: idf(t) * (k1 + 1) per term
  • dot_product(query, doc) = BM25 score

Config example:

embedding:
  dense:
    provider: ollama
    model: qwen3-embedding:0.6b
    dimension: 1024
  sparse:
    provider: local_bm25

Files changed

File Change
openviking/models/embedder/local_bm25_embedder.py NEW: BM25 embedder (tokenizer, CRC32 hashing, corpus stats, scoring)
openviking/models/embedder/__init__.py Export LocalBM25Embedder
openviking_cli/utils/config/embedding_config.py Add local_bm25 to provider validation + factory registry
tests/unit/test_local_bm25_embedder.py 28 unit tests

Test plan

  • Unit test tokenization and CRC32 hashing
  • Unit test BM25 stats persistence (save/load)
  • Unit test IDF: rare terms get higher weight than common terms
  • Unit test dot-product ranking: query "openviking" ranks doc containing "openviking" higher
  • Config validation: local_bm25 provider accepted, model defaults to "bm25"
  • Regression: dense-only configs still work unchanged
  • Integration: EmbeddingConfig with dense+sparse creates CompositeHybridEmbedder

🤖 Generated with Claude Code

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 5, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 90
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Performance Concern

Saving BM25 stats to disk after every document embed can cause excessive I/O and slow down bulk insertions. Consider adding a configurable flush interval or an explicit flush method instead.

self.stats.add_document(token_hashes, doc_len)
if self._stats_path:
    self.stats.save(self._stats_path)

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

PR Code Suggestions ✨

No code suggestions found for the PR.

@rocke2020 rocke2020 force-pushed the feat/local-bm25-sparse-provider branch from 33ff666 to 1f4db25 Compare May 5, 2026 13:28
@qin-ctx qin-ctx requested a review from zhoujh01 May 6, 2026 03:13
@rocke2020 rocke2020 force-pushed the feat/local-bm25-sparse-provider branch from 1f4db25 to cc811fc Compare May 12, 2026 22:44
@MaojiaSheng
Copy link
Copy Markdown
Collaborator

Thanks. and there are some suggestions:

  1. Tokenization Limitations

DEFAULT_TOKEN_PATTERN = r"\w+"

  • The regex \w+ doesn't handle CJK (Chinese/Japanese/Korean) text well - it will treat entire sentences as single tokens
  • Consider adding language-specific tokenization or using a more sophisticated tokenizer for multilingual support
  1. Configuration Flexibility

local_bm25_embedder.py

def init(
self,
model_name: str = "bm25",
k1: float = DEFAULT_K1,
b: float = DEFAULT_B,
token_pattern: str = DEFAULT_TOKEN_PATTERN,
stats_path: Optional[str] = None,
config: Optional[Dict[str, Any]] = None,
):

  • The custom parameters (k1, b, token_pattern, stats_path) are accepted but not exposed in the config factory
  • Consider adding these to EmbeddingModelConfig so users can tune BM25 parameters via yaml config
  1. Stats Persistence Granularity

def _embed_document(self, token_hashes: List[int]) -> EmbedResult:
# ...
self.stats.add_document(token_hashes, doc_len)
if self._stats_path:
self.stats.save(self._stats_path) # Saves after EVERY document

  • Saving stats after every document insertion could cause I/O bottlenecks for bulk inserts
  • Consider adding a batch mode or periodic autosave instead

self.total_tokens = raw.get("total_tokens", 0)
self.term_doc_freq = {int(k): v for k, v in raw.get("term_doc_freq", {}).items()}
except (json.JSONDecodeError, ValueError, OSError) as e:
logger.warning("bm25: failed to load stats from %s: %s", path, e)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果有文件损坏等严重错误,最好是抛出异常

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

if is_query:
return self._embed_query(token_hashes)
return self._embed_document(token_hashes)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: 可以考虑实现下embed_batch

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes


def _hash_token(token: str) -> int:
"""CRC32 hash of token, matching Milvus approach."""
return zlib.crc32(token.encode("utf-8")[:128]) & 0xFFFFFFFF
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以不用照抄milvus的做法,crc32的空间有限,哈希碰撞概率大。这里python项目可以换用例如xxh64

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

谢谢,已经修改为 xxhash

logger.warning("bm25: failed to load stats from %s: %s", path, e)


def _tokenize(text: str, pattern: str = DEFAULT_TOKEN_PATTERN) -> List[str]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

分词器可以拎出来一个配置项,默认至少可以用jieba之类,性能不会太差。当前这个分词器相当于对中文没分词,效果应该不太好。

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

谢谢!已经默认改成 jieba

@rocke2020 rocke2020 force-pushed the feat/local-bm25-sparse-provider branch 6 times, most recently from 1487897 to 06c6edd Compare May 16, 2026 12:36
@rocke2020
Copy link
Copy Markdown
Author

Thanks. and there are some suggestions:

  1. Tokenization Limitations

DEFAULT_TOKEN_PATTERN = r"\w+"

  • The regex \w+ doesn't handle CJK (Chinese/Japanese/Korean) text well - it will treat entire sentences as single tokens
  • Consider adding language-specific tokenization or using a more sophisticated tokenizer for multilingual support
  1. Configuration Flexibility

local_bm25_embedder.py

def init( self, model_name: str = "bm25", k1: float = DEFAULT_K1, b: float = DEFAULT_B, token_pattern: str = DEFAULT_TOKEN_PATTERN, stats_path: Optional[str] = None, config: Optional[Dict[str, Any]] = None, ):

  • The custom parameters (k1, b, token_pattern, stats_path) are accepted but not exposed in the config factory
  • Consider adding these to EmbeddingModelConfig so users can tune BM25 parameters via yaml config
  1. Stats Persistence Granularity

def _embed_document(self, token_hashes: List[int]) -> EmbedResult: # ... self.stats.add_document(token_hashes, doc_len) if self._stats_path: self.stats.save(self._stats_path) # Saves after EVERY document

  • Saving stats after every document insertion could cause I/O bottlenecks for bulk inserts
  • Consider adding a batch mode or periodic autosave instead

thanks for your advice. I realized them.

@ByteDanceLiuYang
Copy link
Copy Markdown
Contributor

@rocke2020 LGTM
cc @MaojiaSheng @zhoujh01

rocke2020 and others added 2 commits June 1, 2026 07:14
…ud dependencies

Enables local BM25 lexical retrieval as a sparse provider, allowing users to combine
any dense embedding (e.g., Ollama qwen3-embedding) with local BM25 for hybrid search.
Uses Milvus-inspired TF/IDF split: documents store length-normalized TF, queries use
live IDF from corpus stats. Dot product of the two produces correct BM25 ranking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rocke2020 rocke2020 force-pushed the feat/local-bm25-sparse-provider branch from eb0a88d to 10a3bfd Compare May 31, 2026 23:18
@rocke2020
Copy link
Copy Markdown
Author

Core fix: local_bm25 is now rebuild-only to keep BM25 scores consistent.
Previously, document sparse vectors were generated one document at a time while corpus stats were being mutated. That meant older documents, newer documents, and query vectors could be based on different doc_count, avgdl, and df(t) values, making BM25 dot-product scores inconsistent and upload-order dependent.

Now the flow is:

document batch:
rebuild BM25 corpus stats from the full batch first then generate all document sparse vectors using the same stats

query:
generate the query vector from the current rebuilt stats, do not mutate corpus stats

This keeps document vectors and query vectors aligned to the same BM25 statistics, avoiding stale stats and preserving score consistency.
This rebuild-based design is appropriate when each corpus update can afford a full BM25 stats/vector rebuild.

For continuously growing corpora with frequent uploads, the more robust long-term design is search-time BM25 in the retrieval index. prefer either search-time BM25 in the retrieval index, or an external sparse/BM25 retrieval provider that owns corpus statistics and scoring.

@rocke2020
Copy link
Copy Markdown
Author

rocke2020 commented Jun 1, 2026

@ByteDanceLiuYang thanks for your instructions, now i polish local_bm25 to keep BM25 scores consistent. I personally think now, this feature is really a feasible feature, could you review? thanks in advance

Comment thread pyproject.toml Outdated
"fastapi>=0.128.0",
"uvicorn>=0.39.0",
"xxhash>=3.0.0",
"jieba>=0.42.1",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can move jieba into an optional dependency group and lazy import it, because it is quite large and most non-BM25 users won’t need it.
Example: pip install openviking[local-bm25] ?

@ByteDanceLiuYang
Copy link
Copy Markdown
Contributor

@ByteDanceLiuYang thanks for your instructions, now i polish local_bm25 to keep BM25 scores consistent. I personally think now, this feature is really a feasible feature, could you review? thanks in advance

@rocke2020 Overall LGTM ~ Two minor points left:

  1. jieba should be an optional dependency as mentioned earlier.
  2. Current BM25 needs full reindex; incremental updates lead to skewed scores. Precise Milvus‑2.5‑style sparse BM25 requires refactoring our C++ engine but not necessary: For large-scale full-text search with accurate BM25 scores, we recommend to use VolcanoEngine VikingDB’s SearchByKeywords API instead, and it will be soon supported in OpenViking. Since this feature targets cost reduction for sparse LLM workloads, the score defect is tolerable. The current lightweight solution is viable, but we must highlight this sparse limitation clearly in docs to avoid improper usage.
    cc @MaojiaSheng @zhoujh01

@rocke2020
Copy link
Copy Markdown
Author

thanks for your instructions which indeed help me much to understand this repo. With codex, I re-read milvus-style logic, milvus calculate bm25 at the retrieval stage, and not need to rebuild the whole corpus when update the corpus.

Now, I made an accurate bm25 to full reindex for each doucument updates, in an async and write-lock way. Yes, now, it is accurate bm25 application. I also made jieba as optional, useful for chinese, not useful for english.
For small corpus, maybe 10k documents, this solution may be ok. I have tested 10k documents locally with batch size 512, 2.2s without consider DB storate I/O cost.

From my real usage with milvus, I feel it is better to supply a free local bm25 solution for users whose corpus are not large and who want a local and fast solution.

btw, it is also ok to boil the lake with agent coding, to refactor the c++ engine to realize the same milvus solution which not need full reindex, but slightly increase retrieval times.

@ByteDanceLiuYang
Copy link
Copy Markdown
Contributor

thanks for your instructions which indeed help me much to understand this repo. With codex, I re-read milvus-style logic, milvus calculate bm25 at the retrieval stage, and not need to rebuild the whole corpus when update the corpus.

Now, I made an accurate bm25 to full reindex for each doucument updates, in an async and write-lock way. Yes, now, it is accurate bm25 application. I also made jieba as optional, useful for chinese, not useful for english. For small corpus, maybe 10k documents, this solution may be ok. I have tested 10k documents locally with batch size 512, 2.2s without consider DB storate I/O cost.

From my real usage with milvus, I feel it is better to supply a free local bm25 solution for users whose corpus are not large and who want a local and fast solution.

btw, it is also ok to boil the lake with agent coding, to refactor the c++ engine to realize the same milvus solution which not need full reindex, but slightly increase retrieval times.

@rocke2020 Thanks for your iteration — it is a creative idea. But there're some concerns about commit 0f972467:

  1. Heavy write amplification. If I'm reading it right, every add_resource schedules a full corpus scan + re-upsert of all sparse vectors. Even though it's async with in-flight coalescing, streaming N inserts still costs O(N²) writes to the sparse index, plus latency spikes on unrelated queries. For this, I think maybe a size-driven trigger would amortize this: After a rebuild at corpus size N₀, only trigger the next rebuild at N₀ × 1.5. Each doc gets rebuilt O(log N) times, amortized cost per insert drops to O(1).

  2. Stepping back — maybe not worth it. Even if fixed with "size-driven trigger", this only covers add-resource opt. We still should handle rm and mv, and so on. For a "cheap local lexical signal in hybrid retrieval" provider, that's a lot of surface area.

My suggestion: maybe it's better to revert to the version before 0f97246 — keep the explicit "rebuild-only, caller-driven" contract, and document it clearly. Curious to hear your thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

5 participants