feat: add local_bm25 sparse provider for hybrid retrieval#1857
feat: add local_bm25 sparse provider for hybrid retrieval#1857rocke2020 wants to merge 7 commits into
Conversation
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨No code suggestions found for the PR. |
33ff666 to
1f4db25
Compare
1f4db25 to
cc811fc
Compare
|
Thanks. and there are some suggestions:
DEFAULT_TOKEN_PATTERN = r"\w+"
local_bm25_embedder.pydef init(
def _embed_document(self, token_hashes: List[int]) -> EmbedResult:
|
| self.total_tokens = raw.get("total_tokens", 0) | ||
| self.term_doc_freq = {int(k): v for k, v in raw.get("term_doc_freq", {}).items()} | ||
| except (json.JSONDecodeError, ValueError, OSError) as e: | ||
| logger.warning("bm25: failed to load stats from %s: %s", path, e) |
There was a problem hiding this comment.
如果有文件损坏等严重错误,最好是抛出异常
| if is_query: | ||
| return self._embed_query(token_hashes) | ||
| return self._embed_document(token_hashes) | ||
|
|
There was a problem hiding this comment.
todo: 可以考虑实现下embed_batch
|
|
||
| def _hash_token(token: str) -> int: | ||
| """CRC32 hash of token, matching Milvus approach.""" | ||
| return zlib.crc32(token.encode("utf-8")[:128]) & 0xFFFFFFFF |
There was a problem hiding this comment.
可以不用照抄milvus的做法,crc32的空间有限,哈希碰撞概率大。这里python项目可以换用例如xxh64
| logger.warning("bm25: failed to load stats from %s: %s", path, e) | ||
|
|
||
|
|
||
| def _tokenize(text: str, pattern: str = DEFAULT_TOKEN_PATTERN) -> List[str]: |
There was a problem hiding this comment.
分词器可以拎出来一个配置项,默认至少可以用jieba之类,性能不会太差。当前这个分词器相当于对中文没分词,效果应该不太好。
1487897 to
06c6edd
Compare
thanks for your advice. I realized them. |
|
@rocke2020 LGTM |
…ud dependencies Enables local BM25 lexical retrieval as a sparse provider, allowing users to combine any dense embedding (e.g., Ollama qwen3-embedding) with local BM25 for hybrid search. Uses Milvus-inspired TF/IDF split: documents store length-normalized TF, queries use live IDF from corpus stats. Dot product of the two produces correct BM25 ranking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
eb0a88d to
10a3bfd
Compare
|
Core fix: local_bm25 is now rebuild-only to keep BM25 scores consistent. Now the flow is: document batch: query: This keeps document vectors and query vectors aligned to the same BM25 statistics, avoiding stale stats and preserving score consistency. For continuously growing corpora with frequent uploads, the more robust long-term design is search-time BM25 in the retrieval index. prefer either search-time BM25 in the retrieval index, or an external sparse/BM25 retrieval provider that owns corpus statistics and scoring. |
|
@ByteDanceLiuYang thanks for your instructions, now i polish local_bm25 to keep BM25 scores consistent. I personally think now, this feature is really a feasible feature, could you review? thanks in advance |
| "fastapi>=0.128.0", | ||
| "uvicorn>=0.39.0", | ||
| "xxhash>=3.0.0", | ||
| "jieba>=0.42.1", |
There was a problem hiding this comment.
Maybe we can move jieba into an optional dependency group and lazy import it, because it is quite large and most non-BM25 users won’t need it.
Example: pip install openviking[local-bm25] ?
@rocke2020 Overall LGTM ~ Two minor points left:
|
|
thanks for your instructions which indeed help me much to understand this repo. With codex, I re-read milvus-style logic, milvus calculate bm25 at the retrieval stage, and not need to rebuild the whole corpus when update the corpus. Now, I made an accurate bm25 to full reindex for each doucument updates, in an async and write-lock way. Yes, now, it is accurate bm25 application. I also made jieba as optional, useful for chinese, not useful for english. From my real usage with milvus, I feel it is better to supply a free local bm25 solution for users whose corpus are not large and who want a local and fast solution. btw, it is also ok to boil the lake with agent coding, to refactor the c++ engine to realize the same milvus solution which not need full reindex, but slightly increase retrieval times. |
@rocke2020 Thanks for your iteration — it is a creative idea. But there're some concerns about commit 0f972467:
My suggestion: maybe it's better to revert to the version before 0f97246 — keep the explicit "rebuild-only, caller-driven" contract, and document it clearly. Curious to hear your thoughts. |
Summary
local_bm25as a new sparse embedding provider, enabling local BM25 lexical retrieval without cloud APIs or heavyweight ML dependenciesqwen3-embedding:0.6b) with local BM25 for hybrid search via the existingCompositeHybridEmbedderpathDesign
Follows the existing
SparseEmbedderBaseinterface. The C++ sparse engine only supports dot product, so full BM25 scoring is pre-baked into the vectors:tf / (tf + k1 * (1 - b + b * doclen/avgdl))per termidf(t) * (k1 + 1)per termConfig example:
Files changed
openviking/models/embedder/local_bm25_embedder.pyopenviking/models/embedder/__init__.pyLocalBM25Embedderopenviking_cli/utils/config/embedding_config.pylocal_bm25to provider validation + factory registrytests/unit/test_local_bm25_embedder.pyTest plan
local_bm25provider accepted, model defaults to "bm25"EmbeddingConfigwith dense+sparse createsCompositeHybridEmbedder🤖 Generated with Claude Code