feat: add multimodal support to KV router with standalone trtllm example #4577

zhongdaor-nv · 2025-11-25T05:00:18Z

Overview:

Add multimodal (MM) hash support to the KV router, ensuring blocks with identical tokens but different multimodal objects produce different hashes. Also adds a standalone KV router example for TRT-LLM with MM support.

Details:

Add multimodal metadata structures: BlockExtraInfo, RequestExtraInfo, BlockMmObjectInfo, RequestMmObjectInfo in kv_router/protocols.rs
Update compute_block_hash_for_seq to incorporate MM hashes into block hash computation
Extend Python bindings in kv.rs to accept optional block_mm_infos parameter
Add new standalone TRT-LLM router example under examples/deployments/router_standalone_trtllm/
Add unit tests for multimodal KV router functionality in test_mm_kv_router.py

Where should the reviewer start?

lib/llm/src/kv_router/protocols.rs - new multimodal protocol structures
lib/llm/src/kv_router/indexer.rs - updated hash computation with MM support
lib/bindings/python/tests/test_mm_kv_router.py - tests demonstrating the new functionality
examples/deployments/router_standalone_trtllm/ - new standalone example

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Relates to DIS-916

Summary by CodeRabbit

New Features
- Added standalone router implementation with optimized KV cache routing and load-based worker selection
- Extended chat completion API to support multimodal inputs (image URLs in messages)
- Introduced KV cache metadata tracking and block-level caching optimization
Documentation
- Added comprehensive documentation, example scripts, and test suite for router standalone setup with TensorRT-LLM integration
- Added performance benchmarking and API testing utilities

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: zhongdaor <[email protected]>

Signed-off-by: zhongdaor-nv <[email protected]>

Signed-off-by: zhongdaor <[email protected]>

coderabbitai · 2025-12-11T05:56:11Z

Walkthrough

This PR introduces multimodal (MM) support to the KV router system by adding MM metadata structures and hash computation logic across bindings and core routing logic. Concurrently, it adds a complete standalone TensorRT-LLM router deployment example with API, worker, routing, and testing components.

Changes

Cohort / File(s)	Summary
Router Standalone TensorRT-LLM Deployment Example `examples/deployments/router_standalone_trtllm/README.md`, `__init__.py`, `api.py`, `worker.py`, `router.py`, `test_router.py`, `perf.sh`, `ping.sh`	New standalone router deployment with FastAPI service (api.py) handling chat completions and multimodal inputs, TRT-LLM worker wrapper (worker.py) with KV cache event publishing, KV router (router.py) using RadixTree-based matching and load metrics, comprehensive test suite (test_router.py) for text and MM routing scenarios, performance benchmarking and ping scripts, and detailed documentation.
Python Bindings for Multimodal KV Hashing `lib/bindings/python/rust/llm/kv.rs`, `lib/bindings/python/src/dynamo/_core.pyi`	Extended `compute_block_hash_for_seq_py` and `KvEventPublisher::publish_stored` to accept optional `block_mm_infos` parameter; propagates MM metadata through Python-Rust boundary. Updated type stubs with MM info documentation and usage examples.
C Bindings for Multimodal Support `lib/bindings/c/src/lib.rs`	Updated KV cache stored block construction to initialize new `mm_extra_info` field and pass MM parameter to hash computation.
Multimodal KV Router Core Logic `lib/llm/src/kv_router/protocols.rs`	Added new MM metadata types: `BlockMmObjectInfo`, `BlockExtraInfo`, `RequestMmObjectInfo`, `RequestExtraInfo` with `to_block_level()` conversion. Extended `RouterRequest::New` and `KvCacheStoredBlockData` to carry MM info.
KV Router Indexer and Hash Computation `lib/llm/src/kv_router/indexer.rs`	Extended `compute_block_hash_for_seq()` signature to accept optional `block_mm_infos` parameter; incorporates MM hashes from `BlockExtraInfo` into block hash computation to differentiate blocks by both tokens and multimodal content.
KV Cache Event Publishing `lib/llm/src/kv_router/publisher.rs`	Updated `create_stored_block_from_parts()` and `create_stored_blocks()` to accept and propagate `block_mm_infos`; extended `RawKvEvent` deserialization to parse MM metadata in stored block events.
Router Request Handling `lib/llm/src/kv_router.rs`	Updated `compute_block_hash_for_seq` call sites to pass MM parameter; extended pattern matching for `RouterRequest::New` to accommodate `request_extra_info` field.
Mocker KV Manager `lib/llm/src/mocker/kv_manager.rs`	Added `mm_extra_info: None` initialization in KV cache stored block construction.
Preprocessor `lib/llm/src/protocols/common/preprocessor.rs`	Added optional `request_extra_info: Option<RequestExtraInfo>` field to `PreprocessedRequest`.
Multimodal KV Router Tests `lib/bindings/python/tests/test_mm_kv_router.py`	New comprehensive test suite validating MM-aware hash computation, block storage, per-worker removal, and end-to-end routing with MM metadata across multiple workers.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Multimodal metadata propagation: MM info flows through multiple layers (protocols → indexer → publisher → bindings); verify correctness of hash incorporation and serialization across all paths
New public APIs: Multiple new exported classes and functions (ServingParams, ParsedRequest, ProcessedInput, RouterRequest, BlockExtraInfo, RequestExtraInfo); ensure consistency with existing patterns
Comprehensive new module (worker.py, api.py, router.py): Large volumes of new business logic with streaming, async tasks, and ZMQ publishers; verify event handling, error paths, and lifecycle management
Hash computation logic: Integration of MM hashes into block hashing requires careful verification of ordering, collision handling, and backward compatibility
Interconnected changes: MM support spans from Rust protocols through Python bindings to the example deployment; any inconsistency propagates broadly

Areas requiring extra attention:

Verification that compute_block_hash_for_seq() correctly incorporates MM hashes without breaking existing cache hits for non-MM cases
Review of RequestExtraInfo::to_block_level() logic for correct block-wise MM info aggregation and offset computation
Validation of worker.py KV event publishing, particularly block parsing and MM info extraction
Confirmation that MM metadata serialization/deserialization is symmetric across all event types
Testing coverage for edge cases: partial blocks, multiple MM objects, None MM info, and cache eviction scenarios

🐰 A KV cache blooms with MM delight,
Routers match tokens and images just right,
RadixTrees and hashes align,
Workers stream chunks so fine,
Multimodal dreams take flight! ✨

Pre-merge checks

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description contains only placeholder headings with no actual content; all sections (Overview, Details, Where should the reviewer start, Related Issues) are unfilled.	Complete all required sections: provide an overview of the multimodal metadata additions, describe the key changes across files, specify which files reviewers should prioritize, and reference the actual GitHub issue number.
Title check	❓ Inconclusive	The PR title 'add mm extra info' is vague and does not clearly convey the main changes; it uses a generic abbreviation ('mm') without context.	Revise the title to be more specific and descriptive, such as 'Add multimodal metadata support to KV cache router' or similar to clarify the scope.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 17

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lib/bindings/python/src/dynamo/_core.pyi (1)
235-281: Remove duplicate docstring content.

The docstring contains duplicate documentation. Lines 247-271 provide the new detailed Args/Returns/Example documentation, but lines 272-279 repeat the older, shorter version of the same documentation. This redundancy should be cleaned up.
     Example:
         >>> tokens = [1, 2, 3, 4] * 8  # 32 tokens = 1 block
         >>> mm_info = {
         ...     "mm_objects": [{
         ...         "mm_hash": 0xDEADBEEF,
         ...     }]
         ... }
         >>> hashes = compute_block_hash_for_seq_py(tokens, 32, [mm_info])
-    
-    Compute block hashes for a sequence of tokens
-
-    Args:
-        tokens: List of token IDs
-        kv_block_size: Size of each KV cache block
-
-    Returns:
-        List of block hashes as integers
     """

     ...

🧹 Nitpick comments (17)

lib/llm/src/kv_router/protocols.rs (1)

450-486: LGTM!

The test correctly initializes the new mm_extra_info field. Consider adding a separate test case that validates serialization/deserialization with non-None multimodal metadata to ensure the new structures work end-to-end.
lib/llm/src/kv_router.rs (1)
485-493: Consider extracting the duplicate block_hashes computation.

compute_block_hash_for_seq is called twice with the same inputs—once at line 487 and again at line 491. While this is conditional (the second call is inside then), when router_track_active_blocks is true, both calls execute. Consider computing block_hashes once and reusing it.
 pub async fn get_potential_loads(&self, tokens: &[u32]) -> Result<Vec<PotentialLoad>> {
     let isl_tokens = tokens.len();
     let block_hashes = compute_block_hash_for_seq(tokens, self.block_size, None);
-    let overlap_scores = self.indexer.find_matches(block_hashes).await?;
+    let overlap_scores = self.indexer.find_matches(block_hashes.clone()).await?;

     let maybe_seq_hashes = self.kv_router_config.router_track_active_blocks.then(|| {
-        let block_hashes = compute_block_hash_for_seq(tokens, self.block_size, None);
         compute_seq_hash_for_block(&block_hashes)
     });
lib/bindings/python/rust/llm/kv.rs (1)

327-372: DRY up block_mm_infos depythonize logic across bindings

Both compute_block_hash_for_seq_py and KvEventPublisher::publish_stored implement identical block_mm_infos → Option<Vec<Option<BlockExtraInfo>>> conversion and error mapping. This is fine functionally, but it’s an easy place for future drift.

A small internal helper (e.g., fn depythonize_block_mm_infos(obj: Option<Bound<PyAny>>) -> PyResult<Option<Vec<Option<BlockExtraInfo>>>>) would centralize the behavior and keep Python↔Rust MM metadata semantics consistent.
lib/llm/src/kv_router/indexer.rs (1)
1110-1151: MM extra info on synthetic routing events is always None

In the TTL/pruning routing path, synthetic KvCacheEventData::Stored events are built with:
blocks: hashes.map(|(local_hash, sequence_hash)| KvCacheStoredBlockData {
    tokens_hash: *local_hash,
    block_hash: ExternalSequenceBlockHash(*sequence_hash),
    mm_extra_info: None,
})
This keeps the existing behavior (no MM metadata on these synthetic entries), which is fine as long as MM distinctions are encoded solely into tokens_hash/sequence_hash. If future features rely on mm_extra_info for anything beyond hashing (e.g., inspection or filtering), you may eventually want to carry through real MM info here as well.

Not an immediate issue, but worth keeping in mind as MM use‑cases expand.
lib/llm/src/kv_router/publisher.rs (2)

446-485: Consider a small test that exercises non‑None block_mm_infos

create_stored_blocks now accepts block_mm_infos: Option<&[Option<BlockExtraInfo>]> and threads per‑block entries into both tokens_hash and mm_extra_info, but the unit tests only cover the None case:

test_create_stored_blocks_ok

test_create_stored_blocks_wrong_size_triggers_warning

A small additional test that passes a Some(&[Some(BlockExtraInfo { … })]) and asserts:

blocks[i].mm_extra_info is Some(...), and

tokens_hash matches a direct call to compute_block_hash_for_seq with the same MM info

would close that gap and guard this path against regressions.

1003-1017: Existing convert_event_block_stored test only covers the None MM case

test_convert_event_block_stored still uses block_mm_infos: None, which is good for backward compatibility but doesn’t validate that non‑None block_mm_infos survive deserialization and reach create_stored_blocks.

Once you’re happy with the MM plumbing, consider extending this test (or adding a new one) that supplies a simple block_mm_infos payload and asserts that the resulting KvCacheStoreData::blocks entries have mm_extra_info populated as expected.
examples/deployments/router_standalone_trtllm/router.py (1)
288-296: Mark unused app argument to satisfy linters

lifespan(self, app: FastAPI) doesn’t use app, which is intentional for FastAPI’s lifecycle signature but shows up in Ruff (ARG002). Renaming it to _app (or adding a del app inside) makes that intent explicit and keeps linters quiet:
-async def lifespan(self, app: FastAPI):
+async def lifespan(self, _app: FastAPI):
examples/deployments/router_standalone_trtllm/test_router.py (2)
72-92: Consider catching more specific exceptions.

While broad exception handling is acceptable in test utilities, logging the exception would improve debuggability.
 def send_request(client: httpx.Client, url: str, payload: dict) -> bool:
     """Send a chat completion request and consume the stream."""
     try:
         resp = client.post(f"{url}/v1/chat/completions", json=payload)
         if resp.status_code != 200:
             return False
         for _ in resp.iter_lines():
             pass
         return True
-    except Exception:
+    except Exception as e:
+        print(f"Request failed: {e}")
         return False


 def get_tree_info(client: httpx.Client, url: str) -> dict:
     """Get radix tree debug info."""
     try:
         resp = client.get(f"{url}/debug/tree_info")
         return resp.json()
-    except Exception:
+    except Exception as e:
+        print(f"Failed to get tree info: {e}")
         return {"num_blocks": -1, "events": []}
159-175: Server connectivity check only verifies router, not API.

_check_servers returns True without actually verifying the API server is reachable. Consider adding an API health check.
     def _check_servers(self) -> bool:
         """Verify both API and Router servers are reachable."""
         print("\nChecking server connectivity...")
         try:
             # Check router
             resp = self.client.get(f"{self.config.router_url}/debug/tree_info")
             if resp.status_code != 200:
                 print(f"  Router not responding: {resp.status_code}")
                 return False
             print(f"  Router OK (blocks in tree: {resp.json().get('num_blocks', '?')})")

-            # Check API - just verify it's up
-            # A simple request to verify the endpoint exists
+            # Check API - verify it's up with a minimal request
+            try:
+                # Just check the server responds (will fail with 4xx but confirms connectivity)
+                self.client.get(f"{self.config.api_url}/health", timeout=5)
+            except httpx.HTTPStatusError:
+                pass  # Expected if no health endpoint, but connection works
+            print("  API OK (connected)")
             return True
         except Exception as e:
             print(f"  Connection error: {e}")
             return False
examples/deployments/router_standalone_trtllm/worker.py (4)
23-28: Debug file path uses fixed /tmp location.

This is acceptable for debug-only code guarded by DEBUG_ENABLED, but be aware it may not be accessible in containerized environments or could conflict with other instances.

Consider using tempfile module or making the path configurable:
+import tempfile
+
 # Debug flag: set DYNAMO_DEBUG=1 to enable debug file dumps
 DEBUG_ENABLED = os.environ.get("DYNAMO_DEBUG", "0") == "1"
-DEBUG_WORKER_KV_FILE = "/tmp/debug_worker_kv.txt"
+DEBUG_WORKER_KV_FILE = os.environ.get("DYNAMO_DEBUG_FILE", "/tmp/debug_worker_kv.txt")
148-166: extract_mm_info only extracts the first MM object.

The function returns after finding the first mm_key, ignoring any additional multimodal objects in the request.

If multiple images can exist per request, consider accumulating all mm_objects:
 def extract_mm_info(blocks_data: list[dict], all_token_ids: list[int]) -> dict | None:
     """Extract multimodal hash info from TRTLLM block data."""
+    mm_objects = []
     for block in blocks_data:
         mm_keys = block.get("mm_keys", [])
         for mm_key in mm_keys:
             if mm_key.get("type") != "mm_key":
                 continue

             hash_hex = mm_key.get("hash", "")
             if not hash_hex:
                 continue

             mm_hash = int(hash_hex[:16], 16)
             offsets = find_image_token_range(all_token_ids)

             if offsets:
-                return {"mm_objects": [{"mm_hash": mm_hash, "offsets": [offsets]}]}
+                mm_objects.append({"mm_hash": mm_hash, "offsets": [offsets]})

-    return None
+    return {"mm_objects": mm_objects} if mm_objects else None
282-307: Use logger.exception instead of logger.error for exception logging.

As flagged by static analysis, logger.exception automatically includes the traceback.
     async def _metrics_loop(self):
         """Continuously publish worker metrics."""
         await asyncio.sleep(1)

         try:
             async for stat in self.llm.get_stats_async(timeout=5):
                 if not isinstance(stat, dict):
                     continue

                 num_waiting = (
                     stat["numQueuedRequests"]
                     + stat["inflightBatchingStats"]["numPausedRequests"]
                 )
                 kv_stats = stat["kvCacheStats"]
                 usage = (
                     kv_stats["allocTotalBlocks"] / kv_stats["maxNumBlocks"]
                     if kv_stats["maxNumBlocks"] > 0
                     else 0.0
                 )

                 self.metrics_publisher.publish(num_waiting, usage)

         except asyncio.CancelledError:
             pass
         except Exception as e:
-            logger.error(f"Worker {self.worker_id} metrics error: {e}")
+            logger.exception(f"Worker {self.worker_id} metrics error")
320-330: Use logger.exception for KV events errors as well.
         except RuntimeError as e:
             if "IterationResult is not properly instantiated" in str(e):
                 logger.warning(f"Worker {self.worker_id}: KV events not available")
             else:
-                logger.error(f"Worker {self.worker_id} KV events error: {e}")
-        except Exception as e:
-            logger.error(f"Worker {self.worker_id} KV events error: {e}")
+                logger.exception(f"Worker {self.worker_id} KV events error")
+        except Exception:
+            logger.exception(f"Worker {self.worker_id} KV events error")
examples/deployments/router_standalone_trtllm/api.py (4)
247-254: Silently falling back to text-only may hide issues.

When MM processing fails, the fallback to text-only mode may produce incorrect results for multimodal requests without clear indication to the caller.

Consider returning an error for MM requests that fail processing, or at least including a warning in the response:
         except Exception as e:
-            logger.warning(f"MM processing failed: {e}, falling back to text-only")
+            logger.exception(f"MM processing failed, falling back to text-only")
             return ProcessedInput(
                 tokens=self.tokenizer.encode(prompt),
                 mm_input=None,
                 mm_hash=None,
                 image_offsets=None,
             )
288-296: Simplify dict access as suggested by linter.
     def _compute_mm_hash(self, multi_modal_data: dict | None) -> int | None:
         """Compute mm_hash from multimodal data."""
         if not multi_modal_data:
             return None

         mm_hashes_dict = apply_mm_hashes(multi_modal_data)
-        if "image" in mm_hashes_dict and mm_hashes_dict["image"]:
-            return int(mm_hashes_dict["image"][0][:16], 16)
+        image_hashes = mm_hashes_dict.get("image")
+        if image_hashes:
+            return int(image_hashes[0][:16], 16)
         return None
315-330: Use logger.exception for routing errors.
     async def _route_request(self, local_hashes: list[int], num_tokens: int) -> int | ErrorResponse:
         """Query router for best worker ID."""
         try:
             router_request = RouterRequest(local_hashes=local_hashes, num_tokens=num_tokens)
             response = await self.http_client.post(
                 f"http://localhost:{self.init_params.router_port}/find_best_worker",
                 json=router_request.model_dump(),
                 timeout=1,
             )
             response.raise_for_status()
             return RouterResponse.model_validate(response.json()).worker_id
         except (httpx.RequestError, httpx.HTTPStatusError) as e:
-            logger.error(f"Router request failed: {e}")
+            logger.exception("Router request failed")
             return ErrorResponse(
                 error=make_error("Router service unavailable", "service_unavailable", 503)
             )
573-589: KeyboardInterrupt won't be caught inside asyncio.run.

The inner except KeyboardInterrupt at line 579 will never trigger because asyncio.run propagates it. The outer handler at line 588 is correct.
     async def run_with_shutdown():
         try:
             router_task = asyncio.create_task(router_api.start())
             await asyncio.sleep(0.5)
             api_task = asyncio.create_task(api.start())
             await asyncio.gather(router_task, api_task)
-        except KeyboardInterrupt:
-            logger.info("Shutting down services...")
+        except asyncio.CancelledError:
+            logger.info("Tasks cancelled, shutting down services...")
         except Exception as e:
-            logger.exception(f"Unhandled exception: {e}")
+            logger.exception("Unhandled exception")
         finally:
             await api.shutdown()

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5250303 and 7d4e95c.

📒 Files selected for processing (18)

examples/deployments/router_standalone_trtllm/README.md (1 hunks)
examples/deployments/router_standalone_trtllm/__init__.py (1 hunks)
examples/deployments/router_standalone_trtllm/api.py (1 hunks)
examples/deployments/router_standalone_trtllm/perf.sh (1 hunks)
examples/deployments/router_standalone_trtllm/ping.sh (1 hunks)
examples/deployments/router_standalone_trtllm/router.py (1 hunks)
examples/deployments/router_standalone_trtllm/test_router.py (1 hunks)
examples/deployments/router_standalone_trtllm/worker.py (1 hunks)
lib/bindings/c/src/lib.rs (1 hunks)
lib/bindings/python/rust/llm/kv.rs (4 hunks)
lib/bindings/python/src/dynamo/_core.pyi (1 hunks)
lib/bindings/python/tests/test_mm_kv_router.py (1 hunks)
lib/llm/src/kv_router.rs (5 hunks)
lib/llm/src/kv_router/indexer.rs (12 hunks)
lib/llm/src/kv_router/protocols.rs (3 hunks)
lib/llm/src/kv_router/publisher.rs (16 hunks)
lib/llm/src/mocker/kv_manager.rs (1 hunks)
lib/llm/src/protocols/common/preprocessor.rs (2 hunks)

🧰 Additional context used

🧠 Learnings (10)

📚 Learning: 2025-05-29T00:02:35.018Z

Learnt from: alec-flowers
Repo: ai-dynamo/dynamo PR: 1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.

Applied to files:

lib/llm/src/kv_router/protocols.rs
lib/bindings/c/src/lib.rs
lib/llm/src/kv_router.rs
lib/llm/src/mocker/kv_manager.rs
lib/llm/src/kv_router/publisher.rs
lib/bindings/python/rust/llm/kv.rs
lib/llm/src/kv_router/indexer.rs

📚 Learning: 2025-09-02T16:46:54.015Z

Learnt from: GuanLuo
Repo: ai-dynamo/dynamo PR: 2714
File: lib/llm/src/discovery/model_entry.rs:38-42
Timestamp: 2025-09-02T16:46:54.015Z
Learning: In lib/llm/src/discovery/model_entry.rs, GuanLuo prefers not to add serde defaults for model_type and model_input fields to keep the specification explicit and avoid user errors, relying on atomic deployment strategy to avoid backward compatibility issues.

Applied to files:

lib/llm/src/kv_router/protocols.rs
lib/llm/src/protocols/common/preprocessor.rs

📚 Learning: 2025-06-24T20:59:35.725Z

Learnt from: ishandhanani
Repo: ai-dynamo/dynamo PR: 1626
File: lib/llm/src/preprocessor.rs:238-239
Timestamp: 2025-06-24T20:59:35.725Z
Learning: In lib/llm/src/preprocessor.rs, the `sampling_options` call in the `preprocess_request` method is placed in the common section after the match statement on `request.prompt_input_type()`, meaning it applies to both `PromptInput::Tokens` and `PromptInput::Text` request types.

Applied to files:

lib/llm/src/protocols/common/preprocessor.rs

📚 Learning: 2025-09-03T19:31:32.621Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2840
File: lib/llm/src/kv_router/sequence.rs:86-88
Timestamp: 2025-09-03T19:31:32.621Z
Learning: PeaBrane chose to defer fixing the corner case where a single late-arriving request might never expire in the ActiveSequences expiry mechanism (lib/llm/src/kv_router/sequence.rs). They prefer to avoid adding a background loop for periodic cleanup at this time, accepting the technical debt to keep the current PR scope contained.

Applied to files:

lib/llm/src/kv_router.rs

📚 Learning: 2025-09-17T01:00:50.937Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.

Applied to files:

lib/llm/src/kv_router.rs

📚 Learning: 2025-08-29T10:08:18.434Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2756
File: lib/bindings/python/rust/llm/kv.rs:401-436
Timestamp: 2025-08-29T10:08:18.434Z
Learning: In the Python KvIndexer bindings (lib/bindings/python/rust/llm/kv.rs), the hardcoded reset_states=true parameter passed to start_kv_router_background is intentional behavior, not an oversight that needs to be made configurable.

Applied to files:

lib/llm/src/kv_router.rs

📚 Learning: 2025-10-14T00:58:05.744Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3597
File: lib/llm/src/kv_router/indexer.rs:437-441
Timestamp: 2025-10-14T00:58:05.744Z
Learning: In lib/llm/src/kv_router/indexer.rs, when a KvCacheEventData::Cleared event is received, the system intentionally clears all dp_ranks for the given worker_id by calling clear_all_blocks(worker.worker_id). This is the desired behavior and should not be scoped to individual dp_ranks.

Applied to files:

lib/llm/src/kv_router.rs
lib/llm/src/kv_router/indexer.rs

📚 Learning: 2025-06-05T01:02:15.318Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Applied to files:

lib/llm/src/kv_router/indexer.rs

📚 Learning: 2025-05-30T06:38:09.630Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Applied to files:

lib/llm/src/kv_router/indexer.rs

📚 Learning: 2025-05-30T06:34:12.785Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.

Applied to files:

lib/llm/src/kv_router/indexer.rs

🧬 Code graph analysis (6)

lib/bindings/python/src/dynamo/_core.pyi (1)

lib/bindings/python/rust/llm/kv.rs (1)

compute_block_hash_for_seq_py (29-91)

lib/llm/src/kv_router.rs (1)

lib/llm/src/kv_router/indexer.rs (1)

compute_block_hash_for_seq (135-237)

examples/deployments/router_standalone_trtllm/test_router.py (3)

lib/bindings/python/rust/llm/kv.rs (3)

compute_block_hash_for_seq_py (29-91)

block_size (756-758)

block_size (846-848)

lib/bindings/python/src/dynamo/_core.pyi (4)

compute_block_hash_for_seq_py (235-282)

get (1663-1664)

block_size (670-674)

block_size (724-731)

examples/deployments/router_standalone_trtllm/router.py (1)

get_tree_info (310-313)

lib/llm/src/kv_router/publisher.rs (1)

lib/llm/src/kv_router/indexer.rs (4)

compute_block_hash_for_seq (135-237)

kv_block_size (2488-2488)

kv_block_size (2493-2493)

kv_block_size (2498-2498)

examples/deployments/router_standalone_trtllm/api.py (2)

examples/deployments/router_standalone_trtllm/router.py (6)

RouterAPI (260-339)

RouterRequest (42-44)

RouterResponse (47-50)

_setup_routes (297-333)

start (335-339)

shutdown (239-252)

examples/deployments/router_standalone_trtllm/worker.py (5)

TrtllmWorkers (471-515)

direct (504-509)

start_all (499-502)

shutdown (451-463)

shutdown_all (511-515)

lib/bindings/python/tests/test_mm_kv_router.py (2)

lib/bindings/python/src/dynamo/_core.pyi (11)

RadixTree (565-636)

compute_block_hash_for_seq_py (235-282)

apply_event (598-609)

dump_tree_as_events (629-636)

find_matches (583-596)

find_matches (650-660)

scores (545-552)

remove_worker (611-618)

clear_all_blocks (620-627)

block_size (670-674)

block_size (724-731)

lib/bindings/python/rust/llm/kv.rs (10)

compute_block_hash_for_seq_py (29-91)

apply_event (529-555)

dump_tree_as_events (601-633)

find_matches (492-527)

find_matches (760-776)

scores (410-417)

remove_worker (557-577)

clear_all_blocks (579-599)

block_size (756-758)

block_size (846-848)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4577/merge) by zhongdaor-nv.

examples/deployments/router_standalone_trtllm/README.md

[error] 1-1: Trailing whitespace detected in file; fixed by pre-commit. Please review and commit the changes.

lib/bindings/python/src/dynamo/_core.pyi

[error] 1-1: Trailing whitespace detected in file; fixed by pre-commit. Please review and commit the changes.

examples/deployments/router_standalone_trtllm/ping.sh

[error] 1-1: pre-commit: check-executables-have-shebangs failed. ping.sh is marked executable but has no valid shebang.

examples/deployments/router_standalone_trtllm/perf.sh

[error] 1-1: pre-commit: check-executables-have-shebangs failed. perf.sh is marked executable but has no valid shebang.

examples/deployments/router_standalone_trtllm/test_router.py