Skip to content

Conversation

@zhongdaor-nv
Copy link
Contributor

@zhongdaor-nv zhongdaor-nv commented Nov 25, 2025

Overview:

Add multimodal (MM) hash support to the KV router, ensuring blocks with identical tokens but different multimodal objects produce different hashes. Also adds a standalone KV router example for TRT-LLM with MM support.

Details:

  • Add multimodal metadata structures: BlockExtraInfo, RequestExtraInfo, BlockMmObjectInfo, RequestMmObjectInfo in kv_router/protocols.rs
  • Update compute_block_hash_for_seq to incorporate MM hashes into block hash computation
  • Extend Python bindings in kv.rs to accept optional block_mm_infos parameter
  • Add new standalone TRT-LLM router example under examples/deployments/router_standalone_trtllm/
  • Add unit tests for multimodal KV router functionality in test_mm_kv_router.py

Where should the reviewer start?

  • lib/llm/src/kv_router/protocols.rs - new multimodal protocol structures
  • lib/llm/src/kv_router/indexer.rs - updated hash computation with MM support
  • lib/bindings/python/tests/test_mm_kv_router.py - tests demonstrating the new functionality
  • examples/deployments/router_standalone_trtllm/ - new standalone example

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Relates to DIS-916

Summary by CodeRabbit

  • New Features

    • Added standalone router implementation with optimized KV cache routing and load-based worker selection
    • Extended chat completion API to support multimodal inputs (image URLs in messages)
    • Introduced KV cache metadata tracking and block-level caching optimization
  • Documentation

    • Added comprehensive documentation, example scripts, and test suite for router standalone setup with TensorRT-LLM integration
    • Added performance benchmarking and API testing utilities

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: zhongdaor <[email protected]>
Signed-off-by: zhongdaor <[email protected]>
Signed-off-by: zhongdaor <[email protected]>
Signed-off-by: zhongdaor <[email protected]>
Signed-off-by: zhongdaor <[email protected]>
@zhongdaor-nv zhongdaor-nv marked this pull request as ready for review December 11, 2025 05:48
@zhongdaor-nv zhongdaor-nv requested review from a team as code owners December 11, 2025 05:48
@zhongdaor-nv zhongdaor-nv requested a review from a team as a code owner December 11, 2025 05:48
@zhongdaor-nv zhongdaor-nv changed the title add mm extra info feat: add multimodal support to KV router with standalone trtllm example Dec 11, 2025
@github-actions github-actions bot added the feat label Dec 11, 2025
Signed-off-by: zhongdaor <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 11, 2025

Walkthrough

This PR introduces multimodal (MM) support to the KV router system by adding MM metadata structures and hash computation logic across bindings and core routing logic. Concurrently, it adds a complete standalone TensorRT-LLM router deployment example with API, worker, routing, and testing components.

Changes

Cohort / File(s) Summary
Router Standalone TensorRT-LLM Deployment Example
examples/deployments/router_standalone_trtllm/README.md, __init__.py, api.py, worker.py, router.py, test_router.py, perf.sh, ping.sh
New standalone router deployment with FastAPI service (api.py) handling chat completions and multimodal inputs, TRT-LLM worker wrapper (worker.py) with KV cache event publishing, KV router (router.py) using RadixTree-based matching and load metrics, comprehensive test suite (test_router.py) for text and MM routing scenarios, performance benchmarking and ping scripts, and detailed documentation.
Python Bindings for Multimodal KV Hashing
lib/bindings/python/rust/llm/kv.rs, lib/bindings/python/src/dynamo/_core.pyi
Extended compute_block_hash_for_seq_py and KvEventPublisher::publish_stored to accept optional block_mm_infos parameter; propagates MM metadata through Python-Rust boundary. Updated type stubs with MM info documentation and usage examples.
C Bindings for Multimodal Support
lib/bindings/c/src/lib.rs
Updated KV cache stored block construction to initialize new mm_extra_info field and pass MM parameter to hash computation.
Multimodal KV Router Core Logic
lib/llm/src/kv_router/protocols.rs
Added new MM metadata types: BlockMmObjectInfo, BlockExtraInfo, RequestMmObjectInfo, RequestExtraInfo with to_block_level() conversion. Extended RouterRequest::New and KvCacheStoredBlockData to carry MM info.
KV Router Indexer and Hash Computation
lib/llm/src/kv_router/indexer.rs
Extended compute_block_hash_for_seq() signature to accept optional block_mm_infos parameter; incorporates MM hashes from BlockExtraInfo into block hash computation to differentiate blocks by both tokens and multimodal content.
KV Cache Event Publishing
lib/llm/src/kv_router/publisher.rs
Updated create_stored_block_from_parts() and create_stored_blocks() to accept and propagate block_mm_infos; extended RawKvEvent deserialization to parse MM metadata in stored block events.
Router Request Handling
lib/llm/src/kv_router.rs
Updated compute_block_hash_for_seq call sites to pass MM parameter; extended pattern matching for RouterRequest::New to accommodate request_extra_info field.
Mocker KV Manager
lib/llm/src/mocker/kv_manager.rs
Added mm_extra_info: None initialization in KV cache stored block construction.
Preprocessor
lib/llm/src/protocols/common/preprocessor.rs
Added optional request_extra_info: Option<RequestExtraInfo> field to PreprocessedRequest.
Multimodal KV Router Tests
lib/bindings/python/tests/test_mm_kv_router.py
New comprehensive test suite validating MM-aware hash computation, block storage, per-worker removal, and end-to-end routing with MM metadata across multiple workers.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

  • Multimodal metadata propagation: MM info flows through multiple layers (protocols → indexer → publisher → bindings); verify correctness of hash incorporation and serialization across all paths
  • New public APIs: Multiple new exported classes and functions (ServingParams, ParsedRequest, ProcessedInput, RouterRequest, BlockExtraInfo, RequestExtraInfo); ensure consistency with existing patterns
  • Comprehensive new module (worker.py, api.py, router.py): Large volumes of new business logic with streaming, async tasks, and ZMQ publishers; verify event handling, error paths, and lifecycle management
  • Hash computation logic: Integration of MM hashes into block hashing requires careful verification of ordering, collision handling, and backward compatibility
  • Interconnected changes: MM support spans from Rust protocols through Python bindings to the example deployment; any inconsistency propagates broadly

Areas requiring extra attention:

  • Verification that compute_block_hash_for_seq() correctly incorporates MM hashes without breaking existing cache hits for non-MM cases
  • Review of RequestExtraInfo::to_block_level() logic for correct block-wise MM info aggregation and offset computation
  • Validation of worker.py KV event publishing, particularly block parsing and MM info extraction
  • Confirmation that MM metadata serialization/deserialization is symmetric across all event types
  • Testing coverage for edge cases: partial blocks, multiple MM objects, None MM info, and cache eviction scenarios

🐰 A KV cache blooms with MM delight,
Routers match tokens and images just right,
RadixTrees and hashes align,
Workers stream chunks so fine,
Multimodal dreams take flight!

Pre-merge checks

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description contains only placeholder headings with no actual content; all sections (Overview, Details, Where should the reviewer start, Related Issues) are unfilled. Complete all required sections: provide an overview of the multimodal metadata additions, describe the key changes across files, specify which files reviewers should prioritize, and reference the actual GitHub issue number.
Title check ❓ Inconclusive The PR title 'add mm extra info' is vague and does not clearly convey the main changes; it uses a generic abbreviation ('mm') without context. Revise the title to be more specific and descriptive, such as 'Add multimodal metadata support to KV cache router' or similar to clarify the scope.
✅ Passed checks (1 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/bindings/python/src/dynamo/_core.pyi (1)

235-281: Remove duplicate docstring content.

The docstring contains duplicate documentation. Lines 247-271 provide the new detailed Args/Returns/Example documentation, but lines 272-279 repeat the older, shorter version of the same documentation. This redundancy should be cleaned up.

     Example:
         >>> tokens = [1, 2, 3, 4] * 8  # 32 tokens = 1 block
         >>> mm_info = {
         ...     "mm_objects": [{
         ...         "mm_hash": 0xDEADBEEF,
         ...     }]
         ... }
         >>> hashes = compute_block_hash_for_seq_py(tokens, 32, [mm_info])
-    
-    Compute block hashes for a sequence of tokens
-
-    Args:
-        tokens: List of token IDs
-        kv_block_size: Size of each KV cache block
-
-    Returns:
-        List of block hashes as integers
     """

     ...
🧹 Nitpick comments (17)
lib/llm/src/kv_router/protocols.rs (1)

450-486: LGTM!

The test correctly initializes the new mm_extra_info field. Consider adding a separate test case that validates serialization/deserialization with non-None multimodal metadata to ensure the new structures work end-to-end.

lib/llm/src/kv_router.rs (1)

485-493: Consider extracting the duplicate block_hashes computation.

compute_block_hash_for_seq is called twice with the same inputs—once at line 487 and again at line 491. While this is conditional (the second call is inside then), when router_track_active_blocks is true, both calls execute. Consider computing block_hashes once and reusing it.

 pub async fn get_potential_loads(&self, tokens: &[u32]) -> Result<Vec<PotentialLoad>> {
     let isl_tokens = tokens.len();
     let block_hashes = compute_block_hash_for_seq(tokens, self.block_size, None);
-    let overlap_scores = self.indexer.find_matches(block_hashes).await?;
+    let overlap_scores = self.indexer.find_matches(block_hashes.clone()).await?;

     let maybe_seq_hashes = self.kv_router_config.router_track_active_blocks.then(|| {
-        let block_hashes = compute_block_hash_for_seq(tokens, self.block_size, None);
         compute_seq_hash_for_block(&block_hashes)
     });
lib/bindings/python/rust/llm/kv.rs (1)

327-372: DRY up block_mm_infos depythonize logic across bindings

Both compute_block_hash_for_seq_py and KvEventPublisher::publish_stored implement identical block_mm_infosOption<Vec<Option<BlockExtraInfo>>> conversion and error mapping. This is fine functionally, but it’s an easy place for future drift.

A small internal helper (e.g., fn depythonize_block_mm_infos(obj: Option<Bound<PyAny>>) -> PyResult<Option<Vec<Option<BlockExtraInfo>>>>) would centralize the behavior and keep Python↔Rust MM metadata semantics consistent.

lib/llm/src/kv_router/indexer.rs (1)

1110-1151: MM extra info on synthetic routing events is always None

In the TTL/pruning routing path, synthetic KvCacheEventData::Stored events are built with:

blocks: hashes.map(|(local_hash, sequence_hash)| KvCacheStoredBlockData {
    tokens_hash: *local_hash,
    block_hash: ExternalSequenceBlockHash(*sequence_hash),
    mm_extra_info: None,
})

This keeps the existing behavior (no MM metadata on these synthetic entries), which is fine as long as MM distinctions are encoded solely into tokens_hash/sequence_hash. If future features rely on mm_extra_info for anything beyond hashing (e.g., inspection or filtering), you may eventually want to carry through real MM info here as well.

Not an immediate issue, but worth keeping in mind as MM use‑cases expand.

lib/llm/src/kv_router/publisher.rs (2)

446-485: Consider a small test that exercises non‑None block_mm_infos

create_stored_blocks now accepts block_mm_infos: Option<&[Option<BlockExtraInfo>]> and threads per‑block entries into both tokens_hash and mm_extra_info, but the unit tests only cover the None case:

  • test_create_stored_blocks_ok
  • test_create_stored_blocks_wrong_size_triggers_warning

A small additional test that passes a Some(&[Some(BlockExtraInfo { … })]) and asserts:

  • blocks[i].mm_extra_info is Some(...), and
  • tokens_hash matches a direct call to compute_block_hash_for_seq with the same MM info

would close that gap and guard this path against regressions.


1003-1017: Existing convert_event_block_stored test only covers the None MM case

test_convert_event_block_stored still uses block_mm_infos: None, which is good for backward compatibility but doesn’t validate that non‑None block_mm_infos survive deserialization and reach create_stored_blocks.

Once you’re happy with the MM plumbing, consider extending this test (or adding a new one) that supplies a simple block_mm_infos payload and asserts that the resulting KvCacheStoreData::blocks entries have mm_extra_info populated as expected.

examples/deployments/router_standalone_trtllm/router.py (1)

288-296: Mark unused app argument to satisfy linters

lifespan(self, app: FastAPI) doesn’t use app, which is intentional for FastAPI’s lifecycle signature but shows up in Ruff (ARG002). Renaming it to _app (or adding a del app inside) makes that intent explicit and keeps linters quiet:

-async def lifespan(self, app: FastAPI):
+async def lifespan(self, _app: FastAPI):
examples/deployments/router_standalone_trtllm/test_router.py (2)

72-92: Consider catching more specific exceptions.

While broad exception handling is acceptable in test utilities, logging the exception would improve debuggability.

 def send_request(client: httpx.Client, url: str, payload: dict) -> bool:
     """Send a chat completion request and consume the stream."""
     try:
         resp = client.post(f"{url}/v1/chat/completions", json=payload)
         if resp.status_code != 200:
             return False
         for _ in resp.iter_lines():
             pass
         return True
-    except Exception:
+    except Exception as e:
+        print(f"Request failed: {e}")
         return False


 def get_tree_info(client: httpx.Client, url: str) -> dict:
     """Get radix tree debug info."""
     try:
         resp = client.get(f"{url}/debug/tree_info")
         return resp.json()
-    except Exception:
+    except Exception as e:
+        print(f"Failed to get tree info: {e}")
         return {"num_blocks": -1, "events": []}

159-175: Server connectivity check only verifies router, not API.

_check_servers returns True without actually verifying the API server is reachable. Consider adding an API health check.

     def _check_servers(self) -> bool:
         """Verify both API and Router servers are reachable."""
         print("\nChecking server connectivity...")
         try:
             # Check router
             resp = self.client.get(f"{self.config.router_url}/debug/tree_info")
             if resp.status_code != 200:
                 print(f"  Router not responding: {resp.status_code}")
                 return False
             print(f"  Router OK (blocks in tree: {resp.json().get('num_blocks', '?')})")

-            # Check API - just verify it's up
-            # A simple request to verify the endpoint exists
+            # Check API - verify it's up with a minimal request
+            try:
+                # Just check the server responds (will fail with 4xx but confirms connectivity)
+                self.client.get(f"{self.config.api_url}/health", timeout=5)
+            except httpx.HTTPStatusError:
+                pass  # Expected if no health endpoint, but connection works
+            print("  API OK (connected)")
             return True
         except Exception as e:
             print(f"  Connection error: {e}")
             return False
examples/deployments/router_standalone_trtllm/worker.py (4)

23-28: Debug file path uses fixed /tmp location.

This is acceptable for debug-only code guarded by DEBUG_ENABLED, but be aware it may not be accessible in containerized environments or could conflict with other instances.

Consider using tempfile module or making the path configurable:

+import tempfile
+
 # Debug flag: set DYNAMO_DEBUG=1 to enable debug file dumps
 DEBUG_ENABLED = os.environ.get("DYNAMO_DEBUG", "0") == "1"
-DEBUG_WORKER_KV_FILE = "/tmp/debug_worker_kv.txt"
+DEBUG_WORKER_KV_FILE = os.environ.get("DYNAMO_DEBUG_FILE", "/tmp/debug_worker_kv.txt")

148-166: extract_mm_info only extracts the first MM object.

The function returns after finding the first mm_key, ignoring any additional multimodal objects in the request.

If multiple images can exist per request, consider accumulating all mm_objects:

 def extract_mm_info(blocks_data: list[dict], all_token_ids: list[int]) -> dict | None:
     """Extract multimodal hash info from TRTLLM block data."""
+    mm_objects = []
     for block in blocks_data:
         mm_keys = block.get("mm_keys", [])
         for mm_key in mm_keys:
             if mm_key.get("type") != "mm_key":
                 continue

             hash_hex = mm_key.get("hash", "")
             if not hash_hex:
                 continue

             mm_hash = int(hash_hex[:16], 16)
             offsets = find_image_token_range(all_token_ids)

             if offsets:
-                return {"mm_objects": [{"mm_hash": mm_hash, "offsets": [offsets]}]}
+                mm_objects.append({"mm_hash": mm_hash, "offsets": [offsets]})

-    return None
+    return {"mm_objects": mm_objects} if mm_objects else None

282-307: Use logger.exception instead of logger.error for exception logging.

As flagged by static analysis, logger.exception automatically includes the traceback.

     async def _metrics_loop(self):
         """Continuously publish worker metrics."""
         await asyncio.sleep(1)

         try:
             async for stat in self.llm.get_stats_async(timeout=5):
                 if not isinstance(stat, dict):
                     continue

                 num_waiting = (
                     stat["numQueuedRequests"]
                     + stat["inflightBatchingStats"]["numPausedRequests"]
                 )
                 kv_stats = stat["kvCacheStats"]
                 usage = (
                     kv_stats["allocTotalBlocks"] / kv_stats["maxNumBlocks"]
                     if kv_stats["maxNumBlocks"] > 0
                     else 0.0
                 )

                 self.metrics_publisher.publish(num_waiting, usage)

         except asyncio.CancelledError:
             pass
         except Exception as e:
-            logger.error(f"Worker {self.worker_id} metrics error: {e}")
+            logger.exception(f"Worker {self.worker_id} metrics error")

320-330: Use logger.exception for KV events errors as well.

         except RuntimeError as e:
             if "IterationResult is not properly instantiated" in str(e):
                 logger.warning(f"Worker {self.worker_id}: KV events not available")
             else:
-                logger.error(f"Worker {self.worker_id} KV events error: {e}")
-        except Exception as e:
-            logger.error(f"Worker {self.worker_id} KV events error: {e}")
+                logger.exception(f"Worker {self.worker_id} KV events error")
+        except Exception:
+            logger.exception(f"Worker {self.worker_id} KV events error")
examples/deployments/router_standalone_trtllm/api.py (4)

247-254: Silently falling back to text-only may hide issues.

When MM processing fails, the fallback to text-only mode may produce incorrect results for multimodal requests without clear indication to the caller.

Consider returning an error for MM requests that fail processing, or at least including a warning in the response:

         except Exception as e:
-            logger.warning(f"MM processing failed: {e}, falling back to text-only")
+            logger.exception(f"MM processing failed, falling back to text-only")
             return ProcessedInput(
                 tokens=self.tokenizer.encode(prompt),
                 mm_input=None,
                 mm_hash=None,
                 image_offsets=None,
             )

288-296: Simplify dict access as suggested by linter.

     def _compute_mm_hash(self, multi_modal_data: dict | None) -> int | None:
         """Compute mm_hash from multimodal data."""
         if not multi_modal_data:
             return None

         mm_hashes_dict = apply_mm_hashes(multi_modal_data)
-        if "image" in mm_hashes_dict and mm_hashes_dict["image"]:
-            return int(mm_hashes_dict["image"][0][:16], 16)
+        image_hashes = mm_hashes_dict.get("image")
+        if image_hashes:
+            return int(image_hashes[0][:16], 16)
         return None

315-330: Use logger.exception for routing errors.

     async def _route_request(self, local_hashes: list[int], num_tokens: int) -> int | ErrorResponse:
         """Query router for best worker ID."""
         try:
             router_request = RouterRequest(local_hashes=local_hashes, num_tokens=num_tokens)
             response = await self.http_client.post(
                 f"http://localhost:{self.init_params.router_port}/find_best_worker",
                 json=router_request.model_dump(),
                 timeout=1,
             )
             response.raise_for_status()
             return RouterResponse.model_validate(response.json()).worker_id
         except (httpx.RequestError, httpx.HTTPStatusError) as e:
-            logger.error(f"Router request failed: {e}")
+            logger.exception("Router request failed")
             return ErrorResponse(
                 error=make_error("Router service unavailable", "service_unavailable", 503)
             )

573-589: KeyboardInterrupt won't be caught inside asyncio.run.

The inner except KeyboardInterrupt at line 579 will never trigger because asyncio.run propagates it. The outer handler at line 588 is correct.

     async def run_with_shutdown():
         try:
             router_task = asyncio.create_task(router_api.start())
             await asyncio.sleep(0.5)
             api_task = asyncio.create_task(api.start())
             await asyncio.gather(router_task, api_task)
-        except KeyboardInterrupt:
-            logger.info("Shutting down services...")
+        except asyncio.CancelledError:
+            logger.info("Tasks cancelled, shutting down services...")
         except Exception as e:
-            logger.exception(f"Unhandled exception: {e}")
+            logger.exception("Unhandled exception")
         finally:
             await api.shutdown()
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5250303 and 7d4e95c.

📒 Files selected for processing (18)
  • examples/deployments/router_standalone_trtllm/README.md (1 hunks)
  • examples/deployments/router_standalone_trtllm/__init__.py (1 hunks)
  • examples/deployments/router_standalone_trtllm/api.py (1 hunks)
  • examples/deployments/router_standalone_trtllm/perf.sh (1 hunks)
  • examples/deployments/router_standalone_trtllm/ping.sh (1 hunks)
  • examples/deployments/router_standalone_trtllm/router.py (1 hunks)
  • examples/deployments/router_standalone_trtllm/test_router.py (1 hunks)
  • examples/deployments/router_standalone_trtllm/worker.py (1 hunks)
  • lib/bindings/c/src/lib.rs (1 hunks)
  • lib/bindings/python/rust/llm/kv.rs (4 hunks)
  • lib/bindings/python/src/dynamo/_core.pyi (1 hunks)
  • lib/bindings/python/tests/test_mm_kv_router.py (1 hunks)
  • lib/llm/src/kv_router.rs (5 hunks)
  • lib/llm/src/kv_router/indexer.rs (12 hunks)
  • lib/llm/src/kv_router/protocols.rs (3 hunks)
  • lib/llm/src/kv_router/publisher.rs (16 hunks)
  • lib/llm/src/mocker/kv_manager.rs (1 hunks)
  • lib/llm/src/protocols/common/preprocessor.rs (2 hunks)
🧰 Additional context used
🧠 Learnings (10)
📚 Learning: 2025-05-29T00:02:35.018Z
Learnt from: alec-flowers
Repo: ai-dynamo/dynamo PR: 1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.

Applied to files:

  • lib/llm/src/kv_router/protocols.rs
  • lib/bindings/c/src/lib.rs
  • lib/llm/src/kv_router.rs
  • lib/llm/src/mocker/kv_manager.rs
  • lib/llm/src/kv_router/publisher.rs
  • lib/bindings/python/rust/llm/kv.rs
  • lib/llm/src/kv_router/indexer.rs
📚 Learning: 2025-09-02T16:46:54.015Z
Learnt from: GuanLuo
Repo: ai-dynamo/dynamo PR: 2714
File: lib/llm/src/discovery/model_entry.rs:38-42
Timestamp: 2025-09-02T16:46:54.015Z
Learning: In lib/llm/src/discovery/model_entry.rs, GuanLuo prefers not to add serde defaults for model_type and model_input fields to keep the specification explicit and avoid user errors, relying on atomic deployment strategy to avoid backward compatibility issues.

Applied to files:

  • lib/llm/src/kv_router/protocols.rs
  • lib/llm/src/protocols/common/preprocessor.rs
📚 Learning: 2025-06-24T20:59:35.725Z
Learnt from: ishandhanani
Repo: ai-dynamo/dynamo PR: 1626
File: lib/llm/src/preprocessor.rs:238-239
Timestamp: 2025-06-24T20:59:35.725Z
Learning: In lib/llm/src/preprocessor.rs, the `sampling_options` call in the `preprocess_request` method is placed in the common section after the match statement on `request.prompt_input_type()`, meaning it applies to both `PromptInput::Tokens` and `PromptInput::Text` request types.

Applied to files:

  • lib/llm/src/protocols/common/preprocessor.rs
📚 Learning: 2025-09-03T19:31:32.621Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2840
File: lib/llm/src/kv_router/sequence.rs:86-88
Timestamp: 2025-09-03T19:31:32.621Z
Learning: PeaBrane chose to defer fixing the corner case where a single late-arriving request might never expire in the ActiveSequences expiry mechanism (lib/llm/src/kv_router/sequence.rs). They prefer to avoid adding a background loop for periodic cleanup at this time, accepting the technical debt to keep the current PR scope contained.

Applied to files:

  • lib/llm/src/kv_router.rs
📚 Learning: 2025-09-17T01:00:50.937Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.

Applied to files:

  • lib/llm/src/kv_router.rs
📚 Learning: 2025-08-29T10:08:18.434Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2756
File: lib/bindings/python/rust/llm/kv.rs:401-436
Timestamp: 2025-08-29T10:08:18.434Z
Learning: In the Python KvIndexer bindings (lib/bindings/python/rust/llm/kv.rs), the hardcoded reset_states=true parameter passed to start_kv_router_background is intentional behavior, not an oversight that needs to be made configurable.

Applied to files:

  • lib/llm/src/kv_router.rs
📚 Learning: 2025-10-14T00:58:05.744Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3597
File: lib/llm/src/kv_router/indexer.rs:437-441
Timestamp: 2025-10-14T00:58:05.744Z
Learning: In lib/llm/src/kv_router/indexer.rs, when a KvCacheEventData::Cleared event is received, the system intentionally clears all dp_ranks for the given worker_id by calling clear_all_blocks(worker.worker_id). This is the desired behavior and should not be scoped to individual dp_ranks.

Applied to files:

  • lib/llm/src/kv_router.rs
  • lib/llm/src/kv_router/indexer.rs
📚 Learning: 2025-06-05T01:02:15.318Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.

Applied to files:

  • lib/llm/src/kv_router/indexer.rs
📚 Learning: 2025-05-30T06:38:09.630Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.

Applied to files:

  • lib/llm/src/kv_router/indexer.rs
📚 Learning: 2025-05-30T06:34:12.785Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.

Applied to files:

  • lib/llm/src/kv_router/indexer.rs
🧬 Code graph analysis (6)
lib/bindings/python/src/dynamo/_core.pyi (1)
lib/bindings/python/rust/llm/kv.rs (1)
  • compute_block_hash_for_seq_py (29-91)
lib/llm/src/kv_router.rs (1)
lib/llm/src/kv_router/indexer.rs (1)
  • compute_block_hash_for_seq (135-237)
examples/deployments/router_standalone_trtllm/test_router.py (3)
lib/bindings/python/rust/llm/kv.rs (3)
  • compute_block_hash_for_seq_py (29-91)
  • block_size (756-758)
  • block_size (846-848)
lib/bindings/python/src/dynamo/_core.pyi (4)
  • compute_block_hash_for_seq_py (235-282)
  • get (1663-1664)
  • block_size (670-674)
  • block_size (724-731)
examples/deployments/router_standalone_trtllm/router.py (1)
  • get_tree_info (310-313)
lib/llm/src/kv_router/publisher.rs (1)
lib/llm/src/kv_router/indexer.rs (4)
  • compute_block_hash_for_seq (135-237)
  • kv_block_size (2488-2488)
  • kv_block_size (2493-2493)
  • kv_block_size (2498-2498)
examples/deployments/router_standalone_trtllm/api.py (2)
examples/deployments/router_standalone_trtllm/router.py (6)
  • RouterAPI (260-339)
  • RouterRequest (42-44)
  • RouterResponse (47-50)
  • _setup_routes (297-333)
  • start (335-339)
  • shutdown (239-252)
examples/deployments/router_standalone_trtllm/worker.py (5)
  • TrtllmWorkers (471-515)
  • direct (504-509)
  • start_all (499-502)
  • shutdown (451-463)
  • shutdown_all (511-515)
lib/bindings/python/tests/test_mm_kv_router.py (2)
lib/bindings/python/src/dynamo/_core.pyi (11)
  • RadixTree (565-636)
  • compute_block_hash_for_seq_py (235-282)
  • apply_event (598-609)
  • dump_tree_as_events (629-636)
  • find_matches (583-596)
  • find_matches (650-660)
  • scores (545-552)
  • remove_worker (611-618)
  • clear_all_blocks (620-627)
  • block_size (670-674)
  • block_size (724-731)
lib/bindings/python/rust/llm/kv.rs (10)
  • compute_block_hash_for_seq_py (29-91)
  • apply_event (529-555)
  • dump_tree_as_events (601-633)
  • find_matches (492-527)
  • find_matches (760-776)
  • scores (410-417)
  • remove_worker (557-577)
  • clear_all_blocks (579-599)
  • block_size (756-758)
  • block_size (846-848)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4577/merge) by zhongdaor-nv.
examples/deployments/router_standalone_trtllm/README.md

[error] 1-1: Trailing whitespace detected in file; fixed by pre-commit. Please review and commit the changes.

lib/bindings/python/src/dynamo/_core.pyi

[error] 1-1: Trailing whitespace detected in file; fixed by pre-commit. Please review and commit the changes.

examples/deployments/router_standalone_trtllm/ping.sh

[error] 1-1: pre-commit: check-executables-have-shebangs failed. ping.sh is marked executable but has no valid shebang.

examples/deployments/router_standalone_trtllm/perf.sh

[error] 1-1: pre-commit: check-executables-have-shebangs failed. perf.sh is marked executable but has no valid shebang.

examples/deployments/router_standalone_trtllm/test_router.py

[error] 1-1: Black formatting changed file. Run 'black' to reformat the code locally.


[error] 782-782: Ruff: Local variable status is assigned to but never used. (F841)

examples/deployments/router_standalone_trtllm/worker.py

[error] 1-1: Black formatting changed file. Run 'black' to reformat the code locally.

examples/deployments/router_standalone_trtllm/router.py

[error] 1-1: Black formatting changed file. Run 'black' to reformat the code locally.

examples/deployments/router_standalone_trtllm/api.py

[error] 1-1: Black formatting changed file. Run 'black' to reformat the code locally.

examples/deployments/router_standalone_trtllm/__init__.py

[error] 1-1: Black formatting changed file. Run 'black' to reformat the code locally.

lib/bindings/python/tests/test_mm_kv_router.py

[error] 1-1: Black formatting changed file. Run 'black' to reformat the code locally.


[error] 1-1: Trailing whitespace detected in file; fixed by pre-commit. Please review and commit the changes.

🪛 markdownlint-cli2 (0.18.1)
examples/deployments/router_standalone_trtllm/README.md

164-164: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.8)
examples/deployments/router_standalone_trtllm/test_router.py

80-80: Consider moving this statement to an else block

(TRY300)


81-81: Do not catch blind exception: Exception

(BLE001)


90-90: Do not catch blind exception: Exception

(BLE001)


172-172: Consider moving this statement to an else block

(TRY300)


173-173: Do not catch blind exception: Exception

(BLE001)


628-628: f-string without any placeholders

Remove extraneous f prefix

(F541)


701-701: Local variable status is assigned to but never used

Remove assignment to unused variable status

(F841)

examples/deployments/router_standalone_trtllm/worker.py

25-25: Probable insecure usage of temporary file or directory: "/tmp/debug_worker_kv.txt"

(S108)


130-130: Do not catch blind exception: Exception

(BLE001)


131-131: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


306-306: Do not catch blind exception: Exception

(BLE001)


307-307: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


326-326: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


327-327: Do not catch blind exception: Exception

(BLE001)


328-328: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

examples/deployments/router_standalone_trtllm/router.py

30-30: Probable insecure usage of temporary file or directory: "/tmp/debug_kv_events.txt"

(S108)


131-131: Store a reference to the return value of asyncio.create_task

(RUF006)


143-143: Do not catch blind exception: Exception

(BLE001)


150-150: Store a reference to the return value of asyncio.create_task

(RUF006)


163-163: Do not catch blind exception: Exception

(BLE001)


180-180: Avoid specifying long messages outside the exception class

(TRY003)


289-289: Unused method argument: app

(ARG002)


307-307: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


338-338: Possible binding to all interfaces

(S104)

examples/deployments/router_standalone_trtllm/api.py

36-36: Probable insecure usage of temporary file or directory: "/tmp/debug_api_hashes.txt"

(S108)


203-203: Do not catch blind exception: Exception

(BLE001)


247-247: Do not catch blind exception: Exception

(BLE001)


294-294: Unnecessary key check before dictionary access

Replace with dict.get

(RUF019)


327-327: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


361-361: Do not catch blind exception: Exception

(BLE001)


362-362: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


457-457: Do not catch blind exception: Exception

(BLE001)


458-458: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


488-488: Do not catch blind exception: Exception

(BLE001)


501-501: Possible binding to all interfaces

(S104)


582-582: Redundant exception object included in logging.exception call

(TRY401)

lib/bindings/python/tests/test_mm_kv_router.py

352-352: Do not assert blind exception: Exception

(B017)

🪛 Shellcheck (0.11.0)
examples/deployments/router_standalone_trtllm/ping.sh

[error] 1-1: Use #!, not just #, for the shebang.

(SC1113)

examples/deployments/router_standalone_trtllm/perf.sh

[error] 1-1: Use #!, not just #, for the shebang.

(SC1113)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
  • GitHub Check: operator (amd64)
  • GitHub Check: operator (arm64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: Mirror Repository to GitLab
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: tests (.)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: clippy (lib/runtime/examples)
  • GitHub Check: clippy (lib/bindings/python)
  • GitHub Check: tests (launch/dynamo-run)
  • GitHub Check: tests (lib/runtime/examples)
  • GitHub Check: clippy (.)
  • GitHub Check: tests (lib/bindings/python)
🔇 Additional comments (37)
lib/llm/src/kv_router/protocols.rs (4)

38-48: LGTM!

The request_extra_info field is correctly added with appropriate serde attributes for backward compatibility and efficient serialization.


50-57: LGTM!

The default implementation correctly initializes the new request_extra_info field to None.


269-366: Well-structured multimodal metadata types and conversion logic.

The to_block_level() method correctly handles:

  • Ceiling division for block count calculation
  • Offset intersection with block boundaries
  • Deduplication of mm_hash entries within blocks
  • Edge cases like empty ranges via the local_start < local_end guard

The use of saturating_sub(1) on line 330 correctly handles exclusive end offsets for block boundary calculation.


368-378: LGTM!

The mm_extra_info field follows the established pattern with appropriate serde attributes for backward compatibility and efficient serialization.

lib/bindings/c/src/lib.rs (1)

163-180: LGTM!

The C bindings correctly pass None for the new multimodal metadata parameters, maintaining backward compatibility for non-multimodal use cases.

examples/deployments/router_standalone_trtllm/ping.sh (1)

17-25: LGTM!

The curl command correctly tests the streaming chat completion endpoint with appropriate headers and payload.

lib/llm/src/mocker/kv_manager.rs (1)

133-144: LGTM!

The mm_extra_info: None initialization correctly aligns with the updated KvCacheStoredBlockData structure. The mocker doesn't need to track multimodal metadata.

lib/llm/src/protocols/common/preprocessor.rs (2)

8-8: LGTM!

The import correctly combines RouterConfigOverride and the new RequestExtraInfo from the same module path.


104-108: LGTM!

The new request_extra_info field follows established patterns in this struct—using #[builder(default)] and #[serde(default, skip_serializing_if = "Option::is_none")] consistent with other optional fields like extra_args and extra_fields. The documentation clearly explains its purpose for carrying multimodal metadata.

lib/llm/src/kv_router.rs (4)

398-398: LGTM!

Correct adaptation to the updated compute_block_hash_for_seq signature, passing None for the new block_mm_infos parameter.


452-455: LGTM!

Consistent update to pass None for the multimodal info parameter.


519-522: Verify: request_extra_info is intentionally unused for now.

The request_extra_info field is captured but explicitly ignored (_). Based on the PR context, this appears to be groundwork for future multimodal routing—the field is added to the protocol but not yet propagated to hash computation. Please confirm this is the intended incremental approach.


604-606: LGTM!

Consistent update in KvPushRouter to pass None for the new multimodal parameter.

lib/bindings/python/tests/test_mm_kv_router.py (1)

214-345: MM block‑hash test coverage looks strong

The block‑hash tests exercise core scenarios well (no‑MM vs MM, determinism, multi‑block, partial blocks, None MM info, offsets ignored, multiple MM objects). This should give good confidence that the Rust hashing behavior is correctly surfaced into Python.

lib/llm/src/kv_router/indexer.rs (1)

2485-2501: Unit test updates confirm MM‑agnostic behavior is preserved when block_mm_infos is None

The updated test_compute_block_hash_for_seq now calls compute_block_hash_for_seq(..., None) and asserts the same block‑count behavior as before (1 hash for ≤1.5 blocks, 2 for >2.5, etc.). This is a good regression guard ensuring the MM extension didn’t change the base (no‑MM) semantics.

lib/llm/src/kv_router/publisher.rs (2)

419-444: MM metadata correctly influences tokens_hash and is stored per block

The changes to create_stored_block_from_parts look consistent with the new MM hashing contract:

  • For each block, you build a one‑element block_mm_infos slice from mm_extra_info and call compute_block_hash_for_seq on the block’s tokens, so tokens_hash is now a function of tokens + that block’s mm_hash set.
  • The same mm_extra_info is then stored on the resulting KvCacheStoredBlockData, so downstream consumers can still inspect MM metadata without recomputing it.

This ensures events produced by this publisher match the MM‑aware hashing logic the indexer and Python bindings expect.


533-701: RawKvEvent deserialization for block_mm_infos is robust and backwards‑compatible

The custom Deserialize impl handles block_mm_infos in both tagged‑map and tagged‑sequence encodings, defaulting to None when the field is absent, which keeps older producers compatible. That matches the intended Option<Vec<Option<BlockExtraInfo>>> semantics and should interoperate cleanly with both legacy and MM‑aware senders.

examples/deployments/router_standalone_trtllm/router.py (3)

171-201: Routing logic is coherent and aligns with RadixTree API

get_best_worker and its helpers correctly:

  • Use RadixTree.find_matches(local_hashes) and mask scores down to per‑worker counts assuming dp_rank=0.
  • Translate matched block counts into overlap ratios via matched_blocks * block_size / num_tokens.
  • Combine overlap, current KV usage, and normalized waiting count into a simple logit and pick the max with random tie‑breaking.

This is a reasonable first‑pass policy for the standalone example and matches the OverlapScores API shape from the bindings.


297-334: Debug inject endpoint shape matches KV event expectations

/debug/inject_event builds a minimal KvCacheEvent JSON with a single stored block (including optional mm_extra_info) and feeds it directly to RadixTree.apply_event(...). That’s a useful hook for local testing and for verifying MM metadata propagation through the router.

Just be aware that this bypasses whatever additional fields the real producer includes (e.g., dp_rank), so it should remain clearly labeled as a debug‑only tool, which you’ve already done via the /debug/... namespace.


343-361: Remember to run Black and commit formatting changes

CI reports that Black reformatted this file and that trailing whitespace was auto‑fixed by pre‑commit. Please run the project’s formatting hooks locally (e.g., black examples/deployments/router_standalone_trtllm/router.py and any configured pre‑commit hooks) and commit the resulting changes to clear the pipeline failure.

examples/deployments/router_standalone_trtllm/test_router.py (5)

1-28: LGTM! Well-structured test module setup.

The file header, imports, and test constants are properly organized. The sample test images from COCO dataset are appropriate for multimodal testing.


30-44: LGTM! Clean dataclass definitions.

TestConfig and TestResult are well-designed with sensible defaults. The kv_settle_time parameter is a good practice for async event propagation tests.


304-349: LGTM! Comprehensive multimodal hash computation tests.

The _test_mm_hash_computation test correctly validates that different mm_hash values produce different block hashes. Good coverage of the core hash differentiation logic.


464-516: LGTM! Thorough block boundary test.

_test_mm_block_boundary correctly validates that MM info only affects the intended blocks, checking all three blocks independently.


723-764: LGTM! Clean CLI structure with proper resource cleanup.

The main() function properly uses try/finally to ensure tests.cleanup() is called, and the argument parsing is well-organized.

examples/deployments/router_standalone_trtllm/worker.py (6)

46-51: LGTM! Correct unsigned 64-bit conversion.

The to_unsigned_u64 function correctly handles Python's arbitrary-precision integers and converts negative values using two's complement for Rust/msgpack compatibility.


59-76: LGTM! Clean ZMQ publisher implementation.

MetricsPublisher is well-implemented with proper context and socket management. The close() method correctly terminates resources.


125-136: LGTM! Well-structured message serialization.

The _send method correctly uses msgpack with sequence numbers for reliable ordering, and handles serialization errors gracefully.


181-213: LGTM! Robust block parsing with proper partial block handling.

parse_stored_blocks correctly tracks partial blocks separately and validates block sizes. The early break on partial/oversized blocks is appropriate.


451-463: LGTM! Proper shutdown with resource cleanup.

The shutdown method correctly cancels background tasks and closes all resources in the right order.


471-515: LGTM! Clean worker manager implementation.

TrtllmWorkers provides a clean interface for managing multiple workers with proper port assignment and lifecycle management.

examples/deployments/router_standalone_trtllm/api.py (6)

1-31: LGTM! Proper module setup with necessary imports.

The protobuf environment workaround is documented, and imports are well-organized. Good use of both local and external dependencies.


70-103: LGTM! Clean API model definitions.

The Pydantic models follow OpenAI API conventions correctly. Using str | list[ContentPart] for content properly handles both text-only and multimodal messages.


197-215: LGTM! Good fallback pattern for chat template.

The _build_prompt method gracefully falls back to simple formatting when the tokenizer's chat template fails.


382-459: LGTM! Well-structured request handler.

The chat_completions endpoint properly validates service readiness, parses requests, handles both text and multimodal paths, computes routing hashes, and streams responses. Good error handling at each stage.


495-504: Binding to 0.0.0.0 is intentional for containerized deployments.

The linter flags binding to all interfaces, but this is appropriate for a service that needs to be accessible from outside the container.


519-549: LGTM! Comprehensive CLI with sensible defaults.

The argument parser provides good defaults and clear help text for all configuration options.

Comment on lines 3 to 8






Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove excessive trailing blank lines to fix Black formatting failure.

The pipeline failed because Black formatting detected issues with this file. Remove the trailing blank lines after the license header.

Apply this diff:

 # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-
-
-
-
-
-
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
🤖 Prompt for AI Agents
In examples/deployments/router_standalone_trtllm/__init__.py around lines 3 to
8, remove the excessive trailing blank lines after the license header so the
file ends with a single newline; update the file to eliminate blank lines beyond
the header (leave exactly one newline at EOF) to satisfy Black formatting.

Comment on lines 156 to 176
def _parse_request(self, request: ChatCompletionRequest) -> ParsedRequest | ErrorResponse:
"""Parse and validate the incoming request."""
max_tokens = request.max_completion_tokens or request.max_tokens
if max_tokens is None:
return ErrorResponse(
error=make_error(
"Either max_tokens or max_completion_tokens must be specified",
"invalid_request_error", 400
)
)

messages_dict, image_urls = self._extract_messages(request.messages)

return ParsedRequest(
messages_dict=messages_dict,
image_urls=image_urls,
max_tokens=max_tokens,
temperature=request.temperature,
top_p=request.top_p,
model=request.model,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Default values for temperature and top_p may be None.

When accessing request.temperature and request.top_p, they could be None if explicitly passed as null in the request, despite the default values.

     def _parse_request(self, request: ChatCompletionRequest) -> ParsedRequest | ErrorResponse:
         """Parse and validate the incoming request."""
         max_tokens = request.max_completion_tokens or request.max_tokens
         if max_tokens is None:
             return ErrorResponse(
                 error=make_error(
                     "Either max_tokens or max_completion_tokens must be specified",
                     "invalid_request_error", 400
                 )
             )

         messages_dict, image_urls = self._extract_messages(request.messages)

         return ParsedRequest(
             messages_dict=messages_dict,
             image_urls=image_urls,
             max_tokens=max_tokens,
-            temperature=request.temperature,
-            top_p=request.top_p,
+            temperature=request.temperature if request.temperature is not None else 1.0,
+            top_p=request.top_p if request.top_p is not None else 1.0,
             model=request.model,
         )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _parse_request(self, request: ChatCompletionRequest) -> ParsedRequest | ErrorResponse:
"""Parse and validate the incoming request."""
max_tokens = request.max_completion_tokens or request.max_tokens
if max_tokens is None:
return ErrorResponse(
error=make_error(
"Either max_tokens or max_completion_tokens must be specified",
"invalid_request_error", 400
)
)
messages_dict, image_urls = self._extract_messages(request.messages)
return ParsedRequest(
messages_dict=messages_dict,
image_urls=image_urls,
max_tokens=max_tokens,
temperature=request.temperature,
top_p=request.top_p,
model=request.model,
)
def _parse_request(self, request: ChatCompletionRequest) -> ParsedRequest | ErrorResponse:
"""Parse and validate the incoming request."""
max_tokens = request.max_completion_tokens or request.max_tokens
if max_tokens is None:
return ErrorResponse(
error=make_error(
"Either max_tokens or max_completion_tokens must be specified",
"invalid_request_error", 400
)
)
messages_dict, image_urls = self._extract_messages(request.messages)
return ParsedRequest(
messages_dict=messages_dict,
image_urls=image_urls,
max_tokens=max_tokens,
temperature=request.temperature if request.temperature is not None else 1.0,
top_p=request.top_p if request.top_p is not None else 1.0,
model=request.model,
)
🤖 Prompt for AI Agents
In examples/deployments/router_standalone_trtllm/api.py around lines 156–176,
request.temperature and request.top_p can be explicitly null so they may be None
here; update the parsing to coalesce those fields to safe defaults before
building ParsedRequest (e.g., temperature = request.temperature if
request.temperature is not None else 1.0 and top_p = request.top_p if
request.top_p is not None else 1.0, or use any existing DEFAULT_* constants),
ensure the values are numeric (cast/validate) and pass those fallback values
into ParsedRequest when returning.

Comment on lines 302 to 313
def _build_block_mm_infos(
self, num_tokens: int, mm_hash: int | None, image_offsets: list[int] | None
) -> list[dict] | None:
"""Build block_mm_infos for routing hash computation."""
if mm_hash is None or image_offsets is None:
return None

num_blocks = (num_tokens + self.init_params.block_size - 1) // self.init_params.block_size
return [
{"mm_objects": [{"mm_hash": mm_hash, "offsets": [image_offsets]}]}
for _ in range(num_blocks)
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

_build_block_mm_infos applies same MM info to all blocks.

Similar to the issue in worker.py, this broadcasts identical mm_info to every block. The image_offsets should determine which blocks actually contain image content.

     def _build_block_mm_infos(
         self, num_tokens: int, mm_hash: int | None, image_offsets: list[int] | None
     ) -> list[dict] | None:
         """Build block_mm_infos for routing hash computation."""
         if mm_hash is None or image_offsets is None:
             return None

+        block_size = self.init_params.block_size
         num_blocks = (num_tokens + self.init_params.block_size - 1) // self.init_params.block_size
-        return [
-            {"mm_objects": [{"mm_hash": mm_hash, "offsets": [image_offsets]}]}
-            for _ in range(num_blocks)
-        ]
+        block_mm_infos = []
+        mm_info = {"mm_objects": [{"mm_hash": mm_hash, "offsets": [image_offsets]}]}
+        
+        for block_idx in range(num_blocks):
+            block_start = block_idx * block_size
+            block_end = min(block_start + block_size, num_tokens)
+            # Check if this block overlaps with image token range
+            if image_offsets[0] < block_end and image_offsets[1] > block_start:
+                block_mm_infos.append(mm_info)
+            else:
+                block_mm_infos.append(None)
+        
+        return block_mm_infos
🤖 Prompt for AI Agents
In examples/deployments/router_standalone_trtllm/api.py around lines 302–313,
the current implementation attaches the same mm_object offsets to every block;
instead compute per-block offsets from image_offsets so only blocks that contain
image tokens get mm_objects. For each block index i compute block_start = i *
block_size and block_end = block_start + block_size, filter image_offsets to
offsets within [block_start, block_end), convert them to block-relative offsets
(subtract block_start), and for that block include an mm_objects entry only if
the resulting offsets list is non-empty; return the list of per-block dicts (or
None when mm_hash or image_offsets is None) so image mm info is not broadcast to
all blocks.

@@ -0,0 +1,52 @@
#/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix the shebang: missing ! character.

The shebang line #/bin/bash is invalid—it must be #!/bin/bash. This causes the script to fail the pre-commit check and will not execute properly as a bash script.

-#/bin/bash
+#!/bin/bash
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#/bin/bash
#!/bin/bash
🧰 Tools
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4577/merge) by zhongdaor-nv.

[error] 1-1: pre-commit: check-executables-have-shebangs failed. perf.sh is marked executable but has no valid shebang.

🪛 Shellcheck (0.11.0)

[error] 1-1: Use #!, not just #, for the shebang.

(SC1113)

🤖 Prompt for AI Agents
In examples/deployments/router_standalone_trtllm/perf.sh around line 1 the
shebang is incorrect ("#/bin/bash"); replace it with a valid bash shebang by
inserting the missing "!" so the first line reads "#!/bin/bash" to ensure the
script is executable and passes pre-commit checks.

@@ -0,0 +1,25 @@
#/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix malformed shebang to resolve pipeline failure.

The shebang is missing the ! character. This is causing the pre-commit check to fail.

Apply this diff:

-#/bin/bash
+#!/bin/bash
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#/bin/bash
#!/bin/bash
🧰 Tools
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4577/merge) by zhongdaor-nv.

[error] 1-1: pre-commit: check-executables-have-shebangs failed. ping.sh is marked executable but has no valid shebang.

🪛 Shellcheck (0.11.0)

[error] 1-1: Use #!, not just #, for the shebang.

(SC1113)

🤖 Prompt for AI Agents
In examples/deployments/router_standalone_trtllm/ping.sh around line 1 the
shebang is malformed ("#/bin/bash" missing the exclamation mark); replace it
with a proper shebang (#!/bin/bash) at the top of the file so the script is
recognized and the pre-commit pipeline passes.

Comment on lines +104 to +110
if parent_hash is not None:
event["parent_block_hash"] = to_unsigned_u64(parent_hash)

if mm_extra_info is not None:
event["block_mm_infos"] = [mm_extra_info] * len(block_hashes)

self._send([event])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify multimodal info semantics: is mm_extra_info operation-wide or per-block?

The field name block_mm_infos (array) conflicts with the implementation that broadcasts a single mm_extra_info to all blocks. This indicates either incomplete implementation (should filter by offsets/block content) or incorrect field structure (should be single mm_extra_info if operation-wide).

Consider either:

  1. Filter blocks based on which ones contain image tokens/offsets before applying mm_extra_info
  2. Rename to single field and document that it applies operation-wide
        if mm_extra_info is not None:
-            event["block_mm_infos"] = [mm_extra_info] * len(block_hashes)
+            # TODO: Clarify if mm_extra_info applies to all blocks or only those with multimodal content
+            event["mm_extra_info"] = mm_extra_info
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if parent_hash is not None:
event["parent_block_hash"] = to_unsigned_u64(parent_hash)
if mm_extra_info is not None:
event["block_mm_infos"] = [mm_extra_info] * len(block_hashes)
self._send([event])
if parent_hash is not None:
event["parent_block_hash"] = to_unsigned_u64(parent_hash)
if mm_extra_info is not None:
# TODO: Clarify if mm_extra_info applies to all blocks or only those with multimodal content
event["mm_extra_info"] = mm_extra_info
self._send([event])
🤖 Prompt for AI Agents
In examples/deployments/router_standalone_trtllm/worker.py around lines 104-110
the code always duplicates a single mm_extra_info across all blocks into
event["block_mm_infos"], which conflicts with the field name and review comment;
either (A) make block_mm_infos truly per-block by building a list of the same
length as block_hashes where you place mm_extra_info only for blocks that
contain multimodal tokens (detect those blocks from offsets/metadata or the
incoming request and otherwise insert None/empty), or (B) change the outgoing
payload to use a single operation-scoped field name (e.g., operation_mm_info or
mm_info) and document that it applies to the whole operation; implement one of
these fixes, update the field name/structure accordingly in the event emission,
and adjust any related docs/tests to reflect the chosen semantics.

Comment on lines +247 to +263
def _initialize(self, kv_events_port: int, metrics_port: int):
"""Initialize TensorRT-LLM engine and publishers."""
os.environ["CUDA_VISIBLE_DEVICES"] = str(self.worker_id)

logger.info(f"Worker {self.worker_id}: Initializing on GPU {self.worker_id}")

self.llm = LLM(
model=self.model,
kv_cache_config=KvCacheConfig(
enable_block_reuse=True,
event_buffer_max_size=DEFAULT_KV_EVENT_BUFFER_MAX_SIZE,
),
)

self.metrics_publisher = MetricsPublisher(metrics_port)
self.kv_events_publisher = KvEventsPublisher(kv_events_port, self.block_size)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check TrtllmWorkers instantiation patterns
ast-grep --pattern 'TrtllmWorkers($$$)'

Repository: ai-dynamo/dynamo

Length of output: 790


🏁 Script executed:

#!/bin/bash
# Also search for the TrtllmWorker class definition and its usage
rg 'class TrtllmWorker' -A 20

Repository: ai-dynamo/dynamo

Length of output: 3664


🏁 Script executed:

#!/bin/bash
# Check the file structure of the worker.py file
wc -l examples/deployments/router_standalone_trtllm/worker.py
head -50 examples/deployments/router_standalone_trtllm/worker.py

Repository: ai-dynamo/dynamo

Length of output: 1636


🏁 Script executed:

#!/bin/bash
# Look for initialization patterns and how workers are spawned/managed
rg 'TrtllmWorker|_initialize' examples/deployments/router_standalone_trtllm/ -B 2 -A 2

Repository: ai-dynamo/dynamo

Length of output: 4132


🏁 Script executed:

#!/bin/bash
# Get the complete TrtllmWorker __init__ and _initialize methods
sed -n '/^class TrtllmWorker:/,/^class [A-Z]/p' examples/deployments/router_standalone_trtllm/worker.py | head -100

Repository: ai-dynamo/dynamo

Length of output: 3622


🏁 Script executed:

#!/bin/bash
# Check TrtllmWorkers __init__ more completely
sed -n '/^class TrtllmWorkers:/,/^class [A-Z]/p' examples/deployments/router_standalone_trtllm/worker.py | head -80

Repository: ai-dynamo/dynamo

Length of output: 1524


🏁 Script executed:

#!/bin/bash
# Check if workers are spawned in separate processes (multiprocessing)
rg 'Process|multiprocessing|spawn|Fork' examples/deployments/router_standalone_trtllm/

Repository: ai-dynamo/dynamo

Length of output: 1475


Setting CUDA_VISIBLE_DEVICES in _initialize causes critical issues when multiple workers initialize sequentially in the same process.

All workers are initialized in TrtllmWorkers.__init__() within a single process loop (lines 289-298). When worker N initializes, it overwrites the environment variable set by worker N-1, affecting GPU visibility for all previously initialized workers. This breaks multi-worker deployments where num_workers > 1.

🤖 Prompt for AI Agents
In examples/deployments/router_standalone_trtllm/worker.py around lines 247-263,
setting CUDA_VISIBLE_DEVICES inside _initialize overwrites GPU visibility for
previously initialized workers when multiple workers are created in the same
process (see TrtllmWorkers.__init__ loop at ~289-298); remove the
os.environ["CUDA_VISIBLE_DEVICES"] assignment here and instead ensure each
worker selects its GPU explicitly without changing process-wide env: either set
the device at the CUDA runtime level (e.g., use the CUDA API / framework call to
set the current device for this worker) or modify the LLM initialization to
accept and bind to a specific device index passed from the caller, or move
per-worker CUDA_VISIBLE_DEVICES setup to separate worker processes before
initialization; update callers to pass worker_id/device index into LLM so each
worker uses the correct GPU without mutating environment variables.

Comment on lines +28 to 90
#[pyo3(signature = (tokens, kv_block_size, block_mm_infos=None))]
pub fn compute_block_hash_for_seq_py(
_py: Python,
tokens: Vec<u32>,
kv_block_size: usize,
block_mm_infos: Option<Bound<PyAny>>,
) -> PyResult<Vec<u64>> {
use std::fs::OpenOptions;
use std::io::Write;

if kv_block_size == 0 {
return Err(to_pyerr(anyhow::anyhow!("kv_block_size cannot be 0")));
}

let hashes = compute_block_hash_for_seq(&tokens, kv_block_size as u32);
// Convert Python block_mm_infos to Rust Vec<Option<BlockExtraInfo>>
let mm_infos_rust: Option<Vec<Option<BlockExtraInfo>>> = block_mm_infos
.as_ref()
.map(|infos_py| {
depythonize::<Vec<Option<BlockExtraInfo>>>(infos_py).map_err(|e| {
PyErr::new::<pyo3::exceptions::PyValueError, _>(format!(
"Failed to convert block_mm_infos: {}",
e
))
})
})
.transpose()?;

// Log parameters to file
if let Ok(mut file) = OpenOptions::new()
.create(true)
.append(true)
.open("/tmp/debug_rust_hash_params.txt")
{
let _ = writeln!(
file,
"\n============================================================"
);
let _ = writeln!(file, "=== compute_block_hash_for_seq_py PARAMETERS ===");
let _ = writeln!(file, "kv_block_size: {}", kv_block_size);
let _ = writeln!(file, "num_tokens: {}", tokens.len());
let _ = writeln!(file, "tokens: {:?}", tokens);
let _ = writeln!(file, "mm_infos_rust: {:?}", mm_infos_rust);
}

let hashes =
compute_block_hash_for_seq(&tokens, kv_block_size as u32, mm_infos_rust.as_deref());

// Log result
if let Ok(mut file) = OpenOptions::new()
.create(true)
.append(true)
.open("/tmp/debug_rust_hash_params.txt")
{
let hash_values: Vec<u64> = hashes.iter().map(|h| h.0).collect();
let _ = writeln!(file, "=== RESULT ===");
let _ = writeln!(file, "hashes ({}): {:?}", hash_values.len(), hash_values);
let _ = writeln!(
file,
"============================================================"
);
}

Ok(hashes.into_iter().map(|h| h.0).collect())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove or gate persistent debug file logging in compute_block_hash_for_seq_py

This binding currently opens and appends to /tmp/debug_rust_hash_params.txt on every call, logging full token arrays and MM metadata as well as results. In a real deployment this function can be called extremely frequently, so:

  • Performance: repeated synchronous file IO in a tight path will add noticeable overhead and contention.
  • Reliability: running in environments where /tmp is slow, small, or unwritable can introduce sporadic failures.
  • Privacy: tokens and MM metadata can include user data, so dumping them to a world‑readable temp file is a potential leakage vector.

Suggest either removing this logging before merge or guarding it behind a very explicit debug flag (e.g., an env var check or a compile‑time feature) and trimming payloads to what you strictly need for troubleshooting.

Comment on lines +348 to +353
def test_mm_block_hash_error_zero_block_size():
"""Test that zero block size raises an error."""
tokens = [100] * 32

with pytest.raises(Exception):
compute_block_hash_for_seq_py(tokens, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Consider asserting a more specific exception than Exception

pytest.raises(Exception) will catch anything, including unrelated errors. Since compute_block_hash_for_seq_py explicitly rejects kv_block_size == 0, it would be more robust to assert the concrete Python exception type (e.g., the specific PyErr mapping used by to_pyerr) or at least a narrower base such as RuntimeError, and optionally validate the error message. This also aligns with Ruff’s B017 warning.

[scratchpad_end] -->

🧰 Tools
🪛 Ruff (0.14.8)

352-352: Do not assert blind exception: Exception

(B017)

🤖 Prompt for AI Agents
In lib/bindings/python/tests/test_mm_kv_router.py around lines 348 to 353, the
test uses pytest.raises(Exception) which is too broad; change it to assert the
concrete exception type raised for an invalid kv_block_size (use ValueError or
the actual exception your to_pyerr maps to, e.g., RuntimeError) and optionally
verify the error message via pytest.raises(<SpecificException>, match="expected
message"). Update the with pytest.raises(...) block to use the specific
exception and add a match string if you want to assert the error text.

Comment on lines +120 to 237
/// Compute the hash for a sequence of tokens, optionally including multimodal metadata.
///
/// When multimodal extra info is provided, the mm_hashes are included in the hash computation
/// to ensure that blocks with identical tokens but different multimodal objects produce
/// different hashes.
///
/// ### Arguments
///
/// * `tokens` - A vector of `u32` tokens.
/// * `kv_block_size` - The size of each block in tokens.
/// * `block_mm_infos` - Optional per-block multimodal metadata.
///
/// ### Returns
///
/// A vector of `LocalBlockHash` representing the computed hashes for each chunk of tokens.
pub fn compute_block_hash_for_seq(tokens: &[u32], kv_block_size: u32) -> Vec<LocalBlockHash> {
tokens
.chunks_exact(kv_block_size as usize) // Split into chunks of kv_block_size elements
.map(|chunk| {
let bytes: Vec<u8> = chunk
.iter()
.flat_map(|&num| num.to_le_bytes()) // Convert each i32 to its little-endian bytes
.collect();
pub fn compute_block_hash_for_seq(
tokens: &[u32],
kv_block_size: u32,
block_mm_infos: Option<&[Option<BlockExtraInfo>]>,
) -> Vec<LocalBlockHash> {
use std::fs::OpenOptions;
use std::io::Write;

// Log input parameters
if let Ok(mut file) = OpenOptions::new()
.create(true)
.append(true)
.open("/tmp/debug_compute_block_hash.txt")
{
let _ = writeln!(
file,
"\n============================================================"
);
let _ = writeln!(file, "=== compute_block_hash_for_seq INPUT ===");
let _ = writeln!(file, "kv_block_size: {}", kv_block_size);
let _ = writeln!(file, "num_tokens: {}", tokens.len());
let _ = writeln!(file, "tokens: {:?}", tokens);
let _ = writeln!(file, "block_mm_infos: {:?}", block_mm_infos);
}

let result: Vec<LocalBlockHash> = tokens
.chunks_exact(kv_block_size as usize)
.enumerate()
.map(|(block_idx, chunk)| {
let mut bytes: Vec<u8> = chunk.iter().flat_map(|&num| num.to_le_bytes()).collect();

// Include MM hashes in the block hash computation if present
if let Some(mm_infos) = block_mm_infos {
if let Some(Some(block_mm_info)) = mm_infos.get(block_idx) {
// Sort mm_hashes for consistent ordering
let mut mm_hashes: Vec<u64> = block_mm_info
.mm_objects
.iter()
.map(|obj| obj.mm_hash)
.collect();
mm_hashes.sort_unstable();

// Log MM hash inclusion
if let Ok(mut file) = OpenOptions::new()
.create(true)
.append(true)
.open("/tmp/debug_compute_block_hash.txt")
{
let _ = writeln!(
file,
"Block {}: Including mm_hashes {:?} in hash computation",
block_idx, mm_hashes
);
}

// Append sorted mm_hashes to the byte array
for mm_hash in mm_hashes {
bytes.extend_from_slice(&mm_hash.to_le_bytes());
}
}
}

compute_block_hash(&Bytes::from(bytes)) // Convert the byte Vec to Bytes
let hash = compute_block_hash(&Bytes::from(bytes));

// Log per-block result
if let Ok(mut file) = OpenOptions::new()
.create(true)
.append(true)
.open("/tmp/debug_compute_block_hash.txt")
{
let _ = writeln!(
file,
"Block {}: chunk tokens {:?} -> hash {}",
block_idx, chunk, hash.0
);
}

hash
})
.collect()
.collect();

// Log output
if let Ok(mut file) = OpenOptions::new()
.create(true)
.append(true)
.open("/tmp/debug_compute_block_hash.txt")
{
let hash_values: Vec<u64> = result.iter().map(|h| h.0).collect();
let _ = writeln!(file, "=== OUTPUT ===");
let _ = writeln!(
file,
"result_hashes ({}): {:?}",
hash_values.len(),
hash_values
);
let _ = writeln!(
file,
"============================================================"
);
}

result
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

MM‑aware hashing semantics look correct; logging and allocation deserve tightening

The new compute_block_hash_for_seq behavior—chunking tokens, optionally incorporating per‑block mm_hash values (sorted for determinism), and skipping partial trailing blocks—aligns with the Python tests and the intended “same tokens + different MM ⇒ different hash” contract. That’s good.

Two follow‑ups are worth addressing:

  1. Unbounded debug logging to /tmp/debug_compute_block_hash.txt

    The function logs every call’s inputs (including full token sequences and MM metadata) and every block’s hash to a fixed temp file. In production this will:

    • Add synchronous file IO on a very hot path.
    • Risk filling or slowing /tmp, and misbehavior when it’s not writable.
    • Persist user‑level token content to disk, which is a privacy concern.

    As with the Python binding, this should either be removed before release or put behind a clearly opt‑in debug mechanism (env flag / feature gate) with reduced verbosity.

  2. Avoid extra Bytes allocation per block

    let hash = compute_block_hash(&Bytes::from(bytes));

    compute_block_hash only needs &[u8], so you can pass the existing Vec<u8> slice:

  •        let hash = compute_block_hash(&Bytes::from(bytes));
    
  •        let hash = compute_block_hash(&bytes);
    
    
    This keeps the hashing loop allocation‑free beyond the initial `Vec<u8>` build.
    
    

@PeaBrane PeaBrane self-requested a review December 11, 2025 23:25
@PeaBrane
Copy link
Contributor

@zhongdaor-nv thanks! Can you check the rabbit's comments and CIs when you get the chance? I also noticed some bulky logs in Rust, can we remove those?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants