-
Notifications
You must be signed in to change notification settings - Fork 739
feat: add multimodal support to KV router with standalone trtllm example #4577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: zhongdaor <[email protected]>
Signed-off-by: zhongdaor <[email protected]>
Signed-off-by: zhongdaor <[email protected]>
Signed-off-by: zhongdaor <[email protected]>
Signed-off-by: zhongdaor <[email protected]>
Signed-off-by: zhongdaor <[email protected]>
Signed-off-by: zhongdaor <[email protected]>
Signed-off-by: zhongdaor <[email protected]>
Signed-off-by: zhongdaor <[email protected]>
Signed-off-by: zhongdaor-nv <[email protected]>
Signed-off-by: zhongdaor <[email protected]>
Signed-off-by: zhongdaor <[email protected]>
Signed-off-by: zhongdaor <[email protected]>
WalkthroughThis PR introduces multimodal (MM) support to the KV router system by adding MM metadata structures and hash computation logic across bindings and core routing logic. Concurrently, it adds a complete standalone TensorRT-LLM router deployment example with API, worker, routing, and testing components. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes
Areas requiring extra attention:
Pre-merge checks❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 17
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/bindings/python/src/dynamo/_core.pyi (1)
235-281: Remove duplicate docstring content.The docstring contains duplicate documentation. Lines 247-271 provide the new detailed Args/Returns/Example documentation, but lines 272-279 repeat the older, shorter version of the same documentation. This redundancy should be cleaned up.
Example: >>> tokens = [1, 2, 3, 4] * 8 # 32 tokens = 1 block >>> mm_info = { ... "mm_objects": [{ ... "mm_hash": 0xDEADBEEF, ... }] ... } >>> hashes = compute_block_hash_for_seq_py(tokens, 32, [mm_info]) - - Compute block hashes for a sequence of tokens - - Args: - tokens: List of token IDs - kv_block_size: Size of each KV cache block - - Returns: - List of block hashes as integers """ ...
🧹 Nitpick comments (17)
lib/llm/src/kv_router/protocols.rs (1)
450-486: LGTM!The test correctly initializes the new
mm_extra_infofield. Consider adding a separate test case that validates serialization/deserialization with non-Nonemultimodal metadata to ensure the new structures work end-to-end.lib/llm/src/kv_router.rs (1)
485-493: Consider extracting the duplicateblock_hashescomputation.
compute_block_hash_for_seqis called twice with the same inputs—once at line 487 and again at line 491. While this is conditional (the second call is insidethen), whenrouter_track_active_blocksis true, both calls execute. Consider computingblock_hashesonce and reusing it.pub async fn get_potential_loads(&self, tokens: &[u32]) -> Result<Vec<PotentialLoad>> { let isl_tokens = tokens.len(); let block_hashes = compute_block_hash_for_seq(tokens, self.block_size, None); - let overlap_scores = self.indexer.find_matches(block_hashes).await?; + let overlap_scores = self.indexer.find_matches(block_hashes.clone()).await?; let maybe_seq_hashes = self.kv_router_config.router_track_active_blocks.then(|| { - let block_hashes = compute_block_hash_for_seq(tokens, self.block_size, None); compute_seq_hash_for_block(&block_hashes) });lib/bindings/python/rust/llm/kv.rs (1)
327-372: DRY upblock_mm_infosdepythonize logic across bindingsBoth
compute_block_hash_for_seq_pyandKvEventPublisher::publish_storedimplement identicalblock_mm_infos→Option<Vec<Option<BlockExtraInfo>>>conversion and error mapping. This is fine functionally, but it’s an easy place for future drift.A small internal helper (e.g.,
fn depythonize_block_mm_infos(obj: Option<Bound<PyAny>>) -> PyResult<Option<Vec<Option<BlockExtraInfo>>>>) would centralize the behavior and keep Python↔Rust MM metadata semantics consistent.lib/llm/src/kv_router/indexer.rs (1)
1110-1151: MM extra info on synthetic routing events is alwaysNoneIn the TTL/pruning routing path, synthetic
KvCacheEventData::Storedevents are built with:blocks: hashes.map(|(local_hash, sequence_hash)| KvCacheStoredBlockData { tokens_hash: *local_hash, block_hash: ExternalSequenceBlockHash(*sequence_hash), mm_extra_info: None, })This keeps the existing behavior (no MM metadata on these synthetic entries), which is fine as long as MM distinctions are encoded solely into
tokens_hash/sequence_hash. If future features rely onmm_extra_infofor anything beyond hashing (e.g., inspection or filtering), you may eventually want to carry through real MM info here as well.Not an immediate issue, but worth keeping in mind as MM use‑cases expand.
lib/llm/src/kv_router/publisher.rs (2)
446-485: Consider a small test that exercises non‑Noneblock_mm_infos
create_stored_blocksnow acceptsblock_mm_infos: Option<&[Option<BlockExtraInfo>]>and threads per‑block entries into bothtokens_hashandmm_extra_info, but the unit tests only cover theNonecase:
test_create_stored_blocks_oktest_create_stored_blocks_wrong_size_triggers_warningA small additional test that passes a
Some(&[Some(BlockExtraInfo { … })])and asserts:
blocks[i].mm_extra_infoisSome(...), andtokens_hashmatches a direct call tocompute_block_hash_for_seqwith the same MM infowould close that gap and guard this path against regressions.
1003-1017: Existingconvert_event_block_storedtest only covers theNoneMM case
test_convert_event_block_storedstill usesblock_mm_infos: None, which is good for backward compatibility but doesn’t validate that non‑Noneblock_mm_infossurvive deserialization and reachcreate_stored_blocks.Once you’re happy with the MM plumbing, consider extending this test (or adding a new one) that supplies a simple
block_mm_infospayload and asserts that the resultingKvCacheStoreData::blocksentries havemm_extra_infopopulated as expected.examples/deployments/router_standalone_trtllm/router.py (1)
288-296: Mark unusedappargument to satisfy linters
lifespan(self, app: FastAPI)doesn’t useapp, which is intentional for FastAPI’s lifecycle signature but shows up in Ruff (ARG002). Renaming it to_app(or adding adel appinside) makes that intent explicit and keeps linters quiet:-async def lifespan(self, app: FastAPI): +async def lifespan(self, _app: FastAPI):examples/deployments/router_standalone_trtllm/test_router.py (2)
72-92: Consider catching more specific exceptions.While broad exception handling is acceptable in test utilities, logging the exception would improve debuggability.
def send_request(client: httpx.Client, url: str, payload: dict) -> bool: """Send a chat completion request and consume the stream.""" try: resp = client.post(f"{url}/v1/chat/completions", json=payload) if resp.status_code != 200: return False for _ in resp.iter_lines(): pass return True - except Exception: + except Exception as e: + print(f"Request failed: {e}") return False def get_tree_info(client: httpx.Client, url: str) -> dict: """Get radix tree debug info.""" try: resp = client.get(f"{url}/debug/tree_info") return resp.json() - except Exception: + except Exception as e: + print(f"Failed to get tree info: {e}") return {"num_blocks": -1, "events": []}
159-175: Server connectivity check only verifies router, not API.
_check_serversreturnsTruewithout actually verifying the API server is reachable. Consider adding an API health check.def _check_servers(self) -> bool: """Verify both API and Router servers are reachable.""" print("\nChecking server connectivity...") try: # Check router resp = self.client.get(f"{self.config.router_url}/debug/tree_info") if resp.status_code != 200: print(f" Router not responding: {resp.status_code}") return False print(f" Router OK (blocks in tree: {resp.json().get('num_blocks', '?')})") - # Check API - just verify it's up - # A simple request to verify the endpoint exists + # Check API - verify it's up with a minimal request + try: + # Just check the server responds (will fail with 4xx but confirms connectivity) + self.client.get(f"{self.config.api_url}/health", timeout=5) + except httpx.HTTPStatusError: + pass # Expected if no health endpoint, but connection works + print(" API OK (connected)") return True except Exception as e: print(f" Connection error: {e}") return Falseexamples/deployments/router_standalone_trtllm/worker.py (4)
23-28: Debug file path uses fixed /tmp location.This is acceptable for debug-only code guarded by
DEBUG_ENABLED, but be aware it may not be accessible in containerized environments or could conflict with other instances.Consider using
tempfilemodule or making the path configurable:+import tempfile + # Debug flag: set DYNAMO_DEBUG=1 to enable debug file dumps DEBUG_ENABLED = os.environ.get("DYNAMO_DEBUG", "0") == "1" -DEBUG_WORKER_KV_FILE = "/tmp/debug_worker_kv.txt" +DEBUG_WORKER_KV_FILE = os.environ.get("DYNAMO_DEBUG_FILE", "/tmp/debug_worker_kv.txt")
148-166:extract_mm_infoonly extracts the first MM object.The function returns after finding the first mm_key, ignoring any additional multimodal objects in the request.
If multiple images can exist per request, consider accumulating all mm_objects:
def extract_mm_info(blocks_data: list[dict], all_token_ids: list[int]) -> dict | None: """Extract multimodal hash info from TRTLLM block data.""" + mm_objects = [] for block in blocks_data: mm_keys = block.get("mm_keys", []) for mm_key in mm_keys: if mm_key.get("type") != "mm_key": continue hash_hex = mm_key.get("hash", "") if not hash_hex: continue mm_hash = int(hash_hex[:16], 16) offsets = find_image_token_range(all_token_ids) if offsets: - return {"mm_objects": [{"mm_hash": mm_hash, "offsets": [offsets]}]} + mm_objects.append({"mm_hash": mm_hash, "offsets": [offsets]}) - return None + return {"mm_objects": mm_objects} if mm_objects else None
282-307: Uselogger.exceptioninstead oflogger.errorfor exception logging.As flagged by static analysis,
logger.exceptionautomatically includes the traceback.async def _metrics_loop(self): """Continuously publish worker metrics.""" await asyncio.sleep(1) try: async for stat in self.llm.get_stats_async(timeout=5): if not isinstance(stat, dict): continue num_waiting = ( stat["numQueuedRequests"] + stat["inflightBatchingStats"]["numPausedRequests"] ) kv_stats = stat["kvCacheStats"] usage = ( kv_stats["allocTotalBlocks"] / kv_stats["maxNumBlocks"] if kv_stats["maxNumBlocks"] > 0 else 0.0 ) self.metrics_publisher.publish(num_waiting, usage) except asyncio.CancelledError: pass except Exception as e: - logger.error(f"Worker {self.worker_id} metrics error: {e}") + logger.exception(f"Worker {self.worker_id} metrics error")
320-330: Uselogger.exceptionfor KV events errors as well.except RuntimeError as e: if "IterationResult is not properly instantiated" in str(e): logger.warning(f"Worker {self.worker_id}: KV events not available") else: - logger.error(f"Worker {self.worker_id} KV events error: {e}") - except Exception as e: - logger.error(f"Worker {self.worker_id} KV events error: {e}") + logger.exception(f"Worker {self.worker_id} KV events error") + except Exception: + logger.exception(f"Worker {self.worker_id} KV events error")examples/deployments/router_standalone_trtllm/api.py (4)
247-254: Silently falling back to text-only may hide issues.When MM processing fails, the fallback to text-only mode may produce incorrect results for multimodal requests without clear indication to the caller.
Consider returning an error for MM requests that fail processing, or at least including a warning in the response:
except Exception as e: - logger.warning(f"MM processing failed: {e}, falling back to text-only") + logger.exception(f"MM processing failed, falling back to text-only") return ProcessedInput( tokens=self.tokenizer.encode(prompt), mm_input=None, mm_hash=None, image_offsets=None, )
288-296: Simplify dict access as suggested by linter.def _compute_mm_hash(self, multi_modal_data: dict | None) -> int | None: """Compute mm_hash from multimodal data.""" if not multi_modal_data: return None mm_hashes_dict = apply_mm_hashes(multi_modal_data) - if "image" in mm_hashes_dict and mm_hashes_dict["image"]: - return int(mm_hashes_dict["image"][0][:16], 16) + image_hashes = mm_hashes_dict.get("image") + if image_hashes: + return int(image_hashes[0][:16], 16) return None
315-330: Uselogger.exceptionfor routing errors.async def _route_request(self, local_hashes: list[int], num_tokens: int) -> int | ErrorResponse: """Query router for best worker ID.""" try: router_request = RouterRequest(local_hashes=local_hashes, num_tokens=num_tokens) response = await self.http_client.post( f"http://localhost:{self.init_params.router_port}/find_best_worker", json=router_request.model_dump(), timeout=1, ) response.raise_for_status() return RouterResponse.model_validate(response.json()).worker_id except (httpx.RequestError, httpx.HTTPStatusError) as e: - logger.error(f"Router request failed: {e}") + logger.exception("Router request failed") return ErrorResponse( error=make_error("Router service unavailable", "service_unavailable", 503) )
573-589:KeyboardInterruptwon't be caught insideasyncio.run.The inner
except KeyboardInterruptat line 579 will never trigger becauseasyncio.runpropagates it. The outer handler at line 588 is correct.async def run_with_shutdown(): try: router_task = asyncio.create_task(router_api.start()) await asyncio.sleep(0.5) api_task = asyncio.create_task(api.start()) await asyncio.gather(router_task, api_task) - except KeyboardInterrupt: - logger.info("Shutting down services...") + except asyncio.CancelledError: + logger.info("Tasks cancelled, shutting down services...") except Exception as e: - logger.exception(f"Unhandled exception: {e}") + logger.exception("Unhandled exception") finally: await api.shutdown()
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
examples/deployments/router_standalone_trtllm/README.md(1 hunks)examples/deployments/router_standalone_trtllm/__init__.py(1 hunks)examples/deployments/router_standalone_trtllm/api.py(1 hunks)examples/deployments/router_standalone_trtllm/perf.sh(1 hunks)examples/deployments/router_standalone_trtllm/ping.sh(1 hunks)examples/deployments/router_standalone_trtllm/router.py(1 hunks)examples/deployments/router_standalone_trtllm/test_router.py(1 hunks)examples/deployments/router_standalone_trtllm/worker.py(1 hunks)lib/bindings/c/src/lib.rs(1 hunks)lib/bindings/python/rust/llm/kv.rs(4 hunks)lib/bindings/python/src/dynamo/_core.pyi(1 hunks)lib/bindings/python/tests/test_mm_kv_router.py(1 hunks)lib/llm/src/kv_router.rs(5 hunks)lib/llm/src/kv_router/indexer.rs(12 hunks)lib/llm/src/kv_router/protocols.rs(3 hunks)lib/llm/src/kv_router/publisher.rs(16 hunks)lib/llm/src/mocker/kv_manager.rs(1 hunks)lib/llm/src/protocols/common/preprocessor.rs(2 hunks)
🧰 Additional context used
🧠 Learnings (10)
📚 Learning: 2025-05-29T00:02:35.018Z
Learnt from: alec-flowers
Repo: ai-dynamo/dynamo PR: 1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
Applied to files:
lib/llm/src/kv_router/protocols.rslib/bindings/c/src/lib.rslib/llm/src/kv_router.rslib/llm/src/mocker/kv_manager.rslib/llm/src/kv_router/publisher.rslib/bindings/python/rust/llm/kv.rslib/llm/src/kv_router/indexer.rs
📚 Learning: 2025-09-02T16:46:54.015Z
Learnt from: GuanLuo
Repo: ai-dynamo/dynamo PR: 2714
File: lib/llm/src/discovery/model_entry.rs:38-42
Timestamp: 2025-09-02T16:46:54.015Z
Learning: In lib/llm/src/discovery/model_entry.rs, GuanLuo prefers not to add serde defaults for model_type and model_input fields to keep the specification explicit and avoid user errors, relying on atomic deployment strategy to avoid backward compatibility issues.
Applied to files:
lib/llm/src/kv_router/protocols.rslib/llm/src/protocols/common/preprocessor.rs
📚 Learning: 2025-06-24T20:59:35.725Z
Learnt from: ishandhanani
Repo: ai-dynamo/dynamo PR: 1626
File: lib/llm/src/preprocessor.rs:238-239
Timestamp: 2025-06-24T20:59:35.725Z
Learning: In lib/llm/src/preprocessor.rs, the `sampling_options` call in the `preprocess_request` method is placed in the common section after the match statement on `request.prompt_input_type()`, meaning it applies to both `PromptInput::Tokens` and `PromptInput::Text` request types.
Applied to files:
lib/llm/src/protocols/common/preprocessor.rs
📚 Learning: 2025-09-03T19:31:32.621Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2840
File: lib/llm/src/kv_router/sequence.rs:86-88
Timestamp: 2025-09-03T19:31:32.621Z
Learning: PeaBrane chose to defer fixing the corner case where a single late-arriving request might never expire in the ActiveSequences expiry mechanism (lib/llm/src/kv_router/sequence.rs). They prefer to avoid adding a background loop for periodic cleanup at this time, accepting the technical debt to keep the current PR scope contained.
Applied to files:
lib/llm/src/kv_router.rs
📚 Learning: 2025-09-17T01:00:50.937Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.
Applied to files:
lib/llm/src/kv_router.rs
📚 Learning: 2025-08-29T10:08:18.434Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2756
File: lib/bindings/python/rust/llm/kv.rs:401-436
Timestamp: 2025-08-29T10:08:18.434Z
Learning: In the Python KvIndexer bindings (lib/bindings/python/rust/llm/kv.rs), the hardcoded reset_states=true parameter passed to start_kv_router_background is intentional behavior, not an oversight that needs to be made configurable.
Applied to files:
lib/llm/src/kv_router.rs
📚 Learning: 2025-10-14T00:58:05.744Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3597
File: lib/llm/src/kv_router/indexer.rs:437-441
Timestamp: 2025-10-14T00:58:05.744Z
Learning: In lib/llm/src/kv_router/indexer.rs, when a KvCacheEventData::Cleared event is received, the system intentionally clears all dp_ranks for the given worker_id by calling clear_all_blocks(worker.worker_id). This is the desired behavior and should not be scoped to individual dp_ranks.
Applied to files:
lib/llm/src/kv_router.rslib/llm/src/kv_router/indexer.rs
📚 Learning: 2025-06-05T01:02:15.318Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Applied to files:
lib/llm/src/kv_router/indexer.rs
📚 Learning: 2025-05-30T06:38:09.630Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Applied to files:
lib/llm/src/kv_router/indexer.rs
📚 Learning: 2025-05-30T06:34:12.785Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1285
File: lib/llm/src/kv_router/scheduler.rs:260-266
Timestamp: 2025-05-30T06:34:12.785Z
Learning: In the KV router scheduler code, PeaBrane prefers fail-fast behavior over silent failure handling. When accessing worker metrics data that could be out-of-bounds (like dp_rank indexing), explicit panics are preferred over graceful degradation with continue statements to ensure data integrity issues are caught early.
Applied to files:
lib/llm/src/kv_router/indexer.rs
🧬 Code graph analysis (6)
lib/bindings/python/src/dynamo/_core.pyi (1)
lib/bindings/python/rust/llm/kv.rs (1)
compute_block_hash_for_seq_py(29-91)
lib/llm/src/kv_router.rs (1)
lib/llm/src/kv_router/indexer.rs (1)
compute_block_hash_for_seq(135-237)
examples/deployments/router_standalone_trtllm/test_router.py (3)
lib/bindings/python/rust/llm/kv.rs (3)
compute_block_hash_for_seq_py(29-91)block_size(756-758)block_size(846-848)lib/bindings/python/src/dynamo/_core.pyi (4)
compute_block_hash_for_seq_py(235-282)get(1663-1664)block_size(670-674)block_size(724-731)examples/deployments/router_standalone_trtllm/router.py (1)
get_tree_info(310-313)
lib/llm/src/kv_router/publisher.rs (1)
lib/llm/src/kv_router/indexer.rs (4)
compute_block_hash_for_seq(135-237)kv_block_size(2488-2488)kv_block_size(2493-2493)kv_block_size(2498-2498)
examples/deployments/router_standalone_trtllm/api.py (2)
examples/deployments/router_standalone_trtllm/router.py (6)
RouterAPI(260-339)RouterRequest(42-44)RouterResponse(47-50)_setup_routes(297-333)start(335-339)shutdown(239-252)examples/deployments/router_standalone_trtllm/worker.py (5)
TrtllmWorkers(471-515)direct(504-509)start_all(499-502)shutdown(451-463)shutdown_all(511-515)
lib/bindings/python/tests/test_mm_kv_router.py (2)
lib/bindings/python/src/dynamo/_core.pyi (11)
RadixTree(565-636)compute_block_hash_for_seq_py(235-282)apply_event(598-609)dump_tree_as_events(629-636)find_matches(583-596)find_matches(650-660)scores(545-552)remove_worker(611-618)clear_all_blocks(620-627)block_size(670-674)block_size(724-731)lib/bindings/python/rust/llm/kv.rs (10)
compute_block_hash_for_seq_py(29-91)apply_event(529-555)dump_tree_as_events(601-633)find_matches(492-527)find_matches(760-776)scores(410-417)remove_worker(557-577)clear_all_blocks(579-599)block_size(756-758)block_size(846-848)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4577/merge) by zhongdaor-nv.
examples/deployments/router_standalone_trtllm/README.md
[error] 1-1: Trailing whitespace detected in file; fixed by pre-commit. Please review and commit the changes.
lib/bindings/python/src/dynamo/_core.pyi
[error] 1-1: Trailing whitespace detected in file; fixed by pre-commit. Please review and commit the changes.
examples/deployments/router_standalone_trtllm/ping.sh
[error] 1-1: pre-commit: check-executables-have-shebangs failed. ping.sh is marked executable but has no valid shebang.
examples/deployments/router_standalone_trtllm/perf.sh
[error] 1-1: pre-commit: check-executables-have-shebangs failed. perf.sh is marked executable but has no valid shebang.
examples/deployments/router_standalone_trtllm/test_router.py
[error] 1-1: Black formatting changed file. Run 'black' to reformat the code locally.
[error] 782-782: Ruff: Local variable status is assigned to but never used. (F841)
examples/deployments/router_standalone_trtllm/worker.py
[error] 1-1: Black formatting changed file. Run 'black' to reformat the code locally.
examples/deployments/router_standalone_trtllm/router.py
[error] 1-1: Black formatting changed file. Run 'black' to reformat the code locally.
examples/deployments/router_standalone_trtllm/api.py
[error] 1-1: Black formatting changed file. Run 'black' to reformat the code locally.
examples/deployments/router_standalone_trtllm/__init__.py
[error] 1-1: Black formatting changed file. Run 'black' to reformat the code locally.
lib/bindings/python/tests/test_mm_kv_router.py
[error] 1-1: Black formatting changed file. Run 'black' to reformat the code locally.
[error] 1-1: Trailing whitespace detected in file; fixed by pre-commit. Please review and commit the changes.
🪛 markdownlint-cli2 (0.18.1)
examples/deployments/router_standalone_trtllm/README.md
164-164: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.14.8)
examples/deployments/router_standalone_trtllm/test_router.py
80-80: Consider moving this statement to an else block
(TRY300)
81-81: Do not catch blind exception: Exception
(BLE001)
90-90: Do not catch blind exception: Exception
(BLE001)
172-172: Consider moving this statement to an else block
(TRY300)
173-173: Do not catch blind exception: Exception
(BLE001)
628-628: f-string without any placeholders
Remove extraneous f prefix
(F541)
701-701: Local variable status is assigned to but never used
Remove assignment to unused variable status
(F841)
examples/deployments/router_standalone_trtllm/worker.py
25-25: Probable insecure usage of temporary file or directory: "/tmp/debug_worker_kv.txt"
(S108)
130-130: Do not catch blind exception: Exception
(BLE001)
131-131: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
306-306: Do not catch blind exception: Exception
(BLE001)
307-307: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
326-326: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
327-327: Do not catch blind exception: Exception
(BLE001)
328-328: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
examples/deployments/router_standalone_trtllm/router.py
30-30: Probable insecure usage of temporary file or directory: "/tmp/debug_kv_events.txt"
(S108)
131-131: Store a reference to the return value of asyncio.create_task
(RUF006)
143-143: Do not catch blind exception: Exception
(BLE001)
150-150: Store a reference to the return value of asyncio.create_task
(RUF006)
163-163: Do not catch blind exception: Exception
(BLE001)
180-180: Avoid specifying long messages outside the exception class
(TRY003)
289-289: Unused method argument: app
(ARG002)
307-307: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
338-338: Possible binding to all interfaces
(S104)
examples/deployments/router_standalone_trtllm/api.py
36-36: Probable insecure usage of temporary file or directory: "/tmp/debug_api_hashes.txt"
(S108)
203-203: Do not catch blind exception: Exception
(BLE001)
247-247: Do not catch blind exception: Exception
(BLE001)
294-294: Unnecessary key check before dictionary access
Replace with dict.get
(RUF019)
327-327: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
361-361: Do not catch blind exception: Exception
(BLE001)
362-362: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
457-457: Do not catch blind exception: Exception
(BLE001)
458-458: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
488-488: Do not catch blind exception: Exception
(BLE001)
501-501: Possible binding to all interfaces
(S104)
582-582: Redundant exception object included in logging.exception call
(TRY401)
lib/bindings/python/tests/test_mm_kv_router.py
352-352: Do not assert blind exception: Exception
(B017)
🪛 Shellcheck (0.11.0)
examples/deployments/router_standalone_trtllm/ping.sh
[error] 1-1: Use #!, not just #, for the shebang.
(SC1113)
examples/deployments/router_standalone_trtllm/perf.sh
[error] 1-1: Use #!, not just #, for the shebang.
(SC1113)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
- GitHub Check: operator (amd64)
- GitHub Check: operator (arm64)
- GitHub Check: trtllm (arm64)
- GitHub Check: vllm (amd64)
- GitHub Check: sglang (arm64)
- GitHub Check: vllm (arm64)
- GitHub Check: trtllm (amd64)
- GitHub Check: sglang (amd64)
- GitHub Check: Mirror Repository to GitLab
- GitHub Check: clippy (launch/dynamo-run)
- GitHub Check: tests (.)
- GitHub Check: Build and Test - dynamo
- GitHub Check: clippy (lib/runtime/examples)
- GitHub Check: clippy (lib/bindings/python)
- GitHub Check: tests (launch/dynamo-run)
- GitHub Check: tests (lib/runtime/examples)
- GitHub Check: clippy (.)
- GitHub Check: tests (lib/bindings/python)
🔇 Additional comments (37)
lib/llm/src/kv_router/protocols.rs (4)
38-48: LGTM!The
request_extra_infofield is correctly added with appropriate serde attributes for backward compatibility and efficient serialization.
50-57: LGTM!The default implementation correctly initializes the new
request_extra_infofield toNone.
269-366: Well-structured multimodal metadata types and conversion logic.The
to_block_level()method correctly handles:
- Ceiling division for block count calculation
- Offset intersection with block boundaries
- Deduplication of mm_hash entries within blocks
- Edge cases like empty ranges via the
local_start < local_endguardThe use of
saturating_sub(1)on line 330 correctly handles exclusive end offsets for block boundary calculation.
368-378: LGTM!The
mm_extra_infofield follows the established pattern with appropriate serde attributes for backward compatibility and efficient serialization.lib/bindings/c/src/lib.rs (1)
163-180: LGTM!The C bindings correctly pass
Nonefor the new multimodal metadata parameters, maintaining backward compatibility for non-multimodal use cases.examples/deployments/router_standalone_trtllm/ping.sh (1)
17-25: LGTM!The curl command correctly tests the streaming chat completion endpoint with appropriate headers and payload.
lib/llm/src/mocker/kv_manager.rs (1)
133-144: LGTM!The
mm_extra_info: Noneinitialization correctly aligns with the updatedKvCacheStoredBlockDatastructure. The mocker doesn't need to track multimodal metadata.lib/llm/src/protocols/common/preprocessor.rs (2)
8-8: LGTM!The import correctly combines
RouterConfigOverrideand the newRequestExtraInfofrom the same module path.
104-108: LGTM!The new
request_extra_infofield follows established patterns in this struct—using#[builder(default)]and#[serde(default, skip_serializing_if = "Option::is_none")]consistent with other optional fields likeextra_argsandextra_fields. The documentation clearly explains its purpose for carrying multimodal metadata.lib/llm/src/kv_router.rs (4)
398-398: LGTM!Correct adaptation to the updated
compute_block_hash_for_seqsignature, passingNonefor the newblock_mm_infosparameter.
452-455: LGTM!Consistent update to pass
Nonefor the multimodal info parameter.
519-522: Verify:request_extra_infois intentionally unused for now.The
request_extra_infofield is captured but explicitly ignored (_). Based on the PR context, this appears to be groundwork for future multimodal routing—the field is added to the protocol but not yet propagated to hash computation. Please confirm this is the intended incremental approach.
604-606: LGTM!Consistent update in
KvPushRouterto passNonefor the new multimodal parameter.lib/bindings/python/tests/test_mm_kv_router.py (1)
214-345: MM block‑hash test coverage looks strongThe block‑hash tests exercise core scenarios well (no‑MM vs MM, determinism, multi‑block, partial blocks,
NoneMM info, offsets ignored, multiple MM objects). This should give good confidence that the Rust hashing behavior is correctly surfaced into Python.lib/llm/src/kv_router/indexer.rs (1)
2485-2501: Unit test updates confirm MM‑agnostic behavior is preserved whenblock_mm_infosisNoneThe updated
test_compute_block_hash_for_seqnow callscompute_block_hash_for_seq(..., None)and asserts the same block‑count behavior as before (1 hash for ≤1.5 blocks, 2 for >2.5, etc.). This is a good regression guard ensuring the MM extension didn’t change the base (no‑MM) semantics.lib/llm/src/kv_router/publisher.rs (2)
419-444: MM metadata correctly influencestokens_hashand is stored per blockThe changes to
create_stored_block_from_partslook consistent with the new MM hashing contract:
- For each block, you build a one‑element
block_mm_infosslice frommm_extra_infoand callcompute_block_hash_for_seqon the block’s tokens, sotokens_hashis now a function of tokens + that block’smm_hashset.- The same
mm_extra_infois then stored on the resultingKvCacheStoredBlockData, so downstream consumers can still inspect MM metadata without recomputing it.This ensures events produced by this publisher match the MM‑aware hashing logic the indexer and Python bindings expect.
533-701: RawKvEvent deserialization forblock_mm_infosis robust and backwards‑compatibleThe custom
Deserializeimpl handlesblock_mm_infosin both tagged‑map and tagged‑sequence encodings, defaulting toNonewhen the field is absent, which keeps older producers compatible. That matches the intendedOption<Vec<Option<BlockExtraInfo>>>semantics and should interoperate cleanly with both legacy and MM‑aware senders.examples/deployments/router_standalone_trtllm/router.py (3)
171-201: Routing logic is coherent and aligns with RadixTree API
get_best_workerand its helpers correctly:
- Use
RadixTree.find_matches(local_hashes)and mask scores down to per‑worker counts assumingdp_rank=0.- Translate matched block counts into overlap ratios via
matched_blocks * block_size / num_tokens.- Combine overlap, current KV usage, and normalized waiting count into a simple logit and pick the max with random tie‑breaking.
This is a reasonable first‑pass policy for the standalone example and matches the OverlapScores API shape from the bindings.
297-334: Debug inject endpoint shape matches KV event expectations
/debug/inject_eventbuilds a minimalKvCacheEventJSON with a single stored block (including optionalmm_extra_info) and feeds it directly toRadixTree.apply_event(...). That’s a useful hook for local testing and for verifying MM metadata propagation through the router.Just be aware that this bypasses whatever additional fields the real producer includes (e.g.,
dp_rank), so it should remain clearly labeled as a debug‑only tool, which you’ve already done via the/debug/...namespace.
343-361: Remember to run Black and commit formatting changesCI reports that Black reformatted this file and that trailing whitespace was auto‑fixed by pre‑commit. Please run the project’s formatting hooks locally (e.g.,
black examples/deployments/router_standalone_trtllm/router.pyand any configured pre‑commit hooks) and commit the resulting changes to clear the pipeline failure.examples/deployments/router_standalone_trtllm/test_router.py (5)
1-28: LGTM! Well-structured test module setup.The file header, imports, and test constants are properly organized. The sample test images from COCO dataset are appropriate for multimodal testing.
30-44: LGTM! Clean dataclass definitions.
TestConfigandTestResultare well-designed with sensible defaults. Thekv_settle_timeparameter is a good practice for async event propagation tests.
304-349: LGTM! Comprehensive multimodal hash computation tests.The
_test_mm_hash_computationtest correctly validates that different mm_hash values produce different block hashes. Good coverage of the core hash differentiation logic.
464-516: LGTM! Thorough block boundary test.
_test_mm_block_boundarycorrectly validates that MM info only affects the intended blocks, checking all three blocks independently.
723-764: LGTM! Clean CLI structure with proper resource cleanup.The
main()function properly usestry/finallyto ensuretests.cleanup()is called, and the argument parsing is well-organized.examples/deployments/router_standalone_trtllm/worker.py (6)
46-51: LGTM! Correct unsigned 64-bit conversion.The
to_unsigned_u64function correctly handles Python's arbitrary-precision integers and converts negative values using two's complement for Rust/msgpack compatibility.
59-76: LGTM! Clean ZMQ publisher implementation.
MetricsPublisheris well-implemented with proper context and socket management. Theclose()method correctly terminates resources.
125-136: LGTM! Well-structured message serialization.The
_sendmethod correctly uses msgpack with sequence numbers for reliable ordering, and handles serialization errors gracefully.
181-213: LGTM! Robust block parsing with proper partial block handling.
parse_stored_blockscorrectly tracks partial blocks separately and validates block sizes. The early break on partial/oversized blocks is appropriate.
451-463: LGTM! Proper shutdown with resource cleanup.The
shutdownmethod correctly cancels background tasks and closes all resources in the right order.
471-515: LGTM! Clean worker manager implementation.
TrtllmWorkersprovides a clean interface for managing multiple workers with proper port assignment and lifecycle management.examples/deployments/router_standalone_trtllm/api.py (6)
1-31: LGTM! Proper module setup with necessary imports.The protobuf environment workaround is documented, and imports are well-organized. Good use of both local and external dependencies.
70-103: LGTM! Clean API model definitions.The Pydantic models follow OpenAI API conventions correctly. Using
str | list[ContentPart]forcontentproperly handles both text-only and multimodal messages.
197-215: LGTM! Good fallback pattern for chat template.The
_build_promptmethod gracefully falls back to simple formatting when the tokenizer's chat template fails.
382-459: LGTM! Well-structured request handler.The
chat_completionsendpoint properly validates service readiness, parses requests, handles both text and multimodal paths, computes routing hashes, and streams responses. Good error handling at each stage.
495-504: Binding to0.0.0.0is intentional for containerized deployments.The linter flags binding to all interfaces, but this is appropriate for a service that needs to be accessible from outside the container.
519-549: LGTM! Comprehensive CLI with sensible defaults.The argument parser provides good defaults and clear help text for all configuration options.
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove excessive trailing blank lines to fix Black formatting failure.
The pipeline failed because Black formatting detected issues with this file. Remove the trailing blank lines after the license header.
Apply this diff:
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
-
-
-
-
-
-📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 |
🤖 Prompt for AI Agents
In examples/deployments/router_standalone_trtllm/__init__.py around lines 3 to
8, remove the excessive trailing blank lines after the license header so the
file ends with a single newline; update the file to eliminate blank lines beyond
the header (leave exactly one newline at EOF) to satisfy Black formatting.
| def _parse_request(self, request: ChatCompletionRequest) -> ParsedRequest | ErrorResponse: | ||
| """Parse and validate the incoming request.""" | ||
| max_tokens = request.max_completion_tokens or request.max_tokens | ||
| if max_tokens is None: | ||
| return ErrorResponse( | ||
| error=make_error( | ||
| "Either max_tokens or max_completion_tokens must be specified", | ||
| "invalid_request_error", 400 | ||
| ) | ||
| ) | ||
|
|
||
| messages_dict, image_urls = self._extract_messages(request.messages) | ||
|
|
||
| return ParsedRequest( | ||
| messages_dict=messages_dict, | ||
| image_urls=image_urls, | ||
| max_tokens=max_tokens, | ||
| temperature=request.temperature, | ||
| top_p=request.top_p, | ||
| model=request.model, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default values for temperature and top_p may be None.
When accessing request.temperature and request.top_p, they could be None if explicitly passed as null in the request, despite the default values.
def _parse_request(self, request: ChatCompletionRequest) -> ParsedRequest | ErrorResponse:
"""Parse and validate the incoming request."""
max_tokens = request.max_completion_tokens or request.max_tokens
if max_tokens is None:
return ErrorResponse(
error=make_error(
"Either max_tokens or max_completion_tokens must be specified",
"invalid_request_error", 400
)
)
messages_dict, image_urls = self._extract_messages(request.messages)
return ParsedRequest(
messages_dict=messages_dict,
image_urls=image_urls,
max_tokens=max_tokens,
- temperature=request.temperature,
- top_p=request.top_p,
+ temperature=request.temperature if request.temperature is not None else 1.0,
+ top_p=request.top_p if request.top_p is not None else 1.0,
model=request.model,
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _parse_request(self, request: ChatCompletionRequest) -> ParsedRequest | ErrorResponse: | |
| """Parse and validate the incoming request.""" | |
| max_tokens = request.max_completion_tokens or request.max_tokens | |
| if max_tokens is None: | |
| return ErrorResponse( | |
| error=make_error( | |
| "Either max_tokens or max_completion_tokens must be specified", | |
| "invalid_request_error", 400 | |
| ) | |
| ) | |
| messages_dict, image_urls = self._extract_messages(request.messages) | |
| return ParsedRequest( | |
| messages_dict=messages_dict, | |
| image_urls=image_urls, | |
| max_tokens=max_tokens, | |
| temperature=request.temperature, | |
| top_p=request.top_p, | |
| model=request.model, | |
| ) | |
| def _parse_request(self, request: ChatCompletionRequest) -> ParsedRequest | ErrorResponse: | |
| """Parse and validate the incoming request.""" | |
| max_tokens = request.max_completion_tokens or request.max_tokens | |
| if max_tokens is None: | |
| return ErrorResponse( | |
| error=make_error( | |
| "Either max_tokens or max_completion_tokens must be specified", | |
| "invalid_request_error", 400 | |
| ) | |
| ) | |
| messages_dict, image_urls = self._extract_messages(request.messages) | |
| return ParsedRequest( | |
| messages_dict=messages_dict, | |
| image_urls=image_urls, | |
| max_tokens=max_tokens, | |
| temperature=request.temperature if request.temperature is not None else 1.0, | |
| top_p=request.top_p if request.top_p is not None else 1.0, | |
| model=request.model, | |
| ) |
🤖 Prompt for AI Agents
In examples/deployments/router_standalone_trtllm/api.py around lines 156–176,
request.temperature and request.top_p can be explicitly null so they may be None
here; update the parsing to coalesce those fields to safe defaults before
building ParsedRequest (e.g., temperature = request.temperature if
request.temperature is not None else 1.0 and top_p = request.top_p if
request.top_p is not None else 1.0, or use any existing DEFAULT_* constants),
ensure the values are numeric (cast/validate) and pass those fallback values
into ParsedRequest when returning.
| def _build_block_mm_infos( | ||
| self, num_tokens: int, mm_hash: int | None, image_offsets: list[int] | None | ||
| ) -> list[dict] | None: | ||
| """Build block_mm_infos for routing hash computation.""" | ||
| if mm_hash is None or image_offsets is None: | ||
| return None | ||
|
|
||
| num_blocks = (num_tokens + self.init_params.block_size - 1) // self.init_params.block_size | ||
| return [ | ||
| {"mm_objects": [{"mm_hash": mm_hash, "offsets": [image_offsets]}]} | ||
| for _ in range(num_blocks) | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_build_block_mm_infos applies same MM info to all blocks.
Similar to the issue in worker.py, this broadcasts identical mm_info to every block. The image_offsets should determine which blocks actually contain image content.
def _build_block_mm_infos(
self, num_tokens: int, mm_hash: int | None, image_offsets: list[int] | None
) -> list[dict] | None:
"""Build block_mm_infos for routing hash computation."""
if mm_hash is None or image_offsets is None:
return None
+ block_size = self.init_params.block_size
num_blocks = (num_tokens + self.init_params.block_size - 1) // self.init_params.block_size
- return [
- {"mm_objects": [{"mm_hash": mm_hash, "offsets": [image_offsets]}]}
- for _ in range(num_blocks)
- ]
+ block_mm_infos = []
+ mm_info = {"mm_objects": [{"mm_hash": mm_hash, "offsets": [image_offsets]}]}
+
+ for block_idx in range(num_blocks):
+ block_start = block_idx * block_size
+ block_end = min(block_start + block_size, num_tokens)
+ # Check if this block overlaps with image token range
+ if image_offsets[0] < block_end and image_offsets[1] > block_start:
+ block_mm_infos.append(mm_info)
+ else:
+ block_mm_infos.append(None)
+
+ return block_mm_infos🤖 Prompt for AI Agents
In examples/deployments/router_standalone_trtllm/api.py around lines 302–313,
the current implementation attaches the same mm_object offsets to every block;
instead compute per-block offsets from image_offsets so only blocks that contain
image tokens get mm_objects. For each block index i compute block_start = i *
block_size and block_end = block_start + block_size, filter image_offsets to
offsets within [block_start, block_end), convert them to block-relative offsets
(subtract block_start), and for that block include an mm_objects entry only if
the resulting offsets list is non-empty; return the list of per-block dicts (or
None when mm_hash or image_offsets is None) so image mm info is not broadcast to
all blocks.
| @@ -0,0 +1,52 @@ | |||
| #/bin/bash | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the shebang: missing ! character.
The shebang line #/bin/bash is invalid—it must be #!/bin/bash. This causes the script to fail the pre-commit check and will not execute properly as a bash script.
-#/bin/bash
+#!/bin/bash📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #/bin/bash | |
| #!/bin/bash |
🧰 Tools
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4577/merge) by zhongdaor-nv.
[error] 1-1: pre-commit: check-executables-have-shebangs failed. perf.sh is marked executable but has no valid shebang.
🪛 Shellcheck (0.11.0)
[error] 1-1: Use #!, not just #, for the shebang.
(SC1113)
🤖 Prompt for AI Agents
In examples/deployments/router_standalone_trtllm/perf.sh around line 1 the
shebang is incorrect ("#/bin/bash"); replace it with a valid bash shebang by
inserting the missing "!" so the first line reads "#!/bin/bash" to ensure the
script is executable and passes pre-commit checks.
| @@ -0,0 +1,25 @@ | |||
| #/bin/bash | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix malformed shebang to resolve pipeline failure.
The shebang is missing the ! character. This is causing the pre-commit check to fail.
Apply this diff:
-#/bin/bash
+#!/bin/bash📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #/bin/bash | |
| #!/bin/bash |
🧰 Tools
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4577/merge) by zhongdaor-nv.
[error] 1-1: pre-commit: check-executables-have-shebangs failed. ping.sh is marked executable but has no valid shebang.
🪛 Shellcheck (0.11.0)
[error] 1-1: Use #!, not just #, for the shebang.
(SC1113)
🤖 Prompt for AI Agents
In examples/deployments/router_standalone_trtllm/ping.sh around line 1 the
shebang is malformed ("#/bin/bash" missing the exclamation mark); replace it
with a proper shebang (#!/bin/bash) at the top of the file so the script is
recognized and the pre-commit pipeline passes.
| if parent_hash is not None: | ||
| event["parent_block_hash"] = to_unsigned_u64(parent_hash) | ||
|
|
||
| if mm_extra_info is not None: | ||
| event["block_mm_infos"] = [mm_extra_info] * len(block_hashes) | ||
|
|
||
| self._send([event]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify multimodal info semantics: is mm_extra_info operation-wide or per-block?
The field name block_mm_infos (array) conflicts with the implementation that broadcasts a single mm_extra_info to all blocks. This indicates either incomplete implementation (should filter by offsets/block content) or incorrect field structure (should be single mm_extra_info if operation-wide).
Consider either:
- Filter blocks based on which ones contain image tokens/offsets before applying
mm_extra_info - Rename to single field and document that it applies operation-wide
if mm_extra_info is not None:
- event["block_mm_infos"] = [mm_extra_info] * len(block_hashes)
+ # TODO: Clarify if mm_extra_info applies to all blocks or only those with multimodal content
+ event["mm_extra_info"] = mm_extra_info📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if parent_hash is not None: | |
| event["parent_block_hash"] = to_unsigned_u64(parent_hash) | |
| if mm_extra_info is not None: | |
| event["block_mm_infos"] = [mm_extra_info] * len(block_hashes) | |
| self._send([event]) | |
| if parent_hash is not None: | |
| event["parent_block_hash"] = to_unsigned_u64(parent_hash) | |
| if mm_extra_info is not None: | |
| # TODO: Clarify if mm_extra_info applies to all blocks or only those with multimodal content | |
| event["mm_extra_info"] = mm_extra_info | |
| self._send([event]) |
🤖 Prompt for AI Agents
In examples/deployments/router_standalone_trtllm/worker.py around lines 104-110
the code always duplicates a single mm_extra_info across all blocks into
event["block_mm_infos"], which conflicts with the field name and review comment;
either (A) make block_mm_infos truly per-block by building a list of the same
length as block_hashes where you place mm_extra_info only for blocks that
contain multimodal tokens (detect those blocks from offsets/metadata or the
incoming request and otherwise insert None/empty), or (B) change the outgoing
payload to use a single operation-scoped field name (e.g., operation_mm_info or
mm_info) and document that it applies to the whole operation; implement one of
these fixes, update the field name/structure accordingly in the event emission,
and adjust any related docs/tests to reflect the chosen semantics.
| def _initialize(self, kv_events_port: int, metrics_port: int): | ||
| """Initialize TensorRT-LLM engine and publishers.""" | ||
| os.environ["CUDA_VISIBLE_DEVICES"] = str(self.worker_id) | ||
|
|
||
| logger.info(f"Worker {self.worker_id}: Initializing on GPU {self.worker_id}") | ||
|
|
||
| self.llm = LLM( | ||
| model=self.model, | ||
| kv_cache_config=KvCacheConfig( | ||
| enable_block_reuse=True, | ||
| event_buffer_max_size=DEFAULT_KV_EVENT_BUFFER_MAX_SIZE, | ||
| ), | ||
| ) | ||
|
|
||
| self.metrics_publisher = MetricsPublisher(metrics_port) | ||
| self.kv_events_publisher = KvEventsPublisher(kv_events_port, self.block_size) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check TrtllmWorkers instantiation patterns
ast-grep --pattern 'TrtllmWorkers($$$)'Repository: ai-dynamo/dynamo
Length of output: 790
🏁 Script executed:
#!/bin/bash
# Also search for the TrtllmWorker class definition and its usage
rg 'class TrtllmWorker' -A 20Repository: ai-dynamo/dynamo
Length of output: 3664
🏁 Script executed:
#!/bin/bash
# Check the file structure of the worker.py file
wc -l examples/deployments/router_standalone_trtllm/worker.py
head -50 examples/deployments/router_standalone_trtllm/worker.pyRepository: ai-dynamo/dynamo
Length of output: 1636
🏁 Script executed:
#!/bin/bash
# Look for initialization patterns and how workers are spawned/managed
rg 'TrtllmWorker|_initialize' examples/deployments/router_standalone_trtllm/ -B 2 -A 2Repository: ai-dynamo/dynamo
Length of output: 4132
🏁 Script executed:
#!/bin/bash
# Get the complete TrtllmWorker __init__ and _initialize methods
sed -n '/^class TrtllmWorker:/,/^class [A-Z]/p' examples/deployments/router_standalone_trtllm/worker.py | head -100Repository: ai-dynamo/dynamo
Length of output: 3622
🏁 Script executed:
#!/bin/bash
# Check TrtllmWorkers __init__ more completely
sed -n '/^class TrtllmWorkers:/,/^class [A-Z]/p' examples/deployments/router_standalone_trtllm/worker.py | head -80Repository: ai-dynamo/dynamo
Length of output: 1524
🏁 Script executed:
#!/bin/bash
# Check if workers are spawned in separate processes (multiprocessing)
rg 'Process|multiprocessing|spawn|Fork' examples/deployments/router_standalone_trtllm/Repository: ai-dynamo/dynamo
Length of output: 1475
Setting CUDA_VISIBLE_DEVICES in _initialize causes critical issues when multiple workers initialize sequentially in the same process.
All workers are initialized in TrtllmWorkers.__init__() within a single process loop (lines 289-298). When worker N initializes, it overwrites the environment variable set by worker N-1, affecting GPU visibility for all previously initialized workers. This breaks multi-worker deployments where num_workers > 1.
🤖 Prompt for AI Agents
In examples/deployments/router_standalone_trtllm/worker.py around lines 247-263,
setting CUDA_VISIBLE_DEVICES inside _initialize overwrites GPU visibility for
previously initialized workers when multiple workers are created in the same
process (see TrtllmWorkers.__init__ loop at ~289-298); remove the
os.environ["CUDA_VISIBLE_DEVICES"] assignment here and instead ensure each
worker selects its GPU explicitly without changing process-wide env: either set
the device at the CUDA runtime level (e.g., use the CUDA API / framework call to
set the current device for this worker) or modify the LLM initialization to
accept and bind to a specific device index passed from the caller, or move
per-worker CUDA_VISIBLE_DEVICES setup to separate worker processes before
initialization; update callers to pass worker_id/device index into LLM so each
worker uses the correct GPU without mutating environment variables.
| #[pyo3(signature = (tokens, kv_block_size, block_mm_infos=None))] | ||
| pub fn compute_block_hash_for_seq_py( | ||
| _py: Python, | ||
| tokens: Vec<u32>, | ||
| kv_block_size: usize, | ||
| block_mm_infos: Option<Bound<PyAny>>, | ||
| ) -> PyResult<Vec<u64>> { | ||
| use std::fs::OpenOptions; | ||
| use std::io::Write; | ||
|
|
||
| if kv_block_size == 0 { | ||
| return Err(to_pyerr(anyhow::anyhow!("kv_block_size cannot be 0"))); | ||
| } | ||
|
|
||
| let hashes = compute_block_hash_for_seq(&tokens, kv_block_size as u32); | ||
| // Convert Python block_mm_infos to Rust Vec<Option<BlockExtraInfo>> | ||
| let mm_infos_rust: Option<Vec<Option<BlockExtraInfo>>> = block_mm_infos | ||
| .as_ref() | ||
| .map(|infos_py| { | ||
| depythonize::<Vec<Option<BlockExtraInfo>>>(infos_py).map_err(|e| { | ||
| PyErr::new::<pyo3::exceptions::PyValueError, _>(format!( | ||
| "Failed to convert block_mm_infos: {}", | ||
| e | ||
| )) | ||
| }) | ||
| }) | ||
| .transpose()?; | ||
|
|
||
| // Log parameters to file | ||
| if let Ok(mut file) = OpenOptions::new() | ||
| .create(true) | ||
| .append(true) | ||
| .open("/tmp/debug_rust_hash_params.txt") | ||
| { | ||
| let _ = writeln!( | ||
| file, | ||
| "\n============================================================" | ||
| ); | ||
| let _ = writeln!(file, "=== compute_block_hash_for_seq_py PARAMETERS ==="); | ||
| let _ = writeln!(file, "kv_block_size: {}", kv_block_size); | ||
| let _ = writeln!(file, "num_tokens: {}", tokens.len()); | ||
| let _ = writeln!(file, "tokens: {:?}", tokens); | ||
| let _ = writeln!(file, "mm_infos_rust: {:?}", mm_infos_rust); | ||
| } | ||
|
|
||
| let hashes = | ||
| compute_block_hash_for_seq(&tokens, kv_block_size as u32, mm_infos_rust.as_deref()); | ||
|
|
||
| // Log result | ||
| if let Ok(mut file) = OpenOptions::new() | ||
| .create(true) | ||
| .append(true) | ||
| .open("/tmp/debug_rust_hash_params.txt") | ||
| { | ||
| let hash_values: Vec<u64> = hashes.iter().map(|h| h.0).collect(); | ||
| let _ = writeln!(file, "=== RESULT ==="); | ||
| let _ = writeln!(file, "hashes ({}): {:?}", hash_values.len(), hash_values); | ||
| let _ = writeln!( | ||
| file, | ||
| "============================================================" | ||
| ); | ||
| } | ||
|
|
||
| Ok(hashes.into_iter().map(|h| h.0).collect()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove or gate persistent debug file logging in compute_block_hash_for_seq_py
This binding currently opens and appends to /tmp/debug_rust_hash_params.txt on every call, logging full token arrays and MM metadata as well as results. In a real deployment this function can be called extremely frequently, so:
- Performance: repeated synchronous file IO in a tight path will add noticeable overhead and contention.
- Reliability: running in environments where
/tmpis slow, small, or unwritable can introduce sporadic failures. - Privacy: tokens and MM metadata can include user data, so dumping them to a world‑readable temp file is a potential leakage vector.
Suggest either removing this logging before merge or guarding it behind a very explicit debug flag (e.g., an env var check or a compile‑time feature) and trimming payloads to what you strictly need for troubleshooting.
| def test_mm_block_hash_error_zero_block_size(): | ||
| """Test that zero block size raises an error.""" | ||
| tokens = [100] * 32 | ||
|
|
||
| with pytest.raises(Exception): | ||
| compute_block_hash_for_seq_py(tokens, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider asserting a more specific exception than Exception
pytest.raises(Exception) will catch anything, including unrelated errors. Since compute_block_hash_for_seq_py explicitly rejects kv_block_size == 0, it would be more robust to assert the concrete Python exception type (e.g., the specific PyErr mapping used by to_pyerr) or at least a narrower base such as RuntimeError, and optionally validate the error message. This also aligns with Ruff’s B017 warning.
[scratchpad_end] -->
🧰 Tools
🪛 Ruff (0.14.8)
352-352: Do not assert blind exception: Exception
(B017)
🤖 Prompt for AI Agents
In lib/bindings/python/tests/test_mm_kv_router.py around lines 348 to 353, the
test uses pytest.raises(Exception) which is too broad; change it to assert the
concrete exception type raised for an invalid kv_block_size (use ValueError or
the actual exception your to_pyerr maps to, e.g., RuntimeError) and optionally
verify the error message via pytest.raises(<SpecificException>, match="expected
message"). Update the with pytest.raises(...) block to use the specific
exception and add a match string if you want to assert the error text.
| /// Compute the hash for a sequence of tokens, optionally including multimodal metadata. | ||
| /// | ||
| /// When multimodal extra info is provided, the mm_hashes are included in the hash computation | ||
| /// to ensure that blocks with identical tokens but different multimodal objects produce | ||
| /// different hashes. | ||
| /// | ||
| /// ### Arguments | ||
| /// | ||
| /// * `tokens` - A vector of `u32` tokens. | ||
| /// * `kv_block_size` - The size of each block in tokens. | ||
| /// * `block_mm_infos` - Optional per-block multimodal metadata. | ||
| /// | ||
| /// ### Returns | ||
| /// | ||
| /// A vector of `LocalBlockHash` representing the computed hashes for each chunk of tokens. | ||
| pub fn compute_block_hash_for_seq(tokens: &[u32], kv_block_size: u32) -> Vec<LocalBlockHash> { | ||
| tokens | ||
| .chunks_exact(kv_block_size as usize) // Split into chunks of kv_block_size elements | ||
| .map(|chunk| { | ||
| let bytes: Vec<u8> = chunk | ||
| .iter() | ||
| .flat_map(|&num| num.to_le_bytes()) // Convert each i32 to its little-endian bytes | ||
| .collect(); | ||
| pub fn compute_block_hash_for_seq( | ||
| tokens: &[u32], | ||
| kv_block_size: u32, | ||
| block_mm_infos: Option<&[Option<BlockExtraInfo>]>, | ||
| ) -> Vec<LocalBlockHash> { | ||
| use std::fs::OpenOptions; | ||
| use std::io::Write; | ||
|
|
||
| // Log input parameters | ||
| if let Ok(mut file) = OpenOptions::new() | ||
| .create(true) | ||
| .append(true) | ||
| .open("/tmp/debug_compute_block_hash.txt") | ||
| { | ||
| let _ = writeln!( | ||
| file, | ||
| "\n============================================================" | ||
| ); | ||
| let _ = writeln!(file, "=== compute_block_hash_for_seq INPUT ==="); | ||
| let _ = writeln!(file, "kv_block_size: {}", kv_block_size); | ||
| let _ = writeln!(file, "num_tokens: {}", tokens.len()); | ||
| let _ = writeln!(file, "tokens: {:?}", tokens); | ||
| let _ = writeln!(file, "block_mm_infos: {:?}", block_mm_infos); | ||
| } | ||
|
|
||
| let result: Vec<LocalBlockHash> = tokens | ||
| .chunks_exact(kv_block_size as usize) | ||
| .enumerate() | ||
| .map(|(block_idx, chunk)| { | ||
| let mut bytes: Vec<u8> = chunk.iter().flat_map(|&num| num.to_le_bytes()).collect(); | ||
|
|
||
| // Include MM hashes in the block hash computation if present | ||
| if let Some(mm_infos) = block_mm_infos { | ||
| if let Some(Some(block_mm_info)) = mm_infos.get(block_idx) { | ||
| // Sort mm_hashes for consistent ordering | ||
| let mut mm_hashes: Vec<u64> = block_mm_info | ||
| .mm_objects | ||
| .iter() | ||
| .map(|obj| obj.mm_hash) | ||
| .collect(); | ||
| mm_hashes.sort_unstable(); | ||
|
|
||
| // Log MM hash inclusion | ||
| if let Ok(mut file) = OpenOptions::new() | ||
| .create(true) | ||
| .append(true) | ||
| .open("/tmp/debug_compute_block_hash.txt") | ||
| { | ||
| let _ = writeln!( | ||
| file, | ||
| "Block {}: Including mm_hashes {:?} in hash computation", | ||
| block_idx, mm_hashes | ||
| ); | ||
| } | ||
|
|
||
| // Append sorted mm_hashes to the byte array | ||
| for mm_hash in mm_hashes { | ||
| bytes.extend_from_slice(&mm_hash.to_le_bytes()); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| compute_block_hash(&Bytes::from(bytes)) // Convert the byte Vec to Bytes | ||
| let hash = compute_block_hash(&Bytes::from(bytes)); | ||
|
|
||
| // Log per-block result | ||
| if let Ok(mut file) = OpenOptions::new() | ||
| .create(true) | ||
| .append(true) | ||
| .open("/tmp/debug_compute_block_hash.txt") | ||
| { | ||
| let _ = writeln!( | ||
| file, | ||
| "Block {}: chunk tokens {:?} -> hash {}", | ||
| block_idx, chunk, hash.0 | ||
| ); | ||
| } | ||
|
|
||
| hash | ||
| }) | ||
| .collect() | ||
| .collect(); | ||
|
|
||
| // Log output | ||
| if let Ok(mut file) = OpenOptions::new() | ||
| .create(true) | ||
| .append(true) | ||
| .open("/tmp/debug_compute_block_hash.txt") | ||
| { | ||
| let hash_values: Vec<u64> = result.iter().map(|h| h.0).collect(); | ||
| let _ = writeln!(file, "=== OUTPUT ==="); | ||
| let _ = writeln!( | ||
| file, | ||
| "result_hashes ({}): {:?}", | ||
| hash_values.len(), | ||
| hash_values | ||
| ); | ||
| let _ = writeln!( | ||
| file, | ||
| "============================================================" | ||
| ); | ||
| } | ||
|
|
||
| result | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MM‑aware hashing semantics look correct; logging and allocation deserve tightening
The new compute_block_hash_for_seq behavior—chunking tokens, optionally incorporating per‑block mm_hash values (sorted for determinism), and skipping partial trailing blocks—aligns with the Python tests and the intended “same tokens + different MM ⇒ different hash” contract. That’s good.
Two follow‑ups are worth addressing:
-
Unbounded debug logging to
/tmp/debug_compute_block_hash.txtThe function logs every call’s inputs (including full token sequences and MM metadata) and every block’s hash to a fixed temp file. In production this will:
- Add synchronous file IO on a very hot path.
- Risk filling or slowing
/tmp, and misbehavior when it’s not writable. - Persist user‑level token content to disk, which is a privacy concern.
As with the Python binding, this should either be removed before release or put behind a clearly opt‑in debug mechanism (env flag / feature gate) with reduced verbosity.
-
Avoid extra
Bytesallocation per blocklet hash = compute_block_hash(&Bytes::from(bytes));
compute_block_hashonly needs&[u8], so you can pass the existingVec<u8>slice:
-
let hash = compute_block_hash(&Bytes::from(bytes));
-
let hash = compute_block_hash(&bytes);This keeps the hashing loop allocation‑free beyond the initial `Vec<u8>` build.
|
@zhongdaor-nv thanks! Can you check the rabbit's comments and CIs when you get the chance? I also noticed some bulky logs in Rust, can we remove those? |
Overview:
Add multimodal (MM) hash support to the KV router, ensuring blocks with identical tokens but different multimodal objects produce different hashes. Also adds a standalone KV router example for TRT-LLM with MM support.
Details:
BlockExtraInfo,RequestExtraInfo,BlockMmObjectInfo,RequestMmObjectInfoinkv_router/protocols.rscompute_block_hash_for_seqto incorporate MM hashes into block hash computationkv.rsto accept optionalblock_mm_infosparameterexamples/deployments/router_standalone_trtllm/test_mm_kv_router.pyWhere should the reviewer start?
lib/llm/src/kv_router/protocols.rs- new multimodal protocol structureslib/llm/src/kv_router/indexer.rs- updated hash computation with MM supportlib/bindings/python/tests/test_mm_kv_router.py- tests demonstrating the new functionalityexamples/deployments/router_standalone_trtllm/- new standalone exampleRelated Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Relates to DIS-916
Summary by CodeRabbit
New Features
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.