Skip to content

Conversation

@kthui
Copy link
Contributor

@kthui kthui commented Dec 2, 2025

Overview:

Fix TRT-LLM cache transceiver setting issue since 1.2.0rc3

Details:

max_tokens_in_buffer need to match --max-seq-len and --max-num-tokens

Other changes:

  • Allocate KV cache exactly for the max sequence length, instead of a percentage of free memory.
  • Update test timeout markers, accounting for the reduced KV cache space allocation.

Where should the reviewer start?

N/A

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

N/A

Summary by CodeRabbit

  • Tests

    • Adjusted fault‑tolerance cancellation tests with shorter per‑case and aggregated durations and reduced timeouts for faster, more consistent runs.
    • Simplified expected‑failure handling for KV‑transfer cancellation to a lighter xfail (non‑strict).
    • Updated timing comments/docs to reflect recent runs.
  • Chores

    • Replaced a GPU memory startup argument with token/buffer and KV cache limits for improved buffering behavior.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions github-actions bot added the fix label Dec 2, 2025
@kthui kthui self-assigned this Dec 2, 2025
@kthui kthui marked this pull request as ready for review December 2, 2025 20:43
@kthui kthui requested review from a team as code owners December 2, 2025 20:43
@kthui kthui enabled auto-merge (squash) December 2, 2025 20:43
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 2, 2025

Walkthrough

Updated TensorRT-LLM cancellation test timings, reduced several timeouts, removed free-gpu-memory-fraction, added max_tokens_in_buffer and KV-cache max_tokens for non-prefill_and_decode modes, simplified XFAIL for kv_transfer test, and adjusted test markers and doc comments.

Changes

Cohort / File(s) Summary
TensorRT-LLM cancellation tests
tests/fault_tolerance/cancellation/test_trtllm.py
Reduced per-test durations and aggregated timeouts (e.g., per-case ~115s → ~65s; aggregated 140s→135s; several 350s→195s). Removed free-gpu-memory-fraction arg. Added max_tokens_in_buffer: 16384 and kv_cache_config: max_tokens: 16384 when mode ≠ prefill_and_decode. Simplified XFAIL for test_request_cancellation_trtllm_kv_transfer_cancel. Updated pytest mark and timing comments.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Verify timeout reductions are safe and not flaky in CI.
  • Confirm new config keys (max_tokens_in_buffer, kv_cache_config.max_tokens) are applied only when mode ≠ prefill_and_decode.
  • Search for any remaining free-gpu-memory-fraction references.
  • Check test_request_cancellation_trtllm_kv_transfer_cancel expectations after XFAIL simplification.
  • Ensure pytest mark change aligns with CI policies.

Poem

🐇 I hopped through tests at dawn’s first light,

Trimming waits, setting buffers tight,
No heavy X now slows the run,
KV caches ready, tokens one-six-three-eight-four, fun! 🎉

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main fix: addressing a TRT-LLM cache transceiver configuration issue introduced in version 1.2.0rc3, which aligns with the PR's primary objective.
Description check ✅ Passed The description includes all required template sections (Overview, Details, Where to start, Related Issues) with substantive content explaining the fix and supporting changes, though the 'Where should the reviewer start' section is marked N/A.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/fault_tolerance/cancellation/test_trtllm.py (1)

39-65: Keep TRT‑LLM token limits in sync via a single constant

You now have 16384 hard‑coded three times (--max-seq-len, --max-num-tokens, and max_tokens_in_buffer in the YAML). To avoid future mismatches that would violate the “must match” requirement, consider centralizing this as a single constant and reusing it for both the CLI args and the YAML write:

-        command = [
+        max_tokens = "16384"
+
+        command = [
           "python3",
@@
-            "--max-seq-len",
-            "16384",
-            "--max-num-tokens",
-            "16384",
+            "--max-seq-len",
+            max_tokens,
+            "--max-num-tokens",
+            max_tokens,
@@
-        if mode != "prefill_and_decode":
+        if mode != "prefill_and_decode":
             with open("test_request_cancellation_trtllm_config.yaml", "w") as f:
-                f.write(
-                    "cache_transceiver_config:\n  backend: DEFAULT\n  max_tokens_in_buffer: 16384\n"
-                )
+                f.write(
+                    f"cache_transceiver_config:\n"
+                    f"  backend: DEFAULT\n"
+                    f"  max_tokens_in_buffer: {max_tokens}\n"
+                )

This keeps the test logic aligned with the TRT‑LLM requirement that all three limits match while making future changes less error‑prone.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 44e8600 and 8c37f8e.

📒 Files selected for processing (1)
  • tests/fault_tolerance/cancellation/test_trtllm.py (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 2989
File: lib/llm/src/block_manager/distributed/transfer.rs:6-6
Timestamp: 2025-09-18T21:47:44.143Z
Learning: For PR ai-dynamo/dynamo#2989, the ConnectorTransferBatcher architectural issues will be addressed in a follow-up PR by removing the duplicate batching logic and integrating distributed transfers with the existing TransferBatcher + LocalTransferManager pipeline, rather than adding bounded concurrency primitives like Semaphore.
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3391
File: tests/fault_tolerance/cancellation/utils.py:323-390
Timestamp: 2025-10-03T01:53:15.023Z
Learning: In `tests/fault_tolerance/cancellation/utils.py`, the `poll_for_pattern` function's default `max_wait_ms` of 500ms is intentionally set to detect failures in cancellation signal propagation to TRT-LLM. This timeout covers only the time for the cancellation signal to reach TRT-LLM (not any generation routine), and if cancellation takes longer than 0.5s to propagate, it should be considered a test failure.
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3193
File: tests/fault_tolerance/cancellation/test_trtllm.py:4-204
Timestamp: 2025-09-25T00:54:01.369Z
Learning: The fault tolerance tests in tests/fault_tolerance/cancellation/ run in a controlled container environment where files written to /workspace are automatically cleaned up after test completion, and tests execute sequentially without concurrency concerns, so temporary file management for config files is not necessary.
📚 Learning: 2025-10-03T01:53:15.023Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3391
File: tests/fault_tolerance/cancellation/utils.py:323-390
Timestamp: 2025-10-03T01:53:15.023Z
Learning: In `tests/fault_tolerance/cancellation/utils.py`, the `poll_for_pattern` function's default `max_wait_ms` of 500ms is intentionally set to detect failures in cancellation signal propagation to TRT-LLM. This timeout covers only the time for the cancellation signal to reach TRT-LLM (not any generation routine), and if cancellation takes longer than 0.5s to propagate, it should be considered a test failure.

Applied to files:

  • tests/fault_tolerance/cancellation/test_trtllm.py
📚 Learning: 2025-09-25T00:49:16.914Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3193
File: components/backends/trtllm/src/dynamo/trtllm/request_handlers/handler_base.py:106-147
Timestamp: 2025-09-25T00:49:16.914Z
Learning: In TensorRT-LLM cancellation implementation, there are two distinct request IDs: the external/user-facing request_id from the incoming request, and the internal_request_id from TRT-LLM's generation_result that's needed for executor.abort_request() calls.

Applied to files:

  • tests/fault_tolerance/cancellation/test_trtllm.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: vllm (arm64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
tests/fault_tolerance/cancellation/test_trtllm.py (1)

374-462: Confirm KV‑transfer cancellation test stability after xfail removal

Now that test_request_cancellation_trtllm_kv_transfer_cancel runs as a normal test, please double‑check it’s stable given the strict 500 ms default timeout in poll_for_pattern for cancellation propagation. Inter‑process log polling plus the new cache‑transceiver config could still surface rare timing issues, so it’s worth running this test repeatedly (and perhaps under load) to ensure it doesn’t exhibit intermittent failures. Based on learnings, this timeout is intentionally tight for TRT‑LLM cancellation.

@kthui kthui changed the title fix: Set max_tokens_in_buffer on TRT-LLM cache transceiver fix: TRT-LLM cache transceiver setting issue since 1.2.0rc3 Dec 2, 2025
@kthui kthui force-pushed the jacky-ft-cancel-kv-transfer-trtllm-config branch from 1c1c87c to a37800c Compare December 13, 2025 00:41
@kthui
Copy link
Contributor Author

kthui commented Dec 13, 2025

@coderabbitai full review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 13, 2025

✅ Actions performed

Full review triggered.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/fault_tolerance/cancellation/test_trtllm.py (1)

33-40: Update the marker comment: it contradicts the new pre_merge marker.
Line 38 says “post_merge to pinpoint failure commit” but the code is pytest.mark.pre_merge; this is confusing and (given marker enforcement) should be accurate. Based on learnings, marker correctness is intentionally enforced.

🧹 Nitpick comments (1)
tests/fault_tolerance/cancellation/test_trtllm.py (1)

67-93: Centralize 16384 constant and use unique config filenames to prevent drift and parallel-execution collisions.

The code contains four hardcoded instances of 16384 that must stay in sync (--max-seq-len, --max-num-tokens, max_tokens_in_buffer, kv_cache_config.max_tokens), creating maintenance risk. Additionally, the fixed filename test_request_cancellation_trtllm_config.yaml will collide if tests run in parallel with pytest-xdist.

Recommended approach:

  • Define MAX_TOKENS = 16384 at module level and reference it everywhere.
  • Generate unique config filenames using test identity (e.g., with request.node.name and mode) to avoid collisions under parallel execution.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 45e881d and a37800c.

📒 Files selected for processing (1)
  • tests/fault_tolerance/cancellation/test_trtllm.py (7 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 2989
File: lib/llm/src/block_manager/distributed/transfer.rs:6-6
Timestamp: 2025-09-18T21:47:44.143Z
Learning: For PR ai-dynamo/dynamo#2989, the ConnectorTransferBatcher architectural issues will be addressed in a follow-up PR by removing the duplicate batching logic and integrating distributed transfers with the existing TransferBatcher + LocalTransferManager pipeline, rather than adding bounded concurrency primitives like Semaphore.
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3391
File: tests/fault_tolerance/cancellation/utils.py:323-390
Timestamp: 2025-10-03T01:53:15.023Z
Learning: In `tests/fault_tolerance/cancellation/utils.py`, the `poll_for_pattern` function's default `max_wait_ms` of 500ms is intentionally set to detect failures in cancellation signal propagation to TRT-LLM. This timeout covers only the time for the cancellation signal to reach TRT-LLM (not any generation routine), and if cancellation takes longer than 0.5s to propagate, it should be considered a test failure.
Learnt from: bhuvan002
Repo: ai-dynamo/dynamo PR: 2702
File: components/backends/trtllm/src/dynamo/trtllm/logits_processing/adapter.py:7-13
Timestamp: 2025-08-28T20:43:54.701Z
Learning: CI import failures for optional dependencies like tensorrt_llm can be resolved by removing __init__.py files and restructuring imports, rather than wrapping imports in try/except blocks.
📚 Learning: 2025-10-03T01:53:15.023Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3391
File: tests/fault_tolerance/cancellation/utils.py:323-390
Timestamp: 2025-10-03T01:53:15.023Z
Learning: In `tests/fault_tolerance/cancellation/utils.py`, the `poll_for_pattern` function's default `max_wait_ms` of 500ms is intentionally set to detect failures in cancellation signal propagation to TRT-LLM. This timeout covers only the time for the cancellation signal to reach TRT-LLM (not any generation routine), and if cancellation takes longer than 0.5s to propagate, it should be considered a test failure.

Applied to files:

  • tests/fault_tolerance/cancellation/test_trtllm.py
📚 Learning: 2025-09-25T00:54:01.369Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3193
File: tests/fault_tolerance/cancellation/test_trtllm.py:4-204
Timestamp: 2025-09-25T00:54:01.369Z
Learning: The fault tolerance tests in tests/fault_tolerance/cancellation/ run in a controlled container environment where files written to /workspace are automatically cleaned up after test completion, and tests execute sequentially without concurrency concerns, so temporary file management for config files is not necessary.

Applied to files:

  • tests/fault_tolerance/cancellation/test_trtllm.py
📚 Learning: 2025-12-11T21:55:13.944Z
Learnt from: pvijayakrish
Repo: ai-dynamo/dynamo PR: 4812
File: scripts/report_pytest_markers.py:412-438
Timestamp: 2025-12-11T21:55:13.944Z
Learning: In scripts/report_pytest_markers.py, the unconditional failure when report.total_missing > 0 (tests missing required marker categories) is intentional behavior designed to catch missing markers at PR review and merge time, not a bug requiring phased rollout options.

Applied to files:

  • tests/fault_tolerance/cancellation/test_trtllm.py
📚 Learning: 2025-09-25T00:49:16.914Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3193
File: components/backends/trtllm/src/dynamo/trtllm/request_handlers/handler_base.py:106-147
Timestamp: 2025-09-25T00:49:16.914Z
Learning: In TensorRT-LLM cancellation implementation, there are two distinct request IDs: the external/user-facing request_id from the incoming request, and the internal_request_id from TRT-LLM's generation_result that's needed for executor.abort_request() calls.

Applied to files:

  • tests/fault_tolerance/cancellation/test_trtllm.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: sglang (amd64)
  • GitHub Check: operator (arm64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
tests/fault_tolerance/cancellation/test_trtllm.py (1)

168-168: Timeout reductions: please sanity-check against worst-case engine init + pre_merge load.
The reduced timeouts match the updated “3x average” intent, but since these are now pre_merge, verify they still pass under busy/shared runners and both request planes.

Also applies to: 255-255, 339-339, 431-431

@kthui kthui force-pushed the jacky-ft-cancel-kv-transfer-trtllm-config branch from 1abc47f to e9707a4 Compare December 15, 2025 23:16
@kthui kthui force-pushed the jacky-ft-cancel-kv-transfer-trtllm-config branch from e9707a4 to c57ef93 Compare December 16, 2025 00:29
@kthui kthui force-pushed the jacky-ft-cancel-kv-transfer-trtllm-config branch from c57ef93 to 090f987 Compare December 16, 2025 19:26
@kthui kthui force-pushed the jacky-ft-cancel-kv-transfer-trtllm-config branch from 090f987 to 85621a2 Compare December 16, 2025 21:22
@kthui
Copy link
Contributor Author

kthui commented Dec 16, 2025

@coderabbitai full review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 16, 2025

✅ Actions performed

Full review triggered.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/fault_tolerance/cancellation/test_trtllm.py (1)

84-88: Cache configuration correctly implements PR objectives.

The addition of max_tokens_in_buffer: 16384 and kv_cache_config.max_tokens: 16384 properly aligns with the command-line flags --max-seq-len and --max-num-tokens (lines 76-78), which matches the PR's goal of ensuring cache transceiver settings are consistent with CLI parameters.

Consider extracting the hard-coded 16384 value to a module-level constant to improve maintainability:

+# TRT-LLM test configuration
+TEST_MAX_TOKENS = 16384
+
 class DynamoWorkerProcess(ManagedProcess):
     ...
         command = [
             ...
             "--max-seq-len",
-            "16384",
+            str(TEST_MAX_TOKENS),
             "--max-num-tokens",
-            "16384",
+            str(TEST_MAX_TOKENS),
             ...
         ]
         if mode != "prefill_and_decode":
             with open("test_request_cancellation_trtllm_config.yaml", "w") as f:
                 f.write(
-                    "cache_transceiver_config:\n  backend: DEFAULT\n  max_tokens_in_buffer: 16384\n"
+                    f"cache_transceiver_config:\n  backend: DEFAULT\n  max_tokens_in_buffer: {TEST_MAX_TOKENS}\n"
                 )
                 f.write("disable_overlap_scheduler: true\n")
-                f.write("kv_cache_config:\n  max_tokens: 16384\n")
+                f.write(f"kv_cache_config:\n  max_tokens: {TEST_MAX_TOKENS}\n")
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 59e6873 and 85621a2.

📒 Files selected for processing (1)
  • tests/fault_tolerance/cancellation/test_trtllm.py (6 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 2989
File: lib/llm/src/block_manager/distributed/transfer.rs:6-6
Timestamp: 2025-09-18T21:47:44.143Z
Learning: For PR ai-dynamo/dynamo#2989, the ConnectorTransferBatcher architectural issues will be addressed in a follow-up PR by removing the duplicate batching logic and integrating distributed transfers with the existing TransferBatcher + LocalTransferManager pipeline, rather than adding bounded concurrency primitives like Semaphore.
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3391
File: tests/fault_tolerance/cancellation/utils.py:323-390
Timestamp: 2025-10-03T01:53:15.023Z
Learning: In `tests/fault_tolerance/cancellation/utils.py`, the `poll_for_pattern` function's default `max_wait_ms` of 500ms is intentionally set to detect failures in cancellation signal propagation to TRT-LLM. This timeout covers only the time for the cancellation signal to reach TRT-LLM (not any generation routine), and if cancellation takes longer than 0.5s to propagate, it should be considered a test failure.
📚 Learning: 2025-10-03T01:53:15.023Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3391
File: tests/fault_tolerance/cancellation/utils.py:323-390
Timestamp: 2025-10-03T01:53:15.023Z
Learning: In `tests/fault_tolerance/cancellation/utils.py`, the `poll_for_pattern` function's default `max_wait_ms` of 500ms is intentionally set to detect failures in cancellation signal propagation to TRT-LLM. This timeout covers only the time for the cancellation signal to reach TRT-LLM (not any generation routine), and if cancellation takes longer than 0.5s to propagate, it should be considered a test failure.

Applied to files:

  • tests/fault_tolerance/cancellation/test_trtllm.py
📚 Learning: 2025-09-25T00:54:01.369Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3193
File: tests/fault_tolerance/cancellation/test_trtllm.py:4-204
Timestamp: 2025-09-25T00:54:01.369Z
Learning: The fault tolerance tests in tests/fault_tolerance/cancellation/ run in a controlled container environment where files written to /workspace are automatically cleaned up after test completion, and tests execute sequentially without concurrency concerns, so temporary file management for config files is not necessary.

Applied to files:

  • tests/fault_tolerance/cancellation/test_trtllm.py
📚 Learning: 2025-09-25T00:49:16.914Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3193
File: components/backends/trtllm/src/dynamo/trtllm/request_handlers/handler_base.py:106-147
Timestamp: 2025-09-25T00:49:16.914Z
Learning: In TensorRT-LLM cancellation implementation, there are two distinct request IDs: the external/user-facing request_id from the incoming request, and the internal_request_id from TRT-LLM's generation_result that's needed for executor.abort_request() calls.

Applied to files:

  • tests/fault_tolerance/cancellation/test_trtllm.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: sglang (arm64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
tests/fault_tolerance/cancellation/test_trtllm.py (1)

168-168: LGTM! Timeout reductions align with deterministic KV cache allocation.

The timeout adjustments (135s for aggregated, 195s for disaggregated tests) appropriately reflect the performance improvements from switching to exact KV cache sizing. The 3x safety margin over documented average runtimes is reasonable.

Also applies to: 255-255, 339-339, 432-432

@kthui kthui merged commit 32eaecb into main Dec 16, 2025
29 checks passed
@kthui kthui deleted the jacky-ft-cancel-kv-transfer-trtllm-config branch December 16, 2025 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants