feat: KVBM V2 optimized bounce buffer transfer + benchmark #3947

jthomson04 · 2025-10-29T00:13:18Z

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

Chores
- Introduced new performance benchmarking tool for analyzing data transfer operations between storage and compute layers
- Optimized internal transfer execution mechanisms to enhance operational efficiency through improved buffering strategies
Build
- Added optional build features to support development utilities and benchmarking infrastructure
- Integrated required dependencies for performance measurement and analysis capabilities

Signed-off-by: jthomson04 <[email protected]>

coderabbitai · 2025-10-29T00:19:56Z

Walkthrough

A new benchmarking tool is introduced for measuring remote memory transfer performance between disk and device. Configuration support is added to the build manifest, along with a new binary that executes transfer benchmarks and computes bandwidth metrics. The internal transfer execution logic is refactored to support batched buffered transfers with improved block grouping and conditional routing based on bounce buffer size.

Changes

Cohort / File(s)	Summary
Build Configuration `lib/llm/Cargo.toml`	Adds `block-manager-bench` feature with dependencies on block-manager, testing-full, clap, and indicatif. Introduces optional clap (v4.5.49 with derive) and indicatif (v0.18.0) dependencies. Registers new binary target `bench_local_transfer_v2` behind feature gate.
Benchmarking Tool `lib/llm/bin/bench_local_transfer_v2.rs`	New benchmarking binary implementing command-line interface for remote memory transfer performance testing. Configures and executes disk-to-device and device-to-disk transfer benchmarks using NixlAgent and TransportManager. Collects latencies and computes bandwidth (GB/s) across configurable iterations.
Transfer Execution Logic `lib/llm/src/block_manager/v2/physical/transfer/executor/mod.rs`	Introduces `handle_buffered_transfer` async function for coordinated multi-stage two-hop transfers with block grouping. Adds TransferGroup type alias. Refactors `execute_two_hop_transfer` with conditional routing: single-bounce-block transfers via direct path, multi-block via batched buffering. Updates error propagation through oneshot channels.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

lib/llm/bin/bench_local_transfer_v2.rs: Substantial new benchmarking code (~350 lines) with async transfer orchestration, layout builders, and CLI argument handling requiring careful validation of transfer logic and correctness of bandwidth calculations.
lib/llm/src/block_manager/v2/physical/transfer/executor/mod.rs: Refactored transfer routing logic with new conditional branching paths and batched buffering semantics requiring verification of correctness against existing transfer patterns and error handling flows.
Integration points: Interaction between new handle_buffered_transfer function and existing transfer infrastructure, particularly error propagation and block grouping semantics.

Poem

🐰 Hops with glee across the transfer lanes,
Benchmarks measure disk and device gains!
With bouncing buffers, batches optimized true,
Bandwidth blooms in gigabytes anew!
New paths paved with Clap-dressed command delight, ✨

Pre-merge checks

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The pull request description is completely unfilled and contains only placeholder template text. All required sections—Overview, Details, Where should the reviewer start?, and Related Issues—are empty except for template comments and an example issue reference (#xxx). No meaningful information about the changes, their rationale, implementation details, or file-specific guidance for reviewers has been provided. This constitutes a largely incomplete description that fails to meet the repository's documentation standards.
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The PR title "feat: KVBM V2 optimized bounce buffer transfer + benchmark" is clear, concise, and directly related to the main changes in the changeset. It accurately captures the two primary aspects of the work: the optimized bounce buffer transfer logic added to the executor module and the new benchmarking tool. The title is specific enough that a teammate reviewing the commit history would understand the primary contribution without being vague or misleading.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (4)

lib/llm/bin/bench_local_transfer_v2.rs (1)
115-115: Make required backends configurable (fallback-friendly)

Requiring both "POSIX" and "GDS_MT" can fail on machines without GDS. Consider a CLI flag to pick backends and a graceful fallback to the subset available.

Example:
-    let agent = NixlAgent::require_backends("test_agent", &["POSIX", "GDS_MT"])?;
+    let backends: Vec<&str> = std::env::var("KVBM_BACKENDS")
+        .ok()
+        .map(|s| s.split(',').collect())
+        .unwrap_or_else(|| vec!["POSIX","GDS_MT"]);
+    let agent = NixlAgent::require_backends("bench_agent", &backends)?;
If you prefer Clap, add a --backends "POSIX,GDS_MT" option.
lib/llm/src/block_manager/v2/physical/transfer/executor/mod.rs (3)
182-182: Prefer a descriptive type over (Vec, bool)

TransferGroup = (Vec, bool) obscures meaning of the boolean (which half). Use an enum or a small struct for clarity and type safety.

Example:
-type TransferGroup = (Vec<usize>, bool);
+enum BounceHalf { A, B }
+struct TransferGroup {
+    src_ids: Vec<usize>,
+    half: BounceHalf,
+}
184-269: Buffered double-buffering is sound; consider cancellation and governance

The A/B split prevents aliasing and enables pipelining. Good.

try_join_all awaits both stages but doesn’t cancel in-flight ops on first error; upstream operations may continue writing to the bounce buffer.

Suggestions:

Use TryJoinSet or FuturesUnordered and cancel the sibling future on error.

Integrate with existing TransferBatcher/LocalTransferManager to respect global concurrency limits instead of ad-hoc try_join_all. Based on learnings.

If you keep local scheduling, at least short-circuit:
-        futures::future::try_join_all(futures).await?;
+        if let Err(e) = futures::future::try_join_all(futures).await {
+            // TODO: plumb cancellation to direct transfers if supported.
+            return Err(e);
+        }
Based on learnings

315-336: Single-bounce-block path is strictly per-block; batch to reduce overhead

Looping one block at a time increases launch/dispatch overhead. When num_bounce_blocks == 1 you can still batch by chunking src/dst ids.

Example:
-            for (src_block_id, dst_block_id) in src_block_ids.iter().zip(dst_block_ids.iter()) {
+            for (src_chunk, dst_chunk) in src_block_ids.chunks(1).zip(dst_block_ids.chunks(1)) {
                 if let Err(e) = execute_two_hop_transfer_chunk(
                     &src_clone,
                     bounce_buffer_spec.layout(),
                     &dst_clone,
-                    &[*src_block_id],
+                    src_chunk,
                     &[bounce_block],
-                    &[*dst_block_id],
+                    dst_chunk,
This keeps semantics but is future-proof if you later grow the chunk size.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a7b703b and fcc278f.

📒 Files selected for processing (3)

lib/llm/Cargo.toml (3 hunks)
lib/llm/bin/bench_local_transfer_v2.rs (1 hunks)
lib/llm/src/block_manager/v2/physical/transfer/executor/mod.rs (4 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-09-18T21:41:02.263Z

Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#2989
File: lib/llm/src/block_manager/distributed/transfer.rs:59-60
Timestamp: 2025-09-18T21:41:02.263Z
Learning: The codebase has a robust two-layer transfer management system: TransferBatcher (offload/pending.rs:400-485) handles batching large transfers into MAX_TRANSFER_BATCH_SIZE chunks, and LocalTransferManager (offload/pending.rs:280-286) limits concurrency to MAX_CONCURRENT_TRANSFERS using FuturesUnordered.

Applied to files:

lib/llm/src/block_manager/v2/physical/transfer/executor/mod.rs

📚 Learning: 2025-09-18T21:41:02.263Z

Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#2989
File: lib/llm/src/block_manager/distributed/transfer.rs:59-60
Timestamp: 2025-09-18T21:41:02.263Z
Learning: ConnectorTransferBatcher in distributed/transfer.rs duplicates existing batching logic and uses try_join_all(), which spawns unlimited concurrent transfers and bypasses the existing concurrency control systems. The proper solution is to integrate distributed transfers with the existing TransferBatcher + LocalTransferManager pipeline rather than implementing separate batching logic.

Applied to files:

lib/llm/src/block_manager/v2/physical/transfer/executor/mod.rs

🧬 Code graph analysis (2)

lib/llm/src/block_manager/v2/physical/transfer/executor/mod.rs (3)

lib/llm/src/block_manager/v2/physical/manager/mod.rs (1)

context (209-211)

lib/llm/src/block_manager/v2/physical/transfer/executor/nixl.rs (2)

src (70-84)

layer_range (178-181)

lib/llm/src/block_manager/v2/physical/transfer/tests/local_transfers.rs (1)

layer_range (81-87)

lib/llm/bin/bench_local_transfer_v2.rs (2)

lib/llm/src/block_manager/v2/physical/transfer/executor/mod.rs (2)

execute_transfer (34-75)

new (373-377)

lib/llm/src/block_manager/v2/physical/transfer/nixl_agent/mod.rs (1)

require_backends (114-155)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)

GitHub Check: operator (amd64)
GitHub Check: vllm (amd64)
GitHub Check: trtllm (amd64)
GitHub Check: vllm (arm64)
GitHub Check: trtllm (arm64)
GitHub Check: sglang (arm64)
GitHub Check: sglang (amd64)
GitHub Check: operator (arm64)
GitHub Check: clippy (lib/runtime/examples)
GitHub Check: clippy (launch/dynamo-run)
GitHub Check: clippy (lib/bindings/python)
GitHub Check: clippy (.)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (4)

lib/llm/Cargo.toml (3)

25-25: Feature wiring looks good

block-manager-bench composes existing features and correctly marks clap/indicatif as optional. No issues.

104-107: Versions verified; disabling clap defaults remains optional

Both crate versions exist: clap 4.5.49 is the second-latest stable release (4.5.50 is current), and indicatif 0.18.0 is published. If binary size matters, you may still consider disabling clap's default features—but this is optional.

179-182: All concerns verified as correct; no action needed

The workspace tokio configuration already enables the "full" feature set, which includes both "rt-multi-thread" and "macros" required for the #[tokio::main] macro. The bin target is properly gated, and all dependencies are correctly configured.

lib/llm/bin/bench_local_transfer_v2.rs (1)

106-112: Review comment is incorrect regarding unit conversion; bandwidth formula is already correct

The current implementation is mathematically sound for GB/s. The calculation bytes / mean.as_nanos() correctly yields GB/s because:

as_nanos() returns nanosecond count

bytes / nanoseconds = bytes / (10⁻⁹ seconds) = 10⁹ bytes/second = 1 GB/s

The suggested fix to divide by 1e9 would produce B/s, not GB/s, making the return value inconsistent with the function name and output label.

However, there is a valid concern: the total_bytes calculation hardcodes * 2 but the LayoutConfig specifies both outer_dim(2) and dtype_width_bytes(2). It's unclear whether the formula should multiply by 4 (both factors) or 2 (one factor), and the calculation doesn't reference the actual config values. This warrants clarification to ensure total_bytes correctly represents the actual transferred data size.

Likely an incorrect or invalid review comment.

lib/llm/bin/bench_local_transfer_v2.rs

lib/llm/src/block_manager/v2/physical/transfer/executor/mod.rs

Signed-off-by: jthomson04 <[email protected]>

…ansfer-bench

Signed-off-by: jthomson04 <[email protected]>

…ansfer-bench

statiraju · 2025-11-24T19:59:12Z

https://docs.nvidia.com/dynamo/latest/kvbm/kvbm_components.html needs update.
https://drive.google.com/file/d/10y_hW3s66y8ZFE-72Mq0vUa3dNeXWjBN/view?usp=sharing needs update.

Signed-off-by: jthomson04 <[email protected]>

…#3947) Signed-off-by: jthomson04 <[email protected]>

V2 transfer benchmark + optimized bounce buffer transfer

fcc278f

Signed-off-by: jthomson04 <[email protected]>

jthomson04 requested a review from a team as a code owner October 29, 2025 00:13

pull-request-size bot added the size/L label Oct 29, 2025

github-actions bot added the feat label Oct 29, 2025

coderabbitai bot reviewed Oct 29, 2025

View reviewed changes

Merge branch 'main' into jthomson04/kvbm-v2-transfer-bench

de45884

copy-pr-bot bot temporarily deployed to GITLAB October 31, 2025 16:27 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 31, 2025 16:31 Inactive

clippy + fmt

2321755

Signed-off-by: jthomson04 <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB October 31, 2025 16:55 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 31, 2025 16:57 Inactive

Merge branch 'main' into jthomson04/kvbm-v2-transfer-bench

0198e10

Signed-off-by: jthomson04 <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB November 3, 2025 04:27 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 3, 2025 04:28 Inactive

Merge remote-tracking branch 'origin/main' into jthomson04/kvbm-v2-tr…

70cdfcc

…ansfer-bench

copy-pr-bot bot temporarily deployed to GITLAB November 7, 2025 19:24 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 7, 2025 19:25 Inactive

fmt

0a63d59

Signed-off-by: jthomson04 <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB November 7, 2025 19:27 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 7, 2025 19:30 Inactive

clippy

1c73b09

Signed-off-by: jthomson04 <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB November 12, 2025 19:09 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 12, 2025 19:13 Inactive

Merge branch 'main' into jthomson04/kvbm-v2-transfer-bench

d686e81

copy-pr-bot bot temporarily deployed to GITLAB November 13, 2025 06:00 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 13, 2025 06:01 Inactive

better buffering strategy

eaf3fca

Signed-off-by: jthomson04 <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB November 21, 2025 00:12 Inactive

Merge remote-tracking branch 'origin/main' into jthomson04/kvbm-v2-tr…

f932dd2

…ansfer-bench

copy-pr-bot bot temporarily deployed to GITLAB November 24, 2025 19:37 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 24, 2025 19:39 Inactive

Merge branch 'main' into jthomson04/kvbm-v2-transfer-bench

22080ee

ryanolson approved these changes Dec 9, 2025

View reviewed changes

Merge branch 'main' into jthomson04/kvbm-v2-transfer-bench

4c6b112

Signed-off-by: jthomson04 <[email protected]>

copy-pr-bot bot had a problem deploying to GITLAB December 9, 2025 17:39 Failure

Merge branch 'main' into jthomson04/kvbm-v2-transfer-bench

8bb3df2

copy-pr-bot bot had a problem deploying to GITLAB December 10, 2025 04:53 Failure

jthomson04 merged commit 01819b8 into main Dec 10, 2025
37 of 38 checks passed

jthomson04 deleted the jthomson04/kvbm-v2-transfer-bench branch December 10, 2025 06:14

zxue2 pushed a commit to zxue2/dynamo that referenced this pull request Dec 11, 2025

feat: KVBM V2 optimized bounce buffer transfer + benchmark (ai-dynamo…

788a6f0

…#3947) Signed-off-by: jthomson04 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: KVBM V2 optimized bounce buffer transfer + benchmark #3947

feat: KVBM V2 optimized bounce buffer transfer + benchmark #3947

Uh oh!

jthomson04 commented Oct 29, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 29, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

statiraju commented Nov 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: KVBM V2 optimized bounce buffer transfer + benchmark #3947

feat: KVBM V2 optimized bounce buffer transfer + benchmark #3947

Uh oh!

Conversation

jthomson04 commented Oct 29, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 29, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

statiraju commented Nov 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jthomson04 commented Oct 29, 2025 •

edited by coderabbitai bot

Loading