-
Notifications
You must be signed in to change notification settings - Fork 737
feat: KVBM V2 optimized bounce buffer transfer + benchmark #3947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: jthomson04 <[email protected]>
WalkthroughA new benchmarking tool is introduced for measuring remote memory transfer performance between disk and device. Configuration support is added to the build manifest, along with a new binary that executes transfer benchmarks and computes bandwidth metrics. The internal transfer execution logic is refactored to support batched buffered transfers with improved block grouping and conditional routing based on bounce buffer size. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes
Poem
Pre-merge checks❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (4)
lib/llm/bin/bench_local_transfer_v2.rs (1)
115-115: Make required backends configurable (fallback-friendly)Requiring both "POSIX" and "GDS_MT" can fail on machines without GDS. Consider a CLI flag to pick backends and a graceful fallback to the subset available.
Example:
- let agent = NixlAgent::require_backends("test_agent", &["POSIX", "GDS_MT"])?; + let backends: Vec<&str> = std::env::var("KVBM_BACKENDS") + .ok() + .map(|s| s.split(',').collect()) + .unwrap_or_else(|| vec!["POSIX","GDS_MT"]); + let agent = NixlAgent::require_backends("bench_agent", &backends)?;If you prefer Clap, add a --backends "POSIX,GDS_MT" option.
lib/llm/src/block_manager/v2/physical/transfer/executor/mod.rs (3)
182-182: Prefer a descriptive type over (Vec, bool)TransferGroup = (Vec, bool) obscures meaning of the boolean (which half). Use an enum or a small struct for clarity and type safety.
Example:
-type TransferGroup = (Vec<usize>, bool); +enum BounceHalf { A, B } +struct TransferGroup { + src_ids: Vec<usize>, + half: BounceHalf, +}
184-269: Buffered double-buffering is sound; consider cancellation and governance
- The A/B split prevents aliasing and enables pipelining. Good.
- try_join_all awaits both stages but doesn’t cancel in-flight ops on first error; upstream operations may continue writing to the bounce buffer.
Suggestions:
- Use TryJoinSet or FuturesUnordered and cancel the sibling future on error.
- Integrate with existing TransferBatcher/LocalTransferManager to respect global concurrency limits instead of ad-hoc try_join_all. Based on learnings.
If you keep local scheduling, at least short-circuit:
- futures::future::try_join_all(futures).await?; + if let Err(e) = futures::future::try_join_all(futures).await { + // TODO: plumb cancellation to direct transfers if supported. + return Err(e); + }Based on learnings
315-336: Single-bounce-block path is strictly per-block; batch to reduce overheadLooping one block at a time increases launch/dispatch overhead. When num_bounce_blocks == 1 you can still batch by chunking src/dst ids.
Example:
- for (src_block_id, dst_block_id) in src_block_ids.iter().zip(dst_block_ids.iter()) { + for (src_chunk, dst_chunk) in src_block_ids.chunks(1).zip(dst_block_ids.chunks(1)) { if let Err(e) = execute_two_hop_transfer_chunk( &src_clone, bounce_buffer_spec.layout(), &dst_clone, - &[*src_block_id], + src_chunk, &[bounce_block], - &[*dst_block_id], + dst_chunk,This keeps semantics but is future-proof if you later grow the chunk size.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
lib/llm/Cargo.toml(3 hunks)lib/llm/bin/bench_local_transfer_v2.rs(1 hunks)lib/llm/src/block_manager/v2/physical/transfer/executor/mod.rs(4 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-18T21:41:02.263Z
Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#2989
File: lib/llm/src/block_manager/distributed/transfer.rs:59-60
Timestamp: 2025-09-18T21:41:02.263Z
Learning: The codebase has a robust two-layer transfer management system: TransferBatcher (offload/pending.rs:400-485) handles batching large transfers into MAX_TRANSFER_BATCH_SIZE chunks, and LocalTransferManager (offload/pending.rs:280-286) limits concurrency to MAX_CONCURRENT_TRANSFERS using FuturesUnordered.
Applied to files:
lib/llm/src/block_manager/v2/physical/transfer/executor/mod.rs
📚 Learning: 2025-09-18T21:41:02.263Z
Learnt from: oandreeva-nv
PR: ai-dynamo/dynamo#2989
File: lib/llm/src/block_manager/distributed/transfer.rs:59-60
Timestamp: 2025-09-18T21:41:02.263Z
Learning: ConnectorTransferBatcher in distributed/transfer.rs duplicates existing batching logic and uses try_join_all(), which spawns unlimited concurrent transfers and bypasses the existing concurrency control systems. The proper solution is to integrate distributed transfers with the existing TransferBatcher + LocalTransferManager pipeline rather than implementing separate batching logic.
Applied to files:
lib/llm/src/block_manager/v2/physical/transfer/executor/mod.rs
🧬 Code graph analysis (2)
lib/llm/src/block_manager/v2/physical/transfer/executor/mod.rs (3)
lib/llm/src/block_manager/v2/physical/manager/mod.rs (1)
context(209-211)lib/llm/src/block_manager/v2/physical/transfer/executor/nixl.rs (2)
src(70-84)layer_range(178-181)lib/llm/src/block_manager/v2/physical/transfer/tests/local_transfers.rs (1)
layer_range(81-87)
lib/llm/bin/bench_local_transfer_v2.rs (2)
lib/llm/src/block_manager/v2/physical/transfer/executor/mod.rs (2)
execute_transfer(34-75)new(373-377)lib/llm/src/block_manager/v2/physical/transfer/nixl_agent/mod.rs (1)
require_backends(114-155)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: operator (amd64)
- GitHub Check: vllm (amd64)
- GitHub Check: trtllm (amd64)
- GitHub Check: vllm (arm64)
- GitHub Check: trtllm (arm64)
- GitHub Check: sglang (arm64)
- GitHub Check: sglang (amd64)
- GitHub Check: operator (arm64)
- GitHub Check: clippy (lib/runtime/examples)
- GitHub Check: clippy (launch/dynamo-run)
- GitHub Check: clippy (lib/bindings/python)
- GitHub Check: clippy (.)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (4)
lib/llm/Cargo.toml (3)
25-25: Feature wiring looks goodblock-manager-bench composes existing features and correctly marks clap/indicatif as optional. No issues.
104-107: Versions verified; disabling clap defaults remains optionalBoth crate versions exist: clap 4.5.49 is the second-latest stable release (4.5.50 is current), and indicatif 0.18.0 is published. If binary size matters, you may still consider disabling clap's default features—but this is optional.
179-182: All concerns verified as correct; no action neededThe workspace tokio configuration already enables the "full" feature set, which includes both "rt-multi-thread" and "macros" required for the
#[tokio::main]macro. The bin target is properly gated, and all dependencies are correctly configured.lib/llm/bin/bench_local_transfer_v2.rs (1)
106-112: Review comment is incorrect regarding unit conversion; bandwidth formula is already correctThe current implementation is mathematically sound for GB/s. The calculation
bytes / mean.as_nanos()correctly yields GB/s because:
as_nanos()returns nanosecond count- bytes / nanoseconds = bytes / (10⁻⁹ seconds) = 10⁹ bytes/second = 1 GB/s
The suggested fix to divide by 1e9 would produce B/s, not GB/s, making the return value inconsistent with the function name and output label.
However, there is a valid concern: the
total_bytescalculation hardcodes* 2but the LayoutConfig specifies bothouter_dim(2)anddtype_width_bytes(2). It's unclear whether the formula should multiply by 4 (both factors) or 2 (one factor), and the calculation doesn't reference the actual config values. This warrants clarification to ensuretotal_bytescorrectly represents the actual transferred data size.Likely an incorrect or invalid review comment.
Signed-off-by: jthomson04 <[email protected]>
Signed-off-by: jthomson04 <[email protected]>
Signed-off-by: jthomson04 <[email protected]>
Signed-off-by: jthomson04 <[email protected]>
Signed-off-by: jthomson04 <[email protected]>
Signed-off-by: jthomson04 <[email protected]>
…#3947) Signed-off-by: jthomson04 <[email protected]>
Overview:
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
Chores
Build