Skip to content

Conversation

@oandreeva-nv
Copy link
Contributor

@oandreeva-nv oandreeva-nv commented Dec 3, 2025

Overview:

Resurrection of #4356 with fixes:

cudaMemcpyBatchAsync needs to be called differently based on CUDA version

Details:

The files changed/introduced in this pr can be divided into multiple subgroups. Main ones are:

Group 1: New KVBM Kernels Library

The core of this PR - a new standalone CUDA kernels library:

lib/kvbm-kernels/                                                                          
 ├── Cargo.toml                                    # [A] Package definition
 ├── build.rs                                      # [A] Build script (nvcc/prebuilt)
 ├── README.md                                     # [A] Documentation          
 ├── src/                                                                                   
 │   ├── lib.rs                                    # [A] Rust module
 │   └── tensor_kernels.rs                         # [A] Rust FFI wrappers     
 └── cuda/                                                                                  
     ├── tensor_kernels.cu                         # [A] CUDA layout conversions        
     ├── vectorized_copy.cu                        # [A] CUDA optimized memcpy
     └── prebuilt/                                                                          
         ├── tensor_kernels.fatbin                 # [A] Precompiled GPU binary         
         ├── libtensor_kernels.a                   # [A] Static library (host + device)
         ├── tensor_kernels.md5                    # [A] Checksum (.cu + .fatbin)
         ├── vectorized_copy.fatbin                # [R] Moved from lib/llm
         └── vectorized_copy.md5                   # [A] Checksum (.cu + .fatbin)  

Since static libraries (.a) are arch dependent, I only provide those for x86

Current Behavior (which should be what you want):

When building from source on ARM with nvcc:

  1. cc::Build compiles all .cu files and creates a library in OUT_DIR
  2. This library gets linked into your Rust binary for the current build
  3. You CAN build and run on ARM just fine
  4. We skip generate_prebuilt_artifacts for tensor_kernels, so we don't save the ARM .a file to cuda/prebuilt/

When using prebuilt mode on ARM (no nvcc):

  • Fails with clear error because the cuda/prebuilt/libtensor_kernels.a is x86_64-only

Group 2: Python/PyTorch Bindings

Expose CUDA kernels to Python for testing:

lib/bindings/kvbm/                                                                                                                                                                       
├── Cargo.toml                                    # [M] Added kernels dependency                                                                                                         
├── Cargo.lock                                    # [M] Updated deps                                                                                                                     
├── python/kvbm/__init__.py                       # [M] Export kernels module                                                                                                            
├── src/                                                                                                                                                                                 
│   ├── lib.rs                                    # [M] Register kernels submodule                                                                                                       
│   └── kernels.rs                                # [A] PyO3 bindings (NEW)
└── tests/
    ├── test_tensor_kernels.py                    # [A] PyTorch validation tests (NEW)
    └── README.md                                 # [A] Testing architecture docs (NEW - we just added this!)

Purpose: PyO3 bindings to test kernels through PyTorch

Group 3: Integration with LLM Engine

Connect open-sourced vectorized copy to existing Dynamo code:

  lib/llm/
  ├── build.rs                                      # [M] Update default location for vectorized copy
  └── src/block_manager/block/transfer/cuda.rs      # [M] Updated debug prints for better debugging

Purpose: Replace old implementation with new kernel library

Group 4: Build & CI Infrastructure

Enable building and testing:

  .github/workflows/pre-merge-rust.yml              # [M] Add kvbm-kernels to CI matrix
  container/Dockerfile.vllm                          # [M] Add CUDA dev tools (nvcc, nvlink, etc.)

Purpose: CI/CD and container support for kernel compilation in dev containers

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Added CUDA-accelerated tensor kernel operations for KV block manager conversions between block, universal, and operational representations
    • Support for multiple data types (float16, bfloat16, float32, float64) and block layouts with configurable backends
  • Chores

    • Extended CI workflows for container validation and code formatting verification
    • Added kvbm-kernels as workspace component; updated cudarc dependency
  • Tests

    • Added comprehensive tensor kernel regression test suite validating round-trip conversions and error handling

✏️ Tip: You can customize this high-level summary in your review settings.

@oandreeva-nv oandreeva-nv requested review from a team as code owners December 3, 2025 22:45
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 3, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@oandreeva-nv
Copy link
Contributor Author

/ok to test f2353f3

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 3, 2025

Walkthrough

This PR introduces a comprehensive KVBM (KV Block Manager) kernels subsystem featuring CUDA tensor operations for converting between block-structured, universal, and operational tensor layouts. A new lib/kvbm-kernels crate provides CUDA kernels, Rust wrappers, and Python bindings. Workflows are updated to include the crate in build and validation pipelines.

Changes

Cohort / File(s) Summary
Workflow Configuration
.github/workflows/container-validation-dynamo.yml, .github/workflows/pre-merge-rust.yml
Added --enable-kvbm flag to container build step; expanded Rust matrix to include lib/kvbm-kernels; added rustfmt toolchain and code formatting verification step.
Workspace & Bindings Manifests
Cargo.toml, lib/bindings/kvbm/Cargo.toml
Registered lib/kvbm-kernels as workspace member; added optional kvbm_kernels dependency to kvbm bindings with block-manager feature wiring; updated cudarc from 0.16.2 to 0.17.1.
KVBM Kernels Crate Core
lib/kvbm-kernels/Cargo.toml, lib/kvbm-kernels/build.rs, lib/kvbm-kernels/src/lib.rs, lib/kvbm-kernels/src/tensor_kernels.rs
Established new crate with dual rlib/cdylib output; build script handles prebuilt kernel validation via MD5 hashing and CUDA compilation; Rust module exposes public enums (TensorDataType, BlockLayout, OperationalCopyDirection, OperationalCopyBackend) and unsafe FFI wrappers (universal_from_block, block_from_universal, operational_copy).
CUDA Tensor Kernels
lib/kvbm-kernels/cuda/tensor_kernels.cu, lib/kvbm-kernels/cuda/prebuilt/tensor_kernels.md5
Implements three kernels (block_to_universal_kernel, universal_to_block_kernel, operational_pack/unpack_kernel) with multi-type support (F16, BF16, F32, F64); exposes C-callable entry points with CUDA 12.9+ memcpy batch support; adds prebuilt MD5 hash file for artifact validation.
Python Bindings & Integration
lib/bindings/kvbm/python/kvbm/__init__.py, lib/bindings/kvbm/src/kernels.rs
Exports kernels submodule from PyO3 bindings; implements four pyfunctions (block_to_universal, universal_to_block, block_to_operational, operational_to_block) with CUDA context management, tensor validation, pointer marshalling, and error handling.
Tests & Documentation
lib/bindings/kvbm/tests/test_tensor_kernels.py, lib/kvbm-kernels/README.md, lib/kvbm-kernels/Dockerfile
Comprehensive PyTorch/CUDA regression tests covering block↔universal/operational round-trips, dtype/layout combinations, backend variations, and error conditions; documentation covering layouts, API usage, build/test workflows, and troubleshooting; Docker setup for reproducible CUDA builds.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

  • build.rs: Prebuilt kernel validation logic with MD5 hash computation and comparison; conditional CUDA compilation path requires verification of arch flag handling and linking directives.
  • lib/kvbm-kernels/cuda/tensor_kernels.cu: Multi-layout index calculation kernels and backend dispatch logic (KernelOnly, MemcpyAsync, MemcpyBatch) with CUDA version-specific branching; validate correctness of grid/block dimensions and pointer offset logic.
  • lib/bindings/kvbm/src/kernels.rs: Extensive tensor metadata extraction and validation; pointer buffer preparation; ensure proper CUDA synchronization and exception mapping.
  • lib/kvbm-kernels/src/tensor_kernels.rs: FFI wrapper correctness and enum-to-i32 flag conversions; internal test coverage for round-trip data integrity.

Poem

🐰 CUDA kernels hop and play,
Blocks transform in every way!
From universal to ops so sleek,
With Rust and Python—it's unique!
🚀 Tensors now have wings to fly,
Building faster, reaching sky! ✨

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: re-introducing KVBM kernels after a previous revert, capturing the primary objective of the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed The PR description provides comprehensive details across all required sections: a clear overview of the resurrection of a previous PR with CUDA version fixes, detailed breakdown of four change groups (KVBM Kernels Library, Python/PyTorch Bindings, Integration with LLM Engine, and Build/CI Infrastructure), file structures with purpose annotations, and guidance for reviewers on where to start.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (12)
lib/kvbm-kernels/Dockerfile (2)

21-21: Consider pinning Rust toolchain version for reproducibility.

Using --default-toolchain stable may lead to different Rust versions across builds. Consider pinning to a specific version (e.g., 1.83.0) for reproducible builds.


27-30: Consider copying Cargo.lock for reproducible dependency versions.

If this crate has a Cargo.lock file, copying it ensures consistent dependency versions across builds.

 COPY Cargo.toml ./
+COPY Cargo.lock ./
 COPY src ./src
 COPY build.rs ./build.rs
 COPY cuda ./cuda
lib/kvbm-kernels/cuda/prebuilt/tensor_kernels.md5 (1)

1-3: Consider documenting what each hash validates.

The file contains three MD5 hashes without context. Consider using a format like <hash> <filename> (standard md5sum output format) or adding a header comment to indicate which files these hashes correspond to.

+# MD5 checksums for prebuilt kernel artifacts
+# Format: <hash>  <filename>
-7aac008ed704fe198ef5056a4e502069
-a7d1649c148ee0366de6d19896a80116
-4234cfd2ef4b283592a37a9ceed94666
+7aac008ed704fe198ef5056a4e502069  tensor_kernels.fatbin
+a7d1649c148ee0366de6d19896a80116  <corresponding_file>
+4234cfd2ef4b283592a37a9ceed94666  <corresponding_file>
lib/bindings/kvbm/python/kvbm/__init__.py (1)

14-17: Consider guarding kernels import for non‑block‑manager builds

Exposing kvbm.kernels via from kvbm._core import kernels as kernels is the right hook for the new submodule. If there’s any scenario where _core is built without the block-manager feature, this import will raise at import time; you could optionally make it defensive, e.g.:

try:
    from kvbm._core import kernels as kernels
except ImportError:
    kernels = None  # or omit export / re-raise with a clearer message

Not required if block-manager is guaranteed to be enabled for all builds, but worth considering.

.github/workflows/pre-merge-rust.yml (1)

69-73: cargo fmt -- --check is now run twice per directory

Both the tests and clippy jobs install rustfmt and run cargo fmt -- --check for each matrix.dir. Functionally fine, but it doubles formatting work for the overlapping dirs (., lib/bindings/python, etc.). If CI time becomes a concern, you could centralize the formatting check in just one of the jobs (or a tiny dedicated job) and keep clippy/tests focused on their respective tasks.

Also applies to: 120-123

lib/bindings/kvbm/tests/test_tensor_kernels.py (2)

30-50: Unused unpacking variables in _make_blocks.

The variables nh, nt, and hd are extracted but never used in this function. Consider using underscores to indicate intentional non-use, which also satisfies the Ruff linter.

 def _make_blocks(universal: torch.Tensor, layout: str) -> List[torch.Tensor]:
-    nh, nl, no, nt, hd = universal.shape
+    _nh, nl, no, _nt, _hd = universal.shape
     blocks = []
     for layer in range(nl):

104-118: Add strict=True to zip() calls for safer iteration.

Multiple zip() calls lack the strict= parameter. Adding strict=True would catch length mismatches early rather than silently truncating, improving test reliability.

-    for produced, expected in zip(outputs, universals):
+    for produced, expected in zip(outputs, universals, strict=True):
         assert torch.allclose(produced, expected, atol=atol, rtol=rtol)
-    for produced_set, expected_set in zip(blocks, expected_blocks):
-        for produced, expected in zip(produced_set, expected_set):
+    for produced_set, expected_set in zip(blocks, expected_blocks, strict=True):
+        for produced, expected in zip(produced_set, expected_set, strict=True):
             assert torch.allclose(produced, expected, atol=atol, rtol=rtol)

Similar changes should be applied at lines 157, 171-172.

lib/kvbm-kernels/build.rs (1)

68-70: Consider more robust nvcc detection.

The current check only verifies that nvcc --version can be spawned. It doesn't distinguish between nvcc not found vs. nvcc crashing. Consider checking exit status for more precise detection.

 fn is_nvcc_available() -> bool {
-    Command::new("nvcc").arg("--version").output().is_ok()
+    Command::new("nvcc")
+        .arg("--version")
+        .output()
+        .map(|o| o.status.success())
+        .unwrap_or(false)
 }
lib/bindings/kvbm/src/kernels.rs (4)

4-4: Consider scoping the unsafe_op_in_unsafe_fn allow more narrowly.

The file-level #![allow(unsafe_op_in_unsafe_fn)] suppresses warnings for the entire file. Consider applying it only to the specific functions that need it, or adding // SAFETY: comments to document the unsafe invariants.


226-362: Consider extracting shared validation logic.

block_to_universal and universal_to_block share nearly identical validation and setup code (lines 232-301 vs 375-444). Consider extracting the common pattern into a helper to reduce duplication and maintenance burden.

A helper like prepare_universal_transfer(py, universals, blocks, layout) could return validated infos and prepared pointer buffers, with the direction-specific kernel call remaining in each function.

Also applies to: 369-505


516-660: Similar duplication in operational functions.

block_to_operational and operational_to_block also share significant code. The same refactoring suggestion applies — extract common validation and pointer preparation into a shared helper.

Also applies to: 670-814


608-616: Host pointers created but may not be needed for all backends.

block_ptrs_host and operational_ptrs_host are always allocated, but they may only be needed for specific backends (e.g., MemcpyBatch). Consider lazy allocation based on the selected backend to reduce unnecessary allocations.

Also applies to: 762-770

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 96f8008 and f2353f3.

⛔ Files ignored due to path filters (2)
  • Cargo.lock is excluded by !**/*.lock
  • lib/bindings/kvbm/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (16)
  • .github/workflows/container-validation-dynamo.yml (1 hunks)
  • .github/workflows/pre-merge-rust.yml (2 hunks)
  • Cargo.toml (1 hunks)
  • lib/bindings/kvbm/Cargo.toml (2 hunks)
  • lib/bindings/kvbm/python/kvbm/__init__.py (1 hunks)
  • lib/bindings/kvbm/src/kernels.rs (1 hunks)
  • lib/bindings/kvbm/src/lib.rs (2 hunks)
  • lib/bindings/kvbm/tests/test_tensor_kernels.py (1 hunks)
  • lib/kvbm-kernels/Cargo.toml (1 hunks)
  • lib/kvbm-kernels/Dockerfile (1 hunks)
  • lib/kvbm-kernels/README.md (1 hunks)
  • lib/kvbm-kernels/build.rs (1 hunks)
  • lib/kvbm-kernels/cuda/prebuilt/tensor_kernels.md5 (1 hunks)
  • lib/kvbm-kernels/cuda/tensor_kernels.cu (1 hunks)
  • lib/kvbm-kernels/src/lib.rs (1 hunks)
  • lib/kvbm-kernels/src/tensor_kernels.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (11)
📚 Learning: 2025-12-02T18:13:40.065Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 4698
File: .github/workflows/container-validation-dynamo.yml:68-68
Timestamp: 2025-12-02T18:13:40.065Z
Learning: In the ai-dynamo/dynamo repository, backend-specific tests (vllm, sglang, trtllm) are intentionally excluded from the container-validation-dynamo.yml workflow using "not (vllm or sglang or trtllm)" because they run in a separate container-validation-backends.yml workflow that has dedicated jobs for each backend. This separation keeps framework-agnostic tests separate from backend-specific tests.

Applied to files:

  • .github/workflows/pre-merge-rust.yml
  • .github/workflows/container-validation-dynamo.yml
📚 Learning: 2025-08-30T20:43:49.632Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: container/Dockerfile:437-449
Timestamp: 2025-08-30T20:43:49.632Z
Learning: In the dynamo project's devcontainer setup, the team prioritizes consistency across framework-specific Dockerfiles (like container/Dockerfile, container/Dockerfile.vllm, etc.) by mirroring their structure, even when individual optimizations might be possible, to maintain uniformity in the development environment setup.

Applied to files:

  • .github/workflows/container-validation-dynamo.yml
📚 Learning: 2025-08-30T20:43:10.091Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project, devcontainer.json files use templated container names (like "dynamo-vllm-devcontainer") that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.

Applied to files:

  • .github/workflows/container-validation-dynamo.yml
📚 Learning: 2025-12-03T01:14:42.094Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 4657
File: container/Dockerfile.vllm:264-265
Timestamp: 2025-12-03T01:14:42.094Z
Learning: In container/Dockerfile.vllm, the recursive chmod -R g+w operation during early user setup (at lines 264-265, when creating the dynamo user and initializing /workspace, /home/dynamo, /opt/dynamo) is an intentional exception to the pattern of avoiding recursive operations, as it handles pre-existing paths and dotfiles created by useradd -m before bulk content is copied.

Applied to files:

  • .github/workflows/container-validation-dynamo.yml
📚 Learning: 2025-09-16T17:16:07.820Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3051
File: container/templates/Dockerfile.vllm.j2:221-233
Timestamp: 2025-09-16T17:16:07.820Z
Learning: In the dynamo project, when converting Dockerfiles to Jinja2 templates, the primary goal is backward compatibility - generated Dockerfiles must be identical to the originals. Security improvements or other enhancements that would change the generated output are out of scope for templating PRs and should be addressed separately.

Applied to files:

  • .github/workflows/container-validation-dynamo.yml
📚 Learning: 2025-08-30T20:43:10.091Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project's devcontainer setup, hard-coded container names in devcontainer.json files serve as templates that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.

Applied to files:

  • .github/workflows/container-validation-dynamo.yml
📚 Learning: 2025-11-15T06:06:37.067Z
Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 4356
File: lib/kvbm-kernels/src/python.rs:23-34
Timestamp: 2025-11-15T06:06:37.067Z
Learning: In lib/kvbm-kernels/src/python.rs, the CUDA context is hard-coded to device 0 in get_context(). This is a known limitation that should be addressed when Python bindings are re-enabled by either: (1) enforcing device_index == 0 validation with clear error messages, or (2) implementing per-device context management using a map/vec of OnceCell<Arc<CudaContext>> keyed by device_index.

Applied to files:

  • lib/kvbm-kernels/README.md
  • lib/bindings/kvbm/src/lib.rs
  • lib/kvbm-kernels/Cargo.toml
  • lib/bindings/kvbm/src/kernels.rs
📚 Learning: 2025-08-18T20:51:51.324Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2465
File: lib/runtime/src/pipeline/network/egress/push_router.rs:0-0
Timestamp: 2025-08-18T20:51:51.324Z
Learning: The runtime crate cannot depend on the llm crate due to architectural dependency constraints, preventing imports from lib/llm into lib/runtime.

Applied to files:

  • lib/bindings/kvbm/src/lib.rs
📚 Learning: 2025-10-17T22:57:25.021Z
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 3700
File: lib/llm/src/block_manager/numa_allocator.rs:46-64
Timestamp: 2025-10-17T22:57:25.021Z
Learning: The dynamo library (specifically the llm/block_manager module) is developed to run on Linux only, so Linux-specific system calls and APIs can be used without platform guards.

Applied to files:

  • lib/bindings/kvbm/src/lib.rs
  • lib/bindings/kvbm/Cargo.toml
📚 Learning: 2025-09-19T18:21:03.693Z
Learnt from: GuanLuo
Repo: ai-dynamo/dynamo PR: 2746
File: lib/llm/src/grpc/service/tensor.rs:269-447
Timestamp: 2025-09-19T18:21:03.693Z
Learning: In lib/llm/src/grpc/service/tensor.rs, GuanLuo prefers using unsafe Vec::from_raw_parts operations to avoid data copying for performance reasons when converting raw tensor data from Vec<u8> to typed vectors like Vec<u32>, Vec<f32>, etc., even when it involves safety risks.

Applied to files:

  • lib/kvbm-kernels/src/lib.rs
  • lib/kvbm-kernels/src/tensor_kernels.rs
📚 Learning: 2025-09-10T22:47:29.325Z
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 2989
File: lib/llm/src/block_manager/block/transfer/cuda.rs:528-539
Timestamp: 2025-09-10T22:47:29.325Z
Learning: CUDA kernel symbol names in cuModuleGetFunction calls must exactly match the function name in the kernel source code, even if it uses different spelling conventions (like British vs American English) compared to other parts of the codebase.

Applied to files:

  • lib/kvbm-kernels/cuda/tensor_kernels.cu
🧬 Code graph analysis (4)
lib/bindings/kvbm/src/lib.rs (1)
lib/bindings/kvbm/src/kernels.rs (1)
  • add_to_module (816-822)
lib/bindings/kvbm/python/kvbm/__init__.py (1)
lib/bindings/kvbm/src/lib.rs (1)
  • _core (25-53)
lib/kvbm-kernels/src/lib.rs (1)
lib/kvbm-kernels/src/tensor_kernels.rs (3)
  • block_from_universal (148-176)
  • operational_copy (186-218)
  • universal_from_block (116-144)
lib/kvbm-kernels/src/tensor_kernels.rs (1)
lib/kvbm-kernels/cuda/tensor_kernels.cu (6)
  • launch_universal_from_block (391-415)
  • launch_universal_from_block (392-394)
  • launch_block_from_universal (417-441)
  • launch_block_from_universal (418-420)
  • launch_operational_copy (450-615)
  • launch_operational_copy (451-454)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4742/merge) by oandreeva-nv.
lib/kvbm-kernels/cuda/tensor_kernels.cu

[error] 1-1: clang-format hook failed in pre-commit: files were modified by this hook. Re-run 'pre-commit run --all-files' to format, then commit the updated changes. Command: 'pre-commit run --show-diff-on-failure --color=always --all-files'.

🪛 LanguageTool
lib/kvbm-kernels/README.md

[grammar] ~144-~144: Ensure spelling is correct
Context: ... tensors must be CUDA accessible by the specificed device and match the expected shapes an...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.18.1)
lib/kvbm-kernels/README.md

46-46: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


177-177: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)

🪛 Ruff (0.14.7)
lib/bindings/kvbm/tests/test_tensor_kernels.py

38-38: Unpacked variable nh is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


38-38: Unpacked variable nt is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


38-38: Unpacked variable hd is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


48-48: Avoid specifying long messages outside the exception class

(TRY003)


104-104: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


116-116: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


117-117: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


157-157: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


171-171: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


172-172: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: tests (.)
  • GitHub Check: tests (launch/dynamo-run)
  • GitHub Check: tests (lib/bindings/python)
  • GitHub Check: tests (lib/runtime/examples)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (22)
lib/kvbm-kernels/cuda/tensor_kernels.cu (2)

491-494: Silent success when block_ptrs_device is null may mask errors.

Returning cudaSuccess when block_ptrs_device is null silently skips the kernel work. If this is intentional for a "host-only" path, consider adding a comment. Otherwise, this could mask caller errors.


139-176: LGTM!

The index decomposition correctly maps the linear residual to the 5D [nh, nl, no, nt, hd] coordinate space in row-major order, and the block_inner_offset correctly handles both NHD and HND layouts.

lib/kvbm-kernels/src/tensor_kernels.rs (3)

22-60: LGTM!

The enum values correctly match their CUDA counterparts, and #[repr(i32)] ensures proper FFI compatibility. Based on learnings, exact matching of enum values across the FFI boundary is critical for correct kernel dispatch.


62-106: LGTM!

The FFI function signatures correctly translate the C pointer types. void* const* maps to *const *mut c_void and const void* const* maps to *const *const c_void.


220-531: Test coverage is good for the basic path.

The roundtrip test validates Block↔Universal↔Operational conversions with data integrity checks. Consider adding coverage for other data types (F16, BF16) and HND layout in follow-up work.

Cargo.toml (1)

19-19: LGTM!

The lib/kvbm-kernels crate is correctly added to workspace members but excluded from default-members, which is appropriate since it requires CUDA toolchain to build.

.github/workflows/container-validation-dynamo.yml (1)

46-46: Enabling kvbm in the container build is appropriate

Wiring --enable-kvbm into the validation image build aligns with the new kernels crate and bindings so tests can see the functionality. As long as container/build.sh cleanly handles this flag (no-op when disabled / unsupported), this looks good.

lib/bindings/kvbm/src/lib.rs (1)

16-19: kernels submodule wiring into _core looks correct

The new mod kernels plus the PyModule::new("kernels") / add_submodule block under the same #[cfg(feature = "block-manager")] gate is consistent with the rest of the bindings and with the kvbm.__init__ export. This gives a clean kvbm._core.kernels surface for the Python package to re-export.

Also applies to: 45-50

.github/workflows/pre-merge-rust.yml (1)

95-96: Adding lib/kvbm-kernels to the clippy matrix is a good call

Including lib/kvbm-kernels in the clippy matrix.dir ensures the new kernels crate is linted explicitly, in addition to any workspace-level checks from ..

lib/kvbm-kernels/src/lib.rs (1)

4-9: Top‑level re‑exports from tensor_kernels give a clean API

Re-exporting BlockLayout, OperationalCopy*, TensorDataType, and the three core functions at the crate root while still exposing tensor_kernels as a module provides both convenience and flexibility for users. This layout looks solid.

lib/kvbm-kernels/Cargo.toml (1)

4-31: Crate manifest for dynamo-kvbm-kernels looks well-structured

The manifest cleanly separates features (testing-cuda, prebuilt-kernels), uses workspace metadata, and exposes both rlib and cdylib for reuse from Rust and bindings. Build-time deps (dynamo-config, cc, md5) are appropriate for the NVCC-based build.rs flow.

lib/bindings/kvbm/Cargo.toml (1)

22-25: Cudarc version mismatch: bindings override workspace default with older 0.17.1

The feature wiring for kvbm_kernels is correct, but the cudarc version consistency check reveals a problem. The workspace defines cudarc = "0.17.8" at the root level, but lib/bindings/kvbm/Cargo.toml (line 72) and lib/bindings/python/Cargo.toml (line 72) both explicitly pin cudarc = "0.17.1". This creates a version mismatch that could lead to duplicate or conflicting CUDA runtime dependencies in Cargo.lock. Either align both binding crates to use { workspace = true } or update the workspace default to 0.17.1 if there's a reason to stay on the older version.

⛔ Skipped due to learnings
Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 4356
File: lib/kvbm-kernels/src/python.rs:23-34
Timestamp: 2025-11-15T06:06:37.067Z
Learning: In lib/kvbm-kernels/src/python.rs, the CUDA context is hard-coded to device 0 in get_context(). This is a known limitation that should be addressed when Python bindings are re-enabled by either: (1) enforcing device_index == 0 validation with clear error messages, or (2) implementing per-device context management using a map/vec of OnceCell<Arc<CudaContext>> keyed by device_index.
lib/bindings/kvbm/tests/test_tensor_kernels.py (3)

18-27: LGTM!

The _tolerances helper correctly provides dtype-aware tolerances. The relaxed tolerances for fp16/bf16 are appropriate given the reduced precision of these formats.


53-66: LGTM!

The _call_with_backend helper correctly handles unsupported backends by translating cudaErrorNotSupported into pytest skips rather than test failures, which is appropriate for optional hardware features.


257-264: LGTM!

The empty batch test correctly verifies that the kernels handle edge cases gracefully without launching CUDA operations. Testing None return values confirms the expected no-op behavior.

lib/kvbm-kernels/build.rs (3)

44-66: LGTM!

The build mode determination logic is well-structured with clear precedence: feature flag → environment variable → nvcc auto-detection. The warning messages provide helpful diagnostics.


220-275: Generating artifacts into source tree during build.

generate_prebuilt_artifacts writes files to cuda/prebuilt/ within the source tree. This is unconventional for build scripts which typically only write to OUT_DIR. This behavior could:

  1. Fail in read-only source directories
  2. Cause unexpected git changes

If intentional for caching prebuilt artifacts, consider documenting this behavior.


277-286: LGTM!

The compute_file_hash function is straightforward and provides clear error messages on failure. MD5 is acceptable here since this is for integrity checking of build artifacts, not security.

lib/bindings/kvbm/src/kernels.rs (4)

65-113: LGTM!

The tensor_info function provides thorough validation of tensor properties including CUDA residency, contiguity, device type, and dtype. The error messages are clear and actionable.


116-173: LGTM!

The collect_block_pointers function provides comprehensive validation with clear error messages for length, dtype, device, shape, and numel mismatches. The internal error check for missing expected parameters is a good defensive measure.


816-822: LGTM!

The add_to_module function correctly registers all four kernel functions for the Python module.


333-336: The pointer casting is correct and matches the FFI contract exactly. The universal_from_block function expects universal_device_ptrs: *const *mut c_void and block_device_ptrs: *const *const c_void, which lines 334-336 provide. Similarly, block_from_universal expects the opposite mutability, which lines 477-479 correctly provide. The casting pattern through usize is idiomatic Rust for FFI, and the guards (_block_guard, _univ_guard) ensure pointer validity during the FFI call. Device pointers from cudarc are guaranteed to be valid, and the memory layout is handled by the underlying CUDA runtime.

Likely an incorrect or invalid review comment.

Comment on lines +602 to +609
case OperationalCopyBackend::Auto:
default:
status = launch_kernel();
if (status == cudaErrorNotSupported || status == cudaErrorInvalidValue) {
status = launch_memcpy_batch();
}
if (status == cudaErrorNotSupported || status == cudaErrorInvalidValue) {
status = launch_memcpy_async();
}
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fallback on cudaErrorInvalidValue may hide dtype/parameter errors.

Auto mode falls back to memcpy paths when the kernel returns cudaErrorInvalidValue. However, if the error is due to an invalid dtype, the memcpy paths will copy raw bytes without type awareness, potentially masking the real issue. Consider only falling back on cudaErrorNotSupported.

     case OperationalCopyBackend::Auto:
     default:
       status = launch_kernel();
-      if (status == cudaErrorNotSupported || status == cudaErrorInvalidValue) {
+      if (status == cudaErrorNotSupported) {
         status = launch_memcpy_batch();
       }
-      if (status == cudaErrorNotSupported || status == cudaErrorInvalidValue) {
+      if (status == cudaErrorNotSupported) {
         status = launch_memcpy_async();
       }
       break;
🤖 Prompt for AI Agents
In lib/kvbm-kernels/cuda/tensor_kernels.cu around lines 602 to 611, the fallback
logic currently treats cudaErrorInvalidValue the same as cudaErrorNotSupported
and will try memcpy paths, which can mask real dtype/parameter errors; change
the checks so only cudaErrorNotSupported triggers falling back to
launch_memcpy_batch() and launch_memcpy_async(), and allow cudaErrorInvalidValue
to be returned (or propagated) immediately instead of attempting memcpy
fallbacks; ensure any surrounding control flow and return value handling still
correctly propagates the final cuda error.

@oandreeva-nv
Copy link
Contributor Author

/ok to test f2353f3

@oandreeva-nv oandreeva-nv force-pushed the oandreeva/kvbm-kernels branch from f2353f3 to a412159 Compare December 4, 2025 20:12
@oandreeva-nv oandreeva-nv requested a review from a team as a code owner December 4, 2025 23:16
@oandreeva-nv
Copy link
Contributor Author

/ok to test 74e9eed

ryanolson and others added 11 commits December 8, 2025 16:29
Signed-off-by: Ryan Olson <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Ryan Olson <[email protected]>
…o happen in cuda and nixl enabled container

Signed-off-by: Ryan Olson <[email protected]>
Signed-off-by: Ryan Olson <[email protected]>
@oandreeva-nv
Copy link
Contributor Author

/ok to test 6a7fe7f

  Problem:
  When using prebuilt kernels (DYNAMO_USE_PREBUILT_KERNELS=1), the build
  would fail at link time with "undefined symbol" errors for the three
  extern "C" functions: launch_universal_from_block,
  launch_block_from_universal, and launch_operational_copy.

  Root Cause:
  The prebuilt path only copied .fatbin files to OUT_DIR. FATBIN files
  contain GPU device code but NOT the host functions (extern "C" functions
  that run on CPU). The linker could not resolve these host function symbols,
  causing undefined reference errors.

  Solution:
  Generate static libraries (.a files) alongside .fatbin files when building
  from source. These .a files contain both host and device code in a linkable
  format. In prebuilt mode, ship and link against these .a files instead of
  relying on .fatbin alone.

  Changes:
  - build.rs: Modified generate_prebuilt_artifacts() to:
    * Compile .cu files to object files (.o) with nvcc -c
    * Create static libraries (.a) from object files using ar
    * Update .md5 hashes to include .a file validation (3 lines instead of 2)

  - build.rs: Modified build_with_prebuilt_kernels() to:
    * Validate and copy .a files to OUT_DIR
    * Link against static libraries with cargo:rustc-link-lib=static
    * Add CUDA runtime (cudart) and C++ stdlib (stdc++) linking
Result:
  Prebuilt kernel mode now works identically to building from source.
  Tests pass with DYNAMO_USE_PREBUILT_KERNELS=1. No CUDA code changes needed.

  Prebuilt artifacts now include:
  - *.fatbin (GPU code for potential runtime loading)
  - lib*.a (linkable static libraries with host+device code)
  - *.md5 (validation hashes for .cu, .fatbin, and .a files)
  Prevents CI from regenerating non-reproducible .a files unnecessarily.
  Only regenerates when .cu source changes or artifacts are missing.
  Delete .md5 files to force regeneration after changing CUDA_ARCHS.
ryanolson added a commit that referenced this pull request Dec 10, 2025
Merged PR #4742 (oandreeva/kvbm-kernels) which re-introduces the KVBM kernels
library after it was reverted from main.

Key changes:
- Added lib/kvbm-kernels with CUDA tensor kernel operations
- Integrated kernels module into kvbm bindings (kept alongside v2 module)
- Updated workspace to include kvbm-kernels
- Added prebuilt CUDA binaries and static libraries for x86_64
- Updated vLLM Dockerfile with CUDA dev tools
- Moved vectorized_copy.fatbin to kvbm-kernels library

Conflicts resolved:
- Cargo.toml: Added kvbm-kernels to workspace members
- lib/bindings/kvbm/Cargo.toml: Added kvbm_kernels dependency to block-manager feature
- lib/bindings/kvbm/src/lib.rs: Added kernels module alongside existing v2 module
- lib/bindings/kvbm/python/kvbm/__init__.py: Exported both kernels and v2 modules
- Cargo.lock files: Accepted PR version and will regenerate

Signed-off-by: Ryan Olson <[email protected]>
@oandreeva-nv oandreeva-nv marked this pull request as draft December 11, 2025 23:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants