-
Notifications
You must be signed in to change notification settings - Fork 739
feat: re-introducing kvbm kernels after revert #4742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
/ok to test f2353f3 |
WalkthroughThis PR introduces a comprehensive KVBM (KV Block Manager) kernels subsystem featuring CUDA tensor operations for converting between block-structured, universal, and operational tensor layouts. A new Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60–90 minutes
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Nitpick comments (12)
lib/kvbm-kernels/Dockerfile (2)
21-21: Consider pinning Rust toolchain version for reproducibility.Using
--default-toolchain stablemay lead to different Rust versions across builds. Consider pinning to a specific version (e.g.,1.83.0) for reproducible builds.
27-30: Consider copying Cargo.lock for reproducible dependency versions.If this crate has a
Cargo.lockfile, copying it ensures consistent dependency versions across builds.COPY Cargo.toml ./ +COPY Cargo.lock ./ COPY src ./src COPY build.rs ./build.rs COPY cuda ./cudalib/kvbm-kernels/cuda/prebuilt/tensor_kernels.md5 (1)
1-3: Consider documenting what each hash validates.The file contains three MD5 hashes without context. Consider using a format like
<hash> <filename>(standard md5sum output format) or adding a header comment to indicate which files these hashes correspond to.+# MD5 checksums for prebuilt kernel artifacts +# Format: <hash> <filename> -7aac008ed704fe198ef5056a4e502069 -a7d1649c148ee0366de6d19896a80116 -4234cfd2ef4b283592a37a9ceed94666 +7aac008ed704fe198ef5056a4e502069 tensor_kernels.fatbin +a7d1649c148ee0366de6d19896a80116 <corresponding_file> +4234cfd2ef4b283592a37a9ceed94666 <corresponding_file>lib/bindings/kvbm/python/kvbm/__init__.py (1)
14-17: Consider guardingkernelsimport for non‑block‑manager buildsExposing
kvbm.kernelsviafrom kvbm._core import kernels as kernelsis the right hook for the new submodule. If there’s any scenario where_coreis built without theblock-managerfeature, this import will raise at import time; you could optionally make it defensive, e.g.:try: from kvbm._core import kernels as kernels except ImportError: kernels = None # or omit export / re-raise with a clearer messageNot required if
block-manageris guaranteed to be enabled for all builds, but worth considering..github/workflows/pre-merge-rust.yml (1)
69-73:cargo fmt -- --checkis now run twice per directoryBoth the
testsandclippyjobs install rustfmt and runcargo fmt -- --checkfor eachmatrix.dir. Functionally fine, but it doubles formatting work for the overlapping dirs (.,lib/bindings/python, etc.). If CI time becomes a concern, you could centralize the formatting check in just one of the jobs (or a tiny dedicated job) and keep clippy/tests focused on their respective tasks.Also applies to: 120-123
lib/bindings/kvbm/tests/test_tensor_kernels.py (2)
30-50: Unused unpacking variables in_make_blocks.The variables
nh,nt, andhdare extracted but never used in this function. Consider using underscores to indicate intentional non-use, which also satisfies the Ruff linter.def _make_blocks(universal: torch.Tensor, layout: str) -> List[torch.Tensor]: - nh, nl, no, nt, hd = universal.shape + _nh, nl, no, _nt, _hd = universal.shape blocks = [] for layer in range(nl):
104-118: Addstrict=Truetozip()calls for safer iteration.Multiple
zip()calls lack thestrict=parameter. Addingstrict=Truewould catch length mismatches early rather than silently truncating, improving test reliability.- for produced, expected in zip(outputs, universals): + for produced, expected in zip(outputs, universals, strict=True): assert torch.allclose(produced, expected, atol=atol, rtol=rtol)- for produced_set, expected_set in zip(blocks, expected_blocks): - for produced, expected in zip(produced_set, expected_set): + for produced_set, expected_set in zip(blocks, expected_blocks, strict=True): + for produced, expected in zip(produced_set, expected_set, strict=True): assert torch.allclose(produced, expected, atol=atol, rtol=rtol)Similar changes should be applied at lines 157, 171-172.
lib/kvbm-kernels/build.rs (1)
68-70: Consider more robust nvcc detection.The current check only verifies that
nvcc --versioncan be spawned. It doesn't distinguish between nvcc not found vs. nvcc crashing. Consider checking exit status for more precise detection.fn is_nvcc_available() -> bool { - Command::new("nvcc").arg("--version").output().is_ok() + Command::new("nvcc") + .arg("--version") + .output() + .map(|o| o.status.success()) + .unwrap_or(false) }lib/bindings/kvbm/src/kernels.rs (4)
4-4: Consider scoping theunsafe_op_in_unsafe_fnallow more narrowly.The file-level
#![allow(unsafe_op_in_unsafe_fn)]suppresses warnings for the entire file. Consider applying it only to the specific functions that need it, or adding// SAFETY:comments to document the unsafe invariants.
226-362: Consider extracting shared validation logic.
block_to_universalanduniversal_to_blockshare nearly identical validation and setup code (lines 232-301 vs 375-444). Consider extracting the common pattern into a helper to reduce duplication and maintenance burden.A helper like
prepare_universal_transfer(py, universals, blocks, layout)could return validated infos and prepared pointer buffers, with the direction-specific kernel call remaining in each function.Also applies to: 369-505
516-660: Similar duplication in operational functions.
block_to_operationalandoperational_to_blockalso share significant code. The same refactoring suggestion applies — extract common validation and pointer preparation into a shared helper.Also applies to: 670-814
608-616: Host pointers created but may not be needed for all backends.
block_ptrs_hostandoperational_ptrs_hostare always allocated, but they may only be needed for specific backends (e.g.,MemcpyBatch). Consider lazy allocation based on the selected backend to reduce unnecessary allocations.Also applies to: 762-770
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
Cargo.lockis excluded by!**/*.locklib/bindings/kvbm/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (16)
.github/workflows/container-validation-dynamo.yml(1 hunks).github/workflows/pre-merge-rust.yml(2 hunks)Cargo.toml(1 hunks)lib/bindings/kvbm/Cargo.toml(2 hunks)lib/bindings/kvbm/python/kvbm/__init__.py(1 hunks)lib/bindings/kvbm/src/kernels.rs(1 hunks)lib/bindings/kvbm/src/lib.rs(2 hunks)lib/bindings/kvbm/tests/test_tensor_kernels.py(1 hunks)lib/kvbm-kernels/Cargo.toml(1 hunks)lib/kvbm-kernels/Dockerfile(1 hunks)lib/kvbm-kernels/README.md(1 hunks)lib/kvbm-kernels/build.rs(1 hunks)lib/kvbm-kernels/cuda/prebuilt/tensor_kernels.md5(1 hunks)lib/kvbm-kernels/cuda/tensor_kernels.cu(1 hunks)lib/kvbm-kernels/src/lib.rs(1 hunks)lib/kvbm-kernels/src/tensor_kernels.rs(1 hunks)
🧰 Additional context used
🧠 Learnings (11)
📚 Learning: 2025-12-02T18:13:40.065Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 4698
File: .github/workflows/container-validation-dynamo.yml:68-68
Timestamp: 2025-12-02T18:13:40.065Z
Learning: In the ai-dynamo/dynamo repository, backend-specific tests (vllm, sglang, trtllm) are intentionally excluded from the container-validation-dynamo.yml workflow using "not (vllm or sglang or trtllm)" because they run in a separate container-validation-backends.yml workflow that has dedicated jobs for each backend. This separation keeps framework-agnostic tests separate from backend-specific tests.
Applied to files:
.github/workflows/pre-merge-rust.yml.github/workflows/container-validation-dynamo.yml
📚 Learning: 2025-08-30T20:43:49.632Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: container/Dockerfile:437-449
Timestamp: 2025-08-30T20:43:49.632Z
Learning: In the dynamo project's devcontainer setup, the team prioritizes consistency across framework-specific Dockerfiles (like container/Dockerfile, container/Dockerfile.vllm, etc.) by mirroring their structure, even when individual optimizations might be possible, to maintain uniformity in the development environment setup.
Applied to files:
.github/workflows/container-validation-dynamo.yml
📚 Learning: 2025-08-30T20:43:10.091Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project, devcontainer.json files use templated container names (like "dynamo-vllm-devcontainer") that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.
Applied to files:
.github/workflows/container-validation-dynamo.yml
📚 Learning: 2025-12-03T01:14:42.094Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 4657
File: container/Dockerfile.vllm:264-265
Timestamp: 2025-12-03T01:14:42.094Z
Learning: In container/Dockerfile.vllm, the recursive chmod -R g+w operation during early user setup (at lines 264-265, when creating the dynamo user and initializing /workspace, /home/dynamo, /opt/dynamo) is an intentional exception to the pattern of avoiding recursive operations, as it handles pre-existing paths and dotfiles created by useradd -m before bulk content is copied.
Applied to files:
.github/workflows/container-validation-dynamo.yml
📚 Learning: 2025-09-16T17:16:07.820Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3051
File: container/templates/Dockerfile.vllm.j2:221-233
Timestamp: 2025-09-16T17:16:07.820Z
Learning: In the dynamo project, when converting Dockerfiles to Jinja2 templates, the primary goal is backward compatibility - generated Dockerfiles must be identical to the originals. Security improvements or other enhancements that would change the generated output are out of scope for templating PRs and should be addressed separately.
Applied to files:
.github/workflows/container-validation-dynamo.yml
📚 Learning: 2025-08-30T20:43:10.091Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 2797
File: .devcontainer/devcontainer.json:12-12
Timestamp: 2025-08-30T20:43:10.091Z
Learning: In the dynamo project's devcontainer setup, hard-coded container names in devcontainer.json files serve as templates that are automatically processed by the copy_devcontainer.sh script to generate framework-specific configurations with unique names, preventing container name collisions.
Applied to files:
.github/workflows/container-validation-dynamo.yml
📚 Learning: 2025-11-15T06:06:37.067Z
Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 4356
File: lib/kvbm-kernels/src/python.rs:23-34
Timestamp: 2025-11-15T06:06:37.067Z
Learning: In lib/kvbm-kernels/src/python.rs, the CUDA context is hard-coded to device 0 in get_context(). This is a known limitation that should be addressed when Python bindings are re-enabled by either: (1) enforcing device_index == 0 validation with clear error messages, or (2) implementing per-device context management using a map/vec of OnceCell<Arc<CudaContext>> keyed by device_index.
Applied to files:
lib/kvbm-kernels/README.mdlib/bindings/kvbm/src/lib.rslib/kvbm-kernels/Cargo.tomllib/bindings/kvbm/src/kernels.rs
📚 Learning: 2025-08-18T20:51:51.324Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2465
File: lib/runtime/src/pipeline/network/egress/push_router.rs:0-0
Timestamp: 2025-08-18T20:51:51.324Z
Learning: The runtime crate cannot depend on the llm crate due to architectural dependency constraints, preventing imports from lib/llm into lib/runtime.
Applied to files:
lib/bindings/kvbm/src/lib.rs
📚 Learning: 2025-10-17T22:57:25.021Z
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 3700
File: lib/llm/src/block_manager/numa_allocator.rs:46-64
Timestamp: 2025-10-17T22:57:25.021Z
Learning: The dynamo library (specifically the llm/block_manager module) is developed to run on Linux only, so Linux-specific system calls and APIs can be used without platform guards.
Applied to files:
lib/bindings/kvbm/src/lib.rslib/bindings/kvbm/Cargo.toml
📚 Learning: 2025-09-19T18:21:03.693Z
Learnt from: GuanLuo
Repo: ai-dynamo/dynamo PR: 2746
File: lib/llm/src/grpc/service/tensor.rs:269-447
Timestamp: 2025-09-19T18:21:03.693Z
Learning: In lib/llm/src/grpc/service/tensor.rs, GuanLuo prefers using unsafe Vec::from_raw_parts operations to avoid data copying for performance reasons when converting raw tensor data from Vec<u8> to typed vectors like Vec<u32>, Vec<f32>, etc., even when it involves safety risks.
Applied to files:
lib/kvbm-kernels/src/lib.rslib/kvbm-kernels/src/tensor_kernels.rs
📚 Learning: 2025-09-10T22:47:29.325Z
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 2989
File: lib/llm/src/block_manager/block/transfer/cuda.rs:528-539
Timestamp: 2025-09-10T22:47:29.325Z
Learning: CUDA kernel symbol names in cuModuleGetFunction calls must exactly match the function name in the kernel source code, even if it uses different spelling conventions (like British vs American English) compared to other parts of the codebase.
Applied to files:
lib/kvbm-kernels/cuda/tensor_kernels.cu
🧬 Code graph analysis (4)
lib/bindings/kvbm/src/lib.rs (1)
lib/bindings/kvbm/src/kernels.rs (1)
add_to_module(816-822)
lib/bindings/kvbm/python/kvbm/__init__.py (1)
lib/bindings/kvbm/src/lib.rs (1)
_core(25-53)
lib/kvbm-kernels/src/lib.rs (1)
lib/kvbm-kernels/src/tensor_kernels.rs (3)
block_from_universal(148-176)operational_copy(186-218)universal_from_block(116-144)
lib/kvbm-kernels/src/tensor_kernels.rs (1)
lib/kvbm-kernels/cuda/tensor_kernels.cu (6)
launch_universal_from_block(391-415)launch_universal_from_block(392-394)launch_block_from_universal(417-441)launch_block_from_universal(418-420)launch_operational_copy(450-615)launch_operational_copy(451-454)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4742/merge) by oandreeva-nv.
lib/kvbm-kernels/cuda/tensor_kernels.cu
[error] 1-1: clang-format hook failed in pre-commit: files were modified by this hook. Re-run 'pre-commit run --all-files' to format, then commit the updated changes. Command: 'pre-commit run --show-diff-on-failure --color=always --all-files'.
🪛 LanguageTool
lib/kvbm-kernels/README.md
[grammar] ~144-~144: Ensure spelling is correct
Context: ... tensors must be CUDA accessible by the specificed device and match the expected shapes an...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.18.1)
lib/kvbm-kernels/README.md
46-46: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
177-177: Tables should be surrounded by blank lines
(MD058, blanks-around-tables)
🪛 Ruff (0.14.7)
lib/bindings/kvbm/tests/test_tensor_kernels.py
38-38: Unpacked variable nh is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
38-38: Unpacked variable nt is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
38-38: Unpacked variable hd is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
48-48: Avoid specifying long messages outside the exception class
(TRY003)
104-104: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
116-116: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
117-117: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
157-157: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
171-171: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
172-172: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: tests (.)
- GitHub Check: tests (launch/dynamo-run)
- GitHub Check: tests (lib/bindings/python)
- GitHub Check: tests (lib/runtime/examples)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (22)
lib/kvbm-kernels/cuda/tensor_kernels.cu (2)
491-494: Silent success whenblock_ptrs_deviceis null may mask errors.Returning
cudaSuccesswhenblock_ptrs_deviceis null silently skips the kernel work. If this is intentional for a "host-only" path, consider adding a comment. Otherwise, this could mask caller errors.
139-176: LGTM!The index decomposition correctly maps the linear
residualto the 5D[nh, nl, no, nt, hd]coordinate space in row-major order, and theblock_inner_offsetcorrectly handles both NHD and HND layouts.lib/kvbm-kernels/src/tensor_kernels.rs (3)
22-60: LGTM!The enum values correctly match their CUDA counterparts, and
#[repr(i32)]ensures proper FFI compatibility. Based on learnings, exact matching of enum values across the FFI boundary is critical for correct kernel dispatch.
62-106: LGTM!The FFI function signatures correctly translate the C pointer types.
void* const*maps to*const *mut c_voidandconst void* const*maps to*const *const c_void.
220-531: Test coverage is good for the basic path.The roundtrip test validates Block↔Universal↔Operational conversions with data integrity checks. Consider adding coverage for other data types (F16, BF16) and HND layout in follow-up work.
Cargo.toml (1)
19-19: LGTM!The
lib/kvbm-kernelscrate is correctly added to workspace members but excluded fromdefault-members, which is appropriate since it requires CUDA toolchain to build..github/workflows/container-validation-dynamo.yml (1)
46-46: Enabling kvbm in the container build is appropriateWiring
--enable-kvbminto the validation image build aligns with the new kernels crate and bindings so tests can see the functionality. As long ascontainer/build.shcleanly handles this flag (no-op when disabled / unsupported), this looks good.lib/bindings/kvbm/src/lib.rs (1)
16-19:kernelssubmodule wiring into_corelooks correctThe new
mod kernelsplus thePyModule::new("kernels")/add_submoduleblock under the same#[cfg(feature = "block-manager")]gate is consistent with the rest of the bindings and with thekvbm.__init__export. This gives a cleankvbm._core.kernelssurface for the Python package to re-export.Also applies to: 45-50
.github/workflows/pre-merge-rust.yml (1)
95-96: Addinglib/kvbm-kernelsto the clippy matrix is a good callIncluding
lib/kvbm-kernelsin the clippymatrix.dirensures the new kernels crate is linted explicitly, in addition to any workspace-level checks from..lib/kvbm-kernels/src/lib.rs (1)
4-9: Top‑level re‑exports fromtensor_kernelsgive a clean APIRe-exporting
BlockLayout,OperationalCopy*,TensorDataType, and the three core functions at the crate root while still exposingtensor_kernelsas a module provides both convenience and flexibility for users. This layout looks solid.lib/kvbm-kernels/Cargo.toml (1)
4-31: Crate manifest fordynamo-kvbm-kernelslooks well-structuredThe manifest cleanly separates features (
testing-cuda,prebuilt-kernels), uses workspace metadata, and exposes bothrlibandcdylibfor reuse from Rust and bindings. Build-time deps (dynamo-config,cc,md5) are appropriate for the NVCC-based build.rs flow.lib/bindings/kvbm/Cargo.toml (1)
22-25: Cudarc version mismatch: bindings override workspace default with older 0.17.1The feature wiring for
kvbm_kernelsis correct, but the cudarc version consistency check reveals a problem. The workspace definescudarc = "0.17.8"at the root level, butlib/bindings/kvbm/Cargo.toml(line 72) andlib/bindings/python/Cargo.toml(line 72) both explicitly pincudarc = "0.17.1". This creates a version mismatch that could lead to duplicate or conflicting CUDA runtime dependencies inCargo.lock. Either align both binding crates to use{ workspace = true }or update the workspace default to0.17.1if there's a reason to stay on the older version.⛔ Skipped due to learnings
Learnt from: ryanolson Repo: ai-dynamo/dynamo PR: 4356 File: lib/kvbm-kernels/src/python.rs:23-34 Timestamp: 2025-11-15T06:06:37.067Z Learning: In lib/kvbm-kernels/src/python.rs, the CUDA context is hard-coded to device 0 in get_context(). This is a known limitation that should be addressed when Python bindings are re-enabled by either: (1) enforcing device_index == 0 validation with clear error messages, or (2) implementing per-device context management using a map/vec of OnceCell<Arc<CudaContext>> keyed by device_index.lib/bindings/kvbm/tests/test_tensor_kernels.py (3)
18-27: LGTM!The
_toleranceshelper correctly provides dtype-aware tolerances. The relaxed tolerances for fp16/bf16 are appropriate given the reduced precision of these formats.
53-66: LGTM!The
_call_with_backendhelper correctly handles unsupported backends by translatingcudaErrorNotSupportedinto pytest skips rather than test failures, which is appropriate for optional hardware features.
257-264: LGTM!The empty batch test correctly verifies that the kernels handle edge cases gracefully without launching CUDA operations. Testing
Nonereturn values confirms the expected no-op behavior.lib/kvbm-kernels/build.rs (3)
44-66: LGTM!The build mode determination logic is well-structured with clear precedence: feature flag → environment variable → nvcc auto-detection. The warning messages provide helpful diagnostics.
220-275: Generating artifacts into source tree during build.
generate_prebuilt_artifactswrites files tocuda/prebuilt/within the source tree. This is unconventional for build scripts which typically only write toOUT_DIR. This behavior could:
- Fail in read-only source directories
- Cause unexpected git changes
If intentional for caching prebuilt artifacts, consider documenting this behavior.
277-286: LGTM!The
compute_file_hashfunction is straightforward and provides clear error messages on failure. MD5 is acceptable here since this is for integrity checking of build artifacts, not security.lib/bindings/kvbm/src/kernels.rs (4)
65-113: LGTM!The
tensor_infofunction provides thorough validation of tensor properties including CUDA residency, contiguity, device type, and dtype. The error messages are clear and actionable.
116-173: LGTM!The
collect_block_pointersfunction provides comprehensive validation with clear error messages for length, dtype, device, shape, and numel mismatches. The internal error check for missing expected parameters is a good defensive measure.
816-822: LGTM!The
add_to_modulefunction correctly registers all four kernel functions for the Python module.
333-336: The pointer casting is correct and matches the FFI contract exactly. Theuniversal_from_blockfunction expectsuniversal_device_ptrs: *const *mut c_voidandblock_device_ptrs: *const *const c_void, which lines 334-336 provide. Similarly,block_from_universalexpects the opposite mutability, which lines 477-479 correctly provide. The casting pattern throughusizeis idiomatic Rust for FFI, and the guards (_block_guard,_univ_guard) ensure pointer validity during the FFI call. Device pointers fromcudarcare guaranteed to be valid, and the memory layout is handled by the underlying CUDA runtime.Likely an incorrect or invalid review comment.
| case OperationalCopyBackend::Auto: | ||
| default: | ||
| status = launch_kernel(); | ||
| if (status == cudaErrorNotSupported || status == cudaErrorInvalidValue) { | ||
| status = launch_memcpy_batch(); | ||
| } | ||
| if (status == cudaErrorNotSupported || status == cudaErrorInvalidValue) { | ||
| status = launch_memcpy_async(); | ||
| } | ||
| break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fallback on cudaErrorInvalidValue may hide dtype/parameter errors.
Auto mode falls back to memcpy paths when the kernel returns cudaErrorInvalidValue. However, if the error is due to an invalid dtype, the memcpy paths will copy raw bytes without type awareness, potentially masking the real issue. Consider only falling back on cudaErrorNotSupported.
case OperationalCopyBackend::Auto:
default:
status = launch_kernel();
- if (status == cudaErrorNotSupported || status == cudaErrorInvalidValue) {
+ if (status == cudaErrorNotSupported) {
status = launch_memcpy_batch();
}
- if (status == cudaErrorNotSupported || status == cudaErrorInvalidValue) {
+ if (status == cudaErrorNotSupported) {
status = launch_memcpy_async();
}
break;🤖 Prompt for AI Agents
In lib/kvbm-kernels/cuda/tensor_kernels.cu around lines 602 to 611, the fallback
logic currently treats cudaErrorInvalidValue the same as cudaErrorNotSupported
and will try memcpy paths, which can mask real dtype/parameter errors; change
the checks so only cudaErrorNotSupported triggers falling back to
launch_memcpy_batch() and launch_memcpy_async(), and allow cudaErrorInvalidValue
to be returned (or propagated) immediately instead of attempting memcpy
fallbacks; ensure any surrounding control flow and return value handling still
correctly propagates the final cuda error.
|
/ok to test f2353f3 |
f2353f3 to
a412159
Compare
|
/ok to test 74e9eed |
64a3eeb to
46dc573
Compare
Signed-off-by: Ryan Olson <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Ryan Olson <[email protected]>
Signed-off-by: Ryan Olson <[email protected]>
Signed-off-by: Ryan Olson <[email protected]>
Signed-off-by: Ryan Olson <[email protected]>
…o happen in cuda and nixl enabled container Signed-off-by: Ryan Olson <[email protected]>
Signed-off-by: Ryan Olson <[email protected]>
Signed-off-by: Ryan Olson <[email protected]>
Signed-off-by: Ryan Olson <[email protected]>
Signed-off-by: Ryan Olson <[email protected]>
Signed-off-by: Ryan Olson <[email protected]>
|
/ok to test 6a7fe7f |
9983f2f to
98e7976
Compare
Problem:
When using prebuilt kernels (DYNAMO_USE_PREBUILT_KERNELS=1), the build
would fail at link time with "undefined symbol" errors for the three
extern "C" functions: launch_universal_from_block,
launch_block_from_universal, and launch_operational_copy.
Root Cause:
The prebuilt path only copied .fatbin files to OUT_DIR. FATBIN files
contain GPU device code but NOT the host functions (extern "C" functions
that run on CPU). The linker could not resolve these host function symbols,
causing undefined reference errors.
Solution:
Generate static libraries (.a files) alongside .fatbin files when building
from source. These .a files contain both host and device code in a linkable
format. In prebuilt mode, ship and link against these .a files instead of
relying on .fatbin alone.
Changes:
- build.rs: Modified generate_prebuilt_artifacts() to:
* Compile .cu files to object files (.o) with nvcc -c
* Create static libraries (.a) from object files using ar
* Update .md5 hashes to include .a file validation (3 lines instead of 2)
- build.rs: Modified build_with_prebuilt_kernels() to:
* Validate and copy .a files to OUT_DIR
* Link against static libraries with cargo:rustc-link-lib=static
* Add CUDA runtime (cudart) and C++ stdlib (stdc++) linking
Result:
Prebuilt kernel mode now works identically to building from source.
Tests pass with DYNAMO_USE_PREBUILT_KERNELS=1. No CUDA code changes needed.
Prebuilt artifacts now include:
- *.fatbin (GPU code for potential runtime loading)
- lib*.a (linkable static libraries with host+device code)
- *.md5 (validation hashes for .cu, .fatbin, and .a files)
Prevents CI from regenerating non-reproducible .a files unnecessarily. Only regenerates when .cu source changes or artifacts are missing. Delete .md5 files to force regeneration after changing CUDA_ARCHS.
ff52b4d to
5140331
Compare
Merged PR #4742 (oandreeva/kvbm-kernels) which re-introduces the KVBM kernels library after it was reverted from main. Key changes: - Added lib/kvbm-kernels with CUDA tensor kernel operations - Integrated kernels module into kvbm bindings (kept alongside v2 module) - Updated workspace to include kvbm-kernels - Added prebuilt CUDA binaries and static libraries for x86_64 - Updated vLLM Dockerfile with CUDA dev tools - Moved vectorized_copy.fatbin to kvbm-kernels library Conflicts resolved: - Cargo.toml: Added kvbm-kernels to workspace members - lib/bindings/kvbm/Cargo.toml: Added kvbm_kernels dependency to block-manager feature - lib/bindings/kvbm/src/lib.rs: Added kernels module alongside existing v2 module - lib/bindings/kvbm/python/kvbm/__init__.py: Exported both kernels and v2 modules - Cargo.lock files: Accepted PR version and will regenerate Signed-off-by: Ryan Olson <[email protected]>
Overview:
Resurrection of #4356 with fixes:
cudaMemcpyBatchAsyncneeds to be called differently based on CUDA versionDetails:
The files changed/introduced in this pr can be divided into multiple subgroups. Main ones are:
Group 1: New KVBM Kernels Library
The core of this PR - a new standalone CUDA kernels library:
Since static libraries (
.a) are arch dependent, I only provide those forx86Current Behavior (which should be what you want):
When building from source on ARM with nvcc:
cc::Buildcompiles all .cu files and creates a library inOUT_DIRgenerate_prebuilt_artifactsfortensor_kernels, so we don't save the ARM.afile tocuda/prebuilt/When using prebuilt mode on ARM (no nvcc):
cuda/prebuilt/libtensor_kernels.aisx86_64-onlyGroup 2: Python/PyTorch Bindings
Expose CUDA kernels to Python for testing:
Purpose: PyO3 bindings to test kernels through PyTorch
Group 3: Integration with LLM Engine
Connect open-sourced vectorized copy to existing Dynamo code:
Purpose: Replace old implementation with new kernel library
Group 4: Build & CI Infrastructure
Enable building and testing:
Purpose: CI/CD and container support for kernel compilation in dev containers
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Chores
Tests
✏️ Tip: You can customize this high-level summary in your review settings.