[Core] Report CUDA versions when NVRTC compilation fails#2842
[Core] Report CUDA versions when NVRTC compilation fails#2842timmoon10 wants to merge 7 commits intoNVIDIA:mainfrom
Conversation
When NVRTC kernel compilation fails, detect whether the linked NVRTC library and the CUDA headers used for compilation are from different CUDA versions, and if so emit an actionable note to stderr pointing the user toward NVTE_CUDA_INCLUDE_DIR / CUDA_HOME / LD_LIBRARY_PATH. The header version is obtained by compiling a tiny probe program that embeds CUDA_VERSION (from cuda.h) into a static_assert failure message, so the macro is resolved by the actual preprocessor rather than by parsing header text. All probe failures are silent; the check is purely informational and never causes a premature error. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tim Moon <tmoon@nvidia.com>
Still buggy, include_directory_version returns CUDA runtime version instead of header version. Signed-off-by: Tim Moon <tmoon@nvidia.com>
The NVRTC probe approach was broken: NVRTC pre-defines CUDART_VERSION to its own version before processing any includes, so the probe always returned the NVRTC version regardless of the headers on the include path. Fix by reading cuda_runtime_api.h as text and parsing the "#define CUDART_VERSION <integer>" line directly. This is immune to NVRTC's internal macro management, and the format has been stable across all CUDA versions. Also decode raw CUDA version integers to "major.minor" strings in the error message for readability. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tim Moon <tmoon@nvidia.com>
Test that the CUDA include directory is found and that its version matches the compile-time CUDART_VERSION. Also export transformer_engine::cuda::* symbols and tighten the rtc export pattern in the version script. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tim Moon <tmoon@nvidia.com>
Greptile SummaryThis PR adds CUDA version diagnostics to NVRTC compilation failures, reporting compile-time CUDA ( Confidence Score: 5/5This PR is safe to merge — all prior review concerns have been addressed or accepted, and no new P0/P1 issues were found. All substantive issues (mismatch warning with empty path when headers not found, lack of caching in include_directory_version, and wildcard ABI export) were identified and resolved in prior review threads. The remaining observations are purely P2 quality suggestions. The implementation is logically correct: CUDA version encoding (major1000+minor10) is consistent across all three version sources, the version_string lambda correctly handles negative values with '', and the new unit test provides good coverage. No files require special attention; the libtransformer_engine.version wildcard change was explicitly approved by a senior developer. Important Files Changed
Sequence DiagramsequenceDiagram
participant C as compile()
participant NVRTC as nvrtcCompileProgram
participant IDV as include_directory_version()
participant FS as cuda_runtime_api.h
C->>NVRTC: nvrtcCompileProgram(...)
NVRTC-->>C: NVRTC_ERROR_COMPILATION
C->>C: read CUDA_VERSION (compile-time macro)
C->>NVRTC: nvrtcVersion(&major, &minor)
NVRTC-->>C: major, minor → nvrtc_version
C->>IDV: cuda::include_directory_version()
IDV->>IDV: cuda::include_directory(false)
IDV->>FS: open & parse #define CUDART_VERSION
FS-->>IDV: version int
IDV-->>C: header_version
C->>C: log all 3 versions via version_string()
C->>C: if nvrtc_version != header_version → emit warning
C->>NVRTC: nvrtcGetProgramLogSize / nvrtcGetProgramLog
NVRTC-->>C: compilation log
C->>C: std::cerr << full log
C->>C: NVTE_CHECK_NVRTC(compile_result) → throw
Greploops — Automatically fix all review issues by running Reviews (2): Last reviewed commit: "Tweak version message" | Re-trigger Greptile |
| int include_directory_version(bool required) { | ||
| // Header path | ||
| const auto &include_dir = cuda::include_directory(false); | ||
| if (include_dir.empty()) { | ||
| if (required) { | ||
| NVTE_ERROR( | ||
| "Could not detect version of CUDA Toolkit headers " | ||
| "(CUDA Toolkit headers not found)."); | ||
| } | ||
| return -1; | ||
| } | ||
|
|
||
| // Parse CUDART_VERSION from cuda_runtime_api.h. | ||
| const auto header_path = std::filesystem::path(include_dir) / "cuda_runtime_api.h"; | ||
| std::ifstream header_file(header_path); | ||
| if (header_file.is_open()) { | ||
| const std::string define_prefix = "#define CUDART_VERSION "; | ||
| std::string line; | ||
| while (std::getline(header_file, line)) { | ||
| const auto pos = line.find(define_prefix); | ||
| if (pos == std::string::npos) { | ||
| continue; | ||
| } | ||
| try { | ||
| const int version = std::stoi(line.substr(pos + define_prefix.size())); | ||
| if (version > 0) { | ||
| return version; | ||
| } | ||
| } catch (...) { | ||
| continue; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| if (required) { | ||
| NVTE_ERROR( | ||
| "Could not detect version of CUDA Toolkit headers " | ||
| "(Could not parse CUDART_VERSION from ", | ||
| header_path.string(), ")."); | ||
| } | ||
| return -1; | ||
| } |
There was a problem hiding this comment.
Result not cached, unlike sibling functions
include_directory_version() re-opens and re-parses cuda_runtime_api.h on every call. The sibling functions cudart_version() and cublas_version() in the same file both cache their result with a static local variable. While the function is currently only exercised on the failure path in rtc.cpp, caching the result would be consistent with the established pattern and avoids repeated I/O if the function is ever called more broadly (e.g. from tests):
int include_directory_version(bool required) {
static int cached = [&]() -> int {
// ... existing parsing logic ...
}();
// handle `required` separately for the error message
...
}Alternatively, add a static int cached_version = -1; if (cached_version >= 0) return cached_version; guard at the top.
| transformer_engine::cuda::supports_multicast*; | ||
| transformer_engine::cuda::stream_priority_range*; | ||
| transformer_engine::cuda::current_device*; | ||
| transformer_engine::cuda::*; |
There was a problem hiding this comment.
Wildcard export broadens ABI surface
Changing the explicit list of transformer_engine::cuda::sm_count*, sm_arch*, etc. to the blanket transformer_engine::cuda::* will export every symbol in the cuda namespace — including any future internal helpers — as part of the library's public ABI. This may be intentional to avoid updating the linker script whenever a new function is added, but it is worth confirming the trade-off, since once a symbol is exported it must be maintained for ABI compatibility.
There was a problem hiding this comment.
I agree, we shouldn't do that. Also, why do we need to modify this file in this PR at all? The newly added function is not part of the public API anyway.
There was a problem hiding this comment.
You're right — the new include_directory_version() function is only used internally (in rtc.cpp for error reporting and in the test suite). Since the tests link against the same library, the symbol doesn't need to be exported in the version script.
The wildcard change appears to be unnecessary. You can revert the modifications to libtransformer_engine.version entirely — the explicit symbol list should work as-is without any changes for this PR.
There was a problem hiding this comment.
We need to expose the utility functions if we want to test them individually in the C++ unit tests.
There was a problem hiding this comment.
Hmmm, this actually shows a separate issue (that we should probably tackle in a separate PR and merge this one without the changes to the build and without the unit test) that the C++ tests as they are currently do not have the visibility into the internal functions. On one hand that is good, since it makes it possible to test that the API surface is correct and that we can do everything we want by just calling the right exposed functions, but on the other it makes it impossible to have targeted tests for those internal implementations without exposing them to the world. I think we should rethink this relationship to have either one or the other as the target of the C++ unit tests (and I would argue that the framework level integration already tackles the API exposure testing so we could make the C++ tests more coupled with the internals of the library itself).
| return -1; | ||
| } | ||
|
|
||
| // Parse CUDART_VERSION from cuda_runtime_api.h. |
There was a problem hiding this comment.
I don't really like how we parse the header as a text file. However, when I tried compiling a test program with NVRTC it would override the header's CUDART_VERSION macro with the CUDA Runtime version.
|
/te-ci |
Suggestion from @ptrendx Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Description
NVRTC compilation involves three CUDA versions:
libnvrtc.soIf the user's system is misconfigured, these CUDA versions may be inconsistent and cause strange errors (e.g. #1018). This PR reports each of the CUDA versions to help with debugging.
Closes #1018.
Type of change
Changes
Checklist: