Enforce minimum NCCL version for cuBLASMp#2857
Enforce minimum NCCL version for cuBLASMp#2857vcherepanov-nv wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Greptile SummaryAdds a CMake Confidence Score: 5/5Safe to merge; all remaining findings are P2 style/robustness suggestions that do not block correctness in standard single-NCCL environments. The core version-check logic is correct: NCCL version is parsed deterministically from the header, compared with CMake's VERSION_LESS, and raises a FATAL_ERROR on failure. The two P2 flags (no HINTS in find_path, cached path variable) affect edge-case environments and do not introduce build failures for the common case. transformer_engine/common/CMakeLists.txt – find_path hints and cache hygiene in find_nccl_version.
|
| Filename | Overview |
|---|---|
| transformer_engine/common/CMakeLists.txt | Adds find_nccl_version helper and enforces NCCL >= 2.29.0 when NVTE_WITH_CUBLASMP is enabled; find_path lacks HINTS and uses a cached result that could mismatch the linked NCCL in multi-install environments. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[cmake configure] --> B{NVTE_WITH_CUBLASMP?}
B -- No --> Z[skip cuBLASMp setup]
B -- Yes --> C[find_nccl_version\nfind_path nccl.h]
C --> D[parse NCCL_MAJOR\nNCCL_MINOR\nNCCL_PATCH]
D --> E{version fields\nparsed OK?}
E -- No --> F[FATAL_ERROR\nFailed to parse NCCL version]
E -- Yes --> G[NCCL_VERSION = MAJOR.MINOR.PATCH]
G --> H{NCCL_VERSION\nVERSION_LESS 2.29.0?}
H -- Yes --> I[FATAL_ERROR\nNCCL >= 2.29.0 required]
H -- No --> J[find_library CUBLASMP_LIB\nfind_library NCCL_LIB]
J --> K[target_link_libraries\nNCCL + cuBLASMp]
K --> L[STATUS messages\ncuBLASMp path + NCCL version]
Reviews (1): Last reviewed commit: "Enforce minimum NCCL version for cuBLASM..." | Re-trigger Greptile
| find_path(_nvte_nccl_include_dir | ||
| NAMES nccl.h | ||
| PATH_SUFFIXES include | ||
| REQUIRED) |
There was a problem hiding this comment.
find_path may locate the wrong nccl.h
find_path here has no PATHS or HINTS, so it searches CMake's default system paths rather than the NCCL that will actually be linked. On systems where both a system NCCL and a CUBLASMP_DIR-bundled NCCL exist, the header discovered could belong to a different version than the NCCL_LIB found by the later find_library call, making the version check unreliable.
Consider extracting a hint from CUBLASMP_DIR, e.g.:
find_path(_nvte_nccl_include_dir
NAMES nccl.h
HINTS ${CUBLASMP_DIR}
PATH_SUFFIXES include
REQUIRED)There was a problem hiding this comment.
Is this suggestion from Greptile sensible? My understanding was that cuBLASMp does not come pre-packaged with it's own NCCL so I don't think this makes sense.
There was a problem hiding this comment.
You're correct—my suggestion doesn't make sense. If cuBLASMp doesn't bundle NCCL, then ${CUBLASMP_DIR} won't contain nccl.h.
The potential issue I was flagging (that find_path and find_library could locate different NCCL installations on systems with multiple versions) is real in principle, but the fix would be to accept a user-provided NCCL_DIR or derive the include path from the library location found by find_library, not to look in CUBLASMP_DIR.
For most practical scenarios where NCCL is installed in a single standard location, the current approach should work fine—both find_path and find_library will discover the same installation through CMake's default search paths.
Tip: You can customize Greptile's behavior for this repo with .greptile/rules and .greptile/config.
There was a problem hiding this comment.
@vcherepanov-nv I think Greptile's latest reply here does have a point here that a user's NCCL dir may differ from the system installation, for example, if they are using NCCL from a pip package. I'm not sure if NCCL from pip is supported with NVTE_WITH_CUBLASMP, but if it is we should probably allow the user to provide a NCCL_DIR hint (unless we have a better way of choosing between pip packages and system installations in CMake)
There was a problem hiding this comment.
I don't think existing find_library approach would find pip-installed NCCL. So whatever find_path finds, it's likely whatever find_library finds as well. Until now no one asked for a custom path to NCCL, so I'd like to avoid adding any extra configurable paths, unless absolutely necessary.
There was a problem hiding this comment.
Meaning - less options to configure, less options to misconfigure.
| find_path(_nvte_nccl_include_dir | ||
| NAMES nccl.h | ||
| PATH_SUFFIXES include | ||
| REQUIRED) |
There was a problem hiding this comment.
Cached
find_path result may become stale on reconfigure
find_path stores _nvte_nccl_include_dir in the CMake cache by default. If NCCL is upgraded and cmake is re-run without clearing the build directory, the old cached path is reused and the version check reads the stale nccl.h, potentially passing or failing incorrectly. Use NO_CACHE (available since CMake 3.21) or prepend unset(_nvte_nccl_include_dir CACHE) to force a fresh search every configure run.
find_path(_nvte_nccl_include_dir
NAMES nccl.h
HINTS ${CUBLASMP_DIR}
PATH_SUFFIXES include
NO_CACHE
REQUIRED)|
LGTM, thanks Vlad! |
|
/te-ci |
Description
cuBLASMp uses API, introduced in NCCL 2.29. This change adds a check to CMake to enforce this minimum version and avoid linker errors.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: