Conversation
8d16d92 to
3c31332
Compare
Greptile SummaryThis PR fixes three distinct bugs in the Confidence Score: 5/5Safe to merge — all remaining findings are P2 style/documentation suggestions that do not affect correctness. The three substantive bugs (wrong bias dimensions, erroneous ×k tolerance scaling, mismatched split-accumulator setting) are all fixed correctly. The only open items are a missing explanatory comment on use_split_accumulator=true and a question about the empirical basis for the BFloat16 GemmRs tolerance of 6e-1 — both P2 and neither blocks correctness. No files require special attention beyond the two P2 notes on test_comm_gemm.cu.
|
| Filename | Overview |
|---|---|
| tests/cpp_distributed/test_comm_gemm.cu | Fixes three independent bugs: (1) bias buffer size m*n→m and shape m×n→m×1 to reflect that bias is a 1-D vector; (2) removes an erroneous ×k multiplier in the per-element tolerance, making comparisons ~256× stricter; (3) aligns the cuBLAS reference with comm_gemm by enabling use_split_accumulator. Tolerance values were recalibrated — BFloat16 GemmRs at 6e-1 is notably loose and lacks a documentation comment on the split-accumulator change. |
Sequence Diagram
sequenceDiagram
participant Run as Run() [per rank]
participant CommGemm as CommGemm<br/>(ag/rs/ar variant)
participant CuBLAS as nvte_cublas_gemm<br/>(golden reference)
participant Compare as Element-wise<br/>EXPECT_NEAR
Run->>CommGemm: local A shard, local B shard,<br/>bias[d_rows_start:d_rows_start+rows, 0:1]
CommGemm-->>Run: local D output (d_rows_num x d_cols_num)
Run->>CuBLAS: global A, global B,<br/>full bias[0:m, 0:1],<br/>use_split_accumulator=true
CuBLAS-->>Run: global D golden (m x n)
Run->>Compare: out[i] vs out_golden[i]<br/>tolerance = tol (absolute, no xk)
Reviews (2): Last reviewed commit: "Adjust comm_gemm test for accurate compa..." | Re-trigger Greptile
tests/cpp_distributed/CMakeLists.txt
Outdated
| ${MPI_CXX_INCLUDE_PATH} | ||
| $ENV{CUBLASMP_HOME}/include | ||
| $ENV{NCCL_HOME}/include | ||
| $ENV{CUDNN_PATH}/include) |
There was a problem hiding this comment.
Unexplained cuDNN include path
$ENV{CUDNN_PATH}/include is appended to the test target's include directories without any comment explaining why the comm_gemm test requires cuDNN headers. More importantly, if CUDNN_PATH is not set in the environment, CMake expands this to /include, silently adding a root-level system path. Consider guarding the addition or documenting the reason.
| $ENV{CUDNN_PATH}/include) | |
| $<$<BOOL:$ENV{CUDNN_PATH}>:$ENV{CUDNN_PATH}/include>) |
| Params{DType::kFloat8E5M2, DType::kFloat8E4M3, | ||
| DType::kFloat16, true, false, 64, 128, 256, 5e-2}), | ||
| DType::kFloat16, true, false, 64, 128, 256, 7e-2}), | ||
| &ParamSuffix); | ||
|
|
There was a problem hiding this comment.
BFloat16 GemmAr tolerance is tighter than Float16 despite lower precision
The BFloat16 GemmAr case uses tol = 1e-3, while the Float16 case uses 7e-2. BFloat16 has fewer mantissa bits (7 vs 10) and generally accumulates more rounding error in reductions, so a tighter tolerance risks flaky failures on different hardware or driver versions. Please confirm this value was empirically validated across all target GPUs.
timmoon10
left a comment
There was a problem hiding this comment.
LGTM, pending CI and @vcherepanov-nv's approval.
|
/te-ci core L1 |
setup.py
Outdated
| nccl_dir = os.getenv("NCCL_HOME") or metadata.distribution( | ||
| f"nvidia-nccl-cu{cuda_version()[0]}" | ||
| ).locate_file("nvidia/nccl") |
There was a problem hiding this comment.
The build is failing because it isn't properly detecting NCCL.
There was a problem hiding this comment.
I've removed the two commits with the build system changes.
f2f482f to
09a4352
Compare
- Allocate bias as 1D vector of length m (not m*n matrix) - Distribute bias as a row-slice matching local D rows - Change tolerance from tol*k to tol since tol values now represent the actual absolute tolerance Signed-off-by: Almog Segal <asegal@nvidia.com>
- Use split accumulator (disable fast FP8 accumulation) in the reference GEMM to match cuBLASMp's accumulation precision - Set per-test tolerances based on observed max errors: AG: 1e-3, RS FP16: 7e-2, RS BF16: 6e-1, RS FP8: 7e-2 to 1e-1, AR FP16: 7e-2, AR BF16: 1e-3, AR FP8: 1.5e-1 Signed-off-by: Almog Segal <asegal@nvidia.com>
|
Tip: Greploop — Automatically fix all review issues by running Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal. |
|
/te-ci core L1 |
Description
Please include a brief summary of the changes, relevant motivation and context.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
tol * ktotol.Checklist: