comm_gemm_test fixes by almogsegal · Pull Request #2839 · NVIDIA/TransformerEngine

almogsegal · 2026-04-06T20:41:53Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change tolerance value from tol * k to tol.
Use split accumulators to avoid fast accum.
Fix bias buffer size.
Adjust tolerance values.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-04-06T20:50:40Z

Greptile Summary

This PR fixes three distinct bugs in the comm_gemm test harness: (1) the bias allocation was m×n instead of the correct m×1 column vector, causing wrong data to be passed to every test case; (2) the per-element tolerance check incorrectly scaled by k (up to ×256), making the test far too permissive; and (3) the cuBLAS golden reference was not using use_split_accumulator, creating a mismatched numerical path vs. the comm_gemm kernels. Tolerance values across GemmRs and GemmAr suites were recalibrated to reflect actual hardware error bounds after the other fixes.

Confidence Score: 5/5

Safe to merge — all remaining findings are P2 style/documentation suggestions that do not affect correctness.

The three substantive bugs (wrong bias dimensions, erroneous ×k tolerance scaling, mismatched split-accumulator setting) are all fixed correctly. The only open items are a missing explanatory comment on use_split_accumulator=true and a question about the empirical basis for the BFloat16 GemmRs tolerance of 6e-1 — both P2 and neither blocks correctness.

No files require special attention beyond the two P2 notes on test_comm_gemm.cu.

Vulnerabilities

No security concerns identified. Changes are limited to a CUDA test file adjusting buffer sizes, tensor shapes, and numerical tolerances.

Important Files Changed

Filename	Overview
tests/cpp_distributed/test_comm_gemm.cu	Fixes three independent bugs: (1) bias buffer size m*n→m and shape m×n→m×1 to reflect that bias is a 1-D vector; (2) removes an erroneous ×k multiplier in the per-element tolerance, making comparisons ~256× stricter; (3) aligns the cuBLAS reference with comm_gemm by enabling use_split_accumulator. Tolerance values were recalibrated — BFloat16 GemmRs at 6e-1 is notably loose and lacks a documentation comment on the split-accumulator change.

Sequence Diagram

sequenceDiagram
    participant Run as Run() [per rank]
    participant CommGemm as CommGemm<br/>(ag/rs/ar variant)
    participant CuBLAS as nvte_cublas_gemm<br/>(golden reference)
    participant Compare as Element-wise<br/>EXPECT_NEAR

    Run->>CommGemm: local A shard, local B shard,<br/>bias[d_rows_start:d_rows_start+rows, 0:1]
    CommGemm-->>Run: local D output (d_rows_num x d_cols_num)

    Run->>CuBLAS: global A, global B,<br/>full bias[0:m, 0:1],<br/>use_split_accumulator=true
    CuBLAS-->>Run: global D golden (m x n)

    Run->>Compare: out[i] vs out_golden[i]<br/>tolerance = tol (absolute, no xk)

_{Reviews (2): Last reviewed commit: "Adjust comm_gemm test for accurate compa..." | Re-trigger Greptile}

greptile-apps · 2026-04-06T20:50:44Z

tests/cpp_distributed/CMakeLists.txt

+	${MPI_CXX_INCLUDE_PATH}
+	$ENV{CUBLASMP_HOME}/include
+	$ENV{NCCL_HOME}/include
+	$ENV{CUDNN_PATH}/include)


Unexplained cuDNN include path

$ENV{CUDNN_PATH}/include is appended to the test target's include directories without any comment explaining why the comm_gemm test requires cuDNN headers. More importantly, if CUDNN_PATH is not set in the environment, CMake expands this to /include, silently adding a root-level system path. Consider guarding the addition or documenting the reason.

Suggested change

$ENV{CUDNN_PATH}/include)

$<$<BOOL:$ENV{CUDNN_PATH}>:$ENV{CUDNN_PATH}/include>)

greptile-apps · 2026-04-06T20:50:44Z

tests/cpp_distributed/test_comm_gemm.cu

                                         Params{DType::kFloat8E5M2, DType::kFloat8E4M3,
-                                                DType::kFloat16, true, false, 64, 128, 256, 5e-2}),
+                                                DType::kFloat16, true, false, 64, 128, 256, 7e-2}),
                         &ParamSuffix);



BFloat16 GemmAr tolerance is tighter than Float16 despite lower precision

The BFloat16 GemmAr case uses tol = 1e-3, while the Float16 case uses 7e-2. BFloat16 has fewer mantissa bits (7 vs 10) and generally accumulates more rounding error in reductions, so a tighter tolerance risks flaky failures on different hardware or driver versions. Please confirm this value was empirically validated across all target GPUs.

timmoon10

LGTM, pending CI and @vcherepanov-nv's approval.

timmoon10 · 2026-04-07T02:07:21Z

/te-ci core L1

timmoon10 · 2026-04-07T03:19:11Z

setup.py

+        nccl_dir = os.getenv("NCCL_HOME") or metadata.distribution(
+            f"nvidia-nccl-cu{cuda_version()[0]}"
+        ).locate_file("nvidia/nccl")


The build is failing because it isn't properly detecting NCCL.

I've removed the two commits with the build system changes.

- Allocate bias as 1D vector of length m (not m*n matrix) - Distribute bias as a row-slice matching local D rows - Change tolerance from tol*k to tol since tol values now represent the actual absolute tolerance Signed-off-by: Almog Segal <asegal@nvidia.com>

- Use split accumulator (disable fast FP8 accumulation) in the reference GEMM to match cuBLASMp's accumulation precision - Set per-test tolerances based on observed max errors: AG: 1e-3, RS FP16: 7e-2, RS BF16: 6e-1, RS FP8: 7e-2 to 1e-1, AR FP16: 7e-2, AR BF16: 1e-3, AR FP8: 1.5e-1 Signed-off-by: Almog Segal <asegal@nvidia.com>

greptile-apps · 2026-04-09T05:44:09Z

Tip:

Greploop — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.

Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

timmoon10 · 2026-04-09T19:20:46Z

/te-ci core L1

almogsegal force-pushed the comm-gemm-test-fixes branch 2 times, most recently from 8d16d92 to 3c31332 Compare April 6, 2026 20:47

greptile-apps bot reviewed Apr 6, 2026

View reviewed changes

timmoon10 requested a review from vcherepanov-nv April 7, 2026 02:06

timmoon10 approved these changes Apr 7, 2026

View reviewed changes

timmoon10 self-requested a review April 7, 2026 02:20

timmoon10 requested changes Apr 7, 2026

View reviewed changes

ptrendx assigned denera Apr 7, 2026

almogsegal force-pushed the comm-gemm-test-fixes branch 2 times, most recently from f2f482f to 09a4352 Compare April 9, 2026 05:40

almogsegal added 2 commits April 9, 2026 08:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

comm_gemm_test fixes#2839

comm_gemm_test fixes#2839
almogsegal wants to merge 2 commits intoNVIDIA:mainfrom
almogsegal:comm-gemm-test-fixes

almogsegal commented Apr 6, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Apr 6, 2026 •

edited

Loading

Vulnerabilities

Uh oh!

greptile-apps bot Apr 6, 2026

Uh oh!

greptile-apps bot Apr 6, 2026

Uh oh!

timmoon10 left a comment

Uh oh!

timmoon10 commented Apr 7, 2026

Uh oh!

timmoon10 Apr 7, 2026

Uh oh!

almogsegal Apr 9, 2026

Uh oh!

greptile-apps bot commented Apr 9, 2026

Uh oh!

timmoon10 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	$ENV{CUDNN_PATH}/include)
	$<$<BOOL:$ENV{CUDNN_PATH}>:$ENV{CUDNN_PATH}/include>)

Conversation

almogsegal commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Vulnerabilities

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

timmoon10 commented Apr 7, 2026

Uh oh!

timmoon10 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

almogsegal Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Apr 9, 2026

Uh oh!

timmoon10 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

almogsegal commented Apr 6, 2026 •

edited

Loading

greptile-apps bot commented Apr 6, 2026 •

edited

Loading