Skip to content

comm_gemm_test fixes#2839

Open
almogsegal wants to merge 2 commits intoNVIDIA:mainfrom
almogsegal:comm-gemm-test-fixes
Open

comm_gemm_test fixes#2839
almogsegal wants to merge 2 commits intoNVIDIA:mainfrom
almogsegal:comm-gemm-test-fixes

Conversation

@almogsegal
Copy link
Copy Markdown
Contributor

@almogsegal almogsegal commented Apr 6, 2026

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change tolerance value from tol * k to tol.
  • Use split accumulators to avoid fast accum.
  • Fix bias buffer size.
  • Adjust tolerance values.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@almogsegal almogsegal force-pushed the comm-gemm-test-fixes branch 2 times, most recently from 8d16d92 to 3c31332 Compare April 6, 2026 20:47
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 6, 2026

Greptile Summary

This PR fixes three distinct bugs in the comm_gemm test harness: (1) the bias allocation was m×n instead of the correct m×1 column vector, causing wrong data to be passed to every test case; (2) the per-element tolerance check incorrectly scaled by k (up to ×256), making the test far too permissive; and (3) the cuBLAS golden reference was not using use_split_accumulator, creating a mismatched numerical path vs. the comm_gemm kernels. Tolerance values across GemmRs and GemmAr suites were recalibrated to reflect actual hardware error bounds after the other fixes.

Confidence Score: 5/5

Safe to merge — all remaining findings are P2 style/documentation suggestions that do not affect correctness.

The three substantive bugs (wrong bias dimensions, erroneous ×k tolerance scaling, mismatched split-accumulator setting) are all fixed correctly. The only open items are a missing explanatory comment on use_split_accumulator=true and a question about the empirical basis for the BFloat16 GemmRs tolerance of 6e-1 — both P2 and neither blocks correctness.

No files require special attention beyond the two P2 notes on test_comm_gemm.cu.

Vulnerabilities

No security concerns identified. Changes are limited to a CUDA test file adjusting buffer sizes, tensor shapes, and numerical tolerances.

Important Files Changed

Filename Overview
tests/cpp_distributed/test_comm_gemm.cu Fixes three independent bugs: (1) bias buffer size m*n→m and shape m×n→m×1 to reflect that bias is a 1-D vector; (2) removes an erroneous ×k multiplier in the per-element tolerance, making comparisons ~256× stricter; (3) aligns the cuBLAS reference with comm_gemm by enabling use_split_accumulator. Tolerance values were recalibrated — BFloat16 GemmRs at 6e-1 is notably loose and lacks a documentation comment on the split-accumulator change.

Sequence Diagram

sequenceDiagram
    participant Run as Run() [per rank]
    participant CommGemm as CommGemm<br/>(ag/rs/ar variant)
    participant CuBLAS as nvte_cublas_gemm<br/>(golden reference)
    participant Compare as Element-wise<br/>EXPECT_NEAR

    Run->>CommGemm: local A shard, local B shard,<br/>bias[d_rows_start:d_rows_start+rows, 0:1]
    CommGemm-->>Run: local D output (d_rows_num x d_cols_num)

    Run->>CuBLAS: global A, global B,<br/>full bias[0:m, 0:1],<br/>use_split_accumulator=true
    CuBLAS-->>Run: global D golden (m x n)

    Run->>Compare: out[i] vs out_golden[i]<br/>tolerance = tol (absolute, no xk)
Loading

Reviews (2): Last reviewed commit: "Adjust comm_gemm test for accurate compa..." | Re-trigger Greptile

${MPI_CXX_INCLUDE_PATH}
$ENV{CUBLASMP_HOME}/include
$ENV{NCCL_HOME}/include
$ENV{CUDNN_PATH}/include)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unexplained cuDNN include path

$ENV{CUDNN_PATH}/include is appended to the test target's include directories without any comment explaining why the comm_gemm test requires cuDNN headers. More importantly, if CUDNN_PATH is not set in the environment, CMake expands this to /include, silently adding a root-level system path. Consider guarding the addition or documenting the reason.

Suggested change
$ENV{CUDNN_PATH}/include)
$<$<BOOL:$ENV{CUDNN_PATH}>:$ENV{CUDNN_PATH}/include>)

Comment on lines 445 to 448
Params{DType::kFloat8E5M2, DType::kFloat8E4M3,
DType::kFloat16, true, false, 64, 128, 256, 5e-2}),
DType::kFloat16, true, false, 64, 128, 256, 7e-2}),
&ParamSuffix);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 BFloat16 GemmAr tolerance is tighter than Float16 despite lower precision

The BFloat16 GemmAr case uses tol = 1e-3, while the Float16 case uses 7e-2. BFloat16 has fewer mantissa bits (7 vs 10) and generally accumulates more rounding error in reductions, so a tighter tolerance risks flaky failures on different hardware or driver versions. Please confirm this value was empirically validated across all target GPUs.

@timmoon10 timmoon10 requested a review from vcherepanov-nv April 7, 2026 02:06
Copy link
Copy Markdown
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending CI and @vcherepanov-nv's approval.

@timmoon10
Copy link
Copy Markdown
Collaborator

/te-ci core L1

@timmoon10 timmoon10 self-requested a review April 7, 2026 02:20
setup.py Outdated
Comment on lines +80 to +82
nccl_dir = os.getenv("NCCL_HOME") or metadata.distribution(
f"nvidia-nccl-cu{cuda_version()[0]}"
).locate_file("nvidia/nccl")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build is failing because it isn't properly detecting NCCL.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the two commits with the build system changes.

@almogsegal almogsegal force-pushed the comm-gemm-test-fixes branch 2 times, most recently from f2f482f to 09a4352 Compare April 9, 2026 05:40
- Allocate bias as 1D vector of length m (not m*n matrix)
- Distribute bias as a row-slice matching local D rows
- Change tolerance from tol*k to tol since tol values now represent
  the actual absolute tolerance

Signed-off-by: Almog Segal <asegal@nvidia.com>
- Use split accumulator (disable fast FP8 accumulation) in the
  reference GEMM to match cuBLASMp's accumulation precision
- Set per-test tolerances based on observed max errors:
  AG: 1e-3, RS FP16: 7e-2, RS BF16: 6e-1, RS FP8: 7e-2 to 1e-1,
  AR FP16: 7e-2, AR BF16: 1e-3, AR FP8: 1.5e-1

Signed-off-by: Almog Segal <asegal@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 9, 2026

Tip:

Greploop — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.

Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

@timmoon10
Copy link
Copy Markdown
Collaborator

/te-ci core L1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants