You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* MXFP8 grouped GEMM + tensor-scaled FP8 fixes
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Change version to 13.3
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
* Random padding condition shouldnt be done for mxfp8
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
* Remove incorrect comment
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
* CUBLAS > 13.2 is enough
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
* CUBLAS version needed for MXFP8 indeed seems to be 13.3
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
* Accidental line removal added back. Plus need changes ci t trigger
Add documentation for scaling factors in common.h
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
* Update cuBLAS version requirement for MXFP8 support
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
* grouped gemm: address code review comments
- Replace nvte_set/get_grouped_tensor_swizzled_scales with nvte_set_grouped_tensor_param
- Add host-side validation: A and B must use same scaling mode (both MXFP8 or both tensor scaling)
- Add host-side validation: A and B must both be FP8 or both non-FP8; restrict inputs to FP8/BF16
- Restrict output (C/D) to BF16/FP32; remove FP16 from supported types
- Refactor workspace allocation: replace manual offset arithmetic with moving pointer pattern
- Use void* + NVTEScalingMode in setup kernel instead of separate float*/char* scale params
- Extract use_columnwise(swap_dims) helper to eliminate duplicated MXFP8 columnwise blocks
- Split set_fp8_scale_pointers into set_fp8_scale_pointers / set_mxfp8_scale_pointers
- Remove scale_inv_ptrs from GroupedOperandSelection; pass workspace pointers directly
- Move swizzled-scales validation into validate_grouped_gemm_inputs for fail-fast behavior
- Add use_split_accumulator to GroupedMatmulConfig (Hopper only, default false)
- Add FP8 test case with per-tensor scales; add BF16/MXFP8 shape-varying test cases
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: vthumbe1503 <vthumbe@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Pawel Gadzinski <pgadzinski@nvidia.com>
0 commit comments