fix(amd): Add K-dimension padding for AITer FP8 GEMM/MoE kernel compa… #14114

sunxxuns · 2025-11-28T23:41:47Z

…tibility

Fixes "RuntimeError: This GEMM is not supported!" errors when running FP8 quantized models (e.g., GLM-4.5-Air-FP8) with tp_size=2 on AMD GPUs using AITer backend.

Root cause: AITer's Composable Kernel GEMM operations require K dimension alignment to multiples of 128/256. After tensor parallelism splits dimensions, the K dimension may not meet these requirements, causing kernel selection to fail during both CUDA graph capture and inference.

Changes:

fp8_utils.py: Pad only K dimension in apply_fp8_linear when needed
- Only applies to AITer backend on HIP (_use_aiter path)
- K: pad to multiple of 128 (or 256 if already aligned to 128)
- M, N: no padding needed (minimal overhead approach)
- Apply padding for both CUDA graph capture and regular inference
- No output unpadding needed since output shape is already (M, N)
- Non-AITer and non-HIP code paths are unaffected
compressed_tensors_moe.py: Pad MoE layer weights (w13_weight, w2_weight) and scales based on inter_dim alignment (128/256) to fix MoE stage2 kernel errors
fp8.py: Add same MoE padding logic for Fp8LinearMethod
test_gsm8k_eval_amd.py: Add --max-running-requests and --mem-fraction-static flags to reduce memory pressure from padding overhead

The padding is applied transparently only when K dimension requires alignment and only for AITer+HIP backend, ensuring kernel compatibility with minimal memory overhead and no impact on other backends.

Related: nightly-test-amd workflow failures

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-11-28T23:42:11Z

Summary of Changes

Hello @sunxxuns, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical runtime error affecting FP8 quantized models on AMD GPUs with the AITer backend, particularly when tensor parallelism is active. The core solution involves introducing a transparent padding mechanism for the K-dimension in general matrix multiplication (GEMM) operations and for the intermediate dimension (inter_dim) in Mixture-of-Experts (MoE) layers. This ensures that the dimensions are appropriately aligned for AITer's Composable Kernel, preventing kernel selection failures during both CUDA graph capture and standard inference. The changes are carefully scoped to the AITer+HIP backend to minimize impact on other configurations, and supporting infrastructure like dependency management and testing has been updated accordingly.

Highlights

FP8 GEMM/MoE Kernel Compatibility: Fixed 'RuntimeError: This GEMM is not supported!' errors encountered when running FP8 quantized models (e.g., GLM-4.5-Air-FP8) with tensor parallelism (tp_size=2) on AMD GPUs using the AITer backend. This issue stemmed from AITer's Composable Kernel GEMM operations requiring K-dimension alignment to multiples of 128/256, which tensor parallelism could disrupt.
K-Dimension Padding in FP8 Linear Layers: Implemented K-dimension padding within fp8_utils.py for apply_fp8_linear operations. This padding is applied only when necessary for the AITer backend on HIP, aligning the K-dimension to multiples of 128 (or 256 if already aligned to 128). The padding is transparent, affecting both CUDA graph capture and regular inference, and does not require output unpadding.
MoE Layer Weight and Scale Padding: Added padding logic to compressed_tensors_moe.py and fp8.py for MoE layer weights (w13_weight, w2_weight) and their corresponding scales. This padding ensures the inter_dim is aligned to 128/256, resolving MoE stage2 kernel errors on AMD GPUs.
Dependency Management and Test Updates: Refactored pyproject.toml to streamline dependencies, removing several specific CUDA/FlashInfer/Torch related packages from runtime_common and introducing new optional dependency groups for HIP, NPU, and HPU. The AMD nightly evaluation test (test_gsm8k_eval_amd.py) was updated to include memory optimization flags (--max-running-requests, --mem-fraction-static) and to focus specifically on the GLM-4.5-Air-FP8 model.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces padding for the K-dimension in FP8 GEMM/MoE kernels to ensure compatibility with the AITer backend on AMD GPUs, which require specific dimension alignments. The changes correctly address the runtime errors by padding weights and scales in fp8_utils.py, compressed_tensors_moe.py, and fp8.py.

My review focuses on code quality and maintainability. While the core logic for padding appears correct, I've identified significant code duplication and numerous debug statements that should be addressed before merging. Specifically, the padding logic is duplicated in fp8.py, which can lead to maintenance issues and bugs. Additionally, extensive debug prints have been added across multiple files, which should be removed.

Please see my detailed comments for suggestions on refactoring and cleanup.

This reverts commit 6f48bbd.

Only certain AITer GEMM kernels require K dimension padding to 128. Add SGLANG_AITER_PAD_K environment variable to enable when needed. Usage: export SGLANG_AITER_PAD_K=1 This is opt-in to avoid unnecessary memory overhead when not required. Future work: Auto-detect kernel requirements.

The MoE stage2 kernel also fails with unsupported GEMM when inter_dim is not aligned to 128. Add padding for MoE weights (w13_weight, w2_weight) and scales under the same SGLANG_AITER_PAD_K environment variable. This keeps it simple - one flag controls all AITer dimension padding (both K in regular GEMM and inter_dim in MoE).

- Change SGLANG_AITER_PAD_K to accept alignment value (e.g., 128, 256) instead of boolean, giving users control over alignment - Replace repetitive padding code with torch.nn.functional.pad for cleaner, more efficient padding - Add _pad_moe_weight helper function to eliminate code duplication in MoE padding logic - Remove unnecessary variable reassignments that affected non-HIP paths - Prepare for better TP sharding compatibility by making alignment explicit This makes the code more maintainable, flexible, and less error-prone.

…bility

…dding - Move all weight padding to process_weights_after_loading (one-time, during model load) - Linear weights: pad in compressed_tensors_w8a8_fp8.py before shuffle_weight - MoE weights: pad in compressed_tensors_moe.py before shuffle_weight - Forward pass: only pad input tensors to match pre-padded weights, not weights themselves - This eliminates double padding and is more efficient (pad once vs every forward pass)

shuffle_weight requires BOTH dimensions to be divisible by 32, not just K. With TP sharding, the N (output) dimension can also be unaligned (e.g., 2736 % 32 = 16). Changes: - Weight loading: pad both K and N dimensions before shuffle_weight - Forward pass: unpad output N dimension to restore original size

…gnment) K dimension: pad to user-specified alignment (128/256) for GEMM requirements N dimension: pad to 32 only (minimum for shuffle_weight) This avoids over-padding N when user sets SGLANG_AITER_PAD_K=256. Example: N=2736 → 2752 (32-align) instead of 2816 (256-align)

- Remove all debug print statements from weight padding code - Remove unused _pad_moe_weight helper function - N dimension padding is always 32 (hard requirement for shuffle_weight) - K dimension padding is user-controlled via SGLANG_AITER_PAD_K

- N padding (32): always applied (required by shuffle_weight) - K padding (128/256): only when SGLANG_AITER_PAD_K is set (for GEMM kernels) - Without flag: minimal 32-alignment for both (shuffle_weight only) - With flag: K gets user alignment, N stays at 32 This avoids unnecessary padding when GEMM kernels don't have special requirements.

Without SGLANG_AITER_PAD_K flag: - No padding at all (same as current main) - Tests pass as they do today With SGLANG_AITER_PAD_K=128/256: - K padded to specified alignment (for GEMM) - N padded to 32 (for shuffle_weight) - Enables problematic models/TP configs to work This ensures backward compatibility with existing tests and deployments.

- SGLANG_AITER_PAD_K: controls K dimension padding (for GEMM kernels) - SGLANG_AITER_PAD_N: controls N dimension padding (for shuffle_weight) Each flag is independent and clear in purpose: - Set SGLANG_AITER_PAD_K=128 for GEMM requirements - Set SGLANG_AITER_PAD_N=32 for shuffle_weight requirements - Set both if needed for specific model/TP configurations Example usage: export SGLANG_AITER_PAD_K=128 SGLANG_AITER_PAD_N=32

Add SGLANG_AITER_PAD_K and SGLANG_AITER_PAD_N flags to nightly test for: - neuralmagic/DeepSeek-Coder-V2-Lite-Instruct-FP8 - zai-org/GLM-4.5-Air-FP8 These models with TP=2 require padding to avoid GEMM/shuffle_weight errors. Set SGLANG_AITER_PAD_K=128 for GEMM kernel compatibility Set SGLANG_AITER_PAD_N=32 for shuffle_weight compatibility This fixes the exit code -9 (OOM/crash) errors in nightly CI.

…fter sharding When padding N dimension for RowParallelLinear layers, we must account for TP sharding. The N dimension is divided by tp_size, so we need to pad to (alignment * tp_size) to ensure post-shard dimensions remain aligned. Example: N=5472 with tp_size=2, pad_n=32 - Before fix: pad to 5472 (already %32==0), after TP: 2736 (%32==16) ❌ - After fix: pad to 5504 (%64==0), after TP: 2752 (%32==0) ✓ This fixes 'GEMM not supported' errors for GLM-4.5-Air-FP8 and similar models when using TP>1.

The previous commit applied TP-aware padding to all layers, which breaks TP=1 and non-RowParallel layers. Now we: - Check if layer is RowParallelLinear before applying TP-aware N padding - Only apply when tp_size > 1 - For other layers (ColumnParallel, TP=1), use regular alignment This fixes TP=1 while keeping TP=2+ working correctly.

TP=1 doesn't need any padding because: - No dimension sharding occurs - Original dimensions are already compatible with shuffle_weight - Padding only needed for TP>1 to handle post-sharding alignment This simplifies the logic and ensures TP=1 works without any modifications.

…le unpacking The AITer backend returns a ForwardMetadata dataclass, not a tuple. The old code tried to unpack it as a 7-element tuple, causing: TypeError: cannot unpack non-iterable ForwardMetadata object Now we access the dataclass fields directly: - attn_logits (optional, may not exist in AITer backend) - kv_indptr - kv_indices This fixes DeepSeek-V2 models with TP>1 using AITer backend.

AITer backend uses separate cos_cache and sin_cache attributes, while other backends use a combined cos_sin_cache attribute. The code now: 1. Tries to get cos_sin_cache first (for non-AITer backends) 2. Falls back to creating tuple (cos_cache, sin_cache) for AITer This fixes AttributeError: 'DeepseekScalingRotaryEmbedding' object has no attribute 'cos_sin_cache'

…ckends AITer backend has num_kv_splits attribute. Triton backend has max_kv_splits attribute. Use getattr with fallback to support both backends. Fixes: AttributeError: 'TritonAttnBackend' object has no attribute 'num_kv_splits'

sunxxuns requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, ch-wan, ispobock and merrymercy as code owners November 28, 2025 23:41

github-actions bot added amd dependencies Pull requests that update a dependency file labels Nov 28, 2025

sunxxuns force-pushed the amd-nightly-oom-models branch from c3395e2 to d00bf4a Compare November 28, 2025 23:44

gemini-code-assist bot reviewed Nov 28, 2025

View reviewed changes

sunxxuns requested review from HaiShaw, Ying1123, yizhang2077 and zhyncs as code owners November 29, 2025 00:05

github-actions bot added documentation Improvements or additions to documentation sgl-kernel labels Nov 29, 2025

sunxxuns force-pushed the amd-nightly-oom-models branch 2 times, most recently from 5a79a5a to 3bce29c Compare November 29, 2025 00:20

sunxxuns added the run-ci label Nov 29, 2025

sunxxuns force-pushed the amd-nightly-oom-models branch 2 times, most recently from dbcc07c to bc1d422 Compare November 29, 2025 05:57

BBuf and others added 6 commits November 29, 2025 13:57

add moe_wna16_marlin_gemm_v2

6f48bbd

Revert "add moe_wna16_marlin_gemm_v2"

eeea208

This reverts commit 6f48bbd.

fix: Add missing ceil_align import in compressed_tensors_moe.py

156f53c

root added 13 commits November 29, 2025 06:26

fix: Add missing os import in fp8_utils.py

1538e1e

fix(amd): Handle gated MoE padding correctly for inter_dim*2

baa694b

fix(amd): Pad weight K dimension before shuffle_weight for TP compati…

e5cb70c

…bility

debug: Add logging to diagnose padding behavior with tp=4

e82b810

sunxxuns force-pushed the amd-nightly-oom-models branch from bc1d422 to 0e5437a Compare November 29, 2025 06:26

root added 2 commits November 29, 2025 06:56

merrymercy force-pushed the main branch from eeea208 to 0fe74af Compare November 29, 2025 07:06

github-actions bot added the deepseek label Nov 29, 2025

root added 2 commits November 29, 2025 07:16

sunxxuns removed the run-ci label Nov 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(amd): Add K-dimension padding for AITer FP8 GEMM/MoE kernel compa… #14114

fix(amd): Add K-dimension padding for AITer FP8 GEMM/MoE kernel compa… #14114

Uh oh!

sunxxuns commented Nov 28, 2025

Uh oh!

gemini-code-assist bot commented Nov 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(amd): Add K-dimension padding for AITer FP8 GEMM/MoE kernel compa… #14114

Are you sure you want to change the base?

fix(amd): Add K-dimension padding for AITer FP8 GEMM/MoE kernel compa… #14114

Uh oh!

Conversation

sunxxuns commented Nov 28, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Nov 28, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants