Skip to content

Conversation

@sunxxuns
Copy link
Collaborator

…tibility

Fixes "RuntimeError: This GEMM is not supported!" errors when running FP8 quantized models (e.g., GLM-4.5-Air-FP8) with tp_size=2 on AMD GPUs using AITer backend.

Root cause: AITer's Composable Kernel GEMM operations require K dimension alignment to multiples of 128/256. After tensor parallelism splits dimensions, the K dimension may not meet these requirements, causing kernel selection to fail during both CUDA graph capture and inference.

Changes:

  • fp8_utils.py: Pad only K dimension in apply_fp8_linear when needed

    • Only applies to AITer backend on HIP (_use_aiter path)
    • K: pad to multiple of 128 (or 256 if already aligned to 128)
    • M, N: no padding needed (minimal overhead approach)
    • Apply padding for both CUDA graph capture and regular inference
    • No output unpadding needed since output shape is already (M, N)
    • Non-AITer and non-HIP code paths are unaffected
  • compressed_tensors_moe.py: Pad MoE layer weights (w13_weight, w2_weight) and scales based on inter_dim alignment (128/256) to fix MoE stage2 kernel errors

  • fp8.py: Add same MoE padding logic for Fp8LinearMethod

  • test_gsm8k_eval_amd.py: Add --max-running-requests and --mem-fraction-static flags to reduce memory pressure from padding overhead

The padding is applied transparently only when K dimension requires alignment and only for AITer+HIP backend, ensuring kernel compatibility with minimal memory overhead and no impact on other backends.

Related: nightly-test-amd workflow failures

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@github-actions github-actions bot added amd dependencies Pull requests that update a dependency file labels Nov 28, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @sunxxuns, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical runtime error affecting FP8 quantized models on AMD GPUs with the AITer backend, particularly when tensor parallelism is active. The core solution involves introducing a transparent padding mechanism for the K-dimension in general matrix multiplication (GEMM) operations and for the intermediate dimension (inter_dim) in Mixture-of-Experts (MoE) layers. This ensures that the dimensions are appropriately aligned for AITer's Composable Kernel, preventing kernel selection failures during both CUDA graph capture and standard inference. The changes are carefully scoped to the AITer+HIP backend to minimize impact on other configurations, and supporting infrastructure like dependency management and testing has been updated accordingly.

Highlights

  • FP8 GEMM/MoE Kernel Compatibility: Fixed 'RuntimeError: This GEMM is not supported!' errors encountered when running FP8 quantized models (e.g., GLM-4.5-Air-FP8) with tensor parallelism (tp_size=2) on AMD GPUs using the AITer backend. This issue stemmed from AITer's Composable Kernel GEMM operations requiring K-dimension alignment to multiples of 128/256, which tensor parallelism could disrupt.
  • K-Dimension Padding in FP8 Linear Layers: Implemented K-dimension padding within fp8_utils.py for apply_fp8_linear operations. This padding is applied only when necessary for the AITer backend on HIP, aligning the K-dimension to multiples of 128 (or 256 if already aligned to 128). The padding is transparent, affecting both CUDA graph capture and regular inference, and does not require output unpadding.
  • MoE Layer Weight and Scale Padding: Added padding logic to compressed_tensors_moe.py and fp8.py for MoE layer weights (w13_weight, w2_weight) and their corresponding scales. This padding ensures the inter_dim is aligned to 128/256, resolving MoE stage2 kernel errors on AMD GPUs.
  • Dependency Management and Test Updates: Refactored pyproject.toml to streamline dependencies, removing several specific CUDA/FlashInfer/Torch related packages from runtime_common and introducing new optional dependency groups for HIP, NPU, and HPU. The AMD nightly evaluation test (test_gsm8k_eval_amd.py) was updated to include memory optimization flags (--max-running-requests, --mem-fraction-static) and to focus specifically on the GLM-4.5-Air-FP8 model.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@sunxxuns sunxxuns force-pushed the amd-nightly-oom-models branch from c3395e2 to d00bf4a Compare November 28, 2025 23:44
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces padding for the K-dimension in FP8 GEMM/MoE kernels to ensure compatibility with the AITer backend on AMD GPUs, which require specific dimension alignments. The changes correctly address the runtime errors by padding weights and scales in fp8_utils.py, compressed_tensors_moe.py, and fp8.py.

My review focuses on code quality and maintainability. While the core logic for padding appears correct, I've identified significant code duplication and numerous debug statements that should be addressed before merging. Specifically, the padding logic is duplicated in fp8.py, which can lead to maintenance issues and bugs. Additionally, extensive debug prints have been added across multiple files, which should be removed.

Please see my detailed comments for suggestions on refactoring and cleanup.

@github-actions github-actions bot added documentation Improvements or additions to documentation sgl-kernel labels Nov 29, 2025
@sunxxuns sunxxuns force-pushed the amd-nightly-oom-models branch 2 times, most recently from 5a79a5a to 3bce29c Compare November 29, 2025 00:20
@sunxxuns sunxxuns force-pushed the amd-nightly-oom-models branch 2 times, most recently from dbcc07c to bc1d422 Compare November 29, 2025 05:57
BBuf and others added 6 commits November 29, 2025 13:57
Only certain AITer GEMM kernels require K dimension padding to 128.
Add SGLANG_AITER_PAD_K environment variable to enable when needed.

Usage:
  export SGLANG_AITER_PAD_K=1

This is opt-in to avoid unnecessary memory overhead when not required.
Future work: Auto-detect kernel requirements.
The MoE stage2 kernel also fails with unsupported GEMM when inter_dim
is not aligned to 128. Add padding for MoE weights (w13_weight, w2_weight)
and scales under the same SGLANG_AITER_PAD_K environment variable.

This keeps it simple - one flag controls all AITer dimension padding
(both K in regular GEMM and inter_dim in MoE).
- Change SGLANG_AITER_PAD_K to accept alignment value (e.g., 128, 256)
  instead of boolean, giving users control over alignment
- Replace repetitive padding code with torch.nn.functional.pad for cleaner,
  more efficient padding
- Add _pad_moe_weight helper function to eliminate code duplication in MoE
  padding logic
- Remove unnecessary variable reassignments that affected non-HIP paths
- Prepare for better TP sharding compatibility by making alignment explicit

This makes the code more maintainable, flexible, and less error-prone.
root added 13 commits November 29, 2025 06:26
…dding

- Move all weight padding to process_weights_after_loading (one-time, during model load)
- Linear weights: pad in compressed_tensors_w8a8_fp8.py before shuffle_weight
- MoE weights: pad in compressed_tensors_moe.py before shuffle_weight
- Forward pass: only pad input tensors to match pre-padded weights, not weights themselves
- This eliminates double padding and is more efficient (pad once vs every forward pass)
shuffle_weight requires BOTH dimensions to be divisible by 32, not just K.
With TP sharding, the N (output) dimension can also be unaligned (e.g., 2736 % 32 = 16).

Changes:
- Weight loading: pad both K and N dimensions before shuffle_weight
- Forward pass: unpad output N dimension to restore original size
…gnment)

K dimension: pad to user-specified alignment (128/256) for GEMM requirements
N dimension: pad to 32 only (minimum for shuffle_weight)

This avoids over-padding N when user sets SGLANG_AITER_PAD_K=256.
Example: N=2736 → 2752 (32-align) instead of 2816 (256-align)
- Remove all debug print statements from weight padding code
- Remove unused _pad_moe_weight helper function
- N dimension padding is always 32 (hard requirement for shuffle_weight)
- K dimension padding is user-controlled via SGLANG_AITER_PAD_K
- N padding (32): always applied (required by shuffle_weight)
- K padding (128/256): only when SGLANG_AITER_PAD_K is set (for GEMM kernels)
- Without flag: minimal 32-alignment for both (shuffle_weight only)
- With flag: K gets user alignment, N stays at 32

This avoids unnecessary padding when GEMM kernels don't have special requirements.
Without SGLANG_AITER_PAD_K flag:
- No padding at all (same as current main)
- Tests pass as they do today

With SGLANG_AITER_PAD_K=128/256:
- K padded to specified alignment (for GEMM)
- N padded to 32 (for shuffle_weight)
- Enables problematic models/TP configs to work

This ensures backward compatibility with existing tests and deployments.
- SGLANG_AITER_PAD_K: controls K dimension padding (for GEMM kernels)
- SGLANG_AITER_PAD_N: controls N dimension padding (for shuffle_weight)

Each flag is independent and clear in purpose:
- Set SGLANG_AITER_PAD_K=128 for GEMM requirements
- Set SGLANG_AITER_PAD_N=32 for shuffle_weight requirements
- Set both if needed for specific model/TP configurations

Example usage:
  export SGLANG_AITER_PAD_K=128 SGLANG_AITER_PAD_N=32
Add SGLANG_AITER_PAD_K and SGLANG_AITER_PAD_N flags to nightly test for:
- neuralmagic/DeepSeek-Coder-V2-Lite-Instruct-FP8
- zai-org/GLM-4.5-Air-FP8

These models with TP=2 require padding to avoid GEMM/shuffle_weight errors.

Set SGLANG_AITER_PAD_K=128 for GEMM kernel compatibility
Set SGLANG_AITER_PAD_N=32 for shuffle_weight compatibility

This fixes the exit code -9 (OOM/crash) errors in nightly CI.
…fter sharding

When padding N dimension for RowParallelLinear layers, we must account
for TP sharding. The N dimension is divided by tp_size, so we need to pad
to (alignment * tp_size) to ensure post-shard dimensions remain aligned.

Example: N=5472 with tp_size=2, pad_n=32
- Before fix: pad to 5472 (already %32==0), after TP: 2736 (%32==16) ❌
- After fix: pad to 5504 (%64==0), after TP: 2752 (%32==0) ✓

This fixes 'GEMM not supported' errors for GLM-4.5-Air-FP8 and similar
models when using TP>1.
@sunxxuns sunxxuns force-pushed the amd-nightly-oom-models branch from bc1d422 to 0e5437a Compare November 29, 2025 06:26
root added 2 commits November 29, 2025 06:56
The previous commit applied TP-aware padding to all layers, which breaks
TP=1 and non-RowParallel layers. Now we:
- Check if layer is RowParallelLinear before applying TP-aware N padding
- Only apply when tp_size > 1
- For other layers (ColumnParallel, TP=1), use regular alignment

This fixes TP=1 while keeping TP=2+ working correctly.
TP=1 doesn't need any padding because:
- No dimension sharding occurs
- Original dimensions are already compatible with shuffle_weight
- Padding only needed for TP>1 to handle post-sharding alignment

This simplifies the logic and ensures TP=1 works without any modifications.
…le unpacking

The AITer backend returns a ForwardMetadata dataclass, not a tuple.
The old code tried to unpack it as a 7-element tuple, causing:
  TypeError: cannot unpack non-iterable ForwardMetadata object

Now we access the dataclass fields directly:
- attn_logits (optional, may not exist in AITer backend)
- kv_indptr
- kv_indices

This fixes DeepSeek-V2 models with TP>1 using AITer backend.
root added 2 commits November 29, 2025 07:16
AITer backend uses separate cos_cache and sin_cache attributes,
while other backends use a combined cos_sin_cache attribute.

The code now:
1. Tries to get cos_sin_cache first (for non-AITer backends)
2. Falls back to creating tuple (cos_cache, sin_cache) for AITer

This fixes AttributeError: 'DeepseekScalingRotaryEmbedding' object
has no attribute 'cos_sin_cache'
…ckends

AITer backend has num_kv_splits attribute.
Triton backend has max_kv_splits attribute.

Use getattr with fallback to support both backends.

Fixes: AttributeError: 'TritonAttnBackend' object has no attribute 'num_kv_splits'
@sunxxuns sunxxuns removed the run-ci label Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd deepseek dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation sgl-kernel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants