-
Notifications
You must be signed in to change notification settings - Fork 3.6k
fix(amd): Add K-dimension padding for AITer FP8 GEMM/MoE kernel compa… #14114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @sunxxuns, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical runtime error affecting FP8 quantized models on AMD GPUs with the AITer backend, particularly when tensor parallelism is active. The core solution involves introducing a transparent padding mechanism for the K-dimension in general matrix multiplication (GEMM) operations and for the intermediate dimension ( Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
c3395e2 to
d00bf4a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces padding for the K-dimension in FP8 GEMM/MoE kernels to ensure compatibility with the AITer backend on AMD GPUs, which require specific dimension alignments. The changes correctly address the runtime errors by padding weights and scales in fp8_utils.py, compressed_tensors_moe.py, and fp8.py.
My review focuses on code quality and maintainability. While the core logic for padding appears correct, I've identified significant code duplication and numerous debug statements that should be addressed before merging. Specifically, the padding logic is duplicated in fp8.py, which can lead to maintenance issues and bugs. Additionally, extensive debug prints have been added across multiple files, which should be removed.
Please see my detailed comments for suggestions on refactoring and cleanup.
5a79a5a to
3bce29c
Compare
dbcc07c to
bc1d422
Compare
This reverts commit 6f48bbd.
Only certain AITer GEMM kernels require K dimension padding to 128. Add SGLANG_AITER_PAD_K environment variable to enable when needed. Usage: export SGLANG_AITER_PAD_K=1 This is opt-in to avoid unnecessary memory overhead when not required. Future work: Auto-detect kernel requirements.
The MoE stage2 kernel also fails with unsupported GEMM when inter_dim is not aligned to 128. Add padding for MoE weights (w13_weight, w2_weight) and scales under the same SGLANG_AITER_PAD_K environment variable. This keeps it simple - one flag controls all AITer dimension padding (both K in regular GEMM and inter_dim in MoE).
- Change SGLANG_AITER_PAD_K to accept alignment value (e.g., 128, 256) instead of boolean, giving users control over alignment - Replace repetitive padding code with torch.nn.functional.pad for cleaner, more efficient padding - Add _pad_moe_weight helper function to eliminate code duplication in MoE padding logic - Remove unnecessary variable reassignments that affected non-HIP paths - Prepare for better TP sharding compatibility by making alignment explicit This makes the code more maintainable, flexible, and less error-prone.
…dding - Move all weight padding to process_weights_after_loading (one-time, during model load) - Linear weights: pad in compressed_tensors_w8a8_fp8.py before shuffle_weight - MoE weights: pad in compressed_tensors_moe.py before shuffle_weight - Forward pass: only pad input tensors to match pre-padded weights, not weights themselves - This eliminates double padding and is more efficient (pad once vs every forward pass)
shuffle_weight requires BOTH dimensions to be divisible by 32, not just K. With TP sharding, the N (output) dimension can also be unaligned (e.g., 2736 % 32 = 16). Changes: - Weight loading: pad both K and N dimensions before shuffle_weight - Forward pass: unpad output N dimension to restore original size
…gnment) K dimension: pad to user-specified alignment (128/256) for GEMM requirements N dimension: pad to 32 only (minimum for shuffle_weight) This avoids over-padding N when user sets SGLANG_AITER_PAD_K=256. Example: N=2736 → 2752 (32-align) instead of 2816 (256-align)
- Remove all debug print statements from weight padding code - Remove unused _pad_moe_weight helper function - N dimension padding is always 32 (hard requirement for shuffle_weight) - K dimension padding is user-controlled via SGLANG_AITER_PAD_K
- N padding (32): always applied (required by shuffle_weight) - K padding (128/256): only when SGLANG_AITER_PAD_K is set (for GEMM kernels) - Without flag: minimal 32-alignment for both (shuffle_weight only) - With flag: K gets user alignment, N stays at 32 This avoids unnecessary padding when GEMM kernels don't have special requirements.
Without SGLANG_AITER_PAD_K flag: - No padding at all (same as current main) - Tests pass as they do today With SGLANG_AITER_PAD_K=128/256: - K padded to specified alignment (for GEMM) - N padded to 32 (for shuffle_weight) - Enables problematic models/TP configs to work This ensures backward compatibility with existing tests and deployments.
- SGLANG_AITER_PAD_K: controls K dimension padding (for GEMM kernels) - SGLANG_AITER_PAD_N: controls N dimension padding (for shuffle_weight) Each flag is independent and clear in purpose: - Set SGLANG_AITER_PAD_K=128 for GEMM requirements - Set SGLANG_AITER_PAD_N=32 for shuffle_weight requirements - Set both if needed for specific model/TP configurations Example usage: export SGLANG_AITER_PAD_K=128 SGLANG_AITER_PAD_N=32
Add SGLANG_AITER_PAD_K and SGLANG_AITER_PAD_N flags to nightly test for: - neuralmagic/DeepSeek-Coder-V2-Lite-Instruct-FP8 - zai-org/GLM-4.5-Air-FP8 These models with TP=2 require padding to avoid GEMM/shuffle_weight errors. Set SGLANG_AITER_PAD_K=128 for GEMM kernel compatibility Set SGLANG_AITER_PAD_N=32 for shuffle_weight compatibility This fixes the exit code -9 (OOM/crash) errors in nightly CI.
…fter sharding When padding N dimension for RowParallelLinear layers, we must account for TP sharding. The N dimension is divided by tp_size, so we need to pad to (alignment * tp_size) to ensure post-shard dimensions remain aligned. Example: N=5472 with tp_size=2, pad_n=32 - Before fix: pad to 5472 (already %32==0), after TP: 2736 (%32==16) ❌ - After fix: pad to 5504 (%64==0), after TP: 2752 (%32==0) ✓ This fixes 'GEMM not supported' errors for GLM-4.5-Air-FP8 and similar models when using TP>1.
bc1d422 to
0e5437a
Compare
The previous commit applied TP-aware padding to all layers, which breaks TP=1 and non-RowParallel layers. Now we: - Check if layer is RowParallelLinear before applying TP-aware N padding - Only apply when tp_size > 1 - For other layers (ColumnParallel, TP=1), use regular alignment This fixes TP=1 while keeping TP=2+ working correctly.
TP=1 doesn't need any padding because: - No dimension sharding occurs - Original dimensions are already compatible with shuffle_weight - Padding only needed for TP>1 to handle post-sharding alignment This simplifies the logic and ensures TP=1 works without any modifications.
…le unpacking The AITer backend returns a ForwardMetadata dataclass, not a tuple. The old code tried to unpack it as a 7-element tuple, causing: TypeError: cannot unpack non-iterable ForwardMetadata object Now we access the dataclass fields directly: - attn_logits (optional, may not exist in AITer backend) - kv_indptr - kv_indices This fixes DeepSeek-V2 models with TP>1 using AITer backend.
AITer backend uses separate cos_cache and sin_cache attributes, while other backends use a combined cos_sin_cache attribute. The code now: 1. Tries to get cos_sin_cache first (for non-AITer backends) 2. Falls back to creating tuple (cos_cache, sin_cache) for AITer This fixes AttributeError: 'DeepseekScalingRotaryEmbedding' object has no attribute 'cos_sin_cache'
…ckends AITer backend has num_kv_splits attribute. Triton backend has max_kv_splits attribute. Use getattr with fallback to support both backends. Fixes: AttributeError: 'TritonAttnBackend' object has no attribute 'num_kv_splits'
…tibility
Fixes "RuntimeError: This GEMM is not supported!" errors when running FP8 quantized models (e.g., GLM-4.5-Air-FP8) with tp_size=2 on AMD GPUs using AITer backend.
Root cause: AITer's Composable Kernel GEMM operations require K dimension alignment to multiples of 128/256. After tensor parallelism splits dimensions, the K dimension may not meet these requirements, causing kernel selection to fail during both CUDA graph capture and inference.
Changes:
fp8_utils.py: Pad only K dimension in apply_fp8_linear when needed
compressed_tensors_moe.py: Pad MoE layer weights (w13_weight, w2_weight) and scales based on inter_dim alignment (128/256) to fix MoE stage2 kernel errors
fp8.py: Add same MoE padding logic for Fp8LinearMethod
test_gsm8k_eval_amd.py: Add --max-running-requests and --mem-fraction-static flags to reduce memory pressure from padding overhead
The padding is applied transparently only when K dimension requires alignment and only for AITer+HIP backend, ensuring kernel compatibility with minimal memory overhead and no impact on other backends.
Related: nightly-test-amd workflow failures
Motivation
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist