-
Notifications
You must be signed in to change notification settings - Fork 3.5k
[Ascend] qwen optimization #12078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ascend] qwen optimization #12078
Conversation
Summary of ChangesHello @Liwansi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a series of targeted optimizations for Qwen models, specifically for NPU (Ascend) hardware. The changes focus on enhancing the performance and memory efficiency of critical components such as paged attention, Mixture-of-Experts (MoE) gating, rotary position embeddings, and KV cache management by integrating specialized NPU kernels. Additionally, it includes a memory optimization for quantized weights and dynamic input handling for NPU graphs. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces several optimizations for Qwen models on Ascend NPUs, including the use of new, specialized kernels for attention, MoE gating, and rotary embeddings. It also contains a memory optimization for weight loading. While most of the changes appear correct and beneficial, I've identified a critical issue in python/sglang/srt/mem_cache/memory_pool.py where an incorrect kernel is used for MLA KV cache updates, which will lead to runtime errors. Additionally, there is some commented-out code in python/sglang/srt/layers/moe/topk.py that should be removed for better code clarity. My review provides specific suggestions to address these points.
| import torch_npu | ||
| if loc.dtype != torch.int32: | ||
| loc = loc.to(torch.int32) | ||
|
|
||
| torch_npu._npu_reshape_and_cache( | ||
| key=cache_k, | ||
| value=cache_v, | ||
| key_cache=self.k_buffer[layer_id].view( | ||
| -1, self.page_size, self.head_num, self.head_dim | ||
| ), | ||
| loc.view(-1, 1), | ||
| cache_v.view(-1, 1, self.qk_rope_head_dim), | ||
| value_cache=self.v_buffer[layer_id].view( | ||
| -1, self.page_size, self.head_num, self.head_dim | ||
| ), | ||
| slot_indices=loc, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change introduces a bug. The class AscendMLAPagedTokenToKVPool does not have self.head_num or self.head_dim attributes, which will cause an AttributeError. This class is for MLA (Multi-Level Attention) and should use self.kv_lora_rank and self.qk_rope_head_dim.
Furthermore, the _npu_reshape_and_cache kernel seems designed for MHA (Multi-Head Attention) and may not be suitable for MLA, especially since MLA can have different dimensions for key and value components. The previous implementation using torch_npu.npu_scatter_nd_update_ was likely correct for MLA.
Also, the buffer indexing self.k_buffer[layer_id] is incorrect. It should be self.k_buffer[layer_id - self.start_layer] to account for pipeline parallelism stages.
Finally, the import torch_npu inside the method is against style guidelines and is redundant as it's already imported at the top of the file.
I recommend reverting this block to the previous implementation.
torch_npu.npu_scatter_nd_update_(
self.k_buffer[layer_id - self.start_layer].view(-1, 1, self.kv_lora_rank),
loc.view(-1, 1),
cache_k.view(-1, 1, self.kv_lora_rank),
)
torch_npu.npu_scatter_nd_update_(
self.v_buffer[layer_id - self.start_layer].view(
-1, 1, self.qk_rope_head_dim
),
loc.view(-1, 1),
cache_v.view(-1, 1, self.qk_rope_head_dim),
)
python/sglang/srt/layers/moe/topk.py
Outdated
| # assert ( | ||
| # num_token_non_padded is None | ||
| # ), "num_token_non_padded is not yet supported in fused_topk_native" | ||
| # assert expert_location_dispatch_info is None | ||
| # assert not apply_routed_scaling_factor_on_output, "Not implemented" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove these useless code
python/sglang/srt/layers/moe/topk.py
Outdated
| ) | ||
| topk_weights = topk_weights.to(torch.float) | ||
| else: | ||
| topk_weights, topk_ids = fused_topk_native( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: Put the common branch (NV GPU) as the first branch in the code whenever possible.
| torch_npu.npu_scatter_nd_update_( | ||
| self.v_buffer[layer_id - self.start_layer].view( | ||
| -1, 1, self.qk_rope_head_dim | ||
| import torch_npu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move the whole class into a new file memory_pool_acsend.py
| ) | ||
|
|
||
|
|
||
| def wait_cmo_stream(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move all npu-specifc functions into a separate file python/sglang/srt/utils/npu_common.py
python/sglang/srt/layers/moe/topk.py
Outdated
| correction_bias=correction_bias, | ||
| ) | ||
| else: | ||
| topk_weights, topk_ids, _ = torch_npu.npu_moe_gating_top_k_softmax( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this kernel has been introduced, pls pull the latest code
python/sglang/srt/layers/moe/topk.py
Outdated
| _is_cpu_amx_available = cpu_has_amx_support() | ||
| _is_npu = is_npu() | ||
| _use_aiter = get_bool_env_var("SGLANG_USE_AITER") and _is_hip | ||
| _use_gating_topk_fused = get_bool_env_var("SGLANG_USE_GATING_TOPK_FUSED") and _is_npu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need this env var, use this kernel by default if it's robust enough
| cache = torch.cat((cos, sin), dim=-1) | ||
| return cache | ||
|
|
||
| def _get_cos_sin_with_position(self, positions, layer_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
give this a more general impl that benefits gpu as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert this change
| and self.compatible_with_fused_kv_buffer | ||
| else None | ||
| ), | ||
| layer_id=self.layer_id if _is_npu else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert this file
python/sglang/srt/utils/__init__.py
Outdated
| @@ -1,2 +1,3 @@ | |||
| # Temporarily do this to avoid changing all imports in the repo | |||
| from sglang.srt.utils.common import * | |||
| from sglang.srt.utils.npu_common import * | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to change in this pr
python/sglang/srt/utils/common.py
Outdated
|
|
||
| from sglang.srt.environ import envs | ||
| from sglang.srt.metrics.func_timer import enable_func_timer | ||
| from sglang.srt.utils.npu_common import get_npu_compiler_config, get_npu_memory_capacity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert this file, no need to change in this pr
| raise ValueError(f"Not Supported DeepEP format {dispatch_output.format}") | ||
|
|
||
|
|
||
| class NpuFuseEPMoE(DeepEPMoE): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to separate files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will be solved in #13359
| ) | ||
|
|
||
| def forward_prepare( | ||
| def forward_prepare_npu( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try to avoid this model change and create fusion pass for this kernel
We have supported fusion pass manager in this PR
#11104
|
/tag-and-rerun-ci |
96a5f1c to
ba1e92f
Compare
| # .contiguous() introduces additional memory overhead and needs to be released using resize_(0) | ||
| origin_weight = weight.data.transpose(1, 2) | ||
| new_weight = origin_weight.contiguous() | ||
| origin_weight.untyped_storage().resize_(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i can't really appreciate anything that does fix this issue there
using flatten is just another workaround from my view
| # .contiguous() introduces additional memory overhead and needs to be released using resize_(0) | ||
| origin_weight = weight.data.transpose(1, 2) | ||
| new_weight = origin_weight.contiguous() | ||
| origin_weight.untyped_storage().resize_(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f5c8a7e to
8418724
Compare
|
/tag-and-rerun-ci |
Motivation
related to #10337
Modifications
-bugfix:
1.memory bugfix(w8a8_int8.py): in previous code, both layer.w13_weight and layer.w2_weight occupied double memory. now we solve it.
2.Cache Management Operation(CMO) bugfix(common.py):in some circumstances(BS=1,2), deadlock situations may occur due to issues with stream sync and waiting. now we solve it.
3.eplb bugfix:The eplb index operator is also introduced when the map path is not set.
-optimization:
1.using triton high-performance fused OP named split_qkv_rmsnorm_rope, which involve modifications to both rotary_embedding.py and qwen3_moe.py.
2.using triton high-performance OP named l1_norm, which involve modifications to moe/topk.py
3.support PA into NPUGraph after some relevant software packages have been released, which involve modifications to npu_graph_runner.py and ascend_backend.py. see [Ascend]optimize Qwen3 on Ascend #10574
4.add moe-a2a-backend named ascend_fuseep, only support decode when pd disaggregation, integration of dispatch、gmm1、swiglu、gmm2、combine OP. Performance improved by 10% on Qwen3 235B.
Accuracy Tests
Benchmarking and Profiling
Checklist