-
Notifications
You must be signed in to change notification settings - Fork 3.6k
NPU Graph Compilation support and PassManager with AddRmsNorm & Quantize fuse. TorchAir compiler backend support. #11104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
eshoguli
wants to merge
46
commits into
sgl-project:main
Choose a base branch
from
eshoguli:eshogulin/pass_manager
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+2,276
−79
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This was referenced Sep 30, 2025
508e483 to
d77e709
Compare
d77e709 to
4ce70e6
Compare
c958827 to
b974460
Compare
8150d72 to
11074d9
Compare
ssshinigami
approved these changes
Nov 20, 2025
Contributor
ssshinigami
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
VDV1985
reviewed
Nov 20, 2025
VDV1985
reviewed
Nov 20, 2025
e6942bc to
e06675b
Compare
7293c2c to
00a0b9b
Compare
…sglang into eshogulin/pass_manager
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
--enable-torch-compile)--enable-piecewise-npu-graph-decode)PassManagerfor current and future fuses in Python viatorch.fx.replace_pattern. Fuses can be easily developed by external contributors.AddRmsNormandAscendQuantV2kernels toAddRmsNormQuantkernel:Original comment: [feat] npu support enable_torch_compile #12371
TorchAir (Torch Ascend Intermediate Representation) is an extension library that provides graph mode capabilities for torch_npu. It enables users to perform graph-mode inference on NPU using PyTorch and torch_npu. TorchAir externally offers a torch.compile backend for NPU, which interfaces with torch._dynamo. Through the following features, performance optimization and capability enhancement of the torch fx graph can be achieved.
TorchAir Main Features:
How to enable compilation and fuses for
NPUGraphdecode:How to enable piecewise graph and fuses for decode:
How to enable TorchAir for decode:
CANN version: 8.2
Torch NPU version:
torch-npu 2.6.0.post3Modifications
Model compilation support by
torch.compileUse
--enable-torch-compileto enable compilation and optional--torch-compile-max-bsargument to limit max batch size for compilation.NpuGraphCompilerBackendcompilation backend for NPU Graph capturing. Implemented in:python/sglang/srt/model_executor/compilation/npu_graph_compiler_backend.py, usage:PiecewiseNpuGraphCompilerBackendcompilation backend for Piecewise graph and partial NPU Graph capturing. Inherited fromNpuGraphCompilerBackendto reuse fusing passes. Implemented in:python/sglang/srt/model_executor/compilation/piecewise_npu_graph_compiler_backend.py, usage:You can use
--enable-piecewise-npu-graph-decodeto enable Piecewise Graph.Optional command line arguments:
--compilation-config {"splitting_ops": ["atb._npu_paged_attention"]}to configure compilation backend,--cuda-graph-bsto specify batch size,--cuda-graph-max-bsto limit max batch size.PassManagerpasses manager and passespython/sglang/srt/model_executor/compilation/passes/w8a8_int8to optimize model during compilation. Usage:RotaryEmbeddinglayer use NPU kernel in forward instead native implementationpython/sglang/srt/layers/attention/ascend_backend.py7.1. Rewrite the capture function;
7.2. Encapsulate the kvcache input (input needs all kvcache);
7.3. Pad the block table to the max length;
7.4. TorchAir input preparation;
The calling process is as follows.

Class Diagram
classDiagram class PiecewiseNpuGraphRunnerDecode class NPUCompileModelRunner class NPUGraphRunner class CudaGraphRunner class NpuGraphCompiler class NpuGraphCompilerBackend class PiecewiseNpuGraphCompiler class PiecewiseNpuGraphCompilerBackend NPUGraphRunner--|>CudaGraphRunner NPUGraphRunner-->NpuGraphCompiler NpuGraphCompiler-->NpuGraphCompilerBackend NPUCompileModelRunner-->CudaGraphRunner PiecewiseNpuGraphRunnerDecode-->CudaGraphRunner PiecewiseNpuGraphRunnerDecode-->PiecewiseNpuGraphCompiler PiecewiseNpuGraphCompiler-->PiecewiseNpuGraphCompilerBackend PiecewiseNpuGraphCompilerBackend--|>NpuGraphCompilerBackendAccuracy Tests
Collected on gsm8k dataset for static quantized
Qwen3-32B:TorchAir
Collected on MMMU dataset for
Qwen3-VL-30B-A3B-Instruct:Benchmarking and Profiling (910B3)
Reference
Compilation
Piecewise Graph
Future roadmaps
In the
torch_npu7.2.0 version, the reduce-overhead mode of the torchair backend will support torch.compile(model, dynamic=True). This mode will be set as the default in get_compile_backend(), enabling support for methods wrapped by the@torch.compile()decorator.In the
torch_npu7.3.0 version, the capture and replay ofNPUGraphcurrently integrated in the torchair backend will be changed to optional execution. The torchair backend will only perform optimizations such as fx pass optimization and static kernel compilation, while the capture and replay ofNPUGraphwill be implemented independently. This design is closer to the implementation ofCudaGraphRunner, decoupling fx graph optimization from graph offloading.Checklist