NPU Graph Compilation support and PassManager with AddRmsNorm & Quantize fuse. TorchAir compiler backend support. #11104

eshoguli · 2025-09-30T15:02:09Z

Motivation

Performance gain (Ascend 910B3, batch size = 128):

Branch	Median ITL (ms)	Performance gain
Reference	74.27
Compilation (`--enable-torch-compile`)	71.64	3.5%
Piecewise Graph (`--enable-piecewise-npu-graph-decode`)	73.20	1.4%

Support model compilation on NPU and PassManager for current and future fuses in Python via torch.fx.replace_pattern. Fuses can be easily developed by external contributors.
Improve performance via fuse AddRmsNorm and AscendQuantV2 kernels to AddRmsNormQuant kernel:
Encrease performance for compiled model via NPU kerneal and torch guards avoiding.
Piecewise graph execution approach
TorchAir compilation backend support
Original comment: [feat] npu support enable_torch_compile #12371

TorchAir (Torch Ascend Intermediate Representation) is an extension library that provides graph mode capabilities for torch_npu. It enables users to perform graph-mode inference on NPU using PyTorch and torch_npu. TorchAir externally offers a torch.compile backend for NPU, which interfaces with torch._dynamo. Through the following features, performance optimization and capability enhancement of the torch fx graph can be achieved.

TorchAir Main Features:

Basic Features:

Enable NPU kernels that depend on host-value tiling operators (e.g., FIA) to support npugraph
Graph input copy optimization
Memory reuse across multi-graphs

FX Pass:

In-place optimization
Redundant operator elimination
NPU fused operator passes

Advanced Features:

Static shape kernel compilation
Multi-stream within single graphs
Compilation caching

How to enable compilation and fuses for `NPUGraph` decode:

--enable-torch-compile

How to enable piecewise graph and fuses for decode:

--enable-piecewise-npu-graph-decode

How to enable TorchAir for decode:

--enable-torch-compile --disable-cuda-graph

CANN version: 8.2
Torch NPU version: torch-npu 2.6.0.post3

Modifications

Model compilation support by torch.compile
Use --enable-torch-compile to enable compilation and optional --torch-compile-max-bs argument to limit max batch size for compilation.
NpuGraphCompilerBackend compilation backend for NPU Graph capturing. Implemented in: python/sglang/srt/model_executor/compilation/npu_graph_compiler_backend.py, usage:

self.compiled_callable = torch.compile(
    model, fullgraph=True, dynamic=False, backend=NpuGraphCompilerBackend()
)

PiecewiseNpuGraphCompilerBackend compilation backend for Piecewise graph and partial NPU Graph capturing. Inherited from NpuGraphCompilerBackend to reuse fusing passes. Implemented in: python/sglang/srt/model_executor/compilation/piecewise_npu_graph_compiler_backend.py, usage:

self.compiled_callable = torch.compile(
    model, fullgraph=True, dynamic=False, backend=PiecewiseNpuGraphCompilerBackend()
)

You can use --enable-piecewise-npu-graph-decode to enable Piecewise Graph.
Optional command line arguments:

--compilation-config {"splitting_ops": ["atb._npu_paged_attention"]} to configure compilation backend,
--cuda-graph-bs to specify batch size,
--cuda-graph-max-bs to limit max batch size.

PassManager passes manager and passes python/sglang/srt/model_executor/compilation/passes/w8a8_int8 to optimize model during compilation. Usage:

from sglang.srt.compilation.npu.pass_manager import PassManager
from sglang.srt.compilation.npu.passes.w8a8_int8 import (
    DivFuse,
    EraseCopy,
    NpuAddRmsNormQuantFuse,
    NpuAddRmsNormDynamicQuantFuse,
)

def apply_passes(graph_module: torch.fx.GraphModule):
    passManager = PassManager(graph_module)
    passManager.add(NpuAddRmsNormQuantFuse)
    passManager.add(NpuAddRmsNormDynamicQuantFuse)
    passManager.add(DivFuse)
    passManager.add(EraseCopy)
    passManager.apply()
    graph_module.recompile()

RotaryEmbedding layer use NPU kernel in forward instead native implementation
torch.compile guards are ignored to improve forward performance
Ascend page attention is used to enable compilation without custom ops: python/sglang/srt/layers/attention/ascend_backend.py
TorchAir
7.1. Rewrite the capture function;
7.2. Encapsulate the kvcache input (input needs all kvcache);
7.3. Pad the block table to the max length;
7.4. TorchAir input preparation;

The calling process is as follows.

Class Diagram

classDiagram
    class PiecewiseNpuGraphRunnerDecode
    class NPUCompileModelRunner
    class NPUGraphRunner
    class CudaGraphRunner
    class NpuGraphCompiler
    class NpuGraphCompilerBackend
    class PiecewiseNpuGraphCompiler
    class PiecewiseNpuGraphCompilerBackend

    NPUGraphRunner--|>CudaGraphRunner
    NPUGraphRunner-->NpuGraphCompiler
    NpuGraphCompiler-->NpuGraphCompilerBackend
    NPUCompileModelRunner-->CudaGraphRunner
    PiecewiseNpuGraphRunnerDecode-->CudaGraphRunner
    PiecewiseNpuGraphRunnerDecode-->PiecewiseNpuGraphCompiler
    PiecewiseNpuGraphCompiler-->PiecewiseNpuGraphCompilerBackend
    PiecewiseNpuGraphCompilerBackend--|>NpuGraphCompilerBackend

Accuracy Tests

Collected on gsm8k dataset for static quantized `Qwen3-32B`:

Version	Accuracy
Reference	85.7%
Compilation	85.6%
Piecewise Graph	85.7%
TorchAir	85.1%

TorchAir

python3 few_shot_gsm8k.py --data-path "/path/to/model/test.jsonl.txt” --parallel 32 --num-questions 200

Accuracy: 0.865
Invalid: 0.000
Latency: 43.077 s
Output throughput: 795.877 token/s

Collected on MMMU dataset for `Qwen3-VL-30B-A3B-Instruct`:

Version	Overall accuracy
Reference	0.592
Compilation	0.597
Piecewise Graph	0.591

Benchmarking and Profiling (910B3)

Reference

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 128
Successful requests:                     128
Benchmark duration (s):                  119.11
Total input tokens:                      131072
Total generated tokens:                  131072
Total generated tokens (retokenized):    131061
Request throughput (req/s):              1.07
Input token throughput (tok/s):          1100.41
Output token throughput (tok/s):         1100.41
Total token throughput (tok/s):          2200.82
Concurrency:                             109.65
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   102033.93
Median E2E Latency (ms):                 100067.18
---------------Time to First Token----------------
Mean TTFT (ms):                          13474.46
Median TTFT (ms):                        13730.29
P99 TTFT (ms):                           24113.16
---------------Inter-Token Latency----------------
Mean ITL (ms):                           86.57
Median ITL (ms):                         74.27
P95 ITL (ms):                            79.96
P99 ITL (ms):                            80.59
Max ITL (ms):                            25360.72
==================================================

Compilation

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 128
Successful requests:                     128
Benchmark duration (s):                  117.06
Total input tokens:                      131072
Total input text tokens:                 131072
Total input vision tokens:               0
Total generated tokens:                  131072
Total generated tokens (retokenized):    131064
Request throughput (req/s):              1.09
Input token throughput (tok/s):          1119.68
Output token throughput (tok/s):         1119.68
Total token throughput (tok/s):          2239.35
Concurrency:                             108.96
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   99646.08
Median E2E Latency (ms):                 97652.90
---------------Time to First Token----------------
Mean TTFT (ms):                          13575.07
Median TTFT (ms):                        13454.43
P99 TTFT (ms):                           24318.40
---------------Inter-Token Latency----------------
Mean ITL (ms):                           84.14
Median ITL (ms):                         71.64
P95 ITL (ms):                            76.49
P99 ITL (ms):                            78.27
Max ITL (ms):                            24386.78
==================================================

Piecewise Graph

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 128
Successful requests:                     128
Benchmark duration (s):                  125.24
Total input tokens:                      131072
Total generated tokens:                  131072
Total generated tokens (retokenized):    131067
Request throughput (req/s):              1.02
Input token throughput (tok/s):          1046.58
Output token throughput (tok/s):         1046.58
Total token throughput (tok/s):          2093.17
Concurrency:                             103.59
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   101352.11
Median E2E Latency (ms):                 98694.90
---------------Time to First Token----------------
Mean TTFT (ms):                          13580.41
Median TTFT (ms):                        14449.29
P99 TTFT (ms):                           24292.08
---------------Inter-Token Latency----------------
Mean ITL (ms):                           85.80
Median ITL (ms):                         73.20
P95 ITL (ms):                            78.72
P99 ITL (ms):                            79.48
Max ITL (ms):                            25003.23
==================================================

Future roadmaps

In the torch_npu 7.2.0 version, the reduce-overhead mode of the torchair backend will support torch.compile(model, dynamic=True). This mode will be set as the default in get_compile_backend(), enabling support for methods wrapped by the @torch.compile() decorator.
In the torch_npu 7.3.0 version, the capture and replay of NPUGraph currently integrated in the torchair backend will be changed to optional execution. The torchair backend will only perform optimizations such as fx pass optimization and static kernel compilation, while the capture and replay of NPUGraph will be implemented independently. This design is closer to the implementation of CudaGraphRunner, decoupling fx graph optimization from graph offloading.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

fuse

ssshinigami

LGTM

python/sglang/srt/compilation/compilation_config.py

…r_transfer

…sglang into eshogulin/pass_manager

This was referenced Sep 30, 2025

[fusion] add fused activate and rmsnorm + quant fusions pass #10549

Open

[WIP] Support torch compile based pass manager framework #10987

Open

eshoguli changed the title ~~[WIP] NPU Graph Compilation & PassManager~~ NPU Graph Compilation support and PassManager with AddRmsNorm & Quantize fuse Oct 30, 2025

eshoguli force-pushed the eshogulin/pass_manager branch 9 times, most recently from 508e483 to d77e709 Compare October 30, 2025 22:40

NPU Graph Compilation support and PassManager with AddRmsNorm & Quantize

4ce70e6

fuse

eshoguli force-pushed the eshogulin/pass_manager branch from d77e709 to 4ce70e6 Compare October 30, 2025 22:46

eshoguli marked this pull request as ready for review October 30, 2025 22:47

eshoguli requested review from BBuf, Edwardf0t1, FlamingoPg, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, kushanam, merrymercy, ping1jing2 and zhyncs as code owners October 30, 2025 22:47

pre-commit & refactoring

b974460

eshoguli force-pushed the eshogulin/pass_manager branch from c958827 to b974460 Compare October 31, 2025 15:38

pre-commit

7a7bde7

eshoguli added 2 commits November 20, 2025 08:05

NPUGraph compilation (fp16) & NPU Piecewise Graph tests

85d808e

TorchAir compilation support: refactoring 2

11074d9

eshoguli force-pushed the eshogulin/pass_manager branch from 8150d72 to 11074d9 Compare November 20, 2025 08:07

ssshinigami approved these changes Nov 20, 2025

View reviewed changes

VDV1985 reviewed Nov 20, 2025

View reviewed changes

python/sglang/srt/compilation/compilation_config.py Outdated Show resolved Hide resolved

VDV1985 reviewed Nov 20, 2025

View reviewed changes

python/sglang/srt/compilation/compilation_config.py Outdated Show resolved Hide resolved

ping1jing2 added the run-ci label Nov 20, 2025

Merge branch 'main' into eshogulin/pass_manager

51ac4b4

ping1jing2 self-assigned this Nov 20, 2025

CompilationConfig comments fix + linter fix

e06675b

eshoguli force-pushed the eshogulin/pass_manager branch from e6942bc to e06675b Compare November 21, 2025 09:18

eshoguli added 2 commits November 23, 2025 09:45

backend instantiation in get_compiler_backend

0c09c24

Merge branch 'main' into eshogulin/pass_manager

00a0b9b

github-actions bot added the piecewise-cuda-graph label Nov 23, 2025

eshoguli force-pushed the eshogulin/pass_manager branch from 7293c2c to 00a0b9b Compare November 24, 2025 19:03

eshoguli added 4 commits November 25, 2025 14:38

Merge remote-tracking branch 'sglang/main' into eshogulin/pass_manage…

0b31746

…r_transfer

Merge branch 'main' into eshogulin/pass_manager

3b5c83b

linter fix

7eefeee

dynamo patch removing

8c63980

yuan-luo self-requested a review November 27, 2025 02:36

fix on main branch: compilation

2e02568

eshoguli requested a review from hebiao064 as a code owner November 27, 2025 13:39

eshoguli and others added 8 commits November 27, 2025 16:46

Merge branch 'main' into eshogulin/pass_manager

966bbf4

auto merge fix

14092b3

tests suit update

f989147

Add npu_add_rms_norm_dynamic_quant fuse

bf1251d

Merge branch 'eshogulin/pass_manager' of https://github.com/eshoguli/…

317174b

…sglang into eshogulin/pass_manager

NPU Graph compilation: attention architecture check

e6eb29c

Add npu_add_rms_norm_dynamic_quant fuse: quick fix

caba95e

Qwen3 MoE compilation support for NPU

3f87879

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NPU Graph Compilation support and PassManager with AddRmsNorm & Quantize fuse. TorchAir compiler backend support. #11104

NPU Graph Compilation support and PassManager with AddRmsNorm & Quantize fuse. TorchAir compiler backend support. #11104

eshoguli commented Sep 30, 2025 •

edited

Loading

Uh oh!

ssshinigami left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

NPU Graph Compilation support and PassManager with AddRmsNorm & Quantize fuse. TorchAir compiler backend support. #11104

Are you sure you want to change the base?

NPU Graph Compilation support and PassManager with AddRmsNorm & Quantize fuse. TorchAir compiler backend support. #11104

Conversation

eshoguli commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

TorchAir Main Features:

How to enable compilation and fuses for NPUGraph decode:

Modifications

Class Diagram

Accuracy Tests

Collected on gsm8k dataset for static quantized Qwen3-32B:

TorchAir

Collected on MMMU dataset for Qwen3-VL-30B-A3B-Instruct:

Benchmarking and Profiling (910B3)

Reference

Compilation

Piecewise Graph

Future roadmaps

Checklist

Uh oh!

ssshinigami left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

eshoguli commented Sep 30, 2025 •

edited

Loading

How to enable compilation and fuses for `NPUGraph` decode:

Collected on gsm8k dataset for static quantized `Qwen3-32B`:

Collected on MMMU dataset for `Qwen3-VL-30B-A3B-Instruct`: