Add turbomind_rms_norm to accelerate QK norm in Qwen3 models #13189

cscyuge · 2025-11-13T06:20:56Z

Motivation

For RMSNorm with head_dim <= 128, the QK-norm implementation in lmdeploy performs better than the current flashinfer RMSNorm implementation.
In benchmark on H20 (head_dim = 128, head_num = 48, token_num = 4096), the latency is reduced from 269 µs to 69 µs.

Modifications

Added a minimal implementation of turbomind_rms_norm.
Updated RMSNorm’s forward_cuda and forward_xpu to use turbomind_rms_norm when hidden_size <= 128.
Added unit tests and benchmarks for turbomind_rms_norm.

Accuracy Tests

Unit tests sgl-kernel/tests/test_norm.py pass.

On H20, Model Qwen/Qwen3-8B-FP8: python3 -m sglang.test.few_shot_gsm8k --num-questions 1000
before:

Accuracy: 0.907
Invalid: 0.000
Latency: 40.152 s
Output throughput: 2997.586 token/s

after:

Accuracy: 0.909
Invalid: 0.000
Latency: 39.234 s
Output throughput: 3067.587 token/s

Benchmarking and Profiling

kernel benchmark (on H20): python /sgl-workspace/sglang/sgl-kernel/benchmark/bench_turbomind_rmsnorm.py
results:

rmsnorm-performance(head_dim=128):
    head_num  token_num      SGLang  Turbomind
0       16.0        1.0    2.455256   2.400793
1       16.0        2.0    2.902393   2.554233
2       16.0        4.0    2.936181   2.553624
3       16.0        8.0    2.900588   2.606192
4       16.0       16.0    3.081954   2.623438
5       16.0       32.0    3.331822   2.670131
6       16.0       64.0    3.920498   2.767891
7       16.0      128.0    5.263943   2.877911
8       16.0      256.0    7.914170   3.171015
9       16.0      512.0   13.160407   4.080311
10      16.0     1024.0   23.700711   5.538099
11      16.0     2048.0   44.501812   8.726369
12      16.0     4096.0   90.898067  15.671547
13      32.0        1.0    2.907672   2.552743
14      32.0        2.0    2.936283   2.553682
15      32.0        4.0    2.900070   2.608614
16      32.0        8.0    3.081353   2.623271
17      32.0       16.0    3.332297   2.670330
18      32.0       32.0    3.920480   2.767877
19      32.0       64.0    5.264174   2.861957
20      32.0      128.0    7.914641   3.171004
21      32.0      256.0   13.178706   4.079834
22      32.0      512.0   23.686496   5.537208
23      32.0     1024.0   44.558687   8.731087
24      32.0     2048.0   90.815924  15.645469
25      32.0     4096.0  180.847845  44.130245
26      48.0        1.0    2.920537   2.548013
27      48.0        2.0    2.955938   2.590711
28      48.0        4.0    3.074632   2.611818
29      48.0        8.0    3.220033   2.651170
30      48.0       16.0    3.599001   2.723399
31      48.0       32.0    4.594748   2.812742
32      48.0       64.0    6.809815   3.288367
33      48.0      128.0   10.555641   3.487169
34      48.0      256.0   18.607111   4.872875
35      48.0      512.0   34.088300   7.114875
36      48.0     1024.0   66.516190  11.957884
37      48.0     2048.0  136.611495  28.148620
38      48.0     4096.0  269.632875  69.378489

e2e benchmark (on H20, model: Qwen/Qwen3-0.6B-FP8):

python3 -m sglang.bench_serving --backend sglang \
        --model $MODEL_PATH \
        --dataset-name random \
        --random-input-len 4096 \
        --random-output-len 32 \
        --random-range-ratio 1 \
        --request-rate 64 \
        --max-concurrency 64 \
        --num-prompts 256 \
        --host $SERVER_IP --port $SERVER_PORT

results(baseline, rmsnorm):

============ Serving Benchmark Result ============                                                                                                                                                              Backend:                                 sglang                                                                                                                                                                 Traffic request rate:                    64.0                                                                                                                                                                   Max request concurrency:                 64                                                                                                                                                                     Successful requests:                     256
Benchmark duration (s):                  13.76
Total input tokens:                      1048576
Total generated tokens:                  8192
Total generated tokens (retokenized):    8190
Request throughput (req/s):              18.60
Input token throughput (tok/s):          76194.82
Output token throughput (tok/s):         595.27
Total token throughput (tok/s):          76790.10
Concurrency:                             61.76
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   3319.85
Median E2E Latency (ms):                 3432.59
---------------Time to First Token----------------
Mean TTFT (ms):                          1375.97
Median TTFT (ms):                        1393.31
P99 TTFT (ms):                           2432.91
--------------Time Per Output Token---------------
Mean TPOT (ms):                          62.73
Median TPOT (ms):                        61.61
P99 TPOT (ms):                           103.67
---------------Inter-Token Latency----------------
Mean ITL (ms):                           62.71
Median ITL (ms):                         31.86
P95 ITL (ms):                            58.45
P99 ITL (ms):                            1374.35
Max ITL (ms):                            2338.90
==================================================

results(turbomind_rms_norm):

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    64.0
Max request concurrency:                 64
Successful requests:                     256
Benchmark duration (s):                  13.15
Total input tokens:                      1048576
Total generated tokens:                  8192
Total generated tokens (retokenized):    8190
Request throughput (req/s):              19.47
Input token throughput (tok/s):          79747.11
Output token throughput (tok/s):         623.02
Total token throughput (tok/s):          80370.13
Concurrency:                             61.68
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   3167.85
Median E2E Latency (ms):                 3282.13
---------------Time to First Token----------------
Mean TTFT (ms):                          1276.11
Median TTFT (ms):                        1334.07
P99 TTFT (ms):                           2270.03
--------------Time Per Output Token---------------
Mean TPOT (ms):                          61.04
Median TPOT (ms):                        57.76
P99 TPOT (ms):                           101.03
---------------Inter-Token Latency----------------
Mean ITL (ms):                           61.02
Median ITL (ms):                         32.03
P95 ITL (ms):                            56.19
P99 ITL (ms):                            1324.25
Max ITL (ms):                            2259.54
==================================================

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-11-13T06:20:59Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

FlamingoPg · 2025-11-13T06:48:45Z

Looks good. Have you tested it on H100/B200? If not, I can help you add some H100/B200 tests.

cscyuge · 2025-11-13T06:51:18Z

Looks good. Have you tested it on H100/B200? If not, I can help you add some H100/B200 tests.

Thanks for reviewing! I don’t have access to H100/B200, so I couldn't test on those.
If you could help run the tests, that would be great. Happy to update anything if needed.

FlamingoPg · 2025-11-13T10:49:21Z

H100 results, looks pretty good

rmsnorm-performance(head_dim=128):
head_num  token_num      SGLang  Turbomind
0       16.0        1.0    2.690041   2.643974
1       16.0        2.0    3.132268   2.748590
2       16.0        4.0    3.181508   2.758456
3       16.0        8.0    3.231008   2.785226
4       16.0       16.0    3.379054   2.902990
5       16.0       32.0    3.557637   3.018197
6       16.0       64.0    4.157426   3.067163
7       16.0      128.0    5.527277   3.304978
8       16.0      256.0    8.192670   3.480653
9       16.0      512.0   13.416918   4.106387
10      16.0     1024.0   23.871537   5.459238
11      16.0     2048.0   45.091851   8.214365
12      16.0     4096.0   92.935882  16.025131
13      32.0        1.0    3.148964   2.759750
14      32.0        2.0    3.171174   2.736667
15      32.0        4.0    3.230939   2.813160
16      32.0        8.0    3.362471   2.902649
17      32.0       16.0    3.553130   3.020820
18      32.0       32.0    4.143711   3.062767
19      32.0       64.0    5.547237   3.285565
20      32.0      128.0    8.196443   3.475744
21      32.0      256.0   13.435698   4.111618
22      32.0      512.0   23.900416   5.468766
23      32.0     1024.0   45.054145   8.206142
24      32.0     2048.0   92.890488  16.015399
25      32.0     4096.0  181.953198  46.956430
26      48.0        1.0    3.100425   2.732236
27      48.0        2.0    3.175247   2.762727
28      48.0        4.0    3.273962   2.830548
29      48.0        8.0    3.426947   2.940473
30      48.0       16.0    3.788936   2.997664
31      48.0       32.0    4.828032   3.143320
32      48.0       64.0    6.853608   3.355092
33      48.0      128.0   10.787238   3.841854
34      48.0      256.0   18.648622   4.792144
35      48.0      512.0   34.520374   6.965173
36      48.0     1024.0   69.091418  11.216033
37      48.0     2048.0  137.166762  29.693131
38      48.0     4096.0  271.742450  73.303484

Add turbomind_rms_norm to accelerate QK norm in Qwen3 models

499fe35

cscyuge requested review from BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock, kushanam, merrymercy, yizhang2077 and zhyncs as code owners November 13, 2025 06:20

github-actions bot added the sgl-kernel label Nov 13, 2025

zhyncs assigned FlamingoPg Nov 13, 2025

zhyncs added the high priority label Nov 13, 2025

FlamingoPg added the run-ci label Nov 13, 2025

FlamingoPg added 2 commits November 13, 2025 20:00

Merge branch 'main' into turbomind_rmsnorm

7b85543

Merge branch 'main' into turbomind_rmsnorm

1a81110

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add turbomind_rms_norm to accelerate QK norm in Qwen3 models #13189

Add turbomind_rms_norm to accelerate QK norm in Qwen3 models #13189

cscyuge commented Nov 13, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 13, 2025

Uh oh!

FlamingoPg commented Nov 13, 2025

Uh oh!

cscyuge commented Nov 13, 2025

Uh oh!

FlamingoPg commented Nov 13, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add turbomind_rms_norm to accelerate QK norm in Qwen3 models #13189

Are you sure you want to change the base?

Add turbomind_rms_norm to accelerate QK norm in Qwen3 models #13189

Conversation

cscyuge commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Nov 13, 2025

Uh oh!

FlamingoPg commented Nov 13, 2025

Uh oh!

cscyuge commented Nov 13, 2025

Uh oh!

FlamingoPg commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cscyuge commented Nov 13, 2025 •

edited

Loading

FlamingoPg commented Nov 13, 2025 •

edited

Loading