Skip to content

Conversation

@cscyuge
Copy link

@cscyuge cscyuge commented Nov 13, 2025

Motivation

For RMSNorm with head_dim <= 128, the QK-norm implementation in lmdeploy performs better than the current flashinfer RMSNorm implementation.
In benchmark on H20 (head_dim = 128, head_num = 48, token_num = 4096), the latency is reduced from 269 µs to 69 µs.

Modifications

  • Added a minimal implementation of turbomind_rms_norm.
  • Updated RMSNorm’s forward_cuda and forward_xpu to use turbomind_rms_norm when hidden_size <= 128.
  • Added unit tests and benchmarks for turbomind_rms_norm.

Accuracy Tests

Unit tests sgl-kernel/tests/test_norm.py pass.

On H20, Model Qwen/Qwen3-8B-FP8: python3 -m sglang.test.few_shot_gsm8k --num-questions 1000
before:

Accuracy: 0.907
Invalid: 0.000
Latency: 40.152 s
Output throughput: 2997.586 token/s

after:

Accuracy: 0.909
Invalid: 0.000
Latency: 39.234 s
Output throughput: 3067.587 token/s

Benchmarking and Profiling

kernel benchmark (on H20): python /sgl-workspace/sglang/sgl-kernel/benchmark/bench_turbomind_rmsnorm.py
results:

rmsnorm-performance(head_dim=128):
    head_num  token_num      SGLang  Turbomind
0       16.0        1.0    2.455256   2.400793
1       16.0        2.0    2.902393   2.554233
2       16.0        4.0    2.936181   2.553624
3       16.0        8.0    2.900588   2.606192
4       16.0       16.0    3.081954   2.623438
5       16.0       32.0    3.331822   2.670131
6       16.0       64.0    3.920498   2.767891
7       16.0      128.0    5.263943   2.877911
8       16.0      256.0    7.914170   3.171015
9       16.0      512.0   13.160407   4.080311
10      16.0     1024.0   23.700711   5.538099
11      16.0     2048.0   44.501812   8.726369
12      16.0     4096.0   90.898067  15.671547
13      32.0        1.0    2.907672   2.552743
14      32.0        2.0    2.936283   2.553682
15      32.0        4.0    2.900070   2.608614
16      32.0        8.0    3.081353   2.623271
17      32.0       16.0    3.332297   2.670330
18      32.0       32.0    3.920480   2.767877
19      32.0       64.0    5.264174   2.861957
20      32.0      128.0    7.914641   3.171004
21      32.0      256.0   13.178706   4.079834
22      32.0      512.0   23.686496   5.537208
23      32.0     1024.0   44.558687   8.731087
24      32.0     2048.0   90.815924  15.645469
25      32.0     4096.0  180.847845  44.130245
26      48.0        1.0    2.920537   2.548013
27      48.0        2.0    2.955938   2.590711
28      48.0        4.0    3.074632   2.611818
29      48.0        8.0    3.220033   2.651170
30      48.0       16.0    3.599001   2.723399
31      48.0       32.0    4.594748   2.812742
32      48.0       64.0    6.809815   3.288367
33      48.0      128.0   10.555641   3.487169
34      48.0      256.0   18.607111   4.872875
35      48.0      512.0   34.088300   7.114875
36      48.0     1024.0   66.516190  11.957884
37      48.0     2048.0  136.611495  28.148620
38      48.0     4096.0  269.632875  69.378489

e2e benchmark (on H20, model: Qwen/Qwen3-0.6B-FP8):

python3 -m sglang.bench_serving --backend sglang \
        --model $MODEL_PATH \
        --dataset-name random \
        --random-input-len 4096 \
        --random-output-len 32 \
        --random-range-ratio 1 \
        --request-rate 64 \
        --max-concurrency 64 \
        --num-prompts 256 \
        --host $SERVER_IP --port $SERVER_PORT

results(baseline, rmsnorm):

============ Serving Benchmark Result ============                                                                                                                                                              Backend:                                 sglang                                                                                                                                                                 Traffic request rate:                    64.0                                                                                                                                                                   Max request concurrency:                 64                                                                                                                                                                     Successful requests:                     256
Benchmark duration (s):                  13.76
Total input tokens:                      1048576
Total generated tokens:                  8192
Total generated tokens (retokenized):    8190
Request throughput (req/s):              18.60
Input token throughput (tok/s):          76194.82
Output token throughput (tok/s):         595.27
Total token throughput (tok/s):          76790.10
Concurrency:                             61.76
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   3319.85
Median E2E Latency (ms):                 3432.59
---------------Time to First Token----------------
Mean TTFT (ms):                          1375.97
Median TTFT (ms):                        1393.31
P99 TTFT (ms):                           2432.91
--------------Time Per Output Token---------------
Mean TPOT (ms):                          62.73
Median TPOT (ms):                        61.61
P99 TPOT (ms):                           103.67
---------------Inter-Token Latency----------------
Mean ITL (ms):                           62.71
Median ITL (ms):                         31.86
P95 ITL (ms):                            58.45
P99 ITL (ms):                            1374.35
Max ITL (ms):                            2338.90
==================================================

results(turbomind_rms_norm):

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    64.0
Max request concurrency:                 64
Successful requests:                     256
Benchmark duration (s):                  13.15
Total input tokens:                      1048576
Total generated tokens:                  8192
Total generated tokens (retokenized):    8190
Request throughput (req/s):              19.47
Input token throughput (tok/s):          79747.11
Output token throughput (tok/s):         623.02
Total token throughput (tok/s):          80370.13
Concurrency:                             61.68
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   3167.85
Median E2E Latency (ms):                 3282.13
---------------Time to First Token----------------
Mean TTFT (ms):                          1276.11
Median TTFT (ms):                        1334.07
P99 TTFT (ms):                           2270.03
--------------Time Per Output Token---------------
Mean TPOT (ms):                          61.04
Median TPOT (ms):                        57.76
P99 TPOT (ms):                           101.03
---------------Inter-Token Latency----------------
Mean ITL (ms):                           61.02
Median ITL (ms):                         32.03
P95 ITL (ms):                            56.19
P99 ITL (ms):                            1324.25
Max ITL (ms):                            2259.54
==================================================

Checklist

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@FlamingoPg
Copy link
Collaborator

Looks good. Have you tested it on H100/B200? If not, I can help you add some H100/B200 tests.

@cscyuge
Copy link
Author

cscyuge commented Nov 13, 2025

Looks good. Have you tested it on H100/B200? If not, I can help you add some H100/B200 tests.

Thanks for reviewing! I don’t have access to H100/B200, so I couldn't test on those.
If you could help run the tests, that would be great. Happy to update anything if needed.

@FlamingoPg
Copy link
Collaborator

FlamingoPg commented Nov 13, 2025

H100 results, looks pretty good

rmsnorm-performance(head_dim=128):
head_num  token_num      SGLang  Turbomind
0       16.0        1.0    2.690041   2.643974
1       16.0        2.0    3.132268   2.748590
2       16.0        4.0    3.181508   2.758456
3       16.0        8.0    3.231008   2.785226
4       16.0       16.0    3.379054   2.902990
5       16.0       32.0    3.557637   3.018197
6       16.0       64.0    4.157426   3.067163
7       16.0      128.0    5.527277   3.304978
8       16.0      256.0    8.192670   3.480653
9       16.0      512.0   13.416918   4.106387
10      16.0     1024.0   23.871537   5.459238
11      16.0     2048.0   45.091851   8.214365
12      16.0     4096.0   92.935882  16.025131
13      32.0        1.0    3.148964   2.759750
14      32.0        2.0    3.171174   2.736667
15      32.0        4.0    3.230939   2.813160
16      32.0        8.0    3.362471   2.902649
17      32.0       16.0    3.553130   3.020820
18      32.0       32.0    4.143711   3.062767
19      32.0       64.0    5.547237   3.285565
20      32.0      128.0    8.196443   3.475744
21      32.0      256.0   13.435698   4.111618
22      32.0      512.0   23.900416   5.468766
23      32.0     1024.0   45.054145   8.206142
24      32.0     2048.0   92.890488  16.015399
25      32.0     4096.0  181.953198  46.956430
26      48.0        1.0    3.100425   2.732236
27      48.0        2.0    3.175247   2.762727
28      48.0        4.0    3.273962   2.830548
29      48.0        8.0    3.426947   2.940473
30      48.0       16.0    3.788936   2.997664
31      48.0       32.0    4.828032   3.143320
32      48.0       64.0    6.853608   3.355092
33      48.0      128.0   10.787238   3.841854
34      48.0      256.0   18.648622   4.792144
35      48.0      512.0   34.520374   6.965173
36      48.0     1024.0   69.091418  11.216033
37      48.0     2048.0  137.166762  29.693131
38      48.0     4096.0  271.742450  73.303484

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants