RMSnorm kernel and benchmark #973

huy209vn · 2025-10-21T07:10:32Z

Implements RMS Normalization with support for bias, configurable epsilon, and runtime vectorization.
Includes fully verified CUDA benchmark and backend tests across CPU, WGPU, HIP.

Status

✅ Kernel: finalized, verified across dtypes (F32, F16, BF16)
✅ Benchmarks: refactored, stable, true GPU-time measurement
✅ Tests: cross-backend validation vs CPU reference

Kernel Notes

Two variants:
- streaming — minimal register pressure (default)
- smem — kept for future optimization, currently disabled
Vectorization: vec=8 (F32), vec=4 (F16/BF16)
Enforces ≥512 threads for occupancy
Avoids register spills via MAX_LINES_PER_THREAD=4
Uses fast reciprocal sqrt when supported

Benchmark Results (RTX 3090, CUDA)

Shape: [8, 1024, 4096] (33.55M elements)

Type	Time (ms)	Throughput (Gelem/s)
F32	0.419	80.04
F16	0.219	153.54
BF16	0.219	153.41

Shape: [16, 1024, 4096] (67.11M elements)

Type	Time (ms)	Throughput (Gelem/s)
F32	0.838	80.05
F16	0.430	155.97
BF16	0.425	157.85

Shape: [32, 2048, 4096] (268.44M elements)

Type	Time (ms)	Throughput (Gelem/s)
F32	3.402	78.90
F16	1.688	159.05
BF16	1.658	161.94

Shape: [64, 4096, 4096] (1073.74M elements)

Type	Time (ms)	Throughput (Gelem/s)
F32	13.309	80.68
F16	6.438	166.78
BF16	6.479	165.73

✅ No regressions.
✅ All dtypes saturate memory bandwidth as expected.

Commands


cargo bench -p cubecl --bench rms_norm --features "cuda,random" -- --nocapture

Env Overrides


CUBECL_RMS_LOG=1             # verbose logging
CUBECL_RMS_VEC=4|8|16        # vectorization override
CUBECL_RMS_VARIANT=stream    # force variant

…-em1880 Restore cubecl-attention lint allows

…o feat/rmsnorm

This reverts commit 0964321.

huy209vn added 4 commits October 21, 2025 13:33

Merge pull request #3 from huy209vn/codex/implement-rmsnorm-in-cubecl…

c38a544

…-em1880 Restore cubecl-attention lint allows

gate test with feature

e014037

Merge branch 'tracel-ai:main' into feat/rmsnorm

5b4c838

Merge branch 'tracel-ai:main' into feat/rmsnorm

6192e8e

huy209vn marked this pull request as draft October 24, 2025 12:51

huy209vn added 5 commits October 25, 2025 10:52

refactor RMSnorm using streaming kernel and smem kernel

0067ec8

Merge branch 'feat/rmsnorm' of https://github.com/huy209vn/cubecl int…

20a1e28

…o feat/rmsnorm

Merge branch 'tracel-ai:main' into feat/rmsnorm

3923e94

fmt

a48e67d

Update rms_norm.rs

6957d10

huy209vn mentioned this pull request Oct 25, 2025

[CUDA] Severe Performance Regressions - F16 30× Slower + Small Shapes <5 GB/s #984

Closed

huy209vn added 2 commits October 31, 2025 21:51

Update rms_norm.rs

25fe0af

Merge branch 'tracel-ai:main' into feat/rmsnorm

ad7ff1c

huy209vn marked this pull request as ready for review October 31, 2025 15:13

huy209vn added 4 commits October 31, 2025 22:28

Merge branch 'main' into feat/rmsnorm

f251e68

Fix plane shuffle operations for FP16/BF16 types

0964321

mistake commit

25d4db0

Revert "Fix plane shuffle operations for FP16/BF16 types"

21c2765

This reverts commit 0964321.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RMSnorm kernel and benchmark #973

RMSnorm kernel and benchmark #973

huy209vn commented Oct 21, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RMSnorm kernel and benchmark #973

Are you sure you want to change the base?

RMSnorm kernel and benchmark #973

Conversation

huy209vn commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Status

Kernel Notes

Commands

Env Overrides

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

huy209vn commented Oct 21, 2025 •

edited

Loading