How to achieve BLAS-level matmul performance on 4080/4090/A100 (pre Hopper Arch) ThunderKittens?

Hi,

I’m trying to implement a high-performance matmul kernel for large matrices (4096x4096x4096) on an RTX 4080 using ThunderKittens. I’ve studied kernels/matmul/educational/level_04.cu, but my performance is only about 1/10th of peak, far behind cuBLAS or torch.matmul.

I noticed that other matmul examples require features based on Hopper, such as TMA or WGMMA.

What main optimizations or directions would you recommend to approach BLAS-level performance on Ada GPUs like the 4080?

Also, could you provide more demo and kernel examples specifically for pre-Hopper architectures?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to achieve BLAS-level matmul performance on 4080/4090/A100 (pre Hopper Arch) ThunderKittens? #151

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to achieve BLAS-level matmul performance on 4080/4090/A100 (pre Hopper Arch) ThunderKittens? #151

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions