Hi,
I’m trying to implement a high-performance matmul kernel for large matrices (4096x4096x4096) on an RTX 4080 using ThunderKittens. I’ve studied kernels/matmul/educational/level_04.cu, but my performance is only about 1/10th of peak, far behind cuBLAS or torch.matmul.
I noticed that other matmul examples require features based on Hopper, such as TMA or WGMMA.
What main optimizations or directions would you recommend to approach BLAS-level performance on Ada GPUs like the 4080?
Also, could you provide more demo and kernel examples specifically for pre-Hopper architectures?
Thanks!