Hi, thanks for the great work!
I'd like to ask if there are plans to add FlashAttention-3–style Hopper kernels (WGMMA + TMA) to the block-sparse attention backend.
FA3 gives a large speedup on H100/H200, but current block-sparse kernels seem closer to FA2, so there’s still a noticeable gap on Hopper even at high sparsity.
Thanks!