HipKittens

HipKittens is a repository in the ThunderKittens cinematic universe! This work provides minimal, opinionated C++ embedded programming primitives to help you write speedy AMD AI kernels. HipKittens is built from the hardware up: we do what the silicon tells us.

HipKittens surfing the ~wave~ (not warp).

Links: Paper (arXiv) | Blog: Fast and Furious AMD Kernels | Blog: AMD GPUs go brrr

AI has largely used a single hardware vendor in the past, but how can we enable a multi-silicon future? Towards the dream of a single software framework that translates across hardware platforms, we explore whether the primitives used in prior DSLs (like TK) suffice for AMD, or whether we need entirely new primitives.

We find that core tile and bulk compute interfaces carry over from TK to HK, but decisions around memory access patterns, scheduling compute and memory, and ordering thread blocks within the chiplet architecture differ. HipKittens features the following types of primitives.

Tile primitives: sized according to the tensor core units. Tile memory ops are coalesced, bank conflict free, and eagerly use tensor core layouts. We focus on minimizing address computation costs.
Python-inspired functions: bulk compute functions that operate over tiles. These are lightweight, wrapping assembly and HIP.
Asynchronous loads/stores: hide latencies and address generation using direct buffer loads to shared memory.
Scheduling and overlapping: we show two core patterns for overlapping compute and memory, 8-wave ping pong and 4-wave interelave, that appear across kernels.

We support CDNA3 and CDNA 4.

Setup

# clone the repo
git clone [email protected]:HazyResearch/HipKittens.git
**or**
git clone https://github.com/HazyResearch/HipKittens.git

# For MI350X and MI355X with gfx950 arch:
# obtain an amd docker using docker pull or podman pull
podman pull docker.io/rocm/7.0-preview:rocm7.0_preview_pytorch_training_mi35x_beta

# enter the docker
podman run -it \
    --ipc=host \
    --network=host \
    --privileged \
    --cap-add=CAP_SYS_ADMIN \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --device=/dev/kfd \
    --device=/dev/dri \
    -v $(pwd):/workdir/ \
    -e USE_FASTSAFETENSOR=1 \
    -e SAFETENSORS_FAST_GPU=1 \
    rocm/7.0-preview:rocm7.0_preview_pytorch_training_mi35x_beta \
    bash

# For MI300X/MI325X, use below docker for gfx942 arch:
podman pull rocm/7.0-preview:rocm7.0_rel_30_ubuntu22.04_py3.10_pytorch_release_2.8.0

#enter the docker
podman run -it \
    --ipc=host \
    --network=host \
    --privileged \
    --cap-add=CAP_SYS_ADMIN \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --device=/dev/kfd \
    --device=/dev/dri \
    -v $(pwd):/workdir/ \
    -e USE_FASTSAFETENSOR=1 \
    -e SAFETENSORS_FAST_GPU=1 \
    rocm/7.0-preview:rocm7.0_rel_30_ubuntu22.04_py3.10_pytorch_release_2.8.0 \
    bash

# set the environment variables
cd HipKittens/
source env.src

# install aiter (baseline kernels)
git clone --recursive https://github.com/ROCm/aiter.git
cd aiter
python3 setup.py develop

Unit tests

We provide unit tests for you to optionally test the correctness of library functions.

cd HipKittens/tests/unit
make -j64

Quick start: running kernels

We assume you will run the following on an MI350X or MI355X unless otherwise specified. You should use the CDNA3 branch of HK to run on the MI300X or MI325X.

BF16 GEMM

# Defaults to 8192x8192x8192
# This will compare to AITER and PyTorch automatically.
cd kernels/gemm/bf16fp32/mi350x/
make clean && make
python test_python.py

# On the mi300x or mi325x run:
git checkout cdna3 # not the main branch!
cd kernels/gemm/bf16fp32/mi325x/8192_256_256_64_16/
make clean && make
python test_python.py

Attention forwards (MHA, GQA, Causal, Non-causal, Head dim 128 / 64)

# GQA, Non-causal, D=128, N=2048, H=64, H_KV=8, B=16:
# This will compare to AITER automatically. 
cd kernels/attn/gqa/
make clean && make
python test_python.py

Modify the ATTN_N sequence length (e.g., 1024, 2048, 4096, 8192), ATTN_H query heads and ATTN_H_KV key value heads (e.g., 16 and 16 for MHA), ATTN_D head dimension (i.e., 64 or 128) in the Makefile and test_python.py file to try other settings.
Use the same process for gqa_causal.

Attention backwards (MHA, GQA, Causal, Non-causal, Head dim 128 / 64)

# GQA, Non-causal, D=128, N=8192, H=64, H_KV=8, B=16:
# This will compare to AITER automatically. 
cd kernels/attn/gqa_backwards/
make clean && make
python test_python.py

Modify the settings in the same way as stated above for forwards.
Try gqa_causal_backwards.

Memory bound

# Rotary (default B=16, H=16, D=128, N=2048)
# This will compare to AITER, PyTorch, PyTorch compiled automatically.
cd kernels/rotary/
make clean && make
python test_python.py

# Layernorm fused (default B=16, H=16, D=128, N=4096)
# This will compare to PyTorch, PyTorch compiled automatically.
cd kernels/layernorm/
make clean && make
python test_python.py

Potental issues:

If you see a complaint that AITER is not building in the test_python.py files, then instal AITER from source following this README.md. Luckily, it is very quick! You can also comment out AITER from test_python.py if you only need the HK kernel.
If you see an error that bin/hipcc/ is not found, then edit the Makefile to replace ROCM_BUILD_DIR with /opt/rocm/bin/hipcc

Benchmarking

Under HipKittens/analysis we provide scripts and instructions to benchmark all the HK kernels from our paper. This will sweep over different dimensions and settings, and we provide plotting scripts.

Note: We also provide the instructions to reproduce our baselines (Triton, CK, HipBLASLT, Mojo, etc.) in HipKittens/analysis/baselines! As these are constantly evolving frameworks, we remind that our results are collected in November 2025.

Training

Under HipKittens/training we provide instructions to train either BERT or Llama models using HipKittens attention kernels, AITER kernels, or PyTorch kernels. These are lightweight. Run them within the AMD Docker.

Resources

We provide resources for profiling kernels, dockers, and HipKittens in HipKittens/docs. Contribute to our onboarding documents.

Get in touch!

Contact: William Hu [email protected] and Simran Arora [email protected]. Join us on Discord to get involved, GPU Mode Invite and then you can join the TK channel! We welcome community contributions.

If you use or build on this work, please consider citing:

@misc{hu2025hipkittensfastfuriousamd,
      title={HipKittens: Fast and Furious AMD Kernels}, 
      author={William Hu and Drew Wadsworth and Sean Siddens and Stanley Winata and Daniel Y. Fu and Ryann Swann and Muhammad Osama and Christopher Ré and Simran Arora},
      year={2025},
      eprint={2511.08083},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2511.08083}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2,279 Commits
analysis		analysis
assets		assets
docs		docs
include		include
kernels		kernels
tests/unit		tests/unit
training		training
.gitignore		.gitignore
.gitmodules		.gitmodules
Doxyfile		Doxyfile
LICENSE		LICENSE
README.md		README.md
env.src		env.src

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HipKittens

Setup

Unit tests

Quick start: running kernels

Benchmarking

Training

Resources

Get in touch!

About

Uh oh!

Releases

Packages

Contributors 11

Uh oh!

Languages

License

HazyResearch/HipKittens

Folders and files

Latest commit

History

Repository files navigation

HipKittens

Setup

Unit tests

Quick start: running kernels

Benchmarking

Training

Resources

Get in touch!

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 11

Uh oh!

Languages

Packages