qtip-eval

A self-contained benchmark for QTIP (NeurIPS 2024 Spotlight) that quantizes a single random 4096×4096 FP16 weight matrix at K=2/3/4 bits and measures both quantization quality and inference throughput, including hardware-level profiling via Nsight Compute.

No LLM weights needed — everything runs on synthetic data.

See results/README.md for the full benchmark report and findings.

Scripts

Script	What it does	Output
`prepare_qtip.py`	Quantize the matrix, record per-step timing	`data/quant_timings.csv`, `data/qtip_K{2,3,4}.pt`
`benchmark_dequant.py`	Dequantize → full FP16 weight → cuBLAS gemv	`results/{gpu}/dequant/results.csv`
`benchmark_fused.py`	Fused decode+matvec kernel, full step-by-step timing breakdown	`results/{gpu}/fused/results.csv`
`plot_results.py`	Generate all plots from CSVs	`results/{gpu}/*/.png`
`profile_kernel.py`	Minimal single-kernel launch script for Nsight Compute	(used by `profile_ncu.py`)
`profile_ncu.py`	Run `ncu` via sudo, parse stall metrics, plot and save	`results/{gpu}/fused/ncu_stall_K{k}.{png,csv}`

Results are placed in results/{gpu}/ where {gpu} is the GPU name (e.g. RTX_5090).

Setup

1. Clone with the QTIP submodule

git clone --recurse-submodules <this-repo>
cd qtip-eval

If you already cloned without submodules:

git submodule update --init --recursive

2. Create the conda environment

conda create -n qtip python=3.11 -y
conda activate qtip

3. Install dependencies

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
pip install git+https://github.com/Dao-AILab/fast-hadamard-transform.git --no-build-isolation
pip install matplotlib
pip install -r requirements.txt --no-build-isolation

4. Build the QTIP inference kernels (required for fused benchmark and profiling)

cd qtip/qtip-kernels
python setup.py install
cd ../..

The kernels are written and tuned for Ampere/Ada GPUs. They compile on other architectures but performance may differ.

Running

Core benchmark

Run in order:

# Step 1 — quantize (always re-runs, overwrites existing artefacts)
python prepare_qtip.py

# Step 2 — benchmark both inference paths
python benchmark_dequant.py
python benchmark_fused.py

# Step 3 — generate all plots
python plot_results.py

Steps 2 and 3 are independent of each other; plot_results.py skips any plot whose source CSV is missing. Results go to results/{gpu}/.

Nsight Compute profiling

Nsight Compute requires sudo to access hardware performance counters.

Option A — Enable counters permanently (recommended for a personal workstation):

echo 'options nvidia NVreg_RestrictProfilingToAdminUsers=0' | sudo tee /etc/modprobe.d/nvidia-profiling.conf
# Then reboot, or reload the driver:
sudo rmmod nvidia_uvm nvidia_drm nvidia_modeset nvidia
sudo modprobe nvidia NVreg_RestrictProfilingToAdminUsers=0

Option B — Run with sudo each time:

Use profile_ncu.py, which handles the sudo invocation automatically:

# Profile K=2 (default)
python profile_ncu.py

# Profile a specific bitrate
python profile_ncu.py --k 4

# Profile all three bitrates in sequence
python profile_ncu.py --all-k

The script finds ncu and python via which, passes them to sudo -E, captures the output, parses the warp stall metrics, and saves:

results/{gpu}/fused/ncu_stall_K{k}.png   — stall breakdown bar chart
results/{gpu}/fused/ncu_stall_K{k}.csv   — raw metric values
results/{gpu}/fused/ncu_stall_summary.txt — annotated bottleneck summary

To run the ncu command manually (e.g. to collect a full .ncu-rep for the GUI):

sudo -E $(which ncu) \
    --kernel-name kernel_decompress_matvec \
    --launch-skip 5 --launch-count 1 \
    --set full -o profile_out \
    $(which python) profile_kernel.py

# View on command line (import the saved file)
ncu --import profile_out.ncu-rep

# Query specific stall metrics from a saved file
ncu --import profile_out.ncu-rep --csv \
    --metrics "regex:smsp__warp_issue_stalled.*per_warp_active.pct"

Configuration

All parameters are constants at the top of each script. The defaults match the QTIP paper's HYB code configuration:

Parameter	Value	Description
`M, N`	4096, 4096	Weight matrix dimensions
`L`	16	Trellis shift-register bits (2^16 = 65536 states)
`K`	2, 3, 4	Bits per weight
`V`	1	VQ dimension
`tlut_bits`	9	Tunable LUT bits (512-entry codebook)
`decode_mode`	`quantlut_sym`	HYB code variant
`td_x, td_y`	16, 16	Trellis tile dimensions
`scale_override`	0.9	Scale factor
`sigma_reg`	0.01	Hessian regularisation
`REPEATS`	200	Timing repetitions per measurement

Outputs

Per-GPU results in results/{gpu}/:

File	Description
`quant_time.png`	Stacked quantization time breakdown + compression ratio
`quality_vs_k.png`	Weight MSE and output MSE vs K bits
`latency_comparison.png`	Inference latency: dequant vs fused vs FP16
`bandwidth_comparison.png`	Effective memory bandwidth per path
`dequant/latency_bar.png`	Dequant latency + output MSE subplot
`dequant/bandwidth_bar.png`	Dequant effective bandwidth
`fused/latency_bar.png`	Fused latency with 5-step breakdown (input_rot / packed_read / decode_overhead / output_rot / output_scale)
`fused/bandwidth_bar.png`	Fused effective bandwidth
`fused/kernel_breakdown.png`	Kernel-internal phase breakdown (LUT load / decode loop / reduction), scaled to wall-clock
`fused/ncu_stall_K{k}.png`	Nsight Compute warp stall breakdown by category
`fused/ncu_stall_K{k}.csv`	Raw stall metric values
`fused/ncu_stall_summary.txt`	Annotated bottleneck explanation

Fused CSV columns (results/{gpu}/fused/results.csv):

Column	Description
`ms`	End-to-end fused matvec latency
`input_rot_ms`	(x · SU) → HadUt / scale
`fused_decode_ms`	decompress_matvec_qtip kernel (wall-clock)
`packed_read_ms`	HBM read proxy: `packed.sum()`
`decode_overhead_ms`	fused_decode − packed_read (compute above HBM floor)
`output_rot_ms`	HadU(out)
`output_scale_ms`	out · (SV · Wscale · scale)
`kernel_codebook_ms`	LUT load phase (globaltimer, proportionally scaled)
`kernel_loop_ms`	Decode loop phase (globaltimer, proportionally scaled)
`kernel_reduce_ms`	Reduction phase (globaltimer, proportionally scaled)

Note on kernel_ columns*: The CUDA %globaltimer register on Blackwell counts only active SM execution cycles, not memory stall cycles. Raw values are ~200× smaller than wall-clock time. The benchmark scales them proportionally to fused_decode_ms so they sum to the correct total, giving accurate phase fractions even though absolute values would be misleading.

Repository Structure

qtip-eval/
  prepare_qtip.py        — quantization + timing
  benchmark_dequant.py   — dequant+gemv benchmark
  benchmark_fused.py     — fused matvec benchmark with step breakdown
  plot_results.py        — all plots
  profile_kernel.py      — minimal kernel launch for ncu
  profile_ncu.py         — ncu stall profiling + plotting
  qtip/                  — QTIP source (git submodule)
  data/                  — generated weight artefacts (gitignored)
  results/               — benchmark CSVs and plots (gitignored)
    README.md            — full benchmark report
    {gpu}/               — per-GPU results
      dequant/
      fused/

Reference

QTIP: Quantization with Trellises and Incoherence Processing Albert Tseng, Qingyao Sun, David Hou, Christopher De Sa NeurIPS 2024 Spotlight https://arxiv.org/abs/2406.11235

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

qtip-eval

Scripts

Setup

Running

Core benchmark

Nsight Compute profiling

Configuration

Outputs

Repository Structure

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.claude		.claude
qtip		qtip
results		results
.gitignore		.gitignore
README.md		README.md
benchmark_dequant.py		benchmark_dequant.py
benchmark_fused.py		benchmark_fused.py
plot_results.py		plot_results.py
prepare_qtip.py		prepare_qtip.py
profile_kernel.py		profile_kernel.py
profile_ncu.py		profile_ncu.py
profile_out.ncu-rep		profile_out.ncu-rep
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

qtip-eval

Scripts

Setup

Running

Core benchmark

Nsight Compute profiling

Configuration

Outputs

Repository Structure

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages