2k (d=16): CPU=0.163ms, CUDA=0.080ms, Shared=0.081ms, Thrust=0.726ms 16k (d=24): CPU=1.725ms, CUDA=0.151ms, Shared=0.152ms, Thrust=1.080ms 64k (d=32): CPU=9.122ms, CUDA=0.681ms, Shared=0.694ms, Thrust=1.594ms
Speedup: 2.04x, 11.4x, 13.4x (CUDA Basic) Iterations: 18, 15, 14 (all implementations converge identically)
Shared memory SLOWER than basic?? WTF
- 2k: 1.7% slower
- 16k: 0.2% slower
- 64k: 1.9% slower Centers array is only 2-4KB, fits in L2 cache already __syncthreads() overhead kills any benefit
Thrust is garbage on small data
- 2k: 4.5x SLOWER than CPU (!!!)
- Only decent at 64k (5.7x vs CPU, still 2.3x worse than CUDA)
- sort_by_key is O(n log n), too much overhead
Memory bound workload
- Compute efficiency: 0.29% (basically doing nothing)
- Memory efficiency: 63.8% (this is the bottleneck)
- Arithmetic intensity: 0.185 FLOP/byte
- 736 GB/s bandwidth is the limit, not 52 TFLOPs compute
atomicAdd not as bad as expected
- 65k threads updating 512 locations
- Still got 13.4x speedup
- RTX 4080 atomic ops are fast
Data transfer is nothing
- 0.04% of runtime
- Only moving centroids (k×d), not points
- Points stay on GPU entire time
CPU actually decent on 2k
- Only 2x slower
- 500KB fits in i9-12900K L3 (30MB)
- No kernel launch overhead
Scaling plateaus at 64k
- 2k→16k: 2x→11.4x (huge jump)
- 16k→64k: 11.4x→13.4x (diminishing returns)
- Hitting memory bandwidth wall