Skip to content

Latest commit

 

History

History
48 lines (38 loc) · 1.41 KB

File metadata and controls

48 lines (38 loc) · 1.41 KB

Observations

2k (d=16): CPU=0.163ms, CUDA=0.080ms, Shared=0.081ms, Thrust=0.726ms 16k (d=24): CPU=1.725ms, CUDA=0.151ms, Shared=0.152ms, Thrust=1.080ms 64k (d=32): CPU=9.122ms, CUDA=0.681ms, Shared=0.694ms, Thrust=1.594ms

Speedup: 2.04x, 11.4x, 13.4x (CUDA Basic) Iterations: 18, 15, 14 (all implementations converge identically)

Quick Notes

Shared memory SLOWER than basic?? WTF

  • 2k: 1.7% slower
  • 16k: 0.2% slower
  • 64k: 1.9% slower Centers array is only 2-4KB, fits in L2 cache already __syncthreads() overhead kills any benefit

Thrust is garbage on small data

  • 2k: 4.5x SLOWER than CPU (!!!)
  • Only decent at 64k (5.7x vs CPU, still 2.3x worse than CUDA)
  • sort_by_key is O(n log n), too much overhead

Memory bound workload

  • Compute efficiency: 0.29% (basically doing nothing)
  • Memory efficiency: 63.8% (this is the bottleneck)
  • Arithmetic intensity: 0.185 FLOP/byte
  • 736 GB/s bandwidth is the limit, not 52 TFLOPs compute

atomicAdd not as bad as expected

  • 65k threads updating 512 locations
  • Still got 13.4x speedup
  • RTX 4080 atomic ops are fast

Data transfer is nothing

  • 0.04% of runtime
  • Only moving centroids (k×d), not points
  • Points stay on GPU entire time

CPU actually decent on 2k

  • Only 2x slower
  • 500KB fits in i9-12900K L3 (30MB)
  • No kernel launch overhead

Scaling plateaus at 64k

  • 2k→16k: 2x→11.4x (huge jump)
  • 16k→64k: 11.4x→13.4x (diminishing returns)
  • Hitting memory bandwidth wall