parallel-kmeans-cuda/Observation.md at main · lebinary/parallel-kmeans-cuda

Observations

2k (d=16): CPU=0.163ms, CUDA=0.080ms, Shared=0.081ms, Thrust=0.726ms 16k (d=24): CPU=1.725ms, CUDA=0.151ms, Shared=0.152ms, Thrust=1.080ms 64k (d=32): CPU=9.122ms, CUDA=0.681ms, Shared=0.694ms, Thrust=1.594ms

Speedup: 2.04x, 11.4x, 13.4x (CUDA Basic) Iterations: 18, 15, 14 (all implementations converge identically)

Quick Notes

Shared memory SLOWER than basic?? WTF

2k: 1.7% slower
16k: 0.2% slower
64k: 1.9% slower Centers array is only 2-4KB, fits in L2 cache already __syncthreads() overhead kills any benefit

Thrust is garbage on small data

2k: 4.5x SLOWER than CPU (!!!)
Only decent at 64k (5.7x vs CPU, still 2.3x worse than CUDA)
sort_by_key is O(n log n), too much overhead

Memory bound workload

Compute efficiency: 0.29% (basically doing nothing)
Memory efficiency: 63.8% (this is the bottleneck)
Arithmetic intensity: 0.185 FLOP/byte
736 GB/s bandwidth is the limit, not 52 TFLOPs compute

atomicAdd not as bad as expected

65k threads updating 512 locations
Still got 13.4x speedup
RTX 4080 atomic ops are fast

Data transfer is nothing

0.04% of runtime
Only moving centroids (k×d), not points
Points stay on GPU entire time

CPU actually decent on 2k

Only 2x slower
500KB fits in i9-12900K L3 (30MB)
No kernel launch overhead

Scaling plateaus at 64k

2k→16k: 2x→11.4x (huge jump)
16k→64k: 11.4x→13.4x (diminishing returns)
Hitting memory bandwidth wall

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observations

Quick Notes

FilesExpand file tree

Observation.md

Latest commit

History

Observation.md

File metadata and controls

Observations

Quick Notes