math/cuda: speed up large-input BLS12-381 MSM#1205
Conversation
Greptile SummaryThis PR replaces the previously racy single-pass CUDA bucket-accumulation kernel with a correct, fully-parallel six-stage pipeline: GPU counting → CPU prefix scan + task splitting → GPU scatter (atomic) → GPU partial-segment accumulation → GPU finalize-segment → GPU two-phase chunked bucket reduction. The public
Confidence Score: 4/5The core algorithmic change is sound — the new scatter+accumulate pipeline correctly eliminates the prior race condition, overflow checks are in place throughout, and the added tests cover bucket collisions, negative signed digits, and deterministic medium-size inputs. Two dead kernel registrations and one stale comment are the only findings; no correctness, safety, or data-integrity issues were identified in the new pipeline. crates/math/src/msm/cuda.rs — the kernel registration list and the stale test comment both benefit from a small cleanup before merge.
|
| Filename | Overview |
|---|---|
| crates/math/src/gpu/cuda/shaders/msm/msm.cuh | Replaces the racy single-pass accumulation kernel with a five-stage pipeline (count, scatter, partial-accumulate, finalize-accumulate, chunked-reduce) using atomic writes for correctness; two kernels (segmented_bucket_accumulation and bucket_reduction) remain defined but are no longer called from the Rust side. |
| crates/math/src/gpu/cuda/shaders/msm/bls12_381_msm.cu | Entry-point wrappers for all new pipeline kernels added; old bucket_accumulation_bls12_381 removed; segmented_bucket_accumulation_bls12_381 and bucket_reduction_bls12_381 still exported in the .cu file but are never invoked by the Rust host code. |
| crates/math/src/msm/cuda.rs | Orchestrates the full pipeline; adds CPU-side prefix scan and task-splitting helpers with overflow guards; registers two kernels that are never called in the compute path; one test retains a stale comment about the now-fixed race condition. |
Sequence Diagram
sequenceDiagram
participant CPU as CPU (Rust)
participant GPU as GPU (CUDA Kernels)
CPU->>GPU: htod: scalars, points, config
CPU->>GPU: alloc_zeros: counts_buf
GPU->>GPU: count_bucket_entries_bls12_381
GPU->>CPU: sync_reclaim counts_buf
CPU->>CPU: exclusive_scan_counts → offsets
CPU->>CPU: build_segment_tasks → task_offsets, task_starts, task_lengths
CPU->>GPU: htod: offsets_buf, task_offsets_buf, task_starts_buf, task_lengths_buf
CPU->>GPU: alloc_zeros: write_counts_buf, sorted_descriptors_buf, buckets_buf, partial_buckets_buf
GPU->>GPU: scatter_bucket_descriptors_bls12_381
GPU->>GPU: partial_segment_accumulation_bls12_381
GPU->>GPU: finalize_segment_accumulation_bls12_381
GPU->>GPU: partial_bucket_reduction_bls12_381
GPU->>GPU: finalize_bucket_reduction_bls12_381
GPU->>CPU: sync_reclaim window_sums_buf
CPU->>CPU: Horner window combination → MSM result
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
crates/math/src/msm/cuda.rs:155-164
Two of the eight registered kernels — `segmented_bucket_accumulation_bls12_381` and `bucket_reduction_bls12_381` — are registered with `load_ptx` but never fetched via `get_func` in the compute path. The new pipeline always goes through `partial_segment_accumulation` + `finalize_segment_accumulation` and `partial_bucket_reduction` + `finalize_bucket_reduction`. Keeping dead entries in the registration list forces cudarc to materialise those kernel stubs from the PTX on every device init and keeps the names alive unnecessarily.
```suggestion
&[
"count_bucket_entries_bls12_381",
"scatter_bucket_descriptors_bls12_381",
"partial_segment_accumulation_bls12_381",
"finalize_segment_accumulation_bls12_381",
"partial_bucket_reduction_bls12_381",
"finalize_bucket_reduction_bls12_381",
],
```
### Issue 2 of 2
crates/math/src/msm/cuda.rs:1159-1160
This comment was accurate when the test deliberately used single-point inputs to sidestep the old race condition, but the race condition was eliminated by this PR. The test now exercises MSM with single points as a correctness check, not as a workaround, so the comment is misleading.
```suggestion
let g = BLS12381Curve::generator();
```
Reviews (1): Last reviewed commit: "Merge branch 'main' into issue/lambdawor..." | Re-trigger Greptile
|
Addressed in f8fdd29.
|
|
Small follow-up: this PR is still limited to the large-input CUDA MSM correctness-plus-speedup path already documented in the thread. If you want the diff cut down further, I can narrow it to the smallest mergeable slice. |
This speeds up the opt-in CUDA BLS12-381 MSM path for large inputs without changing
CudaMSM::computeor default CPU MSM selection.Summary
Replace the previous CUDA bucket path with a fixed-key pipeline: count, prefix-scan, scatter, task-list accumulation, and two-stage bucket reduction.
Keep the public API unchanged and leave CPU MSM selection untouched.
Validation
CUDA output was checked against CPU
pippenger::msm.Checked cases include mixed inputs, deliberate bucket collisions, small-window collisions, negative signed digits, zero-heavy inputs, and a deterministic 257-point input.
All checked outputs matched CPU affine coordinates.
Benchmark
Windows, local RTX 5090-class run:
65,536 points: CPU 225.135 ms, CUDA 74.398 ms, 3.0x faster.
1,048,576 points: CPU 3655.832 ms, CUDA 187.013 ms, 19.5x faster.
WSL on the same machine:
65,536 points: CPU 221.848 ms, CUDA 82.839 ms, 2.7x faster.
1,048,576 points: CPU 2582.938 ms, CUDA 233.516 ms, 11.1x faster.
Every benchmark row was correctness-checked against CPU output.
Caveats
Small inputs are still dominated by CUDA launch and host/device synchronization overhead, so this should stay an opt-in CUDA path improvement rather than a default-path switch.
Repo-local Windows tests are still blocked before MSM tests by the existing unconditional
pprof v0.13.0dev-dependency issue on this target.Repo-local WSL tests are still blocked before MSM tests by an unrelated
koalabear/rustc 1.95.0type-checking ICE path.Because of those repo-local blockers, validation used an external harness depending only on local
lambdaworks-mathwithdefault-features = falseandfeatures = ["std", "cuda"].Screenshot
Part of #730