Skip to content

math/cuda: speed up large-input BLS12-381 MSM#1205

Open
peter941221 wants to merge 3 commits into
lambdaclass:mainfrom
peter941221:issue/lambdaworks-730-gpu-msm
Open

math/cuda: speed up large-input BLS12-381 MSM#1205
peter941221 wants to merge 3 commits into
lambdaclass:mainfrom
peter941221:issue/lambdaworks-730-gpu-msm

Conversation

@peter941221
Copy link
Copy Markdown

@peter941221 peter941221 commented May 19, 2026

This speeds up the opt-in CUDA BLS12-381 MSM path for large inputs without changing CudaMSM::compute or default CPU MSM selection.

Summary

  1. Replace the previous CUDA bucket path with a fixed-key pipeline: count, prefix-scan, scatter, task-list accumulation, and two-stage bucket reduction.

  2. Keep the public API unchanged and leave CPU MSM selection untouched.

Validation

  1. CUDA output was checked against CPU pippenger::msm.

  2. Checked cases include mixed inputs, deliberate bucket collisions, small-window collisions, negative signed digits, zero-heavy inputs, and a deterministic 257-point input.

  3. All checked outputs matched CPU affine coordinates.

Benchmark

  1. Windows, local RTX 5090-class run:
    65,536 points: CPU 225.135 ms, CUDA 74.398 ms, 3.0x faster.
    1,048,576 points: CPU 3655.832 ms, CUDA 187.013 ms, 19.5x faster.

  2. WSL on the same machine:
    65,536 points: CPU 221.848 ms, CUDA 82.839 ms, 2.7x faster.
    1,048,576 points: CPU 2582.938 ms, CUDA 233.516 ms, 11.1x faster.

  3. Every benchmark row was correctness-checked against CPU output.

Caveats

  1. Small inputs are still dominated by CUDA launch and host/device synchronization overhead, so this should stay an opt-in CUDA path improvement rather than a default-path switch.

  2. Repo-local Windows tests are still blocked before MSM tests by the existing unconditional pprof v0.13.0 dev-dependency issue on this target.

  3. Repo-local WSL tests are still blocked before MSM tests by an unrelated koalabear / rustc 1.95.0 type-checking ICE path.

  4. Because of those repo-local blockers, validation used an external harness depending only on local lambdaworks-math with default-features = false and features = ["std", "cuda"].

Screenshot

CUDA MSM benchmark screenshot

Part of #730

@peter941221 peter941221 requested a review from a team as a code owner May 19, 2026 08:56
@peter941221 peter941221 mentioned this pull request May 19, 2026
2 tasks
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 19, 2026

Greptile Summary

This PR replaces the previously racy single-pass CUDA bucket-accumulation kernel with a correct, fully-parallel six-stage pipeline: GPU counting → CPU prefix scan + task splitting → GPU scatter (atomic) → GPU partial-segment accumulation → GPU finalize-segment → GPU two-phase chunked bucket reduction. The public CudaMSM::compute API is unchanged, and the guard that blocked multi-point MSM is removed now that the correctness problem is solved.

  • The mathematical correctness of the new chunked bucket reduction was verified: the carry term higher_chunks_sum * chunk_len correctly accounts for cross-chunk weight adjustments, producing ∑(i+1)·B[i] across all windows.
  • Overflow is consistently guarded on the Rust side with checked_mul / checked_u32_len before any value reaches the GPU; atomic atomicAdd in the scatter kernel guarantees unique write slots with no data races.
  • Two kernels (segmented_bucket_accumulation_bls12_381 and bucket_reduction_bls12_381) are registered in load_ptx but never invoked in the compute path; one test retains a stale "race condition" comment.

Confidence Score: 4/5

The core algorithmic change is sound — the new scatter+accumulate pipeline correctly eliminates the prior race condition, overflow checks are in place throughout, and the added tests cover bucket collisions, negative signed digits, and deterministic medium-size inputs.

Two dead kernel registrations and one stale comment are the only findings; no correctness, safety, or data-integrity issues were identified in the new pipeline.

crates/math/src/msm/cuda.rs — the kernel registration list and the stale test comment both benefit from a small cleanup before merge.

Important Files Changed

Filename Overview
crates/math/src/gpu/cuda/shaders/msm/msm.cuh Replaces the racy single-pass accumulation kernel with a five-stage pipeline (count, scatter, partial-accumulate, finalize-accumulate, chunked-reduce) using atomic writes for correctness; two kernels (segmented_bucket_accumulation and bucket_reduction) remain defined but are no longer called from the Rust side.
crates/math/src/gpu/cuda/shaders/msm/bls12_381_msm.cu Entry-point wrappers for all new pipeline kernels added; old bucket_accumulation_bls12_381 removed; segmented_bucket_accumulation_bls12_381 and bucket_reduction_bls12_381 still exported in the .cu file but are never invoked by the Rust host code.
crates/math/src/msm/cuda.rs Orchestrates the full pipeline; adds CPU-side prefix scan and task-splitting helpers with overflow guards; registers two kernels that are never called in the compute path; one test retains a stale comment about the now-fixed race condition.

Sequence Diagram

sequenceDiagram
    participant CPU as CPU (Rust)
    participant GPU as GPU (CUDA Kernels)

    CPU->>GPU: htod: scalars, points, config
    CPU->>GPU: alloc_zeros: counts_buf

    GPU->>GPU: count_bucket_entries_bls12_381

    GPU->>CPU: sync_reclaim counts_buf

    CPU->>CPU: exclusive_scan_counts → offsets
    CPU->>CPU: build_segment_tasks → task_offsets, task_starts, task_lengths

    CPU->>GPU: htod: offsets_buf, task_offsets_buf, task_starts_buf, task_lengths_buf
    CPU->>GPU: alloc_zeros: write_counts_buf, sorted_descriptors_buf, buckets_buf, partial_buckets_buf

    GPU->>GPU: scatter_bucket_descriptors_bls12_381

    GPU->>GPU: partial_segment_accumulation_bls12_381

    GPU->>GPU: finalize_segment_accumulation_bls12_381

    GPU->>GPU: partial_bucket_reduction_bls12_381

    GPU->>GPU: finalize_bucket_reduction_bls12_381

    GPU->>CPU: sync_reclaim window_sums_buf

    CPU->>CPU: Horner window combination → MSM result
Loading
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
crates/math/src/msm/cuda.rs:155-164
Two of the eight registered kernels — `segmented_bucket_accumulation_bls12_381` and `bucket_reduction_bls12_381` — are registered with `load_ptx` but never fetched via `get_func` in the compute path. The new pipeline always goes through `partial_segment_accumulation` + `finalize_segment_accumulation` and `partial_bucket_reduction` + `finalize_bucket_reduction`. Keeping dead entries in the registration list forces cudarc to materialise those kernel stubs from the PTX on every device init and keeps the names alive unnecessarily.

```suggestion
                &[
                    "count_bucket_entries_bls12_381",
                    "scatter_bucket_descriptors_bls12_381",
                    "partial_segment_accumulation_bls12_381",
                    "finalize_segment_accumulation_bls12_381",
                    "partial_bucket_reduction_bls12_381",
                    "finalize_bucket_reduction_bls12_381",
                ],
```

### Issue 2 of 2
crates/math/src/msm/cuda.rs:1159-1160
This comment was accurate when the test deliberately used single-point inputs to sidestep the old race condition, but the race condition was eliminated by this PR. The test now exercises MSM with single points as a correctness check, not as a workaround, so the comment is misleading.

```suggestion
        let g = BLS12381Curve::generator();
```

Reviews (1): Last reviewed commit: "Merge branch 'main' into issue/lambdawor..." | Re-trigger Greptile

Comment thread crates/math/src/msm/cuda.rs
Comment thread crates/math/src/msm/cuda.rs Outdated
@peter941221
Copy link
Copy Markdown
Author

Addressed in f8fdd29.

  1. Removed the two unused kernel registrations from load_ptx.

  2. Dropped the stale single-point race-condition comment in test_cuda_msm_small.

@peter941221
Copy link
Copy Markdown
Author

Small follow-up: this PR is still limited to the large-input CUDA MSM correctness-plus-speedup path already documented in the thread. If you want the diff cut down further, I can narrow it to the smallest mergeable slice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant