math/cuda: speed up large-input BLS12-381 MSM by peter941221 · Pull Request #1205 · lambdaclass/lambdaworks

peter941221 · 2026-05-19T08:56:35Z

This speeds up the opt-in CUDA BLS12-381 MSM path for large inputs without changing CudaMSM::compute or default CPU MSM selection.

Summary

Replace the previous CUDA bucket path with a fixed-key pipeline: count, prefix-scan, scatter, task-list accumulation, and two-stage bucket reduction.
Keep the public API unchanged and leave CPU MSM selection untouched.

Validation

CUDA output was checked against CPU pippenger::msm.
Checked cases include mixed inputs, deliberate bucket collisions, small-window collisions, negative signed digits, zero-heavy inputs, and a deterministic 257-point input.
All checked outputs matched CPU affine coordinates.

Benchmark

Windows, local RTX 5090-class run:
65,536 points: CPU 225.135 ms, CUDA 74.398 ms, 3.0x faster.
1,048,576 points: CPU 3655.832 ms, CUDA 187.013 ms, 19.5x faster.
WSL on the same machine:
65,536 points: CPU 221.848 ms, CUDA 82.839 ms, 2.7x faster.
1,048,576 points: CPU 2582.938 ms, CUDA 233.516 ms, 11.1x faster.
Every benchmark row was correctness-checked against CPU output.

Caveats

Small inputs are still dominated by CUDA launch and host/device synchronization overhead, so this should stay an opt-in CUDA path improvement rather than a default-path switch.
Repo-local Windows tests are still blocked before MSM tests by the existing unconditional pprof v0.13.0 dev-dependency issue on this target.
Repo-local WSL tests are still blocked before MSM tests by an unrelated koalabear / rustc 1.95.0 type-checking ICE path.
Because of those repo-local blockers, validation used an external harness depending only on local lambdaworks-math with default-features = false and features = ["std", "cuda"].

Screenshot

Part of #730

greptile-apps · 2026-05-19T09:02:03Z

Greptile Summary

This PR replaces the previously racy single-pass CUDA bucket-accumulation kernel with a correct, fully-parallel six-stage pipeline: GPU counting → CPU prefix scan + task splitting → GPU scatter (atomic) → GPU partial-segment accumulation → GPU finalize-segment → GPU two-phase chunked bucket reduction. The public CudaMSM::compute API is unchanged, and the guard that blocked multi-point MSM is removed now that the correctness problem is solved.

The mathematical correctness of the new chunked bucket reduction was verified: the carry term higher_chunks_sum * chunk_len correctly accounts for cross-chunk weight adjustments, producing ∑(i+1)·B[i] across all windows.
Overflow is consistently guarded on the Rust side with checked_mul / checked_u32_len before any value reaches the GPU; atomic atomicAdd in the scatter kernel guarantees unique write slots with no data races.
Two kernels (segmented_bucket_accumulation_bls12_381 and bucket_reduction_bls12_381) are registered in load_ptx but never invoked in the compute path; one test retains a stale "race condition" comment.

Confidence Score: 4/5

The core algorithmic change is sound — the new scatter+accumulate pipeline correctly eliminates the prior race condition, overflow checks are in place throughout, and the added tests cover bucket collisions, negative signed digits, and deterministic medium-size inputs.

Two dead kernel registrations and one stale comment are the only findings; no correctness, safety, or data-integrity issues were identified in the new pipeline.

crates/math/src/msm/cuda.rs — the kernel registration list and the stale test comment both benefit from a small cleanup before merge.

Important Files Changed

Filename	Overview
crates/math/src/gpu/cuda/shaders/msm/msm.cuh	Replaces the racy single-pass accumulation kernel with a five-stage pipeline (count, scatter, partial-accumulate, finalize-accumulate, chunked-reduce) using atomic writes for correctness; two kernels (segmented_bucket_accumulation and bucket_reduction) remain defined but are no longer called from the Rust side.
crates/math/src/gpu/cuda/shaders/msm/bls12_381_msm.cu	Entry-point wrappers for all new pipeline kernels added; old bucket_accumulation_bls12_381 removed; segmented_bucket_accumulation_bls12_381 and bucket_reduction_bls12_381 still exported in the .cu file but are never invoked by the Rust host code.
crates/math/src/msm/cuda.rs	Orchestrates the full pipeline; adds CPU-side prefix scan and task-splitting helpers with overflow guards; registers two kernels that are never called in the compute path; one test retains a stale comment about the now-fixed race condition.

Sequence Diagram

sequenceDiagram
    participant CPU as CPU (Rust)
    participant GPU as GPU (CUDA Kernels)

    CPU->>GPU: htod: scalars, points, config
    CPU->>GPU: alloc_zeros: counts_buf

    GPU->>GPU: count_bucket_entries_bls12_381

    GPU->>CPU: sync_reclaim counts_buf

    CPU->>CPU: exclusive_scan_counts → offsets
    CPU->>CPU: build_segment_tasks → task_offsets, task_starts, task_lengths

    CPU->>GPU: htod: offsets_buf, task_offsets_buf, task_starts_buf, task_lengths_buf
    CPU->>GPU: alloc_zeros: write_counts_buf, sorted_descriptors_buf, buckets_buf, partial_buckets_buf

    GPU->>GPU: scatter_bucket_descriptors_bls12_381

    GPU->>GPU: partial_segment_accumulation_bls12_381

    GPU->>GPU: finalize_segment_accumulation_bls12_381

    GPU->>GPU: partial_bucket_reduction_bls12_381

    GPU->>GPU: finalize_bucket_reduction_bls12_381

    GPU->>CPU: sync_reclaim window_sums_buf

    CPU->>CPU: Horner window combination → MSM result

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
crates/math/src/msm/cuda.rs:155-164
Two of the eight registered kernels — `segmented_bucket_accumulation_bls12_381` and `bucket_reduction_bls12_381` — are registered with `load_ptx` but never fetched via `get_func` in the compute path. The new pipeline always goes through `partial_segment_accumulation` + `finalize_segment_accumulation` and `partial_bucket_reduction` + `finalize_bucket_reduction`. Keeping dead entries in the registration list forces cudarc to materialise those kernel stubs from the PTX on every device init and keeps the names alive unnecessarily.

```suggestion
                &[
                    "count_bucket_entries_bls12_381",
                    "scatter_bucket_descriptors_bls12_381",
                    "partial_segment_accumulation_bls12_381",
                    "finalize_segment_accumulation_bls12_381",
                    "partial_bucket_reduction_bls12_381",
                    "finalize_bucket_reduction_bls12_381",
                ],
```

### Issue 2 of 2
crates/math/src/msm/cuda.rs:1159-1160
This comment was accurate when the test deliberately used single-point inputs to sidestep the old race condition, but the race condition was eliminated by this PR. The test now exercises MSM with single points as a correctness check, not as a workaround, so the comment is misleading.

```suggestion
        let g = BLS12381Curve::generator();
```

_{Reviews (1): Last reviewed commit: "Merge branch 'main' into issue/lambdawor..." | Re-trigger Greptile}

peter941221 · 2026-05-19T09:10:35Z

Addressed in f8fdd29.

Removed the two unused kernel registrations from load_ptx.
Dropped the stale single-point race-condition comment in test_cuda_msm_small.

peter941221 · 2026-05-26T00:24:07Z

Small follow-up: this PR is still limited to the large-input CUDA MSM correctness-plus-speedup path already documented in the thread. If you want the diff cut down further, I can narrow it to the smallest mergeable slice.

Speed up large-input CUDA MSM for BLS12-381

93d97ed

peter941221 requested a review from a team as a code owner May 19, 2026 08:56

peter941221 mentioned this pull request May 19, 2026

MSM Optimizations #730

Open

2 tasks

Merge branch 'main' into issue/lambdaworks-730-gpu-msm

58f4d0d

greptile-apps Bot reviewed May 19, 2026

View reviewed changes

Comment thread crates/math/src/msm/cuda.rs

Comment thread crates/math/src/msm/cuda.rs Outdated

Clean up CUDA MSM PR follow-up

f8fdd29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

math/cuda: speed up large-input BLS12-381 MSM#1205

math/cuda: speed up large-input BLS12-381 MSM#1205
peter941221 wants to merge 3 commits into
lambdaclass:mainfrom
peter941221:issue/lambdaworks-730-gpu-msm

peter941221 commented May 19, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 19, 2026

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

peter941221 commented May 19, 2026

Uh oh!

peter941221 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

peter941221 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented May 19, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

peter941221 commented May 19, 2026

Uh oh!

peter941221 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

peter941221 commented May 19, 2026 •

edited

Loading