Performance: close the gap between OpenImpala's potential and what Colab users see

# Performance: close the gap between OpenImpala's potential and what Colab users see

## Context

Full-run analysis from `notebooks/profiling_and_tuning.ipynb` (commit `668a1bb` on
`claude/upbeat-mccarthy-f1mNN`) definitively ruled OUT the suspected culprits
and identified the real ones:

| Suspect | Verdict | Evidence |
|---|---|---|
| Python bindings slow | ❌ Not it | cProfile overhead 0.004 s; amortization ratio 1.01× |
| Per-call fixed overhead | ❌ Not it | Fit intercept went negative — no detectable per-call cost |
| Setup/rebuild tax | ❌ Negligible | Naive vs. amortized: within noise |
| **HYPRE PCG solve itself** | ✅ **Dominant** | 85–95 % of wall time in §3 stage breakdown |
| **Super-linear scaling** | ✅ **Confirmed** | `t(N) = b·N^1.76` — PCG iteration count grows with N |
| **CPU-only build on GPU host** | ✅ **Hidden killer** | `OMP initialized with 2 threads`, no CUDA banner, T4 idle |

The notebook solver-comparison chart (§9) also shows that currently all
competing solvers on the wheel are Krylov methods — there is no multigrid
preconditioner present to break the `O(N^p)` curve.

## Hypothesis

Two independent fixes compound:

1. **GPU build (~10–50×)** — Colab users get the pip wheel, which is
   CPU+OpenMP. Every solve leaves the T4/A100 idle. AMReX + HYPRE both
   support CUDA; this is an infrastructure/packaging fix, not an algorithmic
   one.
2. **Multigrid preconditioner (asymptotic)** — PCG without multigrid has a
   condition number that scales with `N`, so iteration count grows. A
   geometric multigrid preconditioner (HYPRE `SMG`/`PFMG`) or matrix-free
   MLMG should restore `O(N)` scaling. Expected payoff grows with problem
   size — largest win at 256³ and above.

Together, the "slow on Colab" complaint should disappear.

## Work items

### A. GPU build path (biggest single win)

- [ ] Add a CUDA build config for AMReX: `-DAMReX_GPU_BACKEND=CUDA`
      (`make` equivalent: `USE_CUDA=TRUE`).
- [ ] Verify HYPRE is built with CUDA as well (`--with-cuda`); mismatched
      device memory spaces will silently fall back to CPU solves.
- [ ] Add a CUDA CI job (can gate on self-hosted runner or manual dispatch)
      to prevent regressions.
- [ ] Publish a CUDA wheel alongside the CPU one, or document building from
      source. If wheel publishing is infeasible, ship an Apptainer image
      with CUDA.
- [ ] Update `notebooks/profiling_and_tuning.ipynb` §1a: once a GPU build
      exists, the warning becomes an instruction ("install the CUDA wheel
      with `pip install openimpala[cuda]`").

### B. Multigrid preconditioner (asymptotic scaling fix)

- [ ] Expose HYPRE's `SMG` and `PFMG` as solver choices in
      `TortuosityHypre` (they're already in the `SolverType` enum — plumb
      them through). They are structured-grid multigrid solvers that
      match OpenImpala's voxel-regular layout perfectly.
- [ ] Optionally use `PFMG` as a *preconditioner* for PCG/GMRES rather
      than as a standalone solver — often the best of both worlds.
- [ ] Evaluate MLMG (AMReX-native matrix-free multigrid) as an
      alternative: removes the HYPRE matrix-assembly cost entirely.
      Relevant when the solve itself is smaller than the matrix fill.
- [ ] Add a benchmark at 256³ and 512³ comparing PCG vs PFMG vs
      PCG+PFMG vs MLMG — wall time and iteration count. Extend the
      existing three regression benchmarks (uniform, series, parallel
      layers) — all have exact analytical solutions so correctness is
      easy to check.

### C. Build-flag hygiene (cheap, unblocks future profiling)

- [ ] Rebuild the default wheel with `-DAMReX_TINY_PROFILE=ON`
      (`TINY_PROFILE=TRUE` for `make`). Currently §7 of the notebook has
      to explain why there is no TinyProfiler table — users can't
      self-diagnose C++ hotspots without it.
- [ ] Consider `BL_PROFILE` for finer-grained regions if TinyProfiler
      proves insufficient for the multigrid work above.

## Validation / acceptance

The profiling notebook is the regression harness. After each change, the
output of §12 ("DIAGNOSIS") should show:

- Build backend: `CUDA` (not `CPU`) when run on a GPU host.
- Compute scaling: exponent `p < 1.2` (not `1.76`) at 128³ → 256³.
- No "super-linear" warning.
- TinyProfiler table populated in §7.

Rough targets on a Colab T4 vs. the current state (64³ solve, PCG,
`max_grid_size=32`):

| State | Wall time | Scaling |
|---|---|---|
| Current (CPU wheel, PCG) | ~2–3 s at 64³, `O(N^1.76)` | baseline |
| + GPU build | ~0.1–0.3 s at 64³, still `O(N^1.76)` | 10–30× at 64³ |
| + PFMG preconditioner | bigger win at 128³+ | `O(N^~1.1)` |
| Both | sub-second for full pipeline on reasonable grids | — |

## References

- Notebook: `notebooks/profiling_and_tuning.ipynb` (commit `668a1bb`)
- Full-run output: see §12 DIAGNOSIS for the exact numbers this issue is
  based on.
- AMReX CUDA docs: https://amrex-codes.github.io/amrex/docs_html/GPU.html
- HYPRE SMG/PFMG docs: https://hypre.readthedocs.io/en/latest/solvers-smg-pfmg.html
- HYPRE CUDA build: https://hypre.readthedocs.io/en/latest/ch-misc.html#gpu-use

## Out of scope

- Rewriting the C++ physics layer (current finite-difference formulation
  is correct; the bottleneck is purely in the linear solve).
- Python API changes — cProfile and the amortization test both confirmed
  the bindings add nothing measurable.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: close the gap between OpenImpala's potential and what Colab users see #255

Performance: close the gap between OpenImpala's potential and what Colab users see

Context

Hypothesis

Work items

A. GPU build path (biggest single win)

B. Multigrid preconditioner (asymptotic scaling fix)

C. Build-flag hygiene (cheap, unblocks future profiling)

Validation / acceptance

References

Out of scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Suspect	Verdict	Evidence
Python bindings slow	❌ Not it	cProfile overhead 0.004 s; amortization ratio 1.01×
Per-call fixed overhead	❌ Not it	Fit intercept went negative — no detectable per-call cost
Setup/rebuild tax	❌ Negligible	Naive vs. amortized: within noise
HYPRE PCG solve itself	✅ Dominant	85–95 % of wall time in §3 stage breakdown
Super-linear scaling	✅ Confirmed	`t(N) = b·N^1.76` — PCG iteration count grows with N
CPU-only build on GPU host	✅ Hidden killer	`OMP initialized with 2 threads`, no CUDA banner, T4 idle

State	Wall time	Scaling
Current (CPU wheel, PCG)	~2–3 s at 64³, `O(N^1.76)`	baseline
+ GPU build	~0.1–0.3 s at 64³, still `O(N^1.76)`	10–30× at 64³
+ PFMG preconditioner	bigger win at 128³+	`O(N^~1.1)`
Both	sub-second for full pipeline on reasonable grids	—

Performance: close the gap between OpenImpala's potential and what Colab users see #255

Description

Performance: close the gap between OpenImpala's potential and what Colab users see

Context

Hypothesis

Work items

A. GPU build path (biggest single win)

B. Multigrid preconditioner (asymptotic scaling fix)

C. Build-flag hygiene (cheap, unblocks future profiling)

Validation / acceptance

References

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions