Performance: close the gap between OpenImpala's potential and what Colab users see
Context
Full-run analysis from notebooks/profiling_and_tuning.ipynb (commit 668a1bb on
claude/upbeat-mccarthy-f1mNN) definitively ruled OUT the suspected culprits
and identified the real ones:
| Suspect |
Verdict |
Evidence |
| Python bindings slow |
❌ Not it |
cProfile overhead 0.004 s; amortization ratio 1.01× |
| Per-call fixed overhead |
❌ Not it |
Fit intercept went negative — no detectable per-call cost |
| Setup/rebuild tax |
❌ Negligible |
Naive vs. amortized: within noise |
| HYPRE PCG solve itself |
✅ Dominant |
85–95 % of wall time in §3 stage breakdown |
| Super-linear scaling |
✅ Confirmed |
t(N) = b·N^1.76 — PCG iteration count grows with N |
| CPU-only build on GPU host |
✅ Hidden killer |
OMP initialized with 2 threads, no CUDA banner, T4 idle |
The notebook solver-comparison chart (§9) also shows that currently all
competing solvers on the wheel are Krylov methods — there is no multigrid
preconditioner present to break the O(N^p) curve.
Hypothesis
Two independent fixes compound:
- GPU build (~10–50×) — Colab users get the pip wheel, which is
CPU+OpenMP. Every solve leaves the T4/A100 idle. AMReX + HYPRE both
support CUDA; this is an infrastructure/packaging fix, not an algorithmic
one.
- Multigrid preconditioner (asymptotic) — PCG without multigrid has a
condition number that scales with N, so iteration count grows. A
geometric multigrid preconditioner (HYPRE SMG/PFMG) or matrix-free
MLMG should restore O(N) scaling. Expected payoff grows with problem
size — largest win at 256³ and above.
Together, the "slow on Colab" complaint should disappear.
Work items
A. GPU build path (biggest single win)
B. Multigrid preconditioner (asymptotic scaling fix)
C. Build-flag hygiene (cheap, unblocks future profiling)
Validation / acceptance
The profiling notebook is the regression harness. After each change, the
output of §12 ("DIAGNOSIS") should show:
- Build backend:
CUDA (not CPU) when run on a GPU host.
- Compute scaling: exponent
p < 1.2 (not 1.76) at 128³ → 256³.
- No "super-linear" warning.
- TinyProfiler table populated in §7.
Rough targets on a Colab T4 vs. the current state (64³ solve, PCG,
max_grid_size=32):
| State |
Wall time |
Scaling |
| Current (CPU wheel, PCG) |
~2–3 s at 64³, O(N^1.76) |
baseline |
| + GPU build |
~0.1–0.3 s at 64³, still O(N^1.76) |
10–30× at 64³ |
| + PFMG preconditioner |
bigger win at 128³+ |
O(N^~1.1) |
| Both |
sub-second for full pipeline on reasonable grids |
— |
References
Out of scope
- Rewriting the C++ physics layer (current finite-difference formulation
is correct; the bottleneck is purely in the linear solve).
- Python API changes — cProfile and the amortization test both confirmed
the bindings add nothing measurable.
Performance: close the gap between OpenImpala's potential and what Colab users see
Context
Full-run analysis from
notebooks/profiling_and_tuning.ipynb(commit668a1bbonclaude/upbeat-mccarthy-f1mNN) definitively ruled OUT the suspected culpritsand identified the real ones:
t(N) = b·N^1.76— PCG iteration count grows with NOMP initialized with 2 threads, no CUDA banner, T4 idleThe notebook solver-comparison chart (§9) also shows that currently all
competing solvers on the wheel are Krylov methods — there is no multigrid
preconditioner present to break the
O(N^p)curve.Hypothesis
Two independent fixes compound:
CPU+OpenMP. Every solve leaves the T4/A100 idle. AMReX + HYPRE both
support CUDA; this is an infrastructure/packaging fix, not an algorithmic
one.
condition number that scales with
N, so iteration count grows. Ageometric multigrid preconditioner (HYPRE
SMG/PFMG) or matrix-freeMLMG should restore
O(N)scaling. Expected payoff grows with problemsize — largest win at 256³ and above.
Together, the "slow on Colab" complaint should disappear.
Work items
A. GPU build path (biggest single win)
-DAMReX_GPU_BACKEND=CUDA(
makeequivalent:USE_CUDA=TRUE).--with-cuda); mismatcheddevice memory spaces will silently fall back to CPU solves.
to prevent regressions.
source. If wheel publishing is infeasible, ship an Apptainer image
with CUDA.
notebooks/profiling_and_tuning.ipynb§1a: once a GPU buildexists, the warning becomes an instruction ("install the CUDA wheel
with
pip install openimpala[cuda]").B. Multigrid preconditioner (asymptotic scaling fix)
SMGandPFMGas solver choices inTortuosityHypre(they're already in theSolverTypeenum — plumbthem through). They are structured-grid multigrid solvers that
match OpenImpala's voxel-regular layout perfectly.
PFMGas a preconditioner for PCG/GMRES ratherthan as a standalone solver — often the best of both worlds.
alternative: removes the HYPRE matrix-assembly cost entirely.
Relevant when the solve itself is smaller than the matrix fill.
PCG+PFMG vs MLMG — wall time and iteration count. Extend the
existing three regression benchmarks (uniform, series, parallel
layers) — all have exact analytical solutions so correctness is
easy to check.
C. Build-flag hygiene (cheap, unblocks future profiling)
-DAMReX_TINY_PROFILE=ON(
TINY_PROFILE=TRUEformake). Currently §7 of the notebook hasto explain why there is no TinyProfiler table — users can't
self-diagnose C++ hotspots without it.
BL_PROFILEfor finer-grained regions if TinyProfilerproves insufficient for the multigrid work above.
Validation / acceptance
The profiling notebook is the regression harness. After each change, the
output of §12 ("DIAGNOSIS") should show:
CUDA(notCPU) when run on a GPU host.p < 1.2(not1.76) at 128³ → 256³.Rough targets on a Colab T4 vs. the current state (64³ solve, PCG,
max_grid_size=32):O(N^1.76)O(N^1.76)O(N^~1.1)References
notebooks/profiling_and_tuning.ipynb(commit668a1bb)based on.
Out of scope
is correct; the bottleneck is purely in the linear solve).
the bindings add nothing measurable.