Skip to content

Performance: close the gap between OpenImpala's potential and what Colab users see #255

@jameslehoux

Description

@jameslehoux

Performance: close the gap between OpenImpala's potential and what Colab users see

Context

Full-run analysis from notebooks/profiling_and_tuning.ipynb (commit 668a1bb on
claude/upbeat-mccarthy-f1mNN) definitively ruled OUT the suspected culprits
and identified the real ones:

Suspect Verdict Evidence
Python bindings slow ❌ Not it cProfile overhead 0.004 s; amortization ratio 1.01×
Per-call fixed overhead ❌ Not it Fit intercept went negative — no detectable per-call cost
Setup/rebuild tax ❌ Negligible Naive vs. amortized: within noise
HYPRE PCG solve itself Dominant 85–95 % of wall time in §3 stage breakdown
Super-linear scaling Confirmed t(N) = b·N^1.76 — PCG iteration count grows with N
CPU-only build on GPU host Hidden killer OMP initialized with 2 threads, no CUDA banner, T4 idle

The notebook solver-comparison chart (§9) also shows that currently all
competing solvers on the wheel are Krylov methods — there is no multigrid
preconditioner present to break the O(N^p) curve.

Hypothesis

Two independent fixes compound:

  1. GPU build (~10–50×) — Colab users get the pip wheel, which is
    CPU+OpenMP. Every solve leaves the T4/A100 idle. AMReX + HYPRE both
    support CUDA; this is an infrastructure/packaging fix, not an algorithmic
    one.
  2. Multigrid preconditioner (asymptotic) — PCG without multigrid has a
    condition number that scales with N, so iteration count grows. A
    geometric multigrid preconditioner (HYPRE SMG/PFMG) or matrix-free
    MLMG should restore O(N) scaling. Expected payoff grows with problem
    size — largest win at 256³ and above.

Together, the "slow on Colab" complaint should disappear.

Work items

A. GPU build path (biggest single win)

  • Add a CUDA build config for AMReX: -DAMReX_GPU_BACKEND=CUDA
    (make equivalent: USE_CUDA=TRUE).
  • Verify HYPRE is built with CUDA as well (--with-cuda); mismatched
    device memory spaces will silently fall back to CPU solves.
  • Add a CUDA CI job (can gate on self-hosted runner or manual dispatch)
    to prevent regressions.
  • Publish a CUDA wheel alongside the CPU one, or document building from
    source. If wheel publishing is infeasible, ship an Apptainer image
    with CUDA.
  • Update notebooks/profiling_and_tuning.ipynb §1a: once a GPU build
    exists, the warning becomes an instruction ("install the CUDA wheel
    with pip install openimpala[cuda]").

B. Multigrid preconditioner (asymptotic scaling fix)

  • Expose HYPRE's SMG and PFMG as solver choices in
    TortuosityHypre (they're already in the SolverType enum — plumb
    them through). They are structured-grid multigrid solvers that
    match OpenImpala's voxel-regular layout perfectly.
  • Optionally use PFMG as a preconditioner for PCG/GMRES rather
    than as a standalone solver — often the best of both worlds.
  • Evaluate MLMG (AMReX-native matrix-free multigrid) as an
    alternative: removes the HYPRE matrix-assembly cost entirely.
    Relevant when the solve itself is smaller than the matrix fill.
  • Add a benchmark at 256³ and 512³ comparing PCG vs PFMG vs
    PCG+PFMG vs MLMG — wall time and iteration count. Extend the
    existing three regression benchmarks (uniform, series, parallel
    layers) — all have exact analytical solutions so correctness is
    easy to check.

C. Build-flag hygiene (cheap, unblocks future profiling)

  • Rebuild the default wheel with -DAMReX_TINY_PROFILE=ON
    (TINY_PROFILE=TRUE for make). Currently §7 of the notebook has
    to explain why there is no TinyProfiler table — users can't
    self-diagnose C++ hotspots without it.
  • Consider BL_PROFILE for finer-grained regions if TinyProfiler
    proves insufficient for the multigrid work above.

Validation / acceptance

The profiling notebook is the regression harness. After each change, the
output of §12 ("DIAGNOSIS") should show:

  • Build backend: CUDA (not CPU) when run on a GPU host.
  • Compute scaling: exponent p < 1.2 (not 1.76) at 128³ → 256³.
  • No "super-linear" warning.
  • TinyProfiler table populated in §7.

Rough targets on a Colab T4 vs. the current state (64³ solve, PCG,
max_grid_size=32):

State Wall time Scaling
Current (CPU wheel, PCG) ~2–3 s at 64³, O(N^1.76) baseline
+ GPU build ~0.1–0.3 s at 64³, still O(N^1.76) 10–30× at 64³
+ PFMG preconditioner bigger win at 128³+ O(N^~1.1)
Both sub-second for full pipeline on reasonable grids

References

Out of scope

  • Rewriting the C++ physics layer (current finite-difference formulation
    is correct; the bottleneck is purely in the linear solve).
  • Python API changes — cProfile and the amortization test both confirmed
    the bindings add nothing measurable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions