Skip to content

Parallel ensemble I/O for State and Increment#1243

Draft
travissluka wants to merge 1 commit into
developfrom
feature/ensemble-parallel-io
Draft

Parallel ensemble I/O for State and Increment#1243
travissluka wants to merge 1 commit into
developfrom
feature/ensemble-parallel-io

Conversation

@travissluka
Copy link
Copy Markdown
Contributor

@travissluka travissluka commented May 14, 2026

Description

Warning

Make sure you check out the matching oops branch! This soca branch will still compile and run, but it won't go any faster without the oops branch.

Replaces the per-member-serial ensemble read/write path with bulk routines that drive concurrent I/O across rotated reader/writer PEs, with optional MPI-async batching for the gather/scatter collectives. Previously the LETKF was doing all ensemble file writes on PE0... ouch. Ensemble member reads were using "strided" read mode (whereby each PE reads its section of a netcdf file), not the worst, but probably not the best.

This PR uses a oops::StateSet::readEnsemble / writeEnsemble set of methods that let us bypass the per-member loop, doing ensemble read/write in true parallel.

YAML configuration

The default settings now should do parallel ensemble I/O in what is hopefully the most efficient way on an HPC. But there are some knobs to turn if you need. All knobs live under geometry.io.

key values default what it does
ensemble write parallel | sequential parallel Parallel = M concurrent writers via rotated root_pe; sequential = old gather-to-rank-0 path.
ensemble read scatter | strided scatter Scatter = M reader_pes pull whole-globals and MPI_Scatterv to compute groups; strided = each PE reads its compute-domain tile directly.
async mpi true | false true Batch Igatherv / Iscatterv across all (reader/writer, var) pairs so collectives with different roots overlap. Applies to both read and write.
single state read strided | scatter strided Per-PE direct compute-tile reads vs single-reader-then-scatter for the single-state (non-ensemble) path. Strided is the right default on parallel filesystems.

If you want to A/B against the old behavior on the same checkout, set under geometry.io:

geometry:
  io:
    ensemble write: sequential
    ensemble read: strided
    async mpi: false

Expected performance impact

Based on tests with a synthetic memory/file-system delay emulating HPC performance... my assistant forecasts an improvement of 1.5–3× on reads, 10–30× on writes... let's see how good that forecast is!

  • I'm 100% confident ensemble write: parallel is faster, no question there
  • I'm fairly confident that ensemble read: scatter is faster when the number of PEs/nodes is large. But if read times are slower, it might be worth testing putting this back to strided, which is the currently implemented method
  • async mpi: true in theory should go faster when doing ensemble I/O, but the flag is here in case we want to verify
  • for single state read i'm honestly not sure which would be faster, in theory strided should be faster if number of PEs >> the number of lustre OSTs.

Issue(s) addressed

Dependencies

This PR depends on:

Impact

Behavior is identical to the legacy path. Expected wall-clock speedup on HPC; no expected impact on downstream science code.

Checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have run the unit tests before creating the PR

Note

My initial testing says this SHOULD go a lot faster on an HPC, but i'm waiting for confirmation that it indeed does before actually cleaning up and review the code myself.
Also, this PR includes the changes for #1242, so don't bother looking at the actual code here until THAT PR is first merged

@travissluka travissluka self-assigned this May 14, 2026
Rebased onto develop after the direct-netCDF soca_io_mod PR (#1242) landed.
End-to-end ensemble-parallel I/O on top of the stateless direct-netCDF reader.
Opt-in via the new oops StateSet/IncrementSet bulk-readEnsemble/writeEnsemble
dispatch path; soca provides the model-level implementations.

State/Increment:
  - State::{readEnsemble,writeEnsemble} and Increment::{readEnsemble,
    writeEnsemble} (Fields + interface Fortran bindings to match)
  - writeEnsemble: gather each member to a strided writer PE
    (soca_io_ensemble_root_pe assignment), then phase-2 per-member netCDF
    writes happen concurrently across writer PEs
  - readEnsemble: per-member loop today, honoring single-state read mode
    (strided vs scatter); parallel-across-members reads via
    soca_io_readers_commit_ensemble are wired for a follow-up

soca_io_mod:
  - writer staging split into define/gather/write/close phases plus
    soca_io_writers_commit_ensemble; reader staging mirrored with
    read/distribute/close and soca_io_readers_commit_ensemble. Reader
    opens+closes ncid inside stage_read (no global file-handle cache,
    matching develop's stateless model post-2d7ab3cf)
  - new public knobs: soca_io_ensemble_write_parallel,
    soca_io_ensemble_read_scatter, soca_io_single_state_read_scatter,
    soca_io_async_mpi, soca_io_ensemble_root_pe
  - mpi_pelist_and_comm helper for the writer/reader PE subsets

Geometry:
  - soca_io_config_from_yaml called from soca_geom_init resolves the
    parallel/sequential, scatter/strided, async-mpi knobs from geometry.io
    once; values persist module-level for the run

Companion to JCSDA-internal/oops setID fix in StateSet::buildFromConfigs
(needed so PseudoModel and HTLM dispatch see correct per-member IDs when
StateSet takes the HasReadEnsemble path).
@travissluka travissluka force-pushed the feature/ensemble-parallel-io branch from 0c72b73 to 3b0674f Compare May 23, 2026 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Completely redo model I/O to parallelize it

1 participant