Parallel ensemble I/O for State and Increment by travissluka · Pull Request #1243 · JCSDA-internal/soca

travissluka · 2026-05-14T19:22:12Z

Description

Warning

Make sure you check out the matching oops branch! This soca branch will still compile and run, but it won't go any faster without the oops branch.

Replaces the per-member-serial ensemble read/write path with bulk routines that drive concurrent I/O across rotated reader/writer PEs, with optional MPI-async batching for the gather/scatter collectives. Previously the LETKF was doing all ensemble file writes on PE0... ouch. Ensemble member reads were using "strided" read mode (whereby each PE reads its section of a netcdf file), not the worst, but probably not the best.

This PR uses a oops::StateSet::readEnsemble / writeEnsemble set of methods that let us bypass the per-member loop, doing ensemble read/write in true parallel.

YAML configuration

The default settings now should do parallel ensemble I/O in what is hopefully the most efficient way on an HPC. But there are some knobs to turn if you need. All knobs live under geometry.io.

key	values	default	what it does
`ensemble write`	`parallel` \| `sequential`	`parallel`	Parallel = M concurrent writers via rotated `root_pe`; sequential = old gather-to-rank-0 path.
`ensemble read`	`scatter` \| `strided`	`scatter`	Scatter = M reader_pes pull whole-globals and `MPI_Scatterv` to compute groups; strided = each PE reads its compute-domain tile directly.
`async mpi`	`true` \| `false`	`true`	Batch `Igatherv` / `Iscatterv` across all (reader/writer, var) pairs so collectives with different roots overlap. Applies to both read and write.
`single state read`	`strided` \| `scatter`	`strided`	Per-PE direct compute-tile reads vs single-reader-then-scatter for the single-state (non-ensemble) path. Strided is the right default on parallel filesystems.

If you want to A/B against the old behavior on the same checkout, set under geometry.io:

geometry:
  io:
    ensemble write: sequential
    ensemble read: strided
    async mpi: false

Expected performance impact

Based on tests with a synthetic memory/file-system delay emulating HPC performance... my assistant forecasts an improvement of 1.5–3× on reads, 10–30× on writes... let's see how good that forecast is!

I'm 100% confident ensemble write: parallel is faster, no question there
I'm fairly confident that ensemble read: scatter is faster when the number of PEs/nodes is large. But if read times are slower, it might be worth testing putting this back to strided, which is the currently implemented method
async mpi: true in theory should go faster when doing ensemble I/O, but the flag is here in case we want to verify
for single state read i'm honestly not sure which would be faster, in theory strided should be faster if number of PEs >> the number of lustre OSTs.

Issue(s) addressed

Resolves Completely redo model I/O to parallelize it #1125

Dependencies

This PR depends on:

https://github.com/JCSDA-internal/oops/tree/feature/ensemble-parallel-io (branch, not yet a PR — adds the State::readEnsemble/writeEnsemble dispatch hooks)

Impact

Behavior is identical to the legacy path. Expected wall-clock speedup on HPC; no expected impact on downstream science code.

Checklist

I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have run the unit tests before creating the PR

Note

My initial testing says this SHOULD go a lot faster on an HPC, but i'm waiting for confirmation that it indeed does before actually cleaning up and review the code myself.
Also, this PR includes the changes for #1242, so don't bother looking at the actual code here until THAT PR is first merged

Rebased onto develop after the direct-netCDF soca_io_mod PR (#1242) landed. End-to-end ensemble-parallel I/O on top of the stateless direct-netCDF reader. Opt-in via the new oops StateSet/IncrementSet bulk-readEnsemble/writeEnsemble dispatch path; soca provides the model-level implementations. State/Increment: - State::{readEnsemble,writeEnsemble} and Increment::{readEnsemble, writeEnsemble} (Fields + interface Fortran bindings to match) - writeEnsemble: gather each member to a strided writer PE (soca_io_ensemble_root_pe assignment), then phase-2 per-member netCDF writes happen concurrently across writer PEs - readEnsemble: per-member loop today, honoring single-state read mode (strided vs scatter); parallel-across-members reads via soca_io_readers_commit_ensemble are wired for a follow-up soca_io_mod: - writer staging split into define/gather/write/close phases plus soca_io_writers_commit_ensemble; reader staging mirrored with read/distribute/close and soca_io_readers_commit_ensemble. Reader opens+closes ncid inside stage_read (no global file-handle cache, matching develop's stateless model post-2d7ab3cf) - new public knobs: soca_io_ensemble_write_parallel, soca_io_ensemble_read_scatter, soca_io_single_state_read_scatter, soca_io_async_mpi, soca_io_ensemble_root_pe - mpi_pelist_and_comm helper for the writer/reader PE subsets Geometry: - soca_io_config_from_yaml called from soca_geom_init resolves the parallel/sequential, scatter/strided, async-mpi knobs from geometry.io once; values persist module-level for the run Companion to JCSDA-internal/oops setID fix in StateSet::buildFromConfigs (needed so PseudoModel and HTLM dispatch see correct per-member IDs when StateSet takes the HasReadEnsemble path).

travissluka self-assigned this May 14, 2026

travissluka requested review from DavidNew-NOAA and shlyaeva May 14, 2026 19:33

travissluka force-pushed the feature/ensemble-parallel-io branch from 0c72b73 to 3b0674f Compare May 23, 2026 00:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel ensemble I/O for State and Increment#1243

Parallel ensemble I/O for State and Increment#1243
travissluka wants to merge 1 commit into
developfrom
feature/ensemble-parallel-io

travissluka commented May 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

travissluka commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

YAML configuration

Expected performance impact

Issue(s) addressed

Dependencies

Impact

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

travissluka commented May 14, 2026 •

edited

Loading