Parallel ensemble I/O for State and Increment#1243
Draft
travissluka wants to merge 1 commit into
Draft
Conversation
Rebased onto develop after the direct-netCDF soca_io_mod PR (#1242) landed. End-to-end ensemble-parallel I/O on top of the stateless direct-netCDF reader. Opt-in via the new oops StateSet/IncrementSet bulk-readEnsemble/writeEnsemble dispatch path; soca provides the model-level implementations. State/Increment: - State::{readEnsemble,writeEnsemble} and Increment::{readEnsemble, writeEnsemble} (Fields + interface Fortran bindings to match) - writeEnsemble: gather each member to a strided writer PE (soca_io_ensemble_root_pe assignment), then phase-2 per-member netCDF writes happen concurrently across writer PEs - readEnsemble: per-member loop today, honoring single-state read mode (strided vs scatter); parallel-across-members reads via soca_io_readers_commit_ensemble are wired for a follow-up soca_io_mod: - writer staging split into define/gather/write/close phases plus soca_io_writers_commit_ensemble; reader staging mirrored with read/distribute/close and soca_io_readers_commit_ensemble. Reader opens+closes ncid inside stage_read (no global file-handle cache, matching develop's stateless model post-2d7ab3cf) - new public knobs: soca_io_ensemble_write_parallel, soca_io_ensemble_read_scatter, soca_io_single_state_read_scatter, soca_io_async_mpi, soca_io_ensemble_root_pe - mpi_pelist_and_comm helper for the writer/reader PE subsets Geometry: - soca_io_config_from_yaml called from soca_geom_init resolves the parallel/sequential, scatter/strided, async-mpi knobs from geometry.io once; values persist module-level for the run Companion to JCSDA-internal/oops setID fix in StateSet::buildFromConfigs (needed so PseudoModel and HTLM dispatch see correct per-member IDs when StateSet takes the HasReadEnsemble path).
0c72b73 to
3b0674f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Warning
Make sure you check out the matching oops branch! This soca branch will still compile and run, but it won't go any faster without the oops branch.
Replaces the per-member-serial ensemble read/write path with bulk routines that drive concurrent I/O across rotated reader/writer PEs, with optional MPI-async batching for the gather/scatter collectives. Previously the LETKF was doing all ensemble file writes on PE0... ouch. Ensemble member reads were using "strided" read mode (whereby each PE reads its section of a netcdf file), not the worst, but probably not the best.
This PR uses a
oops::StateSet::readEnsemble/writeEnsembleset of methods that let us bypass the per-member loop, doing ensemble read/write in true parallel.YAML configuration
The default settings now should do parallel ensemble I/O in what is hopefully the most efficient way on an HPC. But there are some knobs to turn if you need. All knobs live under
geometry.io.ensemble writeparallel|sequentialparallelroot_pe; sequential = old gather-to-rank-0 path.ensemble readscatter|stridedscatterMPI_Scattervto compute groups; strided = each PE reads its compute-domain tile directly.async mpitrue|falsetrueIgatherv/Iscattervacross all (reader/writer, var) pairs so collectives with different roots overlap. Applies to both read and write.single state readstrided|scatterstridedIf you want to A/B against the old behavior on the same checkout, set under
geometry.io:Expected performance impact
Based on tests with a synthetic memory/file-system delay emulating HPC performance... my assistant forecasts an improvement of 1.5–3× on reads, 10–30× on writes... let's see how good that forecast is!
ensemble write: parallelis faster, no question thereensemble read: scatteris faster when the number of PEs/nodes is large. But if read times are slower, it might be worth testing putting this back tostrided, which is the currently implemented methodasync mpi: truein theory should go faster when doing ensemble I/O, but the flag is here in case we want to verifysingle state readi'm honestly not sure which would be faster, in theorystridedshould be faster if number of PEs >> the number of lustre OSTs.Issue(s) addressed
Dependencies
This PR depends on:
State::readEnsemble/writeEnsembledispatch hooks)Impact
Behavior is identical to the legacy path. Expected wall-clock speedup on HPC; no expected impact on downstream science code.
Checklist
Note
My initial testing says this SHOULD go a lot faster on an HPC, but i'm waiting for confirmation that it indeed does before actually cleaning up and review the code myself.
Also, this PR includes the changes for #1242, so don't bother looking at the actual code here until THAT PR is first merged