Skip to content

Refactored soca -> cice#1246

Open
shlyaeva wants to merge 52 commits into
developfrom
feature/new_soca2cice
Open

Refactored soca -> cice#1246
shlyaeva wants to merge 52 commits into
developfrom
feature/new_soca2cice

Conversation

@shlyaeva
Copy link
Copy Markdown
Collaborator

@shlyaeva shlyaeva commented May 27, 2026

Description

Sea-ice analysis postprocessing in C++/atlas (PostProcessIce): replaces the Fortran Soca2Cice variable change with a C++/atlas postprocessor that projects an aggregate SOCA ice analysis onto a per-category CICE restart. The new algorithm is largely based on @DmitryDukhovskoy's Python scripts to insert ice concentration, thickness and snow depth (https://github.com/DmitryDukhovskoy/RTOFS_utilities/blob/gaea_pub/prepare_cice6/insert_iconc_ithkn_cice6_restart.py and https://github.com/DmitryDukhovskoy/RTOFS_utilities/blob/gaea_pub/prepare_cice6/insert_hsnow_cice6_restart.py). Validated on 7 production GDAS cycles spanning different seasons.

What's new

  • soca::PostProcessIce in src/soca/PostProcess/,
    with a single public entry point:

    soca::State PostProcessIce::postprocess(const soca::State & analysis) const;

    Reads the CICE background restart, applies the per-cell pass, writes the postprocessed restart in update mode, returns an aggregate-ice State {aice, hice, hsno} matching what was written. Used by the standalone soca_postproc.x app, by AnalysisPostproc.h's ensemble loop, and by gdasapp's increment handler.

  • Per-cell pass (see README for details): case dispatch (LAND / ICE2NOICE / NOICE2ICE / ICE2ICE), ITD rebin (on by default), aicen-weighted snow distribution, optional freeboard enforcement, mass-conserving min-vice cleanup. Pure column-physics helpers (BL99 enthalpy, BZ99 salinity profile, thickness-category solver, freeboard) live in IcePhysics.h/.cc with their own unit tests in TestIcePhysics.cc.

  • New-ice thermo seeding: cells that transitioned from bg_aicen=0 to new_aicen>0 get Tsfcn from the area-weighted mean of a KDTree donor with any ice (global lat/lon tree, donor data gathered once via a sparse halo exchange); sub-surface qice/sice synthesized from CICE physics at the ocean freezing point.

  • soca_postproc.x standalone application
    (src/mains/Postproc.h): takes background + increment, forms analysis = bg + incr, runs PostProcessIce. Also accepts an explicit analysis block.

  • Dedicated CICE restart writer on the soca::State side:
    soca::State::writeCice(cfg) calls a new Fortran entry
    soca_fields::write_cice that wraps the update-mode CICE writer (byte-copy input restart → output, overwrite only modelled variables, ~40 unmodelled CICE vars pass through).

  • fields_metadata.yml: new <LEVEL> placeholder for templated per-layer entries (CICE per-layer restart vars qice00N, sice00N, qsno00N); combines with <CATEGORY> to expand ncat × ice_lev entries automatically. Per-cat ice io name updated to restart naming (aice<N>, vice<N>, vsno<N> - dropping the _h suffix which is the CICE history convention; restarts don't carry it).

  • AnalysisPostproc.h: ensemble loop migrated to call PostProcessIce::postprocess per member. Soca2Cice variable change no longer used here.

Tests

Four end-to-end tests on the 72×35 grid (one per major code path) plusa pure unit test on the column physics:

  • test_soca_soca2cice_new - baseline (rebin on)
  • test_soca_soca2cice_new_freeboard - freeboard enforcement
  • test_soca_soca2cice_new_seed - new-ice seeding + thermo
  • test_soca_soca2cice_new_bgfallback - analysis variables resolver
    (only aice analysed, hice/hsno fall back to background)
  • test_soca_icephysics - pure C++ unit test of the column-physics helpers
  • test_soca_postprocice_vs_soca2cice - Fortran-vs-C++ regression
    (Soca2Cice output compared bit-wise to PostProcessIce output;
    budget = 2 differing cells at 1e-10 tolerance)
  • test_soca_ensanpproc - ensemble postproc end-to-end

All pass. The legacy Fortran Soca2Cice path is still available through soca_convertstate.x for the regression test; it can be deprecated in a follow-up PR.

Configuration

A minimal yaml stanza:

postprocess ice:
  ncat: 5
  ice_lev: 7
  sno_lev: 1
  cice restart:
    input:  /path/to/cice_background.res.nc   # also the update-mode template
    output: /path/to/cice_postprocessed.res.nc

Full schema with defaults in src/soca/PostProcess/README.md.

Checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have run the unit tests before creating the PR

shlyaeva and others added 30 commits April 27, 2026 10:47
Introduces src/soca/IO/soca_io_mod.F90 with soca_io_file_reader /
soca_io_file_writer types (init/enqueue/commit/close), and routes all
state, geometry, and balance-coefficient I/O through it.

PE 0 calls nf90_get_var / nf90_put_var; mpp_broadcast and mpp_gather
distribute / collect data on the geometry's pelist. Removes all
fms_io_mod usages (register_restart_field, restore_state, save_restart,
read_data, file_exist, field_exist, fms_io_init, fms_io_exit) from
soca, including the global_soca_geom_counter shim. fms_init / fms_end
are retained (subsystem init only, no I/O).

Resolves the GDAS Tsnz_h shape mismatch by building nf90_get_var count
from the file's actual dim sizes, not the caller's buffer rank.

Per-domain state writes (ocn/sfc/ice/wav/bio) and reads now use the
new module; the FMS code paths are removed in the same change rather
than gated.

Refs #1125. Obsoletes PR #1241.
Under FMS, the writer would implicitly add the .nc suffix, so test
YAMLs and downstream readers reference 'ocn.<exp>.<typ>.<date>.nc'.
The direct-netCDF writer in soca_io_mod is faithful to whatever
string it's given, so files were landing without the extension and
no consumer could find them.
Six soca_add_test() calls listed test_socs_parameters_diffusion as a
dependency (should be test_soca_). Because the named test does not
exist, ctest treated the dependency as void and ran the variational
tests concurrently with parameters_diffusion under -j>1, causing
races on the diffusion calibration outputs.
Drops the SOCA_IO_SERIAL / SOCA_IO_PARALLEL mode bifurcation (and the
method= arg + io.method config knob) and rewrites the reader to do
per-PE strided nf90_get_var, mirroring FMS mpp_io's domain-decomposed
read pattern. Each PE opens the file once via NF90_NOWRITE and reads
its own compute-domain tile via nf90_get_var(start, count) -- the same
shape as fms_io's READ_RECORD_ for fileset.NE.MPP_SINGLE.

Adds a module-level read cache mirroring fms_io's get_file_unit +
files_read(i)%var(j) tables:
 - one nf90_open per (PE, filename) for the whole run
 - cached (varid, file_ndims, middle_dim sizes) per (file, var)
 - soca_io_close_all() flushes the cache; wired into soca_geom_end

Caller updates: drop method= from reader/writer init() and the
soca_io_method_from_config helper from soca_fields_mod / soca_geom_mod
/ soca_balance_mod.

1-deg 20-mem LETKF read phase (LocalEnsembleDA before solver ctor):
  baseline (FMS):                  3.91 s
  before this commit (no cache):  13.92 s
  with this commit:                3.97 s
Outputs are bit-identical.
…cache)

Adds a second reader_commit implementation -- PE 0 nf90_get_var the
global field + mpp_broadcast -- alongside the existing per-PE strided
path. Both paths share the file-handle + var-metadata cache. Select via
SOCA_IO_READ_MODE=broadcast|strided (default broadcast); the env var is
latched on first reader_commit.

Why broadcast as default: in DA cycling the page cache is always cold,
since each cycle reads files that the previous cycle / model run just
wrote. On cold cache the broadcast path (1 sequential read on PE 0,
kernel readahead happy) beats strided (8 concurrent strided readers
thrash the prefetcher and pay 8x the open-syscall cost).

1-deg 20-mem LETKF, rancor, taskset -c 0-7, cold cache (sample 1):
  broadcast: 13.5 s read
  strided:   30.5 s read
Warm cache reverses the result (strided 3.9 s vs broadcast 13.3 s) but
cycling never sees warm cache. On a parallel filesystem (Lustre/GPFS)
or multi-node setup the strided path may win again -- toggle the env
var to test.
Drops SOCA_IO_READ_MODE in favor of a yaml block on the geometry
config:
    geometry:
      io:
        read mode: broadcast | strided
If absent the module-level default (broadcast) is kept.

soca_geom_init calls soca_io_read_mode_from_config(f_conf) once at
startup; the choice latches for every subsequent reader_commit.
Unrecognized values abort so typos surface immediately.
Annotates testinput/letkf.yml's geometry block with the two read-mode
options so a contributor reading the canonical letkf example sees both
choices without having to hunt for the soca_io_mod docstring.
The write path emits FMS-style auto-numbered dim names (xaxis_N /
yaxis_N / zaxis_N / Time); the file-header docstring incorrectly
claimed xh / yh / Time. Correct the comment -- no code change.
Writer:
- enqueue holds pointers to caller buffers instead of allocating and
  copying (mirrors FMS register_restart_field contract). Compute-slice
  extraction moves from enqueue to commit, so peak per-writer memory
  drops from sum(var_bytes) to one (nx_c x ny_c) tile.
- commit uses the 3D mpp_gather overload for 3D vars: one collective
  per var instead of nlevels, and no per-level gbuf3d(:,:,k) = gbuf2d
  memcpy on root.
- Reuse gbuf3d / tile3 across 3D vars when nlevels matches.

Reader:
- commit_reader_strided hoists tile2 out of the per-var loop and
  reuses tile3 / tile4 when trailing dims match.
- Dead tile3/tile4 zero-inits removed (read_var_strided fully fills
  the buffer).

read_var_strided: fixed-size stack arrays for st / ct instead of
per-call heap allocation.

put_axis_coord_data: reuse idxbuf across axes.

Cleanup:
- Remove dead 'use mpi' and 'use fckit_configuration_module'.
- Remove dead cartesian_axis parameter on writer_enqueue_1d (it was
  stored but never written to the netcdf output).
- Drop unused nprocs out-param of mpi_pelist.
- soca_io_close_all: use ncc on nf90_close instead of silently
  discarding the status.
- commit_reader_scatter: matching dead-init cleanup; comment updated
  to flag it as pending parallel-ensemble I/O exercise.
- Trim verbose / redundant comments.
Drop dangling references to the removed soca_io_read_mode_from_config /
'io.read mode' YAML key in soca_geom_mod and the letkf test input. The
selector was already gone from soca_io_mod; without these the PR fails
to compile.
- soca_io_mod: allocate gbuf2d / gbuf3d as 1x1 dummies on non-root so the
  actuals passed to mpp_gather are always allocated (assumed-shape dummy
  requires it). Pattern matches commit_reader_scatter.

- soca_io_mod: refresh stale 'Data is copied in' comment to reflect the
  pointer-based enqueue semantics (caller buffer kept alive/unmutated
  through commit; actual must satisfy TARGET association rules).

- soca_geom_mod: TARGET on the 'self' dummy in soca_geom_init,
  soca_geom_init_fieldset, and soca_geom_write so the allocatable
  components (self%lonh, self%lat, ..., self%mask2d*) are valid pointer
  targets for the soca_io reader/writer enqueue. TARGET on the local
  fieldData / fieldDataVars too.

- soca_fields_mod: TARGET on h_common and on the local vars(:) wrapper
  array so vars(n)%data sections used in enqueue are valid pointer
  targets.

- soca_balance_mod: TARGET on local kct for the same reason.
…e checks

Three related fixes in the direct-netCDF reader path:

1. read_var_strided previously hardcoded "last Fortran dim is trailing
   time" and set ct(file_ndims)=1 unconditionally. For a file with
   spatial-only layout (Temp(z,y,x) with no leading Time), this read
   only level 1 of z into a multi-level destination, leaving the rest
   uninitialized -- showing up as garbage scale values in the vertical
   diffusion calibration. Now discriminate file_ndims == dst_rank (no
   time, fill every dim from the file) vs file_ndims > dst_rank
   (trailing time + middle squeeze, e.g. CICE Tsnz_h's nksnow=1). A
   total-element-count check catches silent partial-fills and dim-size
   mismatches (e.g. file z=75 vs destination z=25).

2. Drop the read_cache / cached_open / soca_io_close_all machinery.
   Each reader_commit nf90_opens, reads all enqueued vars (with inline
   inq_varid + inquire_variable + inquire_dimension; microseconds), and
   nf90_closes. Holding NetCDF4 handles open across commits was bloating
   LETKF per-task memory by ~MB-to-GB scale (HDF5 metadata + chunk
   caches per open file). Restores per-task max to match develop.

3. Add check_buf_1d / check_buf_2d helpers called from every
   reader/writer enqueue. Catches unallocated and wrong-sized caller
   buffers at the enqueue site instead of silently storing the pointer
   for a later get_var/put_var to mishandle.
# Conflicts:
#	src/soca/Fields/soca_fields_mod.F90
shlyaeva and others added 16 commits May 26, 2026 15:41
Small, no-behavior-change cleanups before the bigger refactors:

- Q5: delete unused 5-arg adjustThicknessCategories back-compat overload.
  Update the four TestIcePhysics calls to the 9-arg form; rewrite the
  _InfeasibleTarget test since legacy semantics no longer apply.
- Q9: remove rho ice/snow/ocean from FreeboardParameters. They were
  always CICE's defaults; making them configurable would let users write
  a restart inconsistent with CICE's physics. Use icephysics::Constants
  directly. Strip the now-unused rho lines from 3 test yamls.
- Q10: rewrite seed new ice docstring to match the 2026-05-20 change
  (qice from Tfrz - Tice_seed_offset, only Tsfcn from donor).
- Q12: move applyThermoStage and CatRecord to the private section.
- Q13: replace the 12-line per-cat view loop with 6 ppiCatViews calls.
- Q16: delete snowEnthalpyToTsfc (only used by the removed Tsfcn-from-qsno
  path), the unused 'sum' helper in IcePhysics.cc, and the (void)sum
  warning suppression. Replace round-trip unit test with a known-value test.
- Q7/Q8: docstring fixes for 'min thickness gap' (numerical robustness
  knob) and 'max snow thickness' (explicitly per-ice-area, m).

All 10 postprocice/soca2cice/icephysics/ensanpproc tests still pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The rebin pass is what makes per-cat thicknesses sit inside their bins;
running without it leaves the restart in a state CICE will quickly
rebin on its own, and is only useful as a legacy diagnostic. Production
users get correct behavior automatically; the legacy no-rebin path
stays available via 'itd.rebin: false' in yaml.

soca2cice_new.yml and soca2cice_new_bgfallback.yml now run with rebin
on (no yaml change needed; they don't set itd: explicitly). Testrefs
are empty so no refresh needed. All 10 postprocice tests pass,
including the Fortran-vs-C++ regression with its 2-cell tolerance.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Six soca2cice_new* yamls collapsed to four:

- soca2cice_new (old non-rebin baseline) deleted; redundant now that
  rebin is the default. soca2cice_new_rebin renamed to take its slot
  with the itd: block dropped (all defaults match).
- soca2cice_new_thermo deleted; its scope (thermo: {update snow thermo,
  reset ponds}) is a strict subset of soca2cice_new_seed.

Kept: soca2cice_new (new production baseline, rebin on), _freeboard
(only freeboard coverage), _seed (full Stage C donor-halo seed),
_bgfallback (analysis variables resolver).

Also added soca2cice_new_freeboard.yml/_seed.yml + their .test files
to the explicit soca_test_input/soca_test_ref lists so they survive a
cmake reconfigure (previously these were only registered by
soca_add_test, which doesn't add them to the configure-time symlink
sweep).

All 8 remaining postprocice tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- 'min aice' was an input clamp on analysis aice, defaulted to 0.0
  (a no-op), and never set by any caller. Its role is fully covered
  by 'min aice output' (which clamps the *output* and is set to 1e-4
  by default to drop near-zero analysis aice cells). The input clamp
  now just enforces nonneg; remove the parameter.

- 'min vice' renamed to 'min cat ice volume' to disambiguate from
  'min new ice thickness' (per-cell new-ice fallback). Docstring
  rewritten to explain it's the per-cat volume floor that drives the
  mass-conserving sweep after Stage A/B. Default unchanged at 1e-5.

The legacy Fortran Soca2Cice path keeps its own 'min aice' / 'min vice'
keys; those are separate parameters.

All 8 postprocice tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The only consumer of ocean T/S inside postProcess was the SST-on-ice2noice
adjustment: when the analysis removes ice from a previously-iced cell,
the surface ocean was optionally warmed toward Tfrz so the post-DA SST
was consistent with no ice. The salinity field was copied in but never
read.

Both are removed:
- SstUpdateParameters block and its 'sst update:' yaml key.
- copyOceanFields helper (static); the ocean T/S copy at the top of
  postProcess; the new_tocn writable view.
- The Case-1 ICE2NOICE branch's SST warming code.

Rationale: surface-ocean adjustment on ice removal is a DA-increment
concern, not a CICE-restart concern. Folding it back into the MOM6 IAU
increment (gdasapp/utils/soca/gdas_incr_handler.h, removed in a parallel
commit) was working around the wrong layer doing the work. Future
revisions, if needed, should adjust SST upstream where the ocean
increment is generated.

readRestart's State now carries only sea-ice fields; PostProcessIce
neither reads nor writes ocean variables (ocean T/S on the analysis
is unused). All 8 postprocice tests pass, including the Fortran
regression test with its 1e-10 tolerance / 2-cell budget.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The generic soca_fields_write_rst no longer scans the config for a
'cice template' key. Instead:

- new Fortran subroutine soca_fields_write_cice that handles the
  config -> soca_fields_write_cice_rst dispatch;
- new C entry soca_state_write_cice_f90 / soca::State::writeCice;
- PostProcessIce::writeRestart now calls pproc.writeCice(cfg) instead
  of pproc.write(cfg) with the magic 'cice template' key.

write_rst goes back to being a pure domain-by-domain restart writer
that knows nothing about CICE update mode. Callers that want CICE
update mode go through writeCice explicitly.

All 8 postprocice tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The three call sites (soca's Postproc.h, AnalysisPostproc.h, and
gdasapp's gdas_incr_handler.h) all did the same triple:

  State restart = ppIce.readRestart(geom, analysis.validTime());
  State pproc(geom, restart);
  ppIce.postProcess(pproc, restart, analysis);
  ppIce.writeRestart(pproc, analysis.validTime());

Replaced with a single public method:

  State PostProcessIce::postprocess(const State & analysis) const;

which opens the CICE restart, applies Stage A/B/C in update mode, writes
the postprocessed restart, and returns an aggregate-ice State on
analysis's geometry carrying {sea_ice_area_fraction, sea_ice_thickness,
sea_ice_snow_thickness} matching what was written.

readRestart, writeRestart, ciceRestartVariables, and the case-dispatch
loop (now runPostprocess) are private. The aggregate-ice return value
is what the gdasapp pipeline / scripted callers can consume without
having to know the per-cat CICE schema.

Postproc.h and AnalysisPostproc.h migrated. The gdasapp side migrates in
a parallel commit. All 8 postprocice tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
soca_coding_norms (include-what-you-use linter) flagged this on line 57.
The header has been using std::make_unique since the file was introduced
but pulled <memory> transitively; the linter wants it explicit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Self-review pass on the PR-as-a-single-change:

- Remove unused includes from PostProcessIce.cc (<limits>, <numeric>,
  atlas/mesh/Mesh.h, atlas/mesh/Connectivity.h,
  atlas/mesh/actions/BuildCellCentres.h).

- Drop project-internal labels from comments and docstrings:
  - 'Stage A / Stage B / Stage C+' inside the per-cell loop → use what
    the code actually does ('Per-cell pass', 'ITD rebin', 'Snow
    distribution', 'Freeboard enforcement').
  - 'Phases A+B+C' in the donor-halo gather header → 'Sparse halo
    exchange'.
  - 'user decision 2026-05-11', '+2.6 K warm bias on the production
    grid', 'With the shuffle path removed', 'rebin shuffle', 'dd'
    shorthand, external Python script filenames ('mc6util.adjust_*',
    'insert_iconc_ithkn_cice6_restart.py').
  - Stage references in PostProcessIce.h ('Parameters for the Stage C
    thermo / pond pass', 'after Stages A/B', etc).

- Fix the log message at the end of runPostprocess: was printing
   (the background, unchanged) instead of  (the
  postprocessed state). Renamed all 3 info-log lines to use the
  consistent 'PostProcessIce:' prefix.

- Group all parameter pickups at one place rather than 'additional
  knobs picked up later in the function'; move seedK with the rest.

- Rewrite stale parameter docstrings ('Parameters for adding soca
  increment to CICE restart files' → 'Top-level parameters'; 'Stage C
  thermo' → 'Per-layer thermo / pond reset', etc.) and drop the
  'matches the Python reference scripts' / 'matches dd' qualifiers
  in favor of the actual physical rationale.

- Update Tice_seed_offset / Tsno_clip_max comments in IcePhysics.h to
  state what they are, not what they 'match' in an external script.

- Add src/soca/PostProcess/README.md documenting the workflow, per-cell
  pass, thermo pass, new-ice seeding, full yaml configuration schema
  with defaults, and the implementation notes (per-cat field naming,
  KDTree halo, IcePhysics helpers). Style matches the existing
  Soca2Cice README.

All 8 postprocice tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
B-list (questions on the comprehensive PR diff):

- B4: validate `itd.category bounds` length (ncat+1) at PostProcessIce
  construction, not only when rebin runs. Default is the CICE5 ncat=5
  layout; ncat != 5 without bounds override now fails fast regardless of
  rebin/freeboard.
- B5: drop the redundant `itd:` block from ensanpproc.yml. With rebin
  defaulted on and the only knobs in that block being the CICE5 defaults,
  the block was pure noise.
- B6: rewrite the AnalysisPostproc.h comment around seekAndReplace to
  state where %mem% actually fires (cice restart: input/output paths).
- B1: move tools/postproc_compare/ out of the soca submodule. Per the
  README it's a "study tool, not a CI test", and once Soca2Cice is
  deprecated this becomes a one-shot deprecation artifact. The
  directory lives at jedi-bundle/tools/ now (untracked, alongside the
  user's other local helper scripts).

D-list (cleanup found in the self-review):

- D1: drop the double `pproc = restart;` in runPostprocess. `postprocess`
  already did `State pproc(geom_, restart);` (the copy-constructor).
- D2: stop adding aggregate ice diagnostic atlas Fields onto
  pproc.fieldSet(). The aggregates were quiet violations of the State
  invariant (extras bypassing soca::Fields's tracked-fields list) and
  only existed to be read back by the postprocess() wrapper. Instead,
  runPostprocess now constructs the aggregate-ice result State on the
  analysis geometry and writes the aggregates directly into its
  fieldset; it returns the State. pproc keeps the per-cat schema the
  CICE-restart writer expects.
- D3: drop the empty `print(std::ostream &)` override and the
  util::Printable inheritance. PostProcessIce is never streamed
  anywhere; the inheritance was dead weight.
- D4: bg_aice atlas Field replaced with std::vector<double> for
  consistency with a_hice_vec / a_hsno_vec. Same memory cost, no
  pseudo-field on the working FieldSet.

All 8 postprocice tests + coding norms pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The aggregate diagnostics are no longer added onto pproc and read back;
runPostprocess builds the result State directly. Reword the workflow
description accordingly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PostProcessIce uses the CICE restart input both as the update-mode template
and (configurably) as the output path. When users configure
'cice restart: { input: foo.nc, output: foo.nc }' for in-place update,
the byte-copy is wasted work: the file already contains its own template.

writer_commit_update now checks for equal template/filename and skips the
copy when they match, opening the existing file directly NF90_WRITE.
Non-equal paths still go through the byte-copy as before, preserving the
safe "keep an unmodified backup" semantics for users who want them.

Path comparison is a literal string equality; cosmetic differences like
'./foo.nc' vs 'foo.nc' would still trigger the (harmless) self-copy. Good
enough for the common gdasapp case where the two yaml fields hold the
exact same string.

Verified manually: running soca_postproc.x with equal input/output paths
mutates the bg restart in place (different md5 after the run); all 9
existing postprocice tests still pass (they configure distinct paths).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@shlyaeva
Copy link
Copy Markdown
Collaborator Author

PostProcessIce: production-grid validation across 7 dates

Side-by-side comparison of the new C++/atlas PostProcessIce path against the legacy Fortran Soca2Cice path and @DmitryDukhovskoy's Python prepare_cice6 scripts (dd here), on retro + realtime GFSv17 data (0.25deg grid).

Dates (one per season + multi-year coverage):
20240715, 20250115, 20250330, 20250715, 20250920, 20251221, 20260322.

For each cycle we hold the same DA analysis (from GFSv17 with artificially increased ice thickness and snow depth to test insertion of those variables) and the same CICE background fixed, and compare the resulting CICE restart written by each of the three pipelines.

Aggregate cell-level RMSE vs analysis

Mask: cells where analysis.aice > 1e-4. Numbers are RMSE averaged over the 7 dates; units are ice concentration (0-1) for aice and per-cell volume (m³/m²-cell) for v_ice / v_sno. Lower is better.

field new legacy dd background
aice 0.00331 0.00507 0.00331 0.0426
v_ice 0.0885 0.258 3.91* 0.360
v_sno 4.4e-5 0.0741 0.00209 0.0887

* dd's high v_ice RMSE is a known unit-convention mismatch; should not be considered a regression in this PR.

Per-category ICE2ICE perturbation vs background

Mask: per-cat slots where both bg.aicen > 1e-4 and new.aicen > 1e-4. This measures how much the postprocessing perturbs the background per-cat shape on cells that have ice in both bg and analysis - lower is better (less perturbation = more CICE-stable restart). Averaged over the 7 dates.

field new legacy dd
aicen 0.0184 0.190 0.0185
vicen 0.094 0.248 0.113
Tsfcn (°C) 0.32 8.44 1.85
sice001 0 2.90 0.74
  • aicen: new ties dd, both beat legacy by ~10×.
  • vicen: new is the lowest perturbation.
  • Tsfcn (per-cat surface temperature): new's perturbation is bounded by the −1 °C clamp (0.13 - 0.64 °C range across dates); legacy runs hot at 4.5 - 11.5 °C; dd is in between at 1.2 - 2.6 °C.
  • sice001: new preserves bg ice salinity exactly (per-cat alpha-rescale doesn't touch it); both legacy and dd perturb it by O(1) ppt.

Plots

Three production-grid comparisons for 20240715 (arctic summer / antarctic winter). Each figure has 3 rows (new / legacy / dd) and 4-5 columns (background, analysis, updated, one or two updated − analysis diff
panels at different colour ranges):

  • Ice thickness in the Arctic:
hice_arctic

Clearest contrast: new's diff panel is white almost everywhere (close-to-analysis ice thickness); legacy has a large blue interior blob (under-thick ice across the entire pack); dd has a dense red rim that should be ignored (not a regression).

  • Snow thickness in the Arctic:
hsno_arctic

new and dd diffs essentially zero (snow tracks analysis exactly); legacy shows a large interior blue patch (snow under-prediction).

  • Ice thickness in the Antarctic:
hice_antarctic

Same pattern in the SH: new clean diff, legacy interior bias, dd rim (ignore).

All 7 dates exhibit the same qualitative behaviour - the new path is consistently close to the analysis on ice thickness and snow (significantly better than legacy).

Bottom line

No degradation compared to Dmitry's scripts, significant improvement for ice thickness and snow depth compared to the legacy soca2cice.

@shlyaeva
Copy link
Copy Markdown
Collaborator Author

I'm going to run a few cycles with the ensembles using this postprocessing and report back here. In the meantime, this is mostly ready for review, I don't anticipate major changes. If it looks painful to review, just know that I've been working on this on and off for half a year, and I'm with you 😆 .

@shlyaeva shlyaeva requested a review from DmitryDukhovskoy May 27, 2026 21:09
shlyaeva and others added 6 commits May 27, 2026 18:52
CICE6's linear_itd aborts when per-cat thickness `hi = vicen/aicen` falls
outside `[hicat[k], hicat[k+1]]`. Diagnostic scan of the existing
production-grid restarts (datatoanalyze/, 7 dates) found ~5% of ice cells
in every restart had at least one populated cat with out-of-bin hi --
~5,000-16,000 cells per date. CICE6 was crashing on these.

Two bugs at play:

1. **Default `category bounds` was CICE5 layout** with bounded top
   (`{0, 0.6445, 1.391, 2.470, 4.567, 9.334}`) whereas CICE6 GFSv17
   uses `{0, 0.64, 1.39, 2.47, 4.57, 1000}` (open top). Silent
   mismatch produced restarts CICE refused to read. Make
   `category bounds` a RequiredParameter -- users must spell out the
   layout matching their CICE version.

2. **The standalone min-vice cleanup block** in PostProcessIce.cc
   (now removed) redistributed mass from dropped cats to surviving
   cats proportional to surviving aicen, without respect to bins.
   This shifted survivors' per-cat hi outside their bins, producing
   the CICE6 abort cases. The dd Python reference (mod_cice6_utils.py
   adjust_thkncats_aice) doesn't have a separate min-vice cleanup --
   its scipy.optimize.minimize call uses the bin edges as variable
   bounds, so survivors are guaranteed in-bin. We rely on the rebin's
   existing `ainMin = 1e-8` clip + Step 7 renormalisation (which
   preserves per-cat hi), and add a final per-cat bin clip as
   defence in depth:

       const double hClipped = std::min(std::max(h, hLo), hHi);
       if (hClipped != h) vicen[k] = aicen[k] * hClipped;

   Aicen is left alone (Σaicen preserved); a small amount of mass is
   lost to the clip, much smaller than the test failures the bug
   produced.

Drops the now-unused `min cat ice volume` yaml parameter.

Test fallout: the 4 soca2cice_new* testrefs shift slightly (min-vice
cleanup is gone). The freeboard test shifts hardest (hice 0.4153 ->
0.4044, because the cleanup was undoing some of freeboard's
ice-volume growth). Test yamls updated to set `category bounds`
explicitly. 9/9 postprocice tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Log::info() writes only from task 0 by oops convention, so all the
counter / max-magnitude diagnostics in runPostprocess were silently
reporting only rank 0's piece of the grid -- misleading on multi-task
runs.

allReduceInPlace each counter (sum) and each magnitude (max) on
geom_.getComm() before the log lines, so what gets printed is the
global total / global max. Covers:

  - rebin_visited, rebin_aicen_mutations  (sum)
  - rebin_max_delta_aicen_all              (max)
  - rebin_failures, freeboard_failures     (sum)
  - bin_clip_slots                         (sum)
  - bin_clip_max_abs_dh                    (max)
  - seedNewIce fallbacks                   (sum, separately because
    seedNewIce runs after the main loop)

Also adds the bin-clip diagnostic itself (was uncounted): per-cat slots
clipped + max |dh| in metres. Only printed when at least one slot was
clipped.

9/9 postprocice tests still pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Log::warning fans out to every MPI task, so the per-rank warnings were
printing N identical copies of the global-summed counts. Switch the
ITD-rebin-failure, freeboard-failure and noice->ice Tfrz-fallback
messages to Log::info (rank-0-only). These are diagnostic counters
about non-fatal events the code already handles, so info is the right
level.

Mention the bin clip in the rebin-failure message so the log explains
how out-of-bin cells get repaired.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Replace the 'min-vice cleanup' step in the per-cell pass with the
  new 'per-cat bin clip' step (the cleanup is gone; the clip is the
  guarantee that the restart is CICE-readable).
- Drop the 'min cat ice volume' configuration entry (parameter no
  longer exists).
- Mark 'itd: category bounds' as required in both the schema block's
  inline comment and the closing 'required fields' bullet. Switch the
  example value to the CICE6 GFSv17 layout
  ([0, 0.64, 1.39, 2.47, 4.57, 1000]) and document the CICE5
  alternative; explain that the previous silent default was a footgun.
- Tidy two surrounding references to 'min-vice cleanup' that pointed
  at the dropped step.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants