Skip to content
Gerd Heber edited this page May 7, 2026 · 38 revisions

Meeting Notes of 2026

💻 Zoom link: https://us06web.zoom.us/j/89601195963

📆 Meeting calendar invite. (First Thursday of each month, 10:00 a.m. central time)

Note

🎥 Please note that by joining and participating in these Working Group meetings, you acknowledge that your name will be visible to other attendees in the Zoom session, and this participation will be considered a public record. Furthermore, your verbal or written contributions may be included in the publicly accessible meeting notes and summary.

Please provide time estimates for each agenda item.

**Agenda items must be added at least 48 hours before the meeting, and they should be deleted or moved to the next meeting no later than 10 hours before the scheduled start time. **


2026-05-07

  • Facilitator/time-keeper: Gerd Heber
  • Note-taker/Editor: AI/Lori

Agenda

Minutes

• —– ٠ ✤ ٠ —–· · • —– ٠ ✤ ٠ —– • ·· • —– ٠ ✤ ٠ —–· · • —– ٠ ✤ ٠ —– • ·· • —– ٠ ✤ ٠ —–· · • —– ٠ ✤ ٠ —– •

2026-04-02 - ❌ Cancelled ❌

• —– ٠ ✤ ٠ —–· · • —– ٠ ✤ ٠ —– • ·· • —– ٠ ✤ ٠ —–· · • —– ٠ ✤ ٠ —– • ·· • —– ٠ ✤ ٠ —–· · • —– ٠ ✤ ٠ —– •

2026-03-19

  • Facilitator/time-keeper: Gerd Heber
  • Note-taker/Editor: AI/Lori

Agenda

  • Debugging infinite loops / H5close (Elena, 2 min)
  • Review threadsafe locking protocol (Quincey, 10 min)
  • Discuss ideas for reviewing threadsafe branch(es) (Quincey, 20 min)
  • Strategy, phases, boundaries, go/no-go criteria (Gerd, 20 min)

Minutes

Review threadsafe locking protocol

Discuss ideas for reviewing threadsafe branch(es)

Debugging suggestions

Strategy, phases, boundaries, go/no-go criteria

None of this is new, but before we get lost in technical details, can we restate a few things I believe (mistakenly?) we agree on?

Three different (but related) goals

  • Multiple application threads can call HDF5 safely
  • Multiple calls can make progress simultaneously inside HDF5
  • One HDF5 call can fan out into internal worker threads

I think we agree we want all three eventually, but we need to agree on how we get there.

Strategy

Is there agreement on these elements?

  1. Architectural first move: make the VOL boundary and its support packages concurrency-ready enough to move the serialization boundary down.
  2. Immediate implementation move: use the native VOL as the first consumer of that work and the first place to ship narrow wins.
  3. Do not try to finish either side in isolation.
  4. Treat internal native-VOL fan-out as a tactical accelerator, not the base model.

Phases

What's wrong with the phases outlined below? A lot, probably, but we need something like this before proceeding.

Committing resources would be madness otherwise.

Phase 0: Contract, instrumentation, and build policy

  • H5TS only, API entry/exit macro path, build/test infrastructure, and docs.
  • Lock-rank annotations, contention telemetry, and a single component taxonomy: safe, serialized, unsupported
    • Add rank discipline to the existing H5TS substrate
    • Do not attempt a fine-grained locking redesign (yet)
  • Exit criteria: a published concurrency matrix (see here), a mandatory fallback story for every unsupported stack, sanitizer-backed stress tests, and lifecycle/shutdown rules that are tighter than the current “join all HDF5-using threads before H5close” warning

Phase 1: Thread-local service packages

  • H5E and H5CX
  • The lock boundary does not move yet; the goal is to make these packages concurrency-clean under the existing top lock.
  • Exit criteria: concurrent error-stack tests, nested callback re-entry tests, white-box H5CX push/pop tests, and zero races with no dependence on lower native/VOL code.

Phase 2: Shared identity and configuration semantics

  • H5I and H5P
  • Identifier lifetime/refcount rules
  • Property-list sharing rules, and what does “read-only sharing” versus “concurrent mutation” mean
  • Resolve H5CX's use of non-default DXPL/LAPL objects directly, does not copy them, and can pass those references down to callbacks
  • Exit criteria: documented object-sharing semantics, stress-tested create/copy/close/lookup races, and no refcount/ABA failures under sanitizers and fault injection

Phase 3: VOL control plane

  • H5VL
  • lock boundary moves from “above VOL” to “around unsafe connector stacks”
  • VOL dispatch safe, but admit connector stacks only by capability: safe stacks run concurrently, unsafe stacks get a serialization wrapper, and unsupported stacks are rejected
  • coarse-grained flags, not per-API flags
  • Exit criteria: native VOL plus at least one pass-through stack working, deterministic fallback for unsafe stacks, and no core locks held across user callbacks into connectors

Phase 4: H5FD and a minimal VFD set

  • Start with sec2 and core; everything else begins life as serialized or unsupported
  • Lock boundary should become per-file/VFD-instance for open/close/flush state, with raw positioned I/O happening outside file-metadata locks.
  • Exit criteria: same-file and different-file read/read correctness on the supported VFDs, plus a clean safe/serialized/unsupported story for plugin-loaded VFDs

Phase 5: Native VOL vertical slice #1: read/read concurrency

  • Scope is narrow around the native open/read/close path for H5F/H5O/H5G/H5A/H5D, with only the H5S/H5T support needed for simple hyperslabs and fixed-size/no-conversion reads.
  • The lock boundary should be: metadata/object traversal and chunk discovery under a file/metadata lock, then raw I/O, filter application, and memory copies outside that lock, operating only on pre-materialized work units and private buffers.
  • Keep the metadata cache serialized and bypass the chunk cache/page buffer on this path initially.
  • Exit criteria: supported different-file and same-file read/read concurrency with automatic fallback to the serial path for anything outside that envelope

Phase 6: Broaden the read surface

  • Expand H5S and H5T: point/irregular selections, datatype conversion/xform, and only then selected reference/vlen cases. - selection-iteration and conversion state must be per-operation rather than shared global state, and user conversion/filter code must never run under file/cache locks.
  • Exit criteria: point-selection and conversion stress tests, an explicit allow-list for thread-safe filters/plugins, and no unacceptable single-thread regressions; everything not on the allow-list still falls back to the serial path

Phase 7: Caches and metadata rearchitecture

  • Requires significant re-architecture(?)
  • Split the monolithic file lock into lock domains—file handle/namespace, cache manager, cache entry, raw I/O
  • Forbid callbacks or cross-cache re-entry while cache-entry locks are held; if an operation can re-enter, make it restartable
  • Exit criteria: a stable documented lock order, deadlock-free eviction/flush/fault-injection runs, sanitizer-clean overnight stress, and clear evidence that the cache-enabled concurrent path is better than the earlier bypass path

Phase 8: Write concurrency

  • Begin with different files, then different datasets in one file when metadata side effects are isolated.
  • Keep same-dataset writes serialized until there is a compelling design for disjoint-chunk writes that does not turn the metadata path into a maintenance nightmare.
  • Exit criteria: explicit visibility rules for readers, crash/fuzz coverage for concurrent mutation, and shutdown/cancellation semantics that are stronger than today’s documented requirement that applications join all HDF5-using threads before H5close or exit

Phase 9: Optional internal fan-out, wrappers, and MPI

  • TBD
  • Exit criteria: no hidden oversubscription disasters, wrapper layers that pass the same stress suite as the C core, and a decision to treat MPI+threads as its own program rather than a side effect of serial-library concurrency work

Release milestones

  • First public concurrency release after Phase 5
  • First serious performance release after Phase 7
  • First polished mixed model after Phase 9.

This sequence is maintainable because it separates three risks instead of compounding them: boundary semantics first, narrow end-to-end value second, and the hard cache/mutation rearchitecture third.

Next steps

  • Correct/improve the plan (phases, matrices, etc.)
  • Turn Phases 0–5(?) into a milestone sheet that includes package owners, lock-rank definitions, and exact fallback behavior for each connector/VFD/filter class.
• —– ٠ ✤ ٠ —–· · • —– ٠ ✤ ٠ —– • ·· • —– ٠ ✤ ٠ —–· · • —– ٠ ✤ ٠ —– • ·· • —– ٠ ✤ ٠ —–· · • —– ٠ ✤ ٠ —– •

2026-03-05 - ❌ Cancelled ❌

• —– ٠ ✤ ٠ —–· · • —– ٠ ✤ ٠ —– • ·· • —– ٠ ✤ ٠ —–· · • —– ٠ ✤ ٠ —– • ·· • —– ٠ ✤ ٠ —–· · • —– ٠ ✤ ٠ —– •

2026-02-19 - ❌ Cancelled ❌

• —– ٠ ✤ ٠ —–· · • —– ٠ ✤ ٠ —– • ·· • —– ٠ ✤ ٠ —–· · • —– ٠ ✤ ٠ —– • ·· • —– ٠ ✤ ٠ —–· · • —– ٠ ✤ ٠ —– •

2026-02-05 - ❌ Cancelled ❌

• —– ٠ ✤ ٠ —–· · • —– ٠ ✤ ٠ —– • ·· • —– ٠ ✤ ٠ —–· · • —– ٠ ✤ ٠ —– • ·· • —– ٠ ✤ ٠ —–· · • —– ٠ ✤ ٠ —– •

2026-01-22 - ❌ Cancelled ❌

• —– ٠ ✤ ٠ —–· · • —– ٠ ✤ ٠ —– • ·· • —– ٠ ✤ ٠ —–· · • —– ٠ ✤ ٠ —– • ·· • —– ٠ ✤ ٠ —–· · • —– ٠ ✤ ٠ —– •

2026-01-08 - ❌ Cancelled ❌

  • Facilitator/time-keeper: Scot Breitenfeld
  • Note-taker/Editor: AI/Scot Breitenfeld

Agenda

  • Add items here

Old Action Items

Minutes

Quick recap

Summary

Action Items

None

• —– ٠ ✤ ٠ —–· · • —– ٠ ✤ ٠ —– • ·· • —– ٠ ✤ ٠ —–· · • —– ٠ ✤ ٠ —– • ·· • —– ٠ ✤ ٠ —–· · • —– ٠ ✤ ٠ —– •

Clone this wiki locally