Skip to content

Memory optimization PYNXTOOLS validate#752

Open
mkuehbach wants to merge 33 commits into
masterfrom
mem_optimization_validation
Open

Memory optimization PYNXTOOLS validate#752
mkuehbach wants to merge 33 commits into
masterfrom
mem_optimization_validation

Conversation

@mkuehbach
Copy link
Copy Markdown
Collaborator

@mkuehbach mkuehbach commented Mar 19, 2026

Planned for v0.14.0

The other aspect of #737:

For HDF5 file base validate we wish to reduce the number of cases when large h5py.Dataset payload needs to be loaded into main memory, especially when using chunked storage layout that would allow iterating over chunks.

Motivation was usage of the standalone HDF validator on a 32GiB file which has essentially only one dataset (a large image stack uncompressed approx 32GiB payload) from CENEM on a 128GiB RAM workstation. Despite that image stack is stored chunked, all data gets loaded somewhen and eventually generating multiple copies in the process into main memory, causing the validation process to terminated after trying to allocate past all the memory available on the workstation.

  • Adds vscode launch configuration to debug the validation
  • Applies two key optimization steps i) chunk-by-chunk reading while validating datasets and enums and ii) a reorganization of the execution flow whilst validating enums to remove unnecessary or inefficient loading and unpacking entire dataset into main memory.
  • Drafts a further optimization for streamed reading of the positivity test in cases where reading a dataset in full is still required as in the case of contiguous storage layout, fact that contiguous storage still can take advantage of hyperslab-based reading allows streaming to approximately I/O chunks, a simple heuristic was defined to set the block_size, which could profit from further tuning for specific type of hardware and I/O patterns.
  • Evaluates cases of possible situations where the assumption that there exists always only one concept with highest score or none breaks. Now a score_board is used, the old implementation is carried along, and the user is warned if best_namefit fits multiple concepts with the same highest score. Inspecting the matter revealed an imprecision in the constraining of NXem_* base classes that was fixed with a separate feature branch of nexus_definitions, here pulled in for testing purposes and to show the full picture. Be careful that unfixed was observed to lead to different execution paths followed for the same file and code when uncorrected (inspected via Hunt nondeterminism #768).
  • Tightens Iterator to Sequence in validate
  • Fixing tests
  • New tests for evaluating NX_POSINT and other elementary NeXus datatypes for contiguous, chunked_uncompressed, and chunked_compressed storage

Keys benefits:

  • Validation substantially faster as unnecessary I/O time and time for prepping payload as np.array removed
  • Validation with substantially smaller and smoother memory footprint for arbitrarily large HDF5 files provided chunked based storage layout is used.

@mkuehbach mkuehbach changed the title mem_optimization_validation Memory optimization PYNXTOOLS validate Mar 20, 2026
This was referenced Mar 30, 2026
…ersion of is_valid_enum which defers the loading of data to occurr only when required, most h5py.Datasets do not qualify as an enum. This change could also in future version of the template-base validation be added. That is not expected to be so performance improving as in the case of template based validation prior to NeXus file writing all data are already in main memory. For HDF5 though they need to loaded first from disk, so far the clean_str_attr(dataset[()]) did that ALWAYS and for all datasets irrespective their size, just to discard most of these data with the next two lines when the code figures the dataset is not an instance of an enum concept
…initial round of tests with float32 there validation is now blazing fast for all three cases (contiguous, chunked uncompressed, chunked compressed), the substantial speedup is not a signature of failure or bypassing certain function calls but expected, in the contiguous float32 test case the concept entry1/measurement/event1/image1/real is float32 yes its 10GB uncompressed payload. But keep in mind that h5py and HDF5 store datasets of elementary datatypes always as type-homogeneous arrays, that means a test if the field is valid reduces to a simple test of the dtype arribute against accepted ones, no need to load the data at all, unless we have NX_POSINT as datatype. Similar expected speedup signified alsso compared to v0.13.2 is valid enum, in that version the data were always fully loaded first just to become, in most cases, immediately discarded inside the function call as the node did not encode an enum type concept, this validation feature branch and commit here tested though check FIRST if the node is an enum, thus we can completely skip the loading, thus bringing essentially the speedup. We expect that we should test again with a concept of type NX_POSINT and our three storing schemes contiguous, chunked compressed and uncompressed, here we would expect to get a costly load of all data when contiguous and a chunk-by-chunk positivity integer test during the is valid field hdf evaluation, but the base line should remain flat. Next-up, i) implement this test with posint and run again, ii) fix pytest from pynxtools, iii) move all testing code to proper tests, mark this feature branch as ready for review, i.e. move from draft to ready
…sed in the code to work on datasets and non-string payload in which case type annotations were imprecise
@mkuehbach mkuehbach marked this pull request as ready for review April 1, 2026 19:00
@mkuehbach mkuehbach requested review from sanbrock and removed request for sanbrock April 1, 2026 19:20
Comment thread .vscode/launch.json Outdated
@mkuehbach mkuehbach requested review from lukaspie and rettigl April 4, 2026 01:20
Copy link
Copy Markdown
Collaborator

@rettigl rettigl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the effort, but for me this PR includes too many changes to properly review. The connection of many of these changes to the issue at hand (large memory consumption) is not clear to me. I suggest to split it up into several PRs.
Also, it changes the behavior of the validator by removing the acceptance of ints as floats, which we explicitly decided to allow at some point, as far as I remember.

Comment thread src/pynxtools/dataconverter/helpers.py
Comment thread src/pynxtools/dataconverter/helpers.py Outdated
Comment thread src/pynxtools/dataconverter/helpers.py
Comment thread src/pynxtools/dataconverter/helpers.py
Comment thread src/pynxtools/dataconverter/validate_file.py Outdated
Comment thread tests/dataconverter/test_validation.py
Comment thread tests/dataconverter/test_validation.py
Comment thread tests/dataconverter/test_validation_opt.py Outdated
Comment thread tests/dataconverter/test_validation_opt.py Outdated
Comment thread tests/dataconverter/test_writer.py Outdated
@mkuehbach
Copy link
Copy Markdown
Collaborator Author

mkuehbach commented Apr 7, 2026

I appreciate the effort, but for me this PR includes too many changes to properly review. The connection of many of these changes to the issue at hand (large memory consumption) is not clear to me. I suggest to split it up into several PRs. Also, it changes the behavior of the validator by removing the acceptance of ints as floats, which we explicitly decided to allow at some point, as far as I remember.

Thank you @rettigl

Wrt to the last point, there is a good reason why in NeXus we distinguish NX_NUMBER, NX_(U,POS)INT, and NX_FLOAT. I do not agree that this decision in the past was a good one. IMHO a validator should check a report where there are inconsistencies it should not for convenience purposes do conversion of types. This should be the pynxtools-readers responsibility to feed to pynxtools straightaway. I can see if people may be fine with just using the default precision of data types, like python inbuild float, int but I disagree with cross-accepting what are different elementary data types. Not only for the fact that currently our implementation then would also need to check if really every precision of an int is exactly mappable on a floating with certain precision.

Copy link
Copy Markdown
Collaborator

@lukaspie lukaspie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it stands, I would not merge this PR.

  • I agree that we need separate function for validity checks (HDF5 vs, template), since we don't always want to load full HDF5 datasets.
  • Probably, we should split up the HDF5 and template validation in separate modules eventually. helpers.py is already rather overloaded. But let's do this in a separate PR.
  • Regarding namefitting, I left a longer comment. I don't agree with the idea that sorting by optionality is really better than what we had before.
  • Changes to the definitions do not belong here.
  • vscode settings are too specific: as long as the NeXus files that are mentioned there are not part of the repo, we cannot use the launch.json. I am not suggesting to add these files, rather the settings should be kept simpler.
  • Regarding "validation, not conversion" -> on a technical design level, I would agree with this. However, my rationale for allowing simple conversion (or other workarounds, like adding the target attribute automatically) was always to allow for easier reader development. Maybe one compromise could be that we allow such conversions in the the template validation, whereas for the HDF5 validation (where also non-pynxtools-generated files can be used), we are more strict. Curious about your opinion @rettigl.

Comment thread src/pynxtools/definitions
Comment thread src/pynxtools/dataconverter/helpers.py Outdated
Comment thread src/pynxtools/dataconverter/validation.py
Comment thread src/pynxtools/dataconverter/validation.py
Comment thread src/pynxtools/dataconverter/validation.py
Comment thread tests/dataconverter/test_validation.py
Comment thread tests/dataconverter/test_validation_opt.py Outdated
Comment thread .pre-commit-config.yaml
Comment thread .vscode/launch.json Outdated
Comment thread src/pynxtools/dataconverter/helpers.py
@mkuehbach
Copy link
Copy Markdown
Collaborator Author

As it stands, I would not merge this PR.

* I agree that we need separate function for validity checks (HDF5 vs, template), since we don't always want to load full HDF5 datasets. ([x] DONE)

* Probably, we should split up the HDF5 and template validation in separate modules eventually. `helpers.py` is already rather overloaded. But let's do this in a separate PR. ([x] FUTURE PR)

* Regarding namefitting, I left a longer comment. I don't agree with the idea that sorting by optionality is really better than what we had before. ([x] COMPROMISE, original solution recovered, monitoring of multiple solutions via logger.debug)

* Changes to the definitions do not belong here. ([x] REMOVED)

* vscode settings are too specific: as long as the NeXus files that are mentioned there are not part of the repo, we cannot use the `launch.json`. I am not suggesting to add these files, rather the settings should be kept simpler. ([x] REMOVED)

* Regarding "validation, not conversion" -> on a technical design level, I would agree with this. However, my rationale for allowing simple conversion (or other workarounds, like adding the target attribute automatically) was always to allow for easier reader development. Maybe one compromise could be that we allow such conversions in the the template validation, whereas for the HDF5 validation (where also non-pynxtools-generated files can be used), we are more strict. Curious about your opinion @rettigl. ([x] FOLLOWING exactly the here said compromise, added technical arguments why I think this is the best approach)

Comment thread src/pynxtools/dataconverter/validation.py Outdated


def clean_str_attr(attr: str | bytes | None, encoding: str = "utf-8") -> str | None:
def decode_if_bytes(payload: Any, encoding: str = "utf-8") -> Any:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RubelMozumder note how this PR has another decoding step, similar to what you have in #778. We should align this in one central function.

Comment thread src/pynxtools/dataconverter/validation.py Outdated
Comment thread tests/dataconverter/test_validation.py
Comment thread tests/dataconverter/test_validation.py
Comment thread tests/dataconverter/test_validation.py
Comment thread tests/dataconverter/test_validation.py
Comment thread tests/dataconverter/test_validation.py
Comment thread src/pynxtools/dataconverter/helpers.py
Comment thread src/pynxtools/dataconverter/validation.py
Copy link
Copy Markdown
Collaborator

@lukaspie lukaspie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably okay, but I will have to test thisin action to understand its full implications

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants