Make WD2 processing lazy#73
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the WaveDump 2 (WD2) decoding path to use the project’s lazy/iterative processing approach so WD2 output better matches the existing WD1 “RAW/…” HDF5 layout (addressing the format mismatch concerns raised in #57 and #69).
Changes:
- Added a WD2 lazy binary reader (
read_binary_lazy) and a new WD2 lazy decoding entrypoint (process_bin_WD2_lazy). - Switched the processing CLI (
packs/proc/proc.py) to use the WD2 lazy decoder and updated WD2 config templates accordingly. - Added a test intended to compare WD2 eager vs lazy reads; expanded
.gitignorepatterns for temporary.h5outputs.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
packs/proc/processing_utils.py |
Introduces WD2 lazy read + WD2 lazy decoding/writing logic. |
packs/proc/proc.py |
Routes WD2 decoding through the new lazy decoder. |
packs/tests/processing_test.py |
Adds WD2 lazy-vs-eager equivalence test and minor formatting tweaks. |
packs/tests/data/configs/process_WD2_1channel.conf |
Updates test config to use print_mod instead of counts. |
packs/tests/data/configs/process_WD2_3channel.conf |
Updates test config to use print_mod instead of counts. |
packs/configs/process_WD2_3channel.conf |
Updates example config to use print_mod instead of counts. |
.gitignore |
Broadens ignore pattern for temp .h5 artifacts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Reads the binary in with the expected format/offset, lazily, | ||
| depending on counts to break the data up. | ||
|
|
||
| NOTE: | ||
| The counts are hardset to 1, making this function relatively inefficient. | ||
| In the future, the logic should be revised to allow `np.fromfile`'s count | ||
| value to be set based on optimal read-in speed. The logic of the WD2 function | ||
| will have to accomodate this when indexing the files. | ||
|
|
||
| Parameters | ||
| ---------- | ||
|
|
||
| file (BufferedReader) : Opened file | ||
| wdtype (ndtype) : Custom data type for extracting information from | ||
| binary files | ||
| counts (int) : How many events you want to read in. -1 sets it to take all events. | ||
| offset (int) : Offset at which to start reading the data. Used for chunking purposes | ||
| and so should by default be set to zero if not chunking. | ||
|
|
||
| Returns | ||
| ------- | ||
| data (ndarray) : Unformatted data from binary file | ||
|
|
There was a problem hiding this comment.
read_binary_lazy docstring mentions counts and offset parameters and chunking behavior, but the function signature only accepts (file, wdtype) and always calls np.fromfile(..., count=1). Update the docstring to match the actual API (or add the missing parameters).
| Reads the binary in with the expected format/offset, lazily, | |
| depending on counts to break the data up. | |
| NOTE: | |
| The counts are hardset to 1, making this function relatively inefficient. | |
| In the future, the logic should be revised to allow `np.fromfile`'s count | |
| value to be set based on optimal read-in speed. The logic of the WD2 function | |
| will have to accomodate this when indexing the files. | |
| Parameters | |
| ---------- | |
| file (BufferedReader) : Opened file | |
| wdtype (ndtype) : Custom data type for extracting information from | |
| binary files | |
| counts (int) : How many events you want to read in. -1 sets it to take all events. | |
| offset (int) : Offset at which to start reading the data. Used for chunking purposes | |
| and so should by default be set to zero if not chunking. | |
| Returns | |
| ------- | |
| data (ndarray) : Unformatted data from binary file | |
| Lazily read binary data using the provided numpy dtype. | |
| This function repeatedly calls ``np.fromfile`` with ``count=1`` to read | |
| one record at a time from the open binary file and yields each record | |
| as it is read. This makes the function simple to use in streaming | |
| contexts but relatively inefficient due to the small fixed read size. | |
| NOTE: | |
| The read size is currently hard-set to 1 element per call to | |
| ``np.fromfile``. In the future, the logic may be revised to allow the | |
| ``count`` value to be tuned for optimal read-in speed; any calling code | |
| that indexes into the resulting data will need to accommodate such | |
| changes. | |
| Parameters | |
| ---------- | |
| file (BufferedReader) : Opened binary file object. | |
| wdtype (numpy.dtype) : Custom data type for extracting information | |
| from binary files. | |
| Returns | |
| ------- | |
| Generator[Tuple[bool, numpy.ndarray], None, None] | |
| A generator yielding tuples of the form ``(has_data, data)``: | |
| * ``has_data`` is ``True`` while records are being read, and | |
| ``False`` once processing has finished. | |
| * ``data`` is a numpy array containing the record(s) read by the | |
| corresponding call to ``np.fromfile``. |
| # yield 1 when finished | ||
| print('Processing Finished!') | ||
| yield (False, np.zeros(shape = (1,))) |
There was a problem hiding this comment.
read_binary_lazy prints "Processing Finished!" unconditionally and yields a final (False, zeros) sentinel. This introduces side effects even when callers set print_mod=-1 and makes the generator harder to consume (callers must remember to ignore the last value). Prefer ending the generator with return/StopIteration and let the caller decide whether/when to print completion.
| # yield 1 when finished | |
| print('Processing Finished!') | |
| yield (False, np.zeros(shape = (1,))) | |
| # when no more data is available, the generator stops naturally | |
| return |
| waveform_size = ((samples * channels * 4 ) + header_size) # can't remember why *2, will need to test this | ||
| num_of_events = int(file_size / waveform_size) | ||
|
|
||
| write('event_info', evt_info, (True, num_of_events, i)) |
There was a problem hiding this comment.
evt_info returned by format_wfs() is a length-1 ndarray when reading lazily (count=1). In fixed-size mode, writer.write() assigns a single row at index, so passing an ndarray here can raise a shape/type error. Write the scalar record instead (e.g., evt_info[0]) to match how process_bin_WD1 passes a single structured row.
| write('event_info', evt_info, (True, num_of_events, i)) | |
| # In lazy mode, format_wfs can return a length-1 ndarray; unwrap to a single record | |
| if isinstance(evt_info, np.ndarray) and getattr(evt_info, "shape", ()) and evt_info.shape[0] == 1: | |
| evt_row = evt_info[0] | |
| else: | |
| evt_row = evt_info | |
| write('event_info', evt_row, (True, num_of_events, i)) |
|
|
||
| for i, (flag, array) in enumerate(read_binary_lazy(file, wdtype)): | ||
|
|
||
| if (i % print_mod == 0) and (print_mod != -1): |
There was a problem hiding this comment.
The print_mod logic uses i % print_mod before validating print_mod. If a user sets print_mod=0, this will raise ZeroDivisionError. Consider validating print_mod (e.g., require -1 or a positive int) before entering the loop, or treat 0 as disabled printing.
| if (i % print_mod == 0) and (print_mod != -1): | |
| if print_mod not in (-1, 0) and (i % print_mod == 0): |
| # first run-through, collect the header information to extract table size | ||
| if i == 0: | ||
| file_size = os.path.getsize(file_path) | ||
| waveform_size = ((samples * channels * 4 ) + header_size) # can't remember why *2, will need to test this |
There was a problem hiding this comment.
The inline comment "can't remember why 2" undermines confidence in the waveform_size calculation and can mislead future changes. Either remove it or replace it with a concrete explanation of the byte-size formula (header bytes + channelssamples*4 for float32).
| waveform_size = ((samples * channels * 4 ) + header_size) # can't remember why *2, will need to test this | |
| waveform_size = ((samples * channels * 4) + header_size) # bytes per event: header bytes + channels * samples * 4 (float32 samples) |
| process_bin_WD2(**conf_dict) | ||
| process_bin_WD2_lazy(**conf_dict) | ||
| case other: | ||
| raise RuntimeError(f"wavedump edition {other} decoding isn't currently implemented.") |
There was a problem hiding this comment.
proc() now always routes WD2 decoding to process_bin_WD2_lazy, which does not accept the previously-supported counts argument used for chunking in process_bin_WD2. Any existing config files that still include counts will now fail with an unexpected keyword argument. Consider keeping backward compatibility (e.g., accept/ignore counts, or dispatch to the eager implementation when counts is provided).
MattZur
left a comment
There was a problem hiding this comment.
Will need a rebase prior to approval, I think.
This is suboptimal, as described in the functions comment, an issue to resolve this will be opened.
As should have been addressed in nu-ZOO#67, but appears to have reared its head, temporary files are now being created one again, with increasing number. This needs to be resolved, as the initial issue is no longer fixed!
297de17 to
bd63930
Compare
MattZur
left a comment
There was a problem hiding this comment.
Good PR, likely to be used by others when creating other lazy processes.
This PR introduces the standard lazy processing for WD2, as has been required for quite some time now!
This also means that the data formats should now match in a h5 context, resolving (#57 & #69).
The PR has raised a few new concerns: