TL;DR:
Data formats (TIFF, HDF5, Zarr) often pad edge chunks to full size on disk. When these are read through virtual Zarr stores, the codec pipeline receives more bytes than the logical chunk shape expects, causing decode failures. We need a TruncateCodec (bytes-to-bytes) that trims oversized buffers before downstream codecs see them.
Problem Statement
When reading archival file formats through virtual Zarr stores, the physical storage units (TIFF strips, HDF5 chunks, etc.) may contain more bytes than the logical array shape implies. This causes array-to-bytes codecs (e.g., BytesCodec) to fail with reshape errors, because the buffer size does not match the expected chunk shape.
There is currently no Zarr codec for truncating oversized byte buffers to the expected chunk size before decoding.
Background
Many archival formats store data in fixed-size physical units that may extend beyond the logical array boundary:
- TIFF strips: Defined by
RowsPerStrip and ImageLength. When ImageLength is not evenly divisible by RowsPerStrip, some writers pad the last strip to full RowsPerStrip size on disk. Both behaviors are valid per the TIFF spec.
- HDF5 chunks: Edge chunks may be stored at full chunk size with padding, depending on the dataset's allocation and fill behavior.
- Zarr chunks: Zarr RegularChunkGrid pads partial chunks
- Other formats: Any format where the physical storage granularity doesn't perfectly align with the logical data extent.
Concrete example
A 150-row TIFF image with RowsPerStrip=12 has 13 strips. The last strip logically covers 6 rows, but may be stored as 12 rows (padded). The StripByteCounts for such a file shows all 13 strips at the same byte count.
When this TIFF is exposed as a virtual Zarr array via virtual-tiff, the chunk grid must reflect the logical array shape. A RectilinearChunkGrid with y-axis chunks (12, 12, ..., 12, 6) correctly describes the array, but the codec pipeline receives 12 rows of data for a chunk that expects 6 — causing a reshape failure:
ValueError: cannot reshape array of size 7776 into shape (4,162,6)
Impact
Single-file reads
This affects any virtual reference to an archival format where the physical storage unit is larger than the logical chunk, regardless of band interleaving, number of bands, or compression.
Virtual concatenation
When virtually concatenating multiple arrays (e.g., via VirtualiZarr), each source file's padded edge chunks carry into the concatenated manifest. The byte ranges still point at the original files, so oversized buffers appear at every source array boundary — not just at the end. A single concatenation of N files can produce N padded chunks instead of one.
This does not affect native Zarr-to-Zarr concatenation, since Zarr stores only valid data in edge chunks.
Request
A TruncateCodec registered as a bytes-to-bytes Zarr v3 codec extension that truncates the input buffer to the size expected by chunk_spec before downstream codecs process it.
The codec would:
- Sit at the end of the codec pipeline (first in the decode chain), before any array-to-bytes codec
- Compute the expected byte count from
chunk_spec.shape and chunk_spec.dtype
- If the buffer is larger than expected, truncate to the expected size
- If the buffer is equal or smaller, pass through unchanged
- Be a no-op on encode (write-side padding is a writer concern, not a codec concern)
Example pipeline
codecs: [TruncateCodec(), TransposeCodec(order=(0, 2, 1)), BytesCodec(endian='little')]
Minimal implementation sketch
class TruncateCodec(BytesBytesCodec):
async def _decode_single(self, chunk_bytes: Buffer, chunk_spec: ArraySpec) -> Buffer:
expected_size = np.prod(chunk_spec.shape) * chunk_spec.dtype.item_size
if len(chunk_bytes) > expected_size:
return chunk_bytes[:expected_size]
return chunk_bytes
async def _encode_single(self, chunk_bytes: Buffer, chunk_spec: ArraySpec) -> Buffer:
return chunk_bytes
Related
cc @mkitti @d-v-b in case you know of existing solutions, and @jhamman @keewis @TomNicholas as an FYI
TL;DR:
Data formats (TIFF, HDF5, Zarr) often pad edge chunks to full size on disk. When these are read through virtual Zarr stores, the codec pipeline receives more bytes than the logical chunk shape expects, causing decode failures. We need a
TruncateCodec(bytes-to-bytes) that trims oversized buffers before downstream codecs see them.Problem Statement
When reading archival file formats through virtual Zarr stores, the physical storage units (TIFF strips, HDF5 chunks, etc.) may contain more bytes than the logical array shape implies. This causes array-to-bytes codecs (e.g.,
BytesCodec) to fail with reshape errors, because the buffer size does not match the expected chunk shape.There is currently no Zarr codec for truncating oversized byte buffers to the expected chunk size before decoding.
Background
Many archival formats store data in fixed-size physical units that may extend beyond the logical array boundary:
RowsPerStripandImageLength. WhenImageLengthis not evenly divisible byRowsPerStrip, some writers pad the last strip to fullRowsPerStripsize on disk. Both behaviors are valid per the TIFF spec.Concrete example
A 150-row TIFF image with
RowsPerStrip=12has 13 strips. The last strip logically covers 6 rows, but may be stored as 12 rows (padded). TheStripByteCountsfor such a file shows all 13 strips at the same byte count.When this TIFF is exposed as a virtual Zarr array via virtual-tiff, the chunk grid must reflect the logical array shape. A
RectilinearChunkGridwith y-axis chunks(12, 12, ..., 12, 6)correctly describes the array, but the codec pipeline receives 12 rows of data for a chunk that expects 6 — causing a reshape failure:Impact
Single-file reads
This affects any virtual reference to an archival format where the physical storage unit is larger than the logical chunk, regardless of band interleaving, number of bands, or compression.
Virtual concatenation
When virtually concatenating multiple arrays (e.g., via VirtualiZarr), each source file's padded edge chunks carry into the concatenated manifest. The byte ranges still point at the original files, so oversized buffers appear at every source array boundary — not just at the end. A single concatenation of N files can produce N padded chunks instead of one.
This does not affect native Zarr-to-Zarr concatenation, since Zarr stores only valid data in edge chunks.
Request
A
TruncateCodecregistered as a bytes-to-bytes Zarr v3 codec extension that truncates the input buffer to the size expected bychunk_specbefore downstream codecs process it.The codec would:
chunk_spec.shapeandchunk_spec.dtypeExample pipeline
Minimal implementation sketch
Related
RectilinearChunkGridPRcc @mkitti @d-v-b in case you know of existing solutions, and @jhamman @keewis @TomNicholas as an FYI