Skip to content

Request: Truncate codec #44

@maxrjones

Description

@maxrjones

TL;DR:

Data formats (TIFF, HDF5, Zarr) often pad edge chunks to full size on disk. When these are read through virtual Zarr stores, the codec pipeline receives more bytes than the logical chunk shape expects, causing decode failures. We need a TruncateCodec (bytes-to-bytes) that trims oversized buffers before downstream codecs see them.

Problem Statement

When reading archival file formats through virtual Zarr stores, the physical storage units (TIFF strips, HDF5 chunks, etc.) may contain more bytes than the logical array shape implies. This causes array-to-bytes codecs (e.g., BytesCodec) to fail with reshape errors, because the buffer size does not match the expected chunk shape.

There is currently no Zarr codec for truncating oversized byte buffers to the expected chunk size before decoding.

Background

Many archival formats store data in fixed-size physical units that may extend beyond the logical array boundary:

  • TIFF strips: Defined by RowsPerStrip and ImageLength. When ImageLength is not evenly divisible by RowsPerStrip, some writers pad the last strip to full RowsPerStrip size on disk. Both behaviors are valid per the TIFF spec.
  • HDF5 chunks: Edge chunks may be stored at full chunk size with padding, depending on the dataset's allocation and fill behavior.
  • Zarr chunks: Zarr RegularChunkGrid pads partial chunks
  • Other formats: Any format where the physical storage granularity doesn't perfectly align with the logical data extent.

Concrete example

A 150-row TIFF image with RowsPerStrip=12 has 13 strips. The last strip logically covers 6 rows, but may be stored as 12 rows (padded). The StripByteCounts for such a file shows all 13 strips at the same byte count.

When this TIFF is exposed as a virtual Zarr array via virtual-tiff, the chunk grid must reflect the logical array shape. A RectilinearChunkGrid with y-axis chunks (12, 12, ..., 12, 6) correctly describes the array, but the codec pipeline receives 12 rows of data for a chunk that expects 6 — causing a reshape failure:

ValueError: cannot reshape array of size 7776 into shape (4,162,6)

Impact

Single-file reads

This affects any virtual reference to an archival format where the physical storage unit is larger than the logical chunk, regardless of band interleaving, number of bands, or compression.

Virtual concatenation

When virtually concatenating multiple arrays (e.g., via VirtualiZarr), each source file's padded edge chunks carry into the concatenated manifest. The byte ranges still point at the original files, so oversized buffers appear at every source array boundary — not just at the end. A single concatenation of N files can produce N padded chunks instead of one.

This does not affect native Zarr-to-Zarr concatenation, since Zarr stores only valid data in edge chunks.

Request

A TruncateCodec registered as a bytes-to-bytes Zarr v3 codec extension that truncates the input buffer to the size expected by chunk_spec before downstream codecs process it.

The codec would:

  • Sit at the end of the codec pipeline (first in the decode chain), before any array-to-bytes codec
  • Compute the expected byte count from chunk_spec.shape and chunk_spec.dtype
  • If the buffer is larger than expected, truncate to the expected size
  • If the buffer is equal or smaller, pass through unchanged
  • Be a no-op on encode (write-side padding is a writer concern, not a codec concern)

Example pipeline

codecs: [TruncateCodec(), TransposeCodec(order=(0, 2, 1)), BytesCodec(endian='little')]

Minimal implementation sketch

class TruncateCodec(BytesBytesCodec):
    async def _decode_single(self, chunk_bytes: Buffer, chunk_spec: ArraySpec) -> Buffer:
        expected_size = np.prod(chunk_spec.shape) * chunk_spec.dtype.item_size
        if len(chunk_bytes) > expected_size:
            return chunk_bytes[:expected_size]
        return chunk_bytes

    async def _encode_single(self, chunk_bytes: Buffer, chunk_spec: ArraySpec) -> Buffer:
        return chunk_bytes

Related

cc @mkitti @d-v-b in case you know of existing solutions, and @jhamman @keewis @TomNicholas as an FYI

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions