Request: Truncate codec

## **TL;DR:**

Data formats (TIFF, HDF5, Zarr) often pad edge chunks to full size on disk. When these are read through virtual Zarr stores, the codec pipeline receives more bytes than the logical chunk shape expects, causing decode failures. We need a `TruncateCodec` (bytes-to-bytes) that trims oversized buffers before downstream codecs see them.

## Problem Statement

When reading archival file formats through virtual Zarr stores, the physical storage units (TIFF strips, HDF5 chunks, etc.) may contain more bytes than the logical array shape implies. This causes array-to-bytes codecs (e.g., `BytesCodec`) to fail with reshape errors, because the buffer size does not match the expected chunk shape.

There is currently no Zarr codec for truncating oversized byte buffers to the expected chunk size before decoding.

## Background

Many archival formats store data in fixed-size physical units that may extend beyond the logical array boundary:

- **TIFF strips**: Defined by `RowsPerStrip` and `ImageLength`. When `ImageLength` is not evenly divisible by `RowsPerStrip`, some writers pad the last strip to full `RowsPerStrip` size on disk. Both behaviors are valid per the TIFF spec.
- **HDF5 chunks**: Edge chunks may be stored at full chunk size with padding, depending on the dataset's allocation and fill behavior.
- **Zarr chunks**: Zarr RegularChunkGrid pads partial chunks
- **Other formats**: Any format where the physical storage granularity doesn't perfectly align with the logical data extent.

### Concrete example

A 150-row TIFF image with `RowsPerStrip=12` has 13 strips. The last strip logically covers 6 rows, but may be stored as 12 rows (padded). The `StripByteCounts` for such a file shows all 13 strips at the same byte count.

When this TIFF is exposed as a virtual Zarr array via [virtual-tiff](https://github.com/virtual-zarr/virtual-tiff), the chunk grid must reflect the logical array shape. A `RectilinearChunkGrid` with y-axis chunks `(12, 12, ..., 12, 6)` correctly describes the array, but the codec pipeline receives 12 rows of data for a chunk that expects 6 — causing a reshape failure:

```
ValueError: cannot reshape array of size 7776 into shape (4,162,6)
```

## Impact

### Single-file reads

This affects any virtual reference to an archival format where the physical storage unit is larger than the logical chunk, regardless of band interleaving, number of bands, or compression.

### Virtual concatenation

When virtually concatenating multiple arrays (e.g., via VirtualiZarr), each source file's padded edge chunks carry into the concatenated manifest. The byte ranges still point at the original files, so oversized buffers appear at every source array boundary — not just at the end. A single concatenation of N files can produce N padded chunks instead of one.

This does not affect native Zarr-to-Zarr concatenation, since Zarr stores only valid data in edge chunks.

## Request

A `TruncateCodec` registered as a bytes-to-bytes Zarr v3 codec extension that truncates the input buffer to the size expected by `chunk_spec` before downstream codecs process it.

The codec would:

- Sit at the end of the codec pipeline (first in the decode chain), before any array-to-bytes codec
- Compute the expected byte count from `chunk_spec.shape` and `chunk_spec.dtype`
- If the buffer is larger than expected, truncate to the expected size
- If the buffer is equal or smaller, pass through unchanged
- Be a no-op on encode (write-side padding is a writer concern, not a codec concern)

### Example pipeline

```
codecs: [TruncateCodec(), TransposeCodec(order=(0, 2, 1)), BytesCodec(endian='little')]
```

### Minimal implementation sketch

```python
class TruncateCodec(BytesBytesCodec):
    async def _decode_single(self, chunk_bytes: Buffer, chunk_spec: ArraySpec) -> Buffer:
        expected_size = np.prod(chunk_spec.shape) * chunk_spec.dtype.item_size
        if len(chunk_bytes) > expected_size:
            return chunk_bytes[:expected_size]
        return chunk_bytes

    async def _encode_single(self, chunk_bytes: Buffer, chunk_spec: ArraySpec) -> Buffer:
        return chunk_bytes
```

## Related

- [virtual-tiff](https://github.com/virtual-zarr/virtual-tiff) — virtual Zarr store for TIFF/GeoTIFF
- [VirtualiZarr](https://github.com/zarr-developers/VirtualiZarr) — virtual Zarr stores for archival formats
- zarr-python [`RectilinearChunkGrid` PR](https://github.com/zarr-developers/zarr-python/pull/3534)
- https://github.com/zarr-developers/zarr-extensions/pull/25

cc @mkitti @d-v-b in case you know of existing solutions, and @jhamman @keewis @tomnicholas as an FYI


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: Truncate codec #44

TL;DR:

Problem Statement

Background

Concrete example

Impact

Single-file reads

Virtual concatenation

Request

Example pipeline

Minimal implementation sketch

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Request: Truncate codec #44

Description

TL;DR:

Problem Statement

Background

Concrete example

Impact

Single-file reads

Virtual concatenation

Request

Example pipeline

Minimal implementation sketch

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions