Add support for `ragged` arrays by cjboyle · Pull Request #1104 · bluesky/tiled

cjboyle · 2025-08-23T00:02:55Z

This adds client and backend support for reading/writing irregular arrays using the the ragged package. As ragged is more or less a wrapper around awkward, ~~this PR reuses, or adds similar implementations from that structure family (e.g. serialization)~~.

Implements #801.

Checklist

danielballan · 2025-08-25T11:56:09Z

Awesome!

It looks like you found all the modules that need to be touched to add this.

The aspect that will need the most careful thought is the structure description and the HTTP APIs. These are designed to be used not only from the built-in Python client, but also from curl with tools like jq, browser-based applications, maybe Julia or Rust someday....

The Awkward form is quite complex. I suspect that only Python and C++ based clients, with access to awkward / awkward-cpp libraries, will be able to parse the form and engage with Tiled's Awkward structures in detail. (Unless, that is, IRIS-HEP builds Awkward libraries in other languages.) Clients without knowledge of Awkward can still get the data—exporting it to JSON, for example—but they probably cannot introspect or slice it in sophisticated ways.

If we were willing to similarly restrict ragged to clients with access to an awkward implementation, we wouldn't even really need to add a new structure family. We could implement it fully client-side, as a wrapper of the awkward client. But I see advantages in using the comparative simplicity of ragged to make it more accessible to simple clients.

This form construct is more flexible than ragged requires:

{'class': 'ListOffsetArray',
 'offsets': 'i64',
 'content': {'class': 'NumpyArray',
  'primitive': 'int64',
  'inner_shape': [],
  'parameters': {},
  'form_key': 'node1'},
 'parameters': {},
 'form_key': 'node0'}

A ragged form is always composed of one numpy "content" array and some number of "offset" arrays—full stop. It can be described thus (from #801):

class RaggedStructure(ArrayStructure):
    shape: Tuple[None | int, ...]  # override base class which has this as Tuple[int, ...]

I'm not sure whether ragged always puts offset arrays in int64 dtype. If other uint types may be needed, then we will need a supplemental offset_datatype, similar to the supplemental coord_datatype in sparse structures.

tiled/tiled/structures/sparse.py

Lines 19 to 23 in f6a9509

    
           coord_data_type: Optional[BuiltinDtype] = field( 
        
               default_factory=lambda: BuiltinDtype( 
        
                   Endianness("little"), Kind("u"), 8 
        
               )  # numpy 'uint' dtype 
        
           )

Although reusing the awkward form keeps things simple assuming your client already consumes awkward I think having a custom, much more constrained structure JSON, is worthwhile, to make ragged arrays a more portable and accessible concept.

Co-authored-by: Copilot <copilot@github.com>

danielballan

I leave it to @genematx to do a detailed review. I have just a couple comments on the structure, which I think it's important to get as right as we can from the start.

danielballan · 2026-04-28T16:39:48Z

+    and any variable dimensions are represented by None."""
+    size: int
+    """The total number of elements in the array."""
+    partitions: tuple[int, ...]


Following convention, we use "chunks" to describe N-dimensional array chunks. The term "partitions" is applied to tables, where the chunking is inherently 1-dimensional, along the rows of the table.

Oh wait, this might be the offsets? In that case I would suggest aligning with Awkward terminology and calling it offsets.

No your first comment was correct (chunks vs partitions). I was essentially following the terminology that dask-awkward uses for partitioning, because it is only being computed over the first dimension.

if I understand correctly, partitions here define the boundaries of splits along the left-most dimension; each partition is stored in its own parquet file (via awkward.to_parquet) -- so primarily they are needed to keep track of tiled's assets (if there are several files). I wonder if this can be done by awkward itself?

That is correct. I don't believe this can be done in awkward proper. dask-awkward does provide some limited partitioning functionality, though see Dan's previous comment.

That works for me. And just for confirmation, this would change from a list of bounding indices to a list of shapes?

# old partitions = [ 0, 10, 50, 60, 75, ... ] # new chunks = [ (10, None, ...), (40, None, ...), (10, None, ...), (15, None, ...), ... ]

yeah, that's almost what I was thinking, more specifically
chunks = [[10, 40, 10, 15], None, None, ...]

Ah right. I can get started on that.

I've refactored this in a way that I think is sane (I've added into the structure its own version of shape_from_chunks that helps with this). This also updates the parquet file output to be in line with the sparse implementation (e.g. block-5.0.0.parquet), bearing in mind that this may soon be switched to DirectoryContainer.

Thank you very much, Connor! I'll have a look at this asap; likely some time next week -- we have a really busy next couple of days here.

danielballan · 2026-04-28T16:41:10Z

+    partitions: tuple[int, ...]
+    """Defines the boundaries for partitioning the array.
+    Note that the final value is the row count from `shape[0]`."""
+    nbytes: int


This can be derived from the size and the data_type. I think that storing it separately over-determines the structure. Should it become a property?

That is how I originally had it (until yesterday 🙂). The difference is that the underlying size from Awkward includes the size of the np.int64 offsets data.

A property seems like a better fit for this, but do we even need it at all?

I don't think it is strictly needed, no.

genematx · 2026-04-28T17:51:31Z

@cjboyle I'm still going through my review and making some small changes, but I thought I'd push what I have so far, so we don't diverge in case you're working on this too. Here're my suggestions: cjboyle#1 (mostly around tests and alembic migration; I've also updated it with main, so it may appear there are more unrelated changes)

Co-authored-by: Connor Boyle <connor@cjboyle.ca>

WIP: Support Ragged Arrays

genematx · 2026-04-29T00:38:41Z

+            dtype=self.dtype,
+        )
+
+    def read_block(self, block: Any, slice: Any | None = None) -> ragged.array:


We probably don't need a separate .read_block method; it currently exists for the ArrayClient, but it likely will become deprecated in the near future (just the simple read + slice covers all necessary cases, especially since dask is not yet supported).

Is the plan to deprecate this from just the clients, or from the adapters and router endpoints as well?

It is not totally thought through yet, but I think we could support this on the client and in HTTP for back-compat, while dropping the need for it on the Adapters (implementing support just in terms of the read method).

genematx · 2026-04-29T00:43:51Z

+    nbytes: int
+    # Optional tuple of dimension names, e.g. ("time", "x"), or None for unnamed dimensions
+    dims: tuple[str, ...] | None = None
+    resizable: bool | tuple[bool, ...] = False


i'm not sure if need resizable in the structure -- I don't think we use it, even for the usual arrays @danielballan

genematx · 2026-04-29T12:42:35Z

+    from tiled.type_aliases import JSON
+
+
+class RaggedParquetAdapter(Adapter[RaggedStructure]):


I've been thinking about the backend storage for this. While parquet is simple and is working, I wonder if it is the optimal choice here.

First, it raises the question, why don't we use parquet for the usual AwkwardArrays but instead store them in lower-level byte-arrays representation? Both methods have their pros and cons (mainly around the standardization of the interface vs storage size and performance), but it probably would make sense to keep them consistent, since ragged is a derivative of awkward (I think in a way it would be nice if we could read ragged storage with AwkwardAdapter -- even thought we probably would never do this). I have made some refactoring changes to AwkwardAdapters here to make it easier to extend and adopt it (@cjboyle somehow I couldn't add you as a reviewer, but your comments are very welcome!).

On the other hand, the main advantage of ragged structures, as I see it, is the ability to append variably-sized data dynamically. (I believe this is how they are intended to be used at one of our applications at NSLS2.) I think we should consider a possibility for designing an appendable storage -- e.g. similar to what we have for appendable tables -- and this could be another difference from the base awkward class.

Yeah, I was going both ways on this. In the first revision the adapter was storing the flattened data as .npy files for efficiency (with offsets stored in the structure object). I'm okay with reusing the Awkward storage implementation for less overhead.

I definitely had "implement append" on a to-do list somewhere... though this already works with the SQL storage for lists of 1D arrays. Yep, this could be done with either ragged.concat or awkward.concatenate with axis=0, or if you have a better idea using the file-backed containers directly.

genematx · 2026-04-29T16:01:03Z

+    from tiled.type_aliases import JSON
+
+
+class RaggedAdapter(Adapter[RaggedStructure]):


I'm also tempted to combine the in-memory and storage-backed adapters. Will work on this.

cjboyle added 16 commits July 10, 2025 16:39

check for array of arrays and convert to ndarray

810045a

Merge remote-tracking branch 'upstream/main'

8714ab1

Merge remote-tracking branch 'upstream/main'

875af0b

Add ragged dependency

7b48ee9

Merge remote-tracking branch 'upstream/main' into support-ragged-arrays

76e13ab

Merge remote-tracking branch 'upstream/main' into support-ragged-arrays

dc7c042

Merge remote-tracking branch 'upstream/main' into support-ragged-arrays

016c03e

Merge remote-tracking branch 'upstream/main' into support-ragged-arrays

95cc27f

From SQLAdapter, test Array-, Ragged-, then AwkwardAdapter

4543b73

Merge remote-tracking branch 'upstream/main' into support-ragged-arrays

81ea973

Test returned adapters, without nullable data types

62ff00e

remove normalize_chunks from ragged adapter

4235a68

Add schema tests for irregular arrays

fb4d71f

No need to test every datatype, already done elsewhere

fc242e1

write + read full ragged arrays

548b0f3

Merge remote-tracking branch 'upstream/main' into support-ragged-arrays

23cba73

cjboyle added 13 commits August 25, 2025 09:18

fix lack of read()

18cfd67

add more complexity to tests

bebdc68

Merge remote-tracking branch 'upstream/main' into support-ragged-arrays

3684801

test simple to complex arrays

f3ba2ea

Merge remote-tracking branch 'upstream/main' into support-ragged-arrays

ec7b92f

Merge remote-tracking branch 'upstream/main' into support-ragged-arrays

07fbe8e

Update structure to store offsets

596a104

Merge remote-tracking branch 'upstream/main' into support-ragged-arrays

1c9a9ec

Merge remote-tracking branch 'upstream/main' into support-ragged-arrays

47ce353

fix exit clause logic

b477814

Merge remote-tracking branch 'upstream/main' into support-ragged-arrays

b9b4fd7

fix parameter order

7ddbe04

test ragged structure and utilities

94aed41

cjboyle and others added 13 commits April 9, 2026 12:43

Merge remote-tracking branch 'upstream/main'

2a8fd09

moved CHANGELOG entry to "unreleased"

09167d2

Merge branch 'main' into support-ragged-arrays

7765c36

Merge remote-tracking branch 'origin/main' into support-ragged-arrays

99d22ec

Merge remote-tracking branch 'upstream/main' into support-ragged-arrays

bbbfed8

Merge remote-tracking branch 'upstream/main' into support-ragged-arrays

d383a3b

provide correct nbytes size from Awkward

274a9a5

Co-authored-by: Copilot <copilot@github.com>

Merge branch 'main' into support-ragged-arrays-eugene

68657f7

ENH: add alembic migration

5279cf7

ENH: add alembic migration

e32c076

TST: revert changes to existing tests

acad1e0

TST: test sql arrays passing

f2b4edc

TST: use module-scoped fixtures

9a54f84

danielballan reviewed Apr 28, 2026

View reviewed changes

genematx added 3 commits April 28, 2026 13:36

TST: add comment

89e2ec9

MNT: comments

fe3040d

MNT: remove noqa: SLF001

8dbfeb5

genematx and others added 3 commits April 28, 2026 15:00

Update tests/test_ragged.py

9021763

Co-authored-by: Connor Boyle <connor@cjboyle.ca>

import make_ragged_array in tests

5695f15

Merge pull request #1 from genematx/support-ragged-arrays-eugene

d8a0387

WIP: Support Ragged Arrays

genematx reviewed Apr 29, 2026

View reviewed changes

genematx mentioned this pull request Apr 29, 2026

Refactoring AwkwardAdapter #1362

Open

2 tasks

genematx reviewed Apr 29, 2026

View reviewed changes

cjboyle added 4 commits April 29, 2026 14:39

remove ambiguous nbytes altogether, as it isn't used anywhere.

e74fd03

wip: refactor 'partitions' to 'chunks'

4a7724e

cleanup determining shape of chunks/blocks

6e82d71

fixed block slice expansion

e67d140

		from tiled.type_aliases import JSON


		class RaggedParquetAdapter(Adapter[RaggedStructure]):

		from tiled.type_aliases import JSON


		class RaggedAdapter(Adapter[RaggedStructure]):

Conversation

cjboyle commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

danielballan commented Aug 25, 2025

Uh oh!

danielballan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

genematx commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cjboyle commented Aug 23, 2025 •

edited

Loading

genematx commented Apr 28, 2026 •

edited

Loading