Skip to content

Add support for ragged arrays#1104

Open
cjboyle wants to merge 108 commits into
bluesky:mainfrom
cjboyle:support-ragged-arrays
Open

Add support for ragged arrays#1104
cjboyle wants to merge 108 commits into
bluesky:mainfrom
cjboyle:support-ragged-arrays

Conversation

@cjboyle
Copy link
Copy Markdown
Contributor

@cjboyle cjboyle commented Aug 23, 2025

This adds client and backend support for reading/writing irregular arrays using the the ragged package. As ragged is more or less a wrapper around awkward, this PR reuses, or adds similar implementations from that structure family (e.g. serialization).

Implements #801.

Checklist

  • Add a Changelog entry
  • Add the ticket number which this PR closes to the comment section
  • Writing ragged data from client (file storage)
  • Reading ragged data to client (file storage)
    • in full
    • sliced
    • from block/chunk
    • from block/chunk, sliced
  • Writing [None] shaped data from Bluesky/TiledWriter into SQL storage
  • Reading data from RaggedAdapter returned by SQLAdapter (SQL storage)
  • Reading ragged data to client (SQL storage)
    • in full
    • sliced
    • from block/chunk
    • from block/chunk, sliced
  • Serialization
    • JSON
    • Arrow
    • Parquet
    • w/ slicing (except scalar results)
    • others?

@danielballan
Copy link
Copy Markdown
Member

Awesome!

It looks like you found all the modules that need to be touched to add this.

The aspect that will need the most careful thought is the structure description and the HTTP APIs. These are designed to be used not only from the built-in Python client, but also from curl with tools like jq, browser-based applications, maybe Julia or Rust someday....

The Awkward form is quite complex. I suspect that only Python and C++ based clients, with access to awkward / awkward-cpp libraries, will be able to parse the form and engage with Tiled's Awkward structures in detail. (Unless, that is, IRIS-HEP builds Awkward libraries in other languages.) Clients without knowledge of Awkward can still get the data—exporting it to JSON, for example—but they probably cannot introspect or slice it in sophisticated ways.

If we were willing to similarly restrict ragged to clients with access to an awkward implementation, we wouldn't even really need to add a new structure family. We could implement it fully client-side, as a wrapper of the awkward client. But I see advantages in using the comparative simplicity of ragged to make it more accessible to simple clients.

This form construct is more flexible than ragged requires:

{'class': 'ListOffsetArray',
 'offsets': 'i64',
 'content': {'class': 'NumpyArray',
  'primitive': 'int64',
  'inner_shape': [],
  'parameters': {},
  'form_key': 'node1'},
 'parameters': {},
 'form_key': 'node0'}

A ragged form is always composed of one numpy "content" array and some number of "offset" arrays—full stop. It can be described thus (from #801):

class RaggedStructure(ArrayStructure):
    shape: Tuple[None | int, ...]  # override base class which has this as Tuple[int, ...]

I'm not sure whether ragged always puts offset arrays in int64 dtype. If other uint types may be needed, then we will need a supplemental offset_datatype, similar to the supplemental coord_datatype in sparse structures.

coord_data_type: Optional[BuiltinDtype] = field(
default_factory=lambda: BuiltinDtype(
Endianness("little"), Kind("u"), 8
) # numpy 'uint' dtype
)

Although reusing the awkward form keeps things simple assuming your client already consumes awkward I think having a custom, much more constrained structure JSON, is worthwhile, to make ragged arrays a more portable and accessible concept.

Copy link
Copy Markdown
Member

@danielballan danielballan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leave it to @genematx to do a detailed review. I have just a couple comments on the structure, which I think it's important to get as right as we can from the start.

Comment thread tiled/structures/ragged.py Outdated
and any variable dimensions are represented by None."""
size: int
"""The total number of elements in the array."""
partitions: tuple[int, ...]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following convention, we use "chunks" to describe N-dimensional array chunks. The term "partitions" is applied to tables, where the chunking is inherently 1-dimensional, along the rows of the table.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, this might be the offsets? In that case I would suggest aligning with Awkward terminology and calling it offsets.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No your first comment was correct (chunks vs partitions). I was essentially following the terminology that dask-awkward uses for partitioning, because it is only being computed over the first dimension.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I understand correctly, partitions here define the boundaries of splits along the left-most dimension; each partition is stored in its own parquet file (via awkward.to_parquet) -- so primarily they are needed to keep track of tiled's assets (if there are several files). I wonder if this can be done by awkward itself?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. I don't believe this can be done in awkward proper. dask-awkward does provide some limited partitioning functionality, though see Dan's previous comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works for me. And just for confirmation, this would change from a list of bounding indices to a list of shapes?

# old
partitions = [
    0,
    10,
    50,
    60,
    75,
    ...
]

# new
chunks = [
    (10, None, ...),
    (40, None, ...),
    (10, None, ...),
    (15, None, ...),
    ...
]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's almost what I was thinking, more specifically
chunks = [[10, 40, 10, 15], None, None, ...]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right. I can get started on that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored this in a way that I think is sane (I've added into the structure its own version of shape_from_chunks that helps with this). This also updates the parquet file output to be in line with the sparse implementation (e.g. block-5.0.0.parquet), bearing in mind that this may soon be switched to DirectoryContainer.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much, Connor! I'll have a look at this asap; likely some time next week -- we have a really busy next couple of days here.

Comment thread tiled/structures/ragged.py Outdated
partitions: tuple[int, ...]
"""Defines the boundaries for partitioning the array.
Note that the final value is the row count from `shape[0]`."""
nbytes: int
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be derived from the size and the data_type. I think that storing it separately over-determines the structure. Should it become a property?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is how I originally had it (until yesterday 🙂). The difference is that the underlying size from Awkward includes the size of the np.int64 offsets data.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A property seems like a better fit for this, but do we even need it at all?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is strictly needed, no.

@genematx
Copy link
Copy Markdown
Contributor

genematx commented Apr 28, 2026

@cjboyle I'm still going through my review and making some small changes, but I thought I'd push what I have so far, so we don't diverge in case you're working on this too. Here're my suggestions: cjboyle#1 (mostly around tests and alembic migration; I've also updated it with main, so it may appear there are more unrelated changes)

genematx and others added 3 commits April 28, 2026 15:00
Comment thread tiled/client/ragged.py
dtype=self.dtype,
)

def read_block(self, block: Any, slice: Any | None = None) -> ragged.array:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need a separate .read_block method; it currently exists for the ArrayClient, but it likely will become deprecated in the near future (just the simple read + slice covers all necessary cases, especially since dask is not yet supported).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan to deprecate this from just the clients, or from the adapters and router endpoints as well?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not totally thought through yet, but I think we could support this on the client and in HTTP for back-compat, while dropping the need for it on the Adapters (implementing support just in terms of the read method).

nbytes: int
# Optional tuple of dimension names, e.g. ("time", "x"), or None for unnamed dimensions
dims: tuple[str, ...] | None = None
resizable: bool | tuple[bool, ...] = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure if need resizable in the structure -- I don't think we use it, even for the usual arrays @danielballan

@genematx genematx mentioned this pull request Apr 29, 2026
2 tasks
from tiled.type_aliases import JSON


class RaggedParquetAdapter(Adapter[RaggedStructure]):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about the backend storage for this. While parquet is simple and is working, I wonder if it is the optimal choice here.

First, it raises the question, why don't we use parquet for the usual AwkwardArrays but instead store them in lower-level byte-arrays representation? Both methods have their pros and cons (mainly around the standardization of the interface vs storage size and performance), but it probably would make sense to keep them consistent, since ragged is a derivative of awkward (I think in a way it would be nice if we could read ragged storage with AwkwardAdapter -- even thought we probably would never do this). I have made some refactoring changes to AwkwardAdapters here to make it easier to extend and adopt it (@cjboyle somehow I couldn't add you as a reviewer, but your comments are very welcome!).

On the other hand, the main advantage of ragged structures, as I see it, is the ability to append variably-sized data dynamically. (I believe this is how they are intended to be used at one of our applications at NSLS2.) I think we should consider a possibility for designing an appendable storage -- e.g. similar to what we have for appendable tables -- and this could be another difference from the base awkward class.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was going both ways on this. In the first revision the adapter was storing the flattened data as .npy files for efficiency (with offsets stored in the structure object). I'm okay with reusing the Awkward storage implementation for less overhead.

I definitely had "implement append" on a to-do list somewhere... though this already works with the SQL storage for lists of 1D arrays. Yep, this could be done with either ragged.concat or awkward.concatenate with axis=0, or if you have a better idea using the file-backed containers directly.

Comment thread tiled/adapters/ragged.py
from tiled.type_aliases import JSON


class RaggedAdapter(Adapter[RaggedStructure]):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also tempted to combine the in-memory and storage-backed adapters. Will work on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants