Add support for ragged arrays#1104
Conversation
|
Awesome! It looks like you found all the modules that need to be touched to add this. The aspect that will need the most careful thought is the structure description and the HTTP APIs. These are designed to be used not only from the built-in Python client, but also from The Awkward If we were willing to similarly restrict This A class RaggedStructure(ArrayStructure):
shape: Tuple[None | int, ...] # override base class which has this as Tuple[int, ...]I'm not sure whether tiled/tiled/structures/sparse.py Lines 19 to 23 in f6a9509 Although reusing the awkward form keeps things simple assuming your client already consumes awkward I think having a custom, much more constrained structure JSON, is worthwhile, to make ragged arrays a more portable and accessible concept. |
Co-authored-by: Copilot <copilot@github.com>
danielballan
left a comment
There was a problem hiding this comment.
I leave it to @genematx to do a detailed review. I have just a couple comments on the structure, which I think it's important to get as right as we can from the start.
| and any variable dimensions are represented by None.""" | ||
| size: int | ||
| """The total number of elements in the array.""" | ||
| partitions: tuple[int, ...] |
There was a problem hiding this comment.
Following convention, we use "chunks" to describe N-dimensional array chunks. The term "partitions" is applied to tables, where the chunking is inherently 1-dimensional, along the rows of the table.
There was a problem hiding this comment.
Oh wait, this might be the offsets? In that case I would suggest aligning with Awkward terminology and calling it offsets.
There was a problem hiding this comment.
No your first comment was correct (chunks vs partitions). I was essentially following the terminology that dask-awkward uses for partitioning, because it is only being computed over the first dimension.
There was a problem hiding this comment.
if I understand correctly, partitions here define the boundaries of splits along the left-most dimension; each partition is stored in its own parquet file (via awkward.to_parquet) -- so primarily they are needed to keep track of tiled's assets (if there are several files). I wonder if this can be done by awkward itself?
There was a problem hiding this comment.
That is correct. I don't believe this can be done in awkward proper. dask-awkward does provide some limited partitioning functionality, though see Dan's previous comment.
There was a problem hiding this comment.
That works for me. And just for confirmation, this would change from a list of bounding indices to a list of shapes?
# old
partitions = [
0,
10,
50,
60,
75,
...
]
# new
chunks = [
(10, None, ...),
(40, None, ...),
(10, None, ...),
(15, None, ...),
...
]There was a problem hiding this comment.
yeah, that's almost what I was thinking, more specifically
chunks = [[10, 40, 10, 15], None, None, ...]
There was a problem hiding this comment.
Ah right. I can get started on that.
There was a problem hiding this comment.
I've refactored this in a way that I think is sane (I've added into the structure its own version of shape_from_chunks that helps with this). This also updates the parquet file output to be in line with the sparse implementation (e.g. block-5.0.0.parquet), bearing in mind that this may soon be switched to DirectoryContainer.
There was a problem hiding this comment.
Thank you very much, Connor! I'll have a look at this asap; likely some time next week -- we have a really busy next couple of days here.
| partitions: tuple[int, ...] | ||
| """Defines the boundaries for partitioning the array. | ||
| Note that the final value is the row count from `shape[0]`.""" | ||
| nbytes: int |
There was a problem hiding this comment.
This can be derived from the size and the data_type. I think that storing it separately over-determines the structure. Should it become a property?
There was a problem hiding this comment.
That is how I originally had it (until yesterday 🙂). The difference is that the underlying size from Awkward includes the size of the np.int64 offsets data.
There was a problem hiding this comment.
A property seems like a better fit for this, but do we even need it at all?
There was a problem hiding this comment.
I don't think it is strictly needed, no.
|
@cjboyle I'm still going through my review and making some small changes, but I thought I'd push what I have so far, so we don't diverge in case you're working on this too. Here're my suggestions: cjboyle#1 (mostly around tests and alembic migration; I've also updated it with main, so it may appear there are more unrelated changes) |
Co-authored-by: Connor Boyle <connor@cjboyle.ca>
WIP: Support Ragged Arrays
| dtype=self.dtype, | ||
| ) | ||
|
|
||
| def read_block(self, block: Any, slice: Any | None = None) -> ragged.array: |
There was a problem hiding this comment.
We probably don't need a separate .read_block method; it currently exists for the ArrayClient, but it likely will become deprecated in the near future (just the simple read + slice covers all necessary cases, especially since dask is not yet supported).
There was a problem hiding this comment.
Is the plan to deprecate this from just the clients, or from the adapters and router endpoints as well?
There was a problem hiding this comment.
It is not totally thought through yet, but I think we could support this on the client and in HTTP for back-compat, while dropping the need for it on the Adapters (implementing support just in terms of the read method).
| nbytes: int | ||
| # Optional tuple of dimension names, e.g. ("time", "x"), or None for unnamed dimensions | ||
| dims: tuple[str, ...] | None = None | ||
| resizable: bool | tuple[bool, ...] = False |
There was a problem hiding this comment.
i'm not sure if need resizable in the structure -- I don't think we use it, even for the usual arrays @danielballan
| from tiled.type_aliases import JSON | ||
|
|
||
|
|
||
| class RaggedParquetAdapter(Adapter[RaggedStructure]): |
There was a problem hiding this comment.
I've been thinking about the backend storage for this. While parquet is simple and is working, I wonder if it is the optimal choice here.
First, it raises the question, why don't we use parquet for the usual AwkwardArrays but instead store them in lower-level byte-arrays representation? Both methods have their pros and cons (mainly around the standardization of the interface vs storage size and performance), but it probably would make sense to keep them consistent, since ragged is a derivative of awkward (I think in a way it would be nice if we could read ragged storage with AwkwardAdapter -- even thought we probably would never do this). I have made some refactoring changes to AwkwardAdapters here to make it easier to extend and adopt it (@cjboyle somehow I couldn't add you as a reviewer, but your comments are very welcome!).
On the other hand, the main advantage of ragged structures, as I see it, is the ability to append variably-sized data dynamically. (I believe this is how they are intended to be used at one of our applications at NSLS2.) I think we should consider a possibility for designing an appendable storage -- e.g. similar to what we have for appendable tables -- and this could be another difference from the base awkward class.
There was a problem hiding this comment.
Yeah, I was going both ways on this. In the first revision the adapter was storing the flattened data as .npy files for efficiency (with offsets stored in the structure object). I'm okay with reusing the Awkward storage implementation for less overhead.
I definitely had "implement append" on a to-do list somewhere... though this already works with the SQL storage for lists of 1D arrays. Yep, this could be done with either ragged.concat or awkward.concatenate with axis=0, or if you have a better idea using the file-backed containers directly.
| from tiled.type_aliases import JSON | ||
|
|
||
|
|
||
| class RaggedAdapter(Adapter[RaggedStructure]): |
There was a problem hiding this comment.
I'm also tempted to combine the in-memory and storage-backed adapters. Will work on this.
This adds client and backend support for reading/writing irregular arrays using the the
raggedpackage. Asraggedis more or less a wrapper aroundawkward,this PR reuses, or adds similar implementations from that structure family (e.g. serialization).Implements #801.
Checklist
[None]shaped data from Bluesky/TiledWriter into SQL storageRaggedAdapterreturned bySQLAdapter(SQL storage)