Replies: 3 comments 6 replies
-
|
Also worth noting: @BorisTheBrave has proposed the same approach in xarray PR #11171 — using zarr's This reinforces the case for making |
Beta Was this translation helpful? Give feedback.
-
|
Another (non-exclusive) idea worth considering is forking Zarr's Jupyter-safe event loop logic into a separate library, which could be used by other projects like Xarray. I think we would still need a way to to invoke Zarr's event loop from inside Xarray for async Zarr IO, though. |
Beta Was this translation helpful? Give feedback.
-
|
do you need to interface with existing async xarray routines, or do you just need to synchronously invoke something that should be done asynchronously under the hood? If it's the second option, shouldn't we add an API to zarr-python that does this? Like, you say you need to open multiple groups concurrently. That means reading N metadata documents concurrently. that means using the async More broadly, we don't have any public synchronous API for our store classes, which IMO is a major oversight, because this would evidently be useful to people. I tried to move in this direction with this pr: #3638, but the result was private methods. IMO we should consider a public, synchronous store API, so people can do useful concurrency-backed store operations without needing to touch our event loop. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Context
We're adding async support to xarray's
open_datatree(xarray PR #10742) to enable concurrent I/O when opening multi-group zarr stores. This achieves at least ~6x speedup on cloud data (8.5s → 1.4s) by opening groups concurrently using zarr's async APIs.During review, @shoyer raised concerns about xarray relying on zarr internal APIs that could break without notice.
What xarray uses from zarr
zarr.AsyncGroup__all__)AsyncGroup.members()AsyncGroup.open()zarr.core.sync.sync()__all__)open_datatree()Why we need
zarr.core.sync.sync()xarray's
open_datatree()is synchronous, but we want to use zarr'sAsyncGroupAPIs internally to open multiple groups concurrently. We usezarr.core.sync.sync()to bridge async→sync — the same function zarr uses internally viaSyncMixin._sync()to implement its syncGroupclass.Without a supported bridge, xarray would need to manage its own event loop, which is error-prone and could conflict with zarr's internal loop management.
Example usage in xarray:
The ask
Could
zarr.core.sync.sync()be made public? It would just need to be re-exported inzarr/__init__.pyand added to__all__. This seems low-cost since zarr already depends on it extensively internally.Or is there an alternative supported mechanism for downstream libraries to run async zarr operations from sync contexts?
Can we also confirm that
AsyncGroup.members()return type (AsyncGenerator[tuple[str, AsyncArray | AsyncGroup], None]) is considered stable?Alternatives we've considered
Group.open_members()returning all child groups concurrently) — would eliminate the need for xarray to manage concurrency, but more work on zarr's sideReferences
cc @TomNicholas @shoyer @keewis @dcherian @ilan-gold @d-v-b
We'd love to hear your thoughts, suggestions, or alternative approaches. Any feedback from the zarr core team on the best path forward would be greatly appreciated!
Beta Was this translation helpful? Give feedback.
All reactions