-
Notifications
You must be signed in to change notification settings - Fork 93
feat: Add NpyCodec for lazy-loading numpy arrays #1331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
dimitri-yatsenko
wants to merge
9
commits into
pre/v2.0
Choose a base branch
from
feature/npy-codec
base: pre/v2.0
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+1,686
−100
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add migrate_external() and migrate_filepath() to datajoint.migrate module for safe migration of 0.x external storage columns to 2.0 JSON format. Migration strategy: 1. Add new <column>_v2 columns with JSON type 2. Copy and convert data from old columns 3. User verifies data accessible via DataJoint 2.0 4. Finalize: rename columns (old → _v1, new → original) This allows 0.x and 2.0 to coexist during migration and provides rollback capability if issues are discovered. Functions: - migrate_external(schema, dry_run=True, finalize=False) - migrate_filepath(schema, dry_run=True, finalize=False) - _find_external_columns(schema) - detect 0.x external columns - _find_filepath_columns(schema) - detect 0.x filepath columns Co-Authored-By: Claude Opus 4.5 <[email protected]>
Implement the `<npy@>` codec for schema-addressed numpy array storage:
- Add SchemaCodec base class for path-addressed storage codecs
- Add NpyRef class for lazy array references with metadata
- Add NpyCodec using .npy format with shape/dtype inspection
- Refactor ObjectCodec to inherit from SchemaCodec
- Rename is_external to is_store throughout codebase
- Export SchemaCodec and NpyRef from public API
- Bump version to 2.0.0a17
Key features:
- Lazy loading: inspect shape/dtype without downloading
- NumPy integration via __array__ protocol
- Safe bulk fetch: returns NpyRef objects, not arrays
- Schema-addressed paths: {schema}/{table}/{pk}/{attr}.npy
Co-Authored-By: Claude Opus 4.5 <[email protected]>
8d7c92e to
08d5c6a
Compare
The SchemaCodec (used by NpyCodec and ObjectCodec) needs _schema,
_table, _field, and primary key values to construct schema-addressed
storage paths. Previously, key=None was passed, resulting in
"unknown/unknown" paths.
Now builds proper context dict from table metadata and row values,
enabling navigable paths like:
{schema}/{table}/objects/{pk_path}/{attribute}.npy
Co-Authored-By: Claude Opus 4.5 <[email protected]>
…to feature/npy-codec
Merge PR #1330 (blob preview display) into feature/npy-codec. Bump version from 2.0.0a17 to 2.0.0a18. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Address reviewer feedback from PR #1330: attr should never be None since field_name comes from heading.names. Raising an error surfaces bugs immediately rather than silently returning a misleading placeholder. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Support memory-mapped loading for large arrays: - Local filesystem stores: mmap directly, no download - Remote stores: download to cache, then mmap Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces the
<npy@>codec for schema-addressed numpy array storage with lazy loading capabilities within the Object-Augmented Schema (OAS).Key Features
mmap_modeparameter__array__protocolNpyRefobjects instead of downloading all arrays.npyfiles readable by NumPy, MATLAB, etc.{schema}/{table}/{pk}/{attr}.npy)OAS Codec Strategy
The Object-Augmented Schema integrates relational tables with object storage as a single system. Codecs define how Python objects are stored.
Built-in codecs (no extra dependencies):
<blob>/<blob@>- DataJoint legacy serialization (hash-addressed)<npy@>- portable numpy arrays with lazy loading (schema-addressed) ← this PR<object@>- files/folders (schema-addressed)<hash@>- raw bytes (hash-addressed)Separate packages (optional install, additional dependencies):
datajoint-zarr→<zarr@>(requireszarr)datajoint-parquet→<parquet@>(requirespyarrow)datajoint-tiff→<tiff@>(requirestifffile)This keeps datajoint-python lean while allowing users to install codec packages as needed. The
SchemaCodecbase class introduced here enables community-contributed codecs.OAS Addressing Schemes
<hash@>,<blob@>,<attach@><object@>,<npy@>Changes
New classes:
SchemaCodec- Abstract base class for schema-addressed codecsNpyRef- Lazy reference with metadata access (shape,dtype,is_loaded)NpyCodec- Codec implementation using.npyformatRefactoring:
ObjectCodecnow inherits fromSchemaCodec(DRY)is_external→is_storethroughout codebase for clarityAPI:
SchemaCodecandNpyReffromdatajointpublic APIUsage Example
Memory Mapping
The
mmap_modeparameter enables efficient random access to large arrays:Modes:
'r'(read-only),'r+'(read-write),'c'(copy-on-write)Comparison:
<npy@>vs<blob@><npy@><blob@>mmap_mode.npyTest Plan
NpyRefmetadata accessNpyRefmmap_mode (local and remote)NpyCodecvalidationRelated
pre/v2.0branch)how-to/use-npy-codec.mdreference/specs/npy-codec.mdtutorials/ephys-with-npy.ipynb🤖 Generated with Claude Code