-
Notifications
You must be signed in to change notification settings - Fork 93
DataJoint 2.0 #1311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
dimitri-yatsenko
wants to merge
305
commits into
master
Choose a base branch
from
pre/v2.0
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
DataJoint 2.0 #1311
+31,985
−21,662
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Random hash suffix for filenames (URL-safe, filename-safe base64) - Configurable hash_length setting (default: 8, range: 4-16) - Upload-first transaction strategy with cleanup on failure - Batch insert atomicity handling - Orphaned file detection/cleanup utilities (future)
Key changes: - Support both files and folders - Immutability contract: insert, read, delete only - Deterministic bidirectional path mapping from schema/table/field/PK - Copy-first insert: copy fails → no DB insert attempted - DB-first delete: file delete is best-effort (stale files acceptable) - Fetch returns handle (FileRef), no automatic download - JSON metadata includes is_folder, file_count for folders - FileRef class with folder operations (listdir, walk)
Path changes:
- Field name now comes after all primary key attributes
- Groups related files together (all fields for same record in same dir)
Partitioning:
- partition_pattern config promotes PK attributes to path root
- Enables grouping by high-level attributes (subject, experiment)
- Example: {subject_id} moves subject to path start for data locality
- Keep = sign in paths (Hive convention, widely supported) - Simple types used directly: integers, dates, timestamps, strings - Conversion to path-safe strings only when necessary: - Path-unsafe characters (/, \) get URL-encoded - Long strings truncated with hash suffix - Binary/complex types hashed
- Orphan cleanup must run during maintenance windows - Uses transactions/locking to avoid race conditions - Grace period excludes recently uploaded files (in-flight inserts) - Dry-run mode for previewing deletions
- attach@store and filepath@store maintained for backward compatibility - Will be deprecated with migration warnings in future releases - Eventually removed after transition period - New pipelines should use file type exclusively
Store metadata (dj-store-meta.json): - Located at store root with project_name, created, format_version - Lists schemas using the store - Created on first file operation Client verification: - project_name required in client settings - Must match store metadata on connect - Raises DataJointError on mismatch - Ensures all clients use same configuration Also renamed hash_length to token_length throughout spec.
- Removed schemas array from dj-store-meta.json - 1:1 correspondence between database+project_name and store assumed - DataJoint performs basic project_name verification on connect - Enforcement is administrative responsibility, not DataJoint's
- Type syntax: `object` instead of `file` - Class: ObjectRef instead of FileRef - Module: objectref.py instead of fileref.py - Pattern: OBJECT matching `object$` - JSON fields: is_dir, item_count (renamed from is_folder, file_count) - Consistent with object_storage.* settings namespace - Aligns with objects/ directory in path structure
Staged Insert (direct write mode): - stage_object() context manager for writing directly to storage - StagedObject provides fs, store, full_path for Zarr/xarray - Cleanup on failure, metadata computed on success - Avoids copy overhead for large arrays ObjectRef fsspec accessors: - fs property: returns fsspec filesystem - store property: returns FSMap for Zarr/xarray - full_path property: returns full URI Updated immutability contract: - Objects immutable "after finalization" - Two insert modes: copy (existing data) and staged (direct write)
- Use dedicated staged_insert1 method instead of co-opting insert1 - Add StagedInsert class with rec dict, store(), and open() methods - Document rationale for separate method (explicit, backward compatible, type safe) - Add examples for Zarr and multiple object fields - Note that staged inserts are limited to insert1 (no multi-row)
- Filename is always {field}_{token}{ext}, no user control over base name
- Extension extracted from source file (copy) or optionally provided (staged)
- Replace `original_name` with `ext` in JSON schema and ObjectRef
- Update path templates, examples, and StagedInsert interface
- Add "Filename Convention" section explaining the design
- Rename store metadata: dj-store-meta.json → datajoint_store.json
- Move objects/ directory after table name in path hierarchy
- Path is now: {schema}/{Table}/objects/{pk_attrs}/{field}_{token}{ext}
- Allows table folders to contain both tabular data and objects
- Update all path examples and JSON samples
- Hash is null by default to avoid performance overhead for large objects - Optional hash parameter on insert: hash="sha256", "md5", or "xxhash" - Staged inserts never compute hashes (no local copy to hash from) - Folders get a manifest file (.manifest.json) with file list and sizes - Manifest enables integrity verification without content hashing - Add ObjectRef.verify() method for integrity checking
- Enables bidirectional mapping between object stores and databases - Fields are informational only, not enforced at runtime - Alternative: admins ensure unique project_name across namespace - Managed platforms may handle this mapping externally
- Legacy attach@store and filepath@store use hidden ~external_* tables - New object type stores all metadata inline in JSON column - Benefits: simpler schema, self-contained records, easier debugging - No reference counting complexity
- Add fsspec>=2023.1.0 as core dependency - Add optional dependencies for cloud backends (s3fs, gcsfs, adlfs) - Create new storage.py module with StorageBackend class - Unified interface for file, S3, GCS, and Azure storage - Methods: put_file, get_file, put_buffer, get_buffer, exists, remove - Refactor ExternalTable to use StorageBackend instead of protocol-specific code - Replace _upload_file, _download_file, etc. with storage backend calls - Add storage property, deprecate s3 property - Update settings.py to support GCS and Azure protocols - Add deprecation warning to s3.py Folder class - Module kept for backward compatibility - Will be removed in future version This lays the foundation for the new object type which will also use fsspec.
This commit adds a new `object` column type that provides managed file/folder storage with fsspec backend integration. Key features: - Object type declaration in declare.py (stores as JSON in MySQL) - ObjectRef class for fetch behavior with fsspec accessors (.fs, .store, .full_path) - Insert processing for file paths, folder paths, and (ext, stream) tuples - staged_insert1 context manager for direct writes (Zarr/xarray compatibility) - Path generation with partition pattern support - Store metadata file (datajoint_store.json) verification/creation - Folder manifest files for integrity verification The object type stores metadata inline (no hidden tables), supports multiple storage backends via fsspec (file, S3, GCS, Azure), and provides ObjectRef handles on fetch with direct storage access.
Remove unused mimetypes imports from objectref.py and storage.py, remove unused Path import and generate_token from staged_insert.py, and fix f-string without placeholders in objectref.py.
- Create comprehensive object.md page covering configuration, insert, fetch, staged inserts, and integration with Zarr/xarray - Update attributes.md to list object as a special DataJoint datatype - Add object_storage configuration section to settings.md - Add ObjectRef and array library integration section to fetch.md - Add object attributes and staged_insert1 section to insert.md
Apply ruff formatter changes for consistent code style.
- schema_object.py: Test table definitions for object type - test_object.py: Comprehensive tests covering: - Storage path generation utilities - Insert with file, folder, and stream - Fetch returning ObjectRef - ObjectRef methods (read, open, download, listdir, walk, verify) - Staged insert operations - Error cases - conftest.py: Object storage fixtures for testing
Co-authored-by: Davis Bennett <[email protected]>
- Replace config.external tests with stores credential tests - Update template test to check for stores structure instead of object_storage - Update get_store_spec tests for new default behavior (None instead of DEFAULT_SUBFOLDING) - Add tests for default store lookup (store=None) - Add tests for loading per-store credentials from .secrets/ - Verify partition_pattern and token_length defaults
- Update mock_stores fixture to use config.stores instead of config.object_storage - Update mock_object_storage fixture to configure stores.default and stores.local - Remove project_name from object_storage_config (now embedded in location path) - Simplify fixture by using unified stores API
- Update mock_stores_update fixture to use config.stores - Remove project_name (now embedded in location path) - Simplify fixture using unified stores API
- Add validation to prevent filepath paths starting with _hash/ or _schema/ - Update FilepathCodec docstring to clarify reserved sections - Filepath gives users maximum freedom while protecting DataJoint-managed sections - Users can organize files anywhere in store except reserved sections
- Test that filepath rejects paths starting with _hash/ - Test that filepath rejects paths starting with _schema/ - Test that filepath allows all other user-managed paths - Test filepath codec properties and registration
The 'secure' parameter is only valid for S3 stores, not for file/GCS/Azure protocols. Move the default setting to protocol-specific section to avoid validation errors when using file stores.
Allow users to configure custom prefixes for hash-addressed, schema-addressed,
and filepath storage sections per store. This enables mapping DataJoint to
existing storage layouts without restructuring.
Configuration:
- hash_prefix (default: '_hash') - Hash-addressed storage section
- schema_prefix (default: '_schema') - Schema-addressed storage section
- filepath_prefix (default: None) - Optional filepath restriction
Features:
- Validates prefixes don't overlap (mutual exclusion)
- FilepathCodec enforces dynamic reserved prefixes
- Optional filepath_prefix to restrict filepath paths
- Backwards compatible defaults
Examples:
{
"stores": {
"legacy": {
"protocol": "file",
"location": "/data/existing",
"hash_prefix": "content_addressed",
"schema_prefix": "structured_data",
"filepath_prefix": "raw_files"
}
}
}
Changes:
- settings.py: Add prefix fields, validation logic
- builtin_codecs.py: Dynamic prefix checking in FilepathCodec
- test_settings.py: 7 new tests for prefix validation
- test_codecs.py: 2 new tests for custom prefixes
Filepath storage is NOT part of the Object-Augmented Schema - it only
provides references to externally-managed files. Allow separate default
configuration for filepath references vs integrated storage.
Configuration:
- stores.default - for integrated storage (<blob>, <object>, <npy>, <attach>)
- stores.filepath_default - for filepath references (<filepath>)
This allows:
- Integrated storage on S3 or fast filesystem
- Filepath references to acquisition files on NAS or different location
Example:
{
"stores": {
"default": "main",
"filepath_default": "raw_data",
"main": {
"protocol": "s3",
"bucket": "processed-data",
"location": "lab-project"
},
"raw_data": {
"protocol": "file",
"location": "/mnt/nas/acquisition"
}
}
}
Usage:
- data : <blob> # Uses stores.default (main)
- arrays : <object> # Uses stores.default (main)
- raw : <filepath> # Uses stores.filepath_default (raw_data)
- raw : <filepath@acq> # Explicitly names store (overrides default)
Changes:
- settings.py: Add use_filepath_default parameter to get_store_spec()
- builtin_codecs.py: FilepathCodec uses use_filepath_default=True
- test_settings.py: Add 3 tests for filepath_default behavior
- settings.py: Update template to include filepath_default example
Architectural rationale:
- Hash/schema storage: integrated into OAS, DataJoint manages lifecycle
- Filepath storage: references only, users manage lifecycle
- Different defaults reflect this fundamental distinction
…alidation All 24 object storage test failures were due to test fixtures not creating the directories they configured. StorageBackend validates that file protocol locations exist, so fixtures must create them. - conftest.py: Create test_project subdirectory in object_storage_config - test_update1.py: Create djtest subdirectories in mock_stores_update Test results: 520 passed, 7 skipped, 0 failures ✓
Add blank line after import statement per PEP 8 style guidelines.
- Removed hardcoded 'objects' directory level from build_object_path()
- Updated path pattern comment to reflect new structure
- Updated all test expectations to match new path format
Previous path: {schema}/{table}/objects/{key}/{file}
New path: {schema}/{table}/{key}/{file}
The 'objects' literal was a legacy remnant intended for future tabular
storage alongside objects. Removing it simplifies the path structure
and aligns with documented behavior.
Verified:
- All test_object.py tests pass (43 tests)
- All test_npy_codec.py tests pass (22 tests)
- All test_hash_storage.py tests pass (14 tests)
- Updated SchemaCodec._build_path() to accept store_name parameter
- _build_path() now retrieves partition_pattern and token_length from store spec
- ObjectCodec and NpyCodec encode methods pass store_name to _build_path
- Enables partitioning configuration like partition_pattern: '{mouse_id}/{session_date}'
This allows organizing storage by experimental structure:
- Without: {schema}/{table}/{mouse_id=X}/{session_date=Y}/...
- With: {mouse_id=X}/{session_date=Y}/{schema}/{table}/...
Partitioning makes storage browsable by subject/session and enables
selective sync/backup of individual subjects or sessions.
The partition_pattern was not preserving the order of attributes specified in the pattern because it was iterating over a set (unordered). This caused paths like 'neuron_id=0/mouse_id=5/session_date=2017-01-05/...' instead of the expected 'mouse_id=5/session_date=2017-01-05/neuron_id=0/...'. Changes: - Extract partition attributes as a list to preserve order - Keep a set for efficient lookup when filtering remaining PK attributes - Iterate over the ordered list when building partition path components Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Added helper functions for safe 0.14.6 → 2.0 migration using parallel schemas: New functions in datajoint.migrate: - create_parallel_schema() - Create _v20 schema copy for testing - copy_table_data() - Copy data from production to test schema - compare_query_results() - Validate results match between schemas - backup_schema() - Create full schema backup before cutover - restore_schema() - Restore from backup if needed - verify_schema_v20() - Check if schema is 2.0 compatible These functions support the parallel schema migration approach which: - Keeps production untouched during testing - Allows unlimited practice runs - Enables side-by-side validation - Provides easy rollback (just drop _v20 schemas) See: datajoint-docs/src/how-to/migrate-to-v20.md Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Added helper function for migrating external storage pointers when copying
production data to _v2 schemas during git branch-based migration.
Function: migrate_external_pointers_v2()
- Converts BINARY(16) UUID → JSON metadata
- Points to existing files (no file copying required)
- Enables access to external data in _v2 test schemas
- Supports deferred external storage migration approach
Use case:
When using git branch workflow (main: 0.14.6, migrate-to-v2: 2.0), this
function allows copied production data to access external storage without
moving the actual blob files until production cutover.
Example:
migrate_external_pointers_v2(
schema='my_pipeline_v2',
table='recording',
attribute='signal',
source_store='external-raw',
dest_store='raw',
copy_files=False # Keep files in place
)
Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
- Remove trailing whitespace from SQL query - Remove unused dest_spec variable - Fix blank line whitespace (auto-fixed by ruff)
Auto-formatted by ruff-format to collapse multi-line function calls
Unified stores configuration with configurable prefixes and filepath_default
Replace deprecated 'external storage' terminology with canonical terms: - 'object storage' for general concept - 'in-store storage' for @ modifier specifics - 'in-table storage' for database storage Changes: - builtin_codecs.py: Update BlobCodec, AttachCodec, HashCodec docstrings * 'internal/external' → 'in-table/in-store' * Update examples and get_dtype() docstrings - settings.py: Update StoresSettings docstrings - gc.py: Update module docstring and format_stats() - expression.py: Update to_dicts() docstring - heading.py, codecs.py, declare.py: Update internal comments - migrate.py: Add note explaining use of legacy terminology Ref: TERMINOLOGY.md, DOCSTRING_TERMINOLOGY_REPORT.md
Replace deprecated SQL-derived terms with accurate DataJoint terminology: - 'semijoin/antijoin' → 'restriction/anti-restriction' - Clarify that A & B restricts A (does not join attributes) Changes in source code comments: - expression.py:1081: 'antijoin' → 'anti-restriction' - condition.py:296: '(semijoin/antijoin)' → 'for restriction' - condition.py:401: '(aka semijoin and antijoin)' → removed Rationale: In relational algebra, joins combine attributes from both operands. DataJoint's A & B restricts A to matching entities—no attributes from B appear in the result. This is fundamentally restriction, not a join operation.
- List <blob> and <blob@> separately to show both inline and external modes - List <attach> and <attach@> separately to show both modes - Change <hash> to <hash@> (external only) - Change <object> to <object@> (external only) - Clarify storage mode for each codec variant - Also corrected hash algorithm from SHA256 to MD5 This makes it clear which codecs support dual modes vs external-only.
Clarify dual-mode codecs in builtin_codecs docstring
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
DataJoint 2.0 is a major release that modernizes the entire codebase while maintaining backward compatibility for core functionality. This release focuses on extensibility, type safety, and developer experience.
Planning: DataJoint 2.0 Plan | Milestone 2.0
Major Features
Codec System (Extensible Types)
Replaces the adapter system with a modern, composable codec architecture:
<blob>,<json>,<attach>,<filepath>,<object>,<hash>,<npy><blob>wraps<json>for external storage)__init_subclass__validate()method for type checking before insertSemantic Matching
Attribute lineage tracking ensures joins only match semantically compatible attributes:
idornamesemantic_check=Falsefor legacy permissive behaviorPrimary Key Rules
Rigorous primary key propagation through all operators:
dj.U('attr')creates ad-hoc grouping entitiesAutoPopulate 2.0 (Jobs System)
Per-table job management with enhanced tracking:
~~_job_timestampand~~_job_durationcolumns~~table_namejob tabletable.progress()returns (remaining, total)Modern Fetch & Insert API
New fetch methods:
to_dicts()- List of dictionariesto_pandas()- DataFrame with PK as indexto_arrays(*attrs)- NumPy arrays (structured or individual)keys()- Primary keys onlyfetch1()- Single rowInsert improvements:
validate()- Check rows before insertingchunk_size- Batch large insertsinsert_dataframe()- DataFrame with index handlingType Aliases
Core DataJoint types for portability:
int8,int16,int32,int64uint8,uint16,uint32,uint64float32,float64booluuidObject Storage
Content-addressed and object storage types:
<hash>- Content-addressed storage with deduplication<object>- Named object storage (Zarr, folders)<npy>- NumPy arrays as .npy files<filepath>- Reference to managed files<attach>- File attachments (uploaded on insert)Virtual Schema Infrastructure (#1307)
New schema introspection API for exploring existing databases:
Schema.get_table(name)- Direct table access with auto tier prefix detectionSchema['TableName']- Bracket notation accessfor table in schema- Iterate tables in dependency order'TableName' in schema- Check table existencedj.virtual_schema()- Clean entry point for accessing schemasdj.VirtualModule()- Virtual modules with custom namesCLI Improvements
The
djcommand-line interface for interactive exploration:dj -s schema:alias- Load schemas as virtual modules--host,--user,--password- Connection options-hconflict with--helpSettings Modernization
Pydantic-based configuration with validation:
dj.config.override()context manager.secrets/)DJ_HOST, etc.)Migration Utilities
Helper functions for migrating from 0.14.x to 2.0:
analyze_blob_columns()- Identify columns needing type markersmigrate_blob_columns()- Add:<blob>:prefixes to column commentscheck_migration_status()- Verify migration readinessadd_job_metadata_columns()- Add hidden job tracking columnsLicense Change
Changed from LGPL to Apache 2.0 license (#1235 (discussion)):
Breaking Changes
Removed Support
fetch()with format parametercreate_virtual_module(usedj.virtual_schema()ordj.VirtualModule())~logtable (IMPR: Deprecate and Remove the~logTable. #1298)Removed API Components
dj.key- Usetable.keys()insteaddj.key_hash()- Removed (was for legacy job debugging)dj.schema()- Usedj.Schema()(capitalized)dj.ERD()- Usedj.Diagram()dj.Di()- Usedj.Diagram()API Changes
fetch()→to_dicts(),to_pandas(),to_arrays()fetch(format='frame')→to_pandas()fetch(as_dict=True)→to_dicts()safemode→prompt(theconfig['safemode']setting remains and controls the default behavior)Semantic Changes
Documentation
Developer Documentation (this repo)
Comprehensive updates in
docs/:User Documentation (datajoint-docs)
Full documentation site following the Diátaxis framework:
Tutorials (learning-oriented, Jupyter notebooks):
How-To Guides (task-oriented):
Reference (specifications):
Project Structure
src/layout for proper packaging (IMPR:srclayout #1267)Test Plan
Closes
Milestone 2.0 Issues
~logTable. #1298 - Deprecate and remove~logtablesuper.deletekwargs toPart.delete#1276 - Part.delete kwargs pass-throughsrclayout #1267 -srclayoutdj.Toporders the preview withorder_by#1242 -dj.Toporders the preview withorder_byBug Fixes
pyarrow(apandasdependency) #1202 - DataJoint import error with missing pyarrowValueErrorin DataJoint-Python 0.14.3 when using numpy 2.2.* #1201 - ValueError with numpy 2.2dj.Diagram()and new release ofpydot==3.0.*#1169 - Error with dj.Diagram() and pydot 3.0Improvements
Related PRs
Migration Guide
See How to Migrate from 1.x for detailed migration instructions.
🤖 Generated with Claude Code