Skip to content

feature/add-url-md5-sha256 #206

@bwalsh

Description

@bwalsh

ADR: Alternate Object Identifier Strategy for Git LFS Without Content SHA256

Status

Proposed


Use Case: Managing Remote-Only Large Files Without Content Hashing

Title

Enable Git LFS workflows for remote-only or very large files without requiring SHA256 content hashing during git add.


Primary Actor

Research data steward / data engineer using Git LFS integrated with DRS.


Scenario

A research team maintains large genomic files (e.g., BAM, CRAM, FASTQ) that are:

  • Already stored in an object store (S3 / Ceph / GCS)
  • Registered in DRS
  • Multi-GB or TB in size
  • Not always locally downloaded

The user wants to:

  • Reference these files in a Git repository
  • Track them via Git LFS
  • Maintain reproducibility and metadata linkage
  • Avoid computing SHA256 hashes locally (too slow or impossible)

User Story

As a research data steward managing large, remote DRS-registered files,
I want to add files to a Git LFS repository without computing a full content SHA256 hash,
So that I can efficiently reference remote objects while maintaining compatibility with Git LFS and DRS workflows.


Functional Expectations

  • During git add, the clean filter:

    • Does not require downloading or hashing full file contents.
    • Generates a stable alternate object identifier.
  • During git lfs push:

    • Remote existence checks use DRS or metadata services.
  • During git checkout:

    • Files are resolved via DRS ID.
  • No additional metadata files are committed to Git.

  • Integrity and deduplication are delegated to DRS.


Acceptance Criteria

  • User can run git add on remote-managed large files without content hashing delays.
  • Repository remains fully compatible with Git LFS commands.
  • git lfs push does not attempt redundant uploads when DRS already contains the object.
  • git checkout correctly restores files via DRS resolution.
  • No extra files beyond standard LFS pointers are committed.
  • Workflow works for multi-GB files without significant local CPU or I/O overhead.

Business / Architectural Value

  • Eliminates expensive SHA256 operations on large files.
  • Enables remote-first, metadata-addressable architecture.
  • Aligns Git workflows with DRS and Indexd object identity.
  • Supports scalable genomics and bioinformatics data management.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions