Architectural Note: Operating on Git LFS Repositories with Thousands of Files
1. Context and Problem Statement
In large projects, it’s common for a Git repository to track thousands to hundreds of thousands of files via Git LFS. Typical use cases:
- A research study with many samples (VCFs, BAMs, images, etc.)
- A data lake-ish repo where each commit adds more LFS pointers
- Monorepos that aggregate multiple datasets or experiments
In these cases, standard Git LFS introspection commands become painfully slow. A concrete example:
On a repo with thousands of LFS pointers, this can take several minutes. That’s a non-starter for:
- Interactive CLI tools
- Editor/IDE integrations
- CI/CD steps that run frequently
This note describes architectural patterns to avoid global enumeration and keep operations fast and predictable as your LFS population grows.
2. Why git lfs ls-files is Slow in Large Repos
Conceptually, git lfs ls-files must:
- Walk the Git index / working tree to identify LFS-tracked files.
- For each file, resolve and hydrate metadata (pointer, OID, size, etc.).
- Optionally serialize to JSON.
Even if the LFS objects are local, this is O(N) over every matching file visible to the command. When N = 10,000+, you’re essentially asking Git + Git LFS to do a full scan and re-derive information that:
- Doesn’t change very often, and
- Could be cached or maintained elsewhere.
From an architecture perspective, the problem is:
We’re using git lfs ls-files as a query engine and index, when it’s really just a dumb enumerator over the current state.
3. Design Goals
For a repository with many LFS objects, we want:
-
Predictable latency
Operations that touch “all LFS files” should be rare and explicit; routine commands should be sub-second, even as the repo grows.
-
Incremental updates
Avoid full scans of N files when only a handful are new or changed.
-
Subset operations by default
Most tasks only need a subset (by path, tag, type, or commit range), not the full universe.
-
Separation of metadata from Git internals
Use Git (and Git LFS) as the transport and integrity layer, not as a full-featured metadata store.
4. Core Architectural Pattern: External LFS Metadata Index
Instead of deriving everything on demand from git lfs ls-files, maintain a separate index of LFS metadata that is:
- Versioned alongside the repo (e.g., tracked TSV/JSON),
- Derived incrementally from Git/LFS events, and
- Fast to query (path lookup, OID lookup, tags, etc.).
4.1. Example: META/lfs_index.tsv
A simple pattern:
-
Maintain a tracked file such as META/lfs_index.tsv with columns like:
path oid_sha256 size tags logical_id
data/a.bam 1a2b3c... 12345 tumor sample:XYZ
data/b.bam 4d5e6f... 67890 normal sample:ABC
-
This TSV becomes your primary, fast, queryable index, not git lfs ls-files.
Pros:
- Constant-time query by path via grep / awk / Python / SQL.
- Easy to join with other metadata tables (specimens, assays, etc.).
- Can be regenerated in a controlled, explicit operation (like
make rebuild-index).
4.2. How to Keep It Up-to-Date
You don’t want manual edits. Use automation on “add” paths:
This shifts expensive work into the write path where it is amortized and expected, and keeps the read path (queries) fast.
5. Avoiding git lfs ls-files in Common Operations
5.1. Don’t use ls-files as your data plane
Refactor any tools that currently:
git lfs ls-files --json | jq ...
to instead read from your external index (TSV/JSON/SQLite). For example:
# Old, slow:
git lfs ls-files --json | jq '.[] | select(.name|test("VCF$"))'
# New, fast:
awk -F'\t' '$1 ~ /\.vcf$/ {print $0}' META/lfs_index.tsv
or in Python:
import csv
with open("META/lfs_index.tsv") as f:
for row in csv.DictReader(f, delimiter="\t"):
if row["path"].endswith(".vcf.gz"):
...
5.2. Use ls-files only for rare “rebuild index” operations
When you first introduce the index, you may need a one-time or occasional rebuild:
git lfs ls-files --all --json > /tmp/lfs_files.json
# transform into META/lfs_index.tsv
This can take minutes in huge repos—and that’s fine, as long as it is rare and documented as a heavy operation (like npm install, docker build, etc.).
6. Subset-First Design: Operate on Paths, Tags, or Commits
If you must derive state from Git directly, design your commands to start with a subset, not the full repo.
6.1. Path-based subsets
For example, instead of:
# Scans entire repo
git lfs ls-files --json
use:
# Only data under a project or cohort
git lfs ls-files --include "data/StudyX/**" --json
and structure your tooling around the concept of project subtrees (data/studyA/, data/studyB/, etc.) so most operations are scoped.
6.2. Commit-range subsets
For incremental workflows (ETL, indexing, sync), use git to find changed files:
git diff --name-only <old-commit> <new-commit> \
| git check-attr --stdin filter \
| awk '$2 == "lfs"' # or similar
Then only examine LFS metadata for changed files, merging that into your external index.
7. Caching and Incremental Computation
If you really want a “git lfs ls-files --json-like view,” you can implement your own cached snapshot:
-
Keep a file like .cache/lfs_snapshot.json keyed by commit hash (HEAD).
-
On invocation:
- If
HEAD has not changed, just read the cache.
- If
HEAD changed, compute the diff from the last snapshot and patch the cached JSON.
This means you only pay full-scan costs when the diff is large, and usually pay a small, incremental cost.
8. CI/CD Considerations
In CI, naive patterns like:
- run: git lfs ls-files --json | jq ...
will slow your builds significantly once the LFS population grows.
Better patterns:
9. Git + LFS as Transport, Not Primary Index
The underlying architectural theme:
- Git is an excellent tool for content addressing, branching, merging, and history.
- Git LFS is an excellent tool for large object transport and storage.
Neither is optimized as a high-level metadata query system for tens of thousands of objects.
So:
- Let Git/LFS handle integrity and distribution.
- Let a simple, explicit index (TSV/JSON/SQLite, or an external service like Indexd) handle queries, tags, and relationships.
You can always rebuild your index from Git LFS if needed, but you shouldn’t be doing that implicitly on every command.
10. Practical Recommendations / Checklist
When you notice git lfs ls-files --json taking minutes:
-
Audit your tools
- Search for any use of
git lfs ls-files in scripts, CI configs, and CLIs.
- Replace them with operations over an external index.
-
Introduce a canonical LFS index
- Add
META/lfs_index.tsv (or similar) to the repo.
- Define columns:
path, oid_sha256, size, tags, logical_id, etc.
- Commit it and treat it as the primary query surface.
-
Automate index maintenance
- Add a wrapper command or pre-commit hook that updates the index on
git add.
- Provide a “heavy”
rebuild-lfs-index command that users run explicitly when necessary.
-
Scope operations by default
- Design new commands to accept
--path, --tag, --study, or --since <commit> flags.
- Document that global “scan everything” commands are expensive and should be infrequent.
-
Use CI wisely
- Only operate on changed LFS files between commits.
- Reserve full LFS integrity checks for scheduled jobs, not every PR.
Architectural Note: Operating on Git LFS Repositories with Thousands of Files
1. Context and Problem Statement
In large projects, it’s common for a Git repository to track thousands to hundreds of thousands of files via Git LFS. Typical use cases:
In these cases, standard Git LFS introspection commands become painfully slow. A concrete example:
On a repo with thousands of LFS pointers, this can take several minutes. That’s a non-starter for:
This note describes architectural patterns to avoid global enumeration and keep operations fast and predictable as your LFS population grows.
2. Why
git lfs ls-filesis Slow in Large ReposConceptually,
git lfs ls-filesmust:Even if the LFS objects are local, this is O(N) over every matching file visible to the command. When N = 10,000+, you’re essentially asking Git + Git LFS to do a full scan and re-derive information that:
From an architecture perspective, the problem is:
3. Design Goals
For a repository with many LFS objects, we want:
Predictable latency
Operations that touch “all LFS files” should be rare and explicit; routine commands should be sub-second, even as the repo grows.
Incremental updates
Avoid full scans of N files when only a handful are new or changed.
Subset operations by default
Most tasks only need a subset (by path, tag, type, or commit range), not the full universe.
Separation of metadata from Git internals
Use Git (and Git LFS) as the transport and integrity layer, not as a full-featured metadata store.
4. Core Architectural Pattern: External LFS Metadata Index
Instead of deriving everything on demand from
git lfs ls-files, maintain a separate index of LFS metadata that is:4.1. Example:
META/lfs_index.tsvA simple pattern:
Maintain a tracked file such as
META/lfs_index.tsvwith columns like:This TSV becomes your primary, fast, queryable index, not
git lfs ls-files.Pros:
make rebuild-index).4.2. How to Keep It Up-to-Date
You don’t want manual edits. Use automation on “add” paths:
use a pre-commit hook:
For newly staged LFS pointer files, update the index before commit.
This shifts expensive work into the write path where it is amortized and expected, and keeps the read path (queries) fast.
5. Avoiding
git lfs ls-filesin Common Operations5.1. Don’t use
ls-filesas your data planeRefactor any tools that currently:
git lfs ls-files --json | jq ...to instead read from your external index (TSV/JSON/SQLite). For example:
or in Python:
5.2. Use
ls-filesonly for rare “rebuild index” operationsWhen you first introduce the index, you may need a one-time or occasional rebuild:
This can take minutes in huge repos—and that’s fine, as long as it is rare and documented as a heavy operation (like
npm install,docker build, etc.).6. Subset-First Design: Operate on Paths, Tags, or Commits
If you must derive state from Git directly, design your commands to start with a subset, not the full repo.
6.1. Path-based subsets
For example, instead of:
# Scans entire repo git lfs ls-files --jsonuse:
and structure your tooling around the concept of project subtrees (
data/studyA/,data/studyB/, etc.) so most operations are scoped.6.2. Commit-range subsets
For incremental workflows (ETL, indexing, sync), use git to find changed files:
Then only examine LFS metadata for changed files, merging that into your external index.
7. Caching and Incremental Computation
If you really want a “
git lfs ls-files --json-like view,” you can implement your own cached snapshot:Keep a file like
.cache/lfs_snapshot.jsonkeyed by commit hash (HEAD).On invocation:
HEADhas not changed, just read the cache.HEADchanged, compute the diff from the last snapshot and patch the cached JSON.This means you only pay full-scan costs when the diff is large, and usually pay a small, incremental cost.
8. CI/CD Considerations
In CI, naive patterns like:
will slow your builds significantly once the LFS population grows.
Better patterns:
For linting or validation:
META/*.tsvand cross-check with a small sample of pointers.For publishing or sync steps:
git diffbetween the last deployed commit and current one to identify only the LFS files that changed.For health checks:
git lfs ls-filesto verify repo consistency, rather than doing it on every push.9. Git + LFS as Transport, Not Primary Index
The underlying architectural theme:
Neither is optimized as a high-level metadata query system for tens of thousands of objects.
So:
You can always rebuild your index from Git LFS if needed, but you shouldn’t be doing that implicitly on every command.
10. Practical Recommendations / Checklist
When you notice
git lfs ls-files --jsontaking minutes:Audit your tools
git lfs ls-filesin scripts, CI configs, and CLIs.Introduce a canonical LFS index
META/lfs_index.tsv(or similar) to the repo.path,oid_sha256,size,tags,logical_id, etc.Automate index maintenance
git add.rebuild-lfs-indexcommand that users run explicitly when necessary.Scope operations by default
--path,--tag,--study, or--since <commit>flags.Use CI wisely