Architectural Note: Operating on Git LFS Repositories with Thousands of Files



# Architectural Note: Operating on Git LFS Repositories with Thousands of Files

## 1. Context and Problem Statement

In large projects, it’s common for a Git repository to track **thousands to hundreds of thousands** of files via Git LFS. Typical use cases:

* A research study with many samples (VCFs, BAMs, images, etc.)
* A data lake-ish repo where each commit adds more LFS pointers
* Monorepos that aggregate multiple datasets or experiments

In these cases, standard Git LFS introspection commands become **painfully slow**. A concrete example:

```bash
git lfs ls-files --json
```

On a repo with thousands of LFS pointers, this can take **several minutes**. That’s a non-starter for:

* Interactive CLI tools
* Editor/IDE integrations
* CI/CD steps that run frequently

This note describes architectural patterns to **avoid global enumeration** and keep operations fast and predictable as your LFS population grows.

---

## 2. Why `git lfs ls-files` is Slow in Large Repos

Conceptually, `git lfs ls-files` must:

1. Walk the Git index / working tree to identify LFS-tracked files.
2. For each file, resolve and hydrate metadata (pointer, OID, size, etc.).
3. Optionally serialize to JSON.

Even if the LFS objects are local, this is **O(N)** over every matching file visible to the command. When N = 10,000+, you’re essentially asking Git + Git LFS to do a full scan and re-derive information that:

* Doesn’t change very often, and
* Could be cached or maintained elsewhere.

From an architecture perspective, the problem is:

> We’re using `git lfs ls-files` as a **query engine and index**, when it’s really just a **dumb enumerator** over the current state.

---

## 3. Design Goals

For a repository with many LFS objects, we want:

1. **Predictable latency**
   Operations that touch “all LFS files” should be rare and explicit; routine commands should be sub-second, even as the repo grows.

2. **Incremental updates**
   Avoid full scans of N files when only a handful are new or changed.

3. **Subset operations by default**
   Most tasks only need a **subset** (by path, tag, type, or commit range), not the full universe.

4. **Separation of metadata from Git internals**
   Use Git (and Git LFS) as the *transport and integrity layer*, not as a full-featured metadata store.

---

## 4. Core Architectural Pattern: External LFS Metadata Index

Instead of deriving everything on demand from `git lfs ls-files`, maintain a **separate index** of LFS metadata that is:

* **Versioned** alongside the repo (e.g., tracked TSV/JSON),
* **Derived incrementally** from Git/LFS events, and
* **Fast to query** (path lookup, OID lookup, tags, etc.).

### 4.1. Example: `META/lfs_index.tsv`

A simple pattern:

* Maintain a tracked file such as `META/lfs_index.tsv` with columns like:

  ```text
  path    oid_sha256                             size    tags    logical_id
  data/a.bam  1a2b3c...                          12345   tumor   sample:XYZ
  data/b.bam  4d5e6f...                          67890   normal  sample:ABC
  ```

* This TSV becomes your **primary, fast, queryable index**, *not* `git lfs ls-files`.

Pros:

* Constant-time query by path via grep / awk / Python / SQL.
* Easy to join with other metadata tables (specimens, assays, etc.).
* Can be regenerated in a controlled, explicit operation (like `make rebuild-index`).

### 4.2. How to Keep It Up-to-Date

You don’t want manual edits. Use **automation on “add” paths**:

*   use a **pre-commit hook**:

  * For newly staged LFS pointer files, update the index before commit.

This shifts expensive work into the **write path** where it is amortized and expected, and keeps the **read path** (queries) fast.

---

## 5. Avoiding `git lfs ls-files` in Common Operations

### 5.1. Don’t use `ls-files` as your data plane

Refactor any tools that currently:

```bash
git lfs ls-files --json | jq ...
```

to instead read from your **external index** (TSV/JSON/SQLite). For example:

```bash
# Old, slow:
git lfs ls-files --json | jq '.[] | select(.name|test("VCF$"))'

# New, fast:
awk -F'\t' '$1 ~ /\.vcf$/ {print $0}' META/lfs_index.tsv
```

or in Python:

```python
import csv

with open("META/lfs_index.tsv") as f:
    for row in csv.DictReader(f, delimiter="\t"):
        if row["path"].endswith(".vcf.gz"):
            ...
```

### 5.2. Use `ls-files` only for rare “rebuild index” operations

When you first introduce the index, you may need a **one-time or occasional** rebuild:

```bash
git lfs ls-files --all --json > /tmp/lfs_files.json
# transform into META/lfs_index.tsv
```

This can take minutes in huge repos—and that’s fine, *as long as it is rare* and documented as a heavy operation (like `npm install`, `docker build`, etc.).

---

## 6. Subset-First Design: Operate on Paths, Tags, or Commits

If you must derive state from Git directly, design your commands to **start with a subset**, not the full repo.

### 6.1. Path-based subsets

For example, instead of:

```bash
# Scans entire repo
git lfs ls-files --json
```

use:

```bash
# Only data under a project or cohort
git lfs ls-files --include "data/StudyX/**" --json
```

and structure your tooling around the concept of **project subtrees** (`data/studyA/`, `data/studyB/`, etc.) so most operations are scoped.

### 6.2. Commit-range subsets

For incremental workflows (ETL, indexing, sync), use git to find changed files:

```bash
git diff --name-only <old-commit> <new-commit> \
  | git check-attr --stdin filter \
  | awk '$2 == "lfs"' # or similar
```

Then only examine LFS metadata for **changed files**, merging that into your external index.

---

## 7. Caching and Incremental Computation

If you really want a “`git lfs ls-files --json`-like view,” you can implement your own **cached snapshot**:

1. Keep a file like `.cache/lfs_snapshot.json` keyed by commit hash (`HEAD`).
2. On invocation:

   * If `HEAD` has not changed, just read the cache.
   * If `HEAD` changed, compute the diff from the last snapshot and patch the cached JSON.

This means you only pay full-scan costs **when the diff is large**, and usually pay a small, incremental cost.

---

## 8. CI/CD Considerations

In CI, naive patterns like:

```yaml
- run: git lfs ls-files --json | jq ...
```

will slow your builds significantly once the LFS population grows.

Better patterns:

* For **linting** or **validation**:

  * Operate on `META/*.tsv` and cross-check with a small sample of pointers.
* For **publishing** or **sync** steps:

  * Use `git diff` between the last deployed commit and current one to identify only the LFS files that changed.
* For **health checks**:

  * Schedule a periodic “heavy” job (nightly or weekly) that runs `git lfs ls-files` to verify repo consistency, rather than doing it on every push.

---

## 9. Git + LFS as Transport, Not Primary Index

The underlying architectural theme:

* **Git** is an excellent tool for content addressing, branching, merging, and history.
* **Git LFS** is an excellent tool for large object transport and storage.

Neither is optimized as a **high-level metadata query system** for tens of thousands of objects.

So:

* Let Git/LFS handle **integrity** and **distribution**.
* Let a simple, explicit index (TSV/JSON/SQLite, or an external service like Indexd) handle **queries**, **tags**, and **relationships**.

You can always **rebuild** your index from Git LFS if needed, but you shouldn’t be doing that implicitly on every command.

---

## 10. Practical Recommendations / Checklist

When you notice `git lfs ls-files --json` taking minutes:

1. **Audit your tools**

   * Search for any use of `git lfs ls-files` in scripts, CI configs, and CLIs.
   * Replace them with operations over an **external index**.

2. **Introduce a canonical LFS index**

   * Add `META/lfs_index.tsv` (or similar) to the repo.
   * Define columns: `path`, `oid_sha256`, `size`, `tags`, `logical_id`, etc.
   * Commit it and treat it as the primary query surface.

3. **Automate index maintenance**

   * Add a wrapper command or pre-commit hook that updates the index on `git add`.
   * Provide a “heavy” `rebuild-lfs-index` command that users run explicitly when necessary.

4. **Scope operations by default**

   * Design new commands to accept `--path`, `--tag`, `--study`, or `--since <commit>` flags.
   * Document that global “scan everything” commands are expensive and should be infrequent.

5. **Use CI wisely**

   * Only operate on changed LFS files between commits.
   * Reserve full LFS integrity checks for scheduled jobs, not every PR.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architectural Note: Operating on Git LFS Repositories with Thousands of Files #85

Architectural Note: Operating on Git LFS Repositories with Thousands of Files

1. Context and Problem Statement

2. Why `git lfs ls-files` is Slow in Large Repos

3. Design Goals

4. Core Architectural Pattern: External LFS Metadata Index

4.1. Example: `META/lfs_index.tsv`

4.2. How to Keep It Up-to-Date

5. Avoiding `git lfs ls-files` in Common Operations

5.1. Don’t use `ls-files` as your data plane

5.2. Use `ls-files` only for rare “rebuild index” operations

6. Subset-First Design: Operate on Paths, Tags, or Commits

6.1. Path-based subsets

6.2. Commit-range subsets

7. Caching and Incremental Computation

8. CI/CD Considerations

9. Git + LFS as Transport, Not Primary Index

10. Practical Recommendations / Checklist

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Architectural Note: Operating on Git LFS Repositories with Thousands of Files #85

Description

Architectural Note: Operating on Git LFS Repositories with Thousands of Files

1. Context and Problem Statement

2. Why git lfs ls-files is Slow in Large Repos

3. Design Goals

4. Core Architectural Pattern: External LFS Metadata Index

4.1. Example: META/lfs_index.tsv

4.2. How to Keep It Up-to-Date

5. Avoiding git lfs ls-files in Common Operations

5.1. Don’t use ls-files as your data plane

5.2. Use ls-files only for rare “rebuild index” operations

6. Subset-First Design: Operate on Paths, Tags, or Commits

6.1. Path-based subsets

6.2. Commit-range subsets

7. Caching and Incremental Computation

8. CI/CD Considerations

9. Git + LFS as Transport, Not Primary Index

10. Practical Recommendations / Checklist

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. Why `git lfs ls-files` is Slow in Large Repos

4.1. Example: `META/lfs_index.tsv`

5. Avoiding `git lfs ls-files` in Common Operations

5.1. Don’t use `ls-files` as your data plane

5.2. Use `ls-files` only for rare “rebuild index” operations