Make DataLoader tuning knobs configurable via env (#138) by mihow · Pull Request #139 · RolnickLab/ami-data-companion

mihow · 2026-04-10T17:29:21Z

Summary

Implements Direction 2 from issue #138 — exposes the previously-hardcoded DataLoader and worker loop values as pydantic Settings fields, with current values as defaults so there is no behavior change on upgrade.

Also documents `AMI_ANTENNA_API_BATCH_SIZE` (the existing but undocumented knob that appears to be the biggest lever on peak ingest memory) in `.env.example` and the README, and adds a request-and-memory flow diagram to the `datasets.py` module docstring.

This PR is deliberately no behavior change by default. Every new setting defaults to its current hardcoded value. Operators who want to tune just set env vars.

New settings

Env var	Default	Previously hardcoded at
`AMI_ANTENNA_API_DATALOADER_PIN_MEMORY`	`true`	`datasets.py::get_rest_dataloader`
`AMI_ANTENNA_API_DATALOADER_PREFETCH_FACTOR`	`4`	`datasets.py::get_rest_dataloader`
`AMI_ANTENNA_API_DATALOADER_DOWNLOAD_THREADS`	`8`	`datasets.py::RESTDataset._ensure_sessions`
`AMI_ANTENNA_MAX_PENDING_POSTS`	`5`	`worker.py::MAX_PENDING_POSTS`
`AMI_ANTENNA_IDLE_SLEEP_SECONDS`	`5`	`worker.py::SLEEP_TIME_SECONDS`

The existing module-level `MAX_PENDING_POSTS` and `SLEEP_TIME_SECONDS` constants in `worker.py` are kept as-is for backward compatibility with any external imports; the runtime code now reads from settings.

Documentation changes

`.env.example`: adds commented-out entries for every new knob with tradeoff notes. Adds a memory-scaling formula and rough per-worker-RAM guidance for `AMI_ANTENNA_API_BATCH_SIZE`. Calls out explicitly that `AMI_LOCALIZATION_BATCH_SIZE` / `AMI_CLASSIFICATION_BATCH_SIZE` do not limit ingest memory — a common confusion.
`README.md`: adds a "Tuning for memory-constrained or large-image workloads" section under "Running the Antenna Worker" with the peak-RAM formula, per-RAM-size starting points, and a deployment note about the supervisor/systemd soft FD limit (the default 1024 is too low for the worker's real FD working set and causes a confusing `OSError: [Errno 24]` at startup).
`datasets.py` module docstring: adds an ASCII flow diagram showing the per-subprocess request → download → tensor → prefetch queue path and where each env var applies. Also corrects a couple of stale values in the existing quick-reference (default `antenna_api_batch_size` is 24, not 16; `prefetch_factor` is explicitly set to 4, not "PyTorch default").

What's NOT in this PR

Intentionally scoped small:

Default changes — I did not change any default value. If we want `pin_memory=False` to be the new default for HTTP-sourced workloads (which the analysis in Clarify and document tuning parameters #138 suggests is probably correct), that's a separate PR after a small-image throughput regression test.
The tuning command (Direction 3 in Clarify and document tuning parameters #138) — a separate small feature.
Probe-based auto-tune (Direction 4 in Clarify and document tuning parameters #138) — a larger refactor of `_process_job` to rebuild the DataLoader per job based on a probe of the first few tasks.
Disk-based prefetch (Direction 5 in Clarify and document tuning parameters #138) — an architectural refactor that deserves its own design discussion.

Testing

`AMI_ANTENNA_API_DATALOADER_PIN_MEMORY=false`, `AMI_ANTENNA_API_DATALOADER_PREFETCH_FACTOR=1`, and `AMI_ANTENNA_API_BATCH_SIZE=4` were applied as a hot patch on a test worker processing a job with ~914 source images at ~24 MB per JPEG. Worker ran stably at ~13 GB RSS on a 68 GB instance, ~2 s/image, across multiple consecutive jobs. Without any of those changes the same worker OOM-killed at ~68 GB in ~40 seconds.
Pre-commit hooks pass (black, isort, flake8, autoflake, trailing whitespace, EOF).
Not yet done: isolated ablation — the one deployment had all three changes at once, so we haven't proven that `AMI_ANTENNA_API_BATCH_SIZE=4` alone is sufficient. See the "what we still need to verify" section of Clarify and document tuning parameters #138. Once this PR lands, that test is just setting one env var and becomes trivial.

Base branch

This PR targets `feature/uv-migration` because that's the branch currently used in our deployment. Once `feature/uv-migration` merges to main, GitHub will auto-retarget this PR.

Closes none of #138 (this is one direction of several — the issue stays open).

Follow-up TODOs (before merge or as separate issues)

Isolated ablation to prove the config-only fix is sufficient. With this PR deployed, set AMI_ANTENNA_API_BATCH_SIZE=8 only (leaving AMI_ANTENNA_API_DATALOADER_PIN_MEMORY at its default true and PREFETCH_FACTOR at 4) on a small-RAM worker, and run a large-image job. If it completes without OOM, the minimal fix is just the one env var and we can recommend that as the universal guidance. If it OOMs, we know pin_memory=false is actually load-bearing and should probably become the new default for HTTP-sourced workloads (separate PR). Cheap to run: one env edit and a worker restart.
Small-image regression test. Run a typical ~1.5 MB JPEG job with AMI_ANTENNA_API_DATALOADER_PIN_MEMORY=false and compare throughput against the current defaults. If throughput is within noise, pin_memory=false is probably the right new default for this worker.
Memray / tracemalloc profile of a large-image job to replace the estimated numbers in the tables with measured ones. Would also catch any residual leak on top of the steady-state peak.
Once feature/uv-migration merges to main, re-target this PR.

Expose the previously-hardcoded values in get_rest_dataloader() and worker.py as pydantic Settings fields, with current values as defaults so there is no behavior change on upgrade. Adds: AMI_ANTENNA_API_DATALOADER_PIN_MEMORY (default: true) AMI_ANTENNA_API_DATALOADER_PREFETCH_FACTOR (default: 4) AMI_ANTENNA_API_DATALOADER_DOWNLOAD_THREADS (default: 8) AMI_ANTENNA_MAX_PENDING_POSTS (default: 5) AMI_ANTENNA_IDLE_SLEEP_SECONDS (default: 5) Also: - Document AMI_ANTENNA_API_BATCH_SIZE (the biggest lever on peak ingest RAM) in .env.example with the memory-scaling formula and guidance for picking a value per worker size. - Add a "Tuning for memory-constrained or large-image workloads" section to the README, plus a deployment note about raising the supervisor / systemd soft FD limit to 65536. - Add a request-and-memory flow diagram to the datasets.py module docstring so operators and future maintainers can see where peak RAM actually comes from. - Correct stale values in the existing datasets.py docstring (default batch size is 24, not 16; prefetch_factor is 4, not PyTorch default). This is Direction 2 from the discussion in #138 — a no-behavior-change refactor that makes the tuning knobs reachable without source edits and sets up the work in Direction 3 (a tuning command) and Direction 4 (probe-based per-job auto-tune). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-10T17:29:30Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7223179c-3f52-4df0-98a7-7005d0e6b90c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/tuning-knobs-138

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

This was referenced Apr 11, 2026

Update progress while queuing NATS tasks RolnickLab/antenna#1173

Open

PSv2: Async jobs hang forever when NATS tasks exhaust max_deliver without posting results RolnickLab/antenna#1168

Open

mihow mentioned this pull request Apr 29, 2026

DataLoader pipe FD leak: ~10k pipes/day accumulate, worker hits EMFILE silently after ~6 days uptime #145

Open

This was referenced May 20, 2026

DataLoader subprocess deadlock: MainThread hangs forever on queue.get after shared-memory unlink failure #140

Open

fix(dataloader): switch to forkserver context + add timeout + explicit teardown (closes #140, partial #145) #148

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make DataLoader tuning knobs configurable via env (#138)#139

Make DataLoader tuning knobs configurable via env (#138)#139
mihow wants to merge 1 commit into
feature/uv-migrationfrom
feature/tuning-knobs-138

mihow commented Apr 10, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 10, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mihow commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New settings

Documentation changes

What's NOT in this PR

Testing

Base branch

Follow-up TODOs (before merge or as separate issues)

Uh oh!

coderabbitai Bot commented Apr 10, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mihow commented Apr 10, 2026 •

edited

Loading