Make DataLoader tuning knobs configurable via env (#138)#139
Draft
mihow wants to merge 1 commit into
Draft
Conversation
Expose the previously-hardcoded values in get_rest_dataloader() and worker.py as pydantic Settings fields, with current values as defaults so there is no behavior change on upgrade. Adds: AMI_ANTENNA_API_DATALOADER_PIN_MEMORY (default: true) AMI_ANTENNA_API_DATALOADER_PREFETCH_FACTOR (default: 4) AMI_ANTENNA_API_DATALOADER_DOWNLOAD_THREADS (default: 8) AMI_ANTENNA_MAX_PENDING_POSTS (default: 5) AMI_ANTENNA_IDLE_SLEEP_SECONDS (default: 5) Also: - Document AMI_ANTENNA_API_BATCH_SIZE (the biggest lever on peak ingest RAM) in .env.example with the memory-scaling formula and guidance for picking a value per worker size. - Add a "Tuning for memory-constrained or large-image workloads" section to the README, plus a deployment note about raising the supervisor / systemd soft FD limit to 65536. - Add a request-and-memory flow diagram to the datasets.py module docstring so operators and future maintainers can see where peak RAM actually comes from. - Correct stale values in the existing datasets.py docstring (default batch size is 24, not 16; prefetch_factor is 4, not PyTorch default). This is Direction 2 from the discussion in #138 — a no-behavior-change refactor that makes the tuning knobs reachable without source edits and sets up the work in Direction 3 (a tuning command) and Direction 4 (probe-based per-job auto-tune). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements Direction 2 from issue #138 — exposes the previously-hardcoded DataLoader and worker loop values as pydantic Settings fields, with current values as defaults so there is no behavior change on upgrade.
Also documents `AMI_ANTENNA_API_BATCH_SIZE` (the existing but undocumented knob that appears to be the biggest lever on peak ingest memory) in `.env.example` and the README, and adds a request-and-memory flow diagram to the `datasets.py` module docstring.
This PR is deliberately no behavior change by default. Every new setting defaults to its current hardcoded value. Operators who want to tune just set env vars.
New settings
The existing module-level `MAX_PENDING_POSTS` and `SLEEP_TIME_SECONDS` constants in `worker.py` are kept as-is for backward compatibility with any external imports; the runtime code now reads from settings.
Documentation changes
What's NOT in this PR
Intentionally scoped small:
Testing
Base branch
This PR targets `feature/uv-migration` because that's the branch currently used in our deployment. Once `feature/uv-migration` merges to main, GitHub will auto-retarget this PR.
Closes none of #138 (this is one direction of several — the issue stays open).
Follow-up TODOs (before merge or as separate issues)
AMI_ANTENNA_API_BATCH_SIZE=8only (leavingAMI_ANTENNA_API_DATALOADER_PIN_MEMORYat its defaulttrueandPREFETCH_FACTORat4) on a small-RAM worker, and run a large-image job. If it completes without OOM, the minimal fix is just the one env var and we can recommend that as the universal guidance. If it OOMs, we knowpin_memory=falseis actually load-bearing and should probably become the new default for HTTP-sourced workloads (separate PR). Cheap to run: one env edit and a worker restart.AMI_ANTENNA_API_DATALOADER_PIN_MEMORY=falseand compare throughput against the current defaults. If throughput is within noise,pin_memory=falseis probably the right new default for this worker.feature/uv-migrationmerges to main, re-target this PR.