DataLoader pipe FD leak: ~10k pipes/day accumulate, worker hits EMFILE silently after ~6 days uptime

After about a week of uptime, the ML worker silently stops processing jobs even though supervisor still reports it as `RUNNING` — the process is alive and polling on schedule, but every outbound HTTP call fails internally because it has run out of file descriptors. The cause is an anonymous-pipe leak in the DataLoader subprocess lifecycle that accumulates roughly 10k pipe FDs per day per worker, regardless of batch size or any of the existing tuning knobs.

## Summary

The `ami worker` process leaks anonymous pipe file descriptors at a sustained rate (~10k/day per process) at steady state. After ~6 days of uptime on a worker with `nofile` rlimit 65536, every `os.pipe()` call (DataLoader subprocess startup, `subprocess.Popen`, etc.) fails with `OSError(24, 'Too many open files')`. The worker process keeps **`RUNNING`** in supervisor — no FATAL, no autorestart, no crash — but every API poll silently fails with a `urllib3` `SSLError` wrapping the `EMFILE`.

This is distinct from #140 (DataLoader subprocess deadlock — synchronous hang) and from #138 (steady-state ingest RAM peak). Same upstream module (`trapdata/antenna/datasets.py` DataLoader lifecycle), different leaked resource. The mitigations from #138 (`AMI_ANTENNA_API_DATALOADER_PIN_MEMORY=false`, smaller batch sizes) did **not** slow this leak.

## Symptom (silent failure)

- Worker `RUNNING` in supervisor, uptime steadily climbs
- Every poll cycle logs:
  ```
  [error] Failed to fetch jobs from <api>: HTTPSConnectionPool(...): Max retries exceeded with url: /api/v2/jobs?... (Caused by SSLError(OSError(24, 'Too many open files')))
  [info ] [GPU 0] No jobs found, sleeping for 5 seconds
  ```
- The deployment's Django backend sees zero requests from the worker
- Async ML jobs queued via NATS sit at `num_pending > 0, num_ack_pending = 0, num_redelivered = 0` — same NATS signature as a stale auth token, but the worker logs show `EMFILE` not `401`
- The deployment-side `jobs_health_check` cuts the job to `FAILURE` after the idle cutoff
- User-visible result: "ML job did nothing"

## Root cause (FD type breakdown)

Snapshot taken on a worker process immediately before restart, after ~6 days uptime:

```
$ sudo ls -l /proc/<PID>/fd | awk '{print $11}' | sed -E 's|/[0-9]+$||;s|\[[0-9]+\]|[N]|g' | sort | uniq -c | sort -rn | head
  65501 pipe:[N]
     19 /dev/nvidia0
      4 anon_inode:[eventfd]
      4 /dev/nvidia-uvm
      2 /dev/nvidiactl
      1 socket:[N]
      1 /dev/urandom
      1 /dev/nvidia-caps/nvidia-cap4
      1 /dev/nvidia-caps/nvidia-cap3
      1 /dev/nvidia-caps/nvidia-cap2
      1
```

Effectively all of the leaked FDs are anonymous **pipes**, not sockets, files, eventfds, or shared-memory inodes. Pipes at this scale come from `multiprocessing` IPC channels — DataLoader subprocess control/data pipes — that are not closed when the subprocess exits.

This is consistent with `AMI_NUM_WORKERS=1` (one DataLoader subprocess spawned per batch) leaking ~3 pipes per batch over the worker's lifetime: ~65500 pipes / ~3 per batch ≈ 21800 batches. At our observed throughput of roughly 1 batch every 25 s, that's ~6 days of continuous processing — which matches the observed uptime at exhaustion.

The leak is **at steady state**, not bursty. Mitigations that lower per-batch RAM (PIN_MEMORY=false, smaller batch sizes) don't help because the pipe count doesn't scale with batch size — it scales with batch *count*.

## Why this isn't covered by #140

#140 describes a synchronous deadlock where the main thread blocks forever on `multiprocessing.Queue.get()` after a `shm_unlink` failure in the subprocess. That worker is hung — log goes silent immediately, `last_log mtime` becomes stale.

This bug is the opposite shape. The worker keeps polling on schedule (logs are timestamped, mtime is fresh), but every poll fails with `EMFILE`. The DataLoader subprocess in this case has long since exited cleanly — it leaves dead pipe FDs in the parent, and the parent never reaps them.

A fix for #140 (DataLoader timeout, persistent_workers, etc.) wouldn't address this. Both bugs share an upstream cause (DataLoader / `multiprocessing` lifecycle hygiene) but the failure modes and recovery paths are different.

## Why this isn't (only) covered by #138

#138 mentions `OSError(24, 'Too many open files')` from `multiprocessing.Pipe`'s `os.pipe()` call at DataLoader startup, and concludes:

> Fixed out-of-band by raising the soft FD limit at the supervisor or systemd level. **Not an ADC bug**

The new evidence here shows the opposite: at `nofile=65536`, the legitimate steady-state working set is ~50 FDs, not 1024 or 65536. The exhaustion at 65k after 6 days is an unbounded leak, not a working-set ceiling. Raising the rlimit buys time (~6 days at 65k vs ~hours at 1024) but doesn't fix the underlying bug.

Two clean datapoints from the same incident:

| `nofile` rlimit | Time to exhaustion |
|---|---|
| 1024 (one worker box, missed provisioning step) | ~hours |
| 65536 (other worker boxes) | ~6 days |

Both eventually hit `EMFILE`. The leak rate is roughly the same in both cases.

## Reproduction

Not yet reproduced on demand. Observed on three production GPU workers simultaneously after ~6 days of continuous polling + intermittent ML processing. Each was running:

- `ami-data-companion` with the env vars from #138 applied (`PIN_MEMORY=false`, `BATCH_SIZE=8`/`16`, `NUM_WORKERS=1`, `LOCALIZATION_BATCH_SIZE=4`, `CLASSIFICATION_BATCH_SIZE=150`)
- torch 2.10.0+cu128 in the venv
- `python` (CPython, version per `pyproject.toml`)

To reproduce in development:

1. Start an `ami worker` against a deployment with no pending tasks (so the worker only polls, never processes)
2. Periodically snapshot `ls /proc/<PID>/fd | wc -l`
3. Compare to the `pipe:[N]` count via `ls -l /proc/<PID>/fd | grep -c 'pipe:'`

If the polling-only mode also leaks pipes, that narrows the cause further (probably points at the polling/HTTP path rather than the DataLoader path). If it doesn't, an active job is required to trigger the leak.

## Suggested fix directions (to discuss, not pre-decided)

Listed roughly in increasing effort.

1. **Track which call sites open pipes** — instrument `os.pipe` (e.g. via `unittest.mock` / a small wrapper) for one production-style test run and log every caller. Likely candidates: `multiprocessing.Pipe`, `multiprocessing.connection.Pipe`, `subprocess.Popen` with pipes, `concurrent.futures.ProcessPoolExecutor` queues.

2. **Explicit DataLoader teardown between batches** — at the end of each `_process_job` iteration:
   ```python
   del loader
   gc.collect()
   ```
   Forces the loader's `__del__` to run and joins its workers. May be enough on its own if the leak is just GC lag.

3. **`persistent_workers=True` + a long-lived DataLoader per worker process** — keeps a fixed pool of DataLoader subprocesses alive, avoids the per-batch spawn/teardown churn that's leaking pipes. Has its own tradeoffs (needs `num_workers > 0`, less responsive to changing `prefetch_factor`).

4. **Audit any `subprocess.Popen` calls in the worker code path** — especially anything that doesn't pass `stdin=DEVNULL, stdout=PIPE, stderr=PIPE` and explicitly close the pipe FDs after read. `popen2`-style usages are easy to leak from.

5. **Switch DataLoader `multiprocessing_context`** from default (likely `fork`) to `spawn` or `forkserver`. Worth a try on a reproducer.

6. **Worker self-restart on RSS / FD soft cap** — defense in depth. `resource.getrusage` for RSS, `len(os.listdir('/proc/self/fd'))` for FDs, sampled every N batches; exit cleanly when over a fraction of the rlimit, let supervisor restart. Doesn't fix the root bug but turns a silent dark-worker into a recoverable restart.

## Workaround (operational)

`sudo supervisorctl restart ami-antenna-worker:*` clears the FDs. Resumes polling immediately. Worker is good for another ~6 days at `nofile=65536`. Not a fix.

## Diagnostic recipe (for any future "worker dark, no errors" report)

```bash
# Find the ami worker PIDs
pgrep -f "ami-data-companion.*ami worker"

# Per-process FD count + rlimit
for pid in $(pgrep -f "ami-data-companion.*ami worker"); do
  echo "pid=$pid fds=$(sudo ls /proc/$pid/fd | wc -l) limit=$(sudo awk '/Max open files/ {print $4}' /proc/$pid/limits)"
done

# FD type breakdown — leak signature is dominant `pipe:[N]`
sudo ls -l /proc/<PID>/fd | awk '{print $11}' | sed -E 's|/[0-9]+$||;s|\[[0-9]+\]|[N]|g' | sort | uniq -c | sort -rn | head

# Recent worker logs — leak symptom is the urllib3 EMFILE
sudo supervisorctl tail -1000 ami-antenna-worker:ami-antenna-worker_00 stdout | grep -E "Too many open files|EMFILE" | tail
```

## Related

- #140 — DataLoader subprocess deadlock (different failure mode, same upstream module)
- #138 — Tuning parameters / RAM peak (mitigations from there don't slow this leak; FD limit was misclassified as ops-only)
- #118 (closed) — prior RSS leak

## What we still need to verify

- Is the leak triggered by polling alone, or only by active job processing? (run polling-only repro)
- Does Direction 2 (`del loader; gc.collect()`) alone stop the leak? (cheapest first try)
- Is there any non-DataLoader call site contributing? (instrument `os.pipe`)
- Does a torch upgrade past 2.10.0 already fix it upstream? (check torch changelog for `multiprocessing` cleanup fixes)
- Cleanup latency — does the leak rate change between idle and active workloads? (suggests whether it's per-batch or per-poll)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataLoader pipe FD leak: ~10k pipes/day accumulate, worker hits EMFILE silently after ~6 days uptime #145

Summary

Symptom (silent failure)

Root cause (FD type breakdown)

Why this isn't covered by #140

Why this isn't (only) covered by #138

Reproduction

Suggested fix directions (to discuss, not pre-decided)

Workaround (operational)

Diagnostic recipe (for any future "worker dark, no errors" report)

Related

What we still need to verify

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

`nofile` rlimit	Time to exhaustion
1024 (one worker box, missed provisioning step)	~hours
65536 (other worker boxes)	~6 days

DataLoader pipe FD leak: ~10k pipes/day accumulate, worker hits EMFILE silently after ~6 days uptime #145

Description

Summary

Symptom (silent failure)

Root cause (FD type breakdown)

Why this isn't covered by #140

Why this isn't (only) covered by #138

Reproduction

Suggested fix directions (to discuss, not pre-decided)

Workaround (operational)

Diagnostic recipe (for any future "worker dark, no errors" report)

Related

What we still need to verify

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions