Skip to content

DataLoader pipe FD leak: ~10k pipes/day accumulate, worker hits EMFILE silently after ~6 days uptime #145

@mihow

Description

@mihow

After about a week of uptime, the ML worker silently stops processing jobs even though supervisor still reports it as RUNNING — the process is alive and polling on schedule, but every outbound HTTP call fails internally because it has run out of file descriptors. The cause is an anonymous-pipe leak in the DataLoader subprocess lifecycle that accumulates roughly 10k pipe FDs per day per worker, regardless of batch size or any of the existing tuning knobs.

Summary

The ami worker process leaks anonymous pipe file descriptors at a sustained rate (~10k/day per process) at steady state. After ~6 days of uptime on a worker with nofile rlimit 65536, every os.pipe() call (DataLoader subprocess startup, subprocess.Popen, etc.) fails with OSError(24, 'Too many open files'). The worker process keeps RUNNING in supervisor — no FATAL, no autorestart, no crash — but every API poll silently fails with a urllib3 SSLError wrapping the EMFILE.

This is distinct from #140 (DataLoader subprocess deadlock — synchronous hang) and from #138 (steady-state ingest RAM peak). Same upstream module (trapdata/antenna/datasets.py DataLoader lifecycle), different leaked resource. The mitigations from #138 (AMI_ANTENNA_API_DATALOADER_PIN_MEMORY=false, smaller batch sizes) did not slow this leak.

Symptom (silent failure)

  • Worker RUNNING in supervisor, uptime steadily climbs
  • Every poll cycle logs:
    [error] Failed to fetch jobs from <api>: HTTPSConnectionPool(...): Max retries exceeded with url: /api/v2/jobs?... (Caused by SSLError(OSError(24, 'Too many open files')))
    [info ] [GPU 0] No jobs found, sleeping for 5 seconds
    
  • The deployment's Django backend sees zero requests from the worker
  • Async ML jobs queued via NATS sit at num_pending > 0, num_ack_pending = 0, num_redelivered = 0 — same NATS signature as a stale auth token, but the worker logs show EMFILE not 401
  • The deployment-side jobs_health_check cuts the job to FAILURE after the idle cutoff
  • User-visible result: "ML job did nothing"

Root cause (FD type breakdown)

Snapshot taken on a worker process immediately before restart, after ~6 days uptime:

$ sudo ls -l /proc/<PID>/fd | awk '{print $11}' | sed -E 's|/[0-9]+$||;s|\[[0-9]+\]|[N]|g' | sort | uniq -c | sort -rn | head
  65501 pipe:[N]
     19 /dev/nvidia0
      4 anon_inode:[eventfd]
      4 /dev/nvidia-uvm
      2 /dev/nvidiactl
      1 socket:[N]
      1 /dev/urandom
      1 /dev/nvidia-caps/nvidia-cap4
      1 /dev/nvidia-caps/nvidia-cap3
      1 /dev/nvidia-caps/nvidia-cap2
      1

Effectively all of the leaked FDs are anonymous pipes, not sockets, files, eventfds, or shared-memory inodes. Pipes at this scale come from multiprocessing IPC channels — DataLoader subprocess control/data pipes — that are not closed when the subprocess exits.

This is consistent with AMI_NUM_WORKERS=1 (one DataLoader subprocess spawned per batch) leaking ~3 pipes per batch over the worker's lifetime: ~65500 pipes / ~3 per batch ≈ 21800 batches. At our observed throughput of roughly 1 batch every 25 s, that's ~6 days of continuous processing — which matches the observed uptime at exhaustion.

The leak is at steady state, not bursty. Mitigations that lower per-batch RAM (PIN_MEMORY=false, smaller batch sizes) don't help because the pipe count doesn't scale with batch size — it scales with batch count.

Why this isn't covered by #140

#140 describes a synchronous deadlock where the main thread blocks forever on multiprocessing.Queue.get() after a shm_unlink failure in the subprocess. That worker is hung — log goes silent immediately, last_log mtime becomes stale.

This bug is the opposite shape. The worker keeps polling on schedule (logs are timestamped, mtime is fresh), but every poll fails with EMFILE. The DataLoader subprocess in this case has long since exited cleanly — it leaves dead pipe FDs in the parent, and the parent never reaps them.

A fix for #140 (DataLoader timeout, persistent_workers, etc.) wouldn't address this. Both bugs share an upstream cause (DataLoader / multiprocessing lifecycle hygiene) but the failure modes and recovery paths are different.

Why this isn't (only) covered by #138

#138 mentions OSError(24, 'Too many open files') from multiprocessing.Pipe's os.pipe() call at DataLoader startup, and concludes:

Fixed out-of-band by raising the soft FD limit at the supervisor or systemd level. Not an ADC bug

The new evidence here shows the opposite: at nofile=65536, the legitimate steady-state working set is ~50 FDs, not 1024 or 65536. The exhaustion at 65k after 6 days is an unbounded leak, not a working-set ceiling. Raising the rlimit buys time (~6 days at 65k vs ~hours at 1024) but doesn't fix the underlying bug.

Two clean datapoints from the same incident:

nofile rlimit Time to exhaustion
1024 (one worker box, missed provisioning step) ~hours
65536 (other worker boxes) ~6 days

Both eventually hit EMFILE. The leak rate is roughly the same in both cases.

Reproduction

Not yet reproduced on demand. Observed on three production GPU workers simultaneously after ~6 days of continuous polling + intermittent ML processing. Each was running:

  • ami-data-companion with the env vars from Clarify and document tuning parameters #138 applied (PIN_MEMORY=false, BATCH_SIZE=8/16, NUM_WORKERS=1, LOCALIZATION_BATCH_SIZE=4, CLASSIFICATION_BATCH_SIZE=150)
  • torch 2.10.0+cu128 in the venv
  • python (CPython, version per pyproject.toml)

To reproduce in development:

  1. Start an ami worker against a deployment with no pending tasks (so the worker only polls, never processes)
  2. Periodically snapshot ls /proc/<PID>/fd | wc -l
  3. Compare to the pipe:[N] count via ls -l /proc/<PID>/fd | grep -c 'pipe:'

If the polling-only mode also leaks pipes, that narrows the cause further (probably points at the polling/HTTP path rather than the DataLoader path). If it doesn't, an active job is required to trigger the leak.

Suggested fix directions (to discuss, not pre-decided)

Listed roughly in increasing effort.

  1. Track which call sites open pipes — instrument os.pipe (e.g. via unittest.mock / a small wrapper) for one production-style test run and log every caller. Likely candidates: multiprocessing.Pipe, multiprocessing.connection.Pipe, subprocess.Popen with pipes, concurrent.futures.ProcessPoolExecutor queues.

  2. Explicit DataLoader teardown between batches — at the end of each _process_job iteration:

    del loader
    gc.collect()

    Forces the loader's __del__ to run and joins its workers. May be enough on its own if the leak is just GC lag.

  3. persistent_workers=True + a long-lived DataLoader per worker process — keeps a fixed pool of DataLoader subprocesses alive, avoids the per-batch spawn/teardown churn that's leaking pipes. Has its own tradeoffs (needs num_workers > 0, less responsive to changing prefetch_factor).

  4. Audit any subprocess.Popen calls in the worker code path — especially anything that doesn't pass stdin=DEVNULL, stdout=PIPE, stderr=PIPE and explicitly close the pipe FDs after read. popen2-style usages are easy to leak from.

  5. Switch DataLoader multiprocessing_context from default (likely fork) to spawn or forkserver. Worth a try on a reproducer.

  6. Worker self-restart on RSS / FD soft cap — defense in depth. resource.getrusage for RSS, len(os.listdir('/proc/self/fd')) for FDs, sampled every N batches; exit cleanly when over a fraction of the rlimit, let supervisor restart. Doesn't fix the root bug but turns a silent dark-worker into a recoverable restart.

Workaround (operational)

sudo supervisorctl restart ami-antenna-worker:* clears the FDs. Resumes polling immediately. Worker is good for another ~6 days at nofile=65536. Not a fix.

Diagnostic recipe (for any future "worker dark, no errors" report)

# Find the ami worker PIDs
pgrep -f "ami-data-companion.*ami worker"

# Per-process FD count + rlimit
for pid in $(pgrep -f "ami-data-companion.*ami worker"); do
  echo "pid=$pid fds=$(sudo ls /proc/$pid/fd | wc -l) limit=$(sudo awk '/Max open files/ {print $4}' /proc/$pid/limits)"
done

# FD type breakdown — leak signature is dominant `pipe:[N]`
sudo ls -l /proc/<PID>/fd | awk '{print $11}' | sed -E 's|/[0-9]+$||;s|\[[0-9]+\]|[N]|g' | sort | uniq -c | sort -rn | head

# Recent worker logs — leak symptom is the urllib3 EMFILE
sudo supervisorctl tail -1000 ami-antenna-worker:ami-antenna-worker_00 stdout | grep -E "Too many open files|EMFILE" | tail

Related

What we still need to verify

  • Is the leak triggered by polling alone, or only by active job processing? (run polling-only repro)
  • Does Direction 2 (del loader; gc.collect()) alone stop the leak? (cheapest first try)
  • Is there any non-DataLoader call site contributing? (instrument os.pipe)
  • Does a torch upgrade past 2.10.0 already fix it upstream? (check torch changelog for multiprocessing cleanup fixes)
  • Cleanup latency — does the leak rate change between idle and active workloads? (suggests whether it's per-batch or per-poll)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions