You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After about a week of uptime, the ML worker silently stops processing jobs even though supervisor still reports it as RUNNING — the process is alive and polling on schedule, but every outbound HTTP call fails internally because it has run out of file descriptors. The cause is an anonymous-pipe leak in the DataLoader subprocess lifecycle that accumulates roughly 10k pipe FDs per day per worker, regardless of batch size or any of the existing tuning knobs.
Summary
The ami worker process leaks anonymous pipe file descriptors at a sustained rate (~10k/day per process) at steady state. After ~6 days of uptime on a worker with nofile rlimit 65536, every os.pipe() call (DataLoader subprocess startup, subprocess.Popen, etc.) fails with OSError(24, 'Too many open files'). The worker process keeps RUNNING in supervisor — no FATAL, no autorestart, no crash — but every API poll silently fails with a urllib3SSLError wrapping the EMFILE.
This is distinct from #140 (DataLoader subprocess deadlock — synchronous hang) and from #138 (steady-state ingest RAM peak). Same upstream module (trapdata/antenna/datasets.py DataLoader lifecycle), different leaked resource. The mitigations from #138 (AMI_ANTENNA_API_DATALOADER_PIN_MEMORY=false, smaller batch sizes) did not slow this leak.
Symptom (silent failure)
Worker RUNNING in supervisor, uptime steadily climbs
Every poll cycle logs:
[error] Failed to fetch jobs from <api>: HTTPSConnectionPool(...): Max retries exceeded with url: /api/v2/jobs?... (Caused by SSLError(OSError(24, 'Too many open files')))
[info ] [GPU 0] No jobs found, sleeping for 5 seconds
The deployment's Django backend sees zero requests from the worker
Async ML jobs queued via NATS sit at num_pending > 0, num_ack_pending = 0, num_redelivered = 0 — same NATS signature as a stale auth token, but the worker logs show EMFILE not 401
The deployment-side jobs_health_check cuts the job to FAILURE after the idle cutoff
User-visible result: "ML job did nothing"
Root cause (FD type breakdown)
Snapshot taken on a worker process immediately before restart, after ~6 days uptime:
Effectively all of the leaked FDs are anonymous pipes, not sockets, files, eventfds, or shared-memory inodes. Pipes at this scale come from multiprocessing IPC channels — DataLoader subprocess control/data pipes — that are not closed when the subprocess exits.
This is consistent with AMI_NUM_WORKERS=1 (one DataLoader subprocess spawned per batch) leaking ~3 pipes per batch over the worker's lifetime: ~65500 pipes / ~3 per batch ≈ 21800 batches. At our observed throughput of roughly 1 batch every 25 s, that's ~6 days of continuous processing — which matches the observed uptime at exhaustion.
The leak is at steady state, not bursty. Mitigations that lower per-batch RAM (PIN_MEMORY=false, smaller batch sizes) don't help because the pipe count doesn't scale with batch size — it scales with batch count.
#140 describes a synchronous deadlock where the main thread blocks forever on multiprocessing.Queue.get() after a shm_unlink failure in the subprocess. That worker is hung — log goes silent immediately, last_log mtime becomes stale.
This bug is the opposite shape. The worker keeps polling on schedule (logs are timestamped, mtime is fresh), but every poll fails with EMFILE. The DataLoader subprocess in this case has long since exited cleanly — it leaves dead pipe FDs in the parent, and the parent never reaps them.
A fix for #140 (DataLoader timeout, persistent_workers, etc.) wouldn't address this. Both bugs share an upstream cause (DataLoader / multiprocessing lifecycle hygiene) but the failure modes and recovery paths are different.
#138 mentions OSError(24, 'Too many open files') from multiprocessing.Pipe's os.pipe() call at DataLoader startup, and concludes:
Fixed out-of-band by raising the soft FD limit at the supervisor or systemd level. Not an ADC bug
The new evidence here shows the opposite: at nofile=65536, the legitimate steady-state working set is ~50 FDs, not 1024 or 65536. The exhaustion at 65k after 6 days is an unbounded leak, not a working-set ceiling. Raising the rlimit buys time (~6 days at 65k vs ~hours at 1024) but doesn't fix the underlying bug.
Two clean datapoints from the same incident:
nofile rlimit
Time to exhaustion
1024 (one worker box, missed provisioning step)
~hours
65536 (other worker boxes)
~6 days
Both eventually hit EMFILE. The leak rate is roughly the same in both cases.
Reproduction
Not yet reproduced on demand. Observed on three production GPU workers simultaneously after ~6 days of continuous polling + intermittent ML processing. Each was running:
ami-data-companion with the env vars from Clarify and document tuning parameters #138 applied (PIN_MEMORY=false, BATCH_SIZE=8/16, NUM_WORKERS=1, LOCALIZATION_BATCH_SIZE=4, CLASSIFICATION_BATCH_SIZE=150)
torch 2.10.0+cu128 in the venv
python (CPython, version per pyproject.toml)
To reproduce in development:
Start an ami worker against a deployment with no pending tasks (so the worker only polls, never processes)
Periodically snapshot ls /proc/<PID>/fd | wc -l
Compare to the pipe:[N] count via ls -l /proc/<PID>/fd | grep -c 'pipe:'
If the polling-only mode also leaks pipes, that narrows the cause further (probably points at the polling/HTTP path rather than the DataLoader path). If it doesn't, an active job is required to trigger the leak.
Suggested fix directions (to discuss, not pre-decided)
Listed roughly in increasing effort.
Track which call sites open pipes — instrument os.pipe (e.g. via unittest.mock / a small wrapper) for one production-style test run and log every caller. Likely candidates: multiprocessing.Pipe, multiprocessing.connection.Pipe, subprocess.Popen with pipes, concurrent.futures.ProcessPoolExecutor queues.
Explicit DataLoader teardown between batches — at the end of each _process_job iteration:
delloadergc.collect()
Forces the loader's __del__ to run and joins its workers. May be enough on its own if the leak is just GC lag.
persistent_workers=True + a long-lived DataLoader per worker process — keeps a fixed pool of DataLoader subprocesses alive, avoids the per-batch spawn/teardown churn that's leaking pipes. Has its own tradeoffs (needs num_workers > 0, less responsive to changing prefetch_factor).
Audit any subprocess.Popen calls in the worker code path — especially anything that doesn't pass stdin=DEVNULL, stdout=PIPE, stderr=PIPE and explicitly close the pipe FDs after read. popen2-style usages are easy to leak from.
Switch DataLoader multiprocessing_context from default (likely fork) to spawn or forkserver. Worth a try on a reproducer.
Worker self-restart on RSS / FD soft cap — defense in depth. resource.getrusage for RSS, len(os.listdir('/proc/self/fd')) for FDs, sampled every N batches; exit cleanly when over a fraction of the rlimit, let supervisor restart. Doesn't fix the root bug but turns a silent dark-worker into a recoverable restart.
Workaround (operational)
sudo supervisorctl restart ami-antenna-worker:* clears the FDs. Resumes polling immediately. Worker is good for another ~6 days at nofile=65536. Not a fix.
Diagnostic recipe (for any future "worker dark, no errors" report)
# Find the ami worker PIDs
pgrep -f "ami-data-companion.*ami worker"# Per-process FD count + rlimitforpidin$(pgrep -f "ami-data-companion.*ami worker");doecho"pid=$pid fds=$(sudo ls /proc/$pid/fd | wc -l) limit=$(sudo awk '/Max open files/ {print $4}' /proc/$pid/limits)"done# FD type breakdown — leak signature is dominant `pipe:[N]`
sudo ls -l /proc/<PID>/fd | awk '{print $11}'| sed -E 's|/[0-9]+$||;s|\[[0-9]+\]|[N]|g'| sort | uniq -c | sort -rn | head
# Recent worker logs — leak symptom is the urllib3 EMFILE
sudo supervisorctl tail -1000 ami-antenna-worker:ami-antenna-worker_00 stdout | grep -E "Too many open files|EMFILE"| tail
After about a week of uptime, the ML worker silently stops processing jobs even though supervisor still reports it as
RUNNING— the process is alive and polling on schedule, but every outbound HTTP call fails internally because it has run out of file descriptors. The cause is an anonymous-pipe leak in the DataLoader subprocess lifecycle that accumulates roughly 10k pipe FDs per day per worker, regardless of batch size or any of the existing tuning knobs.Summary
The
ami workerprocess leaks anonymous pipe file descriptors at a sustained rate (~10k/day per process) at steady state. After ~6 days of uptime on a worker withnofilerlimit 65536, everyos.pipe()call (DataLoader subprocess startup,subprocess.Popen, etc.) fails withOSError(24, 'Too many open files'). The worker process keepsRUNNINGin supervisor — no FATAL, no autorestart, no crash — but every API poll silently fails with aurllib3SSLErrorwrapping theEMFILE.This is distinct from #140 (DataLoader subprocess deadlock — synchronous hang) and from #138 (steady-state ingest RAM peak). Same upstream module (
trapdata/antenna/datasets.pyDataLoader lifecycle), different leaked resource. The mitigations from #138 (AMI_ANTENNA_API_DATALOADER_PIN_MEMORY=false, smaller batch sizes) did not slow this leak.Symptom (silent failure)
RUNNINGin supervisor, uptime steadily climbsnum_pending > 0, num_ack_pending = 0, num_redelivered = 0— same NATS signature as a stale auth token, but the worker logs showEMFILEnot401jobs_health_checkcuts the job toFAILUREafter the idle cutoffRoot cause (FD type breakdown)
Snapshot taken on a worker process immediately before restart, after ~6 days uptime:
Effectively all of the leaked FDs are anonymous pipes, not sockets, files, eventfds, or shared-memory inodes. Pipes at this scale come from
multiprocessingIPC channels — DataLoader subprocess control/data pipes — that are not closed when the subprocess exits.This is consistent with
AMI_NUM_WORKERS=1(one DataLoader subprocess spawned per batch) leaking ~3 pipes per batch over the worker's lifetime: ~65500 pipes / ~3 per batch ≈ 21800 batches. At our observed throughput of roughly 1 batch every 25 s, that's ~6 days of continuous processing — which matches the observed uptime at exhaustion.The leak is at steady state, not bursty. Mitigations that lower per-batch RAM (PIN_MEMORY=false, smaller batch sizes) don't help because the pipe count doesn't scale with batch size — it scales with batch count.
Why this isn't covered by #140
#140 describes a synchronous deadlock where the main thread blocks forever on
multiprocessing.Queue.get()after ashm_unlinkfailure in the subprocess. That worker is hung — log goes silent immediately,last_log mtimebecomes stale.This bug is the opposite shape. The worker keeps polling on schedule (logs are timestamped, mtime is fresh), but every poll fails with
EMFILE. The DataLoader subprocess in this case has long since exited cleanly — it leaves dead pipe FDs in the parent, and the parent never reaps them.A fix for #140 (DataLoader timeout, persistent_workers, etc.) wouldn't address this. Both bugs share an upstream cause (DataLoader /
multiprocessinglifecycle hygiene) but the failure modes and recovery paths are different.Why this isn't (only) covered by #138
#138 mentions
OSError(24, 'Too many open files')frommultiprocessing.Pipe'sos.pipe()call at DataLoader startup, and concludes:The new evidence here shows the opposite: at
nofile=65536, the legitimate steady-state working set is ~50 FDs, not 1024 or 65536. The exhaustion at 65k after 6 days is an unbounded leak, not a working-set ceiling. Raising the rlimit buys time (~6 days at 65k vs ~hours at 1024) but doesn't fix the underlying bug.Two clean datapoints from the same incident:
nofilerlimitBoth eventually hit
EMFILE. The leak rate is roughly the same in both cases.Reproduction
Not yet reproduced on demand. Observed on three production GPU workers simultaneously after ~6 days of continuous polling + intermittent ML processing. Each was running:
ami-data-companionwith the env vars from Clarify and document tuning parameters #138 applied (PIN_MEMORY=false,BATCH_SIZE=8/16,NUM_WORKERS=1,LOCALIZATION_BATCH_SIZE=4,CLASSIFICATION_BATCH_SIZE=150)python(CPython, version perpyproject.toml)To reproduce in development:
ami workeragainst a deployment with no pending tasks (so the worker only polls, never processes)ls /proc/<PID>/fd | wc -lpipe:[N]count vials -l /proc/<PID>/fd | grep -c 'pipe:'If the polling-only mode also leaks pipes, that narrows the cause further (probably points at the polling/HTTP path rather than the DataLoader path). If it doesn't, an active job is required to trigger the leak.
Suggested fix directions (to discuss, not pre-decided)
Listed roughly in increasing effort.
Track which call sites open pipes — instrument
os.pipe(e.g. viaunittest.mock/ a small wrapper) for one production-style test run and log every caller. Likely candidates:multiprocessing.Pipe,multiprocessing.connection.Pipe,subprocess.Popenwith pipes,concurrent.futures.ProcessPoolExecutorqueues.Explicit DataLoader teardown between batches — at the end of each
_process_jobiteration:Forces the loader's
__del__to run and joins its workers. May be enough on its own if the leak is just GC lag.persistent_workers=True+ a long-lived DataLoader per worker process — keeps a fixed pool of DataLoader subprocesses alive, avoids the per-batch spawn/teardown churn that's leaking pipes. Has its own tradeoffs (needsnum_workers > 0, less responsive to changingprefetch_factor).Audit any
subprocess.Popencalls in the worker code path — especially anything that doesn't passstdin=DEVNULL, stdout=PIPE, stderr=PIPEand explicitly close the pipe FDs after read.popen2-style usages are easy to leak from.Switch DataLoader
multiprocessing_contextfrom default (likelyfork) tospawnorforkserver. Worth a try on a reproducer.Worker self-restart on RSS / FD soft cap — defense in depth.
resource.getrusagefor RSS,len(os.listdir('/proc/self/fd'))for FDs, sampled every N batches; exit cleanly when over a fraction of the rlimit, let supervisor restart. Doesn't fix the root bug but turns a silent dark-worker into a recoverable restart.Workaround (operational)
sudo supervisorctl restart ami-antenna-worker:*clears the FDs. Resumes polling immediately. Worker is good for another ~6 days atnofile=65536. Not a fix.Diagnostic recipe (for any future "worker dark, no errors" report)
Related
What we still need to verify
del loader; gc.collect()) alone stop the leak? (cheapest first try)os.pipe)multiprocessingcleanup fixes)