Symptom
Worker process appears RUNNING (supervisor, process listing) but stops making progress. Last Batch N: ... log line is a normal-looking processing entry — no exception, no SIGKILL, no OOM. Uptime keeps accumulating. Log file mtime goes stale.
When processing resumes on other workers (NATS redelivers after ack_wait) the hung worker remains silent until manually restarted.
Root cause (evidenced by py-spy native dump)
With AMI_NUM_WORKERS=1, torch DataLoader uses a subprocess to prefetch batches. The subprocess communicates with the main process via multiprocessing.Queue. During normal operation the subprocess serializes each batch tensor with torch.multiprocessing.reductions.reduce_storage, which uses _share_fd_cpu_ / shm_unlink under the hood.
Under pressure (large images, rapid batching, or when the subprocess was previously signalled) the subprocess can die while attempting to share storage, logging to stderr:
RuntimeError: could not unlink the shared memory file /torch_<pid>_<hash>_<n> : No such file or directory (2)
The subprocess dies. MainThread is still blocked in _try_get_data, polling the now-abandoned multiprocessing.Queue with no timeout:
poll (libc.so.6)
select (selectors.py:415)
wait (multiprocessing/connection.py:948)
_poll (multiprocessing/connection.py:440)
poll (multiprocessing/connection.py:257)
get (multiprocessing/queues.py:113)
_try_get_data (torch/utils/data/dataloader.py:1310)
_get_data (torch/utils/data/dataloader.py:1483)
_next_data (torch/utils/data/dataloader.py:1524)
__next__ (torch/utils/data/dataloader.py:741)
_preload (trapdata/antenna/datasets.py:529)
__next__ (trapdata/antenna/datasets.py:552)
_process_job (trapdata/antenna/worker.py:458)
The torch DataLoader does have a timeout= parameter for exactly this situation, but the default is 0 (wait forever). We don't set it.
Reproduction
Not yet reproduced on demand. Observed twice in production within an 8-hour window on two different H100 MIG slices (12 GiB and 24 GiB), both running feature/tuning-knobs-138 and feature/uv-migration. Appears to correlate with large-image processing (3840×2160 and 4096×2160 mixed-resolution batches).
Workaround applied
Set AMI_NUM_WORKERS=0 on the affected workers. Eliminates the subprocess entirely — DataLoader runs inline in the main process. Worker has been stable for the remainder of the observation window. Trade-off: no data-load/GPU-compute overlap, estimated ~20–40% throughput loss on H100 MIG slices where S3 image download is a non-trivial fraction of per-batch time.
Proposed fixes (not decided — discussion welcome)
Ordered roughly from cheapest to cleanest:
-
Set an explicit DataLoader(timeout=...) — e.g. 120 seconds. On stall, DataLoader raises instead of hanging. Worker process can exit / supervisor restarts it / NATS redelivers. Cheapest possible guard. Does not fix the unlink crash.
-
Catch the shared-memory unlink error and restart the DataLoader — wrap the inner loop in a try/except on RuntimeError matching "could not unlink the shared memory file", tear down the DataLoader, rebuild it, resume. More complex but preserves in-process restart.
-
Switch to multiprocessing_context="spawn" or "forkserver" — may or may not help, depending on whether the root cause is fork-related. Worth trying on a reproducer.
-
Use persistent_workers=True + a fixed worker pool — keeps the subprocess alive across batches, may avoid the per-batch subprocess churn that triggers the race.
-
Upgrade torch — we're on 2.10.0+cu128 in the venv that hangs. Recent torch versions have had several fixes in torch.multiprocessing.reductions for SHM cleanup races. Worth checking the torch changelog against 2.10.0 for known fixes.
Evidence
- py-spy native dump of the hung main thread (see traceback above)
- Supervisor log shows steady
Batch N progression then silence, no exit
- stderr log contains the
could not unlink the shared memory file RuntimeError
- No dmesg OOM, no CUDA error, GPU memory allocation unchanged during the hang
What we still need to verify
- Reproducibility: whether the hang is deterministic given a specific batch content (e.g. the mixed-resolution warning immediately before the hang in two of our cases), or time-dependent
- Whether setting
DataLoader(timeout=120) alone is sufficient for operational recovery, or whether the unlink error also needs to be handled
- Throughput delta at
NUM_WORKERS=0 vs a fixed NUM_WORKERS=1 once stable — to confirm whether restoring subprocess prefetch is worth the complexity
- Whether a current-torch upgrade would silently fix this (check torch release notes for SHM /
_share_fd_cpu_ fixes since 2.10.0)
Related
Symptom
Worker process appears RUNNING (supervisor, process listing) but stops making progress. Last
Batch N: ...log line is a normal-looking processing entry — no exception, noSIGKILL, no OOM. Uptime keeps accumulating. Log file mtime goes stale.When processing resumes on other workers (NATS redelivers after
ack_wait) the hung worker remains silent until manually restarted.Root cause (evidenced by py-spy native dump)
With
AMI_NUM_WORKERS=1, torchDataLoaderuses a subprocess to prefetch batches. The subprocess communicates with the main process viamultiprocessing.Queue. During normal operation the subprocess serializes each batch tensor withtorch.multiprocessing.reductions.reduce_storage, which uses_share_fd_cpu_/shm_unlinkunder the hood.Under pressure (large images, rapid batching, or when the subprocess was previously signalled) the subprocess can die while attempting to share storage, logging to stderr:
The subprocess dies. MainThread is still blocked in
_try_get_data, polling the now-abandonedmultiprocessing.Queuewith no timeout:The torch DataLoader does have a
timeout=parameter for exactly this situation, but the default is0(wait forever). We don't set it.Reproduction
Not yet reproduced on demand. Observed twice in production within an 8-hour window on two different H100 MIG slices (12 GiB and 24 GiB), both running
feature/tuning-knobs-138andfeature/uv-migration. Appears to correlate with large-image processing (3840×2160 and 4096×2160 mixed-resolution batches).Workaround applied
Set
AMI_NUM_WORKERS=0on the affected workers. Eliminates the subprocess entirely — DataLoader runs inline in the main process. Worker has been stable for the remainder of the observation window. Trade-off: no data-load/GPU-compute overlap, estimated ~20–40% throughput loss on H100 MIG slices where S3 image download is a non-trivial fraction of per-batch time.Proposed fixes (not decided — discussion welcome)
Ordered roughly from cheapest to cleanest:
Set an explicit
DataLoader(timeout=...)— e.g. 120 seconds. On stall, DataLoader raises instead of hanging. Worker process can exit / supervisor restarts it / NATS redelivers. Cheapest possible guard. Does not fix the unlink crash.Catch the shared-memory unlink error and restart the DataLoader — wrap the inner loop in a try/except on
RuntimeErrormatching"could not unlink the shared memory file", tear down the DataLoader, rebuild it, resume. More complex but preserves in-process restart.Switch to
multiprocessing_context="spawn"or "forkserver" — may or may not help, depending on whether the root cause is fork-related. Worth trying on a reproducer.Use
persistent_workers=True+ a fixed worker pool — keeps the subprocess alive across batches, may avoid the per-batch subprocess churn that triggers the race.Upgrade torch — we're on 2.10.0+cu128 in the venv that hangs. Recent torch versions have had several fixes in
torch.multiprocessing.reductionsfor SHM cleanup races. Worth checking the torch changelog against 2.10.0 for known fixes.Evidence
Batch Nprogression then silence, no exitcould not unlink the shared memory fileRuntimeErrorWhat we still need to verify
DataLoader(timeout=120)alone is sufficient for operational recovery, or whether the unlink error also needs to be handledNUM_WORKERS=0vs a fixedNUM_WORKERS=1once stable — to confirm whether restoring subprocess prefetch is worth the complexity_share_fd_cpu_fixes since 2.10.0)Related
_process_job()refactor. The hang occurs inside_process_jobat the DataLoader iteration site. A refactor could introduce the timeout guard naturally.AMI_NUM_WORKERSis one of the parameters; its current behaviour (hangs the worker if > 0 under load) needs to be called out.