Skip to content

DataLoader subprocess deadlock: MainThread hangs forever on queue.get after shared-memory unlink failure #140

@mihow

Description

@mihow

Symptom

Worker process appears RUNNING (supervisor, process listing) but stops making progress. Last Batch N: ... log line is a normal-looking processing entry — no exception, no SIGKILL, no OOM. Uptime keeps accumulating. Log file mtime goes stale.

When processing resumes on other workers (NATS redelivers after ack_wait) the hung worker remains silent until manually restarted.

Root cause (evidenced by py-spy native dump)

With AMI_NUM_WORKERS=1, torch DataLoader uses a subprocess to prefetch batches. The subprocess communicates with the main process via multiprocessing.Queue. During normal operation the subprocess serializes each batch tensor with torch.multiprocessing.reductions.reduce_storage, which uses _share_fd_cpu_ / shm_unlink under the hood.

Under pressure (large images, rapid batching, or when the subprocess was previously signalled) the subprocess can die while attempting to share storage, logging to stderr:

RuntimeError: could not unlink the shared memory file /torch_<pid>_<hash>_<n> : No such file or directory (2)

The subprocess dies. MainThread is still blocked in _try_get_data, polling the now-abandoned multiprocessing.Queue with no timeout:

poll (libc.so.6)
select (selectors.py:415)
wait (multiprocessing/connection.py:948)
_poll (multiprocessing/connection.py:440)
poll (multiprocessing/connection.py:257)
get (multiprocessing/queues.py:113)
_try_get_data (torch/utils/data/dataloader.py:1310)
_get_data (torch/utils/data/dataloader.py:1483)
_next_data (torch/utils/data/dataloader.py:1524)
__next__ (torch/utils/data/dataloader.py:741)
_preload (trapdata/antenna/datasets.py:529)
__next__ (trapdata/antenna/datasets.py:552)
_process_job (trapdata/antenna/worker.py:458)

The torch DataLoader does have a timeout= parameter for exactly this situation, but the default is 0 (wait forever). We don't set it.

Reproduction

Not yet reproduced on demand. Observed twice in production within an 8-hour window on two different H100 MIG slices (12 GiB and 24 GiB), both running feature/tuning-knobs-138 and feature/uv-migration. Appears to correlate with large-image processing (3840×2160 and 4096×2160 mixed-resolution batches).

Workaround applied

Set AMI_NUM_WORKERS=0 on the affected workers. Eliminates the subprocess entirely — DataLoader runs inline in the main process. Worker has been stable for the remainder of the observation window. Trade-off: no data-load/GPU-compute overlap, estimated ~20–40% throughput loss on H100 MIG slices where S3 image download is a non-trivial fraction of per-batch time.

Proposed fixes (not decided — discussion welcome)

Ordered roughly from cheapest to cleanest:

  1. Set an explicit DataLoader(timeout=...) — e.g. 120 seconds. On stall, DataLoader raises instead of hanging. Worker process can exit / supervisor restarts it / NATS redelivers. Cheapest possible guard. Does not fix the unlink crash.

  2. Catch the shared-memory unlink error and restart the DataLoader — wrap the inner loop in a try/except on RuntimeError matching "could not unlink the shared memory file", tear down the DataLoader, rebuild it, resume. More complex but preserves in-process restart.

  3. Switch to multiprocessing_context="spawn" or "forkserver" — may or may not help, depending on whether the root cause is fork-related. Worth trying on a reproducer.

  4. Use persistent_workers=True + a fixed worker pool — keeps the subprocess alive across batches, may avoid the per-batch subprocess churn that triggers the race.

  5. Upgrade torch — we're on 2.10.0+cu128 in the venv that hangs. Recent torch versions have had several fixes in torch.multiprocessing.reductions for SHM cleanup races. Worth checking the torch changelog against 2.10.0 for known fixes.

Evidence

  • py-spy native dump of the hung main thread (see traceback above)
  • Supervisor log shows steady Batch N progression then silence, no exit
  • stderr log contains the could not unlink the shared memory file RuntimeError
  • No dmesg OOM, no CUDA error, GPU memory allocation unchanged during the hang

What we still need to verify

  • Reproducibility: whether the hang is deterministic given a specific batch content (e.g. the mixed-resolution warning immediately before the hang in two of our cases), or time-dependent
  • Whether setting DataLoader(timeout=120) alone is sufficient for operational recovery, or whether the unlink error also needs to be handled
  • Throughput delta at NUM_WORKERS=0 vs a fixed NUM_WORKERS=1 once stable — to confirm whether restoring subprocess prefetch is worth the complexity
  • Whether a current-torch upgrade would silently fix this (check torch release notes for SHM / _share_fd_cpu_ fixes since 2.10.0)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions