DataLoader subprocess deadlock: MainThread hangs forever on queue.get after shared-memory unlink failure

## Symptom

Worker process appears RUNNING (supervisor, process listing) but stops making progress. Last `Batch N: ...` log line is a normal-looking processing entry — no exception, no `SIGKILL`, no OOM. Uptime keeps accumulating. Log file mtime goes stale.

When processing resumes on other workers (NATS redelivers after `ack_wait`) the hung worker remains silent until manually restarted.

## Root cause (evidenced by py-spy native dump)

With `AMI_NUM_WORKERS=1`, torch `DataLoader` uses a subprocess to prefetch batches. The subprocess communicates with the main process via `multiprocessing.Queue`. During normal operation the subprocess serializes each batch tensor with `torch.multiprocessing.reductions.reduce_storage`, which uses `_share_fd_cpu_` / `shm_unlink` under the hood.

Under pressure (large images, rapid batching, or when the subprocess was previously signalled) the subprocess can die while attempting to share storage, logging to stderr:

```
RuntimeError: could not unlink the shared memory file /torch_<pid>_<hash>_<n> : No such file or directory (2)
```

The subprocess dies. MainThread is still blocked in `_try_get_data`, polling the now-abandoned `multiprocessing.Queue` with no timeout:

```
poll (libc.so.6)
select (selectors.py:415)
wait (multiprocessing/connection.py:948)
_poll (multiprocessing/connection.py:440)
poll (multiprocessing/connection.py:257)
get (multiprocessing/queues.py:113)
_try_get_data (torch/utils/data/dataloader.py:1310)
_get_data (torch/utils/data/dataloader.py:1483)
_next_data (torch/utils/data/dataloader.py:1524)
__next__ (torch/utils/data/dataloader.py:741)
_preload (trapdata/antenna/datasets.py:529)
__next__ (trapdata/antenna/datasets.py:552)
_process_job (trapdata/antenna/worker.py:458)
```

The torch DataLoader does have a `timeout=` parameter for exactly this situation, but the default is `0` (wait forever). We don't set it.

## Reproduction

Not yet reproduced on demand. Observed twice in production within an 8-hour window on two different H100 MIG slices (12 GiB and 24 GiB), both running `feature/tuning-knobs-138` and `feature/uv-migration`. Appears to correlate with large-image processing (3840×2160 and 4096×2160 mixed-resolution batches).

## Workaround applied

Set `AMI_NUM_WORKERS=0` on the affected workers. Eliminates the subprocess entirely — DataLoader runs inline in the main process. Worker has been stable for the remainder of the observation window. Trade-off: no data-load/GPU-compute overlap, estimated ~20–40% throughput loss on H100 MIG slices where S3 image download is a non-trivial fraction of per-batch time.

## Proposed fixes (not decided — discussion welcome)

Ordered roughly from cheapest to cleanest:

1. **Set an explicit `DataLoader(timeout=...)`** — e.g. 120 seconds. On stall, DataLoader raises instead of hanging. Worker process can exit / supervisor restarts it / NATS redelivers. Cheapest possible guard. Does not fix the unlink crash.

2. **Catch the shared-memory unlink error and restart the DataLoader** — wrap the inner loop in a try/except on `RuntimeError` matching `"could not unlink the shared memory file"`, tear down the DataLoader, rebuild it, resume. More complex but preserves in-process restart.

3. **Switch to `multiprocessing_context="spawn"` or "forkserver"** — may or may not help, depending on whether the root cause is fork-related. Worth trying on a reproducer.

4. **Use `persistent_workers=True` + a fixed worker pool** — keeps the subprocess alive across batches, may avoid the per-batch subprocess churn that triggers the race.

5. **Upgrade torch** — we're on 2.10.0+cu128 in the venv that hangs. Recent torch versions have had several fixes in `torch.multiprocessing.reductions` for SHM cleanup races. Worth checking the torch changelog against 2.10.0 for known fixes.

## Evidence

- py-spy native dump of the hung main thread (see traceback above)
- Supervisor log shows steady `Batch N` progression then silence, no exit
- stderr log contains the `could not unlink the shared memory file` `RuntimeError`
- No dmesg OOM, no CUDA error, GPU memory allocation unchanged during the hang

## What we still need to verify

- Reproducibility: whether the hang is deterministic given a specific batch content (e.g. the mixed-resolution warning immediately before the hang in two of our cases), or time-dependent
- Whether setting `DataLoader(timeout=120)` alone is sufficient for operational recovery, or whether the unlink error also needs to be handled
- Throughput delta at `NUM_WORKERS=0` vs a fixed `NUM_WORKERS=1` once stable — to confirm whether restoring subprocess prefetch is worth the complexity
- Whether a current-torch upgrade would silently fix this (check torch release notes for SHM / `_share_fd_cpu_` fixes since 2.10.0)

## Related

- #121 — `_process_job()` refactor. The hang occurs inside `_process_job` at the DataLoader iteration site. A refactor could introduce the timeout guard naturally.
- #138 — tuning-parameter documentation. `AMI_NUM_WORKERS` is one of the parameters; its current behaviour (hangs the worker if > 0 under load) needs to be called out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataLoader subprocess deadlock: MainThread hangs forever on queue.get after shared-memory unlink failure #140

Symptom

Root cause (evidenced by py-spy native dump)

Reproduction

Workaround applied

Proposed fixes (not decided — discussion welcome)

Evidence

What we still need to verify

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

DataLoader subprocess deadlock: MainThread hangs forever on queue.get after shared-memory unlink failure #140

Description

Symptom

Root cause (evidenced by py-spy native dump)

Reproduction

Workaround applied

Proposed fixes (not decided — discussion welcome)

Evidence

What we still need to verify

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions