A docker compose ls child on one of our v2.1.2 periphery hosts hung in futex_wait_queue (looked like a leftover from a context canceled during an image manifest read). Subprocess hangs happen. What surprised us was the blast radius:
- A
DeployStack against that host hung for 104 minutes with the UI stuck on "Deploying…". Only restarting Core ended it. DestroyStack as a workaround was rejected as Resource is busy.
Stats writes for all 6 servers in the fleet stopped at the exact same instant the subprocess hung, and resumed the moment we killed the orphan PID. ~46 hours of frozen monitoring.
- No alert. Periphery's websocket login flag stayed green throughout.
- After we manually restarted Core to break the hung deploy, all 6 peripheries re-logged in over websocket within ~1s, but one stayed shown as "down" in the UI until we killed the orphan PID. During that window Core spammed
WARN Failed to forward Response message | No response channel found at <uuid> every 5s for one channel_id, which is periphery's keepalive still firing for the wedged pre-restart request after Core's matching channel was gone. So restarting Core is not a real recovery here: the wedged tokio task on the periphery outlives Core's lifecycle.
The chain in sourc:
lib/command/src/lib.rs:160: cmd.output().await has no timeout, so a hung child pins the future. kill_on_drop(true) is set but never fires because nothing drops the future.
bin/periphery/src/connection/mod.rs:182: request handler runs in a tokio::select! against an infinite ping_in_progress loop sending Pending every 5s. The keepalive stops only when resolve_response finishes, which it never will.
bin/core/src/periphery/mod.rs:138: Core loops on recv().with_timeout(10s) and continue;s on every Pending. Keepalives every 5s mean the effective RPC timeout is infinite.
bin/core/src/monitor/mod.rs:79: record_server_stats(ts) is gated on join_all(futures).await. One never-returning future blocks Stats writes for every host.
Also: Server.config.timeout_seconds seems to be dead code in v2.1.2. rg timeout_seconds returns zero hits across bin/, lib/, client/, but the UI exposes it as a knob.
Killing the orphan docker compose ls PIDs on the host fixed everything in ~60s, no service restart needed.
Should the fix go at the keepalive layer (cap how many Pendings reset the timeout, or add a wall-clock RPC ceiling), or in lib/command (timeout the subprocess directly)? Happy to PR either if you can point me at the preferred direction.
A
docker compose lschild on one of our v2.1.2 periphery hosts hung infutex_wait_queue(looked like a leftover from acontext canceledduring an image manifest read). Subprocess hangs happen. What surprised us was the blast radius:DeployStackagainst that host hung for 104 minutes with the UI stuck on "Deploying…". Only restarting Core ended it.DestroyStackas a workaround was rejected asResource is busy.Statswrites for all 6 servers in the fleet stopped at the exact same instant the subprocess hung, and resumed the moment we killed the orphan PID. ~46 hours of frozen monitoring.WARN Failed to forward Response message | No response channel found at <uuid>every 5s for one channel_id, which is periphery's keepalive still firing for the wedged pre-restart request after Core's matching channel was gone. So restarting Core is not a real recovery here: the wedged tokio task on the periphery outlives Core's lifecycle.The chain in sourc:
lib/command/src/lib.rs:160:cmd.output().awaithas no timeout, so a hung child pins the future.kill_on_drop(true)is set but never fires because nothing drops the future.bin/periphery/src/connection/mod.rs:182: request handler runs in atokio::select!against an infiniteping_in_progressloop sendingPendingevery 5s. The keepalive stops only whenresolve_responsefinishes, which it never will.bin/core/src/periphery/mod.rs:138: Core loops onrecv().with_timeout(10s)andcontinue;s on everyPending. Keepalives every 5s mean the effective RPC timeout is infinite.bin/core/src/monitor/mod.rs:79:record_server_stats(ts)is gated onjoin_all(futures).await. One never-returning future blocks Stats writes for every host.Also:
Server.config.timeout_secondsseems to be dead code in v2.1.2.rg timeout_secondsreturns zero hits acrossbin/,lib/,client/, but the UI exposes it as a knob.Killing the orphan
docker compose lsPIDs on the host fixed everything in ~60s, no service restart needed.Should the fix go at the keepalive layer (cap how many Pendings reset the timeout, or add a wall-clock RPC ceiling), or in
lib/command(timeout the subprocess directly)? Happy to PR either if you can point me at the preferred direction.