Skip to content

Periphery Pending keepalive defeats Core's RPC timeout: hung subprocess on one host can wedge a deploy and freeze fleet-wide Stats writes #1392

@vvv850

Description

@vvv850

A docker compose ls child on one of our v2.1.2 periphery hosts hung in futex_wait_queue (looked like a leftover from a context canceled during an image manifest read). Subprocess hangs happen. What surprised us was the blast radius:

  • A DeployStack against that host hung for 104 minutes with the UI stuck on "Deploying…". Only restarting Core ended it. DestroyStack as a workaround was rejected as Resource is busy.
  • Stats writes for all 6 servers in the fleet stopped at the exact same instant the subprocess hung, and resumed the moment we killed the orphan PID. ~46 hours of frozen monitoring.
  • No alert. Periphery's websocket login flag stayed green throughout.
  • After we manually restarted Core to break the hung deploy, all 6 peripheries re-logged in over websocket within ~1s, but one stayed shown as "down" in the UI until we killed the orphan PID. During that window Core spammed WARN Failed to forward Response message | No response channel found at <uuid> every 5s for one channel_id, which is periphery's keepalive still firing for the wedged pre-restart request after Core's matching channel was gone. So restarting Core is not a real recovery here: the wedged tokio task on the periphery outlives Core's lifecycle.

The chain in sourc:

  • lib/command/src/lib.rs:160: cmd.output().await has no timeout, so a hung child pins the future. kill_on_drop(true) is set but never fires because nothing drops the future.
  • bin/periphery/src/connection/mod.rs:182: request handler runs in a tokio::select! against an infinite ping_in_progress loop sending Pending every 5s. The keepalive stops only when resolve_response finishes, which it never will.
  • bin/core/src/periphery/mod.rs:138: Core loops on recv().with_timeout(10s) and continue;s on every Pending. Keepalives every 5s mean the effective RPC timeout is infinite.
  • bin/core/src/monitor/mod.rs:79: record_server_stats(ts) is gated on join_all(futures).await. One never-returning future blocks Stats writes for every host.

Also: Server.config.timeout_seconds seems to be dead code in v2.1.2. rg timeout_seconds returns zero hits across bin/, lib/, client/, but the UI exposes it as a knob.

Killing the orphan docker compose ls PIDs on the host fixed everything in ~60s, no service restart needed.

Should the fix go at the keepalive layer (cap how many Pendings reset the timeout, or add a wall-clock RPC ceiling), or in lib/command (timeout the subprocess directly)? Happy to PR either if you can point me at the preferred direction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions