Periphery Pending keepalive defeats Core's RPC timeout: hung subprocess on one host can wedge a deploy and freeze fleet-wide Stats writes

A `docker compose ls` child on one of our v2.1.2 periphery hosts hung in `futex_wait_queue` (looked like a leftover from a `context canceled` during an image manifest read). Subprocess hangs happen. What surprised us was the blast radius:

- A `DeployStack` against that host hung for 104 minutes with the UI stuck on "Deploying…". Only restarting Core ended it. `DestroyStack` as a workaround was rejected as `Resource is busy`.
- `Stats` writes for **all 6 servers in the fleet** stopped at the exact same instant the subprocess hung, and resumed the moment we killed the orphan PID. ~46 hours of frozen monitoring.
- No alert. Periphery's websocket login flag stayed green throughout.
- After we manually restarted Core to break the hung deploy, all 6 peripheries re-logged in over websocket within ~1s, but one stayed shown as "down" in the UI until we killed the orphan PID. During that window Core spammed `WARN Failed to forward Response message | No response channel found at <uuid>` every 5s for one channel_id, which is periphery's keepalive still firing for the wedged pre-restart request after Core's matching channel was gone. So restarting Core is not a real recovery here: the wedged tokio task on the periphery outlives Core's lifecycle.

The chain in sourc:

- [`lib/command/src/lib.rs:160`](https://github.com/moghtech/komodo/blob/v2.1.2/lib/command/src/lib.rs#L160): `cmd.output().await` has no timeout, so a hung child pins the future. `kill_on_drop(true)` is set but never fires because nothing drops the future.
- [`bin/periphery/src/connection/mod.rs:182`](https://github.com/moghtech/komodo/blob/v2.1.2/bin/periphery/src/connection/mod.rs#L182): request handler runs in a `tokio::select!` against an infinite `ping_in_progress` loop sending `Pending` every 5s. The keepalive stops only when `resolve_response` finishes, which it never will.
- [`bin/core/src/periphery/mod.rs:138`](https://github.com/moghtech/komodo/blob/v2.1.2/bin/core/src/periphery/mod.rs#L138): Core loops on `recv().with_timeout(10s)` and `continue;`s on every `Pending`. Keepalives every 5s mean the effective RPC timeout is infinite.
- [`bin/core/src/monitor/mod.rs:79`](https://github.com/moghtech/komodo/blob/v2.1.2/bin/core/src/monitor/mod.rs#L79): `record_server_stats(ts)` is gated on `join_all(futures).await`. One never-returning future blocks Stats writes for every host.

Also: `Server.config.timeout_seconds` seems to be dead code in v2.1.2. `rg timeout_seconds` returns zero hits across `bin/`, `lib/`, `client/`, but the UI exposes it as a knob.

Killing the orphan `docker compose ls` PIDs on the host fixed everything in ~60s, no service restart needed.

Should the fix go at the keepalive layer (cap how many Pendings reset the timeout, or add a wall-clock RPC ceiling), or in `lib/command` (timeout the subprocess directly)? Happy to PR either if you can point me at the preferred direction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Periphery Pending keepalive defeats Core's RPC timeout: hung subprocess on one host can wedge a deploy and freeze fleet-wide Stats writes #1392

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Periphery Pending keepalive defeats Core's RPC timeout: hung subprocess on one host can wedge a deploy and freeze fleet-wide Stats writes #1392

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions