Follow-up from #537 (Devin Review).
The bounded EOS-flush idle watchdog added in #537 abandons a stalled SVT-AV1 codec task after 30s of idle, logs an error, and lets the node finalize whatever output was already produced. However run_encoder still emits a normal Stopped("input_closed") state event and returns Ok(()), so a caller cannot programmatically distinguish a clean flush from a truncated one — and a truly stuck spawn_blocking OS thread is detached (encoder handle intentionally leaked), so repeated trips could leak native resources.
For the rare-flake fix in #537 this is an acceptable, logged tradeoff (completing the request with partial output beats hanging to the 300s client timeout). But the watchdog trip should ideally be propagated out of drain_codec_results / codec_forward_loop so callers and state events can mark the run as degraded/failed rather than a successful encode.
This changes the node-failure contract, so it deserves its own discussion separate from the flake fix.
Refs:
crates/nodes/src/codec_utils.rs (watchdog branch)
crates/nodes/src/video/encoder_trait.rs (run_encoder always returns Ok)
Follow-up from #537 (Devin Review).
The bounded EOS-flush idle watchdog added in #537 abandons a stalled SVT-AV1 codec task after 30s of idle, logs an
error, and lets the node finalize whatever output was already produced. Howeverrun_encoderstill emits a normalStopped("input_closed")state event and returnsOk(()), so a caller cannot programmatically distinguish a clean flush from a truncated one — and a truly stuckspawn_blockingOS thread is detached (encoder handle intentionally leaked), so repeated trips could leak native resources.For the rare-flake fix in #537 this is an acceptable, logged tradeoff (completing the request with partial output beats hanging to the 300s client timeout). But the watchdog trip should ideally be propagated out of
drain_codec_results/codec_forward_loopso callers and state events can mark the run as degraded/failed rather than a successful encode.This changes the node-failure contract, so it deserves its own discussion separate from the flake fix.
Refs:
crates/nodes/src/codec_utils.rs(watchdog branch)crates/nodes/src/video/encoder_trait.rs(run_encoderalways returns Ok)