Skip to content

[bug] in_progress quests have no stale-timeout recovery; crashed-agent claims held indefinitely #89

@kunallanjewar

Description

@kunallanjewar

Summary

When an agent accepts a quest, quest_accept atomically claims it via CAS UPDATE (see internal/quest/accept.go:90) and the quest stays in_progress until explicit quest_clear or quest_forfeit. If the agent crashes, disconnects, or otherwise terminates without releasing the quest, there is no heartbeat or timeout mechanism to recover the claim. The quest is held indefinitely by a dead owner.

The atomic re-claim rejection (returning AlreadyClaimedError) works correctly for the concurrent-accept race, but it also blocks legitimate recovery: a new agent cannot pick up a stuck quest, and there is no operator-side surface for diagnosing how long a quest has been in flight or which owner held it last.

Affected files

  • internal/quest/accept.go
  • internal/quest/list.go (status filters / introspection)
  • internal/quest/forfeit.go (manual release path)

Acceptance

  • A configurable stale threshold (e.g. claimed_at > N minutes ago) is surfaced via quest list or a new introspection verb.
  • Stale claims can be auto-released, OR explicitly reclaimed via an override flag (e.g. --force) so the choice stays with the operator.
  • The current atomic re-claim contract for non-stale claims is preserved.
  • Output surfaces claimed_at and claimed_by so an operator can decide before overriding.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: questQuest board / task coordinationbugSomething isn't workingpriority: P1High prioritysize: M< 200 lines

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions