Skip to content

Self-repair Phase 5: failure cluster store with PVC backing#374

Merged
rockfordlhotka merged 5 commits intomainfrom
feature/failure-cluster-store
May 9, 2026
Merged

Self-repair Phase 5: failure cluster store with PVC backing#374
rockfordlhotka merged 5 commits intomainfrom
feature/failure-cluster-store

Conversation

@rockfordlhotka
Copy link
Copy Markdown
Member

Closes #349

Summary

  • In-process ConcurrentDictionary<ClusterKey, FailureCluster> keyed by (server, tool, errorClass) for hot reads/writes.
  • PVC-backed persistence: append-only failure-clusters.jsonl plus a periodic snapshot at failure-clusters.snapshot.json, both under /data/agent/telemetry. On startup, snapshot is loaded then JSONL events with at >= snapshotWrittenAt are replayed; on flush, the snapshot is written atomically and the JSONL is truncated.
  • McpRecoveryExecutor records every post-recovery failure (chain-exhausted, Stage A retry-fail, Stage B retry-fail, Stage B fill-fail, and the no-provider/no-StageB short-circuit) with the originating session id from ToolInvokeRequest.SessionId. Auto-recovered calls do NOT record — those remain in RecoveryDiagnostics metrics only.
  • GetEscalatableAsync filters to Count >= 3 && SessionIds.Count >= 2 && (now - LastSeen) < 24h so Phase 4 (Self-repair Phase 4: closed-loop repair tickets in DreamService #348) can iterate ready-to-ticket clusters without re-implementing the threshold.
  • Bounds on per-cluster state: 5 most-recent sample messages (truncated at 512 chars each), 64 distinct session ids; configurable via FailureClusterOptions.
  • Recording and the snapshot/JSONL writes share a single file lock, so concurrent records can't be silently dropped between flush capture and JSONL truncation.

Test plan

  • FailureClusterTestsClusterKey lowercases server/tool, rejects blanks, equality is canonical.
  • FailureErrorClassifierTests — extracts field names from each Phase-1 pattern, falls back to unknown.
  • FileFailureClusterStoreTests (12 cases) — escalation thresholds (count, session count, recency window), bounds enforcement, snapshot+JSONL persistence round-trip, JSONL truncation, corrupt-line recovery, ordering by LastSeen desc.
  • McpRecoveryExecutorFailureClusterTests (9 cases) — auto-recovered does NOT record; each exhausted-recovery branch DOES record with the right errorClass and forwarded sessionId; null IFailureClusterStore keeps Phase-1 contract; throwing store does not break recovery.
  • Full suite: 16 test assemblies, 0 failures.

Acceptance criteria (from #349)

Criterion Coverage
3 same-class failures across 2 sessions in 24h → one escalatable cluster GetEscalatable_ThreeFailuresAcrossTwoSessions_InWindow_ReturnsCluster
Restart preserves cluster state via PVC Persistence_SnapshotAndJsonl_RestoreClusterState
JSONL truncated after each snapshot Flush_TruncatesJsonl
Auto-recovered calls do NOT show up AutoRecovered_StageA_DoesNotRecord

Out of scope

🤖 Generated with Claude Code

rockfordlhotka and others added 5 commits May 8, 2026 16:20
In-process ConcurrentDictionary keyed by (server, tool, errorClass) backed by
an append-only JSONL log plus a periodic JSON snapshot under
/data/agent/telemetry. McpRecoveryExecutor records every post-recovery
failure (chain-exhausted, Stage A retry-fail, Stage B retry-fail, Stage B
fill-fail, and no-provider/no-StageB) with the originating session id;
auto-recovered calls do not record. GetEscalatableAsync surfaces clusters
satisfying Count >= 3 && SessionIds.Count >= 2 && LastSeen within 24h, ready
for Phase 4 (#348) ticket creation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both early-exit branches in McpRecoveryExecutor previously returned without
recording — non-schema errors (auth/network/server-side) and "X is required
but X is in args" cases were invisible to the failure store. Now both record
under errorClass=unknown so DreamService can spot recurring patterns the
recovery layer can't classify.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Other hosted services may dispose their CancellationTokenSources during
shutdown; passing the host's stopping token to SemaphoreSlim.WaitAsync after
that point throws ObjectDisposedException. Shutdown flush is critical work
and shouldn't be cancellable anyway — the timer callback already uses
CancellationToken.None.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rockfordlhotka rockfordlhotka merged commit a9b29f2 into main May 9, 2026
2 checks passed
@rockfordlhotka rockfordlhotka deleted the feature/failure-cluster-store branch May 9, 2026 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Self-repair Phase 5: in-process failure cluster store with PVC backing

1 participant