Self-repair Phase 5: failure cluster store with PVC backing by rockfordlhotka · Pull Request #374 · MarimerLLC/rockbot

rockfordlhotka · 2026-05-08T21:21:12Z

Closes #349

Summary

In-process ConcurrentDictionary<ClusterKey, FailureCluster> keyed by (server, tool, errorClass) for hot reads/writes.
PVC-backed persistence: append-only failure-clusters.jsonl plus a periodic snapshot at failure-clusters.snapshot.json, both under /data/agent/telemetry. On startup, snapshot is loaded then JSONL events with at >= snapshotWrittenAt are replayed; on flush, the snapshot is written atomically and the JSONL is truncated.
McpRecoveryExecutor records every post-recovery failure (chain-exhausted, Stage A retry-fail, Stage B retry-fail, Stage B fill-fail, and the no-provider/no-StageB short-circuit) with the originating session id from ToolInvokeRequest.SessionId. Auto-recovered calls do NOT record — those remain in RecoveryDiagnostics metrics only.
GetEscalatableAsync filters to Count >= 3 && SessionIds.Count >= 2 && (now - LastSeen) < 24h so Phase 4 (Self-repair Phase 4: closed-loop repair tickets in DreamService #348) can iterate ready-to-ticket clusters without re-implementing the threshold.
Bounds on per-cluster state: 5 most-recent sample messages (truncated at 512 chars each), 64 distinct session ids; configurable via FailureClusterOptions.
Recording and the snapshot/JSONL writes share a single file lock, so concurrent records can't be silently dropped between flush capture and JSONL truncation.

Test plan

FailureClusterTests — ClusterKey lowercases server/tool, rejects blanks, equality is canonical.
FailureErrorClassifierTests — extracts field names from each Phase-1 pattern, falls back to unknown.
FileFailureClusterStoreTests (12 cases) — escalation thresholds (count, session count, recency window), bounds enforcement, snapshot+JSONL persistence round-trip, JSONL truncation, corrupt-line recovery, ordering by LastSeen desc.
McpRecoveryExecutorFailureClusterTests (9 cases) — auto-recovered does NOT record; each exhausted-recovery branch DOES record with the right errorClass and forwarded sessionId; null IFailureClusterStore keeps Phase-1 contract; throwing store does not break recovery.
Full suite: 16 test assemblies, 0 failures.

Acceptance criteria (from #349)

Criterion	Coverage
3 same-class failures across 2 sessions in 24h → one escalatable cluster	`GetEscalatable_ThreeFailuresAcrossTwoSessions_InWindow_ReturnsCluster`
Restart preserves cluster state via PVC	`Persistence_SnapshotAndJsonl_RestoreClusterState`
JSONL truncated after each snapshot	`Flush_TruncatesJsonl`
Auto-recovered calls do NOT show up	`AutoRecovered_StageA_DoesNotRecord`

Out of scope

DreamService reading the store and creating tickets — Phase 4 (Self-repair Phase 4: closed-loop repair tickets in DreamService #348).
Canary failures with source=canary — Phase 6.

🤖 Generated with Claude Code

In-process ConcurrentDictionary keyed by (server, tool, errorClass) backed by an append-only JSONL log plus a periodic JSON snapshot under /data/agent/telemetry. McpRecoveryExecutor records every post-recovery failure (chain-exhausted, Stage A retry-fail, Stage B retry-fail, Stage B fill-fail, and no-provider/no-StageB) with the originating session id; auto-recovered calls do not record. GetEscalatableAsync surfaces clusters satisfying Count >= 3 && SessionIds.Count >= 2 && LastSeen within 24h, ready for Phase 4 (#348) ticket creation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…er-store

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both early-exit branches in McpRecoveryExecutor previously returned without recording — non-schema errors (auth/network/server-side) and "X is required but X is in args" cases were invisible to the failure store. Now both record under errorClass=unknown so DreamService can spot recurring patterns the recovery layer can't classify. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Other hosted services may dispose their CancellationTokenSources during shutdown; passing the host's stopping token to SemaphoreSlim.WaitAsync after that point throws ObjectDisposedException. Shutdown flush is critical work and shouldn't be cancellable anyway — the timer callback already uses CancellationToken.None. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rockfordlhotka and others added 5 commits May 8, 2026 16:20

Merge remote-tracking branch 'origin/main' into feature/failure-clust…

b7ce027

…er-store

Bump version to 0.10.41 for K8s deploy of failure cluster store

2d66199

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rockfordlhotka merged commit a9b29f2 into main May 9, 2026
2 checks passed

rockfordlhotka deleted the feature/failure-cluster-store branch May 9, 2026 00:01

rockfordlhotka mentioned this pull request May 9, 2026

Self-repair Phase 4: closed-loop repair tickets in DreamService (#348) #375

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self-repair Phase 5: failure cluster store with PVC backing#374

Self-repair Phase 5: failure cluster store with PVC backing#374
rockfordlhotka merged 5 commits intomainfrom
feature/failure-cluster-store

rockfordlhotka commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rockfordlhotka commented May 8, 2026

Summary

Test plan

Acceptance criteria (from #349)

Out of scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant