Skip to content

fix(embedder): re-probe unhealthy servers and resolve dimension mismatch on failover#132

Merged
aeneasr merged 2 commits intomainfrom
fix/failover-reprobe-and-dimension-mismatch
Apr 14, 2026
Merged

fix(embedder): re-probe unhealthy servers and resolve dimension mismatch on failover#132
aeneasr merged 2 commits intomainfrom
fix/failover-reprobe-and-dimension-mismatch

Conversation

@aeneasr
Copy link
Copy Markdown
Member

@aeneasr aeneasr commented Apr 14, 2026

Summary

  • Re-probe after all-unhealthy: Once initServers() found all embedding servers down, FailoverEmbedder was permanently stuck returning errors — even after the server came back up. Now re-probes every 30 seconds when no healthy server is available.
  • Dimension mismatch on failover: ModelName() and Dimensions() returned stale server-0 values before the first Embed() call. When failover selected a different server (e.g. Ollama 768-dim instead of LM Studio 3584-dim), the Indexer was created with wrong dimensions, causing sqlite-vec errors. Now both methods eagerly init to return correct values.
  • Failover exhaustion recovery: When all servers were exhausted during embed (transient 5xx errors), active wasn't reset to -1, preventing the re-probe path from triggering on subsequent calls.

Test plan

  • TestFailover_ReprobesAfterCooldown — all servers down, one recovers, re-probe after 30s cooldown succeeds
  • TestFailover_ReprobesAfterFailoverExhaustion — servers healthy on probe but 500 on embed, exhaustion resets active, re-probe works
  • TestFailover_ModelNameTriggersInitModelName() eagerly inits and returns healthy server's model
  • TestFailover_DimensionsTriggersInitDimensions() eagerly inits and returns healthy server's dims
  • Updated TestFailover_DimensionsReflectActive to match new eager init behavior
  • All 30 embedder tests pass, 0 lint issues

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Ensure active server state is cleared after failover exhaustion to allow correct recovery.
  • Improvements

    • Periodic health re-probing after a cooldown so recovered servers become available again.
    • Eager initialization so model name and embedding dimensions reflect the current healthy server sooner.
    • Added informational logging when a re-probe is triggered.
  • Tests

    • Added tests covering cooldown reprobes, failover exhaustion recovery, and eager initialization.

…tch on failover

Two related bugs in FailoverEmbedder when the primary embedding server is down:

1. Once initServers() marked all servers unhealthy, the checked flag prevented
   re-probing on subsequent Embed() calls — even after the server came back up.
   Fix: re-probe when active < 0 and a 30-second cooldown has elapsed. Also
   reset active = -1 on failover exhaustion so the re-probe path triggers.

2. ModelName() and Dimensions() returned stale server-0 values before the first
   Embed() call. When failover later selected a different server with different
   dimensions, the Indexer/Store was already created with wrong dimensions,
   causing sqlite-vec errors. Fix: trigger eager initServers() in ModelName()
   and Dimensions() so they reflect the server that will actually handle
   embeddings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 14, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: fdfe8de4-0683-4ede-aaeb-985333fc1e68

📥 Commits

Reviewing files that changed from the base of the PR and between d262436 and 9ec9a0d.

📒 Files selected for processing (1)
  • internal/embedder/failover.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/embedder/failover.go

📝 Walkthrough

Walkthrough

Added cooldown-based periodic health re-probing to FailoverEmbedder. When no active server exists, Dimensions(), ModelName(), and Embed() can trigger a re-probe after a 30s cooldown. Failover exhaustion now sets active index to -1 and initialization records the last probe time.

Changes

Cohort / File(s) Summary
Core Re-probing Logic
internal/embedder/failover.go
Added reprobeInterval (30s) and lastProbeTime. Introduced maybeReprobe() and call sites in Dimensions(), ModelName(), and Embed(). initServers() sets lastProbeTime. On failover exhaustion f.active is set to -1.
Test Coverage
internal/embedder/failover_test.go
Added newTogglableOllamaServer for runtime health toggling. Added tests: TestFailover_ReprobesAfterCooldown, TestFailover_ReprobesAfterFailoverExhaustion, TestFailover_ModelNameTriggersInit, TestFailover_DimensionsTriggersInit. Updated dimension-related assertions.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Client
participant FailoverEmbedder
participant Timer
participant ServerA
participant ServerB
participant Logger

Client->>FailoverEmbedder: Call Embed()/Dimensions()/ModelName()
alt active server exists
    FailoverEmbedder->>ServerA: Forward request
    ServerA-->>FailoverEmbedder: Response
    FailoverEmbedder-->>Client: Return result
else no active server
    FailoverEmbedder->>Timer: Check lastProbeTime vs reprobeInterval
    alt cooldown not elapsed
        FailoverEmbedder-->>Client: Return error / use cached state
    else cooldown elapsed
        Logger-->>FailoverEmbedder: "re-probing embedding servers after cooldown"
        FailoverEmbedder->>ServerA: Health probe
        FailoverEmbedder->>ServerB: Health probe
        alt healthy found
            FailoverEmbedder->>ServerA: Use for embed
            ServerA-->>FailoverEmbedder: Success
            FailoverEmbedder-->>Client: Return result
        else none healthy
            FailoverEmbedder-->>FailoverEmbedder: set active = -1
            FailoverEmbedder-->>Client: Return exhausted error

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Hop, hop—the servers fall in line,
A cooldown ticks, and all's fine.
When health returns, we probe once more,
Failover logic at the core.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: fixing re-probing of unhealthy servers and resolving dimension mismatch issues on failover, which are the core objectives of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/failover-reprobe-and-dimension-mismatch

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal/embedder/failover.go (1)

74-88: Consider extracting duplicated init/reprobe logic.

The needsInit/needsReprobe pattern is repeated in Dimensions(), ModelName(), and Embed(). This is minor but could be consolidated into a helper method called under the lock.

♻️ Optional refactor to reduce duplication
+// maybeReprobe checks if initialization or re-probing is needed and performs it.
+// Must be called with f.mu held.
+func (f *FailoverEmbedder) maybeReprobe(log bool) {
+	needsInit := !f.checked || f.serversChanged()
+	needsReprobe := f.active < 0 && time.Since(f.lastProbeTime) >= reprobeInterval
+	if needsInit || needsReprobe {
+		if log && needsReprobe && f.logger != nil {
+			f.logger.Info("re-probing embedding servers after cooldown")
+		}
+		f.initServers()
+		f.checked = true
+	}
+}
+
 func (f *FailoverEmbedder) Dimensions() int {
 	f.mu.Lock()
-	needsInit := !f.checked || f.serversChanged()
-	needsReprobe := f.active < 0 && time.Since(f.lastProbeTime) >= reprobeInterval
-	if needsInit || needsReprobe {
-		f.initServers()
-		f.checked = true
-	}
+	f.maybeReprobe(false)
 	idx := f.active
 	f.mu.Unlock()

Then apply similarly to ModelName() and Embed() (with f.maybeReprobe(true) for Embed to enable logging).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/embedder/failover.go` around lines 74 - 88, Extract the duplicated
init/reprobe logic in FailoverEmbedder into a new helper method (e.g.,
maybeReprobe) that runs under f.mu lock, checks the same conditions (uses
f.checked, f.serversChanged(), f.active, time.Since(f.lastProbeTime) vs
reprobeInterval), calls f.initServers() and sets f.checked as needed, and
returns the current f.active index; update Dimensions() and ModelName() to call
this helper, and update Embed() to call maybeReprobe(true) (or pass a boolean to
enable logging) so the logging-enabled reprobe path is used there.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/embedder/failover.go`:
- Around line 74-88: Extract the duplicated init/reprobe logic in
FailoverEmbedder into a new helper method (e.g., maybeReprobe) that runs under
f.mu lock, checks the same conditions (uses f.checked, f.serversChanged(),
f.active, time.Since(f.lastProbeTime) vs reprobeInterval), calls f.initServers()
and sets f.checked as needed, and returns the current f.active index; update
Dimensions() and ModelName() to call this helper, and update Embed() to call
maybeReprobe(true) (or pass a boolean to enable logging) so the logging-enabled
reprobe path is used there.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: fc56bf9d-e88d-4674-9b26-3c62d899b670

📥 Commits

Reviewing files that changed from the base of the PR and between 39a180d and d262436.

📒 Files selected for processing (2)
  • internal/embedder/failover.go
  • internal/embedder/failover_test.go

…eprobe logic

Dimensions(), ModelName(), and Embed() all had identical init/reprobe
blocks. Extract into maybeReprobe(log bool) called under the existing
mutex, with the log parameter controlling reprobe diagnostics in Embed().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@aeneasr aeneasr enabled auto-merge (squash) April 14, 2026 13:26
@aeneasr aeneasr merged commit 7979d7f into main Apr 14, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant