fix(embedder): re-probe unhealthy servers and resolve dimension mismatch on failover by aeneasr · Pull Request #132 · ory/lumen

aeneasr · 2026-04-14T12:53:50Z

Summary

Re-probe after all-unhealthy: Once initServers() found all embedding servers down, FailoverEmbedder was permanently stuck returning errors — even after the server came back up. Now re-probes every 30 seconds when no healthy server is available.
Dimension mismatch on failover: ModelName() and Dimensions() returned stale server-0 values before the first Embed() call. When failover selected a different server (e.g. Ollama 768-dim instead of LM Studio 3584-dim), the Indexer was created with wrong dimensions, causing sqlite-vec errors. Now both methods eagerly init to return correct values.
Failover exhaustion recovery: When all servers were exhausted during embed (transient 5xx errors), active wasn't reset to -1, preventing the re-probe path from triggering on subsequent calls.

Test plan

TestFailover_ReprobesAfterCooldown — all servers down, one recovers, re-probe after 30s cooldown succeeds
TestFailover_ReprobesAfterFailoverExhaustion — servers healthy on probe but 500 on embed, exhaustion resets active, re-probe works
TestFailover_ModelNameTriggersInit — ModelName() eagerly inits and returns healthy server's model
TestFailover_DimensionsTriggersInit — Dimensions() eagerly inits and returns healthy server's dims
Updated TestFailover_DimensionsReflectActive to match new eager init behavior
All 30 embedder tests pass, 0 lint issues

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Ensure active server state is cleared after failover exhaustion to allow correct recovery.
Improvements
- Periodic health re-probing after a cooldown so recovered servers become available again.
- Eager initialization so model name and embedding dimensions reflect the current healthy server sooner.
- Added informational logging when a re-probe is triggered.
Tests
- Added tests covering cooldown reprobes, failover exhaustion recovery, and eager initialization.

…tch on failover Two related bugs in FailoverEmbedder when the primary embedding server is down: 1. Once initServers() marked all servers unhealthy, the checked flag prevented re-probing on subsequent Embed() calls — even after the server came back up. Fix: re-probe when active < 0 and a 30-second cooldown has elapsed. Also reset active = -1 on failover exhaustion so the re-probe path triggers. 2. ModelName() and Dimensions() returned stale server-0 values before the first Embed() call. When failover later selected a different server with different dimensions, the Indexer/Store was already created with wrong dimensions, causing sqlite-vec errors. Fix: trigger eager initServers() in ModelName() and Dimensions() so they reflect the server that will actually handle embeddings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-04-14T12:54:09Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: fdfe8de4-0683-4ede-aaeb-985333fc1e68

📥 Commits

Reviewing files that changed from the base of the PR and between d262436 and 9ec9a0d.

📒 Files selected for processing (1)

internal/embedder/failover.go

🚧 Files skipped from review as they are similar to previous changes (1)

internal/embedder/failover.go

📝 Walkthrough

Walkthrough

Added cooldown-based periodic health re-probing to FailoverEmbedder. When no active server exists, Dimensions(), ModelName(), and Embed() can trigger a re-probe after a 30s cooldown. Failover exhaustion now sets active index to -1 and initialization records the last probe time.

Changes

Cohort / File(s)	Summary
Core Re-probing Logic `internal/embedder/failover.go`	Added `reprobeInterval` (30s) and `lastProbeTime`. Introduced `maybeReprobe()` and call sites in `Dimensions()`, `ModelName()`, and `Embed()`. `initServers()` sets `lastProbeTime`. On failover exhaustion `f.active` is set to `-1`.
Test Coverage `internal/embedder/failover_test.go`	Added `newTogglableOllamaServer` for runtime health toggling. Added tests: `TestFailover_ReprobesAfterCooldown`, `TestFailover_ReprobesAfterFailoverExhaustion`, `TestFailover_ModelNameTriggersInit`, `TestFailover_DimensionsTriggersInit`. Updated dimension-related assertions.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Client
participant FailoverEmbedder
participant Timer
participant ServerA
participant ServerB
participant Logger

Client->>FailoverEmbedder: Call Embed()/Dimensions()/ModelName()
alt active server exists
    FailoverEmbedder->>ServerA: Forward request
    ServerA-->>FailoverEmbedder: Response
    FailoverEmbedder-->>Client: Return result
else no active server
    FailoverEmbedder->>Timer: Check lastProbeTime vs reprobeInterval
    alt cooldown not elapsed
        FailoverEmbedder-->>Client: Return error / use cached state
    else cooldown elapsed
        Logger-->>FailoverEmbedder: "re-probing embedding servers after cooldown"
        FailoverEmbedder->>ServerA: Health probe
        FailoverEmbedder->>ServerB: Health probe
        alt healthy found
            FailoverEmbedder->>ServerA: Use for embed
            ServerA-->>FailoverEmbedder: Success
            FailoverEmbedder-->>Client: Return result
        else none healthy
            FailoverEmbedder-->>FailoverEmbedder: set active = -1
            FailoverEmbedder-->>Client: Return exhausted error

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Hop, hop—the servers fall in line,
A cooldown ticks, and all's fine.
When health returns, we probe once more,
Failover logic at the core. ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main changes: fixing re-probing of unhealthy servers and resolving dimension mismatch issues on failover, which are the core objectives of the PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/failover-reprobe-and-dimension-mismatch

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

internal/embedder/failover.go (1)

74-88: Consider extracting duplicated init/reprobe logic.

The needsInit/needsReprobe pattern is repeated in Dimensions(), ModelName(), and Embed(). This is minor but could be consolidated into a helper method called under the lock.

♻️ Optional refactor to reduce duplication

+// maybeReprobe checks if initialization or re-probing is needed and performs it.
+// Must be called with f.mu held.
+func (f *FailoverEmbedder) maybeReprobe(log bool) {
+	needsInit := !f.checked || f.serversChanged()
+	needsReprobe := f.active < 0 && time.Since(f.lastProbeTime) >= reprobeInterval
+	if needsInit || needsReprobe {
+		if log && needsReprobe && f.logger != nil {
+			f.logger.Info("re-probing embedding servers after cooldown")
+		}
+		f.initServers()
+		f.checked = true
+	}
+}
+
 func (f *FailoverEmbedder) Dimensions() int {
 	f.mu.Lock()
-	needsInit := !f.checked || f.serversChanged()
-	needsReprobe := f.active < 0 && time.Since(f.lastProbeTime) >= reprobeInterval
-	if needsInit || needsReprobe {
-		f.initServers()
-		f.checked = true
-	}
+	f.maybeReprobe(false)
 	idx := f.active
 	f.mu.Unlock()

Then apply similarly to ModelName() and Embed() (with f.maybeReprobe(true) for Embed to enable logging).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/embedder/failover.go` around lines 74 - 88, Extract the duplicated
init/reprobe logic in FailoverEmbedder into a new helper method (e.g.,
maybeReprobe) that runs under f.mu lock, checks the same conditions (uses
f.checked, f.serversChanged(), f.active, time.Since(f.lastProbeTime) vs
reprobeInterval), calls f.initServers() and sets f.checked as needed, and
returns the current f.active index; update Dimensions() and ModelName() to call
this helper, and update Embed() to call maybeReprobe(true) (or pass a boolean to
enable logging) so the logging-enabled reprobe path is used there.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/embedder/failover.go`:
- Around line 74-88: Extract the duplicated init/reprobe logic in
FailoverEmbedder into a new helper method (e.g., maybeReprobe) that runs under
f.mu lock, checks the same conditions (uses f.checked, f.serversChanged(),
f.active, time.Since(f.lastProbeTime) vs reprobeInterval), calls f.initServers()
and sets f.checked as needed, and returns the current f.active index; update
Dimensions() and ModelName() to call this helper, and update Embed() to call
maybeReprobe(true) (or pass a boolean to enable logging) so the logging-enabled
reprobe path is used there.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: fc56bf9d-e88d-4674-9b26-3c62d899b670

📥 Commits

Reviewing files that changed from the base of the PR and between 39a180d and d262436.

📒 Files selected for processing (2)

internal/embedder/failover.go
internal/embedder/failover_test.go

…eprobe logic Dimensions(), ModelName(), and Embed() all had identical init/reprobe blocks. Extract into maybeReprobe(log bool) called under the existing mutex, with the log parameter controlling reprobe diagnostics in Embed(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai bot reviewed Apr 14, 2026

View reviewed changes

aeneasr enabled auto-merge (squash) April 14, 2026 13:26

aeneasr merged commit 7979d7f into main Apr 14, 2026
10 checks passed

aeneasr mentioned this pull request Apr 14, 2026

chore(main): release 0.0.36 #133

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(embedder): re-probe unhealthy servers and resolve dimension mismatch on failover#132

fix(embedder): re-probe unhealthy servers and resolve dimension mismatch on failover#132
aeneasr merged 2 commits intomainfrom
fix/failover-reprobe-and-dimension-mismatch

aeneasr commented Apr 14, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 14, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aeneasr commented Apr 14, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aeneasr commented Apr 14, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 14, 2026 •

edited

Loading