Skip to content

bug: silent embedding dataloss when embedding.provider=openai with batch_processing.enabled=true #64

@Anthony-Bible

Description

@Anthony-Bible

Problem Statement

When embedding.provider: openai is configured alongside the default batch_processing.enabled: true, the worker completes the indexing job and marks the repository as completed while leaving the embeddings_partitioned table empty. No error is surfaced. Semantic search returns zero results with no indication anything went wrong.

Reproduced end-to-end on feat/openai-embedding-provider against a local LM Studio server: 159 chunks parsed, 0 embeddings written, 1 stuck row in batch_job_progress (status pending_submission). Workaround batch_processing.enabled: false produced 159/159 embeddings via JobProcessor's sync path on re-index.

Why This Matters

Silent data-loss on the first-run path of the feature this entire branch ships. An operator who configures the OpenAI provider — the primary value proposition of feat/openai-embedding-provider — gets a completed indexing signal with broken search and no recovery prompt.

Stakeholders / Personas

  • Operator / self-hoster: configures embedding.provider: openai (or vLLM, Ollama, Azure OpenAI, LM Studio) and triggers indexing. Sees completed, believes setup is working.
  • End-user / developer: runs /search after a successful-looking index, gets zero results, has no signal that embeddings are missing.

User Stories

  • As an operator who has configured an OpenAI-compatible embedding provider, I want the indexing job to either generate embeddings successfully or fail loudly, so that I am never left with a completed repository that silently has no embeddings.
  • As a developer running semantic search, I want results after a reported-successful index, so that I can trust the system's completion signals.

Acceptance Criteria

  • A repository indexed with embedding.provider: openai and batch_processing.enabled: true either produces populated embeddings or transitions to an explicit error/failed state with a log message identifying the incompatibility.
  • No path exists where job status is completed and embeddings_partitioned is empty for that repository.
  • The workaround configuration (batch_processing.enabled: false) continues to produce correct embeddings with the OpenAI provider (regression guard).
  • Behavior is verified by an integration test asserting embedding count > 0 after indexing with the OpenAI provider config.

Out of Scope

  • Implementing the file-based batch API for the OpenAI adapter.
  • Changes to the Gemini provider's batching path.
  • API/UI changes to expose embedding counts to callers.

Root Cause

Two separate code paths independently decide whether to use the queue-manager batch path, with different inputs — they can reach opposite conclusions.

Decision 1 — startup, cmd/worker.go:99: createEmbeddingService returns a bool asyncBatchSupported (hardcoded false for OpenAI in createOpenAIEmbeddingService, line 577). When false, the worker skips BatchSubmitter/BatchPoller and logs the "sync path" notice. Crucially, batchQueueManager is still created and injected into JobProcessor (line 411) because createBatchQueueManager runs unconditionally before asyncBatchSupported is checked.

Decision 2 — runtime, internal/application/worker/job_processor.go:1963–1976: generateEmbeddings evaluates batchDesired = (len(chunks) >= p.batchConfig.ThresholdChunks) && p.batchConfig.Enabled. When true, it calls processBatchEmbeddings. Inside processBatchEmbeddings (line 2033), if batchQueueManager != nil, it calls batchQueueManager.QueueBulkEmbeddingRequests, writes a batch_job_progress row, and returns — orphaned because the consumer goroutine was never started.

The divergence: asyncBatchSupported=false suppresses the consumer goroutines, but batchQueueManager is non-nil, so the queue path is taken anyway. The shouldUseBatchProcessing helper at line 397 has the correct && p.batchQueueManager != nil guard, but it is //nolint:unused and never called from the hot path.

Fix Options

Option A — Fail-fast config validation (recommended belt)
In internal/config/config.go:applyDefaultsAndValidate (called from line 376), reject embedding.provider=openai + batch_processing.enabled=true at startup. Same precedent as the existing OpenAI validations (ada-002 rejection, dimension check, missing API key).

  • Pro: zero runtime ambiguity; clear operator error before any job runs.
  • Con: must be relaxed when an OpenAI batch adapter ships.

Option B — Nil-out batchQueueManager when !asyncBatchSupported (recommended suspenders)
In cmd/worker.go (~line 409), set batchQueueManager = nil before building JobProcessorBatchOptions when asyncBatchSupported == false. The existing processBatchEmbeddings nil-branch at line 2022 then falls back to processBatchResultsWithStorage (sync path).

  • Pro: forward-compatible with a real OpenAI batch adapter (just flip the factory's bool).
  • Pro: removes the orphan-row possibility structurally.

Option C — Capability check on the interface (longer term)
Add SupportsAsyncBatch() bool to outbound.BatchEmbeddingService (or a companion interface). Both startup and JobProcessor consult the same method. Cleanest but requires interface change.

Recommended: Ship A + B together. Defer C as a refactor task.

Edge Cases / Caveats

  • Existing orphan rows in batch_job_progress from prior runs need a one-time admin cleanup (move to failed terminal state or delete). The retry poller in cmd/worker.go:104 may not transition pending_submission; verify before relying on self-heal.
  • Gemini path must keep working: createGeminiEmbeddingService returns asyncBatchSupported=true when cfg.Gemini.Batch.Enabled. Neither A nor B touches that path.
  • Cross-field validation order: in applyDefaultsAndValidate, the cross-field check must run after individual provider/batch validations to avoid misleading errors on already-invalid configs.
  • shouldUseBatchProcessing dead code: line 389 already has the right logic with //nolint:unused. Option B should delete that nolint and call it instead of duplicating logic at line 1963.
  • >= vs > threshold inconsistency: shouldUseBatchProcessing uses > (line 391), generateEmbeddings uses >= (line 1963). Disagree at exactly ThresholdChunks. Pick one.

Test Strategy

Existing coverage to review:

  • internal/application/worker/job_processor_batch_queueing_test.go — covers queue submission with a non-nil manager; does not cover the nil-manager guard or the OpenAI provider case.
  • internal/application/worker/job_processor_batch_integration_test.go — check for batchEnabled=true && batchQueueManager=nil coverage.
  • internal/config/embedding_config_test.go — already tests OpenAI validation; extend for the cross-field check.

New tests required:

  1. Config validation contract test (internal/config/embedding_config_test.go): table-driven with {provider: "openai", batch_processing.enabled: true} → non-nil error mentioning both fields. {provider: "gemini", batch_processing.enabled: true} → nil.
  2. generateEmbeddings nil-manager guard (job_processor_batch_queueing_test.go): construct DefaultJobProcessor with BatchConfig.Enabled=true, ThresholdChunks=0, BatchQueueManager=nil. Call generateEmbeddings with N>0 chunks. Assert: no error, sync embedding path called, no QueueBulkEmbeddingRequests invocation.
  3. Worker wiring test (cmd/worker_test.go or integration): with provider=openai, assert JobProcessorBatchOptions.BatchQueueManager is nil regardless of batch_processing.enabled.
  4. Orphan-row guard (integration): after a job completes with provider=openai, batch_job_progress contains zero rows.

Files Likely to Change

File Reason
internal/config/config.go Add cross-field check (Option A)
internal/config/embedding_config_test.go New table rows for the cross-field error case
cmd/worker.go (~line 409) Nil out batchQueueManager when !asyncBatchSupported (Option B)
internal/application/worker/job_processor.go:1963 Replace inline batchDesired with shouldUseBatchProcessing; remove //nolint:unused
internal/application/worker/job_processor_batch_queueing_test.go Nil-manager regression test

Complexity: S — root fix is a 3-line nil assignment plus a small config guard; no new interfaces, no schema changes. Orphan-row cleanup is a separate ops task.


Filed from end-to-end testing on feat/openai-embedding-provider against LM Studio (octen-embedding-8b, MRL truncate to 768). The OpenAI adapter itself worked flawlessly once batch_processing.enabled was set to false.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions