Problem Statement
When embedding.provider: openai is configured alongside the default batch_processing.enabled: true, the worker completes the indexing job and marks the repository as completed while leaving the embeddings_partitioned table empty. No error is surfaced. Semantic search returns zero results with no indication anything went wrong.
Reproduced end-to-end on feat/openai-embedding-provider against a local LM Studio server: 159 chunks parsed, 0 embeddings written, 1 stuck row in batch_job_progress (status pending_submission). Workaround batch_processing.enabled: false produced 159/159 embeddings via JobProcessor's sync path on re-index.
Why This Matters
Silent data-loss on the first-run path of the feature this entire branch ships. An operator who configures the OpenAI provider — the primary value proposition of feat/openai-embedding-provider — gets a completed indexing signal with broken search and no recovery prompt.
Stakeholders / Personas
- Operator / self-hoster: configures
embedding.provider: openai (or vLLM, Ollama, Azure OpenAI, LM Studio) and triggers indexing. Sees completed, believes setup is working.
- End-user / developer: runs
/search after a successful-looking index, gets zero results, has no signal that embeddings are missing.
User Stories
- As an operator who has configured an OpenAI-compatible embedding provider, I want the indexing job to either generate embeddings successfully or fail loudly, so that I am never left with a
completed repository that silently has no embeddings.
- As a developer running semantic search, I want results after a reported-successful index, so that I can trust the system's completion signals.
Acceptance Criteria
- A repository indexed with
embedding.provider: openai and batch_processing.enabled: true either produces populated embeddings or transitions to an explicit error/failed state with a log message identifying the incompatibility.
- No path exists where job status is
completed and embeddings_partitioned is empty for that repository.
- The workaround configuration (
batch_processing.enabled: false) continues to produce correct embeddings with the OpenAI provider (regression guard).
- Behavior is verified by an integration test asserting embedding count > 0 after indexing with the OpenAI provider config.
Out of Scope
- Implementing the file-based batch API for the OpenAI adapter.
- Changes to the Gemini provider's batching path.
- API/UI changes to expose embedding counts to callers.
Root Cause
Two separate code paths independently decide whether to use the queue-manager batch path, with different inputs — they can reach opposite conclusions.
Decision 1 — startup, cmd/worker.go:99: createEmbeddingService returns a bool asyncBatchSupported (hardcoded false for OpenAI in createOpenAIEmbeddingService, line 577). When false, the worker skips BatchSubmitter/BatchPoller and logs the "sync path" notice. Crucially, batchQueueManager is still created and injected into JobProcessor (line 411) because createBatchQueueManager runs unconditionally before asyncBatchSupported is checked.
Decision 2 — runtime, internal/application/worker/job_processor.go:1963–1976: generateEmbeddings evaluates batchDesired = (len(chunks) >= p.batchConfig.ThresholdChunks) && p.batchConfig.Enabled. When true, it calls processBatchEmbeddings. Inside processBatchEmbeddings (line 2033), if batchQueueManager != nil, it calls batchQueueManager.QueueBulkEmbeddingRequests, writes a batch_job_progress row, and returns — orphaned because the consumer goroutine was never started.
The divergence: asyncBatchSupported=false suppresses the consumer goroutines, but batchQueueManager is non-nil, so the queue path is taken anyway. The shouldUseBatchProcessing helper at line 397 has the correct && p.batchQueueManager != nil guard, but it is //nolint:unused and never called from the hot path.
Fix Options
Option A — Fail-fast config validation (recommended belt)
In internal/config/config.go:applyDefaultsAndValidate (called from line 376), reject embedding.provider=openai + batch_processing.enabled=true at startup. Same precedent as the existing OpenAI validations (ada-002 rejection, dimension check, missing API key).
- Pro: zero runtime ambiguity; clear operator error before any job runs.
- Con: must be relaxed when an OpenAI batch adapter ships.
Option B — Nil-out batchQueueManager when !asyncBatchSupported (recommended suspenders)
In cmd/worker.go (~line 409), set batchQueueManager = nil before building JobProcessorBatchOptions when asyncBatchSupported == false. The existing processBatchEmbeddings nil-branch at line 2022 then falls back to processBatchResultsWithStorage (sync path).
- Pro: forward-compatible with a real OpenAI batch adapter (just flip the factory's bool).
- Pro: removes the orphan-row possibility structurally.
Option C — Capability check on the interface (longer term)
Add SupportsAsyncBatch() bool to outbound.BatchEmbeddingService (or a companion interface). Both startup and JobProcessor consult the same method. Cleanest but requires interface change.
Recommended: Ship A + B together. Defer C as a refactor task.
Edge Cases / Caveats
- Existing orphan rows in
batch_job_progress from prior runs need a one-time admin cleanup (move to failed terminal state or delete). The retry poller in cmd/worker.go:104 may not transition pending_submission; verify before relying on self-heal.
- Gemini path must keep working:
createGeminiEmbeddingService returns asyncBatchSupported=true when cfg.Gemini.Batch.Enabled. Neither A nor B touches that path.
- Cross-field validation order: in
applyDefaultsAndValidate, the cross-field check must run after individual provider/batch validations to avoid misleading errors on already-invalid configs.
shouldUseBatchProcessing dead code: line 389 already has the right logic with //nolint:unused. Option B should delete that nolint and call it instead of duplicating logic at line 1963.
>= vs > threshold inconsistency: shouldUseBatchProcessing uses > (line 391), generateEmbeddings uses >= (line 1963). Disagree at exactly ThresholdChunks. Pick one.
Test Strategy
Existing coverage to review:
internal/application/worker/job_processor_batch_queueing_test.go — covers queue submission with a non-nil manager; does not cover the nil-manager guard or the OpenAI provider case.
internal/application/worker/job_processor_batch_integration_test.go — check for batchEnabled=true && batchQueueManager=nil coverage.
internal/config/embedding_config_test.go — already tests OpenAI validation; extend for the cross-field check.
New tests required:
- Config validation contract test (
internal/config/embedding_config_test.go): table-driven with {provider: "openai", batch_processing.enabled: true} → non-nil error mentioning both fields. {provider: "gemini", batch_processing.enabled: true} → nil.
generateEmbeddings nil-manager guard (job_processor_batch_queueing_test.go): construct DefaultJobProcessor with BatchConfig.Enabled=true, ThresholdChunks=0, BatchQueueManager=nil. Call generateEmbeddings with N>0 chunks. Assert: no error, sync embedding path called, no QueueBulkEmbeddingRequests invocation.
- Worker wiring test (
cmd/worker_test.go or integration): with provider=openai, assert JobProcessorBatchOptions.BatchQueueManager is nil regardless of batch_processing.enabled.
- Orphan-row guard (integration): after a job completes with
provider=openai, batch_job_progress contains zero rows.
Files Likely to Change
| File |
Reason |
internal/config/config.go |
Add cross-field check (Option A) |
internal/config/embedding_config_test.go |
New table rows for the cross-field error case |
cmd/worker.go (~line 409) |
Nil out batchQueueManager when !asyncBatchSupported (Option B) |
internal/application/worker/job_processor.go:1963 |
Replace inline batchDesired with shouldUseBatchProcessing; remove //nolint:unused |
internal/application/worker/job_processor_batch_queueing_test.go |
Nil-manager regression test |
Complexity: S — root fix is a 3-line nil assignment plus a small config guard; no new interfaces, no schema changes. Orphan-row cleanup is a separate ops task.
Filed from end-to-end testing on feat/openai-embedding-provider against LM Studio (octen-embedding-8b, MRL truncate to 768). The OpenAI adapter itself worked flawlessly once batch_processing.enabled was set to false.
Problem Statement
When
embedding.provider: openaiis configured alongside the defaultbatch_processing.enabled: true, the worker completes the indexing job and marks the repository ascompletedwhile leaving theembeddings_partitionedtable empty. No error is surfaced. Semantic search returns zero results with no indication anything went wrong.Reproduced end-to-end on
feat/openai-embedding-provideragainst a local LM Studio server: 159 chunks parsed, 0 embeddings written, 1 stuck row inbatch_job_progress(statuspending_submission). Workaroundbatch_processing.enabled: falseproduced 159/159 embeddings via JobProcessor's sync path on re-index.Why This Matters
Silent data-loss on the first-run path of the feature this entire branch ships. An operator who configures the OpenAI provider — the primary value proposition of
feat/openai-embedding-provider— gets acompletedindexing signal with broken search and no recovery prompt.Stakeholders / Personas
embedding.provider: openai(or vLLM, Ollama, Azure OpenAI, LM Studio) and triggers indexing. Seescompleted, believes setup is working./searchafter a successful-looking index, gets zero results, has no signal that embeddings are missing.User Stories
completedrepository that silently has no embeddings.Acceptance Criteria
embedding.provider: openaiandbatch_processing.enabled: trueeither produces populated embeddings or transitions to an explicit error/failed state with a log message identifying the incompatibility.completedandembeddings_partitionedis empty for that repository.batch_processing.enabled: false) continues to produce correct embeddings with the OpenAI provider (regression guard).Out of Scope
Root Cause
Two separate code paths independently decide whether to use the queue-manager batch path, with different inputs — they can reach opposite conclusions.
Decision 1 — startup,
cmd/worker.go:99:createEmbeddingServicereturns a boolasyncBatchSupported(hardcodedfalsefor OpenAI increateOpenAIEmbeddingService, line 577). Whenfalse, the worker skipsBatchSubmitter/BatchPollerand logs the "sync path" notice. Crucially,batchQueueManageris still created and injected intoJobProcessor(line 411) becausecreateBatchQueueManagerruns unconditionally beforeasyncBatchSupportedis checked.Decision 2 — runtime,
internal/application/worker/job_processor.go:1963–1976:generateEmbeddingsevaluatesbatchDesired = (len(chunks) >= p.batchConfig.ThresholdChunks) && p.batchConfig.Enabled. When true, it callsprocessBatchEmbeddings. InsideprocessBatchEmbeddings(line 2033), ifbatchQueueManager != nil, it callsbatchQueueManager.QueueBulkEmbeddingRequests, writes abatch_job_progressrow, and returns — orphaned because the consumer goroutine was never started.The divergence:
asyncBatchSupported=falsesuppresses the consumer goroutines, butbatchQueueManageris non-nil, so the queue path is taken anyway. TheshouldUseBatchProcessinghelper at line 397 has the correct&& p.batchQueueManager != nilguard, but it is//nolint:unusedand never called from the hot path.Fix Options
Option A — Fail-fast config validation (recommended belt)
In
internal/config/config.go:applyDefaultsAndValidate(called from line 376), rejectembedding.provider=openai+batch_processing.enabled=trueat startup. Same precedent as the existing OpenAI validations (ada-002 rejection, dimension check, missing API key).Option B — Nil-out
batchQueueManagerwhen!asyncBatchSupported(recommended suspenders)In
cmd/worker.go(~line 409), setbatchQueueManager = nilbefore buildingJobProcessorBatchOptionswhenasyncBatchSupported == false. The existingprocessBatchEmbeddingsnil-branch at line 2022 then falls back toprocessBatchResultsWithStorage(sync path).Option C — Capability check on the interface (longer term)
Add
SupportsAsyncBatch() booltooutbound.BatchEmbeddingService(or a companion interface). Both startup and JobProcessor consult the same method. Cleanest but requires interface change.Recommended: Ship A + B together. Defer C as a refactor task.
Edge Cases / Caveats
batch_job_progressfrom prior runs need a one-time admin cleanup (move tofailedterminal state or delete). The retry poller incmd/worker.go:104may not transitionpending_submission; verify before relying on self-heal.createGeminiEmbeddingServicereturnsasyncBatchSupported=truewhencfg.Gemini.Batch.Enabled. Neither A nor B touches that path.applyDefaultsAndValidate, the cross-field check must run after individual provider/batch validations to avoid misleading errors on already-invalid configs.shouldUseBatchProcessingdead code: line 389 already has the right logic with//nolint:unused. Option B should delete that nolint and call it instead of duplicating logic at line 1963.>=vs>threshold inconsistency:shouldUseBatchProcessinguses>(line 391),generateEmbeddingsuses>=(line 1963). Disagree at exactlyThresholdChunks. Pick one.Test Strategy
Existing coverage to review:
internal/application/worker/job_processor_batch_queueing_test.go— covers queue submission with a non-nil manager; does not cover the nil-manager guard or the OpenAI provider case.internal/application/worker/job_processor_batch_integration_test.go— check forbatchEnabled=true && batchQueueManager=nilcoverage.internal/config/embedding_config_test.go— already tests OpenAI validation; extend for the cross-field check.New tests required:
internal/config/embedding_config_test.go): table-driven with{provider: "openai", batch_processing.enabled: true}→ non-nil error mentioning both fields.{provider: "gemini", batch_processing.enabled: true}→ nil.generateEmbeddingsnil-manager guard (job_processor_batch_queueing_test.go): constructDefaultJobProcessorwithBatchConfig.Enabled=true,ThresholdChunks=0,BatchQueueManager=nil. CallgenerateEmbeddingswith N>0 chunks. Assert: no error, sync embedding path called, noQueueBulkEmbeddingRequestsinvocation.cmd/worker_test.goor integration): withprovider=openai, assertJobProcessorBatchOptions.BatchQueueManageris nil regardless ofbatch_processing.enabled.provider=openai,batch_job_progresscontains zero rows.Files Likely to Change
internal/config/config.gointernal/config/embedding_config_test.gocmd/worker.go(~line 409)batchQueueManagerwhen!asyncBatchSupported(Option B)internal/application/worker/job_processor.go:1963batchDesiredwithshouldUseBatchProcessing; remove//nolint:unusedinternal/application/worker/job_processor_batch_queueing_test.goComplexity: S — root fix is a 3-line nil assignment plus a small config guard; no new interfaces, no schema changes. Orphan-row cleanup is a separate ops task.
Filed from end-to-end testing on
feat/openai-embedding-provideragainst LM Studio (octen-embedding-8b, MRL truncate to 768). The OpenAI adapter itself worked flawlessly oncebatch_processing.enabledwas set tofalse.