bug: silent embedding dataloss when embedding.provider=openai with batch_processing.enabled=true

## Problem Statement

When `embedding.provider: openai` is configured alongside the default `batch_processing.enabled: true`, the worker completes the indexing job and marks the repository as `completed` while leaving the `embeddings_partitioned` table empty. No error is surfaced. Semantic search returns zero results with no indication anything went wrong.

Reproduced end-to-end on `feat/openai-embedding-provider` against a local LM Studio server: 159 chunks parsed, 0 embeddings written, 1 stuck row in `batch_job_progress` (status `pending_submission`). Workaround `batch_processing.enabled: false` produced 159/159 embeddings via JobProcessor's sync path on re-index.

## Why This Matters

Silent data-loss on the first-run path of the feature this entire branch ships. An operator who configures the OpenAI provider — the primary value proposition of `feat/openai-embedding-provider` — gets a `completed` indexing signal with broken search and no recovery prompt.

## Stakeholders / Personas

- **Operator / self-hoster**: configures `embedding.provider: openai` (or vLLM, Ollama, Azure OpenAI, LM Studio) and triggers indexing. Sees `completed`, believes setup is working.
- **End-user / developer**: runs `/search` after a successful-looking index, gets zero results, has no signal that embeddings are missing.

## User Stories

- As an operator who has configured an OpenAI-compatible embedding provider, I want the indexing job to either generate embeddings successfully or fail loudly, so that I am never left with a `completed` repository that silently has no embeddings.
- As a developer running semantic search, I want results after a reported-successful index, so that I can trust the system's completion signals.

## Acceptance Criteria

- A repository indexed with `embedding.provider: openai` and `batch_processing.enabled: true` either produces populated embeddings or transitions to an explicit error/failed state with a log message identifying the incompatibility.
- No path exists where job status is `completed` and `embeddings_partitioned` is empty for that repository.
- The workaround configuration (`batch_processing.enabled: false`) continues to produce correct embeddings with the OpenAI provider (regression guard).
- Behavior is verified by an integration test asserting embedding count > 0 after indexing with the OpenAI provider config.

## Out of Scope

- Implementing the file-based batch API for the OpenAI adapter.
- Changes to the Gemini provider's batching path.
- API/UI changes to expose embedding counts to callers.

---

## Root Cause

Two separate code paths independently decide whether to use the queue-manager batch path, with different inputs — they can reach opposite conclusions.

**Decision 1 — startup, `cmd/worker.go:99`:** `createEmbeddingService` returns a bool `asyncBatchSupported` (hardcoded `false` for OpenAI in `createOpenAIEmbeddingService`, line 577). When `false`, the worker skips `BatchSubmitter`/`BatchPoller` and logs the "sync path" notice. Crucially, `batchQueueManager` is still created and injected into `JobProcessor` (line 411) because `createBatchQueueManager` runs unconditionally before `asyncBatchSupported` is checked.

**Decision 2 — runtime, `internal/application/worker/job_processor.go:1963–1976`:** `generateEmbeddings` evaluates `batchDesired = (len(chunks) >= p.batchConfig.ThresholdChunks) && p.batchConfig.Enabled`. When true, it calls `processBatchEmbeddings`. Inside `processBatchEmbeddings` (line 2033), if `batchQueueManager != nil`, it calls `batchQueueManager.QueueBulkEmbeddingRequests`, writes a `batch_job_progress` row, and returns — orphaned because the consumer goroutine was never started.

**The divergence:** `asyncBatchSupported=false` suppresses the consumer goroutines, but `batchQueueManager` is non-nil, so the queue path is taken anyway. The `shouldUseBatchProcessing` helper at line 397 has the correct `&& p.batchQueueManager != nil` guard, but it is `//nolint:unused` and never called from the hot path.

## Fix Options

**Option A — Fail-fast config validation (recommended belt)**
In `internal/config/config.go:applyDefaultsAndValidate` (called from line 376), reject `embedding.provider=openai` + `batch_processing.enabled=true` at startup. Same precedent as the existing OpenAI validations (ada-002 rejection, dimension check, missing API key).
- Pro: zero runtime ambiguity; clear operator error before any job runs.
- Con: must be relaxed when an OpenAI batch adapter ships.

**Option B — Nil-out `batchQueueManager` when `!asyncBatchSupported` (recommended suspenders)**
In `cmd/worker.go` (~line 409), set `batchQueueManager = nil` before building `JobProcessorBatchOptions` when `asyncBatchSupported == false`. The existing `processBatchEmbeddings` nil-branch at line 2022 then falls back to `processBatchResultsWithStorage` (sync path).
- Pro: forward-compatible with a real OpenAI batch adapter (just flip the factory's bool).
- Pro: removes the orphan-row possibility structurally.

**Option C — Capability check on the interface (longer term)**
Add `SupportsAsyncBatch() bool` to `outbound.BatchEmbeddingService` (or a companion interface). Both startup and JobProcessor consult the same method. Cleanest but requires interface change.

**Recommended:** Ship A + B together. Defer C as a refactor task.

## Edge Cases / Caveats

- **Existing orphan rows in `batch_job_progress`** from prior runs need a one-time admin cleanup (move to `failed` terminal state or delete). The retry poller in `cmd/worker.go:104` may not transition `pending_submission`; verify before relying on self-heal.
- **Gemini path must keep working**: `createGeminiEmbeddingService` returns `asyncBatchSupported=true` when `cfg.Gemini.Batch.Enabled`. Neither A nor B touches that path.
- **Cross-field validation order**: in `applyDefaultsAndValidate`, the cross-field check must run after individual provider/batch validations to avoid misleading errors on already-invalid configs.
- **`shouldUseBatchProcessing` dead code**: line 389 already has the right logic with `//nolint:unused`. Option B should delete that nolint and call it instead of duplicating logic at line 1963.
- **`>=` vs `>` threshold inconsistency**: `shouldUseBatchProcessing` uses `>` (line 391), `generateEmbeddings` uses `>=` (line 1963). Disagree at exactly `ThresholdChunks`. Pick one.

## Test Strategy

**Existing coverage to review:**
- `internal/application/worker/job_processor_batch_queueing_test.go` — covers queue submission with a non-nil manager; does not cover the nil-manager guard or the OpenAI provider case.
- `internal/application/worker/job_processor_batch_integration_test.go` — check for `batchEnabled=true && batchQueueManager=nil` coverage.
- `internal/config/embedding_config_test.go` — already tests OpenAI validation; extend for the cross-field check.

**New tests required:**

1. **Config validation contract test** (`internal/config/embedding_config_test.go`): table-driven with `{provider: "openai", batch_processing.enabled: true}` → non-nil error mentioning both fields. `{provider: "gemini", batch_processing.enabled: true}` → nil.
2. **`generateEmbeddings` nil-manager guard** (`job_processor_batch_queueing_test.go`): construct `DefaultJobProcessor` with `BatchConfig.Enabled=true`, `ThresholdChunks=0`, `BatchQueueManager=nil`. Call `generateEmbeddings` with N>0 chunks. Assert: no error, sync embedding path called, no `QueueBulkEmbeddingRequests` invocation.
3. **Worker wiring test** (`cmd/worker_test.go` or integration): with `provider=openai`, assert `JobProcessorBatchOptions.BatchQueueManager` is nil regardless of `batch_processing.enabled`.
4. **Orphan-row guard** (integration): after a job completes with `provider=openai`, `batch_job_progress` contains zero rows.

## Files Likely to Change

| File | Reason |
|---|---|
| `internal/config/config.go` | Add cross-field check (Option A) |
| `internal/config/embedding_config_test.go` | New table rows for the cross-field error case |
| `cmd/worker.go` (~line 409) | Nil out `batchQueueManager` when `!asyncBatchSupported` (Option B) |
| `internal/application/worker/job_processor.go:1963` | Replace inline `batchDesired` with `shouldUseBatchProcessing`; remove `//nolint:unused` |
| `internal/application/worker/job_processor_batch_queueing_test.go` | Nil-manager regression test |

**Complexity:** S — root fix is a 3-line nil assignment plus a small config guard; no new interfaces, no schema changes. Orphan-row cleanup is a separate ops task.

---

_Filed from end-to-end testing on `feat/openai-embedding-provider` against LM Studio (`octen-embedding-8b`, MRL truncate to 768). The OpenAI adapter itself worked flawlessly once `batch_processing.enabled` was set to `false`._

File	Reason
`internal/config/config.go`	Add cross-field check (Option A)
`internal/config/embedding_config_test.go`	New table rows for the cross-field error case
`cmd/worker.go` (~line 409)	Nil out `batchQueueManager` when `!asyncBatchSupported` (Option B)
`internal/application/worker/job_processor.go:1963`	Replace inline `batchDesired` with `shouldUseBatchProcessing`; remove `//nolint:unused`
`internal/application/worker/job_processor_batch_queueing_test.go`	Nil-manager regression test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: silent embedding dataloss when embedding.provider=openai with batch_processing.enabled=true #64

Problem Statement

Why This Matters

Stakeholders / Personas

User Stories

Acceptance Criteria

Out of Scope

Root Cause

Fix Options

Edge Cases / Caveats

Test Strategy

Files Likely to Change

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

bug: silent embedding dataloss when embedding.provider=openai with batch_processing.enabled=true #64

Description

Problem Statement

Why This Matters

Stakeholders / Personas

User Stories

Acceptance Criteria

Out of Scope

Root Cause

Fix Options

Edge Cases / Caveats

Test Strategy

Files Likely to Change

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions