Problem
When Gemini API keys expire, batch embedding jobs are marked as permanently failed with no recovery path. After rotating to a new valid API key, failed batches cannot be retried.
Error example:
"error":"failed to upload batch input file: embedding service error (auth/invalid_api_key): Failed to create file. Ran into an error: Error 400, Message: API key expired. Please renew the API key., Status: INVALID_ARGUMENT
Proposed Solution
- Add new
auth_failed status to distinguish auth failures from permanent failures
- Create a periodic health check service that validates the API key
- When API key becomes valid again, automatically reset auth-failed batches for retry
Implementation Plan
Files to Modify/Create
| File |
Action |
internal/domain/entity/batch_job_progress.go |
Add StatusAuthFailed, MarkAuthFailed(), IsAuthFailed() |
internal/port/outbound/batch_progress_repository.go |
Add GetAuthFailedBatches(), ResetAuthFailedForRetry() |
internal/port/outbound/api_key_health_service.go |
Create new interface |
internal/adapter/outbound/repository/batch_progress_repository.go |
Implement new methods |
internal/application/worker/batch_submitter.go |
Add isAuthenticationError(), modify handleSubmissionFailure() |
internal/application/worker/auth_recovery_service.go |
Create new recovery service |
internal/config/config.go |
Add recovery config options |
configs/config.yaml |
Add config defaults |
configs/config.dev.yaml |
Add config defaults |
cmd/worker.go |
Initialize and start recovery service |
migrations/000016_add_auth_failed_status.up.sql |
Add auth_failed to status constraint |
migrations/000016_add_auth_failed_status.down.sql |
Rollback migration |
Key Changes
Domain Layer:
- Add
StatusAuthFailed = "auth_failed" constant
- Add
MarkAuthFailed(errorMsg string) method
- Add
IsAuthFailed() bool method
Batch Submitter:
- Add
isAuthenticationError(err error) bool function to detect auth errors
- Modify
handleSubmissionFailure() to mark auth errors as auth_failed instead of failed
Auth Recovery Service:
- Periodic health check (default: 1 minute interval)
- When API key becomes valid, reset all
auth_failed batches to pending_submission
- Batches are then picked up by existing BatchSubmitter on next poll
Sequence Flow
1. Batch submission fails with "API key expired"
2. isAuthenticationError() returns true
3. batch.MarkAuthFailed() called → status = 'auth_failed'
4. User rotates API key in environment
5. AuthRecoveryService health check detects valid key
6. ResetAuthFailedForRetry() resets all auth-failed batches
7. BatchSubmitter picks up batches on next poll
8. Batches successfully submitted with new key
Configuration
batch_processing:
auth_recovery_enabled: true
auth_recovery_poll_interval: 1m
Acceptance Criteria
Problem
When Gemini API keys expire, batch embedding jobs are marked as permanently failed with no recovery path. After rotating to a new valid API key, failed batches cannot be retried.
Error example:
Proposed Solution
auth_failedstatus to distinguish auth failures from permanent failuresImplementation Plan
Files to Modify/Create
internal/domain/entity/batch_job_progress.goStatusAuthFailed,MarkAuthFailed(),IsAuthFailed()internal/port/outbound/batch_progress_repository.goGetAuthFailedBatches(),ResetAuthFailedForRetry()internal/port/outbound/api_key_health_service.gointernal/adapter/outbound/repository/batch_progress_repository.gointernal/application/worker/batch_submitter.goisAuthenticationError(), modifyhandleSubmissionFailure()internal/application/worker/auth_recovery_service.gointernal/config/config.goconfigs/config.yamlconfigs/config.dev.yamlcmd/worker.gomigrations/000016_add_auth_failed_status.up.sqlmigrations/000016_add_auth_failed_status.down.sqlKey Changes
Domain Layer:
StatusAuthFailed = "auth_failed"constantMarkAuthFailed(errorMsg string)methodIsAuthFailed() boolmethodBatch Submitter:
isAuthenticationError(err error) boolfunction to detect auth errorshandleSubmissionFailure()to mark auth errors asauth_failedinstead offailedAuth Recovery Service:
auth_failedbatches topending_submissionSequence Flow
Configuration
Acceptance Criteria
auth_failedstatus, notfailed