Skip to content

feat: Automatic retry of auth-failed batches after API key rotation #26

@Anthony-Bible

Description

@Anthony-Bible

Problem

When Gemini API keys expire, batch embedding jobs are marked as permanently failed with no recovery path. After rotating to a new valid API key, failed batches cannot be retried.

Error example:

"error":"failed to upload batch input file: embedding service error (auth/invalid_api_key): Failed to create file. Ran into an error: Error 400, Message: API key expired. Please renew the API key., Status: INVALID_ARGUMENT

Proposed Solution

  1. Add new auth_failed status to distinguish auth failures from permanent failures
  2. Create a periodic health check service that validates the API key
  3. When API key becomes valid again, automatically reset auth-failed batches for retry

Implementation Plan

Files to Modify/Create

File Action
internal/domain/entity/batch_job_progress.go Add StatusAuthFailed, MarkAuthFailed(), IsAuthFailed()
internal/port/outbound/batch_progress_repository.go Add GetAuthFailedBatches(), ResetAuthFailedForRetry()
internal/port/outbound/api_key_health_service.go Create new interface
internal/adapter/outbound/repository/batch_progress_repository.go Implement new methods
internal/application/worker/batch_submitter.go Add isAuthenticationError(), modify handleSubmissionFailure()
internal/application/worker/auth_recovery_service.go Create new recovery service
internal/config/config.go Add recovery config options
configs/config.yaml Add config defaults
configs/config.dev.yaml Add config defaults
cmd/worker.go Initialize and start recovery service
migrations/000016_add_auth_failed_status.up.sql Add auth_failed to status constraint
migrations/000016_add_auth_failed_status.down.sql Rollback migration

Key Changes

Domain Layer:

  • Add StatusAuthFailed = "auth_failed" constant
  • Add MarkAuthFailed(errorMsg string) method
  • Add IsAuthFailed() bool method

Batch Submitter:

  • Add isAuthenticationError(err error) bool function to detect auth errors
  • Modify handleSubmissionFailure() to mark auth errors as auth_failed instead of failed

Auth Recovery Service:

  • Periodic health check (default: 1 minute interval)
  • When API key becomes valid, reset all auth_failed batches to pending_submission
  • Batches are then picked up by existing BatchSubmitter on next poll

Sequence Flow

1. Batch submission fails with "API key expired"
2. isAuthenticationError() returns true
3. batch.MarkAuthFailed() called → status = 'auth_failed'
4. User rotates API key in environment
5. AuthRecoveryService health check detects valid key
6. ResetAuthFailedForRetry() resets all auth-failed batches
7. BatchSubmitter picks up batches on next poll
8. Batches successfully submitted with new key

Configuration

batch_processing:
  auth_recovery_enabled: true
  auth_recovery_poll_interval: 1m

Acceptance Criteria

  • Auth errors result in auth_failed status, not failed
  • Periodic health check validates API key
  • Auth-failed batches automatically retry when key is valid
  • All changes follow TDD approach
  • Database migration for new status
  • Unit and integration tests pass

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions