Skip to content

Bug: Flash 2.5 transcription exhausts output tokens due to repeated [unclear], resulting in truncated JSON (recent regression) #1961

@ish-codes-magic

Description

@ish-codes-magic

Describe the bug
When using Flash 2.5 on Vertex AI for audio transcription with the google-genai package with batching enabled, the model repeatedly outputs the literal token [unclear]. This repetition consumes the entire max_output_tokens budget before transcription completes, causing the response to be truncated and resulting in invalid or incomplete JSON.

This behavior appears to be a recent regression. The same transcription pipeline was significantly more reliable approximately 1–1.5 months ago, with far fewer [unclear] repetitions and successful completion of JSON responses.


Environment

  • Platform: Vertex AI
  • Model: Flash 2.5
  • Library: google-genai
  • Task: Audio transcription with batching
  • Response MIME type: application/json
  • Response schema: Enabled
  • Thinking mode: Diabled

Steps to reproduce

  1. Send an audio file via file_uri with a transcription prompt
  2. Enable structured JSON output using response_schema
  3. Set max_output_tokens appropriate for the expected transcription length
  4. Invoke Flash 2.5 on Vertex AI with batching

Expected behavior

  • The model should avoid excessive repetition of [unclear]
  • The model should complete transcription within the token budget
  • The model should consistently return a valid JSON response conforming to the schema

Actual behavior

  • The model repeatedly emits [unclear] segments
  • Output tokens are exhausted before transcription completes
  • JSON output is truncated or malformed

Code snippet

parts = [
    {
        "file_data": {
            "file_uri": uri,
            "mime_type": self._get_mime_type(file_path)
        }
    },
    {
        "text": final_prompt
    }
]

generation_config = {
    "response_mime_type": "application/json",
    "temperature": transcription_config.TEMPERATURE,
    "max_output_tokens": transcription_config.get_max_output_tokens(model),
}

schema_class = get_transcription_result_class(model, phase)
if schema_class:
    if isinstance(schema_class, dict):
        generation_config["response_schema"] = schema_class
    else:
        schema_dict = schema_class.model_json_schema()
        schema_dict = self._resolve_json_schema_refs(schema_dict)
        schema_dict.pop("$defs", None)
        generation_config["response_schema"] = schema_dict

generation_config["thinking_config"] = {
    "thinking_budget": transcription_config.THINKING_BUDGET
}

instance = {
    "id": str(i - 1),
    "request": {
        "contents": [
            {
                "role": "user",
                "parts": parts
            }
        ],
        "generation_config": generation_config
    }
}

Additional context

  • Increasing max_output_tokens doesn't reduce the issue.
  • The regression has been observed consistently over the last 1–1.5 months

Questions

  • Is this a known regression in Flash 2.5 transcription behavior?
  • Are there recommended mitigations to prevent token exhaustion due to repeated [unclear] output?
  • Is Flash 2.5 currently recommended for transcription workloads on Vertex AI?

Metadata

Metadata

Assignees

No one assigned

    Labels

    priority: p2Moderately-important priority. Fix may not be included in next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions