-
Notifications
You must be signed in to change notification settings - Fork 732
Open
Labels
priority: p2Moderately-important priority. Fix may not be included in next release.Moderately-important priority. Fix may not be included in next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Description
Describe the bug
When using Flash 2.5 on Vertex AI for audio transcription with the google-genai package with batching enabled, the model repeatedly outputs the literal token [unclear]. This repetition consumes the entire max_output_tokens budget before transcription completes, causing the response to be truncated and resulting in invalid or incomplete JSON.
This behavior appears to be a recent regression. The same transcription pipeline was significantly more reliable approximately 1–1.5 months ago, with far fewer [unclear] repetitions and successful completion of JSON responses.
Environment
- Platform: Vertex AI
- Model: Flash 2.5
- Library: google-genai
- Task: Audio transcription with batching
- Response MIME type:
application/json - Response schema: Enabled
- Thinking mode: Diabled
Steps to reproduce
- Send an audio file via
file_uriwith a transcription prompt - Enable structured JSON output using
response_schema - Set
max_output_tokensappropriate for the expected transcription length - Invoke Flash 2.5 on Vertex AI with batching
Expected behavior
- The model should avoid excessive repetition of
[unclear] - The model should complete transcription within the token budget
- The model should consistently return a valid JSON response conforming to the schema
Actual behavior
- The model repeatedly emits
[unclear]segments - Output tokens are exhausted before transcription completes
- JSON output is truncated or malformed
Code snippet
parts = [
{
"file_data": {
"file_uri": uri,
"mime_type": self._get_mime_type(file_path)
}
},
{
"text": final_prompt
}
]
generation_config = {
"response_mime_type": "application/json",
"temperature": transcription_config.TEMPERATURE,
"max_output_tokens": transcription_config.get_max_output_tokens(model),
}
schema_class = get_transcription_result_class(model, phase)
if schema_class:
if isinstance(schema_class, dict):
generation_config["response_schema"] = schema_class
else:
schema_dict = schema_class.model_json_schema()
schema_dict = self._resolve_json_schema_refs(schema_dict)
schema_dict.pop("$defs", None)
generation_config["response_schema"] = schema_dict
generation_config["thinking_config"] = {
"thinking_budget": transcription_config.THINKING_BUDGET
}
instance = {
"id": str(i - 1),
"request": {
"contents": [
{
"role": "user",
"parts": parts
}
],
"generation_config": generation_config
}
}Additional context
- Increasing
max_output_tokensdoesn't reduce the issue. - The regression has been observed consistently over the last 1–1.5 months
Questions
- Is this a known regression in Flash 2.5 transcription behavior?
- Are there recommended mitigations to prevent token exhaustion due to repeated
[unclear]output? - Is Flash 2.5 currently recommended for transcription workloads on Vertex AI?
Metadata
Metadata
Assignees
Labels
priority: p2Moderately-important priority. Fix may not be included in next release.Moderately-important priority. Fix may not be included in next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.Error or flaw in code with unintended results or allowing sub-optimal usage patterns.