-
Notifications
You must be signed in to change notification settings - Fork 1.4k
proposed changes to PR #2962 #2981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add support for speech-2.6-hd and speech-2.6-turbo models - Add 16 new languages (total 40): Afrikaans, Bulgarian, Catalan, Danish, Persian, Filipino, Hebrew, Croatian, Hungarian, Malay, Norwegian, Nynorsk, Slovak, Slovenian, Swedish, Tamil - Add new emotions: calm and fluent - Add new parameters: text_normalization (renamed from english_normalization), latex_read, force_cbr, exclude_aggregated_audio, subtitle_enable, subtitle_type - Extract trace_id from response headers for all requests - Improve error handling for non-streaming error responses - Add detailed extra_info logging (audio_length, audio_size, usage_characters, word_count) - Add validation warnings for language/model compatibility - Fix silent error issue where HTTP 200 responses with errors were ignored BREAKING CHANGE: Renamed parameter english_normalization to text_normalization
Codecov Report❌ Patch coverage is
... and 1 file with indirect coverage changes 🚀 New features to boost your workflow:
|
| latex_read: Enable LaTeX formula reading. | ||
| force_cbr: Enable Constant Bitrate (CBR) for audio encoding (MP3 only). | ||
| exclude_aggregated_audio: Whether to exclude aggregated audio in final chunk. | ||
| subtitle_enable: Enable subtitle generation (non-streaming only). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider removing since this parameter and subtitle_type as they're for non-streaming only. This MiniMax implementation is streaming only.
The alternative would be to add a streaming arg to allow for a non-streaming mode`, but I don't think that's a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a mismatch. Our model can support streaming subtitle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps (non-streaming only). should be omitted from comment then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhenyujia23-crypto is there a way subtitles can be added to the example 07y-interruptible-minimax.py to exercise this parameter?
| "disgusted", "surprised", "calm", "fluent"). | ||
| english_normalization: Deprecated; use `text_normalization` instead | ||
| text_normalization: Enable text normalization (Chinese/English). | ||
| latex_read: Enable LaTeX formula reading. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually applicable? I guess LLMs can output LaTeX format, but I don't know if it makes sense for a TTS to do the same 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there's no harm in including it, but generally voice agents are best if they skip pronouncing complex math (code or tables).
src/pipecat/services/minimax/tts.py
Outdated
| english_normalization: Deprecated; use `text_normalization` instead | ||
| text_normalization: Enable text normalization (Chinese/English). | ||
| latex_read: Enable LaTeX formula reading. | ||
| force_cbr: Enable Constant Bitrate (CBR) for audio encoding (MP3 only). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is mp3 and option? Again, this probably doesn't make sense for a streaming use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's remove if it doesn't apply to streaming then.
src/pipecat/services/minimax/tts.py
Outdated
| if service_lang: | ||
| self._settings["language_boost"] = service_lang | ||
|
|
||
| # Validate language-model compatibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we should include this type of checking. It's likely to become stale and need to be maintained. It would be better to rely on the service returning an error about unsupported languages for a given model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove it
| if params.text_normalization is not None: | ||
| self._settings["voice_setting"]["text_normalization"] = params.text_normalization | ||
|
|
||
| # Add latex_read if provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider removing based if the parameter isn't included.
src/pipecat/services/minimax/tts.py
Outdated
| if params.latex_read is not None: | ||
| self._settings["voice_setting"]["latex_read"] = params.latex_read | ||
|
|
||
| # Add force_cbr if provided (for MP3 format only) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same. Consider removing based if the parameter isn't included.
| if params.force_cbr is not None: | ||
| self._settings["audio_setting"]["force_cbr"] = params.force_cbr | ||
|
|
||
| # Add subtitle settings if provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove the parameter and this logic since non-streaming isn't supported.
| yield TTSStartedFrame() | ||
|
|
||
| # Process the streaming response | ||
| logger.trace(f"Starting to read streaming response, status={response.status}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
|
|
||
| async for chunk in response.content.iter_chunked(CHUNK_SIZE): | ||
| chunk_count += 1 | ||
| logger.trace(f"Received chunk #{chunk_count}, size={len(chunk)} bytes") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove.
|
|
||
| # Log raw buffer content for debugging | ||
| if chunk_count == 1: | ||
| logger.trace(f"Raw buffer content: {buffer[:200]}") # First 200 bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove.
| data = json.loads(data_block[5:].decode("utf-8")) | ||
| # Skip data blocks containing extra_info | ||
| data_str = data_block[5:].decode("utf-8") | ||
| logger.trace( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this and other trace logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, I think we still need a place to log the trace_id, coz, it will help us to know the problem from our service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the tracing code necessary for other folks who use this class?
Perhaps a better approach is to create a git patch with all the tracing code that can be applied locally to a user's pipecat code.
I see that it is useful, but it feels like it should be a more "opt-in" feature.
| if not chunk_data: | ||
| continue | ||
|
|
||
| # Check for subtitle file (if subtitle generation is enabled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
markbackman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think the biggest issue to sort out is what the scope of this service is. Pipecat is for real-time use cases, so non-streaming is not really applicable. I think we should remove the non-streaming items from this class to keep it focused and simple.
Also, remove the trace logging, as it's only applicable during development and isn't practical for real-word debugging.
Co-authored-by: Mark Backman <[email protected]>
Co-authored-by: Mark Backman <[email protected]>
Co-authored-by: Mark Backman <[email protected]>
- Remove non-essential parameters: latex_read, force_cbr, exclude_aggregated_audio - Simplify trace_id logging (keep for errors only, remove from success logs) - Make subtitle generation support streaming with word-level timestamps - Remove non-streaming logic warnings and validation - Clean up code structure and remove duplicate parameter handling - Reduce code complexity while maintaining core functionality Responds to PR pipecat-ai#2981 reviewer comments about keeping TTS service focused on real-time use cases.
- Remove force_cbr parameter and complex audio encoding options - Remove exclude_aggregated_audio streaming control - Remove all logger.debug statements for cleaner logging - Simplify InputParams class (11 → 7 parameters) - Streamline subtitle support to word-level only - Remove sentence-level subtitle functionality - Fix geographic comment: "west of unite state" → "western United States" - Update documentation for streaming-only word-level subtitles - Maintain all core TTS functionality for real-time use cases - Preserve language support (40 languages), emotions, and trace ID tracking This addresses PR pipecat-ai#2981 feedback by simplifying the service to focus purely on streaming TTS with word-level subtitles, removing unnecessary complexity and debugging code while maintaining essential functionality.


Please describe the changes in your PR. If it is addressing an issue, please reference that as well.