-
Notifications
You must be signed in to change notification settings - Fork 321
Description
Detailed description of the requested feature
Support for quantization and deployment of Qwen3-TTS-style models within the NVIDIA optimization stack, ideally including compatibility with TensorRT-LLM or a clearly defined alternative pipeline.
Specifically, the request is for:
Ability to quantize non-Transformer / non-text-generation models (e.g., TTS pipelines) using a unified workflow similar to LLMs
Support for multi-component models, including:
text encoder (Transformer-based)
acoustic model (autoregressive / diffusion / codec-based)
vocoder (CNN-based waveform generator)
End-to-end export pipeline:
PyTorch → Quantization → ONNX → TensorRT engine(s)
Guidance or tooling for:
handling models not implemented in Hugging Face Transformers
exporting models with custom forward passes or generation loops
Optional: partial support for prefill/decode-style optimization where applicable (e.g., transformer submodules)
This would enable efficient deployment of modern TTS systems on NVIDIA GPUs with reduced latency and memory usage.
Describe alternatives you've considered
- torch AO library
Target hardware/use case
- NVIDIA GPUs (eg. A5000, etc.)