Fun-ASR is an end-to-end speech recognition large model launched by Tongyi Lab.
-
Updated
May 19, 2026 - Python
Fun-ASR is an end-to-end speech recognition large model launched by Tongyi Lab.
Audio Large Language Models
"VideoAgent: All-in-One Agentic Framework for Video Understanding, Editing, and Remaking"
[AAAI 2026 & ACL 2026] The official implementation of the DIFFA series for dLLM-based large audio language model
open-vocabulary sound event detection
Voxtral is a state-of-the-art model developed to handle both speech transcription and audio understanding with remarkable accuracy and efficiency. This demo interface lets you run the Voxtral model on powerful GPUs to evaluate its performance and see how it can be used for transcription and deeper analysis.
Open-source AI memory engine for agents: multimodal memory, temporal knowledge graphs, graph RAG, and evidence-backed long-term context.
A compilation of resources (model profiles, benchmarks, docs) for multimodal AI models with audio understanding (esp. focused on ASR and transcription use-cases)
Empirical eval: how MP3 bitrate affects transcription accuracy across every audio-input LLM on OpenRouter (Gemini, GPT-Audio, Voxtral, MiMo). April 2026.
One voice recording for testing with TTS/cloning
Google's Gemini API provides access to state-of-the-art generative AI models for text generation, multimodal understanding, code generation, and more.
Provide Whisper-based audio transcription and translation with lightweight C++ libraries for easy integration into LLM projects.
Add a description, image, and links to the audio-understanding topic page so that developers can more easily learn about it.
To associate your repository with the audio-understanding topic, visit your repo's landing page and select "manage topics."