Clinical Speech-to-Structured-Report AI - A FastAPI web application that converts doctor's voice into organized medical documentation using advanced speech-to-text and Large Language Models.
- Audio Recording: Record audio directly in the browser using the microphone
- Audio File Upload: Upload existing audio files in various medical industry formats (MP3, MP4, WAV, M4A, OGG, WebM, etc.)
- Speech-to-Text: Uses Faster Whisper (local, open-source Whisper model) for accurate transcription
- Report Generation: Uses free-tier LLMs via OpenRouter to generate structured medical reports
- Python 3.8+
- Microphone access (for recording feature)
- OpenRouter API key (for free LLM access)
-
Clone this repository
-
Create a virtual environment:
python3 -m venv venv
-
Activate the virtual environment:
source venv/bin/activate # On macOS/Linux # venv\Scripts\activate on Windows
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables: Create a
.envfile or set the environment variable:export OPENROUTER_API_KEY=your_openrouter_api_key_here
-
Run the application:
uvicorn app:app --reload
-
Open your browser and go to
http://127.0.0.1:8000 -
Choose one of two options:
- Upload Audio File: Select an existing audio file and click "Upload and Process"
- Record Audio: Click "Start Recording", speak into your microphone, then click "Stop Recording"
-
View the transcription and generated medical report
The app uses Faster Whisper with the "medium" model by default. You can change this in app.py by modifying:
whisper_model = WhisperModel("medium", device="cpu", compute_type="int8")Available models: tiny, base, small, medium, large-v1, large-v2
Uses OpenRouter with Google's Gemini 2.0 Flash (free tier) by default. Change the model in app.py:
default_model = "google/gemini-2.0-flash-exp:free"Other free models available: meta-llama/llama-3.2-3b-instruct:free
GET /: Main web interfacePOST /process: Process audio file and return transcription + report- Accepts: multipart/form-data with
filefield - Returns: JSON with
transcriptionandreportfields
- Accepts: multipart/form-data with
- The Whisper model runs locally, so audio files are not sent to external services
- Only the transcription text is sent to OpenRouter (no original audio)
- Consider HIPAA compliance and local data privacy regulations
- "Error accessing microphone": Browser permissions denied. Allow microphone access and ensure HTTPS for production
- "Report generation failed": Check OPENROUTER_API_KEY is set correctly
- Whisper errors: Ensure sufficient RAM/storage for the model (medium model requires ~2GB RAM)
- First transcription may be slow due to model loading
- Large audio files may require more memory
- GPU acceleration available if CUDA installed (change device="cuda" in WhisperModel)