Skip to content

Clinical Speech-to-Structured-Report AI - Converting doctor's voice into structured medical documentation

Notifications You must be signed in to change notification settings

julka01/MediVoiceScribe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MediVoiceScribe

Clinical Speech-to-Structured-Report AI - A FastAPI web application that converts doctor's voice into organized medical documentation using advanced speech-to-text and Large Language Models.

Features

  • Audio Recording: Record audio directly in the browser using the microphone
  • Audio File Upload: Upload existing audio files in various medical industry formats (MP3, MP4, WAV, M4A, OGG, WebM, etc.)
  • Speech-to-Text: Uses Faster Whisper (local, open-source Whisper model) for accurate transcription
  • Report Generation: Uses free-tier LLMs via OpenRouter to generate structured medical reports

Requirements

  • Python 3.8+
  • Microphone access (for recording feature)
  • OpenRouter API key (for free LLM access)

Installation

  1. Clone this repository

  2. Create a virtual environment:

    python3 -m venv venv
  3. Activate the virtual environment:

    source venv/bin/activate  # On macOS/Linux
    # venv\Scripts\activate on Windows
  4. Install dependencies:

    pip install -r requirements.txt
  5. Set up environment variables: Create a .env file or set the environment variable:

    export OPENROUTER_API_KEY=your_openrouter_api_key_here

Usage

  1. Run the application:

    uvicorn app:app --reload
  2. Open your browser and go to http://127.0.0.1:8000

  3. Choose one of two options:

    • Upload Audio File: Select an existing audio file and click "Upload and Process"
    • Record Audio: Click "Start Recording", speak into your microphone, then click "Stop Recording"
  4. View the transcription and generated medical report

Configuration

Whisper Model

The app uses Faster Whisper with the "medium" model by default. You can change this in app.py by modifying:

whisper_model = WhisperModel("medium", device="cpu", compute_type="int8")

Available models: tiny, base, small, medium, large-v1, large-v2

LLM Provider

Uses OpenRouter with Google's Gemini 2.0 Flash (free tier) by default. Change the model in app.py:

default_model = "google/gemini-2.0-flash-exp:free"

Other free models available: meta-llama/llama-3.2-3b-instruct:free

API Endpoints

  • GET /: Main web interface
  • POST /process: Process audio file and return transcription + report
    • Accepts: multipart/form-data with file field
    • Returns: JSON with transcription and report fields

Security Notes

  • The Whisper model runs locally, so audio files are not sent to external services
  • Only the transcription text is sent to OpenRouter (no original audio)
  • Consider HIPAA compliance and local data privacy regulations

Troubleshooting

Common Issues:

  • "Error accessing microphone": Browser permissions denied. Allow microphone access and ensure HTTPS for production
  • "Report generation failed": Check OPENROUTER_API_KEY is set correctly
  • Whisper errors: Ensure sufficient RAM/storage for the model (medium model requires ~2GB RAM)

Performance:

  • First transcription may be slow due to model loading
  • Large audio files may require more memory
  • GPU acceleration available if CUDA installed (change device="cuda" in WhisperModel)

About

Clinical Speech-to-Structured-Report AI - Converting doctor's voice into structured medical documentation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages