ChronoChat is a video-RAG platform built on top of Ollama that enables users to chat with video content using non vision/video-language models (VLMs). It supports both YouTube and local uploads and uses retrieval-augmented generation (RAG) to answer questions using video transcripts, frames, and captions. Powered by local LLMs, ChronoChat streams real-time responses with additional support for images and PDF uploads.
demo.mp4
Note
ChronoChat is ideal for:
- โ
Interviews, tutorials, and educational content
- โ Not suited for animations or silent videos
# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activatepython cli.py installFor GPU acceleration, install the CUDA-enabled version of PyTorch:
Visit https://pytorch.org/get-started/locally/ to get the correct command for your system.
๐ก If you donโt have an NVIDIA GPU or donโt want CUDA, skip this step
ChronoChat requires ffmpeg for processing video and audio.
Download from: https://ffmpeg.org/download.html
If you havenโt already, install Ollama
ollama servepython cli.py startThen open your browser at: http://localhost:3000
- ๐ Video RAG: Uses CLIP, Whisper, and BLIP embeddings for frame, audio, and caption-based retrieval.
- ๐ง LLM Planning: Models generate reasoning chains, plan actions, and adapt to single or multi-video chats.
- ๐ Streaming Responses: Live WebSocket chat with markdown rendering and response progress updates.
- ๐ฅ Multi-Video Support: Search and reason across multiple videos in a single conversation.
- ๐ Attach Files: Supports uploading PDFs and images.
---
config:
look: handDrawn
theme: neutral
---
graph TD
subgraph "Frontend (Next.js)"
Sidebar["๐ Chats & Videos"]
UploadUI["๐ฆ Upload videos"]
ChatUI["๐ฌ Chat interface"]
APIClient["๐ REST client"]
WSClient["๐ WebSocket client"]
end
subgraph "Backend (FastAPI & Async Worker)"
ChatRouter["๐จ๏ธ Chat router"]
MediaRouter["๐ฌ Media router"]
VideoRAG["๐ง VideoRAG engine"]
ContextExtractor["๐ Context extractor"]
Retriever["๐ฆ ChromaDB retriever"]
LLMClient["๐ค LLM client"]
Worker["โ๏ธ Ingestion worker"]
MediaDB["๐๏ธ ChromaDB"]
MediaStorage["๐ Video and metadata storage"]
VideoQueue["๐ฎ Processing queue"]
end
Sidebar --> ChatUI
UploadUI --> APIClient
ChatUI -- "File upload" --> APIClient
ChatUI <--"Text query" --> WSClient
APIClient <--> MediaRouter
WSClient <--> ChatRouter
ChatRouter --> VideoRAG
VideoRAG <-- "Video query" --> ContextExtractor
VideoRAG <-- "Other query" --> LLMClient
ContextExtractor <--> Retriever
ContextExtractor <--> LLMClient
Retriever <--> MediaDB
MediaRouter --> MediaStorage
MediaRouter --> VideoQueue
VideoQueue --> Worker
Worker --> MediaDB
| Layer | Tools |
|---|---|
| Frontend | Next.js, TailwindCSS, Shadcn, TypeScript |
| Backend | FastAPI, AsyncIO, SQLite, ChromaDB |
| Embeddings | CLIP (frames), Whisper (audio), BLIP |
| LLM | Ollama |
| Storage | Local files, ChromaDB vectors, SQLite |
- Ingest Video: Extracts audio, frames, and captions from YouTube/local videos.
- Embed Content: Computes multimodal embeddings and stores them in ChromaDB.
- Chat Interaction: LLM receives the user query and selects a retrieval mode.
- RAG Flow: Relevant chunks are retrieved based on video context.
- Response Streaming: Final output is streamed to the user in real time.