This repository demonstrates a real-time camera-based image understanding system powered by Vision Language Models (VLMs) through llama.cpp server integration.
- Real-time Camera Feed: Live video streaming with instant frame capture
- AI-Powered Analysis: Natural language understanding of visual content
- Customizable Instructions: Tailor AI prompts for specific use cases
- Responsive Web Interface: Modern, beautiful UI optimized for demos
- Configurable Intervals: Adjustable request frequency for different needs
-
Clone this repository:
git clone https://github.com/mrgehlot/object_detection_using_vllm.git cd object_detection_using_vllm -
Install llama.cpp: Follow the official installation guide to set up llama.cpp on your system
-
Start the llama.cpp server with SmolVLM model:
llama-server -hf ggml-org/SmolVLM-500M-Instruct-GGUF
- After starting the server, your API endpoint appears on the second last line of the console output (as shown in the image above). Use that URL (e.g.,
http://localhost:8080) in the web app's API Endpoint field.
GPU Acceleration (optional):
- For NVIDIA GPUs: Add
-ngl 99flag - For AMD/Intel GPUs: Add
-ngl 99flag
Alternative Models: Explore other multimodal models in the llama.cpp documentation
- After starting the server, your API endpoint appears on the second last line of the console output (as shown in the image above). Use that URL (e.g.,
-
Open the web interface: In your code editor, right-click on the
index.htmlfile and copy the file path, then paste it into your browser's address bar -
Configure settings (optional):
- Modify the AI instruction prompt for specific use cases
- Adjust the request interval based on your needs
- Customize the API endpoint if needed (paste the URL shown by the server)
-
Start detection: Click "Start Detection" and watch the AI analyze your camera feed in real-time
- Instructions: Modify the AI prompt to return specific formats (e.g., JSON responses)
- Intervals: Choose from 100ms to 2-second intervals between requests
- API Endpoint: Point to your local or remote llama.cpp server instance
This demo showcases the power of Vision Language Models by combining:
- MediaDevices API for real-time camera access and video streaming
- Canvas API for frame capture and base64 encoding for AI processing
- llama.cpp server integration for multimodal AI inference
- Modern web technologies for a responsive user experience

