Vision Language Model Demo

This repository demonstrates a real-time camera-based image understanding system powered by Vision Language Models (VLMs) through llama.cpp server integration.

Features

Real-time Camera Feed: Live video streaming with instant frame capture
AI-Powered Analysis: Natural language understanding of visual content
Customizable Instructions: Tailor AI prompts for specific use cases
Responsive Web Interface: Modern, beautiful UI optimized for demos
Configurable Intervals: Adjustable request frequency for different needs

How to Setup

Prerequisites

Clone this repository:

git clone https://github.com/mrgehlot/object_detection_using_vllm.git
cd object_detection_using_vllm

Install llama.cpp: Follow the official installation guide to set up llama.cpp on your system

Running the Demo

Start the llama.cpp server with SmolVLM model:
```
llama-server -hf ggml-org/SmolVLM-500M-Instruct-GGUF
```
- After starting the server, your API endpoint appears on the second last line of the console output (as shown in the image above). Use that URL (e.g., http://localhost:8080) in the web app's API Endpoint field.
GPU Acceleration (optional):
- For NVIDIA GPUs: Add -ngl 99 flag
- For AMD/Intel GPUs: Add -ngl 99 flag
Alternative Models: Explore other multimodal models in the llama.cpp documentation
Open the web interface: In your code editor, right-click on the index.html file and copy the file path, then paste it into your browser's address bar
Configure settings (optional):
- Modify the AI instruction prompt for specific use cases
- Adjust the request interval based on your needs
- Customize the API endpoint if needed (paste the URL shown by the server)
Start detection: Click "Start Detection" and watch the AI analyze your camera feed in real-time

Customization

Instructions: Modify the AI prompt to return specific formats (e.g., JSON responses)
Intervals: Choose from 100ms to 2-second intervals between requests
API Endpoint: Point to your local or remote llama.cpp server instance

Technical Details

This demo showcases the power of Vision Language Models by combining:

MediaDevices API for real-time camera access and video streaming
Canvas API for frame capture and base64 encoding for AI processing
llama.cpp server integration for multimodal AI inference
Modern web technologies for a responsive user experience

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
demo_image.png		demo_image.png
index.html		index.html
server_image.png		server_image.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vision Language Model Demo

Features

How to Setup

Prerequisites

Running the Demo

Customization

Technical Details

About

Uh oh!

Releases

Packages

Languages

mrgehlot/object_detection_using_vllm

Folders and files

Latest commit

History

Repository files navigation

Vision Language Model Demo

Features

How to Setup

Prerequisites

Running the Demo

Customization

Technical Details

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages