Skip to content

mrgehlot/object_detection_using_vllm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Vision Language Model Demo

demo

This repository demonstrates a real-time camera-based image understanding system powered by Vision Language Models (VLMs) through llama.cpp server integration.

Features

  • Real-time Camera Feed: Live video streaming with instant frame capture
  • AI-Powered Analysis: Natural language understanding of visual content
  • Customizable Instructions: Tailor AI prompts for specific use cases
  • Responsive Web Interface: Modern, beautiful UI optimized for demos
  • Configurable Intervals: Adjustable request frequency for different needs

How to Setup

Prerequisites

  1. Clone this repository:

    git clone https://github.com/mrgehlot/object_detection_using_vllm.git
    cd object_detection_using_vllm
  2. Install llama.cpp: Follow the official installation guide to set up llama.cpp on your system

Running the Demo

  1. Start the llama.cpp server with SmolVLM model:

    llama-server -hf ggml-org/SmolVLM-500M-Instruct-GGUF

    llama.cpp server output

    • After starting the server, your API endpoint appears on the second last line of the console output (as shown in the image above). Use that URL (e.g., http://localhost:8080) in the web app's API Endpoint field.

    GPU Acceleration (optional):

    • For NVIDIA GPUs: Add -ngl 99 flag
    • For AMD/Intel GPUs: Add -ngl 99 flag

    Alternative Models: Explore other multimodal models in the llama.cpp documentation

  2. Open the web interface: In your code editor, right-click on the index.html file and copy the file path, then paste it into your browser's address bar

  3. Configure settings (optional):

    • Modify the AI instruction prompt for specific use cases
    • Adjust the request interval based on your needs
    • Customize the API endpoint if needed (paste the URL shown by the server)
  4. Start detection: Click "Start Detection" and watch the AI analyze your camera feed in real-time

Customization

  • Instructions: Modify the AI prompt to return specific formats (e.g., JSON responses)
  • Intervals: Choose from 100ms to 2-second intervals between requests
  • API Endpoint: Point to your local or remote llama.cpp server instance

Technical Details

This demo showcases the power of Vision Language Models by combining:

  • MediaDevices API for real-time camera access and video streaming
  • Canvas API for frame capture and base64 encoding for AI processing
  • llama.cpp server integration for multimodal AI inference
  • Modern web technologies for a responsive user experience