Skip to content

pratiksahu/agent-test-omniparser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

OmniParser Autonomous App - Enterprise-Grade Local UI Automation

An intelligent, privacy-first UI automation solution leveraging Microsoft's OmniParser v2 for advanced icon detection and autonomous web navigation. 100% local execution with zero external dependencies.

License: MIT Python Node.js OmniParser

🎯 Key Features

  • πŸ” Icon & UI Element Detection: Identifies buttons, icons, text, and interactable elements using YOLOv8
  • πŸ€– Autonomous Navigation: Browser automation with goal-driven actions via Puppeteer
  • πŸ“Š Screen Parsing: Analyzes UI layouts and suggests interactions
  • 🎨 Smart Action Planning: Prioritizes actions based on goals and confidence scores
  • ⚑ Real-time Analysis: Processes screenshots locally with ~1 second inference time
  • πŸ”’ 100% Private: All processing happens on your machine - no data leaves your computer
  • πŸ”„ Fallback Support: Works even without models using OpenCV detection

πŸš€ Quick Start

Prerequisites

  • Node.js 18+ and npm
  • Python 3.8+ with pip
  • Git for cloning the repository
  • 8GB RAM minimum (16GB recommended)
  • 5GB disk space for models

Automatic Setup (Recommended)

1. Clone the repository:

git clone <repository-url>
cd agent-ui

2. Run the setup script:

For macOS/Linux:

chmod +x setup.sh
./setup.sh

For Windows:

setup.bat

The setup script will:

  • βœ… Create a Python virtual environment
  • βœ… Install all Python dependencies (PyTorch, YOLOv8, transformers)
  • βœ… Download OmniParser models locally (~2-3GB)
  • βœ… Install Node.js dependencies
  • βœ… Configure the application

Manual Setup

  1. Set up Python environment:
# Create virtual environment
python3 -m venv venv

# Activate it
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install Python dependencies
pip install -r requirements.txt

# Download models
python python/setup_models.py
  1. Install Node.js dependencies:
npm install
  1. Configure environment:
cp .env.example .env
# Edit .env if needed (all defaults should work)

πŸƒ Running the Application

You need to run two servers:

Terminal 1 - Python ML Server:

# Activate virtual environment
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Start Python server
python python/omniparser_local.py

You should see:

πŸš€ Starting Flask server on port 5001...

Terminal 2 - Node.js Application:

npm start

You should see:

πŸš€ OmniParser Autonomous App is running!
πŸ“ Local: http://localhost:3000

3. Access the Application:

Open your browser and navigate to: http://localhost:3000

πŸ§ͺ Verify Installation

Test that everything is working:

# Test Python server health
curl http://localhost:5001/health

# Test Node.js API
curl http://localhost:3000/api/demo

# Run full test suite
npm test

API Endpoints

Parse Image

curl -X POST http://localhost:3000/api/parse-image \
  -F "image=@screenshot.png" \
  -F "context=login"

Start Autonomous Agent

curl -X POST http://localhost:3000/api/autonomous/start \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Set and Execute Goal

curl -X POST http://localhost:3000/api/autonomous/goal \
  -H "Content-Type: application/json" \
  -d '{
    "description": "Find and click login button",
    "keywords": ["login", "sign in"],
    "maxSteps": 5
  }'

Autonomous Exploration

curl -X POST http://localhost:3000/api/autonomous/explore \
  -H "Content-Type: application/json" \
  -d '{"maxActions": 3, "waitTime": 2000}'

Click Specific Icon

curl -X POST http://localhost:3000/api/autonomous/click-icon \
  -H "Content-Type: application/json" \
  -d '{"iconName": "settings"}'

Usage Examples

Basic Screen Parsing

import { OmniParser } from './src/omniparser.js';

const parser = new OmniParser(process.env.HF_TOKEN);
const result = await parser.parseScreen('screenshot.png', {
  generateDescriptions: true,
  context: 'dashboard'
});

console.log(`Found ${result.summary.iconCount} icons`);
console.log(`Layout type: ${result.layout.type}`);

Autonomous Browser Control

import { AutonomousAgent } from './src/autonomous-agent.js';

const agent = new AutonomousAgent(parser);
await agent.initialize();
await agent.navigateTo('https://example.com');

// Set a goal
await agent.setGoal({
  description: 'Complete the signup process',
  keywords: ['signup', 'register', 'email', 'password'],
  maxSteps: 10
});

const result = await agent.executeGoal();
console.log(`Goal completed: ${result.success}`);

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Web Browser                          β”‚
β”‚                 http://localhost:3000                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Node.js Express Server (Port 3000)         β”‚
β”‚  β€’ Web Interface    β€’ REST API    β€’ WebSocket Support   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚                       β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  LocalOmniParser  β”‚   β”‚ AutonomousAgent  β”‚
    β”‚    (JS Client)    β”‚   β”‚   (Puppeteer)    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚     Python ML Server (Port 5001)          β”‚
    β”‚  β€’ YOLOv8 Detection  β€’ Florence-2 Captionsβ”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Components

  • LocalOmniParser: JavaScript client that communicates with Python server
  • AutonomousAgent: Manages Puppeteer browser automation and goal execution
  • Python ML Server: Runs YOLOv8 and Florence-2 models for inference
  • Express API: RESTful endpoints for web interface and API consumers

OmniParser v2 Capabilities

  • 39.5% accuracy on ScreenSpot Pro benchmark
  • 60% faster than v1 (0.6s/frame on A100)
  • Enhanced small icon detection
  • Interactability prediction
  • DOM-like structured output

πŸ’» System Requirements

Minimum Requirements

  • Node.js 18+ with npm
  • Python 3.8+ with pip
  • RAM: 8GB
  • Disk Space: 5GB for models + workspace
  • Browser: Chrome/Chromium (auto-installed by Puppeteer)

Recommended Specifications

  • RAM: 16GB for smooth performance
  • CPU: Multi-core processor (4+ cores)
  • Storage: SSD for faster model loading

GPU Support (Optional but Recommended)

For 3-5x faster inference:

  • NVIDIA GPU with CUDA 11.8+
  • At least 4GB VRAM
  • Install CUDA-enabled PyTorch:
    pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

πŸ€– Local Models

Downloaded Automatically

  • YOLOv8 (50MB): Object detection optimized for UI elements
  • Florence-2-base (700MB): Microsoft's vision-language model for captioning
  • Total Download: ~2-3GB including dependencies

Model Locations

models/
β”œβ”€β”€ yolo/
β”‚   └── yolov8m.pt          # YOLO weights
β”œβ”€β”€ florence/
β”‚   └── [model files]       # Florence-2 model
└── model_config.json       # Configuration

πŸ› οΈ Troubleshooting

Common Issues

Port Already in Use

# Error: Port 5001 is already in use
lsof -i :5001  # Find process
kill -9 <PID>  # Kill process

Python Module Not Found

source venv/bin/activate
pip install -r requirements.txt

Puppeteer Can't Launch Browser

# Reinstall Puppeteer
npm uninstall puppeteer
npm install puppeteer

Models Not Loading

# Re-download models
python python/setup_models.py

Performance Issues

Slow Inference

  • Switch to GPU (see GPU Support above)
  • Use smaller YOLO model: yolov8n.pt instead of yolov8m.pt
  • Reduce image resolution in omniparser_local.py

High Memory Usage

  • Close unnecessary browser tabs (Puppeteer)
  • Restart Python server periodically
  • Use environment variable: export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

πŸ“š Documentation

For Developers

  • See CLAUDE.md for detailed development guide
  • API documentation in code comments
  • Test examples in src/test.js

Project Structure

agent-ui/
β”œβ”€β”€ src/                    # Node.js application
β”œβ”€β”€ python/                 # Python ML server
β”œβ”€β”€ models/                 # Downloaded models
β”œβ”€β”€ screenshots/            # Captured screens
β”œβ”€β”€ venv/                   # Python virtual env
└── docs/                   # Additional docs

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run tests: npm test
  5. Submit a pull request

πŸ“„ License

MIT License - See LICENSE for details

πŸ™ Acknowledgments

  • Microsoft Research for OmniParser
  • Ultralytics for YOLOv8
  • Google for Puppeteer
  • Open source community

πŸ“ž Support

  • Issues: GitHub Issues
  • Documentation: CLAUDE.md for development details
  • Examples: See /src/test.js for usage examples

πŸš€ Professional AI Consulting Services

GeekyAnts

Transform Your Business with Enterprise AI Solutions

GeekyAnts - Your Partner in AI Innovation

🎯 Our AI Expertise

At GeekyAnts, we specialize in delivering cutting-edge AI solutions that drive real business value:

πŸ€– Autonomous AI Agents

  • Custom AI agents tailored to your business processes
  • Intelligent automation for complex workflows
  • Integration with existing enterprise systems
  • Scalable, production-ready deployments

🧠 Computer Vision & ML Solutions

  • Advanced image and video processing systems
  • Custom model training and optimization
  • Real-time inference pipelines
  • Edge deployment strategies

πŸ’Ό Enterprise AI Integration

  • LLM integration and fine-tuning
  • RAG (Retrieval-Augmented Generation) systems
  • Knowledge base development
  • AI-powered analytics dashboards

πŸ† Why Choose GeekyAnts?

  • βœ… 500+ Engineers - Large team of AI/ML specialists and full-stack developers
  • βœ… 15+ Years - Proven track record in enterprise software development
  • βœ… Global Presence - Offices in India, USA, and UK
  • βœ… Fortune 500 Clients - Trusted by leading global brands
  • βœ… End-to-End Solutions - From POC to production deployment

πŸ’‘ Our Process

  1. Discovery & Consultation - Understanding your unique challenges
  2. Solution Architecture - Designing scalable, efficient AI systems
  3. Rapid Prototyping - Quick POCs to validate approach
  4. Production Development - Enterprise-grade implementation
  5. Deployment & Support - Seamless integration and ongoing optimization

πŸ“‹ Services We Offer

Service Description
AI Strategy Consulting Roadmap development for AI adoption
Custom Model Development Training models specific to your domain
System Integration Seamless AI integration with existing infrastructure
Performance Optimization Improving inference speed and accuracy
MLOps Implementation CI/CD pipelines for ML models
24/7 Support Continuous monitoring and maintenance

🌟 Success Stories

  • Automated Document Processing - 90% reduction in manual processing time
  • Intelligent Customer Support - 60% decrease in response time
  • Predictive Maintenance - 40% reduction in equipment downtime
  • Vision-Based Quality Control - 99.9% defect detection accuracy

πŸ“¬ Get in Touch

Ready to revolutionize your business with AI? Let's discuss how we can help.

🌐 Website: https://geekyants.com πŸ“§ Email: ai-consulting@geekyants.com πŸ“± Phone: +1 (415) 890-5433 | +91 (804) 785-5522 πŸ’Ό LinkedIn: GeekyAnts

Schedule a Free Consultation

🏒 Our Offices

USA - San Francisco, CA India - Bangalore, Karnataka UK - London


Built with ❀️ by GeekyAnts Engineering Team

This open-source project demonstrates our commitment to advancing AI technology and sharing knowledge with the developer community.


Privacy Notice: This application runs 100% locally. No data is sent to external servers, ensuring complete privacy and control over your automation workflows. This aligns with our commitment to data security and privacy-first solutions at GeekyAnts.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors