An intelligent, privacy-first UI automation solution leveraging Microsoft's OmniParser v2 for advanced icon detection and autonomous web navigation. 100% local execution with zero external dependencies.
- π Icon & UI Element Detection: Identifies buttons, icons, text, and interactable elements using YOLOv8
- π€ Autonomous Navigation: Browser automation with goal-driven actions via Puppeteer
- π Screen Parsing: Analyzes UI layouts and suggests interactions
- π¨ Smart Action Planning: Prioritizes actions based on goals and confidence scores
- β‘ Real-time Analysis: Processes screenshots locally with ~1 second inference time
- π 100% Private: All processing happens on your machine - no data leaves your computer
- π Fallback Support: Works even without models using OpenCV detection
- Node.js 18+ and npm
- Python 3.8+ with pip
- Git for cloning the repository
- 8GB RAM minimum (16GB recommended)
- 5GB disk space for models
1. Clone the repository:
git clone <repository-url>
cd agent-ui2. Run the setup script:
For macOS/Linux:
chmod +x setup.sh
./setup.shFor Windows:
setup.batThe setup script will:
- β Create a Python virtual environment
- β Install all Python dependencies (PyTorch, YOLOv8, transformers)
- β Download OmniParser models locally (~2-3GB)
- β Install Node.js dependencies
- β Configure the application
- Set up Python environment:
# Create virtual environment
python3 -m venv venv
# Activate it
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install Python dependencies
pip install -r requirements.txt
# Download models
python python/setup_models.py- Install Node.js dependencies:
npm install- Configure environment:
cp .env.example .env
# Edit .env if needed (all defaults should work)You need to run two servers:
# Activate virtual environment
source venv/bin/activate # On Windows: venv\Scripts\activate
# Start Python server
python python/omniparser_local.pyYou should see:
π Starting Flask server on port 5001...
npm startYou should see:
π OmniParser Autonomous App is running!
π Local: http://localhost:3000
Open your browser and navigate to: http://localhost:3000
Test that everything is working:
# Test Python server health
curl http://localhost:5001/health
# Test Node.js API
curl http://localhost:3000/api/demo
# Run full test suite
npm testcurl -X POST http://localhost:3000/api/parse-image \
-F "image=@screenshot.png" \
-F "context=login"curl -X POST http://localhost:3000/api/autonomous/start \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'curl -X POST http://localhost:3000/api/autonomous/goal \
-H "Content-Type: application/json" \
-d '{
"description": "Find and click login button",
"keywords": ["login", "sign in"],
"maxSteps": 5
}'curl -X POST http://localhost:3000/api/autonomous/explore \
-H "Content-Type: application/json" \
-d '{"maxActions": 3, "waitTime": 2000}'curl -X POST http://localhost:3000/api/autonomous/click-icon \
-H "Content-Type: application/json" \
-d '{"iconName": "settings"}'import { OmniParser } from './src/omniparser.js';
const parser = new OmniParser(process.env.HF_TOKEN);
const result = await parser.parseScreen('screenshot.png', {
generateDescriptions: true,
context: 'dashboard'
});
console.log(`Found ${result.summary.iconCount} icons`);
console.log(`Layout type: ${result.layout.type}`);import { AutonomousAgent } from './src/autonomous-agent.js';
const agent = new AutonomousAgent(parser);
await agent.initialize();
await agent.navigateTo('https://example.com');
// Set a goal
await agent.setGoal({
description: 'Complete the signup process',
keywords: ['signup', 'register', 'email', 'password'],
maxSteps: 10
});
const result = await agent.executeGoal();
console.log(`Goal completed: ${result.success}`);βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Web Browser β
β http://localhost:3000 β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββ
β Node.js Express Server (Port 3000) β
β β’ Web Interface β’ REST API β’ WebSocket Support β
ββββββββββββββ¬ββββββββββββββββββββββββ¬βββββββββββββββββββββ
β β
ββββββββββΌβββββββββββ βββββββββΌβββββββββββ
β LocalOmniParser β β AutonomousAgent β
β (JS Client) β β (Puppeteer) β
ββββββββββ¬βββββββββββ ββββββββββββββββββββ
β
ββββββββββΌβββββββββββββββββββββββββββββββββββ
β Python ML Server (Port 5001) β
β β’ YOLOv8 Detection β’ Florence-2 Captionsβ
ββββββββββββββββββββββββββββββββββββββββββββββ
- LocalOmniParser: JavaScript client that communicates with Python server
- AutonomousAgent: Manages Puppeteer browser automation and goal execution
- Python ML Server: Runs YOLOv8 and Florence-2 models for inference
- Express API: RESTful endpoints for web interface and API consumers
- 39.5% accuracy on ScreenSpot Pro benchmark
- 60% faster than v1 (0.6s/frame on A100)
- Enhanced small icon detection
- Interactability prediction
- DOM-like structured output
- Node.js 18+ with npm
- Python 3.8+ with pip
- RAM: 8GB
- Disk Space: 5GB for models + workspace
- Browser: Chrome/Chromium (auto-installed by Puppeteer)
- RAM: 16GB for smooth performance
- CPU: Multi-core processor (4+ cores)
- Storage: SSD for faster model loading
For 3-5x faster inference:
- NVIDIA GPU with CUDA 11.8+
- At least 4GB VRAM
- Install CUDA-enabled PyTorch:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
- YOLOv8 (50MB): Object detection optimized for UI elements
- Florence-2-base (700MB): Microsoft's vision-language model for captioning
- Total Download: ~2-3GB including dependencies
models/
βββ yolo/
β βββ yolov8m.pt # YOLO weights
βββ florence/
β βββ [model files] # Florence-2 model
βββ model_config.json # Configuration
# Error: Port 5001 is already in use
lsof -i :5001 # Find process
kill -9 <PID> # Kill processsource venv/bin/activate
pip install -r requirements.txt# Reinstall Puppeteer
npm uninstall puppeteer
npm install puppeteer# Re-download models
python python/setup_models.py- Switch to GPU (see GPU Support above)
- Use smaller YOLO model:
yolov8n.ptinstead ofyolov8m.pt - Reduce image resolution in
omniparser_local.py
- Close unnecessary browser tabs (Puppeteer)
- Restart Python server periodically
- Use environment variable:
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
- See CLAUDE.md for detailed development guide
- API documentation in code comments
- Test examples in
src/test.js
agent-ui/
βββ src/ # Node.js application
βββ python/ # Python ML server
βββ models/ # Downloaded models
βββ screenshots/ # Captured screens
βββ venv/ # Python virtual env
βββ docs/ # Additional docs
- Fork the repository
- Create a feature branch
- Make your changes
- Run tests:
npm test - Submit a pull request
MIT License - See LICENSE for details
- Microsoft Research for OmniParser
- Ultralytics for YOLOv8
- Google for Puppeteer
- Open source community
- Issues: GitHub Issues
- Documentation: CLAUDE.md for development details
- Examples: See
/src/test.jsfor usage examples
At GeekyAnts, we specialize in delivering cutting-edge AI solutions that drive real business value:
- Custom AI agents tailored to your business processes
- Intelligent automation for complex workflows
- Integration with existing enterprise systems
- Scalable, production-ready deployments
- Advanced image and video processing systems
- Custom model training and optimization
- Real-time inference pipelines
- Edge deployment strategies
- LLM integration and fine-tuning
- RAG (Retrieval-Augmented Generation) systems
- Knowledge base development
- AI-powered analytics dashboards
- β 500+ Engineers - Large team of AI/ML specialists and full-stack developers
- β 15+ Years - Proven track record in enterprise software development
- β Global Presence - Offices in India, USA, and UK
- β Fortune 500 Clients - Trusted by leading global brands
- β End-to-End Solutions - From POC to production deployment
- Discovery & Consultation - Understanding your unique challenges
- Solution Architecture - Designing scalable, efficient AI systems
- Rapid Prototyping - Quick POCs to validate approach
- Production Development - Enterprise-grade implementation
- Deployment & Support - Seamless integration and ongoing optimization
| Service | Description |
|---|---|
| AI Strategy Consulting | Roadmap development for AI adoption |
| Custom Model Development | Training models specific to your domain |
| System Integration | Seamless AI integration with existing infrastructure |
| Performance Optimization | Improving inference speed and accuracy |
| MLOps Implementation | CI/CD pipelines for ML models |
| 24/7 Support | Continuous monitoring and maintenance |
- Automated Document Processing - 90% reduction in manual processing time
- Intelligent Customer Support - 60% decrease in response time
- Predictive Maintenance - 40% reduction in equipment downtime
- Vision-Based Quality Control - 99.9% defect detection accuracy
Ready to revolutionize your business with AI? Let's discuss how we can help.
π Website: https://geekyants.com π§ Email: ai-consulting@geekyants.com π± Phone: +1 (415) 890-5433 | +91 (804) 785-5522 πΌ LinkedIn: GeekyAnts
Schedule a Free ConsultationUSA - San Francisco, CA India - Bangalore, Karnataka UK - London
This open-source project demonstrates our commitment to advancing AI technology and sharing knowledge with the developer community.
Privacy Notice: This application runs 100% locally. No data is sent to external servers, ensuring complete privacy and control over your automation workflows. This aligns with our commitment to data security and privacy-first solutions at GeekyAnts.
