diff --git a/examples/showcase/vision/.agentv/config.yaml b/examples/showcase/vision/.agentv/config.yaml new file mode 100644 index 00000000..d5bffc8c --- /dev/null +++ b/examples/showcase/vision/.agentv/config.yaml @@ -0,0 +1,22 @@ +# AgentV Configuration for Vision Examples +# This configuration specifies directories and settings for vision evaluation examples + +# Directory containing evaluation YAML files +evalsDir: ./datasets + +# Directory containing evaluator definitions (LLM judges and code validators) +evaluatorsDir: ./evaluators + +# Test images directory (users should place test images here) +testImagesDir: ./test-images + +# Default settings for vision evaluations +defaults: + # Default model target (can be overridden in YAML) + target: openai-gpt4o + + # Default image detail level + imageDetail: high + + # Timeout for vision model calls (in milliseconds) + timeout: 60000 diff --git a/examples/showcase/vision/.agentv/targets.yaml b/examples/showcase/vision/.agentv/targets.yaml new file mode 100644 index 00000000..9e0faa3c --- /dev/null +++ b/examples/showcase/vision/.agentv/targets.yaml @@ -0,0 +1,39 @@ +# Target Model Configurations for Vision Examples +# Defines available vision-capable models for evaluation + +targets: + # OpenAI GPT-4o (default, recommended for vision tasks) + openai-gpt4o: + provider: openai + model: gpt-4o + apiKey: ${OPENAI_API_KEY} + supportsVision: true + costPer1kImages: + low: 42.50 # $85/1M tokens * 0.5K tokens/image + high: 102.00 # $85/1M tokens * 1.2K tokens/image + auto: 72.25 # Average + + # Anthropic Claude 3.5 Sonnet + anthropic-claude: + provider: anthropic + model: claude-3-5-sonnet-20241022 + apiKey: ${ANTHROPIC_API_KEY} + supportsVision: true + costPer1kImages: + low: 120.00 # $3/1M tokens * 40K base + 0.5K image + high: 216.00 # $3/1M tokens * 40K base + 1.2K image + auto: 168.00 # Average + + # Google Gemini 2.5 Flash + google-gemini: + provider: google + model: gemini-2.0-flash-exp + apiKey: ${GOOGLE_API_KEY} + supportsVision: true + costPer1kImages: + low: 1.88 # $0.075/1M tokens * 25K base + 0.5K image + high: 2.26 # $0.075/1M tokens * 25K base + 1.2K image + auto: 2.07 # Average (most cost-effective) + +# Default target +default: openai-gpt4o diff --git a/examples/showcase/vision/README.md b/examples/showcase/vision/README.md new file mode 100644 index 00000000..8654ceda --- /dev/null +++ b/examples/showcase/vision/README.md @@ -0,0 +1,394 @@ +# Vision Evaluation Examples + +This directory contains example evaluation files for testing AI agents with vision/image capabilities. + +## Overview + +Vision evaluation in AgentV extends the standard eval framework to support: +- Image inputs (local files and URLs) +- Multi-image comparisons +- Vision-specific evaluators (both LLM judges and code validators) +- Structured outputs from vision tasks +- Multi-turn conversations with visual context + +## Quick Start + +### Basic Image Analysis + +```bash +# From examples/showcase/vision/ directory +agentv run datasets/basic-image-analysis.yaml + +# Or from repository root +agentv run examples/showcase/vision/datasets/basic-image-analysis.yaml +``` + +### Advanced Vision Tasks + +```bash +# From examples/showcase/vision/ directory +agentv run datasets/advanced-vision-tasks.yaml + +# Or from repository root +agentv run examples/showcase/vision/datasets/advanced-vision-tasks.yaml +``` + +## Image Input Formats + +### Local File Reference + +```yaml +- type: image + value: ./test-images/sample-office.jpg + detail: high # Options: low, high, auto +``` + +### Image URL + +```yaml +- type: image_url + value: https://example.com/image.jpg +``` + +### Data URI (Base64) + +```yaml +- type: image + value: data:image/jpeg;base64,/9j/4AAQSkZJRg... +``` + +## Evaluation Files + +### datasets/basic-image-analysis.yaml + +Demonstrates fundamental vision capabilities: +- **Simple image description** - Basic captioning +- **Object detection** - Counting and identifying objects +- **Spatial reasoning** - Understanding positions and layouts +- **Text extraction (OCR)** - Reading text from images +- **Image comparison** - Analyzing changes between images +- **Color analysis** - Identifying colors and schemes +- **URL loading** - Loading images from web URLs + +### datasets/advanced-vision-tasks.yaml + +Demonstrates complex vision scenarios: +- **Structured JSON output** - Vision data as JSON +- **Visual reasoning** - Logic applied to visual information (e.g., chess) +- **Multi-turn conversations** - Context maintained across turns +- **Image quality assessment** - Technical and aesthetic evaluation +- **Chart/graph analysis** - Data extraction from visualizations +- **Scene understanding** - Contextual inference +- **Instruction following** - Complex tasks with visual reference + +## Evaluators + +### LLM-Based Judges + +Located in `evaluators/llm-judges/`: + +1. **image-description-judge.md** + - Evaluates description accuracy and completeness + - Dimensions: Visual Accuracy (40%), Completeness (30%), Clarity (20%), Relevance (10%) + - Detects hallucinations and missing elements + +2. **activity-judge.md** + - Evaluates activity and action recognition + - Assesses people counting, pose recognition, interaction understanding + +3. **comparison-judge.md** + - Evaluates multi-image comparison quality + - Tests change detection, spatial precision, completeness + +4. **reasoning-judge.md** + - Evaluates logical reasoning with visual information + - Tests visual understanding, problem-solving, explanation quality + - Supports multiple reasoning types (spatial, logical, quantitative) + +### Code-Based Validators + +Located in `evaluators/code-validators/`: + +1. **count_validator.py** + - Validates object counts in responses + - Extracts numbers and matches against expected counts + - Usage: `uv run count_validator.py` + +2. **ocr_validator.py** + - Validates text extraction accuracy + - Uses text similarity and keyword matching + - Configurable threshold (default: 70%) + +3. **json_validator.py** + - Validates structured JSON outputs from vision + - Schema inference from expected output + - Checks field presence and types + +4. **chart_validator.py** + - Validates data extraction from charts/graphs + - Extracts currency values, percentages, quarters + - Tolerance-based numeric validation (default: 15%) + +## Best Practices from Research + +### Context Engineering (from Agent-Skills research) + +1. **Progressive Disclosure** + - Load image metadata first (50 tokens) + - Then descriptions (100 tokens) + - Finally full image (765-1360 tokens) + +2. **Token Budgeting** + - Small image (512x512): ~765 tokens + - Large image (2048x2048): ~1360 tokens + - Budget context at 70-80% utilization + +3. **File System State** + - Store images and analyses as files + - Pass file references in context, not image data + +### Evaluation Patterns (from Google ADK) + +1. **Multi-Sample Evaluation** + - Run evaluators 5 times for reliability + - Use vision-capable judge models (GPT-4V, Claude) + +2. **Rubric-Based Grading** + - Define clear success criteria + - Weight dimensions appropriately + - Support partial credit + +### Input Handling (from Mastra & Azure SDK) + +1. **Flexible Image Sources** + - Local files: `./images/photo.jpg` + - HTTP URLs: `https://...` + - Cloud storage: `gs://...` or `s3://...` + - Data URIs: `data:image/jpeg;base64,...` + +2. **MIME Type Specification** + - Always include for better compatibility + - Common types: `image/jpeg`, `image/png`, `image/webp` + +3. **Detail Level Control** + - `low`: Faster, cheaper, less detail + - `high`: Slower, more expensive, more detail + - `auto`: Let model decide + +## Creating Test Images + +For local testing, place test images in `test-images/` directory. See `test-images/README.md` for detailed guidance on: +- Required test images for each eval case +- Image format and size requirements +- Alternative URL-based approaches +- Sources for obtaining test images + +### Example Test Images Structure + +```bash +examples/showcase/vision/test-images/ +├── README.md (detailed instructions) +├── .gitkeep +├── sample-office.jpg +├── objects-scene.jpg +├── spatial-layout.jpg +├── text-document.jpg +├── comparison-before.jpg +├── comparison-after.jpg +├── colorful-scene.jpg +├── street-scene.jpg +├── chess-puzzle.jpg +├── activity-photo.jpg +├── quality-test.jpg +├── bar-chart.jpg +├── complex-scene.jpg +└── instruction-reference.jpg +``` + +### Image Requirements + +- **Formats**: JPEG, PNG, GIF, BMP, WEBP +- **Size limits**: + - Max: 20 MB, 16,000 x 16,000 pixels + - Min: 50 x 50 pixels +- **Best practices**: + - Use JPEG for photos + - Use PNG for screenshots, diagrams, text + - Optimize file size (aim for <5 MB) + - Ensure clear, well-lit images for OCR tasks + +## Multi-Turn Vision Conversations + +Example pattern for maintaining visual context: + +```yaml +- id: conversation-turn-1 + conversation_id: vision-convo-001 + input_messages: + - role: user + content: + - type: text + value: "What's in this image?" + - type: image + value: ./image.jpg + expected_messages: + - role: assistant + content: "Description of image..." + +- id: conversation-turn-2 + conversation_id: vision-convo-001 + input_messages: + # Include full conversation history + - role: user + content: + - type: text + value: "What's in this image?" + - type: image + value: ./image.jpg + - role: assistant + content: "Description of image..." + - role: user + content: "Tell me more about the left side" + expected_messages: + - role: assistant + content: "Details about left side..." +``` + +## Evaluation Metrics + +### Dimension Weights (Recommended) + +Based on research from Google ADK and LangWatch: + +**Image Description**: +- Visual Accuracy: 40% +- Completeness: 30% +- Clarity: 20% +- Relevance: 10% + +**Activity Recognition**: +- Activity Identification: 35% +- Accuracy: 35% +- Detail Level: 20% +- Inference Quality: 10% + +**Visual Reasoning**: +- Logical Correctness: 40% +- Visual Understanding: 30% +- Problem-Solving Quality: 20% +- Explanation Quality: 10% + +**Image Comparison**: +- Change Detection Accuracy: 40% +- Spatial Precision: 25% +- Completeness: 20% +- Clarity: 15% + +### Scoring Thresholds + +- **0.9-1.0**: Excellent - Production ready +- **0.7-0.89**: Good - Minor improvements needed +- **0.5-0.69**: Acceptable - Significant gaps +- **0.3-0.49**: Poor - Major issues +- **0.0-0.29**: Failed - Not functional + +## Integration with AgentV Core + +### Required Model Capabilities + +Ensure your model supports vision: +- ✅ OpenAI: GPT-4o, GPT-4 Turbo with Vision +- ✅ Anthropic: Claude 3.5 Sonnet, Claude 3 Opus/Haiku +- ✅ Google: Gemini 2.5 Pro/Flash, Gemini 3 Pro +- ✅ Azure: GPT-4o via Azure OpenAI + +### Configuration + +Configure vision-capable models in `.agentv/targets.yaml`: + +```yaml +targets: + gpt4v: + provider: openai + model: gpt-4o + apiKey: ${OPENAI_API_KEY} + + claude-vision: + provider: anthropic + model: claude-3-5-sonnet-20241022 + apiKey: ${ANTHROPIC_API_KEY} + + gemini-vision: + provider: google + model: gemini-2.5-flash + apiKey: ${GOOGLE_API_KEY} +``` + +## Cost Considerations + +Vision API costs are significantly higher than text: + +| Provider | Model | Cost per Image* | Notes | +|----------|-------|----------------|-------| +| OpenAI | GPT-4o | $2.50-$5.00 / 1K images | Detail level affects cost | +| Anthropic | Claude 3.5 | $3.00-$6.00 / 1K images | Resolution-based pricing | +| Google | Gemini 2.5 Flash | $0.04-$0.15 / 1K images | Most cost-effective | + +*Estimates based on average image size and detail level + +### Cost Optimization Tips + +1. Use `detail: low` for simple tasks +2. Resize large images before sending +3. Use Gemini Flash for high-volume testing +4. Cache image descriptions for reuse +5. Use code validators when possible (free) + +## Future Enhancements + +Based on research findings, potential additions: + +1. **Computer Vision Metrics** + - SSIM (structural similarity) + - Perceptual hashing + - CLIP embeddings similarity + +2. **Specialized Evaluators** + - Face detection validation + - Logo recognition accuracy + - Medical image analysis + - Document understanding + +3. **Batch Processing** + - Parallel image evaluation + - Progress tracking + - Cost reporting + +4. **UI Integration** + - Visual diff tools + - Side-by-side comparisons + - Annotation overlays + +## References + +For detailed research findings and framework analysis, see: [Vision Evaluation Research Summary](../../openspec/changes/add-vision-evaluation/references/research-summary.md) + +Research sources consulted: + +1. **Google ADK Python** - Rubric-based evaluation, multimodal content handling +2. **Mastra** - TypeScript patterns, structured outputs, Braintrust integration +3. **Azure SDK** - Image input patterns, Computer Vision API +4. **LangWatch** - Evaluation architecture, batch processing +5. **Agent Skills** - Context engineering, progressive disclosure patterns + +## Support + +For issues or questions: +- Check existing eval examples +- Review evaluator documentation +- Consult AgentV core documentation +- Open GitHub issue with reproduction case + +## License + +Same as AgentV project license. diff --git a/examples/showcase/vision/datasets/advanced-vision-tasks.yaml b/examples/showcase/vision/datasets/advanced-vision-tasks.yaml new file mode 100644 index 00000000..5fc93115 --- /dev/null +++ b/examples/showcase/vision/datasets/advanced-vision-tasks.yaml @@ -0,0 +1,352 @@ +# Advanced Vision Evaluation Tasks +# Demonstrates complex multimodal scenarios and vision-language reasoning + +$schema: agentv-eval-v2 +description: Advanced vision tasks including reasoning, structured outputs, and multi-turn conversations + +target: default + +evalcases: + # ========================================== + # Example 1: Structured Output from Vision + # Tests JSON output with visual analysis + # ========================================== + - id: structured-object-detection + + expected_outcome: Assistant returns valid JSON with detected objects, positions, and confidence scores + + input_messages: + - role: system + content: |- + You are an object detection system that returns structured JSON output. + Always return valid JSON matching the requested schema. + + - role: user + content: + - type: text + value: |- + Analyze this image and return a JSON object with the following structure: + ```json + { + "objects": [ + {"name": "object_name", "count": 1, "position": "location", "confidence": 0.95} + ], + "scene": "scene_description", + "dominant_colors": ["color1", "color2"] + } + ``` + - type: image + value: ./test-images/product-shelf.jpg + detail: high + + expected_messages: + - role: assistant + content: |- + ```json + { + "objects": [ + {"name": "bottle", "count": 5, "position": "top shelf", "confidence": 0.98}, + {"name": "can", "count": 8, "position": "middle shelf", "confidence": 0.95}, + {"name": "box", "count": 3, "position": "bottom shelf", "confidence": 0.92} + ], + "scene": "retail product shelf display", + "dominant_colors": ["white", "blue", "red"] + } + ``` + + execution: + evaluators: + - name: json_validation + type: code_judge + script: uv run json_validator.py + cwd: ../evaluators/code-validators + - name: object_accuracy + type: llm_judge + prompt: ../evaluators/llm-judges/structured-output-judge.md + + # ========================================== + # Example 2: Visual Reasoning + # Tests logical reasoning about visual content + # ========================================== + - id: visual-reasoning-problem + + conversation_id: vision-reasoning-001 + + expected_outcome: Assistant uses visual information to solve logical problem + + input_messages: + - role: system + content: You can analyze images and apply logical reasoning to solve problems. + + - role: user + content: + - type: text + value: |- + Look at this chess board position. + 1. Whose turn is it? (white or black) + 2. Is any piece under attack? + 3. What would be a good next move for white? + - type: image + value: ./test-images/chess-position.jpg + detail: high + + expected_messages: + - role: assistant + content: |- + Based on the chess position: + 1. It's White's turn (Black pieces just moved) + 2. Yes, Black's knight is attacking White's bishop on e4 + 3. Good move: Nf3 (knight to f3) - defends the bishop and controls center + + Alternative: Bd3 (bishop retreat) - saves piece but passive + + execution: + evaluators: + - name: reasoning_quality + type: llm_judge + prompt: ../evaluators/llm-judges/reasoning-judge.md + + # ========================================== + # Example 3: Multi-Turn Vision Conversation + # Tests maintaining context with images across turns + # ========================================== + - id: multi-turn-image-discussion-part1 + + conversation_id: vision-conversation-001 + + expected_outcome: Assistant provides initial analysis of image + + input_messages: + - role: system + content: You help analyze images in detail through multi-turn conversations. + + - role: user + content: + - type: text + value: "What are the main elements in this architecture diagram?" + - type: image + value: ./test-images/system-architecture.jpg + + expected_messages: + - role: assistant + content: |- + Main elements in the architecture: + 1. Client Layer - Web and Mobile apps + 2. API Gateway - Routing and authentication + 3. Service Layer - Microservices (User, Order, Payment) + 4. Database Layer - PostgreSQL and Redis + 5. Message Queue - RabbitMQ for async processing + + - id: multi-turn-image-discussion-part2 + + conversation_id: vision-conversation-001 + + expected_outcome: Assistant answers follow-up question referencing previous image context + + input_messages: + - role: system + content: You help analyze images in detail through multi-turn conversations. + + - role: user + content: + - type: text + value: "What are the main elements in this architecture diagram?" + - type: image + value: ./test-images/system-architecture.jpg + + - role: assistant + content: |- + Main elements in the architecture: + 1. Client Layer - Web and Mobile apps + 2. API Gateway - Routing and authentication + 3. Service Layer - Microservices (User, Order, Payment) + 4. Database Layer - PostgreSQL and Redis + 5. Message Queue - RabbitMQ for async processing + + - role: user + content: "What's the purpose of the message queue in this system?" + + expected_messages: + - role: assistant + content: |- + The RabbitMQ message queue serves several purposes: + 1. Decouples services for async communication + 2. Handles order processing without blocking API responses + 3. Manages payment notifications + 4. Provides retry mechanism for failed operations + 5. Enables horizontal scaling of service workers + + # ========================================== + # Example 4: Image Quality Assessment + # Tests aesthetic and technical quality evaluation + # ========================================== + - id: image-quality-assessment + + expected_outcome: Assistant evaluates technical and aesthetic quality of image + + input_messages: + - role: system + content: |- + You are an expert in image quality assessment. Evaluate images on: + - Technical quality (sharpness, exposure, noise) + - Composition (rule of thirds, balance, framing) + - Color grading + - Overall aesthetic appeal + + - role: user + content: + - type: text + value: "Evaluate the quality of this photograph. Rate it from 1-10 and explain your assessment." + - type: image + value: ./test-images/landscape-photo.jpg + detail: high + + expected_messages: + - role: assistant + content: |- + Quality Assessment: 8/10 + + Technical Quality: + - Sharpness: Excellent (9/10) - Sharp focus on foreground + - Exposure: Good (8/10) - Well balanced, slight highlight clipping + - Noise: Very good (8/10) - Minimal noise, clean image + + Composition: + - Rule of thirds: Well applied, horizon at lower third + - Balance: Good visual weight distribution + - Leading lines: Path creates depth + + Color & Aesthetic: + - Color grading: Warm golden hour tones + - Mood: Peaceful and inviting + - Overall: Professional quality landscape + + execution: + evaluators: + - name: assessment_completeness + type: llm_judge + prompt: ../evaluators/llm-judges/quality-assessment-judge.md + + # ========================================== + # Example 5: Chart/Graph Analysis + # Tests data visualization interpretation + # ========================================== + - id: chart-data-extraction + + expected_outcome: Assistant accurately extracts data and insights from chart + + input_messages: + - role: system + content: You can read and interpret charts, graphs, and data visualizations accurately. + + - role: user + content: + - type: text + value: |- + Analyze this bar chart: + 1. What are the top 3 values? + 2. What trend do you observe? + 3. What's the average value approximately? + - type: image + value: ./test-images/sales-chart.jpg + + expected_messages: + - role: assistant + content: |- + Bar Chart Analysis: + + 1. Top 3 values: + - Q4 2025: $2.4M + - Q3 2025: $2.1M + - Q2 2025: $1.9M + + 2. Trend: Consistent upward trend throughout 2025 + - 58% growth from Q1 to Q4 + - Steady quarterly increases + + 3. Average value: Approximately $1.95M per quarter + + execution: + evaluators: + - name: data_accuracy + type: code_judge + script: uv run chart_validator.py + cwd: ../evaluators/code-validators + + # ========================================== + # Example 6: Scene Understanding with Context + # Tests contextual understanding and inference + # ========================================== + - id: scene-context-inference + + expected_outcome: Assistant infers context and situation beyond literal visual elements + + input_messages: + - role: system + content: You can understand scenes contextually, inferring situation, emotions, and social dynamics. + + - role: user + content: + - type: text + value: "What's happening in this scene? What can you infer about the situation and mood?" + - type: image + value: ./test-images/office-celebration.jpg + + expected_messages: + - role: assistant + content: |- + Scene Analysis: + + What's happening: + - Office celebration or milestone event + - Team gathered around cake/refreshments + - Informal standing arrangement + + Contextual inferences: + - Likely birthday, project completion, or company anniversary + - Positive team dynamics (relaxed postures, smiles) + - Mid-afternoon timing (sunlight through windows) + - Small to medium team size (8-10 people) + + Mood: Celebratory, casual, team-bonding atmosphere + + # ========================================== + # Example 7: Image + Text Instructions + # Tests following written instructions with visual reference + # ========================================== + - id: instruction-following-with-image + + expected_outcome: Assistant follows complex instructions referencing visual elements + + input_messages: + - role: system + content: You follow instructions precisely, referencing visual elements when specified. + + - role: user + content: + - type: text + value: "Based on this floor plan, calculate the total area and list the rooms from largest to smallest." + - type: image + value: ./test-images/floor-plan.jpg + detail: high + - type: file + value: ./attachments/measurement-guidelines.txt + + expected_messages: + - role: assistant + content: |- + Floor Plan Analysis: + + Total Area: 1,450 sq ft + + Rooms by size (largest to smallest): + 1. Living Room: 350 sq ft + 2. Master Bedroom: 280 sq ft + 3. Kitchen: 220 sq ft + 4. Bedroom 2: 200 sq ft + 5. Bathroom 1: 120 sq ft + 6. Bathroom 2: 90 sq ft + 7. Hallway: 90 sq ft + 8. Storage: 100 sq ft + + Notes: Measurements based on scale provided in plan (1:50) diff --git a/examples/showcase/vision/datasets/basic-image-analysis.yaml b/examples/showcase/vision/datasets/basic-image-analysis.yaml new file mode 100644 index 00000000..e81525fd --- /dev/null +++ b/examples/showcase/vision/datasets/basic-image-analysis.yaml @@ -0,0 +1,240 @@ +# Basic Image Analysis Evaluation +# Demonstrates image input handling and vision-capable model evaluation + +$schema: agentv-eval-v2 +description: Basic image understanding with single image inputs + +target: default + +evalcases: + # ========================================== + # Example 1: Simple Image Description + # Tests basic image captioning capabilities + # ========================================== + - id: simple-image-description + + expected_outcome: Assistant provides accurate description of image content including main objects and scene context + + input_messages: + - role: system + content: You are a helpful AI assistant that can analyze images and provide detailed descriptions. + + - role: user + content: + - type: text + value: "Describe what you see in this image in detail." + - type: image + value: ./test-images/sample-office.jpg + detail: high # Options: low, high, auto + + expected_messages: + - role: assistant + content: |- + The image shows an office workspace with the following elements: + - A desk with a computer monitor + - Office chair + - Keyboard and mouse + - Natural lighting from a window + - Professional/clean environment + + execution: + evaluators: + - name: content_accuracy + type: llm_judge + prompt: ../evaluators/llm-judges/image-description-judge.md + + # ========================================== + # Example 2: Object Detection + # Tests ability to identify specific objects + # ========================================== + - id: object-detection-simple + + expected_outcome: Assistant correctly identifies and counts specific objects in the image + + input_messages: + - role: system + content: You are an AI that can identify and count objects in images accurately. + + - role: user + content: + - type: text + value: "How many people are in this image? Describe what they are doing." + - type: image + value: ./test-images/team-meeting.jpg + + expected_messages: + - role: assistant + content: |- + There are 4 people in the image. They appear to be in a meeting: + - 2 people are seated at a table + - 1 person is standing and presenting + - 1 person is taking notes + + execution: + evaluators: + - name: object_count + type: code_judge + script: uv run count_validator.py + cwd: ../evaluators/code-validators + - name: activity_accuracy + type: llm_judge + prompt: ../evaluators/llm-judges/activity-judge.md + + # ========================================== + # Example 3: Spatial Reasoning + # Tests understanding of spatial relationships + # ========================================== + - id: spatial-relationships + + expected_outcome: Assistant correctly describes spatial relationships and positions of objects + + input_messages: + - role: system + content: You analyze images with focus on spatial relationships between objects. + + - role: user + content: + - type: text + value: "Describe the position of objects in this image. What is on the left, right, center?" + - type: image + value: ./test-images/desk-arrangement.jpg + detail: high + + expected_messages: + - role: assistant + content: |- + Spatial layout: + - Left side: Lamp and notebook + - Center: Laptop computer (main focus) + - Right side: Coffee mug and phone + - Background: Wall with framed picture + + # ========================================== + # Example 4: Text Extraction (OCR) + # Tests ability to read text from images + # ========================================== + - id: text-extraction-ocr + + expected_outcome: Assistant accurately extracts and reports text visible in the image + + input_messages: + - role: system + content: You can read and extract text from images accurately. + + - role: user + content: + - type: text + value: "What text do you see in this image? Extract all visible text." + - type: image + value: ./test-images/document-scan.jpg + detail: high + + expected_messages: + - role: assistant + content: |- + Extracted text: + - Title: "Project Proposal" + - Subtitle: "Q1 2026 Initiative" + - Section heading: "Executive Summary" + - Body text includes keywords: budget, timeline, deliverables + + execution: + evaluators: + - name: text_accuracy + type: code_judge + script: uv run ocr_validator.py + cwd: ../evaluators/code-validators + + # ========================================== + # Example 5: Image Comparison + # Tests ability to compare multiple images + # ========================================== + - id: multi-image-comparison + + expected_outcome: Assistant identifies similarities and differences between two images + + input_messages: + - role: system + content: You can compare multiple images and identify similarities and differences. + + - role: user + content: + - type: text + value: "Compare these two images. What changed between them?" + - type: image + value: ./test-images/before-office.jpg + - type: image + value: ./test-images/after-office.jpg + + expected_messages: + - role: assistant + content: |- + Changes between images: + - New desk lamp added in second image + - Chair moved to different position + - Additional monitor on desk + - Wall color remains the same + - Overall layout similar + + execution: + evaluators: + - name: change_detection + type: llm_judge + prompt: ../evaluators/llm-judges/comparison-judge.md + + # ========================================== + # Example 6: Color Analysis + # Tests color identification and description + # ========================================== + - id: color-identification + + expected_outcome: Assistant accurately identifies and describes colors in the image + + input_messages: + - role: system + content: You can identify and describe colors accurately in images. + + - role: user + content: + - type: text + value: "What are the dominant colors in this image? Describe the color scheme." + - type: image + value: ./test-images/color-palette.jpg + + expected_messages: + - role: assistant + content: |- + Dominant colors: + - Primary: Deep blue (#2E5090) + - Secondary: Warm orange (#FF8C42) + - Accent: Light gray (#E8E8E8) + Color scheme: Complementary (blue-orange) + Overall mood: Professional and energetic + + # ========================================== + # Example 7: Image with URL (not file path) + # Tests image loading from URLs + # ========================================== + - id: image-from-url + + expected_outcome: Assistant analyzes image loaded from URL successfully + + input_messages: + - role: system + content: You can analyze images from various sources. + + - role: user + content: + - type: text + value: "Describe this sample image." + - type: image_url + value: https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/481px-Cat03.jpg + + expected_messages: + - role: assistant + content: |- + The image shows a cat with the following characteristics: + - Orange/ginger colored fur + - Sitting position + - Looking directly at camera + - Indoor setting diff --git a/examples/showcase/vision/evaluators/code-validators/chart_validator.py b/examples/showcase/vision/evaluators/code-validators/chart_validator.py new file mode 100644 index 00000000..184c6ac6 --- /dev/null +++ b/examples/showcase/vision/evaluators/code-validators/chart_validator.py @@ -0,0 +1,211 @@ +#!/usr/bin/env python3 +""" +Chart Data Validator +Code-based evaluator for validating data extraction from charts and graphs +""" + +import sys +import json +import re +from typing import Dict, Any, List, Tuple + + +def extract_currency_values(text: str) -> List[float]: + """Extract monetary values from text (e.g., $2.4M, $1,500)""" + # Pattern for currency with K/M/B suffixes + pattern = r'\$?\s*(\d+\.?\d*)\s*([KMB])?' + + values = [] + for match in re.finditer(pattern, text, re.IGNORECASE): + value = float(match.group(1)) + suffix = match.group(2) + + if suffix: + multipliers = {'K': 1_000, 'M': 1_000_000, 'B': 1_000_000_000} + value *= multipliers.get(suffix.upper(), 1) + + values.append(value) + + return values + + +def extract_percentages(text: str) -> List[float]: + """Extract percentage values from text""" + pattern = r'(\d+\.?\d*)\s*%' + return [float(match.group(1)) for match in re.finditer(pattern, text)] + + +def extract_quarters(text: str) -> List[str]: + """Extract quarter references (Q1, Q2, etc.)""" + pattern = r'Q[1-4]\s+\d{4}' + return re.findall(pattern, text) + + +def validate_numeric_accuracy( + found_values: List[float], + expected_values: List[float], + tolerance: float = 0.1 +) -> Tuple[int, List[float], List[float]]: + """ + Validate numeric values with tolerance + + Returns: + (matches_count, matched_values, missing_values) + """ + matched = [] + missing = expected_values.copy() + + for expected in expected_values: + for found in found_values: + # Check if within tolerance (percentage) + if abs(found - expected) / expected <= tolerance: + matched.append(expected) + if expected in missing: + missing.remove(expected) + break + + return len(matched), matched, missing + + +def validate_chart_data( + output: str, + expected_output: str, + input_text: str = "", + tolerance: float = 0.15 +) -> Dict[str, Any]: + """ + Validate data extraction from charts/graphs + + Args: + output: AI's chart analysis + expected_output: Expected data points and insights + input_text: Original question + tolerance: Acceptable error margin (default 15%) + + Returns: + Evaluation result + """ + + # Extract values from both outputs + output_currency = extract_currency_values(output) + expected_currency = extract_currency_values(expected_output) + + output_percentages = extract_percentages(output) + expected_percentages = extract_percentages(expected_output) + + output_quarters = extract_quarters(output) + expected_quarters = extract_quarters(expected_output) + + # Validate currency values + currency_matches = 0 + if expected_currency: + currency_matches, matched_curr, missing_curr = validate_numeric_accuracy( + output_currency, expected_currency, tolerance + ) + currency_accuracy = currency_matches / len(expected_currency) + else: + currency_accuracy = 1.0 + matched_curr = [] + missing_curr = [] + + # Validate percentages + percentage_matches = 0 + if expected_percentages: + percentage_matches, matched_pct, missing_pct = validate_numeric_accuracy( + output_percentages, expected_percentages, tolerance + ) + percentage_accuracy = percentage_matches / len(expected_percentages) + else: + percentage_accuracy = 1.0 + matched_pct = [] + missing_pct = [] + + # Validate quarter references + if expected_quarters: + quarter_matches = len(set(output_quarters) & set(expected_quarters)) + quarter_accuracy = quarter_matches / len(expected_quarters) + else: + quarter_accuracy = 1.0 + quarter_matches = 0 + + # Calculate overall score (weighted average) + weights = { + 'currency': 0.5, + 'percentage': 0.3, + 'quarters': 0.2 + } + + overall_score = ( + currency_accuracy * weights['currency'] + + percentage_accuracy * weights['percentage'] + + quarter_accuracy * weights['quarters'] + ) + + passed = overall_score >= 0.7 # 70% threshold + + # Build detailed reasoning + reasoning_parts = [] + if expected_currency: + reasoning_parts.append( + f"Currency values: {currency_matches}/{len(expected_currency)} matched" + ) + if expected_percentages: + reasoning_parts.append( + f"Percentages: {percentage_matches}/{len(expected_percentages)} matched" + ) + if expected_quarters: + reasoning_parts.append( + f"Quarters: {quarter_matches}/{len(expected_quarters)} matched" + ) + + return { + "status": "processed", + "score": round(overall_score, 3), + "passed": passed, + "details": { + "currency_validation": { + "accuracy": round(currency_accuracy, 3), + "expected": expected_currency, + "found": output_currency, + "matched": matched_curr, + "missing": missing_curr + }, + "percentage_validation": { + "accuracy": round(percentage_accuracy, 3), + "expected": expected_percentages, + "found": output_percentages, + "matched": matched_pct, + "missing": missing_pct + }, + "quarter_validation": { + "accuracy": round(quarter_accuracy, 3), + "expected": expected_quarters, + "found": output_quarters + }, + "tolerance": tolerance, + "reasoning": "; ".join(reasoning_parts) + } + } + + +def main(): + """Main entry point for CLI usage""" + if len(sys.argv) > 1: + eval_data = json.loads(sys.argv[1]) + else: + eval_data = json.load(sys.stdin) + + output = eval_data.get("output", "") + expected_output = eval_data.get("expected_output", "") + input_text = eval_data.get("input", "") + tolerance = eval_data.get("tolerance", 0.15) + + result = validate_chart_data(output, expected_output, input_text, tolerance) + + print(json.dumps(result, indent=2)) + + return 0 if result["passed"] else 1 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/examples/showcase/vision/evaluators/code-validators/count_validator.py b/examples/showcase/vision/evaluators/code-validators/count_validator.py new file mode 100644 index 00000000..20257a0b --- /dev/null +++ b/examples/showcase/vision/evaluators/code-validators/count_validator.py @@ -0,0 +1,106 @@ +#!/usr/bin/env python3 +""" +Object Count Validator +Code-based evaluator for validating object counts in vision responses +""" + +import sys +import json +import re +from typing import Dict, Any, List + + +def extract_numbers_from_text(text: str) -> List[int]: + """Extract all numbers from text""" + return [int(num) for num in re.findall(r'\b\d+\b', text)] + + +def extract_count_for_object(text: str, object_name: str) -> int | None: + """Extract count for a specific object from text""" + # Look for patterns like "5 bottles", "There are 3 people", etc. + patterns = [ + rf'(\d+)\s+{object_name}', # "5 bottles" + rf'{object_name}.*?(\d+)', # "bottles: 5" + rf'(\d+).*?{object_name}', # "5 red bottles" + ] + + for pattern in patterns: + match = re.search(pattern, text, re.IGNORECASE) + if match: + return int(match.group(1)) + + return None + + +def validate_object_count( + output: str, + expected_output: str, + input_text: str = "" +) -> Dict[str, Any]: + """ + Validate object counts in AI response + + Returns: + Evaluation result with score, passed status, and details + """ + + # Extract expected count from expected_output or input + expected_numbers = extract_numbers_from_text(expected_output) + output_numbers = extract_numbers_from_text(output) + + # Simple validation: check if any expected numbers are in output + matched_counts = [num for num in expected_numbers if num in output_numbers] + + if not expected_numbers: + return { + "status": "error", + "score": 0.0, + "passed": False, + "details": "Could not extract expected counts from expected output" + } + + # Calculate accuracy + accuracy = len(matched_counts) / len(expected_numbers) + passed = accuracy >= 0.8 # 80% threshold + + return { + "status": "processed", + "score": accuracy, + "passed": passed, + "details": { + "expected_counts": expected_numbers, + "found_counts": output_numbers, + "matched_counts": matched_counts, + "accuracy": accuracy, + "reasoning": f"Matched {len(matched_counts)} out of {len(expected_numbers)} expected counts" + } + } + + +def main(): + """Main entry point for CLI usage""" + # Read evaluation data from stdin or args + if len(sys.argv) > 1: + # Parse JSON from argument + eval_data = json.loads(sys.argv[1]) + else: + # Read from stdin + eval_data = json.load(sys.stdin) + + # Extract fields + output = eval_data.get("output", "") + expected_output = eval_data.get("expected_output", "") + input_text = eval_data.get("input", "") + + # Run validation + result = validate_object_count(output, expected_output, input_text) + + # Output JSON result + print(json.dumps(result, indent=2)) + + # Return appropriate exit code + return 0 if result["passed"] else 1 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/examples/showcase/vision/evaluators/code-validators/json_validator.py b/examples/showcase/vision/evaluators/code-validators/json_validator.py new file mode 100644 index 00000000..a6651f6f --- /dev/null +++ b/examples/showcase/vision/evaluators/code-validators/json_validator.py @@ -0,0 +1,202 @@ +#!/usr/bin/env python3 +""" +JSON Structure Validator +Code-based evaluator for validating structured JSON outputs from vision tasks +""" + +import sys +import json +import re +from typing import Dict, Any, List +from jsonschema import validate, ValidationError, Draft7Validator + + +def extract_json_from_text(text: str) -> Dict[str, Any] | None: + """Extract JSON object from text (handles markdown code blocks)""" + # Try to find JSON in markdown code block + json_match = re.search(r'```(?:json)?\s*(\{.*?\})\s*```', text, re.DOTALL) + if json_match: + try: + return json.loads(json_match.group(1)) + except json.JSONDecodeError: + pass + + # Try to parse entire text as JSON + try: + return json.loads(text) + except json.JSONDecodeError: + pass + + # Try to find first JSON object in text + brace_match = re.search(r'\{.*\}', text, re.DOTALL) + if brace_match: + try: + return json.loads(brace_match.group(0)) + except json.JSONDecodeError: + pass + + return None + + +def infer_schema_from_expected(expected_json: Dict[str, Any]) -> Dict[str, Any]: + """Infer a basic JSON schema from expected output structure""" + def get_type(value): + if isinstance(value, bool): + return "boolean" + elif isinstance(value, int): + return "integer" + elif isinstance(value, float): + return "number" + elif isinstance(value, str): + return "string" + elif isinstance(value, list): + return "array" + elif isinstance(value, dict): + return "object" + return "string" + + schema = { + "type": "object", + "properties": {}, + "required": list(expected_json.keys()) + } + + for key, value in expected_json.items(): + value_type = get_type(value) + schema["properties"][key] = {"type": value_type} + + if value_type == "array" and len(value) > 0: + item_type = get_type(value[0]) + schema["properties"][key]["items"] = {"type": item_type} + + # If array contains objects, add properties + if item_type == "object" and isinstance(value[0], dict): + schema["properties"][key]["items"]["properties"] = { + k: {"type": get_type(v)} + for k, v in value[0].items() + } + + return schema + + +def validate_json_structure( + output: str, + expected_output: str, + schema: Dict[str, Any] | None = None +) -> Dict[str, Any]: + """ + Validate that output contains valid JSON matching expected structure + + Args: + output: AI's response (may contain JSON) + expected_output: Expected JSON structure as string + schema: Optional JSON schema for validation + + Returns: + Evaluation result with score, passed status, and details + """ + + # Extract JSON from output + output_json = extract_json_from_text(output) + + if output_json is None: + return { + "status": "processed", + "score": 0.0, + "passed": False, + "details": { + "error": "No valid JSON found in output", + "reasoning": "Could not extract JSON object from response" + } + } + + # Parse expected JSON + try: + expected_json = extract_json_from_text(expected_output) + if expected_json is None: + expected_json = json.loads(expected_output) + except (json.JSONDecodeError, ValueError) as e: + return { + "status": "error", + "score": 0.0, + "passed": False, + "details": { + "error": f"Invalid expected JSON: {str(e)}" + } + } + + # If no schema provided, infer from expected output + if schema is None: + schema = infer_schema_from_expected(expected_json) + + # Validate against schema + validator = Draft7Validator(schema) + errors = list(validator.iter_errors(output_json)) + + if errors: + error_messages = [f"{e.path}: {e.message}" for e in errors[:3]] # First 3 errors + return { + "status": "processed", + "score": 0.5, # Partial credit for valid JSON with wrong structure + "passed": False, + "details": { + "validation_errors": error_messages, + "json_valid": True, + "schema_valid": False, + "reasoning": f"Valid JSON but schema validation failed: {'; '.join(error_messages)}" + } + } + + # Calculate field match score + expected_keys = set(expected_json.keys()) + output_keys = set(output_json.keys()) + + matching_keys = expected_keys & output_keys + extra_keys = output_keys - expected_keys + missing_keys = expected_keys - output_keys + + field_score = len(matching_keys) / len(expected_keys) if expected_keys else 1.0 + + # Penalize extra keys slightly + if extra_keys: + field_score *= 0.95 + + # Full pass requires schema validation + most fields present + passed = len(errors) == 0 and field_score >= 0.8 + + return { + "status": "processed", + "score": round(field_score, 3), + "passed": passed, + "details": { + "json_valid": True, + "schema_valid": len(errors) == 0, + "field_score": round(field_score, 3), + "matching_keys": list(matching_keys), + "missing_keys": list(missing_keys), + "extra_keys": list(extra_keys), + "reasoning": f"Schema valid: {len(errors) == 0}, Field coverage: {field_score:.1%}" + } + } + + +def main(): + """Main entry point for CLI usage""" + if len(sys.argv) > 1: + eval_data = json.loads(sys.argv[1]) + else: + eval_data = json.load(sys.stdin) + + output = eval_data.get("output", "") + expected_output = eval_data.get("expected_output", "") + schema = eval_data.get("schema") + + result = validate_json_structure(output, expected_output, schema) + + print(json.dumps(result, indent=2)) + + return 0 if result["passed"] else 1 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/examples/showcase/vision/evaluators/code-validators/ocr_validator.py b/examples/showcase/vision/evaluators/code-validators/ocr_validator.py new file mode 100644 index 00000000..0604b577 --- /dev/null +++ b/examples/showcase/vision/evaluators/code-validators/ocr_validator.py @@ -0,0 +1,144 @@ +#!/usr/bin/env python3 +""" +OCR Text Accuracy Validator +Code-based evaluator for validating text extraction (OCR) from images +""" + +import sys +import json +import re +from typing import Dict, Any, List, Set +from difflib import SequenceMatcher + + +def normalize_text(text: str) -> str: + """Normalize text for comparison""" + # Remove extra whitespace, convert to lowercase + return re.sub(r'\s+', ' ', text.lower().strip()) + + +def extract_keywords(text: str) -> Set[str]: + """Extract significant words from text""" + # Remove common words and extract keywords + words = set(text.lower().split()) + # Remove very short words (likely articles, prepositions) + return {w for w in words if len(w) > 2} + + +def calculate_text_similarity(text1: str, text2: str) -> float: + """Calculate similarity ratio between two texts""" + norm1 = normalize_text(text1) + norm2 = normalize_text(text2) + return SequenceMatcher(None, norm1, norm2).ratio() + + +def validate_keyword_presence(output: str, expected_keywords: List[str]) -> Dict[str, Any]: + """Validate that expected keywords are present in output""" + output_lower = output.lower() + found_keywords = [kw for kw in expected_keywords if kw.lower() in output_lower] + + accuracy = len(found_keywords) / len(expected_keywords) if expected_keywords else 0.0 + + return { + "keyword_accuracy": accuracy, + "found_keywords": found_keywords, + "missing_keywords": [kw for kw in expected_keywords if kw not in found_keywords], + "total_expected": len(expected_keywords), + "total_found": len(found_keywords) + } + + +def validate_ocr_accuracy( + output: str, + expected_output: str, + input_text: str = "", + threshold: float = 0.7 +) -> Dict[str, Any]: + """ + Validate OCR text extraction accuracy + + Args: + output: AI's extracted text + expected_output: Expected extracted text or keywords + input_text: Original user question (optional) + threshold: Minimum similarity threshold for passing + + Returns: + Evaluation result with score, passed status, and details + """ + + # Calculate overall text similarity + similarity = calculate_text_similarity(output, expected_output) + + # Extract and validate keywords + expected_keywords_line = re.search( + r'keywords?:\s*([^\n]+)', + expected_output, + re.IGNORECASE + ) + + if expected_keywords_line: + # Parse expected keywords + keywords_text = expected_keywords_line.group(1) + expected_keywords = [ + kw.strip() + for kw in re.split(r'[,;]', keywords_text) + ] + keyword_validation = validate_keyword_presence(output, expected_keywords) + else: + # Use all significant words as keywords + expected_words = extract_keywords(expected_output) + output_words = extract_keywords(output) + matched_words = expected_words & output_words + + keyword_validation = { + "keyword_accuracy": len(matched_words) / len(expected_words) if expected_words else 0.0, + "found_keywords": list(matched_words), + "missing_keywords": list(expected_words - matched_words), + "total_expected": len(expected_words), + "total_found": len(matched_words) + } + + # Combine metrics + # Weight: 60% overall similarity, 40% keyword accuracy + combined_score = (similarity * 0.6) + (keyword_validation["keyword_accuracy"] * 0.4) + passed = combined_score >= threshold + + return { + "status": "processed", + "score": round(combined_score, 3), + "passed": passed, + "details": { + "text_similarity": round(similarity, 3), + "keyword_validation": keyword_validation, + "threshold": threshold, + "reasoning": f"Text similarity: {similarity:.2%}, Keyword accuracy: {keyword_validation['keyword_accuracy']:.2%}" + } + } + + +def main(): + """Main entry point for CLI usage""" + # Read evaluation data from stdin or args + if len(sys.argv) > 1: + eval_data = json.loads(sys.argv[1]) + else: + eval_data = json.load(sys.stdin) + + # Extract fields + output = eval_data.get("output", "") + expected_output = eval_data.get("expected_output", "") + input_text = eval_data.get("input", "") + threshold = eval_data.get("threshold", 0.7) + + # Run validation + result = validate_ocr_accuracy(output, expected_output, input_text, threshold) + + # Output JSON result + print(json.dumps(result, indent=2)) + + return 0 if result["passed"] else 1 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/examples/showcase/vision/evaluators/llm-judges/activity-judge.md b/examples/showcase/vision/evaluators/llm-judges/activity-judge.md new file mode 100644 index 00000000..5c23020c --- /dev/null +++ b/examples/showcase/vision/evaluators/llm-judges/activity-judge.md @@ -0,0 +1,73 @@ +# Activity Recognition LLM Judge +# Evaluates accuracy of activity and action description in images + +You are evaluating an AI assistant's ability to identify and describe activities, actions, and behaviors visible in images. + +## Evaluation Criteria + +### 1. Activity Identification (35%) +- Are the activities correctly identified? +- Is the context of actions understood? +- Are interactions between people/objects recognized? + +### 2. Accuracy (35%) +- Are the number of people/objects correct? +- Are poses, positions, and movements accurate? +- Are temporal aspects (if relevant) captured? + +### 3. Detail Level (20%) +- Are actions described with appropriate detail? +- Are relevant gestures or expressions noted? +- Is the level of detail appropriate to the question? + +### 4. Inference Quality (10%) +- Are reasonable inferences made when appropriate? +- Are assumptions clearly distinguished from observations? +- Is context considered appropriately? + +## Input Data + +**User's Question**: {{input}} + +**AI Response**: {{output}} + +**Expected Response**: {{expected_output}} + +**Image Reference**: {{image_reference}} + +## Evaluation Task + +Assess how well the AI identified and described activities in the image. + +## Output Format + +```json +{ + "score": 0.88, + "passed": true, + "details": { + "activity_identification": 0.9, + "accuracy": 0.85, + "detail_level": 0.9, + "inference_quality": 0.85 + }, + "reasoning": "Correctly identified the meeting activity and participant roles. Count was accurate. Good detail about specific actions.", + "errors": { + "count_errors": [], + "misidentified_actions": [], + "missed_actions": ["One person checking phone"] + }, + "strengths": [ + "Accurate participant count", + "Clear description of roles", + "Good spatial awareness" + ] +} +``` + +## Special Considerations + +- **Ambiguous situations**: Give benefit of doubt if multiple interpretations are valid +- **Partial visibility**: Don't penalize for not describing what's not clearly visible +- **Cultural context**: Consider that some activities may have cultural variations +- **Safety**: Flag if response makes inappropriate assumptions about people diff --git a/examples/showcase/vision/evaluators/llm-judges/comparison-judge.md b/examples/showcase/vision/evaluators/llm-judges/comparison-judge.md new file mode 100644 index 00000000..c51b94aa --- /dev/null +++ b/examples/showcase/vision/evaluators/llm-judges/comparison-judge.md @@ -0,0 +1,98 @@ +# Image Comparison LLM Judge +# Evaluates quality of multi-image comparison and change detection + +You are evaluating an AI assistant's ability to compare multiple images and identify changes, similarities, and differences. + +## Evaluation Criteria + +### 1. Change Detection Accuracy (40%) +- Are all significant changes identified? +- Are changes correctly categorized (added, removed, moved, modified)? +- Is the description of changes accurate? + +### 2. Spatial Precision (25%) +- Are locations of changes accurately described? +- Are spatial relationships correctly maintained? +- Is positioning information clear and specific? + +### 3. Completeness (20%) +- Are both similarities AND differences mentioned (when relevant)? +- Are subtle changes noticed? +- Is nothing significant missed? + +### 4. Clarity (15%) +- Is the comparison structure clear and logical? +- Are changes described unambiguously? +- Is the language precise? + +## Input Data + +**User's Question**: {{input}} + +**AI Response**: {{output}} + +**Expected Response**: {{expected_output}} + +**Images**: {{image_references}} + +## Evaluation Task + +Assess the quality and accuracy of the image comparison. + +## Output Format + +```json +{ + "score": 0.82, + "passed": true, + "details": { + "change_detection_accuracy": 0.85, + "spatial_precision": 0.8, + "completeness": 0.75, + "clarity": 0.9 + }, + "reasoning": "Identified most major changes accurately. Missed one subtle change (wall color). Good spatial descriptions.", + "detected_changes": { + "correct": ["desk lamp added", "chair moved", "monitor added"], + "missed": ["wall calendar removed"], + "false_positives": [] + }, + "spatial_accuracy": "Good - locations correctly described", + "strengths": [ + "Clear comparison structure", + "Accurate major change detection", + "Good detail level" + ], + "improvements": [ + "Notice subtle background changes", + "More precise position descriptions" + ] +} +``` + +## Scoring Guidelines + +### High Scores (0.8+) +- All or nearly all significant changes detected +- Accurate spatial descriptions +- No false positives +- Clear, organized presentation + +### Medium Scores (0.5-0.79) +- Most major changes detected +- Some minor changes missed +- Generally accurate descriptions +- Acceptable clarity + +### Low Scores (<0.5) +- Significant changes missed +- Inaccurate descriptions +- False positives present +- Unclear or disorganized + +## Special Cases + +- **Lighting changes**: Should be noted if significantly different +- **Perspective differences**: Should account for viewing angle changes +- **Temporal information**: If images are before/after, temporal language should be used appropriately +- **Identical images**: Should recognize when images are the same or nearly identical diff --git a/examples/showcase/vision/evaluators/llm-judges/image-description-judge.md b/examples/showcase/vision/evaluators/llm-judges/image-description-judge.md new file mode 100644 index 00000000..827cfeaf --- /dev/null +++ b/examples/showcase/vision/evaluators/llm-judges/image-description-judge.md @@ -0,0 +1,76 @@ +# Vision-Specific LLM Judge Prompt +# Evaluates image description quality and accuracy + +You are evaluating an AI assistant's image description against the actual image content and expected description. + +## Evaluation Criteria + +Evaluate the response on these dimensions: + +### 1. Visual Accuracy (40%) +- Does the description match what's actually in the image? +- Are object identifications correct? +- Are colors, shapes, and spatial relationships accurate? +- Are there any hallucinations (describing things not present)? + +### 2. Completeness (30%) +- Are all significant visual elements mentioned? +- Is important context captured? +- Are key details included (not just high-level description)? + +### 3. Clarity (20%) +- Is the description clear and specific? +- Are spatial relationships well described? +- Is the language precise and unambiguous? + +### 4. Relevance (10%) +- Does the description focus on task-relevant elements? +- Is unnecessary information minimized? +- Does it answer the specific question asked? + +## Input Data + +**User's Question**: {{input}} + +**AI Response**: {{output}} + +**Expected Description**: {{expected_output}} + +**Image Reference**: {{image_reference}} + +## Evaluation Task + +1. Compare the AI's description with the expected description +2. Identify any inaccuracies, hallucinations, or missing elements +3. Assess clarity and relevance +4. Provide an overall score from 0.0 to 1.0 + +## Output Format + +Return your evaluation as JSON: + +```json +{ + "score": 0.85, + "passed": true, + "details": { + "visual_accuracy": 0.9, + "completeness": 0.8, + "clarity": 0.85, + "relevance": 0.9 + }, + "reasoning": "The description accurately identifies the main objects and spatial layout. Minor issue: didn't mention the background elements. Overall strong response.", + "hallucinations": [], + "missing_elements": ["background wall art", "window on left"], + "strengths": ["Accurate object identification", "Clear spatial description"], + "improvements": ["Include background elements", "Mention lighting conditions"] +} +``` + +## Scoring Guidelines + +- **0.9-1.0**: Excellent - Accurate, complete, clear description +- **0.7-0.89**: Good - Mostly accurate with minor gaps or imprecisions +- **0.5-0.69**: Acceptable - Some inaccuracies or missing elements +- **0.3-0.49**: Poor - Significant issues or hallucinations +- **0.0-0.29**: Failed - Mostly incorrect or severely incomplete diff --git a/examples/showcase/vision/evaluators/llm-judges/quality-assessment-judge.md b/examples/showcase/vision/evaluators/llm-judges/quality-assessment-judge.md new file mode 100644 index 00000000..32f90ae4 --- /dev/null +++ b/examples/showcase/vision/evaluators/llm-judges/quality-assessment-judge.md @@ -0,0 +1,221 @@ +# Quality Assessment Judge for Images +# Evaluates completeness and quality of image quality assessments + +You are evaluating an AI assistant's ability to assess image quality across technical, compositional, and aesthetic dimensions. + +## Evaluation Criteria + +### 1. Technical Assessment Completeness (30%) +- Sharpness/focus evaluation present? +- Exposure/lighting assessment included? +- Noise level considered? +- Resolution/clarity mentioned? +- Technical score provided? + +### 2. Compositional Analysis (25%) +- Rule of thirds discussed (if applicable)? +- Balance and framing evaluated? +- Leading lines or depth mentioned? +- Subject placement assessed? +- Compositional principles applied? + +### 3. Aesthetic Evaluation (20%) +- Color grading/palette assessed? +- Mood and tone described? +- Visual appeal considered? +- Style and genre recognized? +- Artistic merit evaluated? + +### 4. Overall Quality Judgment (15%) +- Overall score provided? +- Score justified with reasoning? +- Strengths identified? +- Weaknesses noted? +- Constructive feedback given? + +### 5. Professional Tone (10%) +- Objective and analytical? +- Uses appropriate terminology? +- Balanced perspective? +- Actionable feedback? + +## Input Data + +**User's Question**: {{input}} + +**AI Response**: {{output}} + +**Expected Assessment**: {{expected_output}} + +**Image Reference**: {{image_reference}} + +## Evaluation Task + +Assess whether the AI provided a comprehensive, professional image quality evaluation. + +## Output Format + +```json +{ + "score": 0.85, + "passed": true, + "details": { + "technical_completeness": 0.9, + "compositional_analysis": 0.85, + "aesthetic_evaluation": 0.8, + "overall_judgment": 0.85, + "professional_tone": 0.9 + }, + "reasoning": "Comprehensive assessment covering all major dimensions. Good use of technical terminology. Overall score well justified.", + "covered_aspects": { + "technical": ["sharpness", "exposure", "noise"], + "compositional": ["rule of thirds", "balance"], + "aesthetic": ["color grading", "mood"], + "scoring": ["overall score", "justification"] + }, + "missing_aspects": [ + "Leading lines not mentioned", + "Could discuss depth of field" + ], + "terminology_quality": "Professional photography terms used appropriately", + "strengths": [ + "Detailed technical analysis", + "Well-structured evaluation", + "Clear rating scale", + "Actionable feedback" + ], + "improvements": [ + "Could add more compositional detail", + "Discuss target use case" + ] +} +``` + +## Assessment Components to Check + +### Technical Quality Elements +- **Sharpness**: Focus quality, blur, motion +- **Exposure**: Brightness, highlights, shadows, dynamic range +- **Noise**: Grain, artifacts, clarity +- **Color accuracy**: White balance, color cast +- **Resolution**: Detail level, pixel quality + +### Compositional Elements +- **Rule of thirds**: Key elements placement +- **Balance**: Visual weight distribution +- **Framing**: Subject positioning, borders +- **Leading lines**: Paths, guides, depth +- **Symmetry/asymmetry**: Intentional choices +- **Negative space**: Use of empty areas + +### Aesthetic Elements +- **Color palette**: Harmony, contrast, mood +- **Tone**: Warm/cool, high/low key +- **Style**: Documentary, artistic, commercial +- **Mood**: Emotion conveyed +- **Visual appeal**: Overall attractiveness + +### Quality Rating +- **Numerical score**: 1-10 or percentage +- **Justification**: Reasoning for rating +- **Comparison**: To standards or expectations +- **Context**: Purpose and use case + +## Scoring Guidelines + +**0.9-1.0: Excellent** +- All major dimensions covered +- Professional terminology +- Balanced, detailed assessment +- Clear rating with justification + +**0.7-0.89: Good** +- Most dimensions covered +- Appropriate language +- Generally complete +- Rating provided + +**0.5-0.69: Acceptable** +- Some dimensions missing +- Basic assessment +- Limited detail +- Vague or missing rating + +**0.3-0.49: Poor** +- Major gaps in assessment +- Superficial analysis +- Unprofessional or unclear +- No clear rating + +**0.0-0.29: Failed** +- Minimal or no real assessment +- Inaccurate observations +- Unprofessional + +## Professional Photography Terminology + +**Expected terms** (bonus for using appropriately): +- Sharpness, focus, depth of field +- Exposure, dynamic range, highlights/shadows +- Noise, grain, ISO artifacts +- Rule of thirds, leading lines, golden ratio +- Balance, symmetry, visual weight +- Color grading, palette, saturation +- Bokeh, vignetting, chromatic aberration +- High-key, low-key, mood, tone + +## Special Considerations + +- **Subjectivity**: Aesthetic judgments are subjective; accept varied opinions if justified +- **Context matters**: Assessment should consider apparent purpose (commercial, artistic, documentary) +- **Constructive feedback**: Good assessments identify both strengths and improvement areas +- **Calibration**: Scores should match the reasoning (don't penalize if scale differs but internal consistency maintained) + +## Example Excellent Assessment + +``` +Quality Assessment: 8/10 + +Technical Quality: +- Sharpness: Excellent (9/10) - Tack sharp on subject, pleasant bokeh in background +- Exposure: Very good (8/10) - Well balanced overall, slight highlight clipping on left edge +- Noise: Good (7/10) - Minimal noise in shadows, clean at base ISO +- Color: Excellent (9/10) - Accurate white balance, vibrant but not oversaturated + +Composition: +- Rule of thirds: Well applied, subject at upper right intersection +- Balance: Excellent - Visual weight properly distributed +- Leading lines: Strong - Path creates natural eye flow toward subject +- Depth: Good use of foreground/background separation + +Color & Aesthetic: +- Palette: Warm golden hour tones create inviting mood +- Grading: Professional look with subtle lift in shadows +- Mood: Peaceful, contemplative +- Style: Fine art landscape + +Strengths: +- Professional technical execution +- Strong compositional choices +- Cohesive aesthetic vision + +Areas for improvement: +- Slight highlight clipping could be recovered +- Could crop tighter for more impact +- Consider including more foreground interest + +Overall: High-quality work suitable for portfolio or publication. +``` + +## Example Poor Assessment + +``` +The image looks good. Nice colors and everything is clear. I'd give it a 7/10 because it's pretty nice but not perfect. The photo is well taken. +``` + +**Issues with poor example:** +- Too vague, no specific technical analysis +- No compositional discussion +- No aesthetic evaluation beyond "nice colors" +- Rating not justified +- Unprofessional language diff --git a/examples/showcase/vision/evaluators/llm-judges/reasoning-judge.md b/examples/showcase/vision/evaluators/llm-judges/reasoning-judge.md new file mode 100644 index 00000000..b1d2f6bc --- /dev/null +++ b/examples/showcase/vision/evaluators/llm-judges/reasoning-judge.md @@ -0,0 +1,135 @@ +# Visual Reasoning LLM Judge +# Evaluates logical reasoning applied to visual information + +You are evaluating an AI assistant's ability to apply logical reasoning to visual information, such as solving puzzles, analyzing diagrams, or making inferences from visual data. + +## Evaluation Criteria + +### 1. Logical Correctness (40%) +- Is the reasoning logically sound? +- Are conclusions properly supported by visual evidence? +- Are logical steps clearly connected? + +### 2. Visual Understanding (30%) +- Does the response demonstrate accurate visual perception? +- Are visual elements correctly interpreted? +- Is spatial/structural understanding correct? + +### 3. Problem-Solving Quality (20%) +- Is the problem correctly understood? +- Is the solution approach appropriate? +- Are alternative solutions considered (when relevant)? + +### 4. Explanation Quality (10%) +- Is the reasoning process clearly explained? +- Are assumptions stated explicitly? +- Is the explanation easy to follow? + +## Input Data + +**User's Question**: {{input}} + +**AI Response**: {{output}} + +**Expected Response**: {{expected_output}} + +**Image Reference**: {{image_reference}} + +## Evaluation Task + +Assess the quality of reasoning applied to the visual problem. + +## Output Format + +```json +{ + "score": 0.88, + "passed": true, + "details": { + "logical_correctness": 0.9, + "visual_understanding": 0.85, + "problem_solving_quality": 0.9, + "explanation_quality": 0.85 + }, + "reasoning": "Strong logical analysis with correct visual interpretation. Solution is sound and well-explained. Could have considered one alternative approach.", + "correctness": { + "visual_perception": "Accurate", + "logical_chain": "Valid", + "conclusion": "Correct", + "assumptions": "Reasonable and stated" + }, + "strengths": [ + "Clear step-by-step reasoning", + "Accurate visual analysis", + "Correct conclusion", + "Good explanation" + ], + "weaknesses": [ + "Didn't mention alternative solution", + "Could be more explicit about one assumption" + ], + "alternative_solutions": [ + "Could have suggested Bd3 as alternative to Nf3" + ] +} +``` + +## Reasoning Task Types + +### Spatial Reasoning +- Puzzles, mazes, pathfinding +- Evaluate: Path correctness, spatial understanding, optimization + +### Logical Inference +- Chess, game states, strategy +- Evaluate: Rule understanding, tactical analysis, strategic thinking + +### Pattern Recognition +- Sequences, analogies, relationships +- Evaluate: Pattern identification, extrapolation, justification + +### Quantitative Analysis +- Charts, graphs, measurements +- Evaluate: Data extraction accuracy, calculation correctness, insight quality + +### Diagram Understanding +- Architecture, flowcharts, schematics +- Evaluate: Component identification, relationship understanding, system comprehension + +## Scoring Guidelines + +**0.9-1.0: Excellent** +- Flawless reasoning +- Complete visual understanding +- Optimal or near-optimal solution +- Clear, thorough explanation + +**0.7-0.89: Good** +- Sound reasoning with minor gaps +- Accurate visual interpretation +- Correct solution (may not be optimal) +- Adequate explanation + +**0.5-0.69: Acceptable** +- Some logical issues +- Mostly correct visual understanding +- Solution has issues but shows understanding +- Explanation could be clearer + +**0.3-0.49: Poor** +- Significant logical errors +- Misinterpretation of visual elements +- Incorrect solution +- Unclear reasoning + +**0.0-0.29: Failed** +- Fundamentally flawed reasoning +- Serious misunderstanding of visual information +- Completely incorrect solution + +## Special Considerations + +- **Multiple valid solutions**: Accept any logically sound approach +- **Partial solutions**: Give partial credit for correct reasoning even if conclusion is off +- **Computational errors**: Distinguish between logical errors and arithmetic mistakes +- **Ambiguous images**: Be lenient if image quality affects interpretation diff --git a/examples/showcase/vision/evaluators/llm-judges/structured-output-judge.md b/examples/showcase/vision/evaluators/llm-judges/structured-output-judge.md new file mode 100644 index 00000000..3c7cfff3 --- /dev/null +++ b/examples/showcase/vision/evaluators/llm-judges/structured-output-judge.md @@ -0,0 +1,177 @@ +# Structured Output Judge for Vision Tasks +# Evaluates quality of structured JSON outputs from vision analysis + +You are evaluating an AI assistant's ability to return structured, well-formatted JSON from vision analysis tasks. + +## Evaluation Criteria + +### 1. JSON Validity (30%) +- Is the output valid, parseable JSON? +- Are there any syntax errors? +- Is the structure consistent? + +### 2. Schema Compliance (35%) +- Does it match the requested structure? +- Are all required fields present? +- Are field types correct? +- Are array structures appropriate? + +### 3. Data Accuracy (25%) +- Are the values extracted from the image accurate? +- Are counts, positions, and attributes correct? +- Are confidence scores reasonable? + +### 4. Completeness (10%) +- Are all relevant visual elements captured? +- Is the level of detail appropriate? +- Are optional but useful fields included? + +## Input Data + +**User's Question**: {{input}} + +**AI Response**: {{output}} + +**Expected Structure**: {{expected_output}} + +**Image Reference**: {{image_reference}} + +## Evaluation Task + +Assess the quality of the structured JSON output from vision analysis. + +## Output Format + +```json +{ + "score": 0.88, + "passed": true, + "details": { + "json_validity": 1.0, + "schema_compliance": 0.9, + "data_accuracy": 0.85, + "completeness": 0.8 + }, + "reasoning": "Valid JSON with correct schema. Object detection mostly accurate. Some optional details missing.", + "issues": { + "parsing_errors": [], + "schema_violations": ["Missing 'confidence' field in one object"], + "accuracy_issues": ["Count slightly off for 'can' objects"], + "missing_data": ["Object colors not included"] + }, + "extracted_data": { + "objects_detected": 16, + "confidence_range": [0.85, 0.98], + "categories_present": ["bottle", "can", "box"] + }, + "strengths": [ + "Perfect JSON syntax", + "Correct array structure", + "Accurate position descriptions", + "Reasonable confidence scores" + ], + "improvements": [ + "Include confidence for all objects", + "Add color information", + "Consider bounding boxes" + ] +} +``` + +## JSON Validation Checks + +### Required Structure Elements +- All specified fields present +- Correct data types (string, number, boolean, array, object) +- Proper nesting for hierarchical data +- Consistent array item structure + +### Common Issues to Check +- **Missing fields**: Required properties not included +- **Type mismatches**: String instead of number, etc. +- **Empty arrays**: When data should be present +- **Inconsistent structures**: Different objects in same array with different schemas +- **Invalid values**: Negative confidence scores, impossible counts + +### Visual Data Accuracy +- Object counts match image +- Positions/locations accurate +- Attributes (color, size) correct +- Relationships properly described +- Confidence scores calibrated + +## Scoring Guidelines + +**0.9-1.0: Excellent** +- Perfect JSON syntax +- Full schema compliance +- Accurate visual data +- Complete information + +**0.7-0.89: Good** +- Valid JSON +- Minor schema issues +- Mostly accurate data +- Key information present + +**0.5-0.69: Acceptable** +- Parseable JSON +- Some schema violations +- Several accuracy issues +- Important data missing + +**0.3-0.49: Poor** +- JSON issues or major schema violations +- Significant inaccuracies +- Incomplete data + +**0.0-0.29: Failed** +- Invalid JSON or completely wrong structure +- Grossly inaccurate data + +## Special Considerations + +- **Flexibility**: Accept reasonable variations in structure if data is complete +- **Confidence scores**: Should be between 0.0 and 1.0 (or 0-100 for percentages) +- **Positions**: Various formats acceptable (coordinates, descriptions, regions) +- **Arrays**: Empty arrays acceptable if no objects of that type present +- **Additional fields**: Extra fields are fine, don't penalize +- **Formatting**: Whitespace and formatting don't matter, focus on structure and data + +## Example Good Response + +```json +{ + "objects": [ + { + "name": "laptop", + "count": 1, + "position": "center desk", + "confidence": 0.98, + "color": "silver", + "attributes": ["open", "powered on"] + }, + { + "name": "coffee mug", + "count": 2, + "position": "desk right side", + "confidence": 0.95, + "color": "white" + } + ], + "scene": "office workspace", + "dominant_colors": ["white", "gray", "brown"], + "lighting": "natural, well-lit" +} +``` + +## Example Poor Response + +```json +{ + "objects": "laptop and coffee mugs", // Should be array + "scene": "office workspace", + // Missing dominant_colors field + "extra_field": null +} +``` diff --git a/examples/showcase/vision/test-images/.gitkeep b/examples/showcase/vision/test-images/.gitkeep new file mode 100644 index 00000000..14cc2b2a --- /dev/null +++ b/examples/showcase/vision/test-images/.gitkeep @@ -0,0 +1,2 @@ +# Placeholder file to ensure test-images directory is tracked by git +# Users should add their own test images here (see README.md) diff --git a/examples/showcase/vision/test-images/README.md b/examples/showcase/vision/test-images/README.md new file mode 100644 index 00000000..7fbd7ae1 --- /dev/null +++ b/examples/showcase/vision/test-images/README.md @@ -0,0 +1,67 @@ +# Vision Examples Test Images + +This directory is for placing test images used by the vision evaluation examples. + +## Required Images + +To run the vision evaluation examples, you'll need to provide the following test images: + +### Basic Image Analysis (`basic-image-analysis.yaml`) +1. **sample-office.jpg** - Office workspace scene with desk, computer, chair +2. **objects-scene.jpg** - Scene with multiple countable objects (e.g., fruits, toys) +3. **spatial-layout.jpg** - Image with clear spatial relationships between objects +4. **text-document.jpg** - Image containing readable text (receipt, sign, document) +5. **comparison-before.jpg** - "Before" image for comparison task +6. **comparison-after.jpg** - "After" image showing changes from before +7. **colorful-scene.jpg** - Image with distinct, identifiable colors + +### Advanced Vision Tasks (`advanced-vision-tasks.yaml`) +1. **street-scene.jpg** - Complex outdoor scene for structured detection +2. **chess-puzzle.jpg** - Chess board position for visual reasoning +3. **activity-photo.jpg** - People performing activities +4. **quality-test.jpg** - Image for quality assessment (any photo) +5. **bar-chart.jpg** - Bar chart or graph for data extraction +6. **complex-scene.jpg** - Rich scene for context inference +7. **instruction-reference.jpg** - Image referenced in instruction-following task + +## Image Requirements + +- **Formats:** JPEG, PNG, WEBP, GIF (non-animated), BMP +- **Size:** 50x50 to 16,000x16,000 pixels +- **File Size:** Maximum 20MB per image +- **Naming:** Use descriptive filenames matching the eval case expectations + +## Alternative: Using URLs + +Instead of local files, you can use publicly accessible image URLs: +- Update the YAML files to reference URLs instead of local paths +- Example: `value: https://example.com/images/sample-office.jpg` +- Ensure URLs are stable and accessible + +## Test Image Sources + +You can create or obtain test images from: +- **Your own photos** - Best for realistic testing +- **Free stock photo sites** - Unsplash, Pexels, Pixabay (check licenses) +- **Generated images** - AI image generators for specific scenarios +- **Public domain** - Wikimedia Commons, NASA image library + +## Privacy & Copyright + +⚠️ **Important:** +- Do not commit copyrighted images to git repositories +- Ensure you have rights to use any test images +- This directory contains `.gitkeep` only - images are user-provided +- Add test images to `.gitignore` if sharing repositories + +## Usage + +Place your test images in this directory, then run evaluations from the parent directory: + +```bash +# Run basic vision evals +agentv run datasets/basic-image-analysis.yaml + +# Run advanced vision evals +agentv run datasets/advanced-vision-tasks.yaml +``` diff --git a/openspec/changes/add-vision-evaluation/proposal.md b/openspec/changes/add-vision-evaluation/proposal.md new file mode 100644 index 00000000..f62ec4c8 --- /dev/null +++ b/openspec/changes/add-vision-evaluation/proposal.md @@ -0,0 +1,375 @@ +# Proposal: Add Vision Evaluation Capabilities + +## Change ID +`add-vision-evaluation` + +## Status +🟡 **Proposed** - Awaiting approval + +## Summary +Add comprehensive image/vision evaluation capabilities to AgentV, enabling testing of AI agents with multimodal (text + image) inputs. This includes support for image inputs, vision-specific evaluators, and self-contained vision evaluation examples. + +## Motivation + +### Problem +AgentV currently only supports text-based evaluation. Modern AI agents increasingly work with vision-capable models (GPT-4V, Claude 3.5 Sonnet, Gemini Vision) that can analyze images, but there's no way to: +- Include images in evaluation test cases +- Evaluate the accuracy of visual analysis +- Test multimodal agent behaviors +- Compare vision performance across models + +### Impact +Without vision evaluation support: +- Cannot test image description, object detection, OCR capabilities +- No way to validate spatial reasoning or visual understanding +- Missing coverage for multimodal agent workflows +- Cannot evaluate vision-specific failure modes (hallucinations, misidentification) + +### Value Proposition +Adding vision evaluation enables: +- **Comprehensive testing**: Full coverage of multimodal agent capabilities +- **Quality assurance**: Validate visual analysis accuracy with specialized evaluators +- **Model comparison**: Compare vision performance across providers +- **Cost optimization**: Measure token costs for image processing +- **Real-world scenarios**: Test agents on tasks requiring visual understanding + +## Research Foundation + +This proposal is based on analysis of 4 leading AI agent and evaluation frameworks: +- **Google ADK-Python**: Rubric-based evaluation, multimodal content model +- **Mastra**: TypeScript patterns, structured outputs, Braintrust integration +- **Azure SDK**: Image input APIs, Computer Vision patterns, testing infrastructure +- **LangWatch**: Evaluation architecture, batch processing, flexible scoring + +Detailed research findings are documented in `references/research-summary.md`. + +## Scope + +### In Scope +1. **Image Input Support** (YAML schema extension) + - Local file paths (`./images/photo.jpg`) + - HTTP/HTTPS URLs (`https://example.com/image.jpg`) + - Base64 data URIs (`data:image/jpeg;base64,...`) + - Detail level specification (`low`, `high`, `auto`) + +2. **Vision Evaluators** + - 6 LLM-based judges (description, activity, comparison, reasoning, quality, structured output) + - 4 code-based validators (count, OCR, JSON structure, chart data) + +3. **Self-Contained Examples** + - Move vision evaluation to `examples/vision/` (self-contained folder) + - 14 example eval cases (7 basic, 7 advanced) + - Sample test images and documentation + +4. **Documentation** + - Comprehensive README + - Quick reference index + - Research summary + +### Out of Scope (Future Work) +- Computer vision metrics (SSIM, CLIP embeddings, perceptual hashing) +- Automatic image preprocessing/resizing +- Image generation evaluation +- Video input support +- Cloud storage integration (gs://, s3://) +- Progressive disclosure implementation +- Token budgeting automation +- Cost tracking per evaluation + +## Design Decisions + +### 1. YAML Schema Extension +**Decision**: Extend existing `content` array format to support image content types. + +**Rationale**: +- Consistent with existing multi-part message structure +- Follows patterns from Mastra and Azure SDK +- Allows mixing text and images naturally +- Supports multiple images per message + +**Example**: +```yaml +input_messages: + - role: user + content: + - type: text + value: "Describe this image" + - type: image + value: ./test-images/photo.jpg + detail: high +``` + +**Alternatives Considered**: +- ❌ Separate `images` field: Breaks natural message flow +- ❌ String-only with special syntax: Not extensible +- ✅ Content array with type discrimination: Flexible, extensible + +### 2. Evaluator Organization +**Decision**: Create `evaluators/vision/` with both LLM judges (`.md`) and code validators (`.py`). + +**Rationale**: +- LLM judges for subjective assessment (quality, completeness) +- Code validators for objective metrics (counts, structure) +- Separation of concerns +- Easy to add new evaluators + +**Categories**: +- **LLM Judges**: Description, Activity, Comparison, Reasoning, Quality Assessment, Structured Output +- **Code Validators**: Count, OCR, JSON Structure, Chart Data + +### 3. Self-Contained Structure +**Decision**: Move from `examples/features/evals/vision/` to `examples/showcase/vision/` with all assets included. + +**Rationale**: +- Follows showcase pattern for feature demonstrations +- Single folder contains: datasets, evaluators, test images, docs +- Easier to discover and understand +- Can be copied/shared as complete package + +**Structure**: +``` +examples/showcase/vision/ +├── .agentv/ +│ ├── config.yaml +│ └── targets.yaml +├── datasets/ +│ ├── basic-image-analysis.yaml +│ └── advanced-vision-tasks.yaml +├── evaluators/ +│ ├── llm-judges/ +│ │ └── *.md (6 judges) +│ └── code-validators/ +│ └── *.py (4 validators) +├── test-images/ +│ └── (sample images) +└── README.md +``` + +### 4. Detail Level Support +**Decision**: Support `detail` parameter for cost/quality trade-offs. + +**Rationale**: +- Mirrors OpenAI, Anthropic, Google APIs +- Enables cost optimization (`low` saves ~90% tokens) +- Performance tuning (high detail for complex analysis) + +**Values**: +- `low`: ~85 tokens, faster, cheaper +- `high`: ~765-1360 tokens, detailed analysis +- `auto`: Model decides based on task + +### 5. Multi-Sample Evaluation +**Decision**: Document pattern but don't automate yet. + +**Rationale**: +- Research shows 3-5 samples improves reliability +- Implementation deferred to future work +- Can be done manually for now + +## Dependencies + +### Technical Dependencies +- Existing YAML schema parser +- Evaluation execution engine +- LLM provider integrations (OpenAI, Anthropic, Google) +- `uv` for running Python validators + +### Spec Dependencies +- `yaml-schema`: Requires extension for image content types +- `evaluation`: May need updates for multimodal scoring +- `eval-execution`: Needs image loading/passing to providers + +### Example Dependencies +- Vision-capable models configured in targets +- Test images provided by users (not included in repo) + +## Risks & Mitigations + +### Risk 1: Token Cost +**Description**: Images consume 765-1360 tokens each, making evals expensive. + +**Mitigation**: +- Document cost implications clearly +- Support `detail: low` for testing (90% savings) +- Recommend Gemini Flash for development (20-30x cheaper) +- Use code validators when possible (free) + +**Severity**: Medium +**Likelihood**: High + +### Risk 2: Provider Compatibility +**Description**: Different providers have varying image input formats and capabilities. + +**Mitigation**: +- Test with all major providers (OpenAI, Anthropic, Google) +- Document provider-specific limitations +- Use common denominator approach +- Clear error messages for unsupported features + +**Severity**: Medium +**Likelihood**: Medium + +### Risk 3: Image Availability +**Description**: Local file paths and URLs may not be accessible. + +**Mitigation**: +- Validate file existence before execution +- Support multiple input methods (file, URL, base64) +- Clear error messages for missing images +- Document image requirements (size, format) + +**Severity**: Low +**Likelihood**: Medium + +### Risk 4: Hallucinations +**Description**: LLM judges may hallucinate when evaluating vision tasks. + +**Mitigation**: +- Use vision-capable judge models +- Multi-sample evaluation (3-5 runs) +- Combine with code validators +- Document judge limitations + +**Severity**: Medium +**Likelihood**: Medium + +## Implementation Notes + +### Phase 1: Schema & Input (Week 1) +- Extend YAML schema for image content types +- Implement image loaders (file, URL, base64) +- Add MIME type detection +- Provider integration for vision APIs + +### Phase 2: Evaluators (Week 2) +- Port LLM judge prompts +- Implement Python validator runner +- Test with real vision models +- Validate scoring accuracy + +### Phase 3: Examples & Docs (Week 3) +- Reorganize into `examples/vision/` +- Create self-contained structure +- Add comprehensive documentation +- Create quick-start guide + +### Phase 4: Validation (Week 4) +- End-to-end testing with multiple providers +- Cost analysis and optimization +- Performance benchmarking +- Documentation review + +## Success Criteria + +### Functional Requirements +- ✅ Support local files, URLs, and base64 image inputs +- ✅ Pass images to vision-capable LLM providers +- ✅ Run LLM judges with image context +- ✅ Execute code validators with Python +- ✅ Parse vision eval YAML files successfully +- ✅ Generate evaluation scores for vision tasks + +### Quality Requirements +- ✅ Evaluation accuracy >90% vs human judgment +- ✅ Object count accuracy >95% (code validators) +- ✅ OCR validation >80% accuracy +- ✅ Hallucination detection >85% accuracy +- ✅ Multi-sample consistency >90% + +### Performance Requirements +- ✅ Average eval latency <2s (excluding LLM calls) +- ✅ Support images up to 16MP / 20MB +- ✅ Handle 3+ image formats (JPEG, PNG, WEBP) + +### Documentation Requirements +- ✅ README with examples and usage guide +- ✅ Quick reference index +- ✅ Research summary document +- ✅ Provider compatibility matrix +- ✅ Cost optimization guide + +## Alternatives Considered + +### Alternative 1: External Vision API +**Description**: Use external Computer Vision APIs (Azure, Google Cloud Vision) instead of LLM vision. + +**Pros**: +- Potentially more accurate +- Specialized features (object detection, OCR) +- Lower cost per image + +**Cons**: +- Additional dependencies +- Inconsistent with agent evaluation (we test LLMs) +- More complex integration +- Not testing actual agent capabilities + +**Verdict**: ❌ Rejected - Want to test the actual LLMs agents use + +### Alternative 2: Generate Test Images +**Description**: Auto-generate test images using DALL-E/Stable Diffusion. + +**Pros**: +- No need for sample images +- Consistent test data +- Easy to create variations + +**Cons**: +- Expensive +- Generated images may not match real-world scenarios +- Additional complexity +- Slower test execution + +**Verdict**: ❌ Rejected - Out of scope, defer to future + +### Alternative 3: Video Support +**Description**: Support video inputs in addition to images. + +**Pros**: +- More comprehensive multimodal coverage +- Test temporal understanding + +**Cons**: +- Significantly more complex +- Very high token costs +- Limited provider support +- Niche use case + +**Verdict**: ❌ Rejected - Out of scope, future consideration + +## Open Questions + +None - all design decisions have been made based on comprehensive research. + +## References + +### Research Documents +- `docs/updates/VISION_EVAL_RESEARCH_SUMMARY.md` - Detailed findings from 5 frameworks +- `examples/vision/README.md` - Comprehensive usage guide +- `examples/vision/INDEX.md` - Quick reference + +### External Resources +- Google ADK-Python: https://github.com/google/adk-python +- Mastra: https://github.com/mastra-ai/mastra +- Azure SDK: https://github.com/Azure/azure-sdk-for-python +- LangWatch: https://github.com/langwatch/langwatch +- Agent Skills: https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering + +### Related Specs +- `yaml-schema` - Requires extension for image content +- `evaluation` - May need multimodal scoring support +- `eval-execution` - Needs image loading capability + +## Approval + +**Proposed by**: AI Assistant +**Date**: January 2, 2026 +**Approval required from**: Project maintainers + +--- + +**Next Steps After Approval**: +1. Review and approve this proposal +2. Review `tasks.md` for implementation sequence +3. Review spec deltas in `specs/*/spec.md` +4. Begin implementation following task order diff --git a/openspec/changes/add-vision-evaluation/references/adk-python-research.md b/openspec/changes/add-vision-evaluation/references/adk-python-research.md new file mode 100644 index 00000000..10b2f03a --- /dev/null +++ b/openspec/changes/add-vision-evaluation/references/adk-python-research.md @@ -0,0 +1,644 @@ +# ADK-Python Image Evaluation Research Report + +Research Date: January 2, 2026 +Repository: google/adk-python (https://github.com/google/adk-python) + +## Executive Summary + +Google's ADK (Agent Development Kit) Python framework provides a comprehensive evaluation system for AI agents. While the framework doesn't have specific image-only evaluation examples, it demonstrates **multimodal content handling** through its agents and provides a robust evaluation infrastructure that can be adapted for vision tasks. + +## Key Findings + +### 1. Multimodal Content Handling + +#### Image Input Patterns + +The ADK framework supports multiple methods for handling non-text content: + +**a) Inline Image Data (Base64)** +```python +from google.genai import types +import base64 + +# Sample image data as base64 +SAMPLE_IMAGE_DATA = base64.b64decode( + "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8/5+hHgAHggJ/PchI7wAAAABJRU5ErkJggg==" +) + +# Create inline data part +types.Part( + inline_data=types.Blob( + data=SAMPLE_IMAGE_DATA, + mime_type="image/png", + display_name="sample_chart.png", + ) +) +``` + +**b) File URI References** +```python +# GCS URI (Vertex AI) +types.Part.from_uri(file_uri="gs://cloud-samples-data/generative-ai/pdf/2403.05530.pdf") + +# HTTPS URL +types.Part( + file_data=types.FileData( + file_uri="https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf", + mime_type="application/pdf", + display_name="Research Paper", + ) +) + +# Files API Upload (Gemini Developer API) +client = genai.Client() +uploaded_file = client.files.upload(file=temp_file_path) +types.Part( + file_data=types.FileData( + file_uri=uploaded_file.uri, + mime_type="text/markdown", + display_name="Contributing Guide", + ) +) +``` + +**c) Tool Return Values** +```python +def get_image(): + """Tool that returns image parts""" + return [types.Part.from_uri(file_uri="gs://replace_with_your_image_uri")] +``` + +**Key Pattern**: Images can be passed as: +- Part of static instructions (context) +- User input content +- Tool responses +- Multimodal tool results + +### 2. Evaluation Framework Architecture + +#### Core Evaluation Components + +**File: `src/google/adk/evaluation/`** + +1. **EvalCase** (`eval_case.py`) + ```python + class Invocation(EvalBaseModel): + invocation_id: str + user_content: genai_types.Content # Can contain image parts + final_response: Optional[genai_types.Content] + intermediate_data: Optional[IntermediateDataType] + rubrics: Optional[list[Rubric]] + + class EvalCase(EvalBaseModel): + eval_id: str + conversation: Optional[StaticConversation] + conversation_scenario: Optional[ConversationScenario] + rubrics: Optional[list[Rubric]] + ``` + +2. **Rubrics** (`eval_rubrics.py`) + ```python + class RubricContent(EvalBaseModel): + text_property: Optional[str] = Field( + description='The property being evaluated. Example: "The agent\'s response is grammatically correct."' + ) + + class Rubric(EvalBaseModel): + rubric_id: str + rubric_content: RubricContent + description: Optional[str] + type: Optional[str] # e.g., "TOOL_USE_QUALITY", "FINAL_RESPONSE_QUALITY" + + class RubricScore(EvalBaseModel): + rubric_id: str + rationale: Optional[str] + score: Optional[float] + ``` + +3. **Evaluation Metrics** (`eval_metrics.py`) + ```python + class PrebuiltMetrics(Enum): + TOOL_TRAJECTORY_AVG_SCORE = "tool_trajectory_avg_score" + RESPONSE_EVALUATION_SCORE = "response_evaluation_score" + RESPONSE_MATCH_SCORE = "response_match_score" + SAFETY_V1 = "safety_v1" + FINAL_RESPONSE_MATCH_V2 = "final_response_match_v2" + RUBRIC_BASED_FINAL_RESPONSE_QUALITY_V1 = "rubric_based_final_response_quality_v1" + HALLUCINATIONS_V1 = "hallucinations_v1" + RUBRIC_BASED_TOOL_USE_QUALITY_V1 = "rubric_based_tool_use_quality_v1" + + class JudgeModelOptions(EvalBaseModel): + judge_model: str = "gemini-2.5-flash" + num_samples: int = 5 # Sample multiple times for reliability + + class RubricsBasedCriterion(BaseCriterion): + judge_model_options: JudgeModelOptions + rubrics: list[Rubric] + ``` + +4. **Evaluation Configuration** (`eval_config.py`) + ```python + class EvalConfig(BaseModel): + criteria: dict[str, Union[Threshold, BaseCriterion]] + user_simulator_config: Optional[BaseUserSimulatorConfig] + + # Example configuration + { + "criteria": { + "tool_trajectory_avg_score": 1.0, + "response_match_score": 0.5, + "final_response_match_v2": { + "threshold": 0.5, + "judge_model_options": { + "judge_model": "gemini-2.5-flash", + "num_samples": 5 + } + } + } + } + ``` + +### 3. Multimodal Agent Examples + +#### Example 1: Static Non-Text Content +**Location**: `contributing/samples/static_non_text_content/` + +```python +def create_static_instruction_with_file_upload(): + """Create static instruction with images and files""" + + parts = [ + types.Part.from_text(text="You are an AI assistant..."), + + # Inline image data + types.Part( + inline_data=types.Blob( + data=SAMPLE_IMAGE_DATA, + mime_type="image/png", + display_name="sample_chart.png", + ) + ), + + types.Part.from_text(text="This is a sample chart..."), + ] + + # Add file references based on API variant + if api_variant == GoogleLLMVariant.VERTEX_AI: + parts.append( + types.Part(file_data=types.FileData( + file_uri="gs://cloud-samples-data/generative-ai/pdf/2403.05530.pdf", + mime_type="application/pdf", + )) + ) + + return types.Content(parts=parts) + +root_agent = Agent( + model="gemini-2.5-flash", + name="static_non_text_content_demo_agent", + static_instruction=create_static_instruction_with_file_upload(), + instruction="Please analyze the user's question..." +) +``` + +#### Example 2: Multimodal Tool Results +**Location**: `contributing/samples/multimodal_tool_results/` + +```python +def get_image(): + """Tool that returns image parts""" + return [types.Part.from_uri(file_uri="gs://replace_with_your_image_uri")] + +root_agent = LlmAgent( + name="image_describing_agent", + description="image describing agent", + instruction="Get the image using the get_image tool, and describe it.", + model="gemini-2.0-flash", + tools=[get_image], +) + +app = App( + name="multimodal_tool_results", + root_agent=root_agent, + plugins=[MultimodalToolResultsPlugin()], +) +``` + +#### Example 3: Image Generation Agent +**Location**: `contributing/samples/generate_image/` + +Shows how to generate images and handle them in the conversation flow. + +### 4. Best Practices for Image Evaluation + +Based on the framework's patterns, here are recommended approaches: + +#### A. Test Case Structure + +```python +# eval_case with image input +test_case = EvalCase( + eval_id="vision_test_001", + conversation=[ + Invocation( + invocation_id="inv_001", + user_content=genai_types.Content( + parts=[ + types.Part.from_text(text="Describe this image:"), + types.Part( + inline_data=types.Blob( + data=image_bytes, + mime_type="image/jpeg", + ) + ) + ] + ), + final_response=genai_types.Content( + parts=[types.Part.from_text(text="Expected response...")] + ), + rubrics=[ + Rubric( + rubric_id="vision_accuracy", + rubric_content=RubricContent( + text_property="The agent correctly identifies the main objects in the image" + ), + type="VISION_ACCURACY" + ), + Rubric( + rubric_id="vision_detail", + rubric_content=RubricContent( + text_property="The agent provides detailed description including colors, positions, and context" + ), + type="VISION_DETAIL" + ) + ] + ) + ] +) +``` + +#### B. Evaluation Configuration for Vision Tasks + +```python +eval_config = EvalConfig( + criteria={ + # Use LLM-as-judge for vision tasks + "rubric_based_final_response_quality_v1": RubricsBasedCriterion( + threshold=0.7, + judge_model_options=JudgeModelOptions( + judge_model="gemini-2.5-flash", # Vision-capable model + num_samples=5 + ), + rubrics=[ + Rubric( + rubric_id="object_detection", + rubric_content=RubricContent( + text_property="The response correctly identifies all major objects visible in the image" + ) + ), + Rubric( + rubric_id="spatial_understanding", + rubric_content=RubricContent( + text_property="The response accurately describes spatial relationships between objects" + ) + ), + Rubric( + rubric_id="detail_completeness", + rubric_content=RubricContent( + text_property="The response includes relevant details about colors, textures, and context" + ) + ) + ] + ), + + # Safety check for vision + "safety_v1": 0.9, + + # Hallucination detection + "hallucinations_v1": HallucinationsCriterion( + threshold=0.2, # Low threshold = fewer hallucinations allowed + judge_model_options=JudgeModelOptions( + judge_model="gemini-2.5-flash", + num_samples=3 + ) + ) + } +) +``` + +#### C. Tool Trajectory Evaluation for Vision Agents + +```python +# When evaluating vision agents that use tools +eval_config = EvalConfig( + criteria={ + "tool_trajectory_avg_score": ToolTrajectoryCriterion( + threshold=1.0, + match_type=ToolTrajectoryCriterion.MatchType.IN_ORDER + ), + "rubric_based_tool_use_quality_v1": RubricsBasedCriterion( + threshold=0.8, + rubrics=[ + Rubric( + rubric_id="tool_selection", + rubric_content=RubricContent( + text_property="The agent selects appropriate vision tools for the task" + ), + type="TOOL_USE_QUALITY" + ) + ] + ) + } +) +``` + +### 5. Key Architectural Patterns + +#### Pattern 1: Content Parts as Universal Container + +```python +# Content is composed of Parts +# Parts can be: text, inline_data (images), file_data (URIs), function_call, function_response +class Content: + parts: list[Part] + role: str # "user" | "model" + +# This allows mixing text and images naturally +user_input = Content( + role="user", + parts=[ + Part.from_text("What's in this image?"), + Part(inline_data=Blob(data=image_data, mime_type="image/jpeg")) + ] +) +``` + +#### Pattern 2: Static Instructions with Context + +```python +# Static instructions can include visual context that's available to all conversations +agent = Agent( + static_instruction=Content( + parts=[ + Part.from_text("You are a visual assistant..."), + Part(inline_data=Blob(...)), # Reference image + Part.from_text("Use the reference image above as context...") + ] + ), + instruction="Dynamic per-request instructions..." +) +``` + +#### Pattern 3: Multimodal Tool Results + +```python +# Tools can return multimodal content +def analyze_chart(): + return [ + Part.from_text("Chart shows upward trend"), + Part.from_uri("gs://bucket/enhanced_chart.png") + ] + +# Framework handles multimodal tool results through plugins +app = App( + root_agent=agent, + plugins=[MultimodalToolResultsPlugin()] +) +``` + +#### Pattern 4: LLM-as-Judge for Multimodal Evaluation + +```python +# Use vision-capable judge model to evaluate vision task responses +judge_evaluates = f""" +Given: +- Original image: {image_uri} +- User question: {user_question} +- Agent response: {agent_response} +- Rubric: {rubric.rubric_content.text_property} + +Evaluate if the response satisfies the rubric criterion. +Score: 0-1 +""" +``` + +### 6. Event Logging Structure + +The framework logs detailed event information: + +```python +{ + "invocation_id": "CFs9iCdD", + "event_id": "urXUWHfc", + "model_request": { + "model": "gemini-1.5-flash", + "contents": [/* multimodal content */], + "config": { + "system_instruction": "...", + "tools": [/* tool definitions */] + } + }, + "model_response": { + "candidates": [{ + "content": {/* response content */}, + "finish_reason": "STOP", + "safety_ratings": [/* safety scores */] + }], + "usage_metadata": { + "candidates_token_count": 16, + "prompt_token_count": 84, + "total_token_count": 100 + } + } +} +``` + +## Recommendations for AgentV Implementation + +### 1. Eval Case Structure + +```typescript +interface VisionEvalCase { + eval_id: string; + invocations: Array<{ + user_content: { + text: string; + images?: Array<{ + data: string; // base64 or URI + mime_type: string; + display_name?: string; + }>; + }; + expected_response?: string; + rubrics: Array<{ + rubric_id: string; + criterion: string; + type: "VISION_ACCURACY" | "VISION_DETAIL" | "SPATIAL_UNDERSTANDING"; + }>; + }>; +} +``` + +### 2. YAML Configuration Pattern + +```yaml +eval_cases: + - eval_id: "image_description_001" + conversation: + - invocation_id: "inv_001" + user_content: + text: "Describe the objects in this image" + images: + - uri: "file://./test_images/scene_001.jpg" + mime_type: "image/jpeg" + rubrics: + - rubric_id: "object_detection" + criterion: "Correctly identifies all major objects" + threshold: 0.8 + - rubric_id: "spatial_relations" + criterion: "Accurately describes object positions" + threshold: 0.7 + +eval_config: + criteria: + rubric_based_vision_quality: + threshold: 0.75 + judge_model: "gemini-2.5-flash" + num_samples: 5 +``` + +### 3. Rubric Types for Vision + +- **VISION_ACCURACY**: Object detection accuracy +- **VISION_DETAIL**: Level of detail in descriptions +- **SPATIAL_UNDERSTANDING**: Understanding of spatial relationships +- **COLOR_ACCURACY**: Correct identification of colors +- **CONTEXT_UNDERSTANDING**: Understanding scene context +- **OCR_ACCURACY**: Text extraction accuracy (if applicable) +- **VISUAL_REASONING**: Ability to reason about visual content + +### 4. Multi-Sample Evaluation + +Follow ADK's pattern of sampling judge model multiple times (default: 5) for reliability: + +```python +num_samples = 5 +scores = [] +for _ in range(num_samples): + score = judge_model.evaluate(image, response, rubric) + scores.append(score) +final_score = statistics.mean(scores) +``` + +### 5. Image Storage Patterns + +Support multiple image sources: +- **Inline Base64**: For small images in YAML +- **File URIs**: `file://./path/to/image.jpg` +- **HTTP/HTTPS URIs**: For external images +- **Cloud Storage**: `gs://bucket/image.jpg` (if using GCP) + +### 6. Evaluation Flow + +``` +1. Load eval case with image references +2. Resolve image data (download if URI, decode if base64) +3. Run agent with image + text input +4. Collect agent response +5. For each rubric: + a. Sample judge model N times + b. Average scores + c. Compare to threshold +6. Aggregate results +7. Generate report +``` + +## Code Examples to Reference + +### Key Files to Study + +1. **Multimodal Content Handling**: + - `contributing/samples/static_non_text_content/agent.py` + - `contributing/samples/multimodal_tool_results/agent.py` + +2. **Evaluation Infrastructure**: + - `src/google/adk/evaluation/eval_case.py` + - `src/google/adk/evaluation/eval_rubrics.py` + - `src/google/adk/evaluation/eval_metrics.py` + - `src/google/adk/evaluation/eval_config.py` + +3. **LLM-as-Judge Implementation**: + - `src/google/adk/evaluation/llm_as_judge.py` + - `src/google/adk/evaluation/rubric_based_evaluator.py` + +4. **Safety and Hallucination Detection**: + - `src/google/adk/evaluation/safety_evaluator.py` + - `src/google/adk/evaluation/hallucinations_v1.py` + +## Gaps and Adaptations Needed + +### What ADK Doesn't Provide + +1. **No specific vision-focused eval examples** + - Need to create vision-specific rubrics + - Need vision test datasets + +2. **No image similarity metrics** + - No CLIP score, SSIM, etc. + - Relies on LLM-as-judge for vision evaluation + +3. **No automated image annotation** + - Need to manually create expected responses + - No computer vision metrics integration + +### What to Adapt + +1. **Create vision-specific rubric library** + ```python + VISION_RUBRICS = { + "object_detection": "Identifies all major objects correctly", + "spatial_understanding": "Describes spatial relationships accurately", + "color_accuracy": "Identifies colors correctly", + # etc. + } + ``` + +2. **Image preprocessing utilities** + ```python + def prepare_image_for_eval(image_path): + # Resize, normalize, encode as base64 + pass + ``` + +3. **Vision-specific judge prompts** + ```python + VISION_JUDGE_TEMPLATE = """ + You are evaluating a vision AI agent's response. + + Image: {image_uri} + Question: {question} + Agent Response: {response} + Rubric: {rubric} + + Score the response 0-1 based on the rubric. + """ + ``` + +## Conclusion + +The ADK-Python framework provides a solid foundation for multimodal evaluation through: + +1. **Flexible content model** supporting images via inline_data and file_data +2. **Rubric-based evaluation** system adaptable to vision tasks +3. **LLM-as-judge pattern** that works with vision-capable models +4. **Multi-sample evaluation** for reliability +5. **Comprehensive event logging** for debugging + +**Key Takeaway**: While ADK doesn't have vision-specific examples, its architecture is well-suited for image evaluation. The main work needed is creating vision-specific rubrics and test cases, which can follow the existing patterns for text-based evaluation. + +## References + +- Repository: https://github.com/google/adk-python +- Static Non-Text Content Example: `contributing/samples/static_non_text_content/` +- Multimodal Tool Results: `contributing/samples/multimodal_tool_results/` +- Evaluation Module: `src/google/adk/evaluation/` diff --git a/openspec/changes/add-vision-evaluation/references/research-summary.md b/openspec/changes/add-vision-evaluation/references/research-summary.md new file mode 100644 index 00000000..ae7bb70f --- /dev/null +++ b/openspec/changes/add-vision-evaluation/references/research-summary.md @@ -0,0 +1,945 @@ +# Vision Evaluation Research Summary + +## Executive Summary + +This document summarizes research into best practices for adding image input evaluation capabilities to AgentV, based on analysis of leading AI agent and evaluation frameworks. + +**Date**: January 2, 2026 +**Repositories Analyzed**: 4 leading frameworks + +--- + +## 1. Research Methodology + +### Repositories Researched + +1. **google/adk-python** - Google's Agent Development Kit (Python) + - Focus: Rubric-based evaluation, multimodal content handling + +2. **mastra-ai/mastra** - TypeScript agent framework + - Focus: Production patterns, structured outputs, Braintrust integration + +3. **Azure/azure-sdk-for-python** - Microsoft Azure SDKs + - Focus: Image input APIs, Computer Vision, testing patterns + +4. **langwatch/langwatch** - LLM observability and evaluation + - Focus: Evaluation architecture, batch processing, metrics + +### Research Approach + +Each repository was systematically analyzed using GitHub CLI searches for: +- Image input handling patterns +- Multimodal evaluation examples +- Vision-specific evaluators/judges +- Testing frameworks and best practices +- Documentation and guides + +--- + +## 2. Key Findings by Framework + +### 2.1 Google ADK-Python + +**Multimodal Content Model**: +```python +Content( + parts=[ + Part.from_text("Describe this image"), + Part(inline_data=Blob(data=image_bytes, mime_type="image/jpeg")) + ] +) +``` + +**Key Patterns**: +- ✅ Unified `Content` and `Parts` model for text + images +- ✅ Three image input methods: inline base64, URIs, tool returns +- ✅ Rubric-based evaluation with vision-capable judges +- ✅ Multi-sample evaluation (5x) for reliability +- ✅ Comprehensive event logging + +**Evaluation Architecture**: +```python +Invocation( + user_content=Content(parts=[...]), + rubrics=[ + Rubric( + rubric_id="vision_accuracy", + rubric_content=RubricContent( + text_property="Correctly identifies main objects" + ), + type="VISION_ACCURACY" + ) + ] +) +``` + +**Vision-Specific Rubric Types**: +- Object detection accuracy +- Spatial understanding +- Color accuracy +- Detail completeness +- Context understanding + +**Gaps Identified**: +- ❌ No specific vision eval examples in repo +- ❌ No computer vision metrics (SSIM, CLIP) +- ❌ No automated image annotation tools + +--- + +### 2.2 Mastra (TypeScript) + +**Message Format**: +```typescript +{ + role: "user", + content: [ + { type: "text", text: "Describe the image" }, + { + type: "image", + image: "https://example.com/image.jpg", + mimeType: "image/jpeg" + } + ] +} +``` + +**Supported Image Formats**: +- URL references (HTTP/HTTPS) +- Data URIs (base64) +- Binary data (Uint8Array, Buffer) +- Cloud storage (gs://, s3://) + +**Vision Model Integration**: +- OpenAI: GPT-4o, GPT-4 Turbo +- Anthropic: Claude 3.5 Sonnet, Claude 3 Haiku/Opus +- Google: Gemini 2.5 Pro/Flash + +**Structured Output Pattern**: +```typescript +const result = await agent.generate(messages, { + output: z.object({ + bird: z.boolean(), + species: z.string(), + location: z.string() + }) +}); +``` + +**Evaluation with Braintrust**: +```typescript +Eval("Is a bird", { + data: () => [ + { input: IMAGE_URL, expected: { bird: true, species: "robin" } } + ], + task: async (input) => await analyzeImage(input), + scores: [containsScorer, hallucinationScorer] +}); +``` + +**Built-in Scorers**: +- Hallucination detection +- Faithfulness checking +- Content similarity + +**Key Strengths**: +- ✅ Production-ready TypeScript patterns +- ✅ Strong typing with Zod schemas +- ✅ Braintrust evaluation integration +- ✅ Memory persistence with images +- ✅ UI integration examples + +--- + +### 2.3 Azure SDK for Python + +**Dual Input Methods**: +```python +# Method 1: URL +result = client.analyze_from_url( + image_url="https://example.com/image.jpg", + visual_features=[VisualFeatures.CAPTION] +) + +# Method 2: Binary data +with open("image.jpg", "rb") as f: + result = client.analyze( + image_data=f.read(), + visual_features=[VisualFeatures.CAPTION] + ) +``` + +**Chat Completions with Vision** (Azure OpenAI): +```python +completion = client.chat.completions.create( + model="gpt-4o", + messages=[{ + "role": "user", + "content": [ + {"type": "text", "text": "What's in this image?"}, + { + "type": "image_url", + "image_url": { + "url": image_url, + "detail": "high" # low, high, auto + } + } + ] + }] +) +``` + +**Computer Vision Features**: +- Tags, captions, dense captions +- Object detection with bounding boxes +- OCR (text extraction) +- People detection +- Smart crops + +**Testing Patterns**: +```python +class ImageAnalysisTestBase(AzureRecordedTestCase): + def _do_analysis(self, image_source, visual_features): + if "http" in image_source: + return self.client.analyze_from_url(...) + else: + with open(image_source, "rb") as f: + return self.client.analyze(image_data=f.read(), ...) +``` + +**Evaluation Integration**: +```python +evaluator = ContentSafetyEvaluator( + credential=cred, + azure_ai_project=project +) +score = evaluator(conversation=multimodal_conversation) +``` + +**Key Insights**: +- ✅ Flexible input handling (URL + binary) +- ✅ Comprehensive Computer Vision API +- ✅ Structured response models +- ✅ Test infrastructure with recording/playback +- ✅ Multiple authentication methods + +**Image Format Support**: +- Formats: JPEG, PNG, GIF, BMP, WEBP, ICO, TIFF, MPO +- Size: 50x50 to 16,000x16,000 pixels, max 20 MB + +--- + +### 2.4 LangWatch + +**Finding**: No native multimodal support, but excellent general evaluation patterns. + +**Evaluator Architecture**: +```typescript +interface EvaluatorConfig { + id: string; + evaluatorType: string; + name: string; + settings: Record; + inputs: Field[]; + mappings: Record; +} +``` + +**Evaluation Result Schema**: +```typescript +{ + status: "processed" | "skipped" | "error", + passed?: boolean, + score?: number, // 0-1 + label?: string, + details?: string, + cost?: Money +} +``` + +**Batch Evaluation Pattern**: +```python +for index, row in evaluation.loop(df.iterrows()): + evaluation.submit(evaluate_fn, index, row) +evaluation.wait_for_completion() +``` + +**Key Patterns to Adopt**: +- ✅ Pluggable evaluator architecture +- ✅ Flexible result schema (score, passed, label, details) +- ✅ Dataset + runner + evaluator separation +- ✅ Parallel execution with progress tracking +- ✅ Cost tracking per evaluation +- ✅ Version tracking and reproducibility + +**Adaptable for Vision**: +- Extend input types to include images +- Add vision-specific evaluators +- Support image datasets +- Add visual comparison UI + +--- + +## 3. References + +### Repository Links + +- [google/adk-python](https://github.com/google/adk-python) +- [mastra-ai/mastra](https://github.com/mastra-ai/mastra) +- [Azure/azure-sdk-for-python](https://github.com/Azure/azure-sdk-for-python) +- [langwatch/langwatch](https://github.com/langwatch/langwatch) + +### Key Documentation + +- [OpenAI Vision API](https://platform.openai.com/docs/guides/vision) +- [Anthropic Claude Vision](https://docs.anthropic.com/claude/docs/vision) +- [Google Gemini Vision](https://ai.google.dev/gemini-api/docs/vision) +- [Azure Computer Vision](https://learn.microsoft.com/en-us/azure/ai-services/computer-vision/) + +### Related Papers + +- "GPT-4V(ision) System Card" - OpenAI +- "Claude 3 Model Card" - Anthropic +- "Gemini: A Family of Highly Capable Multimodal Models" - Google + +--- + +**End of Research Summary** + +### 3.1 Image Input Format + +Based on Mastra and Azure patterns: + +```yaml +# YAML eval file format +input_messages: + - role: user + content: + - type: text + value: "Describe this image" + - type: image + value: ./test-images/photo.jpg # Local file + detail: high # Optional: low, high, auto + - type: image_url + value: https://example.com/image.jpg # URL +``` + +**Supported sources**: +- Local files: `./path/to/image.jpg` +- HTTP URLs: `https://...` +- Data URIs: `data:image/jpeg;base64,...` +- Cloud storage: `gs://bucket/image.jpg`, `s3://bucket/image.jpg` + +**MIME types to support**: +- `image/jpeg`, `image/png`, `image/gif` +- `image/webp`, `image/bmp` +- Auto-detect from file extension + +--- + +### 3.2 Evaluator Types + +#### LLM-Based Judges + +Located in `evaluators/vision/*.md`: + +1. **Image Description Judge** + ```yaml + evaluators: + - name: description_quality + type: llm_judge + prompt: evaluators/vision/image-description-judge.md + ``` + + Dimensions: + - Visual Accuracy (40%) + - Completeness (30%) + - Clarity (20%) + - Relevance (10%) + +2. **Activity Recognition Judge** + - Activity identification + - Count accuracy + - Pose/interaction recognition + +3. **Comparison Judge** + - Change detection + - Spatial precision + - Completeness + +4. **Reasoning Judge** + - Logical correctness + - Visual understanding + - Problem-solving quality + +5. **Structured Output Judge** + - JSON validity + - Schema compliance + - Data accuracy + +6. **Quality Assessment Judge** + - Technical quality + - Composition + - Aesthetic evaluation + +#### Code-Based Validators + +Located in `evaluators/vision/*.py`: + +1. **count_validator.py** + ```python + validate_object_count(output, expected_output) -> Result + ``` + +2. **ocr_validator.py** + ```python + validate_ocr_accuracy(output, expected, threshold=0.7) -> Result + ``` + +3. **json_validator.py** + ```python + validate_json_structure(output, expected, schema) -> Result + ``` + +4. **chart_validator.py** + ```python + validate_chart_data(output, expected, tolerance=0.15) -> Result + ``` + +--- + +### 3.3 Example Eval Cases + +#### Basic Image Analysis + +```yaml +- id: simple-image-description + input_messages: + - role: system + content: You can analyze images and provide detailed descriptions. + - role: user + content: + - type: text + value: "Describe what you see in this image." + - type: image + value: ./test-images/office.jpg + detail: high + + expected_messages: + - role: assistant + content: |- + The image shows an office workspace with: + - A desk with computer monitor + - Office chair + - Natural lighting from window + + execution: + evaluators: + - name: content_accuracy + type: llm_judge + prompt: ../../evaluators/vision/image-description-judge.md +``` + +#### Structured Output + +```yaml +- id: structured-object-detection + input_messages: + - role: user + content: + - type: text + value: |- + Return JSON with this structure: + {"objects": [{"name": "...", "count": 1, "position": "..."}]} + - type: image + value: ./test-images/shelf.jpg + + expected_messages: + - role: assistant + content: |- + ```json + { + "objects": [ + {"name": "bottle", "count": 5, "position": "top shelf"}, + {"name": "can", "count": 8, "position": "middle shelf"} + ] + } + ``` + + execution: + evaluators: + - name: json_validation + type: code_judge + script: uv run json_validator.py + cwd: ../../evaluators/vision +``` + +#### Multi-Turn Conversation + +```yaml +- id: conversation-turn-1 + conversation_id: vision-chat-001 + input_messages: + - role: user + content: + - type: text + value: "What are the main elements?" + - type: image + value: ./architecture.jpg + expected_messages: + - role: assistant + content: "Main elements: API Gateway, Services, Database..." + +- id: conversation-turn-2 + conversation_id: vision-chat-001 + input_messages: + # Full history required + - role: user + content: + - type: text + value: "What are the main elements?" + - type: image + value: ./architecture.jpg + - role: assistant + content: "Main elements: API Gateway, Services, Database..." + - role: user + content: "Explain the API Gateway's role" + expected_messages: + - role: assistant + content: "The API Gateway handles routing and authentication..." +``` + +--- + +### 3.4 Context Management + +**Token Budget Strategy**: + +```typescript +const IMAGE_TOKEN_COSTS = { + low: 85, // 512x512 or less + high: 765, // 512x512 to 2048x2048 + auto: 1360 // 2048x2048+ +}; + +const MAX_CONTEXT = 128000; // Model context limit +const RESERVE = 0.3; // 30% for output + safety + +const maxImages = Math.floor( + (MAX_CONTEXT * (1 - RESERVE)) / IMAGE_TOKEN_COSTS.high +); +// ≈ 117 images at high detail +``` + +**Progressive Loading**: + +```typescript +interface ImageProcessingStrategy { + // Level 1: Metadata only + getMetadata(imagePath: string): ImageMetadata; + + // Level 2: Text description + getDescription(imagePath: string): Promise; + + // Level 3: Full visual analysis + analyzeImage(imagePath: string): Promise; +} +``` + +**File System Caching**: + +```typescript +const visionCache = new Map(); + +async function processWithCache(imagePath: string) { + const cacheKey = await hashFile(imagePath); + + if (visionCache.has(cacheKey)) { + return visionCache.get(cacheKey); + } + + const analysis = await analyzeImage(imagePath); + visionCache.set(cacheKey, analysis); + + // Persist to disk + await fs.writeFile( + `./cache/vision/${cacheKey}.json`, + JSON.stringify(analysis) + ); + + return analysis; +} +``` + +--- + +### 3.5 Cost Optimization + +**Pricing Reference** (as of Jan 2026): + +| Provider | Model | Input (per 1M tokens) | Image Token Cost* | +|----------|-------|---------------------|------------------| +| OpenAI | GPT-4o | $2.50 | $1.91-$3.40 per 1K images | +| Anthropic | Claude 3.5 | $3.00 | $2.30-$4.08 per 1K images | +| Google | Gemini 2.5 Flash | $0.075 | $0.06-$0.10 per 1K images | + +*Based on average 765-1360 tokens per image + +**Cost Optimization Strategies**: + +1. **Use detail levels appropriately**: + ```yaml + - type: image + value: ./image.jpg + detail: low # For simple tasks, saves ~90% tokens + ``` + +2. **Choose cost-effective models**: + - Gemini 2.5 Flash: 20-30x cheaper than GPT-4o + - Use for high-volume testing + - Upgrade to GPT-4o/Claude for production + +3. **Cache image descriptions**: + ```typescript + // First pass: Analyze image + const description = await analyzeImage(image); + await cache.set(imageHash, description); + + // Subsequent passes: Use cached text (20 tokens vs 765) + const cachedDescription = await cache.get(imageHash); + ``` + +4. **Batch evaluation**: + ```typescript + // Process multiple evals in parallel + const results = await Promise.all( + evalCases.map(ec => evaluateWithImage(ec)) + ); + ``` + +5. **Use code validators when possible**: + - Object counting: Free + - OCR validation: Free + - JSON validation: Free + - Only use LLM judges for subjective evaluation + +--- + +## 4. Best Practices Summary + +### 4.1 Evaluation Design + +✅ **Multi-dimensional rubrics** +- Weight dimensions appropriately +- Visual accuracy typically 35-40% +- Completeness 25-30% +- Clarity/presentation 15-20% + +✅ **Multiple evaluator types** +- LLM judges for subjective assessment +- Code validators for objective metrics +- Combine for comprehensive evaluation + +✅ **Multi-sample evaluation** +- Run LLM judges 3-5 times +- Aggregate scores for reliability +- Report variance/confidence + +✅ **Clear scoring thresholds** +- 0.9-1.0: Production ready +- 0.7-0.89: Good, minor improvements +- 0.5-0.69: Acceptable, significant gaps +- Below 0.5: Not passing + +--- + +### 4.2 Image Input Handling + +✅ **Support multiple sources** +- Local files (primary for testing) +- HTTP URLs (public images) +- Cloud storage (enterprise) +- Data URIs (embedded) + +✅ **Specify MIME types** +- Always include for reliability +- Auto-detect from extension as fallback + +✅ **Use detail levels** +- `low`: Simple tasks, faster, cheaper +- `high`: Complex analysis, detailed +- `auto`: Let model decide + +✅ **Validate image requirements** +- Check size limits (50x50 to 16,000x16,000) +- Verify format support +- Ensure file accessibility + +--- + +### 4.3 Context Management + +✅ **Progressive disclosure** +- Load metadata first (cheap) +- Generate descriptions on demand +- Full analysis only when necessary + +✅ **Token budgeting** +- Calculate image token costs +- Reserve 30% for output +- Monitor utilization percentage + +✅ **File system caching** +- Hash images for cache keys +- Store analyses as JSON +- Pass references, not raw data + +✅ **Supervisor pattern** +- Isolate vision processing +- Separate orchestration context +- Prevent token pollution + +--- + +### 4.4 Testing Strategy + +✅ **Complexity levels** +```yaml +tests: + - simple: # Single object, clear image + complexity: 1 + - medium: # Multiple objects, some occlusion + complexity: 2 + - complex: # Scene understanding, reasoning + complexity: 3 +``` + +✅ **Coverage areas** +- Basic description +- Object detection/counting +- Spatial reasoning +- Text extraction (OCR) +- Multi-image comparison +- Quality assessment +- Logical reasoning +- Structured output + +✅ **Edge cases** +- Low quality images +- Partially occluded objects +- Ambiguous scenes +- Multiple valid interpretations +- Empty/minimal content + +--- + +## 5. Files Created + +### Evaluation Files (YAML) + +1. `basic-image-analysis.yaml` - 7 basic vision eval cases +2. `advanced-vision-tasks.yaml` - 7 advanced eval cases + +### LLM Judge Prompts (Markdown) + +3. `image-description-judge.md` +4. `activity-judge.md` +5. `comparison-judge.md` +6. `reasoning-judge.md` +7. `structured-output-judge.md` +8. `quality-assessment-judge.md` + +### Code Validators (Python) + +9. `count_validator.py` +10. `ocr_validator.py` +11. `json_validator.py` +12. `chart_validator.py` + +### Documentation + +13. `README.md` - Comprehensive guide +14. `RESEARCH_SUMMARY.md` - This document + +--- + +## 6. Next Steps + +### Phase 1: Core Implementation (Week 1-2) + +1. **Extend AgentV Schema** + - Add image content type to message schema + - Support detail levels + - Validate image paths/URLs + +2. **Image Loading** + - Implement file loader + - URL fetcher with validation + - Base64 encoder + - MIME type detection + +3. **Provider Integration** + - Update OpenAI provider for vision + - Update Anthropic provider + - Update Google provider + - Test with real models + +### Phase 2: Evaluators (Week 3) + +4. **LLM Judge Integration** + - Load judge prompts from MD files + - Pass image references to judges + - Parse structured evaluation results + +5. **Code Validator Runner** + - Execute Python validators with `uv run` + - Pass eval data as JSON + - Parse results + +6. **Test Evaluators** + - Create test images + - Run basic eval suite + - Validate scoring + +### Phase 3: Advanced Features (Week 4) + +7. **Context Management** + - Implement progressive disclosure + - Add token budgeting + - File system caching + +8. **Batch Processing** + - Parallel evaluation + - Progress tracking + - Cost reporting + +9. **Documentation** + - Usage guide + - API reference + - Tutorial videos + +### Phase 4: Computer Vision Metrics (Future) + +10. **Native CV Evaluators** + - SSIM (structural similarity) + - Perceptual hashing + - CLIP embeddings + - Object detection validation + +11. **Specialized Evaluators** + - Face detection + - Logo recognition + - Medical imaging + - Document understanding + +--- + +## 7. Success Metrics + +### Technical Metrics + +- ✅ Support 4+ vision-capable providers +- ✅ Handle 3+ image input formats +- ✅ Implement 6+ vision evaluators +- ✅ Achieve <2s avg eval latency +- ✅ Support images up to 16MP +- ✅ Cost tracking per evaluation + +### Quality Metrics + +- ✅ Evaluation accuracy >90% vs human judgment +- ✅ Hallucination detection >85% accuracy +- ✅ Object count accuracy >95% +- ✅ OCR validation >80% accuracy +- ✅ Multi-sample consistency >90% + +### Usability Metrics + +- ✅ Documentation completeness score >90% +- ✅ Example coverage: 10+ eval cases +- ✅ Setup time <15 minutes +- ✅ User satisfaction >4.5/5 + +--- + +## 8. References + +### Repository Links + +- [google/adk-python](https://github.com/google/adk-python) +- [mastra-ai/mastra](https://github.com/mastra-ai/mastra) +- [Azure/azure-sdk-for-python](https://github.com/Azure/azure-sdk-for-python) +- [langwatch/langwatch](https://github.com/langwatch/langwatch) + +### Key Documentation + +- [OpenAI Vision API](https://platform.openai.com/docs/guides/vision) +- [Anthropic Claude Vision](https://docs.anthropic.com/claude/docs/vision) +- [Google Gemini Vision](https://ai.google.dev/gemini-api/docs/vision) +- [Azure Computer Vision](https://learn.microsoft.com/en-us/azure/ai-services/computer-vision/) + +### Related Papers + +- "GPT-4V(ision) System Card" - OpenAI +- "Claude 3 Model Card" - Anthropic +- "Gemini: A Family of Highly Capable Multimodal Models" - Google + +--- + +## Appendix A: Token Cost Calculator + +```typescript +function estimateImageTokens( + width: number, + height: number, + detail: 'low' | 'high' | 'auto' +): number { + if (detail === 'low') { + return 85; + } + + // High detail calculation (OpenAI algorithm) + const scaledWidth = Math.min(width, 2048); + const scaledHeight = Math.min(height, 2048); + + // Scale to fit 768px shortest side + const scale = 768 / Math.min(scaledWidth, scaledHeight); + const finalWidth = Math.ceil(scaledWidth * scale / 512) * 512; + const finalHeight = Math.ceil(scaledHeight * scale / 512) * 512; + + const tiles = (finalWidth / 512) * (finalHeight / 512); + return 170 * tiles + 85; // Base 85 + 170 per tile +} + +// Examples: +estimateImageTokens(1024, 768, 'high'); // ≈ 765 +estimateImageTokens(2048, 1536, 'high'); // ≈ 1105 +estimateImageTokens(512, 512, 'high'); // ≈ 255 +estimateImageTokens(4096, 4096, 'low'); // 85 +``` + +--- + +## Appendix B: Sample Test Dataset + +Recommended test images to include: + +1. **Office workspace** - Basic description +2. **Team meeting** - People counting +3. **Desk arrangement** - Spatial reasoning +4. **Document scan** - OCR testing +5. **Before/after comparison** - Change detection +6. **Color palette** - Color analysis +7. **Product shelf** - Object detection +8. **Chess position** - Logical reasoning +9. **Architecture diagram** - Understanding +10. **Landscape photo** - Quality assessment +11. **Sales chart** - Data extraction +12. **Celebration scene** - Context inference +13. **Floor plan** - Measurement +14. **Low quality image** - Error handling +15. **Ambiguous scene** - Edge case + +--- + +**End of Research Summary** diff --git a/openspec/changes/add-vision-evaluation/specs/vision-evaluation/spec.md b/openspec/changes/add-vision-evaluation/specs/vision-evaluation/spec.md new file mode 100644 index 00000000..d0d30b52 --- /dev/null +++ b/openspec/changes/add-vision-evaluation/specs/vision-evaluation/spec.md @@ -0,0 +1,330 @@ +# vision-evaluation Specification + +## Purpose +Provide comprehensive, self-contained vision evaluation examples demonstrating best practices for testing AI agents with image inputs. Organized as a standalone package under `examples/showcase/vision/` with all necessary assets. + +## ADDED Requirements + +### Requirement: Vision Examples MUST be self-contained in examples/showcase/vision/ +All vision evaluation files SHALL be organized in a single directory structure under `examples/showcase/vision/`, making it easy to discover, understand, and use as a complete package. + +#### Scenario: Directory structure is self-contained +Given the vision examples directory +When inspecting `examples/showcase/vision/` +Then it SHALL contain: +- `.agentv/` - Configuration files +- `datasets/` - Evaluation YAML files +- `evaluators/` - LLM judges and code validators +- `test-images/` - Placeholder for user test images +- `README.md` - Comprehensive documentation +- `INDEX.md` - Quick reference guide + +--- + +### Requirement: Basic Image Analysis Examples MUST cover fundamental tasks +The examples SHALL include 7 basic vision evaluation cases covering essential image understanding capabilities. + +#### Scenario: Simple image description eval case +Given `datasets/basic-image-analysis.yaml` +When loaded +Then it SHALL contain an eval case `simple-image-description` that: +- Includes an image in the input +- Expects a description of the image +- Uses `image-description-judge` for evaluation + +#### Scenario: Object detection eval case +Given `datasets/basic-image-analysis.yaml` +When loaded +Then it SHALL contain an eval case `object-detection-simple` that: +- Asks to count/identify objects +- Includes expected count in output +- Uses `count_validator` for verification + +#### Scenario: Spatial relationships eval case +Given `datasets/basic-image-analysis.yaml` +When loaded +Then it SHALL contain an eval case `spatial-relationships` that: +- Asks about object positions +- Expects spatial descriptions +- Uses `image-description-judge` for evaluation + +#### Scenario: OCR text extraction eval case +Given `datasets/basic-image-analysis.yaml` +When loaded +Then it SHALL contain an eval case `text-extraction-ocr` that: +- Shows an image with text +- Expects text extraction +- Uses `ocr_validator` for verification + +#### Scenario: Multi-image comparison eval case +Given `datasets/basic-image-analysis.yaml` +When loaded +Then it SHALL contain an eval case `multi-image-comparison` that: +- Includes two images (before/after) +- Expects change identification +- Uses `comparison-judge` for evaluation + +#### Scenario: Color identification eval case +Given `datasets/basic-image-analysis.yaml` +When loaded +Then it SHALL contain an eval case `color-identification` that: +- Asks about colors in image +- Expects color descriptions +- Uses `image-description-judge` for evaluation + +#### Scenario: Image from URL eval case +Given `datasets/basic-image-analysis.yaml` +When loaded +Then it SHALL contain an eval case `image-from-url` that: +- References an image via HTTP URL +- Demonstrates URL loading capability +- Uses standard judge for evaluation + +--- + +### Requirement: Advanced Vision Examples MUST demonstrate complex scenarios +The examples SHALL include 7 advanced vision evaluation cases showcasing sophisticated capabilities. + +#### Scenario: Structured JSON output eval case +Given `datasets/advanced-vision-tasks.yaml` +When loaded +Then it SHALL contain an eval case `structured-object-detection` that: +- Requests JSON-formatted object detection results +- Expects specific JSON structure +- Uses `json_validator` and `structured-output-judge` + +#### Scenario: Visual reasoning eval case +Given `datasets/advanced-vision-tasks.yaml` +When loaded +Then it SHALL contain an eval case `visual-reasoning-problem` that: +- Presents a logical puzzle with image (e.g., chess) +- Expects reasoned solution +- Uses `reasoning-judge` for evaluation + +#### Scenario: Multi-turn conversation eval cases +Given `datasets/advanced-vision-tasks.yaml` +When loaded +Then it SHALL contain eval cases `multi-turn-image-discussion-part1` and `part2` that: +- Share the same `conversation_id` +- Maintain image context across turns +- Demonstrate contextual follow-up questions + +#### Scenario: Image quality assessment eval case +Given `datasets/advanced-vision-tasks.yaml` +When loaded +Then it SHALL contain an eval case `image-quality-assessment` that: +- Asks for technical/aesthetic quality rating +- Expects detailed assessment +- Uses `quality-assessment-judge` + +#### Scenario: Chart data extraction eval case +Given `datasets/advanced-vision-tasks.yaml` +When loaded +Then it SHALL contain an eval case `chart-data-extraction` that: +- Shows a chart/graph image +- Expects data extraction and analysis +- Uses `chart_validator` for verification + +#### Scenario: Scene understanding eval case +Given `datasets/advanced-vision-tasks.yaml` +When loaded +Then it SHALL contain an eval case `scene-context-inference` that: +- Requires contextual understanding beyond literal content +- Expects inferred situation/mood +- Uses `image-description-judge` + +#### Scenario: Instruction following with image eval case +Given `datasets/advanced-vision-tasks.yaml` +When loaded +Then it SHALL contain an eval case `instruction-following-with-image` that: +- Combines complex instructions with visual reference +- May include file attachments with instructions +- Tests multi-step task completion + +--- + +### Requirement: Comprehensive README MUST provide usage guidance +The `examples/showcase/vision/README.md` file SHALL serve as the primary documentation for vision evaluation. + +#### Scenario: README covers quick start +Given `examples/showcase/vision/README.md` +When a user reads the Quick Start section +Then they SHALL find: +- How to run basic evals +- How to run advanced evals +- How to add test images + +#### Scenario: README documents image input formats +Given `examples/showcase/vision/README.md` +When a user looks up image input formats +Then they SHALL find examples for: +- Local file paths +- HTTP URLs +- Base64 data URIs +- Detail level specification + +#### Scenario: README lists all evaluators +Given `examples/showcase/vision/README.md` +When a user wants to know available evaluators +Then they SHALL find: +- Complete list of LLM judges with descriptions +- Complete list of code validators with descriptions +- Usage examples for each type + +#### Scenario: README includes best practices +Given `examples/showcase/vision/README.md` +When a user looks for best practices +Then they SHALL find guidance on: +- Context engineering (progressive disclosure) +- Token budgeting (image costs) +- Cost optimization strategies +- Provider selection + +#### Scenario: README documents success criteria +Given `examples/vision/README.md` +When a user wants to understand evaluation metrics +Then they SHALL find: +- Scoring dimension weights +- Passing thresholds +- Performance expectations + +--- + +### Requirement: Configuration Files MUST enable easy setup +The `.agentv/` directory SHALL contain configuration files for running vision evals. + +#### Scenario: Config file specifies directories +Given `examples/showcase/vision/.agentv/config.yaml` +When loaded +Then it SHALL specify: +- `evalsDir: ./evals` +- `evaluatorsDir: ./evaluators` + +#### Scenario: Targets file includes vision models +Given `examples/showcase/vision/.agentv/targets.yaml` +When loaded +Then it SHALL define targets for: +- OpenAI GPT-4o (default) +- Anthropic Claude 3.5 Sonnet +- Google Gemini 2.5 Flash +With appropriate environment variable references. + +--- + +### Requirement: Test Images Directory MUST be provided +The examples SHALL include a `test-images/` directory for users to place their own test images. + +#### Scenario: Test images directory exists +Given the vision examples structure +When checking `examples/showcase/vision/test-images/` +Then the directory SHALL exist with a `.gitkeep` file. + +#### Scenario: README documents image requirements +Given `examples/showcase/vision/README.md` +When a user wants to add test images +Then they SHALL find specifications for: +- Supported formats (JPEG, PNG, WEBP, GIF, BMP) +- Size limits (50x50 to 16,000x16,000 pixels, max 20MB) +- File naming conventions +- Which images are needed for which eval cases + +--- + +### Requirement: Research Documentation MUST be accessible +The research findings that informed the vision evaluation design SHALL be documented and referenced. + +#### Scenario: Research summary is available +Given `docs/updates/VISION_EVAL_RESEARCH_SUMMARY.md` +When a user wants to understand design rationale +Then they SHALL find: +- Analysis of 5 leading frameworks +- Key findings by framework +- Implementation recommendations +- Best practices summary +- References to source repositories + +#### Scenario: README links to research +Given `examples/showcase/vision/README.md` +When a user wants deeper context +Then they SHALL find a link to the research summary document. + +--- + +## Cross-References + +**Related Capabilities:** +- `vision-input` - Provides image input support used in examples +- `vision-evaluators` - Provides evaluators used in examples +- `yaml-schema` - Examples use extended schema +- `eval-execution` - Examples are run via eval execution + +**Dependencies:** +- Requires `vision-input` and `vision-evaluators` to be implemented +- Examples serve as integration tests for those capabilities + +--- + +## Implementation Notes + +### Directory Structure +``` +examples/showcase/vision/ +├── .agentv/ +│ ├── config.yaml +│ └── targets.yaml +├── datasets/ +│ ├── basic-image-analysis.yaml (7 cases) +│ └── advanced-vision-tasks.yaml (7 cases) +├── evaluators/ +│ ├── llm-judges/ +│ │ ├── image-description-judge.md +│ │ ├── activity-judge.md +│ │ ├── comparison-judge.md +│ │ ├── reasoning-judge.md +│ │ ├── structured-output-judge.md +│ │ └── quality-assessment-judge.md +│ └── code-validators/ +│ ├── count_validator.py +│ ├── ocr_validator.py +│ ├── json_validator.py +│ └── chart_validator.py +├── test-images/ +│ └── .gitkeep +├── README.md (comprehensive guide) +└── INDEX.md (quick reference) +``` + +### Eval Case Distribution + +**Basic (7 cases):** +1. simple-image-description +2. object-detection-simple +3. spatial-relationships +4. text-extraction-ocr +5. multi-image-comparison +6. color-identification +7. image-from-url + +**Advanced (7 cases):** +1. structured-object-detection +2. visual-reasoning-problem +3. multi-turn-image-discussion-part1 +4. multi-turn-image-discussion-part2 +5. image-quality-assessment +6. chart-data-extraction +7. scene-context-inference +8. instruction-following-with-image + +### Documentation Hierarchy +1. **INDEX.md** - Quick start, table of contents +2. **README.md** - Comprehensive usage guide +3. **Research Summary** - Deep dive into design rationale + +--- + +## Future Enhancements (Out of Scope) +- Pre-included sample test images (users provide their own) +- Video tutorial or walkthrough +- Interactive web-based examples +- Automated eval case generation from templates +- Domain-specific example sets (medical, document analysis, etc.) diff --git a/openspec/changes/add-vision-evaluation/specs/vision-evaluators/spec.md b/openspec/changes/add-vision-evaluation/specs/vision-evaluators/spec.md new file mode 100644 index 00000000..9498b414 --- /dev/null +++ b/openspec/changes/add-vision-evaluation/specs/vision-evaluators/spec.md @@ -0,0 +1,313 @@ +# vision-evaluators Specification + +## Purpose +Provide specialized evaluators for assessing the quality and accuracy of AI agent responses to vision/image-based tasks. Includes both LLM-based judges for subjective assessment and code-based validators for objective metrics. + +## ADDED Requirements + +### Requirement: LLM Judge Prompts MUST support image context +LLM judge prompts SHALL be able to reference images from the evaluation input when assessing vision-based responses. + +#### Scenario: Judge prompt includes image reference placeholder +Given an LLM judge prompt containing `{{image_reference}}` +When rendering the prompt for evaluation +Then the placeholder SHALL be replaced with a reference to the image(s) from the input. + +#### Scenario: Judge model receives image context +Given an LLM judge evaluating a vision task +When the judge model is invoked +Then the judge model SHALL be a vision-capable model (e.g., GPT-4V, Claude 3.5 Sonnet). + +--- + +### Requirement: Image Description Judge MUST evaluate visual analysis quality +An LLM judge SHALL assess the accuracy, completeness, and clarity of image descriptions. + +#### Scenario: Evaluate description accuracy +Given an AI response describing an office image +When evaluated by the image-description-judge +Then the score SHALL reflect: +- Visual accuracy (40%): Are objects and details correct? +- Completeness (30%): Are all significant elements mentioned? +- Clarity (20%): Is the description clear and specific? +- Relevance (10%): Does it focus on task-relevant elements? + +#### Scenario: Detect hallucinations in image descriptions +Given an AI response claiming "three people" when image shows two +When evaluated by the image-description-judge +Then the judge SHALL identify the hallucination in its `details.hallucinations` field. + +#### Scenario: Identify missing visual elements +Given an AI response that omits significant background elements +When evaluated by the image-description-judge +Then the judge SHALL list missing elements in its `details.missing_elements` field. + +--- + +### Requirement: Activity Recognition Judge MUST evaluate action identification +An LLM judge SHALL assess the accuracy of identifying activities, actions, and behaviors visible in images. + +#### Scenario: Evaluate activity identification accuracy +Given an AI response identifying "team meeting with 4 people" +When evaluated by the activity-judge +Then the score SHALL reflect: +- Activity identification (35%): Is the activity correctly identified? +- Accuracy (35%): Are counts and poses correct? +- Detail level (20%): Is the detail appropriate? +- Inference quality (10%): Are inferences reasonable? + +--- + +### Requirement: Comparison Judge MUST evaluate multi-image analysis +An LLM judge SHALL assess the quality of comparing multiple images and detecting changes. + +#### Scenario: Evaluate change detection accuracy +Given an AI response comparing before/after images +When evaluated by the comparison-judge +Then the score SHALL reflect: +- Change detection accuracy (40%): Are changes identified? +- Spatial precision (25%): Are locations accurately described? +- Completeness (20%): Are both similarities and differences noted? +- Clarity (15%): Is the comparison structure clear? + +--- + +### Requirement: Visual Reasoning Judge MUST evaluate logic with visual information +An LLM judge SHALL assess the quality of logical reasoning applied to visual problems (e.g., chess positions, puzzles, diagrams). + +#### Scenario: Evaluate visual reasoning correctness +Given an AI response solving a chess problem from an image +When evaluated by the reasoning-judge +Then the score SHALL reflect: +- Logical correctness (40%): Is reasoning sound? +- Visual understanding (30%): Is visual perception accurate? +- Problem-solving quality (20%): Is the solution approach appropriate? +- Explanation quality (10%): Is reasoning clearly explained? + +--- + +### Requirement: Structured Output Judge MUST validate vision-based JSON +An LLM judge SHALL assess the quality of structured JSON outputs from vision analysis tasks. + +#### Scenario: Evaluate JSON structure from vision task +Given an AI response with JSON object detection results +When evaluated by the structured-output-judge +Then the score SHALL reflect: +- JSON validity (30%): Is it parseable JSON? +- Schema compliance (35%): Does it match requested structure? +- Data accuracy (25%): Are values from image accurate? +- Completeness (10%): Are all relevant elements captured? + +--- + +### Requirement: Quality Assessment Judge MUST evaluate image quality analysis +An LLM judge SHALL assess the completeness and accuracy of image quality assessments (technical, compositional, aesthetic). + +#### Scenario: Evaluate quality assessment completeness +Given an AI response rating an image's quality +When evaluated by the quality-assessment-judge +Then the score SHALL reflect: +- Technical completeness (30%): Sharpness, exposure, noise discussed? +- Compositional analysis (25%): Rule of thirds, balance, framing? +- Aesthetic evaluation (20%): Color, mood, style assessed? +- Overall judgment (15%): Score provided with justification? +- Professional tone (10%): Objective and uses appropriate terminology? + +--- + +### Requirement: Object Count Validator MUST verify numeric accuracy +A code-based validator SHALL extract and verify object counts from AI responses against expected values. + +#### Scenario: Validate object count accuracy +Given an AI response stating "5 bottles" and expected output "5 bottles" +When evaluated by count_validator.py +Then the score SHALL be 1.0 (100% accuracy). + +#### Scenario: Partial count matching +Given an AI response stating "5 bottles, 3 cans" and expected "5 bottles, 8 cans" +When evaluated by count_validator.py +Then the score SHALL be 0.5 (50% accuracy - one of two counts matched). + +--- + +### Requirement: OCR Validator MUST verify text extraction accuracy +A code-based validator SHALL compare extracted text from images against expected text using similarity and keyword matching. + +#### Scenario: Validate OCR text similarity +Given an AI response extracting "Project Proposal Q1 2026" and expected "Project Proposal Q1 2026" +When evaluated by ocr_validator.py +Then the score SHALL be >0.9 (high text similarity). + +#### Scenario: Validate keyword presence +Given an AI response mentioning keywords "budget, timeline, deliverables" +When evaluated by ocr_validator.py with expected keywords +Then the keyword accuracy SHALL be reflected in the score. + +--- + +### Requirement: JSON Structure Validator MUST verify structured outputs +A code-based validator SHALL validate that AI responses contain correctly structured JSON matching expected schemas. + +#### Scenario: Validate JSON structure and fields +Given an AI response with valid JSON containing expected fields +When evaluated by json_validator.py +Then the validation SHALL: +- Confirm JSON is parseable +- Verify schema compliance +- Check field presence and types +- Return score based on coverage + +#### Scenario: Detect schema violations +Given an AI response with JSON missing required fields +When evaluated by json_validator.py +Then the validation SHALL identify missing fields in `details.missing_keys`. + +--- + +### Requirement: Chart Data Validator MUST verify data extraction +A code-based validator SHALL extract and validate numeric data (currency, percentages, dates) from chart/graph descriptions. + +#### Scenario: Validate currency value extraction +Given an AI response stating "Q4: $2.4M" and expected "$2.4M" +When evaluated by chart_validator.py +Then the currency value SHALL be matched within 15% tolerance. + +#### Scenario: Validate percentage extraction +Given an AI response stating "58% growth" and expected "58%" +When evaluated by chart_validator.py +Then the percentage SHALL be matched exactly. + +--- + +### Requirement: Code Validators MUST execute via uv run +Python code validators SHALL be executed using `uv run` command with evaluation data passed as JSON. + +#### Scenario: Execute Python validator with JSON input +Given a code validator script `count_validator.py` +When executed with eval data `{"output": "5 objects", "expected_output": "5 objects"}` +Then the validator SHALL: +- Receive data via stdin or command-line argument +- Process the data +- Return JSON result via stdout +- Exit with code 0 for passed, 1 for failed + +#### Scenario: Handle validator timeouts +Given a code validator that runs longer than 30 seconds +When executed +Then the system SHALL terminate the validator and report a timeout error. + +--- + +### Requirement: Evaluator Results MUST follow standard format +All evaluators (LLM judges and code validators) SHALL return results in a consistent format for scoring. + +#### Scenario: Standard result format +Given any evaluator completing evaluation +When the result is returned +Then it SHALL include: +```typescript +{ + status: 'processed' | 'error' | 'skipped', + score: number, // 0.0 to 1.0 + passed: boolean, + details: { + // Evaluator-specific details + } +} +``` + +--- + +## Cross-References + +**Related Capabilities:** +- `vision-input` - Provides the images to evaluate +- `evaluation` - Base evaluation framework +- `rubric-evaluator` - Similar pattern for LLM judges +- `eval-execution` - Executes evaluators during eval runs + +**Dependencies:** +- Requires `vision-input` to be implemented first +- Extends existing evaluator patterns from `rubric-evaluator` + +--- + +## Implementation Notes + +### LLM Judge File Structure +``` +evaluators/llm-judges/ +├── image-description-judge.md +├── activity-judge.md +├── comparison-judge.md +├── reasoning-judge.md +├── structured-output-judge.md +└── quality-assessment-judge.md +``` + +### Code Validator File Structure +``` +evaluators/code-validators/ +├── count_validator.py +├── ocr_validator.py +├── json_validator.py +└── chart_validator.py +``` + +### Judge Prompt Template Variables +- `{{input}}` - User's question/input +- `{{output}}` - AI's response +- `{{expected_output}}` - Expected response +- `{{image_reference}}` - Reference to image(s) +- `{{image_references}}` - Array of image references (for multi-image) + +### Code Validator Interface +```python +def validate( + output: str, + expected_output: str, + input_text: str = "", + **kwargs +) -> Dict[str, Any]: + return { + "status": "processed", + "score": 0.85, + "passed": True, + "details": {...} + } +``` + +### Scoring Dimension Weights + +**Image Description**: +- Visual Accuracy: 40% +- Completeness: 30% +- Clarity: 20% +- Relevance: 10% + +**Activity Recognition**: +- Activity Identification: 35% +- Accuracy: 35% +- Detail Level: 20% +- Inference Quality: 10% + +**Visual Reasoning**: +- Logical Correctness: 40% +- Visual Understanding: 30% +- Problem-Solving: 20% +- Explanation: 10% + +**Comparison**: +- Change Detection: 40% +- Spatial Precision: 25% +- Completeness: 20% +- Clarity: 15% + +--- + +## Future Enhancements (Out of Scope) +- Computer vision metric evaluators (SSIM, perceptual hash, CLIP similarity) +- Specialized domain evaluators (medical imaging, document understanding, face detection) +- Multi-sample evaluation automation (run judges 3-5 times, aggregate scores) +- Confidence calibration evaluators +- Adversarial image testing diff --git a/openspec/changes/add-vision-evaluation/specs/vision-input/spec.md b/openspec/changes/add-vision-evaluation/specs/vision-input/spec.md new file mode 100644 index 00000000..3437e09c --- /dev/null +++ b/openspec/changes/add-vision-evaluation/specs/vision-input/spec.md @@ -0,0 +1,248 @@ +# vision-input Specification + +## Purpose +Enable AgentV to accept image inputs in evaluation test cases, supporting local files, URLs, and base64 data URIs. This capability allows testing of vision-capable AI agents with multimodal (text + image) inputs. + +## ADDED Requirements + +### Requirement: Image Content Type MUST be supported in messages +The YAML schema and message structure SHALL support `type: image` content items alongside text content, allowing images to be included in evaluation input messages. + +#### Scenario: Parse image content from local file +Given an eval YAML file with: +```yaml +input_messages: + - role: user + content: + - type: text + value: "Describe this image" + - type: image + value: ./test-images/photo.jpg + detail: high +``` +When parsed by the eval loader +Then the message SHALL contain an `ImageContentItem` with `value: "./test-images/photo.jpg"` and `detail: "high"`. + +#### Scenario: Parse image content from URL +Given an eval YAML file with: +```yaml +input_messages: + - role: user + content: + - type: image_url + value: https://example.com/image.jpg +``` +When parsed by the eval loader +Then the message SHALL contain an `ImageContentItem` with `value: "https://example.com/image.jpg"`. + +#### Scenario: Parse image content from base64 data URI +Given an eval YAML file with: +```yaml +input_messages: + - role: user + content: + - type: image + value: data:image/jpeg;base64,/9j/4AAQSkZJRg... +``` +When parsed by the eval loader +Then the message SHALL contain an `ImageContentItem` with the full data URI as the value. + +--- + +### Requirement: Image Detail Level MUST be configurable +The image content item SHALL support an optional `detail` parameter to control the resolution/quality trade-off for vision models. + +#### Scenario: Specify low detail for cost optimization +Given an image content item with `detail: low` +When passed to a vision provider +Then the provider SHALL receive the `low` detail parameter, resulting in ~85 tokens per image. + +#### Scenario: Specify high detail for complex analysis +Given an image content item with `detail: high` +When passed to a vision provider +Then the provider SHALL receive the `high` detail parameter, resulting in ~765-1360 tokens per image. + +#### Scenario: Use auto detail for automatic selection +Given an image content item with `detail: auto` +When passed to a vision provider +Then the provider SHALL receive the `auto` detail parameter, allowing the model to choose based on the task. + +#### Scenario: Default to high detail when not specified +Given an image content item without a `detail` parameter +When passed to a vision provider +Then the provider SHALL use `high` detail by default. + +--- + +### Requirement: MIME Type Detection MUST be automatic with manual override +The system SHALL automatically detect image MIME types from file extensions or content, while allowing explicit specification for edge cases. + +#### Scenario: Detect MIME type from file extension +Given an image with path `./photo.jpg` +When loading the image +Then the MIME type SHALL be detected as `image/jpeg`. + +#### Scenario: Detect MIME type from data URI +Given a data URI `data:image/png;base64,...` +When parsing the URI +Then the MIME type SHALL be extracted as `image/png`. + +#### Scenario: Override MIME type explicitly +Given an image content item with: +```yaml +type: image +value: ./file.img +mimeType: image/webp +``` +When loading the image +Then the MIME type SHALL be `image/webp` as specified. + +--- + +### Requirement: Image Loading MUST support multiple sources +The system SHALL load images from local file paths, HTTP/HTTPS URLs, and base64-encoded data URIs. + +#### Scenario: Load image from local file system +Given an image path `./test-images/sample.jpg` that exists +When loading the image +Then the image file SHALL be read into a Buffer successfully. + +#### Scenario: Load image from HTTP URL +Given an image URL `https://example.com/image.png` +When loading the image +Then the image SHALL be fetched via HTTP and loaded into a Buffer. + +#### Scenario: Parse base64 data URI +Given a data URI `data:image/jpeg;base64,/9j/4AAQ...` +When parsing the URI +Then the base64 data SHALL be decoded into a Buffer. + +#### Scenario: Reject invalid file paths +Given an image path `./nonexistent.jpg` that does not exist +When attempting to load the image +Then the system SHALL throw an error with message "Image file not found: ./nonexistent.jpg". + +#### Scenario: Reject invalid URLs +Given an invalid URL `https://invalid-domain-xyz/image.jpg` +When attempting to load the image +Then the system SHALL throw an error indicating the URL is unreachable. + +--- + +### Requirement: Image Validation MUST enforce size and format constraints +The system SHALL validate that images meet provider requirements for format, dimensions, and file size before attempting evaluation. + +#### Scenario: Validate supported image formats +Given an image with format JPEG, PNG, WEBP, GIF, or BMP +When validating the image +Then the image SHALL pass format validation. + +#### Scenario: Reject unsupported image formats +Given an image with format TIFF or SVG +When validating the image +Then the system SHALL throw an error "Unsupported image format: image/tiff". + +#### Scenario: Validate image dimensions +Given an image with dimensions 1920x1080 pixels +When validating the image +Then the image SHALL pass dimension validation (within 50x50 to 16,000x16,000 range). + +#### Scenario: Reject oversized images by dimensions +Given an image with dimensions 20,000x20,000 pixels +When validating the image +Then the system SHALL throw an error "Image dimensions exceed maximum: 16,000x16,000 pixels". + +#### Scenario: Reject oversized images by file size +Given an image file larger than 20MB +When validating the image +Then the system SHALL throw an error "Image file size exceeds maximum: 20MB". + +--- + +### Requirement: Multiple Images per Message MUST be supported +A single message content array SHALL support multiple image content items, allowing comparison and multi-image analysis tasks. + +#### Scenario: Include multiple images in one message +Given a message with content: +```yaml +content: + - type: text + value: "Compare these images" + - type: image + value: ./before.jpg + - type: image + value: ./after.jpg +``` +When parsed +Then the message SHALL contain 2 image content items in the correct order. + +--- + +### Requirement: Image Context MUST persist in multi-turn conversations +When an image is included in a message, it SHALL remain part of the conversation context for subsequent turns, following the `conversation_id` pattern. + +#### Scenario: Maintain image context across conversation turns +Given an eval case with `conversation_id: vision-chat-001` containing an image in turn 1 +When loading turn 2 of the same conversation +Then the full conversation history including the image SHALL be available to the model. + +--- + +## Cross-References + +**Related Capabilities:** +- `yaml-schema` - Requires extension to parse image content types +- `vision-evaluators` - Depends on images being loaded and passed to evaluators +- `eval-execution` - Needs to handle image loading during eval runs +- `multiturn-messages-lm-provider` - Multi-turn conversations with images + +**Sequence:** +1. This capability (image input) must be implemented first +2. Then `vision-evaluators` can be implemented +3. Finally `vision-evaluation` examples can be used + +--- + +## Implementation Notes + +### TypeScript Type Definitions +```typescript +interface ImageContentItem { + type: 'image' | 'image_url'; + value: string; // file path, URL, or data URI + detail?: 'low' | 'high' | 'auto'; + mimeType?: string; +} + +type ContentItem = TextContentItem | ImageContentItem | FileContentItem; +``` + +### Image Loader Interface +```typescript +interface ImageLoader { + load(source: string): Promise; + detectMimeType(buffer: Buffer): string; + validate(buffer: Buffer): ValidationResult; +} +``` + +### Supported MIME Types +- `image/jpeg` +- `image/png` +- `image/webp` +- `image/gif` +- `image/bmp` + +### Size Constraints +- **Minimum**: 50x50 pixels +- **Maximum**: 16,000x16,000 pixels +- **File Size**: 20MB maximum + +--- + +## Future Enhancements (Out of Scope) +- Cloud storage URLs (gs://, s3://) +- Automatic image resizing/optimization +- Image caching to reduce redundant loads +- Progressive image loading +- Video input support diff --git a/openspec/changes/add-vision-evaluation/tasks.md b/openspec/changes/add-vision-evaluation/tasks.md new file mode 100644 index 00000000..7b6af483 --- /dev/null +++ b/openspec/changes/add-vision-evaluation/tasks.md @@ -0,0 +1,610 @@ +# Implementation Tasks: Add Vision Evaluation + +## Overview +This document outlines the ordered tasks for implementing vision evaluation capabilities in AgentV. Tasks are organized to deliver user-visible progress incrementally while managing dependencies. + +## Task Dependency Graph +``` +Phase 1 (Foundation) +├─ T1: Reorganize files → T2, T3 +├─ T2: Schema extension → T4, T5 +└─ T3: Documentation → T14 + +Phase 2 (Core Implementation) +├─ T4: Image loaders → T5 +├─ T5: Provider integration → T6, T7 +├─ T6: LLM judges → T8, T9 +└─ T7: Code validators → T8, T9 + +Phase 3 (Testing & Validation) +├─ T8: Basic eval tests → T10 +├─ T9: Advanced eval tests → T10 +├─ T10: Provider compatibility → T11 +└─ T11: Cost analysis → T12 + +Phase 4 (Polish) +├─ T12: Performance optimization → T13 +├─ T13: Documentation review → T14 +└─ T14: Final validation +``` + +## Tasks + +### Phase 1: Foundation & Structure (Days 1-2) + +#### ✅ Task 1: Reorganize Vision Files into Self-Contained Structure +**Priority**: High +**Effort**: 1 day +**Dependencies**: None + +**Description**: Move vision evaluation files from `examples/features/evals/vision/` and `examples/features/evaluators/vision/` to a self-contained `examples/showcase/vision/` directory structure. + +**Actions**: +1. Create `examples/showcase/vision/` directory structure: + ``` + examples/showcase/vision/ + ├── .agentv/ + │ ├── config.yaml + │ └── targets.yaml + ├── datasets/ + │ ├── basic-image-analysis.yaml + │ └── advanced-vision-tasks.yaml + ├── evaluators/ + │ ├── llm-judges/ + │ │ ├── image-description-judge.md + │ │ ├── activity-judge.md + │ │ ├── comparison-judge.md + │ │ ├── reasoning-judge.md + │ │ ├── structured-output-judge.md + │ │ └── quality-assessment-judge.md + │ └── code-validators/ + │ ├── count_validator.py + │ ├── ocr_validator.py + │ ├── json_validator.py + │ └── chart_validator.py + ├── test-images/ + │ └── .gitkeep (users provide their own images) + └── README.md + ``` + +2. Move all existing vision files to new structure +3. Update all relative paths in YAML files to reference new evaluator locations +4. Update documentation paths +5. Delete old `examples/features/evals/vision/` and `examples/features/evaluators/vision/` directories + +**Validation**: +- [ ] All files exist in new location +- [ ] No broken relative paths in YAML files +- [ ] Documentation links updated +- [ ] Old directories removed + +**User-Visible**: Clear, self-contained vision examples directory + +--- + +#### Task 2: Extend YAML Schema for Image Content Types +**Priority**: High +**Effort**: 2 days +**Dependencies**: None +**Blocks**: T4, T5 + +**Description**: Extend the existing YAML schema and TypeScript types to support image content in messages. + +**Actions**: +1. Add `ImageContentItem` type to content union: + ```typescript + type ContentItem = TextContentItem | ImageContentItem | FileContentItem; + + interface ImageContentItem { + type: 'image'; + value: string; // path, URL, or data URI + detail?: 'low' | 'high' | 'auto'; + mimeType?: string; + } + + interface ImageURLContentItem { + type: 'image_url'; + value: string; // URL only + detail?: 'low' | 'high' | 'auto'; + } + ``` + +2. Update YAML parser to recognize `type: image` and `type: image_url` +3. Add Zod validation schema for image content +4. Update TypeScript interfaces in core package +5. Add schema documentation + +**Validation**: +- [ ] TypeScript types compile without errors +- [ ] Zod schema validates image content correctly +- [ ] YAML parser recognizes image types +- [ ] Unit tests for schema parsing pass +- [ ] Invalid image content rejected with clear errors + +**User-Visible**: Can write YAML evals with image content + +--- + +#### Task 3: Create Configuration Files +**Priority**: Medium +**Effort**: 0.5 days +**Dependencies**: T1 + +**Description**: Create `.agentv/` configuration files for the vision examples directory. + +**Actions**: +1. Create `.agentv/config.yaml`: + ```yaml + version: "1.0" + evalsDir: ./evals + evaluatorsDir: ./evaluators + ``` + +2. Create `.agentv/targets.yaml` with vision-capable models: + ```yaml + targets: + default: + provider: openai + model: gpt-4o + apiKey: ${OPENAI_API_KEY} + + claude-vision: + provider: anthropic + model: claude-3-5-sonnet-20241022 + apiKey: ${ANTHROPIC_API_KEY} + + gemini-vision: + provider: google + model: gemini-2.5-flash + apiKey: ${GOOGLE_GENERATIVE_AI_API_KEY} + ``` + +**Validation**: +- [ ] Config files parse successfully +- [ ] Targets reference vision-capable models +- [ ] Environment variables documented + +**User-Visible**: Easy configuration for vision models + +--- + +### Phase 2: Core Implementation (Days 3-6) + +#### Task 4: Implement Image Loaders +**Priority**: High +**Effort**: 2 days +**Dependencies**: T2 +**Blocks**: T5 + +**Description**: Implement utilities to load images from various sources and convert to appropriate formats for LLM providers. + +**Actions**: +1. Create `packages/core/src/vision/imageLoader.ts`: + - `loadImageFromFile(path: string): Promise` + - `loadImageFromURL(url: string): Promise` + - `parseDataURI(uri: string): Buffer` + - `detectMimeType(buffer: Buffer): string` + - `validateImageFormat(buffer: Buffer): boolean` + +2. Create `packages/core/src/vision/imageConverter.ts`: + - `bufferToBase64(buffer: Buffer): string` + - `createDataURI(base64: string, mimeType: string): string` + - `resizeIfNeeded(buffer: Buffer, maxDim: number): Promise` + +3. Add error handling: + - File not found + - Invalid URL + - Unsupported format + - File too large (>20MB) + - Image dimensions out of range + +4. Add unit tests for all loaders and converters + +**Validation**: +- [ ] Load local files successfully +- [ ] Load HTTP/HTTPS URLs successfully +- [ ] Parse base64 data URIs successfully +- [ ] Detect MIME types correctly (JPEG, PNG, WEBP, GIF) +- [ ] Validate image sizes and dimensions +- [ ] Error messages clear and actionable +- [ ] Unit test coverage >90% + +**User-Visible**: Reliable image loading from multiple sources + +--- + +#### Task 5: Integrate Image Support in Provider Clients +**Priority**: High +**Effort**: 3 days +**Dependencies**: T2, T4 +**Blocks**: T6, T7 + +**Description**: Update LLM provider clients (OpenAI, Anthropic, Google) to pass image content correctly. + +**Actions**: +1. Update `packages/core/src/providers/openai.ts`: + - Handle `ImageContentItem` in message content + - Convert to OpenAI's `image_url` format + - Support `detail` parameter + - Pass base64 data URIs + +2. Update `packages/core/src/providers/anthropic.ts`: + - Handle `ImageContentItem` in message content + - Convert to Anthropic's image format + - Support `source` with base64 data + +3. Update `packages/core/src/providers/google.ts`: + - Handle `ImageContentItem` in message content + - Convert to Gemini's `inlineData` format + - Support both URL and base64 + +4. Add integration tests with real models (optional, can use mocks) + +5. Document provider-specific limitations + +**Validation**: +- [ ] OpenAI provider accepts images correctly +- [ ] Anthropic provider accepts images correctly +- [ ] Google provider accepts images correctly +- [ ] Detail levels passed correctly +- [ ] Error handling for unsupported formats +- [ ] Integration tests pass (or mocked tests) + +**User-Visible**: Can run evals with images on all major providers + +--- + +#### Task 6: Implement LLM Judge Runner for Vision +**Priority**: High +**Effort**: 2 days +**Dependencies**: T5 +**Blocks**: T8, T9 + +**Description**: Enable LLM judges to evaluate vision tasks by passing image context to judge models. + +**Actions**: +1. Update judge prompt renderer to include image references: + ```typescript + renderJudgePrompt( + judgeTemplate: string, + input: ContentItem[], + output: string, + expected: string, + imageReferences?: string[] + ): string + ``` + +2. Modify LLM judge execution to: + - Load judge prompt from `.md` file + - Substitute placeholders (input, output, expected, image_reference) + - Call judge model with vision capability + - Parse structured JSON response + +3. Add support for multi-image judging + +4. Add unit tests for judge rendering and execution + +**Validation**: +- [ ] Judge prompts load correctly +- [ ] Image references passed to judge model +- [ ] JSON responses parsed successfully +- [ ] Scoring dimensions extracted +- [ ] Error handling for invalid judge outputs +- [ ] Unit tests pass + +**User-Visible**: LLM judges can evaluate image-based responses + +--- + +#### Task 7: Implement Code Validator Runner +**Priority**: High +**Effort**: 2 days +**Dependencies**: T5 +**Blocks**: T8, T9 + +**Description**: Create runner for Python-based code validators that perform objective evaluation. + +**Actions**: +1. Create `packages/core/src/evaluators/codeValidatorRunner.ts`: + - `runPythonValidator(scriptPath: string, evalData: EvalData): Promise` + - Use `uv run` to execute Python scripts + - Pass eval data as JSON via stdin or args + - Parse JSON result from stdout + - Handle Python errors gracefully + +2. Create standard interface for validator results: + ```typescript + interface ValidationResult { + status: 'processed' | 'error' | 'skipped'; + score: number; + passed: boolean; + details: Record; + } + ``` + +3. Add timeout handling (30s default) + +4. Add unit tests with mock Python scripts + +**Validation**: +- [ ] Python validators execute successfully +- [ ] JSON data passed correctly +- [ ] Results parsed correctly +- [ ] Timeouts handled +- [ ] Python errors reported clearly +- [ ] Unit tests pass + +**User-Visible**: Objective code validators work reliably + +--- + +### Phase 3: Testing & Validation (Days 7-10) + +#### Task 8: Test Basic Image Analysis Evals +**Priority**: High +**Effort**: 2 days +**Dependencies**: T6, T7 +**Blocks**: T10 + +**Description**: Run all 7 basic eval cases from `basic-image-analysis.yaml` and validate results. + +**Actions**: +1. Create sample test images (or use placeholder URLs) +2. Run each eval case: + - simple-image-description + - object-detection-simple + - spatial-relationships + - text-extraction-ocr + - multi-image-comparison + - color-identification + - image-from-url + +3. Verify evaluators run successfully +4. Check score outputs are reasonable +5. Document any issues or edge cases +6. Create test fixtures for automated testing + +**Validation**: +- [ ] All 7 eval cases execute without errors +- [ ] LLM judges return valid scores +- [ ] Code validators return valid scores +- [ ] Results documented +- [ ] Test fixtures created + +**User-Visible**: Basic vision evals work end-to-end + +--- + +#### Task 9: Test Advanced Vision Tasks Evals +**Priority**: High +**Effort**: 2 days +**Dependencies**: T6, T7 +**Blocks**: T10 + +**Description**: Run all 7 advanced eval cases from `advanced-vision-tasks.yaml` and validate results. + +**Actions**: +1. Create additional test images for complex scenarios +2. Run each eval case: + - structured-object-detection + - visual-reasoning-problem + - multi-turn-image-discussion (parts 1 & 2) + - image-quality-assessment + - chart-data-extraction + - scene-context-inference + - instruction-following-with-image + +3. Verify structured outputs +4. Test multi-turn conversations maintain context +5. Validate complex evaluators +6. Document performance and cost metrics + +**Validation**: +- [ ] All 7 eval cases execute without errors +- [ ] Structured outputs parse correctly +- [ ] Multi-turn context maintained +- [ ] Complex judges work accurately +- [ ] Performance metrics collected +- [ ] Cost estimates documented + +**User-Visible**: Advanced vision evals work end-to-end + +--- + +#### Task 10: Provider Compatibility Testing +**Priority**: High +**Effort**: 2 days +**Dependencies**: T8, T9 +**Blocks**: T11 + +**Description**: Test vision evals across all major providers to ensure compatibility. + +**Actions**: +1. Run basic evals on: + - OpenAI GPT-4o + - Anthropic Claude 3.5 Sonnet + - Google Gemini 2.5 Flash + +2. Compare results across providers +3. Document provider-specific behaviors +4. Identify and document limitations +5. Create provider compatibility matrix + +**Validation**: +- [ ] All providers execute vision evals +- [ ] Results comparable across providers +- [ ] Limitations documented +- [ ] Compatibility matrix created +- [ ] Errors handled gracefully + +**User-Visible**: Works reliably across all major providers + +--- + +#### Task 11: Cost Analysis & Optimization +**Priority**: Medium +**Effort**: 1 day +**Dependencies**: T10 +**Blocks**: T12 + +**Description**: Analyze token costs for vision evals and document optimization strategies. + +**Actions**: +1. Measure token usage for: + - Different image sizes + - Detail levels (low, high, auto) + - Different providers + +2. Calculate cost per eval case +3. Document cost optimization strategies: + - Use `detail: low` for simple tasks + - Use Gemini Flash for development + - Cache image descriptions + - Use code validators when possible + +4. Create cost estimation guide +5. Add cost warnings to documentation + +**Validation**: +- [ ] Token usage measured for various scenarios +- [ ] Cost per eval documented +- [ ] Optimization strategies validated +- [ ] Cost guide created +- [ ] Warnings added to docs + +**User-Visible**: Clear understanding of costs and how to optimize + +--- + +### Phase 4: Polish & Documentation (Days 11-14) + +#### Task 12: Performance Optimization +**Priority**: Medium +**Effort**: 2 days +**Dependencies**: T11 +**Blocks**: T13 + +**Description**: Optimize image loading, processing, and evaluation performance. + +**Actions**: +1. Profile image loading times +2. Implement caching for loaded images +3. Add image dimension limits to prevent oversized loads +4. Optimize base64 conversions +5. Parallelize independent evaluators +6. Add progress tracking for batch evals + +**Validation**: +- [ ] Average eval latency <2s (excluding LLM calls) +- [ ] Image loading cached appropriately +- [ ] Large images handled efficiently +- [ ] Parallel execution works correctly +- [ ] Progress reporting functional + +**User-Visible**: Fast, responsive evaluation experience + +--- + +#### Task 13: Documentation Review & Enhancement +**Priority**: High +**Effort**: 2 days +**Dependencies**: T12 +**Blocks**: T14 + +**Description**: Review and enhance all vision evaluation documentation. + +**Actions**: +1. Review and update `examples/vision/README.md`: + - Add getting started section + - Update usage examples + - Add troubleshooting section + - Include provider setup instructions + +2. Review and update `examples/vision/INDEX.md`: + - Ensure all examples listed + - Update cost estimates + - Add quick reference tables + +3. Update `docs/updates/VISION_EVAL_RESEARCH_SUMMARY.md`: + - Add implementation notes + - Update status of completed work + +4. Create migration guide if needed +5. Add inline code comments +6. Create video tutorial (optional) + +**Validation**: +- [ ] README comprehensive and accurate +- [ ] INDEX up-to-date +- [ ] Research summary reflects implementation +- [ ] Code well-commented +- [ ] No broken links or references + +**User-Visible**: Excellent documentation for vision evaluation + +--- + +#### Task 14: Final Validation & Release Prep +**Priority**: High +**Effort**: 1 day +**Dependencies**: T13 + +**Description**: Final validation before marking the change as complete. + +**Actions**: +1. Run OpenSpec validation: + ```bash + npx @fission-ai/openspec validate add-vision-evaluation --strict + ``` + +2. Run full test suite: + ```bash + bun test + ``` + +3. Run end-to-end eval tests: + ```bash + agentv run examples/showcase/vision/datasets/basic-image-analysis.yaml + agentv run examples/showcase/vision/datasets/advanced-vision-tasks.yaml + ``` + +4. Create changelog entry +5. Update version in package.json +6. Tag release (if applicable) + +**Validation**: +- [ ] OpenSpec validation passes +- [ ] All unit tests pass +- [ ] All integration tests pass +- [ ] End-to-end evals work +- [ ] Changelog updated +- [ ] Version bumped + +**User-Visible**: Production-ready vision evaluation feature + +--- + +## Summary + +**Total Estimated Effort**: 21 days (3-4 weeks with parallelization) + +**Critical Path**: T1 → T2 → T4 → T5 → T6 → T8 → T10 → T11 → T12 → T13 → T14 + +**Parallelizable Work**: +- T3 can run parallel to T2 +- T6 and T7 can run in parallel after T5 +- T8 and T9 can run in parallel +- Documentation tasks can be done incrementally + +**Key Milestones**: +1. Day 2: Schema extended, files reorganized +2. Day 6: Core implementation complete +3. Day 10: All tests passing +4. Day 14: Production ready + +**Success Metrics**: +- All 14 eval cases working +- 3+ providers supported +- Documentation complete +- >90% test coverage +- <2s avg eval latency