diff --git a/examples/showcase/vision/.agentv/config.yaml b/examples/showcase/vision/.agentv/config.yaml
new file mode 100644
index 00000000..d5bffc8c
--- /dev/null
+++ b/examples/showcase/vision/.agentv/config.yaml
@@ -0,0 +1,22 @@
+# AgentV Configuration for Vision Examples
+# This configuration specifies directories and settings for vision evaluation examples
+
+# Directory containing evaluation YAML files
+evalsDir: ./datasets
+
+# Directory containing evaluator definitions (LLM judges and code validators)
+evaluatorsDir: ./evaluators
+
+# Test images directory (users should place test images here)
+testImagesDir: ./test-images
+
+# Default settings for vision evaluations
+defaults:
+  # Default model target (can be overridden in YAML)
+  target: openai-gpt4o
+  
+  # Default image detail level
+  imageDetail: high
+  
+  # Timeout for vision model calls (in milliseconds)
+  timeout: 60000
diff --git a/examples/showcase/vision/.agentv/targets.yaml b/examples/showcase/vision/.agentv/targets.yaml
new file mode 100644
index 00000000..9e0faa3c
--- /dev/null
+++ b/examples/showcase/vision/.agentv/targets.yaml
@@ -0,0 +1,39 @@
+# Target Model Configurations for Vision Examples
+# Defines available vision-capable models for evaluation
+
+targets:
+  # OpenAI GPT-4o (default, recommended for vision tasks)
+  openai-gpt4o:
+    provider: openai
+    model: gpt-4o
+    apiKey: ${OPENAI_API_KEY}
+    supportsVision: true
+    costPer1kImages:
+      low: 42.50      # $85/1M tokens * 0.5K tokens/image
+      high: 102.00    # $85/1M tokens * 1.2K tokens/image
+      auto: 72.25     # Average
+  
+  # Anthropic Claude 3.5 Sonnet
+  anthropic-claude:
+    provider: anthropic
+    model: claude-3-5-sonnet-20241022
+    apiKey: ${ANTHROPIC_API_KEY}
+    supportsVision: true
+    costPer1kImages:
+      low: 120.00     # $3/1M tokens * 40K base + 0.5K image
+      high: 216.00    # $3/1M tokens * 40K base + 1.2K image
+      auto: 168.00    # Average
+  
+  # Google Gemini 2.5 Flash
+  google-gemini:
+    provider: google
+    model: gemini-2.0-flash-exp
+    apiKey: ${GOOGLE_API_KEY}
+    supportsVision: true
+    costPer1kImages:
+      low: 1.88       # $0.075/1M tokens * 25K base + 0.5K image
+      high: 2.26      # $0.075/1M tokens * 25K base + 1.2K image
+      auto: 2.07      # Average (most cost-effective)
+
+# Default target
+default: openai-gpt4o
diff --git a/examples/showcase/vision/README.md b/examples/showcase/vision/README.md
new file mode 100644
index 00000000..8654ceda
--- /dev/null
+++ b/examples/showcase/vision/README.md
@@ -0,0 +1,394 @@
+# Vision Evaluation Examples
+
+This directory contains example evaluation files for testing AI agents with vision/image capabilities.
+
+## Overview
+
+Vision evaluation in AgentV extends the standard eval framework to support:
+- Image inputs (local files and URLs)
+- Multi-image comparisons
+- Vision-specific evaluators (both LLM judges and code validators)
+- Structured outputs from vision tasks
+- Multi-turn conversations with visual context
+
+## Quick Start
+
+### Basic Image Analysis
+
+```bash
+# From examples/showcase/vision/ directory
+agentv run datasets/basic-image-analysis.yaml
+
+# Or from repository root
+agentv run examples/showcase/vision/datasets/basic-image-analysis.yaml
+```
+
+### Advanced Vision Tasks
+
+```bash
+# From examples/showcase/vision/ directory
+agentv run datasets/advanced-vision-tasks.yaml
+
+# Or from repository root
+agentv run examples/showcase/vision/datasets/advanced-vision-tasks.yaml
+```
+
+## Image Input Formats
+
+### Local File Reference
+
+```yaml
+- type: image
+  value: ./test-images/sample-office.jpg
+  detail: high  # Options: low, high, auto
+```
+
+### Image URL
+
+```yaml
+- type: image_url
+  value: https://example.com/image.jpg
+```
+
+### Data URI (Base64)
+
+```yaml
+- type: image
+  value: data:image/jpeg;base64,/9j/4AAQSkZJRg...
+```
+
+## Evaluation Files
+
+### datasets/basic-image-analysis.yaml
+
+Demonstrates fundamental vision capabilities:
+- **Simple image description** - Basic captioning
+- **Object detection** - Counting and identifying objects
+- **Spatial reasoning** - Understanding positions and layouts
+- **Text extraction (OCR)** - Reading text from images
+- **Image comparison** - Analyzing changes between images
+- **Color analysis** - Identifying colors and schemes
+- **URL loading** - Loading images from web URLs
+
+### datasets/advanced-vision-tasks.yaml
+
+Demonstrates complex vision scenarios:
+- **Structured JSON output** - Vision data as JSON
+- **Visual reasoning** - Logic applied to visual information (e.g., chess)
+- **Multi-turn conversations** - Context maintained across turns
+- **Image quality assessment** - Technical and aesthetic evaluation
+- **Chart/graph analysis** - Data extraction from visualizations
+- **Scene understanding** - Contextual inference
+- **Instruction following** - Complex tasks with visual reference
+
+## Evaluators
+
+### LLM-Based Judges
+
+Located in `evaluators/llm-judges/`:
+
+1. **image-description-judge.md**
+   - Evaluates description accuracy and completeness
+   - Dimensions: Visual Accuracy (40%), Completeness (30%), Clarity (20%), Relevance (10%)
+   - Detects hallucinations and missing elements
+
+2. **activity-judge.md**
+   - Evaluates activity and action recognition
+   - Assesses people counting, pose recognition, interaction understanding
+
+3. **comparison-judge.md**
+   - Evaluates multi-image comparison quality
+   - Tests change detection, spatial precision, completeness
+
+4. **reasoning-judge.md**
+   - Evaluates logical reasoning with visual information
+   - Tests visual understanding, problem-solving, explanation quality
+   - Supports multiple reasoning types (spatial, logical, quantitative)
+
+### Code-Based Validators
+
+Located in `evaluators/code-validators/`:
+
+1. **count_validator.py**
+   - Validates object counts in responses
+   - Extracts numbers and matches against expected counts
+   - Usage: `uv run count_validator.py`
+
+2. **ocr_validator.py**
+   - Validates text extraction accuracy
+   - Uses text similarity and keyword matching
+   - Configurable threshold (default: 70%)
+
+3. **json_validator.py**
+   - Validates structured JSON outputs from vision
+   - Schema inference from expected output
+   - Checks field presence and types
+
+4. **chart_validator.py**
+   - Validates data extraction from charts/graphs
+   - Extracts currency values, percentages, quarters
+   - Tolerance-based numeric validation (default: 15%)
+
+## Best Practices from Research
+
+### Context Engineering (from Agent-Skills research)
+
+1. **Progressive Disclosure**
+   - Load image metadata first (50 tokens)
+   - Then descriptions (100 tokens)
+   - Finally full image (765-1360 tokens)
+
+2. **Token Budgeting**
+   - Small image (512x512): ~765 tokens
+   - Large image (2048x2048): ~1360 tokens
+   - Budget context at 70-80% utilization
+
+3. **File System State**
+   - Store images and analyses as files
+   - Pass file references in context, not image data
+
+### Evaluation Patterns (from Google ADK)
+
+1. **Multi-Sample Evaluation**
+   - Run evaluators 5 times for reliability
+   - Use vision-capable judge models (GPT-4V, Claude)
+
+2. **Rubric-Based Grading**
+   - Define clear success criteria
+   - Weight dimensions appropriately
+   - Support partial credit
+
+### Input Handling (from Mastra & Azure SDK)
+
+1. **Flexible Image Sources**
+   - Local files: `./images/photo.jpg`
+   - HTTP URLs: `https://...`
+   - Cloud storage: `gs://...` or `s3://...`
+   - Data URIs: `data:image/jpeg;base64,...`
+
+2. **MIME Type Specification**
+   - Always include for better compatibility
+   - Common types: `image/jpeg`, `image/png`, `image/webp`
+
+3. **Detail Level Control**
+   - `low`: Faster, cheaper, less detail
+   - `high`: Slower, more expensive, more detail
+   - `auto`: Let model decide
+
+## Creating Test Images
+
+For local testing, place test images in `test-images/` directory. See `test-images/README.md` for detailed guidance on:
+- Required test images for each eval case
+- Image format and size requirements
+- Alternative URL-based approaches
+- Sources for obtaining test images
+
+### Example Test Images Structure
+
+```bash
+examples/showcase/vision/test-images/
+├── README.md (detailed instructions)
+├── .gitkeep
+├── sample-office.jpg
+├── objects-scene.jpg
+├── spatial-layout.jpg
+├── text-document.jpg
+├── comparison-before.jpg
+├── comparison-after.jpg
+├── colorful-scene.jpg
+├── street-scene.jpg
+├── chess-puzzle.jpg
+├── activity-photo.jpg
+├── quality-test.jpg
+├── bar-chart.jpg
+├── complex-scene.jpg
+└── instruction-reference.jpg
+```
+
+### Image Requirements
+
+- **Formats**: JPEG, PNG, GIF, BMP, WEBP
+- **Size limits**: 
+  - Max: 20 MB, 16,000 x 16,000 pixels
+  - Min: 50 x 50 pixels
+- **Best practices**:
+  - Use JPEG for photos
+  - Use PNG for screenshots, diagrams, text
+  - Optimize file size (aim for <5 MB)
+  - Ensure clear, well-lit images for OCR tasks
+
+## Multi-Turn Vision Conversations
+
+Example pattern for maintaining visual context:
+
+```yaml
+- id: conversation-turn-1
+  conversation_id: vision-convo-001
+  input_messages:
+    - role: user
+      content:
+        - type: text
+          value: "What's in this image?"
+        - type: image
+          value: ./image.jpg
+  expected_messages:
+    - role: assistant
+      content: "Description of image..."
+
+- id: conversation-turn-2
+  conversation_id: vision-convo-001
+  input_messages:
+    # Include full conversation history
+    - role: user
+      content:
+        - type: text
+          value: "What's in this image?"
+        - type: image
+          value: ./image.jpg
+    - role: assistant
+      content: "Description of image..."
+    - role: user
+      content: "Tell me more about the left side"
+  expected_messages:
+    - role: assistant
+      content: "Details about left side..."
+```
+
+## Evaluation Metrics
+
+### Dimension Weights (Recommended)
+
+Based on research from Google ADK and LangWatch:
+
+**Image Description**:
+- Visual Accuracy: 40%
+- Completeness: 30%
+- Clarity: 20%
+- Relevance: 10%
+
+**Activity Recognition**:
+- Activity Identification: 35%
+- Accuracy: 35%
+- Detail Level: 20%
+- Inference Quality: 10%
+
+**Visual Reasoning**:
+- Logical Correctness: 40%
+- Visual Understanding: 30%
+- Problem-Solving Quality: 20%
+- Explanation Quality: 10%
+
+**Image Comparison**:
+- Change Detection Accuracy: 40%
+- Spatial Precision: 25%
+- Completeness: 20%
+- Clarity: 15%
+
+### Scoring Thresholds
+
+- **0.9-1.0**: Excellent - Production ready
+- **0.7-0.89**: Good - Minor improvements needed
+- **0.5-0.69**: Acceptable - Significant gaps
+- **0.3-0.49**: Poor - Major issues
+- **0.0-0.29**: Failed - Not functional
+
+## Integration with AgentV Core
+
+### Required Model Capabilities
+
+Ensure your model supports vision:
+- ✅ OpenAI: GPT-4o, GPT-4 Turbo with Vision
+- ✅ Anthropic: Claude 3.5 Sonnet, Claude 3 Opus/Haiku
+- ✅ Google: Gemini 2.5 Pro/Flash, Gemini 3 Pro
+- ✅ Azure: GPT-4o via Azure OpenAI
+
+### Configuration
+
+Configure vision-capable models in `.agentv/targets.yaml`:
+
+```yaml
+targets:
+  gpt4v:
+    provider: openai
+    model: gpt-4o
+    apiKey: ${OPENAI_API_KEY}
+  
+  claude-vision:
+    provider: anthropic
+    model: claude-3-5-sonnet-20241022
+    apiKey: ${ANTHROPIC_API_KEY}
+  
+  gemini-vision:
+    provider: google
+    model: gemini-2.5-flash
+    apiKey: ${GOOGLE_API_KEY}
+```
+
+## Cost Considerations
+
+Vision API costs are significantly higher than text:
+
+| Provider | Model | Cost per Image* | Notes |
+|----------|-------|----------------|-------|
+| OpenAI | GPT-4o | $2.50-$5.00 / 1K images | Detail level affects cost |
+| Anthropic | Claude 3.5 | $3.00-$6.00 / 1K images | Resolution-based pricing |
+| Google | Gemini 2.5 Flash | $0.04-$0.15 / 1K images | Most cost-effective |
+
+*Estimates based on average image size and detail level
+
+### Cost Optimization Tips
+
+1. Use `detail: low` for simple tasks
+2. Resize large images before sending
+3. Use Gemini Flash for high-volume testing
+4. Cache image descriptions for reuse
+5. Use code validators when possible (free)
+
+## Future Enhancements
+
+Based on research findings, potential additions:
+
+1. **Computer Vision Metrics**
+   - SSIM (structural similarity)
+   - Perceptual hashing
+   - CLIP embeddings similarity
+
+2. **Specialized Evaluators**
+   - Face detection validation
+   - Logo recognition accuracy
+   - Medical image analysis
+   - Document understanding
+
+3. **Batch Processing**
+   - Parallel image evaluation
+   - Progress tracking
+   - Cost reporting
+
+4. **UI Integration**
+   - Visual diff tools
+   - Side-by-side comparisons
+   - Annotation overlays
+
+## References
+
+For detailed research findings and framework analysis, see: [Vision Evaluation Research Summary](../../openspec/changes/add-vision-evaluation/references/research-summary.md)
+
+Research sources consulted:
+
+1. **Google ADK Python** - Rubric-based evaluation, multimodal content handling
+2. **Mastra** - TypeScript patterns, structured outputs, Braintrust integration
+3. **Azure SDK** - Image input patterns, Computer Vision API
+4. **LangWatch** - Evaluation architecture, batch processing
+5. **Agent Skills** - Context engineering, progressive disclosure patterns
+
+## Support
+
+For issues or questions:
+- Check existing eval examples
+- Review evaluator documentation
+- Consult AgentV core documentation
+- Open GitHub issue with reproduction case
+
+## License
+
+Same as AgentV project license.
diff --git a/examples/showcase/vision/datasets/advanced-vision-tasks.yaml b/examples/showcase/vision/datasets/advanced-vision-tasks.yaml
new file mode 100644
index 00000000..5fc93115
--- /dev/null
+++ b/examples/showcase/vision/datasets/advanced-vision-tasks.yaml
@@ -0,0 +1,352 @@
+# Advanced Vision Evaluation Tasks
+# Demonstrates complex multimodal scenarios and vision-language reasoning
+
+$schema: agentv-eval-v2
+description: Advanced vision tasks including reasoning, structured outputs, and multi-turn conversations
+
+target: default
+
+evalcases:
+  # ==========================================
+  # Example 1: Structured Output from Vision
+  # Tests JSON output with visual analysis
+  # ==========================================
+  - id: structured-object-detection
+    
+    expected_outcome: Assistant returns valid JSON with detected objects, positions, and confidence scores
+    
+    input_messages:
+      - role: system
+        content: |-
+          You are an object detection system that returns structured JSON output.
+          Always return valid JSON matching the requested schema.
+      
+      - role: user
+        content:
+          - type: text
+            value: |-
+              Analyze this image and return a JSON object with the following structure:
+              ```json
+              {
+                "objects": [
+                  {"name": "object_name", "count": 1, "position": "location", "confidence": 0.95}
+                ],
+                "scene": "scene_description",
+                "dominant_colors": ["color1", "color2"]
+              }
+              ```
+          - type: image
+            value: ./test-images/product-shelf.jpg
+            detail: high
+    
+    expected_messages:
+      - role: assistant
+        content: |-
+          ```json
+          {
+            "objects": [
+              {"name": "bottle", "count": 5, "position": "top shelf", "confidence": 0.98},
+              {"name": "can", "count": 8, "position": "middle shelf", "confidence": 0.95},
+              {"name": "box", "count": 3, "position": "bottom shelf", "confidence": 0.92}
+            ],
+            "scene": "retail product shelf display",
+            "dominant_colors": ["white", "blue", "red"]
+          }
+          ```
+    
+    execution:
+      evaluators:
+        - name: json_validation
+          type: code_judge
+          script: uv run json_validator.py
+          cwd: ../evaluators/code-validators
+        - name: object_accuracy
+          type: llm_judge
+          prompt: ../evaluators/llm-judges/structured-output-judge.md
+
+  # ==========================================
+  # Example 2: Visual Reasoning
+  # Tests logical reasoning about visual content
+  # ==========================================
+  - id: visual-reasoning-problem
+    
+    conversation_id: vision-reasoning-001
+    
+    expected_outcome: Assistant uses visual information to solve logical problem
+    
+    input_messages:
+      - role: system
+        content: You can analyze images and apply logical reasoning to solve problems.
+      
+      - role: user
+        content:
+          - type: text
+            value: |-
+              Look at this chess board position. 
+              1. Whose turn is it? (white or black)
+              2. Is any piece under attack?
+              3. What would be a good next move for white?
+          - type: image
+            value: ./test-images/chess-position.jpg
+            detail: high
+    
+    expected_messages:
+      - role: assistant
+        content: |-
+          Based on the chess position:
+          1. It's White's turn (Black pieces just moved)
+          2. Yes, Black's knight is attacking White's bishop on e4
+          3. Good move: Nf3 (knight to f3) - defends the bishop and controls center
+          
+          Alternative: Bd3 (bishop retreat) - saves piece but passive
+    
+    execution:
+      evaluators:
+        - name: reasoning_quality
+          type: llm_judge
+          prompt: ../evaluators/llm-judges/reasoning-judge.md
+
+  # ==========================================
+  # Example 3: Multi-Turn Vision Conversation
+  # Tests maintaining context with images across turns
+  # ==========================================
+  - id: multi-turn-image-discussion-part1
+    
+    conversation_id: vision-conversation-001
+    
+    expected_outcome: Assistant provides initial analysis of image
+    
+    input_messages:
+      - role: system
+        content: You help analyze images in detail through multi-turn conversations.
+      
+      - role: user
+        content:
+          - type: text
+            value: "What are the main elements in this architecture diagram?"
+          - type: image
+            value: ./test-images/system-architecture.jpg
+    
+    expected_messages:
+      - role: assistant
+        content: |-
+          Main elements in the architecture:
+          1. Client Layer - Web and Mobile apps
+          2. API Gateway - Routing and authentication
+          3. Service Layer - Microservices (User, Order, Payment)
+          4. Database Layer - PostgreSQL and Redis
+          5. Message Queue - RabbitMQ for async processing
+
+  - id: multi-turn-image-discussion-part2
+    
+    conversation_id: vision-conversation-001
+    
+    expected_outcome: Assistant answers follow-up question referencing previous image context
+    
+    input_messages:
+      - role: system
+        content: You help analyze images in detail through multi-turn conversations.
+      
+      - role: user
+        content:
+          - type: text
+            value: "What are the main elements in this architecture diagram?"
+          - type: image
+            value: ./test-images/system-architecture.jpg
+      
+      - role: assistant
+        content: |-
+          Main elements in the architecture:
+          1. Client Layer - Web and Mobile apps
+          2. API Gateway - Routing and authentication
+          3. Service Layer - Microservices (User, Order, Payment)
+          4. Database Layer - PostgreSQL and Redis
+          5. Message Queue - RabbitMQ for async processing
+      
+      - role: user
+        content: "What's the purpose of the message queue in this system?"
+    
+    expected_messages:
+      - role: assistant
+        content: |-
+          The RabbitMQ message queue serves several purposes:
+          1. Decouples services for async communication
+          2. Handles order processing without blocking API responses
+          3. Manages payment notifications
+          4. Provides retry mechanism for failed operations
+          5. Enables horizontal scaling of service workers
+
+  # ==========================================
+  # Example 4: Image Quality Assessment
+  # Tests aesthetic and technical quality evaluation
+  # ==========================================
+  - id: image-quality-assessment
+    
+    expected_outcome: Assistant evaluates technical and aesthetic quality of image
+    
+    input_messages:
+      - role: system
+        content: |-
+          You are an expert in image quality assessment. Evaluate images on:
+          - Technical quality (sharpness, exposure, noise)
+          - Composition (rule of thirds, balance, framing)
+          - Color grading
+          - Overall aesthetic appeal
+      
+      - role: user
+        content:
+          - type: text
+            value: "Evaluate the quality of this photograph. Rate it from 1-10 and explain your assessment."
+          - type: image
+            value: ./test-images/landscape-photo.jpg
+            detail: high
+    
+    expected_messages:
+      - role: assistant
+        content: |-
+          Quality Assessment: 8/10
+          
+          Technical Quality:
+          - Sharpness: Excellent (9/10) - Sharp focus on foreground
+          - Exposure: Good (8/10) - Well balanced, slight highlight clipping
+          - Noise: Very good (8/10) - Minimal noise, clean image
+          
+          Composition:
+          - Rule of thirds: Well applied, horizon at lower third
+          - Balance: Good visual weight distribution
+          - Leading lines: Path creates depth
+          
+          Color & Aesthetic:
+          - Color grading: Warm golden hour tones
+          - Mood: Peaceful and inviting
+          - Overall: Professional quality landscape
+    
+    execution:
+      evaluators:
+        - name: assessment_completeness
+          type: llm_judge
+          prompt: ../evaluators/llm-judges/quality-assessment-judge.md
+
+  # ==========================================
+  # Example 5: Chart/Graph Analysis
+  # Tests data visualization interpretation
+  # ==========================================
+  - id: chart-data-extraction
+    
+    expected_outcome: Assistant accurately extracts data and insights from chart
+    
+    input_messages:
+      - role: system
+        content: You can read and interpret charts, graphs, and data visualizations accurately.
+      
+      - role: user
+        content:
+          - type: text
+            value: |-
+              Analyze this bar chart:
+              1. What are the top 3 values?
+              2. What trend do you observe?
+              3. What's the average value approximately?
+          - type: image
+            value: ./test-images/sales-chart.jpg
+    
+    expected_messages:
+      - role: assistant
+        content: |-
+          Bar Chart Analysis:
+          
+          1. Top 3 values:
+             - Q4 2025: $2.4M
+             - Q3 2025: $2.1M
+             - Q2 2025: $1.9M
+          
+          2. Trend: Consistent upward trend throughout 2025
+             - 58% growth from Q1 to Q4
+             - Steady quarterly increases
+          
+          3. Average value: Approximately $1.95M per quarter
+    
+    execution:
+      evaluators:
+        - name: data_accuracy
+          type: code_judge
+          script: uv run chart_validator.py
+          cwd: ../evaluators/code-validators
+
+  # ==========================================
+  # Example 6: Scene Understanding with Context
+  # Tests contextual understanding and inference
+  # ==========================================
+  - id: scene-context-inference
+    
+    expected_outcome: Assistant infers context and situation beyond literal visual elements
+    
+    input_messages:
+      - role: system
+        content: You can understand scenes contextually, inferring situation, emotions, and social dynamics.
+      
+      - role: user
+        content:
+          - type: text
+            value: "What's happening in this scene? What can you infer about the situation and mood?"
+          - type: image
+            value: ./test-images/office-celebration.jpg
+    
+    expected_messages:
+      - role: assistant
+        content: |-
+          Scene Analysis:
+          
+          What's happening:
+          - Office celebration or milestone event
+          - Team gathered around cake/refreshments
+          - Informal standing arrangement
+          
+          Contextual inferences:
+          - Likely birthday, project completion, or company anniversary
+          - Positive team dynamics (relaxed postures, smiles)
+          - Mid-afternoon timing (sunlight through windows)
+          - Small to medium team size (8-10 people)
+          
+          Mood: Celebratory, casual, team-bonding atmosphere
+
+  # ==========================================
+  # Example 7: Image + Text Instructions
+  # Tests following written instructions with visual reference
+  # ==========================================
+  - id: instruction-following-with-image
+    
+    expected_outcome: Assistant follows complex instructions referencing visual elements
+    
+    input_messages:
+      - role: system
+        content: You follow instructions precisely, referencing visual elements when specified.
+      
+      - role: user
+        content:
+          - type: text
+            value: "Based on this floor plan, calculate the total area and list the rooms from largest to smallest."
+          - type: image
+            value: ./test-images/floor-plan.jpg
+            detail: high
+          - type: file
+            value: ./attachments/measurement-guidelines.txt
+    
+    expected_messages:
+      - role: assistant
+        content: |-
+          Floor Plan Analysis:
+          
+          Total Area: 1,450 sq ft
+          
+          Rooms by size (largest to smallest):
+          1. Living Room: 350 sq ft
+          2. Master Bedroom: 280 sq ft
+          3. Kitchen: 220 sq ft
+          4. Bedroom 2: 200 sq ft
+          5. Bathroom 1: 120 sq ft
+          6. Bathroom 2: 90 sq ft
+          7. Hallway: 90 sq ft
+          8. Storage: 100 sq ft
+          
+          Notes: Measurements based on scale provided in plan (1:50)
diff --git a/examples/showcase/vision/datasets/basic-image-analysis.yaml b/examples/showcase/vision/datasets/basic-image-analysis.yaml
new file mode 100644
index 00000000..e81525fd
--- /dev/null
+++ b/examples/showcase/vision/datasets/basic-image-analysis.yaml
@@ -0,0 +1,240 @@
+# Basic Image Analysis Evaluation
+# Demonstrates image input handling and vision-capable model evaluation
+
+$schema: agentv-eval-v2
+description: Basic image understanding with single image inputs
+
+target: default
+
+evalcases:
+  # ==========================================
+  # Example 1: Simple Image Description
+  # Tests basic image captioning capabilities
+  # ==========================================
+  - id: simple-image-description
+    
+    expected_outcome: Assistant provides accurate description of image content including main objects and scene context
+    
+    input_messages:
+      - role: system
+        content: You are a helpful AI assistant that can analyze images and provide detailed descriptions.
+      
+      - role: user
+        content:
+          - type: text
+            value: "Describe what you see in this image in detail."
+          - type: image
+            value: ./test-images/sample-office.jpg
+            detail: high  # Options: low, high, auto
+    
+    expected_messages:
+      - role: assistant
+        content: |-
+          The image shows an office workspace with the following elements:
+          - A desk with a computer monitor
+          - Office chair
+          - Keyboard and mouse
+          - Natural lighting from a window
+          - Professional/clean environment
+    
+    execution:
+      evaluators:
+        - name: content_accuracy
+          type: llm_judge
+          prompt: ../evaluators/llm-judges/image-description-judge.md
+
+  # ==========================================
+  # Example 2: Object Detection
+  # Tests ability to identify specific objects
+  # ==========================================
+  - id: object-detection-simple
+    
+    expected_outcome: Assistant correctly identifies and counts specific objects in the image
+    
+    input_messages:
+      - role: system
+        content: You are an AI that can identify and count objects in images accurately.
+      
+      - role: user
+        content:
+          - type: text
+            value: "How many people are in this image? Describe what they are doing."
+          - type: image
+            value: ./test-images/team-meeting.jpg
+    
+    expected_messages:
+      - role: assistant
+        content: |-
+          There are 4 people in the image. They appear to be in a meeting:
+          - 2 people are seated at a table
+          - 1 person is standing and presenting
+          - 1 person is taking notes
+    
+    execution:
+      evaluators:
+        - name: object_count
+          type: code_judge
+          script: uv run count_validator.py
+          cwd: ../evaluators/code-validators
+        - name: activity_accuracy
+          type: llm_judge
+          prompt: ../evaluators/llm-judges/activity-judge.md
+
+  # ==========================================
+  # Example 3: Spatial Reasoning
+  # Tests understanding of spatial relationships
+  # ==========================================
+  - id: spatial-relationships
+    
+    expected_outcome: Assistant correctly describes spatial relationships and positions of objects
+    
+    input_messages:
+      - role: system
+        content: You analyze images with focus on spatial relationships between objects.
+      
+      - role: user
+        content:
+          - type: text
+            value: "Describe the position of objects in this image. What is on the left, right, center?"
+          - type: image
+            value: ./test-images/desk-arrangement.jpg
+            detail: high
+    
+    expected_messages:
+      - role: assistant
+        content: |-
+          Spatial layout:
+          - Left side: Lamp and notebook
+          - Center: Laptop computer (main focus)
+          - Right side: Coffee mug and phone
+          - Background: Wall with framed picture
+
+  # ==========================================
+  # Example 4: Text Extraction (OCR)
+  # Tests ability to read text from images
+  # ==========================================
+  - id: text-extraction-ocr
+    
+    expected_outcome: Assistant accurately extracts and reports text visible in the image
+    
+    input_messages:
+      - role: system
+        content: You can read and extract text from images accurately.
+      
+      - role: user
+        content:
+          - type: text
+            value: "What text do you see in this image? Extract all visible text."
+          - type: image
+            value: ./test-images/document-scan.jpg
+            detail: high
+    
+    expected_messages:
+      - role: assistant
+        content: |-
+          Extracted text:
+          - Title: "Project Proposal"
+          - Subtitle: "Q1 2026 Initiative"
+          - Section heading: "Executive Summary"
+          - Body text includes keywords: budget, timeline, deliverables
+    
+    execution:
+      evaluators:
+        - name: text_accuracy
+          type: code_judge
+          script: uv run ocr_validator.py
+          cwd: ../evaluators/code-validators
+
+  # ==========================================
+  # Example 5: Image Comparison
+  # Tests ability to compare multiple images
+  # ==========================================
+  - id: multi-image-comparison
+    
+    expected_outcome: Assistant identifies similarities and differences between two images
+    
+    input_messages:
+      - role: system
+        content: You can compare multiple images and identify similarities and differences.
+      
+      - role: user
+        content:
+          - type: text
+            value: "Compare these two images. What changed between them?"
+          - type: image
+            value: ./test-images/before-office.jpg
+          - type: image
+            value: ./test-images/after-office.jpg
+    
+    expected_messages:
+      - role: assistant
+        content: |-
+          Changes between images:
+          - New desk lamp added in second image
+          - Chair moved to different position
+          - Additional monitor on desk
+          - Wall color remains the same
+          - Overall layout similar
+    
+    execution:
+      evaluators:
+        - name: change_detection
+          type: llm_judge
+          prompt: ../evaluators/llm-judges/comparison-judge.md
+
+  # ==========================================
+  # Example 6: Color Analysis
+  # Tests color identification and description
+  # ==========================================
+  - id: color-identification
+    
+    expected_outcome: Assistant accurately identifies and describes colors in the image
+    
+    input_messages:
+      - role: system
+        content: You can identify and describe colors accurately in images.
+      
+      - role: user
+        content:
+          - type: text
+            value: "What are the dominant colors in this image? Describe the color scheme."
+          - type: image
+            value: ./test-images/color-palette.jpg
+    
+    expected_messages:
+      - role: assistant
+        content: |-
+          Dominant colors:
+          - Primary: Deep blue (#2E5090)
+          - Secondary: Warm orange (#FF8C42)
+          - Accent: Light gray (#E8E8E8)
+          Color scheme: Complementary (blue-orange)
+          Overall mood: Professional and energetic
+
+  # ==========================================
+  # Example 7: Image with URL (not file path)
+  # Tests image loading from URLs
+  # ==========================================
+  - id: image-from-url
+    
+    expected_outcome: Assistant analyzes image loaded from URL successfully
+    
+    input_messages:
+      - role: system
+        content: You can analyze images from various sources.
+      
+      - role: user
+        content:
+          - type: text
+            value: "Describe this sample image."
+          - type: image_url
+            value: https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/481px-Cat03.jpg
+    
+    expected_messages:
+      - role: assistant
+        content: |-
+          The image shows a cat with the following characteristics:
+          - Orange/ginger colored fur
+          - Sitting position
+          - Looking directly at camera
+          - Indoor setting
diff --git a/examples/showcase/vision/evaluators/code-validators/chart_validator.py b/examples/showcase/vision/evaluators/code-validators/chart_validator.py
new file mode 100644
index 00000000..184c6ac6
--- /dev/null
+++ b/examples/showcase/vision/evaluators/code-validators/chart_validator.py
@@ -0,0 +1,211 @@
+#!/usr/bin/env python3
+"""
+Chart Data Validator
+Code-based evaluator for validating data extraction from charts and graphs
+"""
+
+import sys
+import json
+import re
+from typing import Dict, Any, List, Tuple
+
+
+def extract_currency_values(text: str) -> List[float]:
+    """Extract monetary values from text (e.g., $2.4M, $1,500)"""
+    # Pattern for currency with K/M/B suffixes
+    pattern = r'\$?\s*(\d+\.?\d*)\s*([KMB])?'
+    
+    values = []
+    for match in re.finditer(pattern, text, re.IGNORECASE):
+        value = float(match.group(1))
+        suffix = match.group(2)
+        
+        if suffix:
+            multipliers = {'K': 1_000, 'M': 1_000_000, 'B': 1_000_000_000}
+            value *= multipliers.get(suffix.upper(), 1)
+        
+        values.append(value)
+    
+    return values
+
+
+def extract_percentages(text: str) -> List[float]:
+    """Extract percentage values from text"""
+    pattern = r'(\d+\.?\d*)\s*%'
+    return [float(match.group(1)) for match in re.finditer(pattern, text)]
+
+
+def extract_quarters(text: str) -> List[str]:
+    """Extract quarter references (Q1, Q2, etc.)"""
+    pattern = r'Q[1-4]\s+\d{4}'
+    return re.findall(pattern, text)
+
+
+def validate_numeric_accuracy(
+    found_values: List[float],
+    expected_values: List[float],
+    tolerance: float = 0.1
+) -> Tuple[int, List[float], List[float]]:
+    """
+    Validate numeric values with tolerance
+    
+    Returns:
+        (matches_count, matched_values, missing_values)
+    """
+    matched = []
+    missing = expected_values.copy()
+    
+    for expected in expected_values:
+        for found in found_values:
+            # Check if within tolerance (percentage)
+            if abs(found - expected) / expected <= tolerance:
+                matched.append(expected)
+                if expected in missing:
+                    missing.remove(expected)
+                break
+    
+    return len(matched), matched, missing
+
+
+def validate_chart_data(
+    output: str,
+    expected_output: str,
+    input_text: str = "",
+    tolerance: float = 0.15
+) -> Dict[str, Any]:
+    """
+    Validate data extraction from charts/graphs
+    
+    Args:
+        output: AI's chart analysis
+        expected_output: Expected data points and insights
+        input_text: Original question
+        tolerance: Acceptable error margin (default 15%)
+    
+    Returns:
+        Evaluation result
+    """
+    
+    # Extract values from both outputs
+    output_currency = extract_currency_values(output)
+    expected_currency = extract_currency_values(expected_output)
+    
+    output_percentages = extract_percentages(output)
+    expected_percentages = extract_percentages(expected_output)
+    
+    output_quarters = extract_quarters(output)
+    expected_quarters = extract_quarters(expected_output)
+    
+    # Validate currency values
+    currency_matches = 0
+    if expected_currency:
+        currency_matches, matched_curr, missing_curr = validate_numeric_accuracy(
+            output_currency, expected_currency, tolerance
+        )
+        currency_accuracy = currency_matches / len(expected_currency)
+    else:
+        currency_accuracy = 1.0
+        matched_curr = []
+        missing_curr = []
+    
+    # Validate percentages
+    percentage_matches = 0
+    if expected_percentages:
+        percentage_matches, matched_pct, missing_pct = validate_numeric_accuracy(
+            output_percentages, expected_percentages, tolerance
+        )
+        percentage_accuracy = percentage_matches / len(expected_percentages)
+    else:
+        percentage_accuracy = 1.0
+        matched_pct = []
+        missing_pct = []
+    
+    # Validate quarter references
+    if expected_quarters:
+        quarter_matches = len(set(output_quarters) & set(expected_quarters))
+        quarter_accuracy = quarter_matches / len(expected_quarters)
+    else:
+        quarter_accuracy = 1.0
+        quarter_matches = 0
+    
+    # Calculate overall score (weighted average)
+    weights = {
+        'currency': 0.5,
+        'percentage': 0.3,
+        'quarters': 0.2
+    }
+    
+    overall_score = (
+        currency_accuracy * weights['currency'] +
+        percentage_accuracy * weights['percentage'] +
+        quarter_accuracy * weights['quarters']
+    )
+    
+    passed = overall_score >= 0.7  # 70% threshold
+    
+    # Build detailed reasoning
+    reasoning_parts = []
+    if expected_currency:
+        reasoning_parts.append(
+            f"Currency values: {currency_matches}/{len(expected_currency)} matched"
+        )
+    if expected_percentages:
+        reasoning_parts.append(
+            f"Percentages: {percentage_matches}/{len(expected_percentages)} matched"
+        )
+    if expected_quarters:
+        reasoning_parts.append(
+            f"Quarters: {quarter_matches}/{len(expected_quarters)} matched"
+        )
+    
+    return {
+        "status": "processed",
+        "score": round(overall_score, 3),
+        "passed": passed,
+        "details": {
+            "currency_validation": {
+                "accuracy": round(currency_accuracy, 3),
+                "expected": expected_currency,
+                "found": output_currency,
+                "matched": matched_curr,
+                "missing": missing_curr
+            },
+            "percentage_validation": {
+                "accuracy": round(percentage_accuracy, 3),
+                "expected": expected_percentages,
+                "found": output_percentages,
+                "matched": matched_pct,
+                "missing": missing_pct
+            },
+            "quarter_validation": {
+                "accuracy": round(quarter_accuracy, 3),
+                "expected": expected_quarters,
+                "found": output_quarters
+            },
+            "tolerance": tolerance,
+            "reasoning": "; ".join(reasoning_parts)
+        }
+    }
+
+
+def main():
+    """Main entry point for CLI usage"""
+    if len(sys.argv) > 1:
+        eval_data = json.loads(sys.argv[1])
+    else:
+        eval_data = json.load(sys.stdin)
+    
+    output = eval_data.get("output", "")
+    expected_output = eval_data.get("expected_output", "")
+    input_text = eval_data.get("input", "")
+    tolerance = eval_data.get("tolerance", 0.15)
+    
+    result = validate_chart_data(output, expected_output, input_text, tolerance)
+    
+    print(json.dumps(result, indent=2))
+    
+    return 0 if result["passed"] else 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/examples/showcase/vision/evaluators/code-validators/count_validator.py b/examples/showcase/vision/evaluators/code-validators/count_validator.py
new file mode 100644
index 00000000..20257a0b
--- /dev/null
+++ b/examples/showcase/vision/evaluators/code-validators/count_validator.py
@@ -0,0 +1,106 @@
+#!/usr/bin/env python3
+"""
+Object Count Validator
+Code-based evaluator for validating object counts in vision responses
+"""
+
+import sys
+import json
+import re
+from typing import Dict, Any, List
+
+
+def extract_numbers_from_text(text: str) -> List[int]:
+    """Extract all numbers from text"""
+    return [int(num) for num in re.findall(r'\b\d+\b', text)]
+
+
+def extract_count_for_object(text: str, object_name: str) -> int | None:
+    """Extract count for a specific object from text"""
+    # Look for patterns like "5 bottles", "There are 3 people", etc.
+    patterns = [
+        rf'(\d+)\s+{object_name}',  # "5 bottles"
+        rf'{object_name}.*?(\d+)',   # "bottles: 5"
+        rf'(\d+).*?{object_name}',   # "5 red bottles"
+    ]
+    
+    for pattern in patterns:
+        match = re.search(pattern, text, re.IGNORECASE)
+        if match:
+            return int(match.group(1))
+    
+    return None
+
+
+def validate_object_count(
+    output: str,
+    expected_output: str,
+    input_text: str = ""
+) -> Dict[str, Any]:
+    """
+    Validate object counts in AI response
+    
+    Returns:
+        Evaluation result with score, passed status, and details
+    """
+    
+    # Extract expected count from expected_output or input
+    expected_numbers = extract_numbers_from_text(expected_output)
+    output_numbers = extract_numbers_from_text(output)
+    
+    # Simple validation: check if any expected numbers are in output
+    matched_counts = [num for num in expected_numbers if num in output_numbers]
+    
+    if not expected_numbers:
+        return {
+            "status": "error",
+            "score": 0.0,
+            "passed": False,
+            "details": "Could not extract expected counts from expected output"
+        }
+    
+    # Calculate accuracy
+    accuracy = len(matched_counts) / len(expected_numbers)
+    passed = accuracy >= 0.8  # 80% threshold
+    
+    return {
+        "status": "processed",
+        "score": accuracy,
+        "passed": passed,
+        "details": {
+            "expected_counts": expected_numbers,
+            "found_counts": output_numbers,
+            "matched_counts": matched_counts,
+            "accuracy": accuracy,
+            "reasoning": f"Matched {len(matched_counts)} out of {len(expected_numbers)} expected counts"
+        }
+    }
+
+
+def main():
+    """Main entry point for CLI usage"""
+    # Read evaluation data from stdin or args
+    if len(sys.argv) > 1:
+        # Parse JSON from argument
+        eval_data = json.loads(sys.argv[1])
+    else:
+        # Read from stdin
+        eval_data = json.load(sys.stdin)
+    
+    # Extract fields
+    output = eval_data.get("output", "")
+    expected_output = eval_data.get("expected_output", "")
+    input_text = eval_data.get("input", "")
+    
+    # Run validation
+    result = validate_object_count(output, expected_output, input_text)
+    
+    # Output JSON result
+    print(json.dumps(result, indent=2))
+    
+    # Return appropriate exit code
+    return 0 if result["passed"] else 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/examples/showcase/vision/evaluators/code-validators/json_validator.py b/examples/showcase/vision/evaluators/code-validators/json_validator.py
new file mode 100644
index 00000000..a6651f6f
--- /dev/null
+++ b/examples/showcase/vision/evaluators/code-validators/json_validator.py
@@ -0,0 +1,202 @@
+#!/usr/bin/env python3
+"""
+JSON Structure Validator
+Code-based evaluator for validating structured JSON outputs from vision tasks
+"""
+
+import sys
+import json
+import re
+from typing import Dict, Any, List
+from jsonschema import validate, ValidationError, Draft7Validator
+
+
+def extract_json_from_text(text: str) -> Dict[str, Any] | None:
+    """Extract JSON object from text (handles markdown code blocks)"""
+    # Try to find JSON in markdown code block
+    json_match = re.search(r'```(?:json)?\s*(\{.*?\})\s*```', text, re.DOTALL)
+    if json_match:
+        try:
+            return json.loads(json_match.group(1))
+        except json.JSONDecodeError:
+            pass
+    
+    # Try to parse entire text as JSON
+    try:
+        return json.loads(text)
+    except json.JSONDecodeError:
+        pass
+    
+    # Try to find first JSON object in text
+    brace_match = re.search(r'\{.*\}', text, re.DOTALL)
+    if brace_match:
+        try:
+            return json.loads(brace_match.group(0))
+        except json.JSONDecodeError:
+            pass
+    
+    return None
+
+
+def infer_schema_from_expected(expected_json: Dict[str, Any]) -> Dict[str, Any]:
+    """Infer a basic JSON schema from expected output structure"""
+    def get_type(value):
+        if isinstance(value, bool):
+            return "boolean"
+        elif isinstance(value, int):
+            return "integer"
+        elif isinstance(value, float):
+            return "number"
+        elif isinstance(value, str):
+            return "string"
+        elif isinstance(value, list):
+            return "array"
+        elif isinstance(value, dict):
+            return "object"
+        return "string"
+    
+    schema = {
+        "type": "object",
+        "properties": {},
+        "required": list(expected_json.keys())
+    }
+    
+    for key, value in expected_json.items():
+        value_type = get_type(value)
+        schema["properties"][key] = {"type": value_type}
+        
+        if value_type == "array" and len(value) > 0:
+            item_type = get_type(value[0])
+            schema["properties"][key]["items"] = {"type": item_type}
+            
+            # If array contains objects, add properties
+            if item_type == "object" and isinstance(value[0], dict):
+                schema["properties"][key]["items"]["properties"] = {
+                    k: {"type": get_type(v)} 
+                    for k, v in value[0].items()
+                }
+    
+    return schema
+
+
+def validate_json_structure(
+    output: str,
+    expected_output: str,
+    schema: Dict[str, Any] | None = None
+) -> Dict[str, Any]:
+    """
+    Validate that output contains valid JSON matching expected structure
+    
+    Args:
+        output: AI's response (may contain JSON)
+        expected_output: Expected JSON structure as string
+        schema: Optional JSON schema for validation
+    
+    Returns:
+        Evaluation result with score, passed status, and details
+    """
+    
+    # Extract JSON from output
+    output_json = extract_json_from_text(output)
+    
+    if output_json is None:
+        return {
+            "status": "processed",
+            "score": 0.0,
+            "passed": False,
+            "details": {
+                "error": "No valid JSON found in output",
+                "reasoning": "Could not extract JSON object from response"
+            }
+        }
+    
+    # Parse expected JSON
+    try:
+        expected_json = extract_json_from_text(expected_output)
+        if expected_json is None:
+            expected_json = json.loads(expected_output)
+    except (json.JSONDecodeError, ValueError) as e:
+        return {
+            "status": "error",
+            "score": 0.0,
+            "passed": False,
+            "details": {
+                "error": f"Invalid expected JSON: {str(e)}"
+            }
+        }
+    
+    # If no schema provided, infer from expected output
+    if schema is None:
+        schema = infer_schema_from_expected(expected_json)
+    
+    # Validate against schema
+    validator = Draft7Validator(schema)
+    errors = list(validator.iter_errors(output_json))
+    
+    if errors:
+        error_messages = [f"{e.path}: {e.message}" for e in errors[:3]]  # First 3 errors
+        return {
+            "status": "processed",
+            "score": 0.5,  # Partial credit for valid JSON with wrong structure
+            "passed": False,
+            "details": {
+                "validation_errors": error_messages,
+                "json_valid": True,
+                "schema_valid": False,
+                "reasoning": f"Valid JSON but schema validation failed: {'; '.join(error_messages)}"
+            }
+        }
+    
+    # Calculate field match score
+    expected_keys = set(expected_json.keys())
+    output_keys = set(output_json.keys())
+    
+    matching_keys = expected_keys & output_keys
+    extra_keys = output_keys - expected_keys
+    missing_keys = expected_keys - output_keys
+    
+    field_score = len(matching_keys) / len(expected_keys) if expected_keys else 1.0
+    
+    # Penalize extra keys slightly
+    if extra_keys:
+        field_score *= 0.95
+    
+    # Full pass requires schema validation + most fields present
+    passed = len(errors) == 0 and field_score >= 0.8
+    
+    return {
+        "status": "processed",
+        "score": round(field_score, 3),
+        "passed": passed,
+        "details": {
+            "json_valid": True,
+            "schema_valid": len(errors) == 0,
+            "field_score": round(field_score, 3),
+            "matching_keys": list(matching_keys),
+            "missing_keys": list(missing_keys),
+            "extra_keys": list(extra_keys),
+            "reasoning": f"Schema valid: {len(errors) == 0}, Field coverage: {field_score:.1%}"
+        }
+    }
+
+
+def main():
+    """Main entry point for CLI usage"""
+    if len(sys.argv) > 1:
+        eval_data = json.loads(sys.argv[1])
+    else:
+        eval_data = json.load(sys.stdin)
+    
+    output = eval_data.get("output", "")
+    expected_output = eval_data.get("expected_output", "")
+    schema = eval_data.get("schema")
+    
+    result = validate_json_structure(output, expected_output, schema)
+    
+    print(json.dumps(result, indent=2))
+    
+    return 0 if result["passed"] else 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/examples/showcase/vision/evaluators/code-validators/ocr_validator.py b/examples/showcase/vision/evaluators/code-validators/ocr_validator.py
new file mode 100644
index 00000000..0604b577
--- /dev/null
+++ b/examples/showcase/vision/evaluators/code-validators/ocr_validator.py
@@ -0,0 +1,144 @@
+#!/usr/bin/env python3
+"""
+OCR Text Accuracy Validator
+Code-based evaluator for validating text extraction (OCR) from images
+"""
+
+import sys
+import json
+import re
+from typing import Dict, Any, List, Set
+from difflib import SequenceMatcher
+
+
+def normalize_text(text: str) -> str:
+    """Normalize text for comparison"""
+    # Remove extra whitespace, convert to lowercase
+    return re.sub(r'\s+', ' ', text.lower().strip())
+
+
+def extract_keywords(text: str) -> Set[str]:
+    """Extract significant words from text"""
+    # Remove common words and extract keywords
+    words = set(text.lower().split())
+    # Remove very short words (likely articles, prepositions)
+    return {w for w in words if len(w) > 2}
+
+
+def calculate_text_similarity(text1: str, text2: str) -> float:
+    """Calculate similarity ratio between two texts"""
+    norm1 = normalize_text(text1)
+    norm2 = normalize_text(text2)
+    return SequenceMatcher(None, norm1, norm2).ratio()
+
+
+def validate_keyword_presence(output: str, expected_keywords: List[str]) -> Dict[str, Any]:
+    """Validate that expected keywords are present in output"""
+    output_lower = output.lower()
+    found_keywords = [kw for kw in expected_keywords if kw.lower() in output_lower]
+    
+    accuracy = len(found_keywords) / len(expected_keywords) if expected_keywords else 0.0
+    
+    return {
+        "keyword_accuracy": accuracy,
+        "found_keywords": found_keywords,
+        "missing_keywords": [kw for kw in expected_keywords if kw not in found_keywords],
+        "total_expected": len(expected_keywords),
+        "total_found": len(found_keywords)
+    }
+
+
+def validate_ocr_accuracy(
+    output: str,
+    expected_output: str,
+    input_text: str = "",
+    threshold: float = 0.7
+) -> Dict[str, Any]:
+    """
+    Validate OCR text extraction accuracy
+    
+    Args:
+        output: AI's extracted text
+        expected_output: Expected extracted text or keywords
+        input_text: Original user question (optional)
+        threshold: Minimum similarity threshold for passing
+    
+    Returns:
+        Evaluation result with score, passed status, and details
+    """
+    
+    # Calculate overall text similarity
+    similarity = calculate_text_similarity(output, expected_output)
+    
+    # Extract and validate keywords
+    expected_keywords_line = re.search(
+        r'keywords?:\s*([^\n]+)', 
+        expected_output, 
+        re.IGNORECASE
+    )
+    
+    if expected_keywords_line:
+        # Parse expected keywords
+        keywords_text = expected_keywords_line.group(1)
+        expected_keywords = [
+            kw.strip() 
+            for kw in re.split(r'[,;]', keywords_text)
+        ]
+        keyword_validation = validate_keyword_presence(output, expected_keywords)
+    else:
+        # Use all significant words as keywords
+        expected_words = extract_keywords(expected_output)
+        output_words = extract_keywords(output)
+        matched_words = expected_words & output_words
+        
+        keyword_validation = {
+            "keyword_accuracy": len(matched_words) / len(expected_words) if expected_words else 0.0,
+            "found_keywords": list(matched_words),
+            "missing_keywords": list(expected_words - matched_words),
+            "total_expected": len(expected_words),
+            "total_found": len(matched_words)
+        }
+    
+    # Combine metrics
+    # Weight: 60% overall similarity, 40% keyword accuracy
+    combined_score = (similarity * 0.6) + (keyword_validation["keyword_accuracy"] * 0.4)
+    passed = combined_score >= threshold
+    
+    return {
+        "status": "processed",
+        "score": round(combined_score, 3),
+        "passed": passed,
+        "details": {
+            "text_similarity": round(similarity, 3),
+            "keyword_validation": keyword_validation,
+            "threshold": threshold,
+            "reasoning": f"Text similarity: {similarity:.2%}, Keyword accuracy: {keyword_validation['keyword_accuracy']:.2%}"
+        }
+    }
+
+
+def main():
+    """Main entry point for CLI usage"""
+    # Read evaluation data from stdin or args
+    if len(sys.argv) > 1:
+        eval_data = json.loads(sys.argv[1])
+    else:
+        eval_data = json.load(sys.stdin)
+    
+    # Extract fields
+    output = eval_data.get("output", "")
+    expected_output = eval_data.get("expected_output", "")
+    input_text = eval_data.get("input", "")
+    threshold = eval_data.get("threshold", 0.7)
+    
+    # Run validation
+    result = validate_ocr_accuracy(output, expected_output, input_text, threshold)
+    
+    # Output JSON result
+    print(json.dumps(result, indent=2))
+    
+    return 0 if result["passed"] else 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/examples/showcase/vision/evaluators/llm-judges/activity-judge.md b/examples/showcase/vision/evaluators/llm-judges/activity-judge.md
new file mode 100644
index 00000000..5c23020c
--- /dev/null
+++ b/examples/showcase/vision/evaluators/llm-judges/activity-judge.md
@@ -0,0 +1,73 @@
+# Activity Recognition LLM Judge
+# Evaluates accuracy of activity and action description in images
+
+You are evaluating an AI assistant's ability to identify and describe activities, actions, and behaviors visible in images.
+
+## Evaluation Criteria
+
+### 1. Activity Identification (35%)
+- Are the activities correctly identified?
+- Is the context of actions understood?
+- Are interactions between people/objects recognized?
+
+### 2. Accuracy (35%)
+- Are the number of people/objects correct?
+- Are poses, positions, and movements accurate?
+- Are temporal aspects (if relevant) captured?
+
+### 3. Detail Level (20%)
+- Are actions described with appropriate detail?
+- Are relevant gestures or expressions noted?
+- Is the level of detail appropriate to the question?
+
+### 4. Inference Quality (10%)
+- Are reasonable inferences made when appropriate?
+- Are assumptions clearly distinguished from observations?
+- Is context considered appropriately?
+
+## Input Data
+
+**User's Question**: {{input}}
+
+**AI Response**: {{output}}
+
+**Expected Response**: {{expected_output}}
+
+**Image Reference**: {{image_reference}}
+
+## Evaluation Task
+
+Assess how well the AI identified and described activities in the image.
+
+## Output Format
+
+```json
+{
+  "score": 0.88,
+  "passed": true,
+  "details": {
+    "activity_identification": 0.9,
+    "accuracy": 0.85,
+    "detail_level": 0.9,
+    "inference_quality": 0.85
+  },
+  "reasoning": "Correctly identified the meeting activity and participant roles. Count was accurate. Good detail about specific actions.",
+  "errors": {
+    "count_errors": [],
+    "misidentified_actions": [],
+    "missed_actions": ["One person checking phone"]
+  },
+  "strengths": [
+    "Accurate participant count",
+    "Clear description of roles",
+    "Good spatial awareness"
+  ]
+}
+```
+
+## Special Considerations
+
+- **Ambiguous situations**: Give benefit of doubt if multiple interpretations are valid
+- **Partial visibility**: Don't penalize for not describing what's not clearly visible
+- **Cultural context**: Consider that some activities may have cultural variations
+- **Safety**: Flag if response makes inappropriate assumptions about people
diff --git a/examples/showcase/vision/evaluators/llm-judges/comparison-judge.md b/examples/showcase/vision/evaluators/llm-judges/comparison-judge.md
new file mode 100644
index 00000000..c51b94aa
--- /dev/null
+++ b/examples/showcase/vision/evaluators/llm-judges/comparison-judge.md
@@ -0,0 +1,98 @@
+# Image Comparison LLM Judge
+# Evaluates quality of multi-image comparison and change detection
+
+You are evaluating an AI assistant's ability to compare multiple images and identify changes, similarities, and differences.
+
+## Evaluation Criteria
+
+### 1. Change Detection Accuracy (40%)
+- Are all significant changes identified?
+- Are changes correctly categorized (added, removed, moved, modified)?
+- Is the description of changes accurate?
+
+### 2. Spatial Precision (25%)
+- Are locations of changes accurately described?
+- Are spatial relationships correctly maintained?
+- Is positioning information clear and specific?
+
+### 3. Completeness (20%)
+- Are both similarities AND differences mentioned (when relevant)?
+- Are subtle changes noticed?
+- Is nothing significant missed?
+
+### 4. Clarity (15%)
+- Is the comparison structure clear and logical?
+- Are changes described unambiguously?
+- Is the language precise?
+
+## Input Data
+
+**User's Question**: {{input}}
+
+**AI Response**: {{output}}
+
+**Expected Response**: {{expected_output}}
+
+**Images**: {{image_references}}
+
+## Evaluation Task
+
+Assess the quality and accuracy of the image comparison.
+
+## Output Format
+
+```json
+{
+  "score": 0.82,
+  "passed": true,
+  "details": {
+    "change_detection_accuracy": 0.85,
+    "spatial_precision": 0.8,
+    "completeness": 0.75,
+    "clarity": 0.9
+  },
+  "reasoning": "Identified most major changes accurately. Missed one subtle change (wall color). Good spatial descriptions.",
+  "detected_changes": {
+    "correct": ["desk lamp added", "chair moved", "monitor added"],
+    "missed": ["wall calendar removed"],
+    "false_positives": []
+  },
+  "spatial_accuracy": "Good - locations correctly described",
+  "strengths": [
+    "Clear comparison structure",
+    "Accurate major change detection",
+    "Good detail level"
+  ],
+  "improvements": [
+    "Notice subtle background changes",
+    "More precise position descriptions"
+  ]
+}
+```
+
+## Scoring Guidelines
+
+### High Scores (0.8+)
+- All or nearly all significant changes detected
+- Accurate spatial descriptions
+- No false positives
+- Clear, organized presentation
+
+### Medium Scores (0.5-0.79)
+- Most major changes detected
+- Some minor changes missed
+- Generally accurate descriptions
+- Acceptable clarity
+
+### Low Scores (<0.5)
+- Significant changes missed
+- Inaccurate descriptions
+- False positives present
+- Unclear or disorganized
+
+## Special Cases
+
+- **Lighting changes**: Should be noted if significantly different
+- **Perspective differences**: Should account for viewing angle changes
+- **Temporal information**: If images are before/after, temporal language should be used appropriately
+- **Identical images**: Should recognize when images are the same or nearly identical
diff --git a/examples/showcase/vision/evaluators/llm-judges/image-description-judge.md b/examples/showcase/vision/evaluators/llm-judges/image-description-judge.md
new file mode 100644
index 00000000..827cfeaf
--- /dev/null
+++ b/examples/showcase/vision/evaluators/llm-judges/image-description-judge.md
@@ -0,0 +1,76 @@
+# Vision-Specific LLM Judge Prompt
+# Evaluates image description quality and accuracy
+
+You are evaluating an AI assistant's image description against the actual image content and expected description.
+
+## Evaluation Criteria
+
+Evaluate the response on these dimensions:
+
+### 1. Visual Accuracy (40%)
+- Does the description match what's actually in the image?
+- Are object identifications correct?
+- Are colors, shapes, and spatial relationships accurate?
+- Are there any hallucinations (describing things not present)?
+
+### 2. Completeness (30%)
+- Are all significant visual elements mentioned?
+- Is important context captured?
+- Are key details included (not just high-level description)?
+
+### 3. Clarity (20%)
+- Is the description clear and specific?
+- Are spatial relationships well described?
+- Is the language precise and unambiguous?
+
+### 4. Relevance (10%)
+- Does the description focus on task-relevant elements?
+- Is unnecessary information minimized?
+- Does it answer the specific question asked?
+
+## Input Data
+
+**User's Question**: {{input}}
+
+**AI Response**: {{output}}
+
+**Expected Description**: {{expected_output}}
+
+**Image Reference**: {{image_reference}}
+
+## Evaluation Task
+
+1. Compare the AI's description with the expected description
+2. Identify any inaccuracies, hallucinations, or missing elements
+3. Assess clarity and relevance
+4. Provide an overall score from 0.0 to 1.0
+
+## Output Format
+
+Return your evaluation as JSON:
+
+```json
+{
+  "score": 0.85,
+  "passed": true,
+  "details": {
+    "visual_accuracy": 0.9,
+    "completeness": 0.8,
+    "clarity": 0.85,
+    "relevance": 0.9
+  },
+  "reasoning": "The description accurately identifies the main objects and spatial layout. Minor issue: didn't mention the background elements. Overall strong response.",
+  "hallucinations": [],
+  "missing_elements": ["background wall art", "window on left"],
+  "strengths": ["Accurate object identification", "Clear spatial description"],
+  "improvements": ["Include background elements", "Mention lighting conditions"]
+}
+```
+
+## Scoring Guidelines
+
+- **0.9-1.0**: Excellent - Accurate, complete, clear description
+- **0.7-0.89**: Good - Mostly accurate with minor gaps or imprecisions
+- **0.5-0.69**: Acceptable - Some inaccuracies or missing elements
+- **0.3-0.49**: Poor - Significant issues or hallucinations
+- **0.0-0.29**: Failed - Mostly incorrect or severely incomplete
diff --git a/examples/showcase/vision/evaluators/llm-judges/quality-assessment-judge.md b/examples/showcase/vision/evaluators/llm-judges/quality-assessment-judge.md
new file mode 100644
index 00000000..32f90ae4
--- /dev/null
+++ b/examples/showcase/vision/evaluators/llm-judges/quality-assessment-judge.md
@@ -0,0 +1,221 @@
+# Quality Assessment Judge for Images
+# Evaluates completeness and quality of image quality assessments
+
+You are evaluating an AI assistant's ability to assess image quality across technical, compositional, and aesthetic dimensions.
+
+## Evaluation Criteria
+
+### 1. Technical Assessment Completeness (30%)
+- Sharpness/focus evaluation present?
+- Exposure/lighting assessment included?
+- Noise level considered?
+- Resolution/clarity mentioned?
+- Technical score provided?
+
+### 2. Compositional Analysis (25%)
+- Rule of thirds discussed (if applicable)?
+- Balance and framing evaluated?
+- Leading lines or depth mentioned?
+- Subject placement assessed?
+- Compositional principles applied?
+
+### 3. Aesthetic Evaluation (20%)
+- Color grading/palette assessed?
+- Mood and tone described?
+- Visual appeal considered?
+- Style and genre recognized?
+- Artistic merit evaluated?
+
+### 4. Overall Quality Judgment (15%)
+- Overall score provided?
+- Score justified with reasoning?
+- Strengths identified?
+- Weaknesses noted?
+- Constructive feedback given?
+
+### 5. Professional Tone (10%)
+- Objective and analytical?
+- Uses appropriate terminology?
+- Balanced perspective?
+- Actionable feedback?
+
+## Input Data
+
+**User's Question**: {{input}}
+
+**AI Response**: {{output}}
+
+**Expected Assessment**: {{expected_output}}
+
+**Image Reference**: {{image_reference}}
+
+## Evaluation Task
+
+Assess whether the AI provided a comprehensive, professional image quality evaluation.
+
+## Output Format
+
+```json
+{
+  "score": 0.85,
+  "passed": true,
+  "details": {
+    "technical_completeness": 0.9,
+    "compositional_analysis": 0.85,
+    "aesthetic_evaluation": 0.8,
+    "overall_judgment": 0.85,
+    "professional_tone": 0.9
+  },
+  "reasoning": "Comprehensive assessment covering all major dimensions. Good use of technical terminology. Overall score well justified.",
+  "covered_aspects": {
+    "technical": ["sharpness", "exposure", "noise"],
+    "compositional": ["rule of thirds", "balance"],
+    "aesthetic": ["color grading", "mood"],
+    "scoring": ["overall score", "justification"]
+  },
+  "missing_aspects": [
+    "Leading lines not mentioned",
+    "Could discuss depth of field"
+  ],
+  "terminology_quality": "Professional photography terms used appropriately",
+  "strengths": [
+    "Detailed technical analysis",
+    "Well-structured evaluation",
+    "Clear rating scale",
+    "Actionable feedback"
+  ],
+  "improvements": [
+    "Could add more compositional detail",
+    "Discuss target use case"
+  ]
+}
+```
+
+## Assessment Components to Check
+
+### Technical Quality Elements
+- **Sharpness**: Focus quality, blur, motion
+- **Exposure**: Brightness, highlights, shadows, dynamic range
+- **Noise**: Grain, artifacts, clarity
+- **Color accuracy**: White balance, color cast
+- **Resolution**: Detail level, pixel quality
+
+### Compositional Elements
+- **Rule of thirds**: Key elements placement
+- **Balance**: Visual weight distribution
+- **Framing**: Subject positioning, borders
+- **Leading lines**: Paths, guides, depth
+- **Symmetry/asymmetry**: Intentional choices
+- **Negative space**: Use of empty areas
+
+### Aesthetic Elements
+- **Color palette**: Harmony, contrast, mood
+- **Tone**: Warm/cool, high/low key
+- **Style**: Documentary, artistic, commercial
+- **Mood**: Emotion conveyed
+- **Visual appeal**: Overall attractiveness
+
+### Quality Rating
+- **Numerical score**: 1-10 or percentage
+- **Justification**: Reasoning for rating
+- **Comparison**: To standards or expectations
+- **Context**: Purpose and use case
+
+## Scoring Guidelines
+
+**0.9-1.0: Excellent**
+- All major dimensions covered
+- Professional terminology
+- Balanced, detailed assessment
+- Clear rating with justification
+
+**0.7-0.89: Good**
+- Most dimensions covered
+- Appropriate language
+- Generally complete
+- Rating provided
+
+**0.5-0.69: Acceptable**
+- Some dimensions missing
+- Basic assessment
+- Limited detail
+- Vague or missing rating
+
+**0.3-0.49: Poor**
+- Major gaps in assessment
+- Superficial analysis
+- Unprofessional or unclear
+- No clear rating
+
+**0.0-0.29: Failed**
+- Minimal or no real assessment
+- Inaccurate observations
+- Unprofessional
+
+## Professional Photography Terminology
+
+**Expected terms** (bonus for using appropriately):
+- Sharpness, focus, depth of field
+- Exposure, dynamic range, highlights/shadows
+- Noise, grain, ISO artifacts
+- Rule of thirds, leading lines, golden ratio
+- Balance, symmetry, visual weight
+- Color grading, palette, saturation
+- Bokeh, vignetting, chromatic aberration
+- High-key, low-key, mood, tone
+
+## Special Considerations
+
+- **Subjectivity**: Aesthetic judgments are subjective; accept varied opinions if justified
+- **Context matters**: Assessment should consider apparent purpose (commercial, artistic, documentary)
+- **Constructive feedback**: Good assessments identify both strengths and improvement areas
+- **Calibration**: Scores should match the reasoning (don't penalize if scale differs but internal consistency maintained)
+
+## Example Excellent Assessment
+
+```
+Quality Assessment: 8/10
+
+Technical Quality:
+- Sharpness: Excellent (9/10) - Tack sharp on subject, pleasant bokeh in background
+- Exposure: Very good (8/10) - Well balanced overall, slight highlight clipping on left edge
+- Noise: Good (7/10) - Minimal noise in shadows, clean at base ISO
+- Color: Excellent (9/10) - Accurate white balance, vibrant but not oversaturated
+
+Composition:
+- Rule of thirds: Well applied, subject at upper right intersection
+- Balance: Excellent - Visual weight properly distributed
+- Leading lines: Strong - Path creates natural eye flow toward subject
+- Depth: Good use of foreground/background separation
+
+Color & Aesthetic:
+- Palette: Warm golden hour tones create inviting mood
+- Grading: Professional look with subtle lift in shadows
+- Mood: Peaceful, contemplative
+- Style: Fine art landscape
+
+Strengths:
+- Professional technical execution
+- Strong compositional choices
+- Cohesive aesthetic vision
+
+Areas for improvement:
+- Slight highlight clipping could be recovered
+- Could crop tighter for more impact
+- Consider including more foreground interest
+
+Overall: High-quality work suitable for portfolio or publication.
+```
+
+## Example Poor Assessment
+
+```
+The image looks good. Nice colors and everything is clear. I'd give it a 7/10 because it's pretty nice but not perfect. The photo is well taken.
+```
+
+**Issues with poor example:**
+- Too vague, no specific technical analysis
+- No compositional discussion
+- No aesthetic evaluation beyond "nice colors"
+- Rating not justified
+- Unprofessional language
diff --git a/examples/showcase/vision/evaluators/llm-judges/reasoning-judge.md b/examples/showcase/vision/evaluators/llm-judges/reasoning-judge.md
new file mode 100644
index 00000000..b1d2f6bc
--- /dev/null
+++ b/examples/showcase/vision/evaluators/llm-judges/reasoning-judge.md
@@ -0,0 +1,135 @@
+# Visual Reasoning LLM Judge
+# Evaluates logical reasoning applied to visual information
+
+You are evaluating an AI assistant's ability to apply logical reasoning to visual information, such as solving puzzles, analyzing diagrams, or making inferences from visual data.
+
+## Evaluation Criteria
+
+### 1. Logical Correctness (40%)
+- Is the reasoning logically sound?
+- Are conclusions properly supported by visual evidence?
+- Are logical steps clearly connected?
+
+### 2. Visual Understanding (30%)
+- Does the response demonstrate accurate visual perception?
+- Are visual elements correctly interpreted?
+- Is spatial/structural understanding correct?
+
+### 3. Problem-Solving Quality (20%)
+- Is the problem correctly understood?
+- Is the solution approach appropriate?
+- Are alternative solutions considered (when relevant)?
+
+### 4. Explanation Quality (10%)
+- Is the reasoning process clearly explained?
+- Are assumptions stated explicitly?
+- Is the explanation easy to follow?
+
+## Input Data
+
+**User's Question**: {{input}}
+
+**AI Response**: {{output}}
+
+**Expected Response**: {{expected_output}}
+
+**Image Reference**: {{image_reference}}
+
+## Evaluation Task
+
+Assess the quality of reasoning applied to the visual problem.
+
+## Output Format
+
+```json
+{
+  "score": 0.88,
+  "passed": true,
+  "details": {
+    "logical_correctness": 0.9,
+    "visual_understanding": 0.85,
+    "problem_solving_quality": 0.9,
+    "explanation_quality": 0.85
+  },
+  "reasoning": "Strong logical analysis with correct visual interpretation. Solution is sound and well-explained. Could have considered one alternative approach.",
+  "correctness": {
+    "visual_perception": "Accurate",
+    "logical_chain": "Valid",
+    "conclusion": "Correct",
+    "assumptions": "Reasonable and stated"
+  },
+  "strengths": [
+    "Clear step-by-step reasoning",
+    "Accurate visual analysis",
+    "Correct conclusion",
+    "Good explanation"
+  ],
+  "weaknesses": [
+    "Didn't mention alternative solution",
+    "Could be more explicit about one assumption"
+  ],
+  "alternative_solutions": [
+    "Could have suggested Bd3 as alternative to Nf3"
+  ]
+}
+```
+
+## Reasoning Task Types
+
+### Spatial Reasoning
+- Puzzles, mazes, pathfinding
+- Evaluate: Path correctness, spatial understanding, optimization
+
+### Logical Inference
+- Chess, game states, strategy
+- Evaluate: Rule understanding, tactical analysis, strategic thinking
+
+### Pattern Recognition
+- Sequences, analogies, relationships
+- Evaluate: Pattern identification, extrapolation, justification
+
+### Quantitative Analysis
+- Charts, graphs, measurements
+- Evaluate: Data extraction accuracy, calculation correctness, insight quality
+
+### Diagram Understanding
+- Architecture, flowcharts, schematics
+- Evaluate: Component identification, relationship understanding, system comprehension
+
+## Scoring Guidelines
+
+**0.9-1.0: Excellent**
+- Flawless reasoning
+- Complete visual understanding
+- Optimal or near-optimal solution
+- Clear, thorough explanation
+
+**0.7-0.89: Good**
+- Sound reasoning with minor gaps
+- Accurate visual interpretation
+- Correct solution (may not be optimal)
+- Adequate explanation
+
+**0.5-0.69: Acceptable**
+- Some logical issues
+- Mostly correct visual understanding
+- Solution has issues but shows understanding
+- Explanation could be clearer
+
+**0.3-0.49: Poor**
+- Significant logical errors
+- Misinterpretation of visual elements
+- Incorrect solution
+- Unclear reasoning
+
+**0.0-0.29: Failed**
+- Fundamentally flawed reasoning
+- Serious misunderstanding of visual information
+- Completely incorrect solution
+
+## Special Considerations
+
+- **Multiple valid solutions**: Accept any logically sound approach
+- **Partial solutions**: Give partial credit for correct reasoning even if conclusion is off
+- **Computational errors**: Distinguish between logical errors and arithmetic mistakes
+- **Ambiguous images**: Be lenient if image quality affects interpretation
diff --git a/examples/showcase/vision/evaluators/llm-judges/structured-output-judge.md b/examples/showcase/vision/evaluators/llm-judges/structured-output-judge.md
new file mode 100644
index 00000000..3c7cfff3
--- /dev/null
+++ b/examples/showcase/vision/evaluators/llm-judges/structured-output-judge.md
@@ -0,0 +1,177 @@
+# Structured Output Judge for Vision Tasks
+# Evaluates quality of structured JSON outputs from vision analysis
+
+You are evaluating an AI assistant's ability to return structured, well-formatted JSON from vision analysis tasks.
+
+## Evaluation Criteria
+
+### 1. JSON Validity (30%)
+- Is the output valid, parseable JSON?
+- Are there any syntax errors?
+- Is the structure consistent?
+
+### 2. Schema Compliance (35%)
+- Does it match the requested structure?
+- Are all required fields present?
+- Are field types correct?
+- Are array structures appropriate?
+
+### 3. Data Accuracy (25%)
+- Are the values extracted from the image accurate?
+- Are counts, positions, and attributes correct?
+- Are confidence scores reasonable?
+
+### 4. Completeness (10%)
+- Are all relevant visual elements captured?
+- Is the level of detail appropriate?
+- Are optional but useful fields included?
+
+## Input Data
+
+**User's Question**: {{input}}
+
+**AI Response**: {{output}}
+
+**Expected Structure**: {{expected_output}}
+
+**Image Reference**: {{image_reference}}
+
+## Evaluation Task
+
+Assess the quality of the structured JSON output from vision analysis.
+
+## Output Format
+
+```json
+{
+  "score": 0.88,
+  "passed": true,
+  "details": {
+    "json_validity": 1.0,
+    "schema_compliance": 0.9,
+    "data_accuracy": 0.85,
+    "completeness": 0.8
+  },
+  "reasoning": "Valid JSON with correct schema. Object detection mostly accurate. Some optional details missing.",
+  "issues": {
+    "parsing_errors": [],
+    "schema_violations": ["Missing 'confidence' field in one object"],
+    "accuracy_issues": ["Count slightly off for 'can' objects"],
+    "missing_data": ["Object colors not included"]
+  },
+  "extracted_data": {
+    "objects_detected": 16,
+    "confidence_range": [0.85, 0.98],
+    "categories_present": ["bottle", "can", "box"]
+  },
+  "strengths": [
+    "Perfect JSON syntax",
+    "Correct array structure",
+    "Accurate position descriptions",
+    "Reasonable confidence scores"
+  ],
+  "improvements": [
+    "Include confidence for all objects",
+    "Add color information",
+    "Consider bounding boxes"
+  ]
+}
+```
+
+## JSON Validation Checks
+
+### Required Structure Elements
+- All specified fields present
+- Correct data types (string, number, boolean, array, object)
+- Proper nesting for hierarchical data
+- Consistent array item structure
+
+### Common Issues to Check
+- **Missing fields**: Required properties not included
+- **Type mismatches**: String instead of number, etc.
+- **Empty arrays**: When data should be present
+- **Inconsistent structures**: Different objects in same array with different schemas
+- **Invalid values**: Negative confidence scores, impossible counts
+
+### Visual Data Accuracy
+- Object counts match image
+- Positions/locations accurate
+- Attributes (color, size) correct
+- Relationships properly described
+- Confidence scores calibrated
+
+## Scoring Guidelines
+
+**0.9-1.0: Excellent**
+- Perfect JSON syntax
+- Full schema compliance
+- Accurate visual data
+- Complete information
+
+**0.7-0.89: Good**
+- Valid JSON
+- Minor schema issues
+- Mostly accurate data
+- Key information present
+
+**0.5-0.69: Acceptable**
+- Parseable JSON
+- Some schema violations
+- Several accuracy issues
+- Important data missing
+
+**0.3-0.49: Poor**
+- JSON issues or major schema violations
+- Significant inaccuracies
+- Incomplete data
+
+**0.0-0.29: Failed**
+- Invalid JSON or completely wrong structure
+- Grossly inaccurate data
+
+## Special Considerations
+
+- **Flexibility**: Accept reasonable variations in structure if data is complete
+- **Confidence scores**: Should be between 0.0 and 1.0 (or 0-100 for percentages)
+- **Positions**: Various formats acceptable (coordinates, descriptions, regions)
+- **Arrays**: Empty arrays acceptable if no objects of that type present
+- **Additional fields**: Extra fields are fine, don't penalize
+- **Formatting**: Whitespace and formatting don't matter, focus on structure and data
+
+## Example Good Response
+
+```json
+{
+  "objects": [
+    {
+      "name": "laptop",
+      "count": 1,
+      "position": "center desk",
+      "confidence": 0.98,
+      "color": "silver",
+      "attributes": ["open", "powered on"]
+    },
+    {
+      "name": "coffee mug",
+      "count": 2,
+      "position": "desk right side",
+      "confidence": 0.95,
+      "color": "white"
+    }
+  ],
+  "scene": "office workspace",
+  "dominant_colors": ["white", "gray", "brown"],
+  "lighting": "natural, well-lit"
+}
+```
+
+## Example Poor Response
+
+```json
+{
+  "objects": "laptop and coffee mugs",  // Should be array
+  "scene": "office workspace",
+  // Missing dominant_colors field
+  "extra_field": null
+}
+```
diff --git a/examples/showcase/vision/test-images/.gitkeep b/examples/showcase/vision/test-images/.gitkeep
new file mode 100644
index 00000000..14cc2b2a
--- /dev/null
+++ b/examples/showcase/vision/test-images/.gitkeep
@@ -0,0 +1,2 @@
+# Placeholder file to ensure test-images directory is tracked by git
+# Users should add their own test images here (see README.md)
diff --git a/examples/showcase/vision/test-images/README.md b/examples/showcase/vision/test-images/README.md
new file mode 100644
index 00000000..7fbd7ae1
--- /dev/null
+++ b/examples/showcase/vision/test-images/README.md
@@ -0,0 +1,67 @@
+# Vision Examples Test Images
+
+This directory is for placing test images used by the vision evaluation examples.
+
+## Required Images
+
+To run the vision evaluation examples, you'll need to provide the following test images:
+
+### Basic Image Analysis (`basic-image-analysis.yaml`)
+1. **sample-office.jpg** - Office workspace scene with desk, computer, chair
+2. **objects-scene.jpg** - Scene with multiple countable objects (e.g., fruits, toys)
+3. **spatial-layout.jpg** - Image with clear spatial relationships between objects
+4. **text-document.jpg** - Image containing readable text (receipt, sign, document)
+5. **comparison-before.jpg** - "Before" image for comparison task
+6. **comparison-after.jpg** - "After" image showing changes from before
+7. **colorful-scene.jpg** - Image with distinct, identifiable colors
+
+### Advanced Vision Tasks (`advanced-vision-tasks.yaml`)
+1. **street-scene.jpg** - Complex outdoor scene for structured detection
+2. **chess-puzzle.jpg** - Chess board position for visual reasoning
+3. **activity-photo.jpg** - People performing activities
+4. **quality-test.jpg** - Image for quality assessment (any photo)
+5. **bar-chart.jpg** - Bar chart or graph for data extraction
+6. **complex-scene.jpg** - Rich scene for context inference
+7. **instruction-reference.jpg** - Image referenced in instruction-following task
+
+## Image Requirements
+
+- **Formats:** JPEG, PNG, WEBP, GIF (non-animated), BMP
+- **Size:** 50x50 to 16,000x16,000 pixels
+- **File Size:** Maximum 20MB per image
+- **Naming:** Use descriptive filenames matching the eval case expectations
+
+## Alternative: Using URLs
+
+Instead of local files, you can use publicly accessible image URLs:
+- Update the YAML files to reference URLs instead of local paths
+- Example: `value: https://example.com/images/sample-office.jpg`
+- Ensure URLs are stable and accessible
+
+## Test Image Sources
+
+You can create or obtain test images from:
+- **Your own photos** - Best for realistic testing
+- **Free stock photo sites** - Unsplash, Pexels, Pixabay (check licenses)
+- **Generated images** - AI image generators for specific scenarios
+- **Public domain** - Wikimedia Commons, NASA image library
+
+## Privacy & Copyright
+
+⚠️ **Important:**
+- Do not commit copyrighted images to git repositories
+- Ensure you have rights to use any test images
+- This directory contains `.gitkeep` only - images are user-provided
+- Add test images to `.gitignore` if sharing repositories
+
+## Usage
+
+Place your test images in this directory, then run evaluations from the parent directory:
+
+```bash
+# Run basic vision evals
+agentv run datasets/basic-image-analysis.yaml
+
+# Run advanced vision evals
+agentv run datasets/advanced-vision-tasks.yaml
+```
diff --git a/openspec/changes/add-vision-evaluation/proposal.md b/openspec/changes/add-vision-evaluation/proposal.md
new file mode 100644
index 00000000..f62ec4c8
--- /dev/null
+++ b/openspec/changes/add-vision-evaluation/proposal.md
@@ -0,0 +1,375 @@
+# Proposal: Add Vision Evaluation Capabilities
+
+## Change ID
+`add-vision-evaluation`
+
+## Status
+🟡 **Proposed** - Awaiting approval
+
+## Summary
+Add comprehensive image/vision evaluation capabilities to AgentV, enabling testing of AI agents with multimodal (text + image) inputs. This includes support for image inputs, vision-specific evaluators, and self-contained vision evaluation examples.
+
+## Motivation
+
+### Problem
+AgentV currently only supports text-based evaluation. Modern AI agents increasingly work with vision-capable models (GPT-4V, Claude 3.5 Sonnet, Gemini Vision) that can analyze images, but there's no way to:
+- Include images in evaluation test cases
+- Evaluate the accuracy of visual analysis
+- Test multimodal agent behaviors
+- Compare vision performance across models
+
+### Impact
+Without vision evaluation support:
+- Cannot test image description, object detection, OCR capabilities
+- No way to validate spatial reasoning or visual understanding
+- Missing coverage for multimodal agent workflows
+- Cannot evaluate vision-specific failure modes (hallucinations, misidentification)
+
+### Value Proposition
+Adding vision evaluation enables:
+- **Comprehensive testing**: Full coverage of multimodal agent capabilities
+- **Quality assurance**: Validate visual analysis accuracy with specialized evaluators
+- **Model comparison**: Compare vision performance across providers
+- **Cost optimization**: Measure token costs for image processing
+- **Real-world scenarios**: Test agents on tasks requiring visual understanding
+
+## Research Foundation
+
+This proposal is based on analysis of 4 leading AI agent and evaluation frameworks:
+- **Google ADK-Python**: Rubric-based evaluation, multimodal content model
+- **Mastra**: TypeScript patterns, structured outputs, Braintrust integration
+- **Azure SDK**: Image input APIs, Computer Vision patterns, testing infrastructure
+- **LangWatch**: Evaluation architecture, batch processing, flexible scoring
+
+Detailed research findings are documented in `references/research-summary.md`.
+
+## Scope
+
+### In Scope
+1. **Image Input Support** (YAML schema extension)
+   - Local file paths (`./images/photo.jpg`)
+   - HTTP/HTTPS URLs (`https://example.com/image.jpg`)
+   - Base64 data URIs (`data:image/jpeg;base64,...`)
+   - Detail level specification (`low`, `high`, `auto`)
+
+2. **Vision Evaluators**
+   - 6 LLM-based judges (description, activity, comparison, reasoning, quality, structured output)
+   - 4 code-based validators (count, OCR, JSON structure, chart data)
+
+3. **Self-Contained Examples**
+   - Move vision evaluation to `examples/vision/` (self-contained folder)
+   - 14 example eval cases (7 basic, 7 advanced)
+   - Sample test images and documentation
+
+4. **Documentation**
+   - Comprehensive README
+   - Quick reference index
+   - Research summary
+
+### Out of Scope (Future Work)
+- Computer vision metrics (SSIM, CLIP embeddings, perceptual hashing)
+- Automatic image preprocessing/resizing
+- Image generation evaluation
+- Video input support
+- Cloud storage integration (gs://, s3://)
+- Progressive disclosure implementation
+- Token budgeting automation
+- Cost tracking per evaluation
+
+## Design Decisions
+
+### 1. YAML Schema Extension
+**Decision**: Extend existing `content` array format to support image content types.
+
+**Rationale**: 
+- Consistent with existing multi-part message structure
+- Follows patterns from Mastra and Azure SDK
+- Allows mixing text and images naturally
+- Supports multiple images per message
+
+**Example**:
+```yaml
+input_messages:
+  - role: user
+    content:
+      - type: text
+        value: "Describe this image"
+      - type: image
+        value: ./test-images/photo.jpg
+        detail: high
+```
+
+**Alternatives Considered**:
+- ❌ Separate `images` field: Breaks natural message flow
+- ❌ String-only with special syntax: Not extensible
+- ✅ Content array with type discrimination: Flexible, extensible
+
+### 2. Evaluator Organization
+**Decision**: Create `evaluators/vision/` with both LLM judges (`.md`) and code validators (`.py`).
+
+**Rationale**:
+- LLM judges for subjective assessment (quality, completeness)
+- Code validators for objective metrics (counts, structure)
+- Separation of concerns
+- Easy to add new evaluators
+
+**Categories**:
+- **LLM Judges**: Description, Activity, Comparison, Reasoning, Quality Assessment, Structured Output
+- **Code Validators**: Count, OCR, JSON Structure, Chart Data
+
+### 3. Self-Contained Structure
+**Decision**: Move from `examples/features/evals/vision/` to `examples/showcase/vision/` with all assets included.
+
+**Rationale**:
+- Follows showcase pattern for feature demonstrations
+- Single folder contains: datasets, evaluators, test images, docs
+- Easier to discover and understand
+- Can be copied/shared as complete package
+
+**Structure**:
+```
+examples/showcase/vision/
+├── .agentv/
+│   ├── config.yaml
+│   └── targets.yaml
+├── datasets/
+│   ├── basic-image-analysis.yaml
+│   └── advanced-vision-tasks.yaml
+├── evaluators/
+│   ├── llm-judges/
+│   │   └── *.md (6 judges)
+│   └── code-validators/
+│       └── *.py (4 validators)
+├── test-images/
+│   └── (sample images)
+└── README.md
+```
+
+### 4. Detail Level Support
+**Decision**: Support `detail` parameter for cost/quality trade-offs.
+
+**Rationale**:
+- Mirrors OpenAI, Anthropic, Google APIs
+- Enables cost optimization (`low` saves ~90% tokens)
+- Performance tuning (high detail for complex analysis)
+
+**Values**:
+- `low`: ~85 tokens, faster, cheaper
+- `high`: ~765-1360 tokens, detailed analysis
+- `auto`: Model decides based on task
+
+### 5. Multi-Sample Evaluation
+**Decision**: Document pattern but don't automate yet.
+
+**Rationale**:
+- Research shows 3-5 samples improves reliability
+- Implementation deferred to future work
+- Can be done manually for now
+
+## Dependencies
+
+### Technical Dependencies
+- Existing YAML schema parser
+- Evaluation execution engine
+- LLM provider integrations (OpenAI, Anthropic, Google)
+- `uv` for running Python validators
+
+### Spec Dependencies
+- `yaml-schema`: Requires extension for image content types
+- `evaluation`: May need updates for multimodal scoring
+- `eval-execution`: Needs image loading/passing to providers
+
+### Example Dependencies
+- Vision-capable models configured in targets
+- Test images provided by users (not included in repo)
+
+## Risks & Mitigations
+
+### Risk 1: Token Cost
+**Description**: Images consume 765-1360 tokens each, making evals expensive.
+
+**Mitigation**:
+- Document cost implications clearly
+- Support `detail: low` for testing (90% savings)
+- Recommend Gemini Flash for development (20-30x cheaper)
+- Use code validators when possible (free)
+
+**Severity**: Medium  
+**Likelihood**: High
+
+### Risk 2: Provider Compatibility
+**Description**: Different providers have varying image input formats and capabilities.
+
+**Mitigation**:
+- Test with all major providers (OpenAI, Anthropic, Google)
+- Document provider-specific limitations
+- Use common denominator approach
+- Clear error messages for unsupported features
+
+**Severity**: Medium  
+**Likelihood**: Medium
+
+### Risk 3: Image Availability
+**Description**: Local file paths and URLs may not be accessible.
+
+**Mitigation**:
+- Validate file existence before execution
+- Support multiple input methods (file, URL, base64)
+- Clear error messages for missing images
+- Document image requirements (size, format)
+
+**Severity**: Low  
+**Likelihood**: Medium
+
+### Risk 4: Hallucinations
+**Description**: LLM judges may hallucinate when evaluating vision tasks.
+
+**Mitigation**:
+- Use vision-capable judge models
+- Multi-sample evaluation (3-5 runs)
+- Combine with code validators
+- Document judge limitations
+
+**Severity**: Medium  
+**Likelihood**: Medium
+
+## Implementation Notes
+
+### Phase 1: Schema & Input (Week 1)
+- Extend YAML schema for image content types
+- Implement image loaders (file, URL, base64)
+- Add MIME type detection
+- Provider integration for vision APIs
+
+### Phase 2: Evaluators (Week 2)
+- Port LLM judge prompts
+- Implement Python validator runner
+- Test with real vision models
+- Validate scoring accuracy
+
+### Phase 3: Examples & Docs (Week 3)
+- Reorganize into `examples/vision/`
+- Create self-contained structure
+- Add comprehensive documentation
+- Create quick-start guide
+
+### Phase 4: Validation (Week 4)
+- End-to-end testing with multiple providers
+- Cost analysis and optimization
+- Performance benchmarking
+- Documentation review
+
+## Success Criteria
+
+### Functional Requirements
+- ✅ Support local files, URLs, and base64 image inputs
+- ✅ Pass images to vision-capable LLM providers
+- ✅ Run LLM judges with image context
+- ✅ Execute code validators with Python
+- ✅ Parse vision eval YAML files successfully
+- ✅ Generate evaluation scores for vision tasks
+
+### Quality Requirements
+- ✅ Evaluation accuracy >90% vs human judgment
+- ✅ Object count accuracy >95% (code validators)
+- ✅ OCR validation >80% accuracy
+- ✅ Hallucination detection >85% accuracy
+- ✅ Multi-sample consistency >90%
+
+### Performance Requirements
+- ✅ Average eval latency <2s (excluding LLM calls)
+- ✅ Support images up to 16MP / 20MB
+- ✅ Handle 3+ image formats (JPEG, PNG, WEBP)
+
+### Documentation Requirements
+- ✅ README with examples and usage guide
+- ✅ Quick reference index
+- ✅ Research summary document
+- ✅ Provider compatibility matrix
+- ✅ Cost optimization guide
+
+## Alternatives Considered
+
+### Alternative 1: External Vision API
+**Description**: Use external Computer Vision APIs (Azure, Google Cloud Vision) instead of LLM vision.
+
+**Pros**:
+- Potentially more accurate
+- Specialized features (object detection, OCR)
+- Lower cost per image
+
+**Cons**:
+- Additional dependencies
+- Inconsistent with agent evaluation (we test LLMs)
+- More complex integration
+- Not testing actual agent capabilities
+
+**Verdict**: ❌ Rejected - Want to test the actual LLMs agents use
+
+### Alternative 2: Generate Test Images
+**Description**: Auto-generate test images using DALL-E/Stable Diffusion.
+
+**Pros**:
+- No need for sample images
+- Consistent test data
+- Easy to create variations
+
+**Cons**:
+- Expensive
+- Generated images may not match real-world scenarios
+- Additional complexity
+- Slower test execution
+
+**Verdict**: ❌ Rejected - Out of scope, defer to future
+
+### Alternative 3: Video Support
+**Description**: Support video inputs in addition to images.
+
+**Pros**:
+- More comprehensive multimodal coverage
+- Test temporal understanding
+
+**Cons**:
+- Significantly more complex
+- Very high token costs
+- Limited provider support
+- Niche use case
+
+**Verdict**: ❌ Rejected - Out of scope, future consideration
+
+## Open Questions
+
+None - all design decisions have been made based on comprehensive research.
+
+## References
+
+### Research Documents
+- `docs/updates/VISION_EVAL_RESEARCH_SUMMARY.md` - Detailed findings from 5 frameworks
+- `examples/vision/README.md` - Comprehensive usage guide
+- `examples/vision/INDEX.md` - Quick reference
+
+### External Resources
+- Google ADK-Python: https://github.com/google/adk-python
+- Mastra: https://github.com/mastra-ai/mastra
+- Azure SDK: https://github.com/Azure/azure-sdk-for-python
+- LangWatch: https://github.com/langwatch/langwatch
+- Agent Skills: https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering
+
+### Related Specs
+- `yaml-schema` - Requires extension for image content
+- `evaluation` - May need multimodal scoring support
+- `eval-execution` - Needs image loading capability
+
+## Approval
+
+**Proposed by**: AI Assistant  
+**Date**: January 2, 2026  
+**Approval required from**: Project maintainers
+
+---
+
+**Next Steps After Approval**:
+1. Review and approve this proposal
+2. Review `tasks.md` for implementation sequence
+3. Review spec deltas in `specs/*/spec.md`
+4. Begin implementation following task order
diff --git a/openspec/changes/add-vision-evaluation/references/adk-python-research.md b/openspec/changes/add-vision-evaluation/references/adk-python-research.md
new file mode 100644
index 00000000..10b2f03a
--- /dev/null
+++ b/openspec/changes/add-vision-evaluation/references/adk-python-research.md
@@ -0,0 +1,644 @@
+# ADK-Python Image Evaluation Research Report
+
+Research Date: January 2, 2026
+Repository: google/adk-python (https://github.com/google/adk-python)
+
+## Executive Summary
+
+Google's ADK (Agent Development Kit) Python framework provides a comprehensive evaluation system for AI agents. While the framework doesn't have specific image-only evaluation examples, it demonstrates **multimodal content handling** through its agents and provides a robust evaluation infrastructure that can be adapted for vision tasks.
+
+## Key Findings
+
+### 1. Multimodal Content Handling
+
+#### Image Input Patterns
+
+The ADK framework supports multiple methods for handling non-text content:
+
+**a) Inline Image Data (Base64)**
+```python
+from google.genai import types
+import base64
+
+# Sample image data as base64
+SAMPLE_IMAGE_DATA = base64.b64decode(
+    "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8/5+hHgAHggJ/PchI7wAAAABJRU5ErkJggg=="
+)
+
+# Create inline data part
+types.Part(
+    inline_data=types.Blob(
+        data=SAMPLE_IMAGE_DATA,
+        mime_type="image/png",
+        display_name="sample_chart.png",
+    )
+)
+```
+
+**b) File URI References**
+```python
+# GCS URI (Vertex AI)
+types.Part.from_uri(file_uri="gs://cloud-samples-data/generative-ai/pdf/2403.05530.pdf")
+
+# HTTPS URL
+types.Part(
+    file_data=types.FileData(
+        file_uri="https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf",
+        mime_type="application/pdf",
+        display_name="Research Paper",
+    )
+)
+
+# Files API Upload (Gemini Developer API)
+client = genai.Client()
+uploaded_file = client.files.upload(file=temp_file_path)
+types.Part(
+    file_data=types.FileData(
+        file_uri=uploaded_file.uri,
+        mime_type="text/markdown",
+        display_name="Contributing Guide",
+    )
+)
+```
+
+**c) Tool Return Values**
+```python
+def get_image():
+    """Tool that returns image parts"""
+    return [types.Part.from_uri(file_uri="gs://replace_with_your_image_uri")]
+```
+
+**Key Pattern**: Images can be passed as:
+- Part of static instructions (context)
+- User input content
+- Tool responses
+- Multimodal tool results
+
+### 2. Evaluation Framework Architecture
+
+#### Core Evaluation Components
+
+**File: `src/google/adk/evaluation/`**
+
+1. **EvalCase** (`eval_case.py`)
+   ```python
+   class Invocation(EvalBaseModel):
+       invocation_id: str
+       user_content: genai_types.Content  # Can contain image parts
+       final_response: Optional[genai_types.Content]
+       intermediate_data: Optional[IntermediateDataType]
+       rubrics: Optional[list[Rubric]]
+   
+   class EvalCase(EvalBaseModel):
+       eval_id: str
+       conversation: Optional[StaticConversation]
+       conversation_scenario: Optional[ConversationScenario]
+       rubrics: Optional[list[Rubric]]
+   ```
+
+2. **Rubrics** (`eval_rubrics.py`)
+   ```python
+   class RubricContent(EvalBaseModel):
+       text_property: Optional[str] = Field(
+           description='The property being evaluated. Example: "The agent\'s response is grammatically correct."'
+       )
+   
+   class Rubric(EvalBaseModel):
+       rubric_id: str
+       rubric_content: RubricContent
+       description: Optional[str]
+       type: Optional[str]  # e.g., "TOOL_USE_QUALITY", "FINAL_RESPONSE_QUALITY"
+   
+   class RubricScore(EvalBaseModel):
+       rubric_id: str
+       rationale: Optional[str]
+       score: Optional[float]
+   ```
+
+3. **Evaluation Metrics** (`eval_metrics.py`)
+   ```python
+   class PrebuiltMetrics(Enum):
+       TOOL_TRAJECTORY_AVG_SCORE = "tool_trajectory_avg_score"
+       RESPONSE_EVALUATION_SCORE = "response_evaluation_score"
+       RESPONSE_MATCH_SCORE = "response_match_score"
+       SAFETY_V1 = "safety_v1"
+       FINAL_RESPONSE_MATCH_V2 = "final_response_match_v2"
+       RUBRIC_BASED_FINAL_RESPONSE_QUALITY_V1 = "rubric_based_final_response_quality_v1"
+       HALLUCINATIONS_V1 = "hallucinations_v1"
+       RUBRIC_BASED_TOOL_USE_QUALITY_V1 = "rubric_based_tool_use_quality_v1"
+   
+   class JudgeModelOptions(EvalBaseModel):
+       judge_model: str = "gemini-2.5-flash"
+       num_samples: int = 5  # Sample multiple times for reliability
+   
+   class RubricsBasedCriterion(BaseCriterion):
+       judge_model_options: JudgeModelOptions
+       rubrics: list[Rubric]
+   ```
+
+4. **Evaluation Configuration** (`eval_config.py`)
+   ```python
+   class EvalConfig(BaseModel):
+       criteria: dict[str, Union[Threshold, BaseCriterion]]
+       user_simulator_config: Optional[BaseUserSimulatorConfig]
+   
+   # Example configuration
+   {
+       "criteria": {
+           "tool_trajectory_avg_score": 1.0,
+           "response_match_score": 0.5,
+           "final_response_match_v2": {
+               "threshold": 0.5,
+               "judge_model_options": {
+                   "judge_model": "gemini-2.5-flash",
+                   "num_samples": 5
+               }
+           }
+       }
+   }
+   ```
+
+### 3. Multimodal Agent Examples
+
+#### Example 1: Static Non-Text Content
+**Location**: `contributing/samples/static_non_text_content/`
+
+```python
+def create_static_instruction_with_file_upload():
+    """Create static instruction with images and files"""
+    
+    parts = [
+        types.Part.from_text(text="You are an AI assistant..."),
+        
+        # Inline image data
+        types.Part(
+            inline_data=types.Blob(
+                data=SAMPLE_IMAGE_DATA,
+                mime_type="image/png",
+                display_name="sample_chart.png",
+            )
+        ),
+        
+        types.Part.from_text(text="This is a sample chart..."),
+    ]
+    
+    # Add file references based on API variant
+    if api_variant == GoogleLLMVariant.VERTEX_AI:
+        parts.append(
+            types.Part(file_data=types.FileData(
+                file_uri="gs://cloud-samples-data/generative-ai/pdf/2403.05530.pdf",
+                mime_type="application/pdf",
+            ))
+        )
+    
+    return types.Content(parts=parts)
+
+root_agent = Agent(
+    model="gemini-2.5-flash",
+    name="static_non_text_content_demo_agent",
+    static_instruction=create_static_instruction_with_file_upload(),
+    instruction="Please analyze the user's question..."
+)
+```
+
+#### Example 2: Multimodal Tool Results
+**Location**: `contributing/samples/multimodal_tool_results/`
+
+```python
+def get_image():
+    """Tool that returns image parts"""
+    return [types.Part.from_uri(file_uri="gs://replace_with_your_image_uri")]
+
+root_agent = LlmAgent(
+    name="image_describing_agent",
+    description="image describing agent",
+    instruction="Get the image using the get_image tool, and describe it.",
+    model="gemini-2.0-flash",
+    tools=[get_image],
+)
+
+app = App(
+    name="multimodal_tool_results",
+    root_agent=root_agent,
+    plugins=[MultimodalToolResultsPlugin()],
+)
+```
+
+#### Example 3: Image Generation Agent
+**Location**: `contributing/samples/generate_image/`
+
+Shows how to generate images and handle them in the conversation flow.
+
+### 4. Best Practices for Image Evaluation
+
+Based on the framework's patterns, here are recommended approaches:
+
+#### A. Test Case Structure
+
+```python
+# eval_case with image input
+test_case = EvalCase(
+    eval_id="vision_test_001",
+    conversation=[
+        Invocation(
+            invocation_id="inv_001",
+            user_content=genai_types.Content(
+                parts=[
+                    types.Part.from_text(text="Describe this image:"),
+                    types.Part(
+                        inline_data=types.Blob(
+                            data=image_bytes,
+                            mime_type="image/jpeg",
+                        )
+                    )
+                ]
+            ),
+            final_response=genai_types.Content(
+                parts=[types.Part.from_text(text="Expected response...")]
+            ),
+            rubrics=[
+                Rubric(
+                    rubric_id="vision_accuracy",
+                    rubric_content=RubricContent(
+                        text_property="The agent correctly identifies the main objects in the image"
+                    ),
+                    type="VISION_ACCURACY"
+                ),
+                Rubric(
+                    rubric_id="vision_detail",
+                    rubric_content=RubricContent(
+                        text_property="The agent provides detailed description including colors, positions, and context"
+                    ),
+                    type="VISION_DETAIL"
+                )
+            ]
+        )
+    ]
+)
+```
+
+#### B. Evaluation Configuration for Vision Tasks
+
+```python
+eval_config = EvalConfig(
+    criteria={
+        # Use LLM-as-judge for vision tasks
+        "rubric_based_final_response_quality_v1": RubricsBasedCriterion(
+            threshold=0.7,
+            judge_model_options=JudgeModelOptions(
+                judge_model="gemini-2.5-flash",  # Vision-capable model
+                num_samples=5
+            ),
+            rubrics=[
+                Rubric(
+                    rubric_id="object_detection",
+                    rubric_content=RubricContent(
+                        text_property="The response correctly identifies all major objects visible in the image"
+                    )
+                ),
+                Rubric(
+                    rubric_id="spatial_understanding",
+                    rubric_content=RubricContent(
+                        text_property="The response accurately describes spatial relationships between objects"
+                    )
+                ),
+                Rubric(
+                    rubric_id="detail_completeness",
+                    rubric_content=RubricContent(
+                        text_property="The response includes relevant details about colors, textures, and context"
+                    )
+                )
+            ]
+        ),
+        
+        # Safety check for vision
+        "safety_v1": 0.9,
+        
+        # Hallucination detection
+        "hallucinations_v1": HallucinationsCriterion(
+            threshold=0.2,  # Low threshold = fewer hallucinations allowed
+            judge_model_options=JudgeModelOptions(
+                judge_model="gemini-2.5-flash",
+                num_samples=3
+            )
+        )
+    }
+)
+```
+
+#### C. Tool Trajectory Evaluation for Vision Agents
+
+```python
+# When evaluating vision agents that use tools
+eval_config = EvalConfig(
+    criteria={
+        "tool_trajectory_avg_score": ToolTrajectoryCriterion(
+            threshold=1.0,
+            match_type=ToolTrajectoryCriterion.MatchType.IN_ORDER
+        ),
+        "rubric_based_tool_use_quality_v1": RubricsBasedCriterion(
+            threshold=0.8,
+            rubrics=[
+                Rubric(
+                    rubric_id="tool_selection",
+                    rubric_content=RubricContent(
+                        text_property="The agent selects appropriate vision tools for the task"
+                    ),
+                    type="TOOL_USE_QUALITY"
+                )
+            ]
+        )
+    }
+)
+```
+
+### 5. Key Architectural Patterns
+
+#### Pattern 1: Content Parts as Universal Container
+
+```python
+# Content is composed of Parts
+# Parts can be: text, inline_data (images), file_data (URIs), function_call, function_response
+class Content:
+    parts: list[Part]
+    role: str  # "user" | "model"
+
+# This allows mixing text and images naturally
+user_input = Content(
+    role="user",
+    parts=[
+        Part.from_text("What's in this image?"),
+        Part(inline_data=Blob(data=image_data, mime_type="image/jpeg"))
+    ]
+)
+```
+
+#### Pattern 2: Static Instructions with Context
+
+```python
+# Static instructions can include visual context that's available to all conversations
+agent = Agent(
+    static_instruction=Content(
+        parts=[
+            Part.from_text("You are a visual assistant..."),
+            Part(inline_data=Blob(...)),  # Reference image
+            Part.from_text("Use the reference image above as context...")
+        ]
+    ),
+    instruction="Dynamic per-request instructions..."
+)
+```
+
+#### Pattern 3: Multimodal Tool Results
+
+```python
+# Tools can return multimodal content
+def analyze_chart():
+    return [
+        Part.from_text("Chart shows upward trend"),
+        Part.from_uri("gs://bucket/enhanced_chart.png")
+    ]
+
+# Framework handles multimodal tool results through plugins
+app = App(
+    root_agent=agent,
+    plugins=[MultimodalToolResultsPlugin()]
+)
+```
+
+#### Pattern 4: LLM-as-Judge for Multimodal Evaluation
+
+```python
+# Use vision-capable judge model to evaluate vision task responses
+judge_evaluates = f"""
+Given:
+- Original image: {image_uri}
+- User question: {user_question}
+- Agent response: {agent_response}
+- Rubric: {rubric.rubric_content.text_property}
+
+Evaluate if the response satisfies the rubric criterion.
+Score: 0-1
+"""
+```
+
+### 6. Event Logging Structure
+
+The framework logs detailed event information:
+
+```python
+{
+    "invocation_id": "CFs9iCdD",
+    "event_id": "urXUWHfc",
+    "model_request": {
+        "model": "gemini-1.5-flash",
+        "contents": [/* multimodal content */],
+        "config": {
+            "system_instruction": "...",
+            "tools": [/* tool definitions */]
+        }
+    },
+    "model_response": {
+        "candidates": [{
+            "content": {/* response content */},
+            "finish_reason": "STOP",
+            "safety_ratings": [/* safety scores */]
+        }],
+        "usage_metadata": {
+            "candidates_token_count": 16,
+            "prompt_token_count": 84,
+            "total_token_count": 100
+        }
+    }
+}
+```
+
+## Recommendations for AgentV Implementation
+
+### 1. Eval Case Structure
+
+```typescript
+interface VisionEvalCase {
+  eval_id: string;
+  invocations: Array<{
+    user_content: {
+      text: string;
+      images?: Array<{
+        data: string;  // base64 or URI
+        mime_type: string;
+        display_name?: string;
+      }>;
+    };
+    expected_response?: string;
+    rubrics: Array<{
+      rubric_id: string;
+      criterion: string;
+      type: "VISION_ACCURACY" | "VISION_DETAIL" | "SPATIAL_UNDERSTANDING";
+    }>;
+  }>;
+}
+```
+
+### 2. YAML Configuration Pattern
+
+```yaml
+eval_cases:
+  - eval_id: "image_description_001"
+    conversation:
+      - invocation_id: "inv_001"
+        user_content:
+          text: "Describe the objects in this image"
+          images:
+            - uri: "file://./test_images/scene_001.jpg"
+              mime_type: "image/jpeg"
+        rubrics:
+          - rubric_id: "object_detection"
+            criterion: "Correctly identifies all major objects"
+            threshold: 0.8
+          - rubric_id: "spatial_relations"
+            criterion: "Accurately describes object positions"
+            threshold: 0.7
+
+eval_config:
+  criteria:
+    rubric_based_vision_quality:
+      threshold: 0.75
+      judge_model: "gemini-2.5-flash"
+      num_samples: 5
+```
+
+### 3. Rubric Types for Vision
+
+- **VISION_ACCURACY**: Object detection accuracy
+- **VISION_DETAIL**: Level of detail in descriptions
+- **SPATIAL_UNDERSTANDING**: Understanding of spatial relationships
+- **COLOR_ACCURACY**: Correct identification of colors
+- **CONTEXT_UNDERSTANDING**: Understanding scene context
+- **OCR_ACCURACY**: Text extraction accuracy (if applicable)
+- **VISUAL_REASONING**: Ability to reason about visual content
+
+### 4. Multi-Sample Evaluation
+
+Follow ADK's pattern of sampling judge model multiple times (default: 5) for reliability:
+
+```python
+num_samples = 5
+scores = []
+for _ in range(num_samples):
+    score = judge_model.evaluate(image, response, rubric)
+    scores.append(score)
+final_score = statistics.mean(scores)
+```
+
+### 5. Image Storage Patterns
+
+Support multiple image sources:
+- **Inline Base64**: For small images in YAML
+- **File URIs**: `file://./path/to/image.jpg`
+- **HTTP/HTTPS URIs**: For external images
+- **Cloud Storage**: `gs://bucket/image.jpg` (if using GCP)
+
+### 6. Evaluation Flow
+
+```
+1. Load eval case with image references
+2. Resolve image data (download if URI, decode if base64)
+3. Run agent with image + text input
+4. Collect agent response
+5. For each rubric:
+   a. Sample judge model N times
+   b. Average scores
+   c. Compare to threshold
+6. Aggregate results
+7. Generate report
+```
+
+## Code Examples to Reference
+
+### Key Files to Study
+
+1. **Multimodal Content Handling**:
+   - `contributing/samples/static_non_text_content/agent.py`
+   - `contributing/samples/multimodal_tool_results/agent.py`
+
+2. **Evaluation Infrastructure**:
+   - `src/google/adk/evaluation/eval_case.py`
+   - `src/google/adk/evaluation/eval_rubrics.py`
+   - `src/google/adk/evaluation/eval_metrics.py`
+   - `src/google/adk/evaluation/eval_config.py`
+
+3. **LLM-as-Judge Implementation**:
+   - `src/google/adk/evaluation/llm_as_judge.py`
+   - `src/google/adk/evaluation/rubric_based_evaluator.py`
+
+4. **Safety and Hallucination Detection**:
+   - `src/google/adk/evaluation/safety_evaluator.py`
+   - `src/google/adk/evaluation/hallucinations_v1.py`
+
+## Gaps and Adaptations Needed
+
+### What ADK Doesn't Provide
+
+1. **No specific vision-focused eval examples**
+   - Need to create vision-specific rubrics
+   - Need vision test datasets
+
+2. **No image similarity metrics**
+   - No CLIP score, SSIM, etc.
+   - Relies on LLM-as-judge for vision evaluation
+
+3. **No automated image annotation**
+   - Need to manually create expected responses
+   - No computer vision metrics integration
+
+### What to Adapt
+
+1. **Create vision-specific rubric library**
+   ```python
+   VISION_RUBRICS = {
+       "object_detection": "Identifies all major objects correctly",
+       "spatial_understanding": "Describes spatial relationships accurately",
+       "color_accuracy": "Identifies colors correctly",
+       # etc.
+   }
+   ```
+
+2. **Image preprocessing utilities**
+   ```python
+   def prepare_image_for_eval(image_path):
+       # Resize, normalize, encode as base64
+       pass
+   ```
+
+3. **Vision-specific judge prompts**
+   ```python
+   VISION_JUDGE_TEMPLATE = """
+   You are evaluating a vision AI agent's response.
+   
+   Image: {image_uri}
+   Question: {question}
+   Agent Response: {response}
+   Rubric: {rubric}
+   
+   Score the response 0-1 based on the rubric.
+   """
+   ```
+
+## Conclusion
+
+The ADK-Python framework provides a solid foundation for multimodal evaluation through:
+
+1. **Flexible content model** supporting images via inline_data and file_data
+2. **Rubric-based evaluation** system adaptable to vision tasks
+3. **LLM-as-judge pattern** that works with vision-capable models
+4. **Multi-sample evaluation** for reliability
+5. **Comprehensive event logging** for debugging
+
+**Key Takeaway**: While ADK doesn't have vision-specific examples, its architecture is well-suited for image evaluation. The main work needed is creating vision-specific rubrics and test cases, which can follow the existing patterns for text-based evaluation.
+
+## References
+
+- Repository: https://github.com/google/adk-python
+- Static Non-Text Content Example: `contributing/samples/static_non_text_content/`
+- Multimodal Tool Results: `contributing/samples/multimodal_tool_results/`
+- Evaluation Module: `src/google/adk/evaluation/`
diff --git a/openspec/changes/add-vision-evaluation/references/research-summary.md b/openspec/changes/add-vision-evaluation/references/research-summary.md
new file mode 100644
index 00000000..ae7bb70f
--- /dev/null
+++ b/openspec/changes/add-vision-evaluation/references/research-summary.md
@@ -0,0 +1,945 @@
+# Vision Evaluation Research Summary
+
+## Executive Summary
+
+This document summarizes research into best practices for adding image input evaluation capabilities to AgentV, based on analysis of leading AI agent and evaluation frameworks.
+
+**Date**: January 2, 2026  
+**Repositories Analyzed**: 4 leading frameworks
+
+---
+
+## 1. Research Methodology
+
+### Repositories Researched
+
+1. **google/adk-python** - Google's Agent Development Kit (Python)
+   - Focus: Rubric-based evaluation, multimodal content handling
+   
+2. **mastra-ai/mastra** - TypeScript agent framework
+   - Focus: Production patterns, structured outputs, Braintrust integration
+   
+3. **Azure/azure-sdk-for-python** - Microsoft Azure SDKs
+   - Focus: Image input APIs, Computer Vision, testing patterns
+   
+4. **langwatch/langwatch** - LLM observability and evaluation
+   - Focus: Evaluation architecture, batch processing, metrics
+
+### Research Approach
+
+Each repository was systematically analyzed using GitHub CLI searches for:
+- Image input handling patterns
+- Multimodal evaluation examples
+- Vision-specific evaluators/judges
+- Testing frameworks and best practices
+- Documentation and guides
+
+---
+
+## 2. Key Findings by Framework
+
+### 2.1 Google ADK-Python
+
+**Multimodal Content Model**:
+```python
+Content(
+    parts=[
+        Part.from_text("Describe this image"),
+        Part(inline_data=Blob(data=image_bytes, mime_type="image/jpeg"))
+    ]
+)
+```
+
+**Key Patterns**:
+- ✅ Unified `Content` and `Parts` model for text + images
+- ✅ Three image input methods: inline base64, URIs, tool returns
+- ✅ Rubric-based evaluation with vision-capable judges
+- ✅ Multi-sample evaluation (5x) for reliability
+- ✅ Comprehensive event logging
+
+**Evaluation Architecture**:
+```python
+Invocation(
+    user_content=Content(parts=[...]),
+    rubrics=[
+        Rubric(
+            rubric_id="vision_accuracy",
+            rubric_content=RubricContent(
+                text_property="Correctly identifies main objects"
+            ),
+            type="VISION_ACCURACY"
+        )
+    ]
+)
+```
+
+**Vision-Specific Rubric Types**:
+- Object detection accuracy
+- Spatial understanding
+- Color accuracy
+- Detail completeness
+- Context understanding
+
+**Gaps Identified**:
+- ❌ No specific vision eval examples in repo
+- ❌ No computer vision metrics (SSIM, CLIP)
+- ❌ No automated image annotation tools
+
+---
+
+### 2.2 Mastra (TypeScript)
+
+**Message Format**:
+```typescript
+{
+  role: "user",
+  content: [
+    { type: "text", text: "Describe the image" },
+    { 
+      type: "image", 
+      image: "https://example.com/image.jpg",
+      mimeType: "image/jpeg" 
+    }
+  ]
+}
+```
+
+**Supported Image Formats**:
+- URL references (HTTP/HTTPS)
+- Data URIs (base64)
+- Binary data (Uint8Array, Buffer)
+- Cloud storage (gs://, s3://)
+
+**Vision Model Integration**:
+- OpenAI: GPT-4o, GPT-4 Turbo
+- Anthropic: Claude 3.5 Sonnet, Claude 3 Haiku/Opus
+- Google: Gemini 2.5 Pro/Flash
+
+**Structured Output Pattern**:
+```typescript
+const result = await agent.generate(messages, {
+  output: z.object({
+    bird: z.boolean(),
+    species: z.string(),
+    location: z.string()
+  })
+});
+```
+
+**Evaluation with Braintrust**:
+```typescript
+Eval("Is a bird", {
+  data: () => [
+    { input: IMAGE_URL, expected: { bird: true, species: "robin" } }
+  ],
+  task: async (input) => await analyzeImage(input),
+  scores: [containsScorer, hallucinationScorer]
+});
+```
+
+**Built-in Scorers**:
+- Hallucination detection
+- Faithfulness checking
+- Content similarity
+
+**Key Strengths**:
+- ✅ Production-ready TypeScript patterns
+- ✅ Strong typing with Zod schemas
+- ✅ Braintrust evaluation integration
+- ✅ Memory persistence with images
+- ✅ UI integration examples
+
+---
+
+### 2.3 Azure SDK for Python
+
+**Dual Input Methods**:
+```python
+# Method 1: URL
+result = client.analyze_from_url(
+    image_url="https://example.com/image.jpg",
+    visual_features=[VisualFeatures.CAPTION]
+)
+
+# Method 2: Binary data
+with open("image.jpg", "rb") as f:
+    result = client.analyze(
+        image_data=f.read(),
+        visual_features=[VisualFeatures.CAPTION]
+    )
+```
+
+**Chat Completions with Vision** (Azure OpenAI):
+```python
+completion = client.chat.completions.create(
+    model="gpt-4o",
+    messages=[{
+        "role": "user",
+        "content": [
+            {"type": "text", "text": "What's in this image?"},
+            {
+                "type": "image_url",
+                "image_url": {
+                    "url": image_url,
+                    "detail": "high"  # low, high, auto
+                }
+            }
+        ]
+    }]
+)
+```
+
+**Computer Vision Features**:
+- Tags, captions, dense captions
+- Object detection with bounding boxes
+- OCR (text extraction)
+- People detection
+- Smart crops
+
+**Testing Patterns**:
+```python
+class ImageAnalysisTestBase(AzureRecordedTestCase):
+    def _do_analysis(self, image_source, visual_features):
+        if "http" in image_source:
+            return self.client.analyze_from_url(...)
+        else:
+            with open(image_source, "rb") as f:
+                return self.client.analyze(image_data=f.read(), ...)
+```
+
+**Evaluation Integration**:
+```python
+evaluator = ContentSafetyEvaluator(
+    credential=cred, 
+    azure_ai_project=project
+)
+score = evaluator(conversation=multimodal_conversation)
+```
+
+**Key Insights**:
+- ✅ Flexible input handling (URL + binary)
+- ✅ Comprehensive Computer Vision API
+- ✅ Structured response models
+- ✅ Test infrastructure with recording/playback
+- ✅ Multiple authentication methods
+
+**Image Format Support**:
+- Formats: JPEG, PNG, GIF, BMP, WEBP, ICO, TIFF, MPO
+- Size: 50x50 to 16,000x16,000 pixels, max 20 MB
+
+---
+
+### 2.4 LangWatch
+
+**Finding**: No native multimodal support, but excellent general evaluation patterns.
+
+**Evaluator Architecture**:
+```typescript
+interface EvaluatorConfig {
+  id: string;
+  evaluatorType: string;
+  name: string;
+  settings: Record<string, any>;
+  inputs: Field[];
+  mappings: Record<string, FieldMapping>;
+}
+```
+
+**Evaluation Result Schema**:
+```typescript
+{
+  status: "processed" | "skipped" | "error",
+  passed?: boolean,
+  score?: number,  // 0-1
+  label?: string,
+  details?: string,
+  cost?: Money
+}
+```
+
+**Batch Evaluation Pattern**:
+```python
+for index, row in evaluation.loop(df.iterrows()):
+    evaluation.submit(evaluate_fn, index, row)
+evaluation.wait_for_completion()
+```
+
+**Key Patterns to Adopt**:
+- ✅ Pluggable evaluator architecture
+- ✅ Flexible result schema (score, passed, label, details)
+- ✅ Dataset + runner + evaluator separation
+- ✅ Parallel execution with progress tracking
+- ✅ Cost tracking per evaluation
+- ✅ Version tracking and reproducibility
+
+**Adaptable for Vision**:
+- Extend input types to include images
+- Add vision-specific evaluators
+- Support image datasets
+- Add visual comparison UI
+
+---
+
+## 3. References
+
+### Repository Links
+
+- [google/adk-python](https://github.com/google/adk-python)
+- [mastra-ai/mastra](https://github.com/mastra-ai/mastra)
+- [Azure/azure-sdk-for-python](https://github.com/Azure/azure-sdk-for-python)
+- [langwatch/langwatch](https://github.com/langwatch/langwatch)
+
+### Key Documentation
+
+- [OpenAI Vision API](https://platform.openai.com/docs/guides/vision)
+- [Anthropic Claude Vision](https://docs.anthropic.com/claude/docs/vision)
+- [Google Gemini Vision](https://ai.google.dev/gemini-api/docs/vision)
+- [Azure Computer Vision](https://learn.microsoft.com/en-us/azure/ai-services/computer-vision/)
+
+### Related Papers
+
+- "GPT-4V(ision) System Card" - OpenAI
+- "Claude 3 Model Card" - Anthropic
+- "Gemini: A Family of Highly Capable Multimodal Models" - Google
+
+---
+
+**End of Research Summary**
+
+### 3.1 Image Input Format
+
+Based on Mastra and Azure patterns:
+
+```yaml
+# YAML eval file format
+input_messages:
+  - role: user
+    content:
+      - type: text
+        value: "Describe this image"
+      - type: image
+        value: ./test-images/photo.jpg  # Local file
+        detail: high                     # Optional: low, high, auto
+      - type: image_url
+        value: https://example.com/image.jpg  # URL
+```
+
+**Supported sources**:
+- Local files: `./path/to/image.jpg`
+- HTTP URLs: `https://...`
+- Data URIs: `data:image/jpeg;base64,...`
+- Cloud storage: `gs://bucket/image.jpg`, `s3://bucket/image.jpg`
+
+**MIME types to support**:
+- `image/jpeg`, `image/png`, `image/gif`
+- `image/webp`, `image/bmp`
+- Auto-detect from file extension
+
+---
+
+### 3.2 Evaluator Types
+
+#### LLM-Based Judges
+
+Located in `evaluators/vision/*.md`:
+
+1. **Image Description Judge**
+   ```yaml
+   evaluators:
+     - name: description_quality
+       type: llm_judge
+       prompt: evaluators/vision/image-description-judge.md
+   ```
+   
+   Dimensions:
+   - Visual Accuracy (40%)
+   - Completeness (30%)
+   - Clarity (20%)
+   - Relevance (10%)
+
+2. **Activity Recognition Judge**
+   - Activity identification
+   - Count accuracy
+   - Pose/interaction recognition
+
+3. **Comparison Judge**
+   - Change detection
+   - Spatial precision
+   - Completeness
+
+4. **Reasoning Judge**
+   - Logical correctness
+   - Visual understanding
+   - Problem-solving quality
+
+5. **Structured Output Judge**
+   - JSON validity
+   - Schema compliance
+   - Data accuracy
+
+6. **Quality Assessment Judge**
+   - Technical quality
+   - Composition
+   - Aesthetic evaluation
+
+#### Code-Based Validators
+
+Located in `evaluators/vision/*.py`:
+
+1. **count_validator.py**
+   ```python
+   validate_object_count(output, expected_output) -> Result
+   ```
+
+2. **ocr_validator.py**
+   ```python
+   validate_ocr_accuracy(output, expected, threshold=0.7) -> Result
+   ```
+
+3. **json_validator.py**
+   ```python
+   validate_json_structure(output, expected, schema) -> Result
+   ```
+
+4. **chart_validator.py**
+   ```python
+   validate_chart_data(output, expected, tolerance=0.15) -> Result
+   ```
+
+---
+
+### 3.3 Example Eval Cases
+
+#### Basic Image Analysis
+
+```yaml
+- id: simple-image-description
+  input_messages:
+    - role: system
+      content: You can analyze images and provide detailed descriptions.
+    - role: user
+      content:
+        - type: text
+          value: "Describe what you see in this image."
+        - type: image
+          value: ./test-images/office.jpg
+          detail: high
+  
+  expected_messages:
+    - role: assistant
+      content: |-
+        The image shows an office workspace with:
+        - A desk with computer monitor
+        - Office chair
+        - Natural lighting from window
+  
+  execution:
+    evaluators:
+      - name: content_accuracy
+        type: llm_judge
+        prompt: ../../evaluators/vision/image-description-judge.md
+```
+
+#### Structured Output
+
+```yaml
+- id: structured-object-detection
+  input_messages:
+    - role: user
+      content:
+        - type: text
+          value: |-
+            Return JSON with this structure:
+            {"objects": [{"name": "...", "count": 1, "position": "..."}]}
+        - type: image
+          value: ./test-images/shelf.jpg
+  
+  expected_messages:
+    - role: assistant
+      content: |-
+        ```json
+        {
+          "objects": [
+            {"name": "bottle", "count": 5, "position": "top shelf"},
+            {"name": "can", "count": 8, "position": "middle shelf"}
+          ]
+        }
+        ```
+  
+  execution:
+    evaluators:
+      - name: json_validation
+        type: code_judge
+        script: uv run json_validator.py
+        cwd: ../../evaluators/vision
+```
+
+#### Multi-Turn Conversation
+
+```yaml
+- id: conversation-turn-1
+  conversation_id: vision-chat-001
+  input_messages:
+    - role: user
+      content:
+        - type: text
+          value: "What are the main elements?"
+        - type: image
+          value: ./architecture.jpg
+  expected_messages:
+    - role: assistant
+      content: "Main elements: API Gateway, Services, Database..."
+
+- id: conversation-turn-2
+  conversation_id: vision-chat-001
+  input_messages:
+    # Full history required
+    - role: user
+      content:
+        - type: text
+          value: "What are the main elements?"
+        - type: image
+          value: ./architecture.jpg
+    - role: assistant
+      content: "Main elements: API Gateway, Services, Database..."
+    - role: user
+      content: "Explain the API Gateway's role"
+  expected_messages:
+    - role: assistant
+      content: "The API Gateway handles routing and authentication..."
+```
+
+---
+
+### 3.4 Context Management
+
+**Token Budget Strategy**:
+
+```typescript
+const IMAGE_TOKEN_COSTS = {
+  low: 85,      // 512x512 or less
+  high: 765,    // 512x512 to 2048x2048
+  auto: 1360    // 2048x2048+
+};
+
+const MAX_CONTEXT = 128000;  // Model context limit
+const RESERVE = 0.3;          // 30% for output + safety
+
+const maxImages = Math.floor(
+  (MAX_CONTEXT * (1 - RESERVE)) / IMAGE_TOKEN_COSTS.high
+);
+// ≈ 117 images at high detail
+```
+
+**Progressive Loading**:
+
+```typescript
+interface ImageProcessingStrategy {
+  // Level 1: Metadata only
+  getMetadata(imagePath: string): ImageMetadata;
+  
+  // Level 2: Text description
+  getDescription(imagePath: string): Promise<string>;
+  
+  // Level 3: Full visual analysis
+  analyzeImage(imagePath: string): Promise<FullAnalysis>;
+}
+```
+
+**File System Caching**:
+
+```typescript
+const visionCache = new Map<string, CachedAnalysis>();
+
+async function processWithCache(imagePath: string) {
+  const cacheKey = await hashFile(imagePath);
+  
+  if (visionCache.has(cacheKey)) {
+    return visionCache.get(cacheKey);
+  }
+  
+  const analysis = await analyzeImage(imagePath);
+  visionCache.set(cacheKey, analysis);
+  
+  // Persist to disk
+  await fs.writeFile(
+    `./cache/vision/${cacheKey}.json`,
+    JSON.stringify(analysis)
+  );
+  
+  return analysis;
+}
+```
+
+---
+
+### 3.5 Cost Optimization
+
+**Pricing Reference** (as of Jan 2026):
+
+| Provider | Model | Input (per 1M tokens) | Image Token Cost* |
+|----------|-------|---------------------|------------------|
+| OpenAI | GPT-4o | $2.50 | $1.91-$3.40 per 1K images |
+| Anthropic | Claude 3.5 | $3.00 | $2.30-$4.08 per 1K images |
+| Google | Gemini 2.5 Flash | $0.075 | $0.06-$0.10 per 1K images |
+
+*Based on average 765-1360 tokens per image
+
+**Cost Optimization Strategies**:
+
+1. **Use detail levels appropriately**:
+   ```yaml
+   - type: image
+     value: ./image.jpg
+     detail: low  # For simple tasks, saves ~90% tokens
+   ```
+
+2. **Choose cost-effective models**:
+   - Gemini 2.5 Flash: 20-30x cheaper than GPT-4o
+   - Use for high-volume testing
+   - Upgrade to GPT-4o/Claude for production
+
+3. **Cache image descriptions**:
+   ```typescript
+   // First pass: Analyze image
+   const description = await analyzeImage(image);
+   await cache.set(imageHash, description);
+   
+   // Subsequent passes: Use cached text (20 tokens vs 765)
+   const cachedDescription = await cache.get(imageHash);
+   ```
+
+4. **Batch evaluation**:
+   ```typescript
+   // Process multiple evals in parallel
+   const results = await Promise.all(
+     evalCases.map(ec => evaluateWithImage(ec))
+   );
+   ```
+
+5. **Use code validators when possible**:
+   - Object counting: Free
+   - OCR validation: Free
+   - JSON validation: Free
+   - Only use LLM judges for subjective evaluation
+
+---
+
+## 4. Best Practices Summary
+
+### 4.1 Evaluation Design
+
+✅ **Multi-dimensional rubrics**
+- Weight dimensions appropriately
+- Visual accuracy typically 35-40%
+- Completeness 25-30%
+- Clarity/presentation 15-20%
+
+✅ **Multiple evaluator types**
+- LLM judges for subjective assessment
+- Code validators for objective metrics
+- Combine for comprehensive evaluation
+
+✅ **Multi-sample evaluation**
+- Run LLM judges 3-5 times
+- Aggregate scores for reliability
+- Report variance/confidence
+
+✅ **Clear scoring thresholds**
+- 0.9-1.0: Production ready
+- 0.7-0.89: Good, minor improvements
+- 0.5-0.69: Acceptable, significant gaps
+- Below 0.5: Not passing
+
+---
+
+### 4.2 Image Input Handling
+
+✅ **Support multiple sources**
+- Local files (primary for testing)
+- HTTP URLs (public images)
+- Cloud storage (enterprise)
+- Data URIs (embedded)
+
+✅ **Specify MIME types**
+- Always include for reliability
+- Auto-detect from extension as fallback
+
+✅ **Use detail levels**
+- `low`: Simple tasks, faster, cheaper
+- `high`: Complex analysis, detailed
+- `auto`: Let model decide
+
+✅ **Validate image requirements**
+- Check size limits (50x50 to 16,000x16,000)
+- Verify format support
+- Ensure file accessibility
+
+---
+
+### 4.3 Context Management
+
+✅ **Progressive disclosure**
+- Load metadata first (cheap)
+- Generate descriptions on demand
+- Full analysis only when necessary
+
+✅ **Token budgeting**
+- Calculate image token costs
+- Reserve 30% for output
+- Monitor utilization percentage
+
+✅ **File system caching**
+- Hash images for cache keys
+- Store analyses as JSON
+- Pass references, not raw data
+
+✅ **Supervisor pattern**
+- Isolate vision processing
+- Separate orchestration context
+- Prevent token pollution
+
+---
+
+### 4.4 Testing Strategy
+
+✅ **Complexity levels**
+```yaml
+tests:
+  - simple:  # Single object, clear image
+      complexity: 1
+  - medium:  # Multiple objects, some occlusion
+      complexity: 2
+  - complex: # Scene understanding, reasoning
+      complexity: 3
+```
+
+✅ **Coverage areas**
+- Basic description
+- Object detection/counting
+- Spatial reasoning
+- Text extraction (OCR)
+- Multi-image comparison
+- Quality assessment
+- Logical reasoning
+- Structured output
+
+✅ **Edge cases**
+- Low quality images
+- Partially occluded objects
+- Ambiguous scenes
+- Multiple valid interpretations
+- Empty/minimal content
+
+---
+
+## 5. Files Created
+
+### Evaluation Files (YAML)
+
+1. `basic-image-analysis.yaml` - 7 basic vision eval cases
+2. `advanced-vision-tasks.yaml` - 7 advanced eval cases
+
+### LLM Judge Prompts (Markdown)
+
+3. `image-description-judge.md`
+4. `activity-judge.md`
+5. `comparison-judge.md`
+6. `reasoning-judge.md`
+7. `structured-output-judge.md`
+8. `quality-assessment-judge.md`
+
+### Code Validators (Python)
+
+9. `count_validator.py`
+10. `ocr_validator.py`
+11. `json_validator.py`
+12. `chart_validator.py`
+
+### Documentation
+
+13. `README.md` - Comprehensive guide
+14. `RESEARCH_SUMMARY.md` - This document
+
+---
+
+## 6. Next Steps
+
+### Phase 1: Core Implementation (Week 1-2)
+
+1. **Extend AgentV Schema**
+   - Add image content type to message schema
+   - Support detail levels
+   - Validate image paths/URLs
+
+2. **Image Loading**
+   - Implement file loader
+   - URL fetcher with validation
+   - Base64 encoder
+   - MIME type detection
+
+3. **Provider Integration**
+   - Update OpenAI provider for vision
+   - Update Anthropic provider
+   - Update Google provider
+   - Test with real models
+
+### Phase 2: Evaluators (Week 3)
+
+4. **LLM Judge Integration**
+   - Load judge prompts from MD files
+   - Pass image references to judges
+   - Parse structured evaluation results
+
+5. **Code Validator Runner**
+   - Execute Python validators with `uv run`
+   - Pass eval data as JSON
+   - Parse results
+
+6. **Test Evaluators**
+   - Create test images
+   - Run basic eval suite
+   - Validate scoring
+
+### Phase 3: Advanced Features (Week 4)
+
+7. **Context Management**
+   - Implement progressive disclosure
+   - Add token budgeting
+   - File system caching
+
+8. **Batch Processing**
+   - Parallel evaluation
+   - Progress tracking
+   - Cost reporting
+
+9. **Documentation**
+   - Usage guide
+   - API reference
+   - Tutorial videos
+
+### Phase 4: Computer Vision Metrics (Future)
+
+10. **Native CV Evaluators**
+    - SSIM (structural similarity)
+    - Perceptual hashing
+    - CLIP embeddings
+    - Object detection validation
+
+11. **Specialized Evaluators**
+    - Face detection
+    - Logo recognition
+    - Medical imaging
+    - Document understanding
+
+---
+
+## 7. Success Metrics
+
+### Technical Metrics
+
+- ✅ Support 4+ vision-capable providers
+- ✅ Handle 3+ image input formats
+- ✅ Implement 6+ vision evaluators
+- ✅ Achieve <2s avg eval latency
+- ✅ Support images up to 16MP
+- ✅ Cost tracking per evaluation
+
+### Quality Metrics
+
+- ✅ Evaluation accuracy >90% vs human judgment
+- ✅ Hallucination detection >85% accuracy
+- ✅ Object count accuracy >95%
+- ✅ OCR validation >80% accuracy
+- ✅ Multi-sample consistency >90%
+
+### Usability Metrics
+
+- ✅ Documentation completeness score >90%
+- ✅ Example coverage: 10+ eval cases
+- ✅ Setup time <15 minutes
+- ✅ User satisfaction >4.5/5
+
+---
+
+## 8. References
+
+### Repository Links
+
+- [google/adk-python](https://github.com/google/adk-python)
+- [mastra-ai/mastra](https://github.com/mastra-ai/mastra)
+- [Azure/azure-sdk-for-python](https://github.com/Azure/azure-sdk-for-python)
+- [langwatch/langwatch](https://github.com/langwatch/langwatch)
+
+### Key Documentation
+
+- [OpenAI Vision API](https://platform.openai.com/docs/guides/vision)
+- [Anthropic Claude Vision](https://docs.anthropic.com/claude/docs/vision)
+- [Google Gemini Vision](https://ai.google.dev/gemini-api/docs/vision)
+- [Azure Computer Vision](https://learn.microsoft.com/en-us/azure/ai-services/computer-vision/)
+
+### Related Papers
+
+- "GPT-4V(ision) System Card" - OpenAI
+- "Claude 3 Model Card" - Anthropic
+- "Gemini: A Family of Highly Capable Multimodal Models" - Google
+
+---
+
+## Appendix A: Token Cost Calculator
+
+```typescript
+function estimateImageTokens(
+  width: number,
+  height: number,
+  detail: 'low' | 'high' | 'auto'
+): number {
+  if (detail === 'low') {
+    return 85;
+  }
+  
+  // High detail calculation (OpenAI algorithm)
+  const scaledWidth = Math.min(width, 2048);
+  const scaledHeight = Math.min(height, 2048);
+  
+  // Scale to fit 768px shortest side
+  const scale = 768 / Math.min(scaledWidth, scaledHeight);
+  const finalWidth = Math.ceil(scaledWidth * scale / 512) * 512;
+  const finalHeight = Math.ceil(scaledHeight * scale / 512) * 512;
+  
+  const tiles = (finalWidth / 512) * (finalHeight / 512);
+  return 170 * tiles + 85;  // Base 85 + 170 per tile
+}
+
+// Examples:
+estimateImageTokens(1024, 768, 'high');   // ≈ 765
+estimateImageTokens(2048, 1536, 'high');  // ≈ 1105
+estimateImageTokens(512, 512, 'high');    // ≈ 255
+estimateImageTokens(4096, 4096, 'low');   // 85
+```
+
+---
+
+## Appendix B: Sample Test Dataset
+
+Recommended test images to include:
+
+1. **Office workspace** - Basic description
+2. **Team meeting** - People counting
+3. **Desk arrangement** - Spatial reasoning
+4. **Document scan** - OCR testing
+5. **Before/after comparison** - Change detection
+6. **Color palette** - Color analysis
+7. **Product shelf** - Object detection
+8. **Chess position** - Logical reasoning
+9. **Architecture diagram** - Understanding
+10. **Landscape photo** - Quality assessment
+11. **Sales chart** - Data extraction
+12. **Celebration scene** - Context inference
+13. **Floor plan** - Measurement
+14. **Low quality image** - Error handling
+15. **Ambiguous scene** - Edge case
+
+---
+
+**End of Research Summary**
diff --git a/openspec/changes/add-vision-evaluation/specs/vision-evaluation/spec.md b/openspec/changes/add-vision-evaluation/specs/vision-evaluation/spec.md
new file mode 100644
index 00000000..d0d30b52
--- /dev/null
+++ b/openspec/changes/add-vision-evaluation/specs/vision-evaluation/spec.md
@@ -0,0 +1,330 @@
+# vision-evaluation Specification
+
+## Purpose
+Provide comprehensive, self-contained vision evaluation examples demonstrating best practices for testing AI agents with image inputs. Organized as a standalone package under `examples/showcase/vision/` with all necessary assets.
+
+## ADDED Requirements
+
+### Requirement: Vision Examples MUST be self-contained in examples/showcase/vision/
+All vision evaluation files SHALL be organized in a single directory structure under `examples/showcase/vision/`, making it easy to discover, understand, and use as a complete package.
+
+#### Scenario: Directory structure is self-contained
+Given the vision examples directory
+When inspecting `examples/showcase/vision/`
+Then it SHALL contain:
+- `.agentv/` - Configuration files
+- `datasets/` - Evaluation YAML files
+- `evaluators/` - LLM judges and code validators
+- `test-images/` - Placeholder for user test images
+- `README.md` - Comprehensive documentation
+- `INDEX.md` - Quick reference guide
+
+---
+
+### Requirement: Basic Image Analysis Examples MUST cover fundamental tasks
+The examples SHALL include 7 basic vision evaluation cases covering essential image understanding capabilities.
+
+#### Scenario: Simple image description eval case
+Given `datasets/basic-image-analysis.yaml`
+When loaded
+Then it SHALL contain an eval case `simple-image-description` that:
+- Includes an image in the input
+- Expects a description of the image
+- Uses `image-description-judge` for evaluation
+
+#### Scenario: Object detection eval case
+Given `datasets/basic-image-analysis.yaml`
+When loaded
+Then it SHALL contain an eval case `object-detection-simple` that:
+- Asks to count/identify objects
+- Includes expected count in output
+- Uses `count_validator` for verification
+
+#### Scenario: Spatial relationships eval case
+Given `datasets/basic-image-analysis.yaml`
+When loaded
+Then it SHALL contain an eval case `spatial-relationships` that:
+- Asks about object positions
+- Expects spatial descriptions
+- Uses `image-description-judge` for evaluation
+
+#### Scenario: OCR text extraction eval case
+Given `datasets/basic-image-analysis.yaml`
+When loaded
+Then it SHALL contain an eval case `text-extraction-ocr` that:
+- Shows an image with text
+- Expects text extraction
+- Uses `ocr_validator` for verification
+
+#### Scenario: Multi-image comparison eval case
+Given `datasets/basic-image-analysis.yaml`
+When loaded
+Then it SHALL contain an eval case `multi-image-comparison` that:
+- Includes two images (before/after)
+- Expects change identification
+- Uses `comparison-judge` for evaluation
+
+#### Scenario: Color identification eval case
+Given `datasets/basic-image-analysis.yaml`
+When loaded
+Then it SHALL contain an eval case `color-identification` that:
+- Asks about colors in image
+- Expects color descriptions
+- Uses `image-description-judge` for evaluation
+
+#### Scenario: Image from URL eval case
+Given `datasets/basic-image-analysis.yaml`
+When loaded
+Then it SHALL contain an eval case `image-from-url` that:
+- References an image via HTTP URL
+- Demonstrates URL loading capability
+- Uses standard judge for evaluation
+
+---
+
+### Requirement: Advanced Vision Examples MUST demonstrate complex scenarios
+The examples SHALL include 7 advanced vision evaluation cases showcasing sophisticated capabilities.
+
+#### Scenario: Structured JSON output eval case
+Given `datasets/advanced-vision-tasks.yaml`
+When loaded
+Then it SHALL contain an eval case `structured-object-detection` that:
+- Requests JSON-formatted object detection results
+- Expects specific JSON structure
+- Uses `json_validator` and `structured-output-judge`
+
+#### Scenario: Visual reasoning eval case
+Given `datasets/advanced-vision-tasks.yaml`
+When loaded
+Then it SHALL contain an eval case `visual-reasoning-problem` that:
+- Presents a logical puzzle with image (e.g., chess)
+- Expects reasoned solution
+- Uses `reasoning-judge` for evaluation
+
+#### Scenario: Multi-turn conversation eval cases
+Given `datasets/advanced-vision-tasks.yaml`
+When loaded
+Then it SHALL contain eval cases `multi-turn-image-discussion-part1` and `part2` that:
+- Share the same `conversation_id`
+- Maintain image context across turns
+- Demonstrate contextual follow-up questions
+
+#### Scenario: Image quality assessment eval case
+Given `datasets/advanced-vision-tasks.yaml`
+When loaded
+Then it SHALL contain an eval case `image-quality-assessment` that:
+- Asks for technical/aesthetic quality rating
+- Expects detailed assessment
+- Uses `quality-assessment-judge`
+
+#### Scenario: Chart data extraction eval case
+Given `datasets/advanced-vision-tasks.yaml`
+When loaded
+Then it SHALL contain an eval case `chart-data-extraction` that:
+- Shows a chart/graph image
+- Expects data extraction and analysis
+- Uses `chart_validator` for verification
+
+#### Scenario: Scene understanding eval case
+Given `datasets/advanced-vision-tasks.yaml`
+When loaded
+Then it SHALL contain an eval case `scene-context-inference` that:
+- Requires contextual understanding beyond literal content
+- Expects inferred situation/mood
+- Uses `image-description-judge`
+
+#### Scenario: Instruction following with image eval case
+Given `datasets/advanced-vision-tasks.yaml`
+When loaded
+Then it SHALL contain an eval case `instruction-following-with-image` that:
+- Combines complex instructions with visual reference
+- May include file attachments with instructions
+- Tests multi-step task completion
+
+---
+
+### Requirement: Comprehensive README MUST provide usage guidance
+The `examples/showcase/vision/README.md` file SHALL serve as the primary documentation for vision evaluation.
+
+#### Scenario: README covers quick start
+Given `examples/showcase/vision/README.md`
+When a user reads the Quick Start section
+Then they SHALL find:
+- How to run basic evals
+- How to run advanced evals
+- How to add test images
+
+#### Scenario: README documents image input formats
+Given `examples/showcase/vision/README.md`
+When a user looks up image input formats
+Then they SHALL find examples for:
+- Local file paths
+- HTTP URLs
+- Base64 data URIs
+- Detail level specification
+
+#### Scenario: README lists all evaluators
+Given `examples/showcase/vision/README.md`
+When a user wants to know available evaluators
+Then they SHALL find:
+- Complete list of LLM judges with descriptions
+- Complete list of code validators with descriptions
+- Usage examples for each type
+
+#### Scenario: README includes best practices
+Given `examples/showcase/vision/README.md`
+When a user looks for best practices
+Then they SHALL find guidance on:
+- Context engineering (progressive disclosure)
+- Token budgeting (image costs)
+- Cost optimization strategies
+- Provider selection
+
+#### Scenario: README documents success criteria
+Given `examples/vision/README.md`
+When a user wants to understand evaluation metrics
+Then they SHALL find:
+- Scoring dimension weights
+- Passing thresholds
+- Performance expectations
+
+---
+
+### Requirement: Configuration Files MUST enable easy setup
+The `.agentv/` directory SHALL contain configuration files for running vision evals.
+
+#### Scenario: Config file specifies directories
+Given `examples/showcase/vision/.agentv/config.yaml`
+When loaded
+Then it SHALL specify:
+- `evalsDir: ./evals`
+- `evaluatorsDir: ./evaluators`
+
+#### Scenario: Targets file includes vision models
+Given `examples/showcase/vision/.agentv/targets.yaml`
+When loaded
+Then it SHALL define targets for:
+- OpenAI GPT-4o (default)
+- Anthropic Claude 3.5 Sonnet
+- Google Gemini 2.5 Flash
+With appropriate environment variable references.
+
+---
+
+### Requirement: Test Images Directory MUST be provided
+The examples SHALL include a `test-images/` directory for users to place their own test images.
+
+#### Scenario: Test images directory exists
+Given the vision examples structure
+When checking `examples/showcase/vision/test-images/`
+Then the directory SHALL exist with a `.gitkeep` file.
+
+#### Scenario: README documents image requirements
+Given `examples/showcase/vision/README.md`
+When a user wants to add test images
+Then they SHALL find specifications for:
+- Supported formats (JPEG, PNG, WEBP, GIF, BMP)
+- Size limits (50x50 to 16,000x16,000 pixels, max 20MB)
+- File naming conventions
+- Which images are needed for which eval cases
+
+---
+
+### Requirement: Research Documentation MUST be accessible
+The research findings that informed the vision evaluation design SHALL be documented and referenced.
+
+#### Scenario: Research summary is available
+Given `docs/updates/VISION_EVAL_RESEARCH_SUMMARY.md`
+When a user wants to understand design rationale
+Then they SHALL find:
+- Analysis of 5 leading frameworks
+- Key findings by framework
+- Implementation recommendations
+- Best practices summary
+- References to source repositories
+
+#### Scenario: README links to research
+Given `examples/showcase/vision/README.md`
+When a user wants deeper context
+Then they SHALL find a link to the research summary document.
+
+---
+
+## Cross-References
+
+**Related Capabilities:**
+- `vision-input` - Provides image input support used in examples
+- `vision-evaluators` - Provides evaluators used in examples
+- `yaml-schema` - Examples use extended schema
+- `eval-execution` - Examples are run via eval execution
+
+**Dependencies:**
+- Requires `vision-input` and `vision-evaluators` to be implemented
+- Examples serve as integration tests for those capabilities
+
+---
+
+## Implementation Notes
+
+### Directory Structure
+```
+examples/showcase/vision/
+├── .agentv/
+│   ├── config.yaml
+│   └── targets.yaml
+├── datasets/
+│   ├── basic-image-analysis.yaml (7 cases)
+│   └── advanced-vision-tasks.yaml (7 cases)
+├── evaluators/
+│   ├── llm-judges/
+│   │   ├── image-description-judge.md
+│   │   ├── activity-judge.md
+│   │   ├── comparison-judge.md
+│   │   ├── reasoning-judge.md
+│   │   ├── structured-output-judge.md
+│   │   └── quality-assessment-judge.md
+│   └── code-validators/
+│       ├── count_validator.py
+│       ├── ocr_validator.py
+│       ├── json_validator.py
+│       └── chart_validator.py
+├── test-images/
+│   └── .gitkeep
+├── README.md (comprehensive guide)
+└── INDEX.md (quick reference)
+```
+
+### Eval Case Distribution
+
+**Basic (7 cases):**
+1. simple-image-description
+2. object-detection-simple
+3. spatial-relationships
+4. text-extraction-ocr
+5. multi-image-comparison
+6. color-identification
+7. image-from-url
+
+**Advanced (7 cases):**
+1. structured-object-detection
+2. visual-reasoning-problem
+3. multi-turn-image-discussion-part1
+4. multi-turn-image-discussion-part2
+5. image-quality-assessment
+6. chart-data-extraction
+7. scene-context-inference
+8. instruction-following-with-image
+
+### Documentation Hierarchy
+1. **INDEX.md** - Quick start, table of contents
+2. **README.md** - Comprehensive usage guide
+3. **Research Summary** - Deep dive into design rationale
+
+---
+
+## Future Enhancements (Out of Scope)
+- Pre-included sample test images (users provide their own)
+- Video tutorial or walkthrough
+- Interactive web-based examples
+- Automated eval case generation from templates
+- Domain-specific example sets (medical, document analysis, etc.)
diff --git a/openspec/changes/add-vision-evaluation/specs/vision-evaluators/spec.md b/openspec/changes/add-vision-evaluation/specs/vision-evaluators/spec.md
new file mode 100644
index 00000000..9498b414
--- /dev/null
+++ b/openspec/changes/add-vision-evaluation/specs/vision-evaluators/spec.md
@@ -0,0 +1,313 @@
+# vision-evaluators Specification
+
+## Purpose
+Provide specialized evaluators for assessing the quality and accuracy of AI agent responses to vision/image-based tasks. Includes both LLM-based judges for subjective assessment and code-based validators for objective metrics.
+
+## ADDED Requirements
+
+### Requirement: LLM Judge Prompts MUST support image context
+LLM judge prompts SHALL be able to reference images from the evaluation input when assessing vision-based responses.
+
+#### Scenario: Judge prompt includes image reference placeholder
+Given an LLM judge prompt containing `{{image_reference}}`
+When rendering the prompt for evaluation
+Then the placeholder SHALL be replaced with a reference to the image(s) from the input.
+
+#### Scenario: Judge model receives image context
+Given an LLM judge evaluating a vision task
+When the judge model is invoked
+Then the judge model SHALL be a vision-capable model (e.g., GPT-4V, Claude 3.5 Sonnet).
+
+---
+
+### Requirement: Image Description Judge MUST evaluate visual analysis quality
+An LLM judge SHALL assess the accuracy, completeness, and clarity of image descriptions.
+
+#### Scenario: Evaluate description accuracy
+Given an AI response describing an office image
+When evaluated by the image-description-judge
+Then the score SHALL reflect:
+- Visual accuracy (40%): Are objects and details correct?
+- Completeness (30%): Are all significant elements mentioned?
+- Clarity (20%): Is the description clear and specific?
+- Relevance (10%): Does it focus on task-relevant elements?
+
+#### Scenario: Detect hallucinations in image descriptions
+Given an AI response claiming "three people" when image shows two
+When evaluated by the image-description-judge
+Then the judge SHALL identify the hallucination in its `details.hallucinations` field.
+
+#### Scenario: Identify missing visual elements
+Given an AI response that omits significant background elements
+When evaluated by the image-description-judge
+Then the judge SHALL list missing elements in its `details.missing_elements` field.
+
+---
+
+### Requirement: Activity Recognition Judge MUST evaluate action identification
+An LLM judge SHALL assess the accuracy of identifying activities, actions, and behaviors visible in images.
+
+#### Scenario: Evaluate activity identification accuracy
+Given an AI response identifying "team meeting with 4 people"
+When evaluated by the activity-judge
+Then the score SHALL reflect:
+- Activity identification (35%): Is the activity correctly identified?
+- Accuracy (35%): Are counts and poses correct?
+- Detail level (20%): Is the detail appropriate?
+- Inference quality (10%): Are inferences reasonable?
+
+---
+
+### Requirement: Comparison Judge MUST evaluate multi-image analysis
+An LLM judge SHALL assess the quality of comparing multiple images and detecting changes.
+
+#### Scenario: Evaluate change detection accuracy
+Given an AI response comparing before/after images
+When evaluated by the comparison-judge
+Then the score SHALL reflect:
+- Change detection accuracy (40%): Are changes identified?
+- Spatial precision (25%): Are locations accurately described?
+- Completeness (20%): Are both similarities and differences noted?
+- Clarity (15%): Is the comparison structure clear?
+
+---
+
+### Requirement: Visual Reasoning Judge MUST evaluate logic with visual information
+An LLM judge SHALL assess the quality of logical reasoning applied to visual problems (e.g., chess positions, puzzles, diagrams).
+
+#### Scenario: Evaluate visual reasoning correctness
+Given an AI response solving a chess problem from an image
+When evaluated by the reasoning-judge
+Then the score SHALL reflect:
+- Logical correctness (40%): Is reasoning sound?
+- Visual understanding (30%): Is visual perception accurate?
+- Problem-solving quality (20%): Is the solution approach appropriate?
+- Explanation quality (10%): Is reasoning clearly explained?
+
+---
+
+### Requirement: Structured Output Judge MUST validate vision-based JSON
+An LLM judge SHALL assess the quality of structured JSON outputs from vision analysis tasks.
+
+#### Scenario: Evaluate JSON structure from vision task
+Given an AI response with JSON object detection results
+When evaluated by the structured-output-judge
+Then the score SHALL reflect:
+- JSON validity (30%): Is it parseable JSON?
+- Schema compliance (35%): Does it match requested structure?
+- Data accuracy (25%): Are values from image accurate?
+- Completeness (10%): Are all relevant elements captured?
+
+---
+
+### Requirement: Quality Assessment Judge MUST evaluate image quality analysis
+An LLM judge SHALL assess the completeness and accuracy of image quality assessments (technical, compositional, aesthetic).
+
+#### Scenario: Evaluate quality assessment completeness
+Given an AI response rating an image's quality
+When evaluated by the quality-assessment-judge
+Then the score SHALL reflect:
+- Technical completeness (30%): Sharpness, exposure, noise discussed?
+- Compositional analysis (25%): Rule of thirds, balance, framing?
+- Aesthetic evaluation (20%): Color, mood, style assessed?
+- Overall judgment (15%): Score provided with justification?
+- Professional tone (10%): Objective and uses appropriate terminology?
+
+---
+
+### Requirement: Object Count Validator MUST verify numeric accuracy
+A code-based validator SHALL extract and verify object counts from AI responses against expected values.
+
+#### Scenario: Validate object count accuracy
+Given an AI response stating "5 bottles" and expected output "5 bottles"
+When evaluated by count_validator.py
+Then the score SHALL be 1.0 (100% accuracy).
+
+#### Scenario: Partial count matching
+Given an AI response stating "5 bottles, 3 cans" and expected "5 bottles, 8 cans"
+When evaluated by count_validator.py
+Then the score SHALL be 0.5 (50% accuracy - one of two counts matched).
+
+---
+
+### Requirement: OCR Validator MUST verify text extraction accuracy
+A code-based validator SHALL compare extracted text from images against expected text using similarity and keyword matching.
+
+#### Scenario: Validate OCR text similarity
+Given an AI response extracting "Project Proposal Q1 2026" and expected "Project Proposal Q1 2026"
+When evaluated by ocr_validator.py
+Then the score SHALL be >0.9 (high text similarity).
+
+#### Scenario: Validate keyword presence
+Given an AI response mentioning keywords "budget, timeline, deliverables"
+When evaluated by ocr_validator.py with expected keywords
+Then the keyword accuracy SHALL be reflected in the score.
+
+---
+
+### Requirement: JSON Structure Validator MUST verify structured outputs
+A code-based validator SHALL validate that AI responses contain correctly structured JSON matching expected schemas.
+
+#### Scenario: Validate JSON structure and fields
+Given an AI response with valid JSON containing expected fields
+When evaluated by json_validator.py
+Then the validation SHALL:
+- Confirm JSON is parseable
+- Verify schema compliance
+- Check field presence and types
+- Return score based on coverage
+
+#### Scenario: Detect schema violations
+Given an AI response with JSON missing required fields
+When evaluated by json_validator.py
+Then the validation SHALL identify missing fields in `details.missing_keys`.
+
+---
+
+### Requirement: Chart Data Validator MUST verify data extraction
+A code-based validator SHALL extract and validate numeric data (currency, percentages, dates) from chart/graph descriptions.
+
+#### Scenario: Validate currency value extraction
+Given an AI response stating "Q4: $2.4M" and expected "$2.4M"
+When evaluated by chart_validator.py
+Then the currency value SHALL be matched within 15% tolerance.
+
+#### Scenario: Validate percentage extraction
+Given an AI response stating "58% growth" and expected "58%"
+When evaluated by chart_validator.py
+Then the percentage SHALL be matched exactly.
+
+---
+
+### Requirement: Code Validators MUST execute via uv run
+Python code validators SHALL be executed using `uv run` command with evaluation data passed as JSON.
+
+#### Scenario: Execute Python validator with JSON input
+Given a code validator script `count_validator.py`
+When executed with eval data `{"output": "5 objects", "expected_output": "5 objects"}`
+Then the validator SHALL:
+- Receive data via stdin or command-line argument
+- Process the data
+- Return JSON result via stdout
+- Exit with code 0 for passed, 1 for failed
+
+#### Scenario: Handle validator timeouts
+Given a code validator that runs longer than 30 seconds
+When executed
+Then the system SHALL terminate the validator and report a timeout error.
+
+---
+
+### Requirement: Evaluator Results MUST follow standard format
+All evaluators (LLM judges and code validators) SHALL return results in a consistent format for scoring.
+
+#### Scenario: Standard result format
+Given any evaluator completing evaluation
+When the result is returned
+Then it SHALL include:
+```typescript
+{
+  status: 'processed' | 'error' | 'skipped',
+  score: number,  // 0.0 to 1.0
+  passed: boolean,
+  details: {
+    // Evaluator-specific details
+  }
+}
+```
+
+---
+
+## Cross-References
+
+**Related Capabilities:**
+- `vision-input` - Provides the images to evaluate
+- `evaluation` - Base evaluation framework
+- `rubric-evaluator` - Similar pattern for LLM judges
+- `eval-execution` - Executes evaluators during eval runs
+
+**Dependencies:**
+- Requires `vision-input` to be implemented first
+- Extends existing evaluator patterns from `rubric-evaluator`
+
+---
+
+## Implementation Notes
+
+### LLM Judge File Structure
+```
+evaluators/llm-judges/
+├── image-description-judge.md
+├── activity-judge.md
+├── comparison-judge.md
+├── reasoning-judge.md
+├── structured-output-judge.md
+└── quality-assessment-judge.md
+```
+
+### Code Validator File Structure
+```
+evaluators/code-validators/
+├── count_validator.py
+├── ocr_validator.py
+├── json_validator.py
+└── chart_validator.py
+```
+
+### Judge Prompt Template Variables
+- `{{input}}` - User's question/input
+- `{{output}}` - AI's response
+- `{{expected_output}}` - Expected response
+- `{{image_reference}}` - Reference to image(s)
+- `{{image_references}}` - Array of image references (for multi-image)
+
+### Code Validator Interface
+```python
+def validate(
+    output: str,
+    expected_output: str,
+    input_text: str = "",
+    **kwargs
+) -> Dict[str, Any]:
+    return {
+        "status": "processed",
+        "score": 0.85,
+        "passed": True,
+        "details": {...}
+    }
+```
+
+### Scoring Dimension Weights
+
+**Image Description**:
+- Visual Accuracy: 40%
+- Completeness: 30%
+- Clarity: 20%
+- Relevance: 10%
+
+**Activity Recognition**:
+- Activity Identification: 35%
+- Accuracy: 35%
+- Detail Level: 20%
+- Inference Quality: 10%
+
+**Visual Reasoning**:
+- Logical Correctness: 40%
+- Visual Understanding: 30%
+- Problem-Solving: 20%
+- Explanation: 10%
+
+**Comparison**:
+- Change Detection: 40%
+- Spatial Precision: 25%
+- Completeness: 20%
+- Clarity: 15%
+
+---
+
+## Future Enhancements (Out of Scope)
+- Computer vision metric evaluators (SSIM, perceptual hash, CLIP similarity)
+- Specialized domain evaluators (medical imaging, document understanding, face detection)
+- Multi-sample evaluation automation (run judges 3-5 times, aggregate scores)
+- Confidence calibration evaluators
+- Adversarial image testing
diff --git a/openspec/changes/add-vision-evaluation/specs/vision-input/spec.md b/openspec/changes/add-vision-evaluation/specs/vision-input/spec.md
new file mode 100644
index 00000000..3437e09c
--- /dev/null
+++ b/openspec/changes/add-vision-evaluation/specs/vision-input/spec.md
@@ -0,0 +1,248 @@
+# vision-input Specification
+
+## Purpose
+Enable AgentV to accept image inputs in evaluation test cases, supporting local files, URLs, and base64 data URIs. This capability allows testing of vision-capable AI agents with multimodal (text + image) inputs.
+
+## ADDED Requirements
+
+### Requirement: Image Content Type MUST be supported in messages
+The YAML schema and message structure SHALL support `type: image` content items alongside text content, allowing images to be included in evaluation input messages.
+
+#### Scenario: Parse image content from local file
+Given an eval YAML file with:
+```yaml
+input_messages:
+  - role: user
+    content:
+      - type: text
+        value: "Describe this image"
+      - type: image
+        value: ./test-images/photo.jpg
+        detail: high
+```
+When parsed by the eval loader
+Then the message SHALL contain an `ImageContentItem` with `value: "./test-images/photo.jpg"` and `detail: "high"`.
+
+#### Scenario: Parse image content from URL
+Given an eval YAML file with:
+```yaml
+input_messages:
+  - role: user
+    content:
+      - type: image_url
+        value: https://example.com/image.jpg
+```
+When parsed by the eval loader
+Then the message SHALL contain an `ImageContentItem` with `value: "https://example.com/image.jpg"`.
+
+#### Scenario: Parse image content from base64 data URI
+Given an eval YAML file with:
+```yaml
+input_messages:
+  - role: user
+    content:
+      - type: image
+        value: data:image/jpeg;base64,/9j/4AAQSkZJRg...
+```
+When parsed by the eval loader
+Then the message SHALL contain an `ImageContentItem` with the full data URI as the value.
+
+---
+
+### Requirement: Image Detail Level MUST be configurable
+The image content item SHALL support an optional `detail` parameter to control the resolution/quality trade-off for vision models.
+
+#### Scenario: Specify low detail for cost optimization
+Given an image content item with `detail: low`
+When passed to a vision provider
+Then the provider SHALL receive the `low` detail parameter, resulting in ~85 tokens per image.
+
+#### Scenario: Specify high detail for complex analysis
+Given an image content item with `detail: high`
+When passed to a vision provider
+Then the provider SHALL receive the `high` detail parameter, resulting in ~765-1360 tokens per image.
+
+#### Scenario: Use auto detail for automatic selection
+Given an image content item with `detail: auto`
+When passed to a vision provider
+Then the provider SHALL receive the `auto` detail parameter, allowing the model to choose based on the task.
+
+#### Scenario: Default to high detail when not specified
+Given an image content item without a `detail` parameter
+When passed to a vision provider
+Then the provider SHALL use `high` detail by default.
+
+---
+
+### Requirement: MIME Type Detection MUST be automatic with manual override
+The system SHALL automatically detect image MIME types from file extensions or content, while allowing explicit specification for edge cases.
+
+#### Scenario: Detect MIME type from file extension
+Given an image with path `./photo.jpg`
+When loading the image
+Then the MIME type SHALL be detected as `image/jpeg`.
+
+#### Scenario: Detect MIME type from data URI
+Given a data URI `data:image/png;base64,...`
+When parsing the URI
+Then the MIME type SHALL be extracted as `image/png`.
+
+#### Scenario: Override MIME type explicitly
+Given an image content item with:
+```yaml
+type: image
+value: ./file.img
+mimeType: image/webp
+```
+When loading the image
+Then the MIME type SHALL be `image/webp` as specified.
+
+---
+
+### Requirement: Image Loading MUST support multiple sources
+The system SHALL load images from local file paths, HTTP/HTTPS URLs, and base64-encoded data URIs.
+
+#### Scenario: Load image from local file system
+Given an image path `./test-images/sample.jpg` that exists
+When loading the image
+Then the image file SHALL be read into a Buffer successfully.
+
+#### Scenario: Load image from HTTP URL
+Given an image URL `https://example.com/image.png`
+When loading the image
+Then the image SHALL be fetched via HTTP and loaded into a Buffer.
+
+#### Scenario: Parse base64 data URI
+Given a data URI `data:image/jpeg;base64,/9j/4AAQ...`
+When parsing the URI
+Then the base64 data SHALL be decoded into a Buffer.
+
+#### Scenario: Reject invalid file paths
+Given an image path `./nonexistent.jpg` that does not exist
+When attempting to load the image
+Then the system SHALL throw an error with message "Image file not found: ./nonexistent.jpg".
+
+#### Scenario: Reject invalid URLs
+Given an invalid URL `https://invalid-domain-xyz/image.jpg`
+When attempting to load the image
+Then the system SHALL throw an error indicating the URL is unreachable.
+
+---
+
+### Requirement: Image Validation MUST enforce size and format constraints
+The system SHALL validate that images meet provider requirements for format, dimensions, and file size before attempting evaluation.
+
+#### Scenario: Validate supported image formats
+Given an image with format JPEG, PNG, WEBP, GIF, or BMP
+When validating the image
+Then the image SHALL pass format validation.
+
+#### Scenario: Reject unsupported image formats
+Given an image with format TIFF or SVG
+When validating the image
+Then the system SHALL throw an error "Unsupported image format: image/tiff".
+
+#### Scenario: Validate image dimensions
+Given an image with dimensions 1920x1080 pixels
+When validating the image
+Then the image SHALL pass dimension validation (within 50x50 to 16,000x16,000 range).
+
+#### Scenario: Reject oversized images by dimensions
+Given an image with dimensions 20,000x20,000 pixels
+When validating the image
+Then the system SHALL throw an error "Image dimensions exceed maximum: 16,000x16,000 pixels".
+
+#### Scenario: Reject oversized images by file size
+Given an image file larger than 20MB
+When validating the image
+Then the system SHALL throw an error "Image file size exceeds maximum: 20MB".
+
+---
+
+### Requirement: Multiple Images per Message MUST be supported
+A single message content array SHALL support multiple image content items, allowing comparison and multi-image analysis tasks.
+
+#### Scenario: Include multiple images in one message
+Given a message with content:
+```yaml
+content:
+  - type: text
+    value: "Compare these images"
+  - type: image
+    value: ./before.jpg
+  - type: image
+    value: ./after.jpg
+```
+When parsed
+Then the message SHALL contain 2 image content items in the correct order.
+
+---
+
+### Requirement: Image Context MUST persist in multi-turn conversations
+When an image is included in a message, it SHALL remain part of the conversation context for subsequent turns, following the `conversation_id` pattern.
+
+#### Scenario: Maintain image context across conversation turns
+Given an eval case with `conversation_id: vision-chat-001` containing an image in turn 1
+When loading turn 2 of the same conversation
+Then the full conversation history including the image SHALL be available to the model.
+
+---
+
+## Cross-References
+
+**Related Capabilities:**
+- `yaml-schema` - Requires extension to parse image content types
+- `vision-evaluators` - Depends on images being loaded and passed to evaluators
+- `eval-execution` - Needs to handle image loading during eval runs
+- `multiturn-messages-lm-provider` - Multi-turn conversations with images
+
+**Sequence:**
+1. This capability (image input) must be implemented first
+2. Then `vision-evaluators` can be implemented
+3. Finally `vision-evaluation` examples can be used
+
+---
+
+## Implementation Notes
+
+### TypeScript Type Definitions
+```typescript
+interface ImageContentItem {
+  type: 'image' | 'image_url';
+  value: string;  // file path, URL, or data URI
+  detail?: 'low' | 'high' | 'auto';
+  mimeType?: string;
+}
+
+type ContentItem = TextContentItem | ImageContentItem | FileContentItem;
+```
+
+### Image Loader Interface
+```typescript
+interface ImageLoader {
+  load(source: string): Promise<Buffer>;
+  detectMimeType(buffer: Buffer): string;
+  validate(buffer: Buffer): ValidationResult;
+}
+```
+
+### Supported MIME Types
+- `image/jpeg`
+- `image/png`
+- `image/webp`
+- `image/gif`
+- `image/bmp`
+
+### Size Constraints
+- **Minimum**: 50x50 pixels
+- **Maximum**: 16,000x16,000 pixels
+- **File Size**: 20MB maximum
+
+---
+
+## Future Enhancements (Out of Scope)
+- Cloud storage URLs (gs://, s3://)
+- Automatic image resizing/optimization
+- Image caching to reduce redundant loads
+- Progressive image loading
+- Video input support
diff --git a/openspec/changes/add-vision-evaluation/tasks.md b/openspec/changes/add-vision-evaluation/tasks.md
new file mode 100644
index 00000000..7b6af483
--- /dev/null
+++ b/openspec/changes/add-vision-evaluation/tasks.md
@@ -0,0 +1,610 @@
+# Implementation Tasks: Add Vision Evaluation
+
+## Overview
+This document outlines the ordered tasks for implementing vision evaluation capabilities in AgentV. Tasks are organized to deliver user-visible progress incrementally while managing dependencies.
+
+## Task Dependency Graph
+```
+Phase 1 (Foundation)
+├─ T1: Reorganize files → T2, T3
+├─ T2: Schema extension → T4, T5
+└─ T3: Documentation → T14
+
+Phase 2 (Core Implementation)
+├─ T4: Image loaders → T5
+├─ T5: Provider integration → T6, T7
+├─ T6: LLM judges → T8, T9
+└─ T7: Code validators → T8, T9
+
+Phase 3 (Testing & Validation)
+├─ T8: Basic eval tests → T10
+├─ T9: Advanced eval tests → T10
+├─ T10: Provider compatibility → T11
+└─ T11: Cost analysis → T12
+
+Phase 4 (Polish)
+├─ T12: Performance optimization → T13
+├─ T13: Documentation review → T14
+└─ T14: Final validation
+```
+
+## Tasks
+
+### Phase 1: Foundation & Structure (Days 1-2)
+
+#### ✅ Task 1: Reorganize Vision Files into Self-Contained Structure
+**Priority**: High  
+**Effort**: 1 day  
+**Dependencies**: None  
+
+**Description**: Move vision evaluation files from `examples/features/evals/vision/` and `examples/features/evaluators/vision/` to a self-contained `examples/showcase/vision/` directory structure.
+
+**Actions**:
+1. Create `examples/showcase/vision/` directory structure:
+   ```
+   examples/showcase/vision/
+   ├── .agentv/
+   │   ├── config.yaml
+   │   └── targets.yaml
+   ├── datasets/
+   │   ├── basic-image-analysis.yaml
+   │   └── advanced-vision-tasks.yaml
+   ├── evaluators/
+   │   ├── llm-judges/
+   │   │   ├── image-description-judge.md
+   │   │   ├── activity-judge.md
+   │   │   ├── comparison-judge.md
+   │   │   ├── reasoning-judge.md
+   │   │   ├── structured-output-judge.md
+   │   │   └── quality-assessment-judge.md
+   │   └── code-validators/
+   │       ├── count_validator.py
+   │       ├── ocr_validator.py
+   │       ├── json_validator.py
+   │       └── chart_validator.py
+   ├── test-images/
+   │   └── .gitkeep (users provide their own images)
+   └── README.md
+   ```
+
+2. Move all existing vision files to new structure
+3. Update all relative paths in YAML files to reference new evaluator locations
+4. Update documentation paths
+5. Delete old `examples/features/evals/vision/` and `examples/features/evaluators/vision/` directories
+
+**Validation**:
+- [ ] All files exist in new location
+- [ ] No broken relative paths in YAML files
+- [ ] Documentation links updated
+- [ ] Old directories removed
+
+**User-Visible**: Clear, self-contained vision examples directory
+
+---
+
+#### Task 2: Extend YAML Schema for Image Content Types
+**Priority**: High  
+**Effort**: 2 days  
+**Dependencies**: None  
+**Blocks**: T4, T5
+
+**Description**: Extend the existing YAML schema and TypeScript types to support image content in messages.
+
+**Actions**:
+1. Add `ImageContentItem` type to content union:
+   ```typescript
+   type ContentItem = TextContentItem | ImageContentItem | FileContentItem;
+   
+   interface ImageContentItem {
+     type: 'image';
+     value: string;  // path, URL, or data URI
+     detail?: 'low' | 'high' | 'auto';
+     mimeType?: string;
+   }
+   
+   interface ImageURLContentItem {
+     type: 'image_url';
+     value: string;  // URL only
+     detail?: 'low' | 'high' | 'auto';
+   }
+   ```
+
+2. Update YAML parser to recognize `type: image` and `type: image_url`
+3. Add Zod validation schema for image content
+4. Update TypeScript interfaces in core package
+5. Add schema documentation
+
+**Validation**:
+- [ ] TypeScript types compile without errors
+- [ ] Zod schema validates image content correctly
+- [ ] YAML parser recognizes image types
+- [ ] Unit tests for schema parsing pass
+- [ ] Invalid image content rejected with clear errors
+
+**User-Visible**: Can write YAML evals with image content
+
+---
+
+#### Task 3: Create Configuration Files
+**Priority**: Medium  
+**Effort**: 0.5 days  
+**Dependencies**: T1  
+
+**Description**: Create `.agentv/` configuration files for the vision examples directory.
+
+**Actions**:
+1. Create `.agentv/config.yaml`:
+   ```yaml
+   version: "1.0"
+   evalsDir: ./evals
+   evaluatorsDir: ./evaluators
+   ```
+
+2. Create `.agentv/targets.yaml` with vision-capable models:
+   ```yaml
+   targets:
+     default:
+       provider: openai
+       model: gpt-4o
+       apiKey: ${OPENAI_API_KEY}
+     
+     claude-vision:
+       provider: anthropic
+       model: claude-3-5-sonnet-20241022
+       apiKey: ${ANTHROPIC_API_KEY}
+     
+     gemini-vision:
+       provider: google
+       model: gemini-2.5-flash
+       apiKey: ${GOOGLE_GENERATIVE_AI_API_KEY}
+   ```
+
+**Validation**:
+- [ ] Config files parse successfully
+- [ ] Targets reference vision-capable models
+- [ ] Environment variables documented
+
+**User-Visible**: Easy configuration for vision models
+
+---
+
+### Phase 2: Core Implementation (Days 3-6)
+
+#### Task 4: Implement Image Loaders
+**Priority**: High  
+**Effort**: 2 days  
+**Dependencies**: T2  
+**Blocks**: T5
+
+**Description**: Implement utilities to load images from various sources and convert to appropriate formats for LLM providers.
+
+**Actions**:
+1. Create `packages/core/src/vision/imageLoader.ts`:
+   - `loadImageFromFile(path: string): Promise<Buffer>`
+   - `loadImageFromURL(url: string): Promise<Buffer>`
+   - `parseDataURI(uri: string): Buffer`
+   - `detectMimeType(buffer: Buffer): string`
+   - `validateImageFormat(buffer: Buffer): boolean`
+
+2. Create `packages/core/src/vision/imageConverter.ts`:
+   - `bufferToBase64(buffer: Buffer): string`
+   - `createDataURI(base64: string, mimeType: string): string`
+   - `resizeIfNeeded(buffer: Buffer, maxDim: number): Promise<Buffer>`
+
+3. Add error handling:
+   - File not found
+   - Invalid URL
+   - Unsupported format
+   - File too large (>20MB)
+   - Image dimensions out of range
+
+4. Add unit tests for all loaders and converters
+
+**Validation**:
+- [ ] Load local files successfully
+- [ ] Load HTTP/HTTPS URLs successfully
+- [ ] Parse base64 data URIs successfully
+- [ ] Detect MIME types correctly (JPEG, PNG, WEBP, GIF)
+- [ ] Validate image sizes and dimensions
+- [ ] Error messages clear and actionable
+- [ ] Unit test coverage >90%
+
+**User-Visible**: Reliable image loading from multiple sources
+
+---
+
+#### Task 5: Integrate Image Support in Provider Clients
+**Priority**: High  
+**Effort**: 3 days  
+**Dependencies**: T2, T4  
+**Blocks**: T6, T7
+
+**Description**: Update LLM provider clients (OpenAI, Anthropic, Google) to pass image content correctly.
+
+**Actions**:
+1. Update `packages/core/src/providers/openai.ts`:
+   - Handle `ImageContentItem` in message content
+   - Convert to OpenAI's `image_url` format
+   - Support `detail` parameter
+   - Pass base64 data URIs
+
+2. Update `packages/core/src/providers/anthropic.ts`:
+   - Handle `ImageContentItem` in message content
+   - Convert to Anthropic's image format
+   - Support `source` with base64 data
+
+3. Update `packages/core/src/providers/google.ts`:
+   - Handle `ImageContentItem` in message content
+   - Convert to Gemini's `inlineData` format
+   - Support both URL and base64
+
+4. Add integration tests with real models (optional, can use mocks)
+
+5. Document provider-specific limitations
+
+**Validation**:
+- [ ] OpenAI provider accepts images correctly
+- [ ] Anthropic provider accepts images correctly
+- [ ] Google provider accepts images correctly
+- [ ] Detail levels passed correctly
+- [ ] Error handling for unsupported formats
+- [ ] Integration tests pass (or mocked tests)
+
+**User-Visible**: Can run evals with images on all major providers
+
+---
+
+#### Task 6: Implement LLM Judge Runner for Vision
+**Priority**: High  
+**Effort**: 2 days  
+**Dependencies**: T5  
+**Blocks**: T8, T9
+
+**Description**: Enable LLM judges to evaluate vision tasks by passing image context to judge models.
+
+**Actions**:
+1. Update judge prompt renderer to include image references:
+   ```typescript
+   renderJudgePrompt(
+     judgeTemplate: string,
+     input: ContentItem[],
+     output: string,
+     expected: string,
+     imageReferences?: string[]
+   ): string
+   ```
+
+2. Modify LLM judge execution to:
+   - Load judge prompt from `.md` file
+   - Substitute placeholders (input, output, expected, image_reference)
+   - Call judge model with vision capability
+   - Parse structured JSON response
+
+3. Add support for multi-image judging
+
+4. Add unit tests for judge rendering and execution
+
+**Validation**:
+- [ ] Judge prompts load correctly
+- [ ] Image references passed to judge model
+- [ ] JSON responses parsed successfully
+- [ ] Scoring dimensions extracted
+- [ ] Error handling for invalid judge outputs
+- [ ] Unit tests pass
+
+**User-Visible**: LLM judges can evaluate image-based responses
+
+---
+
+#### Task 7: Implement Code Validator Runner
+**Priority**: High  
+**Effort**: 2 days  
+**Dependencies**: T5  
+**Blocks**: T8, T9
+
+**Description**: Create runner for Python-based code validators that perform objective evaluation.
+
+**Actions**:
+1. Create `packages/core/src/evaluators/codeValidatorRunner.ts`:
+   - `runPythonValidator(scriptPath: string, evalData: EvalData): Promise<ValidationResult>`
+   - Use `uv run` to execute Python scripts
+   - Pass eval data as JSON via stdin or args
+   - Parse JSON result from stdout
+   - Handle Python errors gracefully
+
+2. Create standard interface for validator results:
+   ```typescript
+   interface ValidationResult {
+     status: 'processed' | 'error' | 'skipped';
+     score: number;
+     passed: boolean;
+     details: Record<string, any>;
+   }
+   ```
+
+3. Add timeout handling (30s default)
+
+4. Add unit tests with mock Python scripts
+
+**Validation**:
+- [ ] Python validators execute successfully
+- [ ] JSON data passed correctly
+- [ ] Results parsed correctly
+- [ ] Timeouts handled
+- [ ] Python errors reported clearly
+- [ ] Unit tests pass
+
+**User-Visible**: Objective code validators work reliably
+
+---
+
+### Phase 3: Testing & Validation (Days 7-10)
+
+#### Task 8: Test Basic Image Analysis Evals
+**Priority**: High  
+**Effort**: 2 days  
+**Dependencies**: T6, T7  
+**Blocks**: T10
+
+**Description**: Run all 7 basic eval cases from `basic-image-analysis.yaml` and validate results.
+
+**Actions**:
+1. Create sample test images (or use placeholder URLs)
+2. Run each eval case:
+   - simple-image-description
+   - object-detection-simple
+   - spatial-relationships
+   - text-extraction-ocr
+   - multi-image-comparison
+   - color-identification
+   - image-from-url
+
+3. Verify evaluators run successfully
+4. Check score outputs are reasonable
+5. Document any issues or edge cases
+6. Create test fixtures for automated testing
+
+**Validation**:
+- [ ] All 7 eval cases execute without errors
+- [ ] LLM judges return valid scores
+- [ ] Code validators return valid scores
+- [ ] Results documented
+- [ ] Test fixtures created
+
+**User-Visible**: Basic vision evals work end-to-end
+
+---
+
+#### Task 9: Test Advanced Vision Tasks Evals
+**Priority**: High  
+**Effort**: 2 days  
+**Dependencies**: T6, T7  
+**Blocks**: T10
+
+**Description**: Run all 7 advanced eval cases from `advanced-vision-tasks.yaml` and validate results.
+
+**Actions**:
+1. Create additional test images for complex scenarios
+2. Run each eval case:
+   - structured-object-detection
+   - visual-reasoning-problem
+   - multi-turn-image-discussion (parts 1 & 2)
+   - image-quality-assessment
+   - chart-data-extraction
+   - scene-context-inference
+   - instruction-following-with-image
+
+3. Verify structured outputs
+4. Test multi-turn conversations maintain context
+5. Validate complex evaluators
+6. Document performance and cost metrics
+
+**Validation**:
+- [ ] All 7 eval cases execute without errors
+- [ ] Structured outputs parse correctly
+- [ ] Multi-turn context maintained
+- [ ] Complex judges work accurately
+- [ ] Performance metrics collected
+- [ ] Cost estimates documented
+
+**User-Visible**: Advanced vision evals work end-to-end
+
+---
+
+#### Task 10: Provider Compatibility Testing
+**Priority**: High  
+**Effort**: 2 days  
+**Dependencies**: T8, T9  
+**Blocks**: T11
+
+**Description**: Test vision evals across all major providers to ensure compatibility.
+
+**Actions**:
+1. Run basic evals on:
+   - OpenAI GPT-4o
+   - Anthropic Claude 3.5 Sonnet
+   - Google Gemini 2.5 Flash
+
+2. Compare results across providers
+3. Document provider-specific behaviors
+4. Identify and document limitations
+5. Create provider compatibility matrix
+
+**Validation**:
+- [ ] All providers execute vision evals
+- [ ] Results comparable across providers
+- [ ] Limitations documented
+- [ ] Compatibility matrix created
+- [ ] Errors handled gracefully
+
+**User-Visible**: Works reliably across all major providers
+
+---
+
+#### Task 11: Cost Analysis & Optimization
+**Priority**: Medium  
+**Effort**: 1 day  
+**Dependencies**: T10  
+**Blocks**: T12
+
+**Description**: Analyze token costs for vision evals and document optimization strategies.
+
+**Actions**:
+1. Measure token usage for:
+   - Different image sizes
+   - Detail levels (low, high, auto)
+   - Different providers
+
+2. Calculate cost per eval case
+3. Document cost optimization strategies:
+   - Use `detail: low` for simple tasks
+   - Use Gemini Flash for development
+   - Cache image descriptions
+   - Use code validators when possible
+
+4. Create cost estimation guide
+5. Add cost warnings to documentation
+
+**Validation**:
+- [ ] Token usage measured for various scenarios
+- [ ] Cost per eval documented
+- [ ] Optimization strategies validated
+- [ ] Cost guide created
+- [ ] Warnings added to docs
+
+**User-Visible**: Clear understanding of costs and how to optimize
+
+---
+
+### Phase 4: Polish & Documentation (Days 11-14)
+
+#### Task 12: Performance Optimization
+**Priority**: Medium  
+**Effort**: 2 days  
+**Dependencies**: T11  
+**Blocks**: T13
+
+**Description**: Optimize image loading, processing, and evaluation performance.
+
+**Actions**:
+1. Profile image loading times
+2. Implement caching for loaded images
+3. Add image dimension limits to prevent oversized loads
+4. Optimize base64 conversions
+5. Parallelize independent evaluators
+6. Add progress tracking for batch evals
+
+**Validation**:
+- [ ] Average eval latency <2s (excluding LLM calls)
+- [ ] Image loading cached appropriately
+- [ ] Large images handled efficiently
+- [ ] Parallel execution works correctly
+- [ ] Progress reporting functional
+
+**User-Visible**: Fast, responsive evaluation experience
+
+---
+
+#### Task 13: Documentation Review & Enhancement
+**Priority**: High  
+**Effort**: 2 days  
+**Dependencies**: T12  
+**Blocks**: T14
+
+**Description**: Review and enhance all vision evaluation documentation.
+
+**Actions**:
+1. Review and update `examples/vision/README.md`:
+   - Add getting started section
+   - Update usage examples
+   - Add troubleshooting section
+   - Include provider setup instructions
+
+2. Review and update `examples/vision/INDEX.md`:
+   - Ensure all examples listed
+   - Update cost estimates
+   - Add quick reference tables
+
+3. Update `docs/updates/VISION_EVAL_RESEARCH_SUMMARY.md`:
+   - Add implementation notes
+   - Update status of completed work
+
+4. Create migration guide if needed
+5. Add inline code comments
+6. Create video tutorial (optional)
+
+**Validation**:
+- [ ] README comprehensive and accurate
+- [ ] INDEX up-to-date
+- [ ] Research summary reflects implementation
+- [ ] Code well-commented
+- [ ] No broken links or references
+
+**User-Visible**: Excellent documentation for vision evaluation
+
+---
+
+#### Task 14: Final Validation & Release Prep
+**Priority**: High  
+**Effort**: 1 day  
+**Dependencies**: T13  
+
+**Description**: Final validation before marking the change as complete.
+
+**Actions**:
+1. Run OpenSpec validation:
+   ```bash
+   npx @fission-ai/openspec validate add-vision-evaluation --strict
+   ```
+
+2. Run full test suite:
+   ```bash
+   bun test
+   ```
+
+3. Run end-to-end eval tests:
+   ```bash
+   agentv run examples/showcase/vision/datasets/basic-image-analysis.yaml
+   agentv run examples/showcase/vision/datasets/advanced-vision-tasks.yaml
+   ```
+
+4. Create changelog entry
+5. Update version in package.json
+6. Tag release (if applicable)
+
+**Validation**:
+- [ ] OpenSpec validation passes
+- [ ] All unit tests pass
+- [ ] All integration tests pass
+- [ ] End-to-end evals work
+- [ ] Changelog updated
+- [ ] Version bumped
+
+**User-Visible**: Production-ready vision evaluation feature
+
+---
+
+## Summary
+
+**Total Estimated Effort**: 21 days (3-4 weeks with parallelization)
+
+**Critical Path**: T1 → T2 → T4 → T5 → T6 → T8 → T10 → T11 → T12 → T13 → T14
+
+**Parallelizable Work**:
+- T3 can run parallel to T2
+- T6 and T7 can run in parallel after T5
+- T8 and T9 can run in parallel
+- Documentation tasks can be done incrementally
+
+**Key Milestones**:
+1. Day 2: Schema extended, files reorganized
+2. Day 6: Core implementation complete
+3. Day 10: All tests passing
+4. Day 14: Production ready
+
+**Success Metrics**:
+- All 14 eval cases working
+- 3+ providers supported
+- Documentation complete
+- >90% test coverage
+- <2s avg eval latency