Skip to content

Conversation

@christso
Copy link
Collaborator

@christso christso commented Jan 1, 2026

Summary

This PR introduces a comprehensive proposal for adding structured data evaluators to AgentV, enabling precise evaluation of document extraction systems (invoices, receipts, forms, etc.).

What's Included

OpenSpec Proposal (openspec/changes/add-structured-data-evaluators/)

  • proposal.md: Problem statement, industry research (Azure, Google ADK, LangWatch, Mastra), proposed solution
  • design.md: 10 architectural decisions with implementation patterns
  • tasks.md: 43-task implementation roadmap across 7 phases
  • specs/: Comprehensive requirements with 65+ test scenarios
    • structured-data-evaluators/spec.md: field_accuracy evaluator (10 requirements, 30+ scenarios)
    • geometric-evaluators/spec.md: iou_score & coordinate_distance (12 requirements, 35+ scenarios)

Functional Testing Example (examples/features/structured-data/)

  • invoice-extraction.yaml: 5 test cases demonstrating commercial invoice extraction
    • Exact matching for IDs/dates
    • Fuzzy matching for company names (Levenshtein)
    • Numeric tolerance for currency amounts (±$1)
    • Array field validation (line items)
    • Missing required field handling
  • README.md: Comprehensive usage guide, best practices, troubleshooting

Evaluator Types Proposed

  1. field_accuracy: Compare extracted structured data against expected schema

    • Match strategies: exact, fuzzy (Levenshtein/Jaro-Winkler), numeric_tolerance
    • Aggregation: weighted_average, all_or_nothing
    • Nested field paths with dot notation (e.g., invoice.line_items[0].amount)
  2. iou_score: Bounding box overlap for document layout verification

    • Formats: XYXY, XYWH, polygon
    • Intersection over Union (IoU) metric
    • Threshold-based pass/fail verdicts
  3. coordinate_distance: Spatial comparison for field positioning

    • Metrics: Euclidean, Manhattan, Cosine
    • 2D/3D coordinate support
    • Precision/Recall/F1 for detection tasks

AgentV Goals Alignment

Declarative Definitions: YAML-based configuration with clear expected outcomes
Structured Evaluation: Deterministic field comparison (primitive for rubric-based patterns)
⚠️ Multi-Objective Scoring: Example focuses on correctness; includes placeholders for latency/cost/safety
Optimization Ready: Weighted fields enable future hyperparameter tuning

Notable Design Decisions

  • No external dependencies: Direct implementation of Levenshtein/Jaro-Winkler (lightweight core principle)
  • lodash get for nested field path resolution (already in dependencies)
  • Never throw exceptions: Return score 0.0 with descriptive errors for robustness
  • Performance targets: <10ms per field, <5ms per bbox
  • Format-agnostic IoU: Canonical XYXY conversion for all bbox formats

Validation

  • ✅ Proposal validated with openspec validate add-structured-data-evaluators --strict
  • ✅ All specs include comprehensive GIVEN-WHEN-THEN scenarios
  • ⚠️ Example uses proposed evaluators (will not run until implementation complete)

Next Steps

Implementation tracked in tasks.md with 7 phases:

  1. Spec deltas & design refinement
  2. Core implementation (FieldAccuracyEvaluator, IoUScoreEvaluator, CoordinateDistanceEvaluator)
  3. Schema validation with Zod
  4. Integration & unit tests
  5. Documentation
  6. QA & benchmarking
  7. Finalization & archive

Ready for review: Proposal is self-contained for AI agent implementation without conversation context.

@christso christso marked this pull request as draft January 1, 2026 21:07
@christso christso force-pushed the feat/add-structured-data-evaluators branch 3 times, most recently from 56f8dbd to e66db2e Compare January 2, 2026 04:39
@christso christso marked this pull request as ready for review January 2, 2026 05:38
…example

- Add OpenSpec proposal for field_accuracy, iou_score, and coordinate_distance evaluators
- Include comprehensive specs with 30+ test scenarios for structured data evaluation
- Include 35+ test scenarios for geometric evaluation (bounding boxes, coordinates)
- Add design document with 10 architectural decisions and implementation patterns
- Add 43-task implementation roadmap across 7 phases
- Create functional testing example: invoice extraction with 5 test cases
- Demonstrate exact, fuzzy, and numeric tolerance matching strategies
- Include rubric-based alternative and multi-objective scoring placeholders
- Align with AgentV high-level goals: declarative, structured, optimization-ready
- Sanitize example data (demo companies, addresses, invoice numbers)

Refs: #structured-data-evaluation
- Update invoice-extraction.yaml to use file references (type: file pattern)
- Change execution target from 'default' to 'bun run mock_extractor.ts'
- Add 5 HTML fixture files simulating invoice documents
- Create mock_extractor.ts CLI that parses HTML and outputs JSON
- Add fixtures/README.md documenting test file structure
- Update README.md with directory structure and running instructions
- All test cases now reference actual fixture files (invoice-001.html, etc.)

This aligns with AgentV patterns seen in export-screening example where:
- input_messages use type: file for document references
- target specifies the actual CLI/script to execute
- Fixtures are versionable, readable HTML instead of binary PDFs
The previous version would always extract data perfectly matching expected values,
resulting in 100% accuracy on all tests. This defeats the purpose of evaluation.

Changes:
- Mock extractor now intentionally produces realistic OCR-like variations:
  * invoice-001: Normalizes data correctly (baseline perfect extraction)
  * invoice-002: Preserves 'Acme - Shipping' formatting to test fuzzy matching
  * invoice-003: Keeps decimal precision (1889.5 vs expected 1889) to test tolerance
  * invoice-004: Returns undefined for missing fields (already missing in HTML)

- Updated YAML outcome descriptions to show ACTUAL vs EXPECTED values
- Added 'How the Mock Extractor Works' section to README
- Clarified each test case shows specific evaluator capability

This creates realistic test scenarios where:
- Fuzzy matching detects 'Acme - Shipping' ≈ 'Acme Shipping' (Levenshtein > 0.85)
- Numeric tolerance accepts |1889.5 - 1889| < 1.0
- Missing required fields reduce overall score
- Perfect extraction (invoice-001) establishes baseline
Product code is redundant as it's covered by description and HS code is
already being used for classification. Simplifies data model and reduces
noise in evaluation.

Changes:
- Remove product_code column from invoice-001.html table (7 cols → 6 cols)
- Remove product_code from all line items in invoice-extraction.yaml
- Update mock_extractor.ts interface and parsing logic (6 cells instead of 7)
- Simplifies line item structure to: description, quantity, unit_price,
  line_total, unit_type, hs_code
Fixed two issues in mock_extractor.ts:
1. Code corruption from previous edit - duplicate grossTotalEl sections with
   incomplete closing braces
2. Missing jsdom dependency causing type errors

Changes:
- Repaired netTotal/grossTotal extraction logic (removed duplicate code)
- Added jsdom and @types/jsdom as dev dependencies
- All TypeScript type errors resolved

The extractor now compiles successfully.
…ependencies

Improved organization and dependency isolation following AgentV best practices:

Structure changes:
- examples/features/structured-data/ → examples/features/invoice-extraction/
- invoice-extraction.yaml → invoice-extraction/evals/dataset.yaml
- Follows showcase pattern: {feature-name}/evals/, {feature-name}/fixtures/

Dependency isolation:
- Remove jsdom + @types/jsdom from workspace root package.json
- Add local package.json in invoice-extraction/ with jsdom as devDependency
- Add .gitignore for local node_modules and bun.lockb
- Prevents polluting production dependencies with example-only packages

Benefits:
- Each feature example is self-contained with its own dependencies
- Aligns with showcase/ directory pattern for consistency
- Clear separation: workspace deps (production) vs example deps (local)
- Easy to run: 'cd invoice-extraction && bun install && bun run extract'

Updated documentation:
- examples/features/README.md with new structure explanation
- invoice-extraction/README.md with local install instructions
Dramatically simplified the invoice extraction example by using JSON fixtures
instead of HTML, eliminating unnecessary complexity and dependencies.

Changes:
- Convert fixtures: *.html → *.json (5 files)
- Simplify mock_extractor.ts: 187 lines → 27 lines (just reads JSON)
- Remove jsdom and @types/jsdom dependencies entirely
- Delete package.json (no dependencies needed!)
- Delete bun.lock (no longer needed)
- Update dataset.yaml: reference ../fixtures/*.json instead of ./fixtures/*.html
- Update all READMEs to reflect simpler JSON-based approach

Benefits:
✅ Zero dependencies (uses only Node.js built-ins)
✅ 85% reduction in code complexity (187 → 27 lines)
✅ Clearer test data (JSON is immediately readable)
✅ Faster to understand and modify
✅ Focuses on evaluator demonstration, not document parsing
✅ No build step or installation required

The JSON fixtures still contain the same intentional variations for testing:
- invoice-002: 'Acme - Shipping' (fuzzy matching)
- invoice-003: 1889.5 vs expected 1889 (numeric tolerance)
- invoice-004: Missing invoice_number (required field)
… configuration

- Rename examples/features/invoice-extraction to document-extraction (industry standard terminology)
- Add .agentv/targets.yaml with CLI provider configuration for mock_extractor
- Update mock_extractor.ts to support optional output file parameter for AgentV integration
- Update dataset.yaml to reference mock_extractor target by name instead of inline command
- Functional test confirms mock_extractor works correctly with AgentV CLI provider
- Defer geometric evaluators (IoU, coordinate distance) to code_judge plugins
- Add date match type with format normalization to field_accuracy spec
- Convert geometric-evaluators spec to plugin implementation guide with Python examples
- Update tasks.md to reflect simplified scope focused on field_accuracy
- Update example dataset.yaml to use date matching for invoice_date field

This aligns with AgentV's "lightweight core, plugin extensibility" principle
by keeping complex algorithms (polygon intersection, Hungarian matching) out
of core while providing ready-to-use code_judge scripts for users who need them.
Implement the field_accuracy evaluator from OpenSpec proposal
add-structured-data-evaluators. This evaluator enables declarative
validation of extracted structured data with configurable matching
strategies.

Features:
- Exact, fuzzy (Levenshtein/Jaro-Winkler), numeric tolerance, and date matching
- Dot-notation field paths with array indexing (e.g., items[0].price)
- Weighted average and all-or-nothing aggregation strategies
- Comprehensive test suite (12 new tests)

Files changed:
- packages/core/src/evaluation/types.ts: Add FieldAccuracyEvaluatorConfig
- packages/core/src/evaluation/evaluators.ts: Implement FieldAccuracyEvaluator
- packages/core/src/evaluation/loaders/evaluator-parser.ts: Parse field_accuracy
- packages/core/src/evaluation/orchestrator.ts: Register field_accuracy handler
- packages/core/test/evaluation/evaluators.test.ts: Add evaluator tests
The test case was missing invoice_number in expected_messages, so the
evaluator had nothing to compare against. Now properly tests the 'missing
required field' scenario with score dropping to ~0.85.
@christso christso force-pushed the feat/add-structured-data-evaluators branch from 194b8ff to 58ed92f Compare January 2, 2026 05:46
Align with lightweight core principle by moving fuzzy string matching
(Levenshtein, Jaro-Winkler) out of built-in field_accuracy evaluator.

Changes:
- Remove 'fuzzy' match type from FieldMatchType
- Remove ~100 LOC of algorithm implementations from evaluators.ts
- Add fuzzy_match.ts code_judge example with both algorithms
- Add early warning for empty fields array in parser
- Document date format disambiguation behavior (defaults to US)
- Update changeset and README documentation

This is consistent with how IoU/geometric evaluators are handled -
complex algorithms belong in plugins, not core.
Replace match: fuzzy with match: exact in dataset.yaml since fuzzy
matching was removed from core. Added comments pointing to
fuzzy_match.ts code_judge for OCR variation handling.
Demonstrates how to use code_judge for fuzzy string matching on specific
fields in structured data. invoice-002 now shows the hybrid approach:
- code_judge with Levenshtein similarity for supplier.name (OCR variations)
- field_accuracy for exact/numeric/date fields
Enables code_judge scripts to receive configuration via stdin by passing
unrecognized YAML properties as a `config` object. This allows reusable
scripts without hardcoding field paths or thresholds.

Changes:
- Add `config` property to CodeEvaluatorConfig type
- Collect unrecognized YAML properties in evaluator-parser
- Pass config to scripts via stdin payload
- Add multi_field_fuzzy.ts example demonstrating config usage
- Update dataset.yaml invoice-002 to use multi-field fuzzy matcher
- Update OpenSpec docs to reflect fuzzy matching via code_judge
- Add tests for config pass-through

Example usage:
```yaml
- type: code_judge
  script: ./multi_field_fuzzy.ts
  fields:
    - path: supplier.name
      threshold: 0.85
  algorithm: levenshtein
```
Rename package and update documentation to reflect the directory rename
from invoice-extraction to document-extraction. The example content still
covers invoice extraction as a specific use case of document extraction.
Add two new evaluator types for validating execution metrics:
- `latency`: Check execution duration against a threshold (uses traceSummary.durationMs)
- `cost`: Check execution cost against a budget (uses traceSummary.costUsd)

Both evaluators leverage the existing TraceSummary infrastructure that
already collects these metrics from providers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants