feat: Add structured data evaluators for document extraction evaluation #94

christso · 2026-01-01T21:07:18Z

Summary

This PR introduces a comprehensive proposal for adding structured data evaluators to AgentV, enabling precise evaluation of document extraction systems (invoices, receipts, forms, etc.).

What's Included

OpenSpec Proposal (`openspec/changes/add-structured-data-evaluators/`)

proposal.md: Problem statement, industry research (Azure, Google ADK, LangWatch, Mastra), proposed solution
design.md: 10 architectural decisions with implementation patterns
tasks.md: 43-task implementation roadmap across 7 phases
specs/: Comprehensive requirements with 65+ test scenarios
- structured-data-evaluators/spec.md: field_accuracy evaluator (10 requirements, 30+ scenarios)
- geometric-evaluators/spec.md: iou_score & coordinate_distance (12 requirements, 35+ scenarios)

Functional Testing Example (`examples/features/structured-data/`)

invoice-extraction.yaml: 5 test cases demonstrating commercial invoice extraction
- Exact matching for IDs/dates
- Fuzzy matching for company names (Levenshtein)
- Numeric tolerance for currency amounts (±$1)
- Array field validation (line items)
- Missing required field handling
README.md: Comprehensive usage guide, best practices, troubleshooting

Evaluator Types Proposed

field_accuracy: Compare extracted structured data against expected schema
- Match strategies: exact, fuzzy (Levenshtein/Jaro-Winkler), numeric_tolerance
- Aggregation: weighted_average, all_or_nothing
- Nested field paths with dot notation (e.g., invoice.line_items[0].amount)
iou_score: Bounding box overlap for document layout verification
- Formats: XYXY, XYWH, polygon
- Intersection over Union (IoU) metric
- Threshold-based pass/fail verdicts
coordinate_distance: Spatial comparison for field positioning
- Metrics: Euclidean, Manhattan, Cosine
- 2D/3D coordinate support
- Precision/Recall/F1 for detection tasks

AgentV Goals Alignment

✅ Declarative Definitions: YAML-based configuration with clear expected outcomes
✅ Structured Evaluation: Deterministic field comparison (primitive for rubric-based patterns)
⚠️ Multi-Objective Scoring: Example focuses on correctness; includes placeholders for latency/cost/safety
✅ Optimization Ready: Weighted fields enable future hyperparameter tuning

Notable Design Decisions

No external dependencies: Direct implementation of Levenshtein/Jaro-Winkler (lightweight core principle)
lodash get for nested field path resolution (already in dependencies)
Never throw exceptions: Return score 0.0 with descriptive errors for robustness
Performance targets: <10ms per field, <5ms per bbox
Format-agnostic IoU: Canonical XYXY conversion for all bbox formats

Validation

✅ Proposal validated with openspec validate add-structured-data-evaluators --strict
✅ All specs include comprehensive GIVEN-WHEN-THEN scenarios
⚠️ Example uses proposed evaluators (will not run until implementation complete)

Next Steps

Implementation tracked in tasks.md with 7 phases:

Spec deltas & design refinement
Core implementation (FieldAccuracyEvaluator, IoUScoreEvaluator, CoordinateDistanceEvaluator)
Schema validation with Zod
Integration & unit tests
Documentation
QA & benchmarking
Finalization & archive

Ready for review: Proposal is self-contained for AI agent implementation without conversation context.

…example - Add OpenSpec proposal for field_accuracy, iou_score, and coordinate_distance evaluators - Include comprehensive specs with 30+ test scenarios for structured data evaluation - Include 35+ test scenarios for geometric evaluation (bounding boxes, coordinates) - Add design document with 10 architectural decisions and implementation patterns - Add 43-task implementation roadmap across 7 phases - Create functional testing example: invoice extraction with 5 test cases - Demonstrate exact, fuzzy, and numeric tolerance matching strategies - Include rubric-based alternative and multi-objective scoring placeholders - Align with AgentV high-level goals: declarative, structured, optimization-ready - Sanitize example data (demo companies, addresses, invoice numbers) Refs: #structured-data-evaluation

- Update invoice-extraction.yaml to use file references (type: file pattern) - Change execution target from 'default' to 'bun run mock_extractor.ts' - Add 5 HTML fixture files simulating invoice documents - Create mock_extractor.ts CLI that parses HTML and outputs JSON - Add fixtures/README.md documenting test file structure - Update README.md with directory structure and running instructions - All test cases now reference actual fixture files (invoice-001.html, etc.) This aligns with AgentV patterns seen in export-screening example where: - input_messages use type: file for document references - target specifies the actual CLI/script to execute - Fixtures are versionable, readable HTML instead of binary PDFs

The previous version would always extract data perfectly matching expected values, resulting in 100% accuracy on all tests. This defeats the purpose of evaluation. Changes: - Mock extractor now intentionally produces realistic OCR-like variations: * invoice-001: Normalizes data correctly (baseline perfect extraction) * invoice-002: Preserves 'Acme - Shipping' formatting to test fuzzy matching * invoice-003: Keeps decimal precision (1889.5 vs expected 1889) to test tolerance * invoice-004: Returns undefined for missing fields (already missing in HTML) - Updated YAML outcome descriptions to show ACTUAL vs EXPECTED values - Added 'How the Mock Extractor Works' section to README - Clarified each test case shows specific evaluator capability This creates realistic test scenarios where: - Fuzzy matching detects 'Acme - Shipping' ≈ 'Acme Shipping' (Levenshtein > 0.85) - Numeric tolerance accepts |1889.5 - 1889| < 1.0 - Missing required fields reduce overall score - Perfect extraction (invoice-001) establishes baseline

Product code is redundant as it's covered by description and HS code is already being used for classification. Simplifies data model and reduces noise in evaluation. Changes: - Remove product_code column from invoice-001.html table (7 cols → 6 cols) - Remove product_code from all line items in invoice-extraction.yaml - Update mock_extractor.ts interface and parsing logic (6 cells instead of 7) - Simplifies line item structure to: description, quantity, unit_price, line_total, unit_type, hs_code

Fixed two issues in mock_extractor.ts: 1. Code corruption from previous edit - duplicate grossTotalEl sections with incomplete closing braces 2. Missing jsdom dependency causing type errors Changes: - Repaired netTotal/grossTotal extraction logic (removed duplicate code) - Added jsdom and @types/jsdom as dev dependencies - All TypeScript type errors resolved The extractor now compiles successfully.

…ependencies Improved organization and dependency isolation following AgentV best practices: Structure changes: - examples/features/structured-data/ → examples/features/invoice-extraction/ - invoice-extraction.yaml → invoice-extraction/evals/dataset.yaml - Follows showcase pattern: {feature-name}/evals/, {feature-name}/fixtures/ Dependency isolation: - Remove jsdom + @types/jsdom from workspace root package.json - Add local package.json in invoice-extraction/ with jsdom as devDependency - Add .gitignore for local node_modules and bun.lockb - Prevents polluting production dependencies with example-only packages Benefits: - Each feature example is self-contained with its own dependencies - Aligns with showcase/ directory pattern for consistency - Clear separation: workspace deps (production) vs example deps (local) - Easy to run: 'cd invoice-extraction && bun install && bun run extract' Updated documentation: - examples/features/README.md with new structure explanation - invoice-extraction/README.md with local install instructions

Dramatically simplified the invoice extraction example by using JSON fixtures instead of HTML, eliminating unnecessary complexity and dependencies. Changes: - Convert fixtures: *.html → *.json (5 files) - Simplify mock_extractor.ts: 187 lines → 27 lines (just reads JSON) - Remove jsdom and @types/jsdom dependencies entirely - Delete package.json (no dependencies needed!) - Delete bun.lock (no longer needed) - Update dataset.yaml: reference ../fixtures/*.json instead of ./fixtures/*.html - Update all READMEs to reflect simpler JSON-based approach Benefits: ✅ Zero dependencies (uses only Node.js built-ins) ✅ 85% reduction in code complexity (187 → 27 lines) ✅ Clearer test data (JSON is immediately readable) ✅ Faster to understand and modify ✅ Focuses on evaluator demonstration, not document parsing ✅ No build step or installation required The JSON fixtures still contain the same intentional variations for testing: - invoice-002: 'Acme - Shipping' (fuzzy matching) - invoice-003: 1889.5 vs expected 1889 (numeric tolerance) - invoice-004: Missing invoice_number (required field)

… configuration - Rename examples/features/invoice-extraction to document-extraction (industry standard terminology) - Add .agentv/targets.yaml with CLI provider configuration for mock_extractor - Update mock_extractor.ts to support optional output file parameter for AgentV integration - Update dataset.yaml to reference mock_extractor target by name instead of inline command - Functional test confirms mock_extractor works correctly with AgentV CLI provider

- Defer geometric evaluators (IoU, coordinate distance) to code_judge plugins - Add date match type with format normalization to field_accuracy spec - Convert geometric-evaluators spec to plugin implementation guide with Python examples - Update tasks.md to reflect simplified scope focused on field_accuracy - Update example dataset.yaml to use date matching for invoice_date field This aligns with AgentV's "lightweight core, plugin extensibility" principle by keeping complex algorithms (polygon intersection, Hungarian matching) out of core while providing ready-to-use code_judge scripts for users who need them.

Implement the field_accuracy evaluator from OpenSpec proposal add-structured-data-evaluators. This evaluator enables declarative validation of extracted structured data with configurable matching strategies. Features: - Exact, fuzzy (Levenshtein/Jaro-Winkler), numeric tolerance, and date matching - Dot-notation field paths with array indexing (e.g., items[0].price) - Weighted average and all-or-nothing aggregation strategies - Comprehensive test suite (12 new tests) Files changed: - packages/core/src/evaluation/types.ts: Add FieldAccuracyEvaluatorConfig - packages/core/src/evaluation/evaluators.ts: Implement FieldAccuracyEvaluator - packages/core/src/evaluation/loaders/evaluator-parser.ts: Parse field_accuracy - packages/core/src/evaluation/orchestrator.ts: Register field_accuracy handler - packages/core/test/evaluation/evaluators.test.ts: Add evaluator tests

The test case was missing invoice_number in expected_messages, so the evaluator had nothing to compare against. Now properly tests the 'missing required field' scenario with score dropping to ~0.85.

Align with lightweight core principle by moving fuzzy string matching (Levenshtein, Jaro-Winkler) out of built-in field_accuracy evaluator. Changes: - Remove 'fuzzy' match type from FieldMatchType - Remove ~100 LOC of algorithm implementations from evaluators.ts - Add fuzzy_match.ts code_judge example with both algorithms - Add early warning for empty fields array in parser - Document date format disambiguation behavior (defaults to US) - Update changeset and README documentation This is consistent with how IoU/geometric evaluators are handled - complex algorithms belong in plugins, not core.

Replace match: fuzzy with match: exact in dataset.yaml since fuzzy matching was removed from core. Added comments pointing to fuzzy_match.ts code_judge for OCR variation handling.

Demonstrates how to use code_judge for fuzzy string matching on specific fields in structured data. invoice-002 now shows the hybrid approach: - code_judge with Levenshtein similarity for supplier.name (OCR variations) - field_accuracy for exact/numeric/date fields

Enables code_judge scripts to receive configuration via stdin by passing unrecognized YAML properties as a `config` object. This allows reusable scripts without hardcoding field paths or thresholds. Changes: - Add `config` property to CodeEvaluatorConfig type - Collect unrecognized YAML properties in evaluator-parser - Pass config to scripts via stdin payload - Add multi_field_fuzzy.ts example demonstrating config usage - Update dataset.yaml invoice-002 to use multi-field fuzzy matcher - Update OpenSpec docs to reflect fuzzy matching via code_judge - Add tests for config pass-through Example usage: ```yaml - type: code_judge script: ./multi_field_fuzzy.ts fields: - path: supplier.name threshold: 0.85 algorithm: levenshtein ```

Rename package and update documentation to reflect the directory rename from invoice-extraction to document-extraction. The example content still covers invoice extraction as a specific use case of document extraction.

Add two new evaluator types for validating execution metrics: - `latency`: Check execution duration against a threshold (uses traceSummary.durationMs) - `cost`: Check execution cost against a budget (uses traceSummary.costUsd) Both evaluators leverage the existing TraceSummary infrastructure that already collects these metrics from providers.

christso marked this pull request as draft January 1, 2026 21:07

christso force-pushed the feat/add-structured-data-evaluators branch 3 times, most recently from 56f8dbd to e66db2e Compare January 2, 2026 04:39

christso marked this pull request as ready for review January 2, 2026 05:38

christso added 11 commits January 2, 2026 05:45

fix: Correct invoice-004 test case to include expected invoice_number

58ed92f

The test case was missing invoice_number in expected_messages, so the evaluator had nothing to compare against. Now properly tests the 'missing required field' scenario with score dropping to ~0.85.

christso force-pushed the feat/add-structured-data-evaluators branch from 194b8ff to 58ed92f Compare January 2, 2026 05:46

christso added 13 commits January 2, 2026 06:12

fix: Update dataset to remove fuzzy match type references

583640e

Replace match: fuzzy with match: exact in dataset.yaml since fuzzy matching was removed from core. Added comments pointing to fuzzy_match.ts code_judge for OCR variation handling.

docs: Update multi-objective scoring status to implemented

c574126

chore: Remove unnecessary $schema from eval file

061edf3

chore: Remove $schema from eval-schema.json and examples

4540aa4

Fix composite evaluators trace context

ca0b320

Trim document-extraction README and update eval-builder skill

f41c810

Move structured evaluator guidance into skill reference

9819006

Remove redundant supplier_name_fuzzy example

92a3971

christso added 8 commits January 2, 2026 17:05

Trim verbose dataset header comment

aa74a16

Fix invoice-003 outcome text

be5f77b

Clarify invoice-003 outcome wording

b8760a4

Clarify mock output wording in invoice-003 outcome

c4240a0

Avoid embedding file contents for cli providers

004c689

Refactor CLI prompt formatting check

d869429

Fix invoice-002 outcome wording

c1de559

Add token_usage evaluator

2311216

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add structured data evaluators for document extraction evaluation #94

feat: Add structured data evaluators for document extraction evaluation #94

Uh oh!

christso commented Jan 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Add structured data evaluators for document extraction evaluation #94

Are you sure you want to change the base?

feat: Add structured data evaluators for document extraction evaluation #94

Uh oh!

Conversation

christso commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's Included

OpenSpec Proposal (openspec/changes/add-structured-data-evaluators/)

Functional Testing Example (examples/features/structured-data/)

Evaluator Types Proposed

AgentV Goals Alignment

Notable Design Decisions

Validation

Next Steps

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

christso commented Jan 1, 2026 •

edited

Loading

OpenSpec Proposal (`openspec/changes/add-structured-data-evaluators/`)

Functional Testing Example (`examples/features/structured-data/`)