-
Notifications
You must be signed in to change notification settings - Fork 0
feat: Add structured data evaluators for document extraction evaluation #94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
christso
wants to merge
32
commits into
main
Choose a base branch
from
feat/add-structured-data-evaluators
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
56f8dbd to
e66db2e
Compare
…example - Add OpenSpec proposal for field_accuracy, iou_score, and coordinate_distance evaluators - Include comprehensive specs with 30+ test scenarios for structured data evaluation - Include 35+ test scenarios for geometric evaluation (bounding boxes, coordinates) - Add design document with 10 architectural decisions and implementation patterns - Add 43-task implementation roadmap across 7 phases - Create functional testing example: invoice extraction with 5 test cases - Demonstrate exact, fuzzy, and numeric tolerance matching strategies - Include rubric-based alternative and multi-objective scoring placeholders - Align with AgentV high-level goals: declarative, structured, optimization-ready - Sanitize example data (demo companies, addresses, invoice numbers) Refs: #structured-data-evaluation
- Update invoice-extraction.yaml to use file references (type: file pattern) - Change execution target from 'default' to 'bun run mock_extractor.ts' - Add 5 HTML fixture files simulating invoice documents - Create mock_extractor.ts CLI that parses HTML and outputs JSON - Add fixtures/README.md documenting test file structure - Update README.md with directory structure and running instructions - All test cases now reference actual fixture files (invoice-001.html, etc.) This aligns with AgentV patterns seen in export-screening example where: - input_messages use type: file for document references - target specifies the actual CLI/script to execute - Fixtures are versionable, readable HTML instead of binary PDFs
The previous version would always extract data perfectly matching expected values, resulting in 100% accuracy on all tests. This defeats the purpose of evaluation. Changes: - Mock extractor now intentionally produces realistic OCR-like variations: * invoice-001: Normalizes data correctly (baseline perfect extraction) * invoice-002: Preserves 'Acme - Shipping' formatting to test fuzzy matching * invoice-003: Keeps decimal precision (1889.5 vs expected 1889) to test tolerance * invoice-004: Returns undefined for missing fields (already missing in HTML) - Updated YAML outcome descriptions to show ACTUAL vs EXPECTED values - Added 'How the Mock Extractor Works' section to README - Clarified each test case shows specific evaluator capability This creates realistic test scenarios where: - Fuzzy matching detects 'Acme - Shipping' ≈ 'Acme Shipping' (Levenshtein > 0.85) - Numeric tolerance accepts |1889.5 - 1889| < 1.0 - Missing required fields reduce overall score - Perfect extraction (invoice-001) establishes baseline
Product code is redundant as it's covered by description and HS code is already being used for classification. Simplifies data model and reduces noise in evaluation. Changes: - Remove product_code column from invoice-001.html table (7 cols → 6 cols) - Remove product_code from all line items in invoice-extraction.yaml - Update mock_extractor.ts interface and parsing logic (6 cells instead of 7) - Simplifies line item structure to: description, quantity, unit_price, line_total, unit_type, hs_code
Fixed two issues in mock_extractor.ts: 1. Code corruption from previous edit - duplicate grossTotalEl sections with incomplete closing braces 2. Missing jsdom dependency causing type errors Changes: - Repaired netTotal/grossTotal extraction logic (removed duplicate code) - Added jsdom and @types/jsdom as dev dependencies - All TypeScript type errors resolved The extractor now compiles successfully.
…ependencies
Improved organization and dependency isolation following AgentV best practices:
Structure changes:
- examples/features/structured-data/ → examples/features/invoice-extraction/
- invoice-extraction.yaml → invoice-extraction/evals/dataset.yaml
- Follows showcase pattern: {feature-name}/evals/, {feature-name}/fixtures/
Dependency isolation:
- Remove jsdom + @types/jsdom from workspace root package.json
- Add local package.json in invoice-extraction/ with jsdom as devDependency
- Add .gitignore for local node_modules and bun.lockb
- Prevents polluting production dependencies with example-only packages
Benefits:
- Each feature example is self-contained with its own dependencies
- Aligns with showcase/ directory pattern for consistency
- Clear separation: workspace deps (production) vs example deps (local)
- Easy to run: 'cd invoice-extraction && bun install && bun run extract'
Updated documentation:
- examples/features/README.md with new structure explanation
- invoice-extraction/README.md with local install instructions
Dramatically simplified the invoice extraction example by using JSON fixtures instead of HTML, eliminating unnecessary complexity and dependencies. Changes: - Convert fixtures: *.html → *.json (5 files) - Simplify mock_extractor.ts: 187 lines → 27 lines (just reads JSON) - Remove jsdom and @types/jsdom dependencies entirely - Delete package.json (no dependencies needed!) - Delete bun.lock (no longer needed) - Update dataset.yaml: reference ../fixtures/*.json instead of ./fixtures/*.html - Update all READMEs to reflect simpler JSON-based approach Benefits: ✅ Zero dependencies (uses only Node.js built-ins) ✅ 85% reduction in code complexity (187 → 27 lines) ✅ Clearer test data (JSON is immediately readable) ✅ Faster to understand and modify ✅ Focuses on evaluator demonstration, not document parsing ✅ No build step or installation required The JSON fixtures still contain the same intentional variations for testing: - invoice-002: 'Acme - Shipping' (fuzzy matching) - invoice-003: 1889.5 vs expected 1889 (numeric tolerance) - invoice-004: Missing invoice_number (required field)
… configuration - Rename examples/features/invoice-extraction to document-extraction (industry standard terminology) - Add .agentv/targets.yaml with CLI provider configuration for mock_extractor - Update mock_extractor.ts to support optional output file parameter for AgentV integration - Update dataset.yaml to reference mock_extractor target by name instead of inline command - Functional test confirms mock_extractor works correctly with AgentV CLI provider
- Defer geometric evaluators (IoU, coordinate distance) to code_judge plugins - Add date match type with format normalization to field_accuracy spec - Convert geometric-evaluators spec to plugin implementation guide with Python examples - Update tasks.md to reflect simplified scope focused on field_accuracy - Update example dataset.yaml to use date matching for invoice_date field This aligns with AgentV's "lightweight core, plugin extensibility" principle by keeping complex algorithms (polygon intersection, Hungarian matching) out of core while providing ready-to-use code_judge scripts for users who need them.
Implement the field_accuracy evaluator from OpenSpec proposal add-structured-data-evaluators. This evaluator enables declarative validation of extracted structured data with configurable matching strategies. Features: - Exact, fuzzy (Levenshtein/Jaro-Winkler), numeric tolerance, and date matching - Dot-notation field paths with array indexing (e.g., items[0].price) - Weighted average and all-or-nothing aggregation strategies - Comprehensive test suite (12 new tests) Files changed: - packages/core/src/evaluation/types.ts: Add FieldAccuracyEvaluatorConfig - packages/core/src/evaluation/evaluators.ts: Implement FieldAccuracyEvaluator - packages/core/src/evaluation/loaders/evaluator-parser.ts: Parse field_accuracy - packages/core/src/evaluation/orchestrator.ts: Register field_accuracy handler - packages/core/test/evaluation/evaluators.test.ts: Add evaluator tests
The test case was missing invoice_number in expected_messages, so the evaluator had nothing to compare against. Now properly tests the 'missing required field' scenario with score dropping to ~0.85.
194b8ff to
58ed92f
Compare
Align with lightweight core principle by moving fuzzy string matching (Levenshtein, Jaro-Winkler) out of built-in field_accuracy evaluator. Changes: - Remove 'fuzzy' match type from FieldMatchType - Remove ~100 LOC of algorithm implementations from evaluators.ts - Add fuzzy_match.ts code_judge example with both algorithms - Add early warning for empty fields array in parser - Document date format disambiguation behavior (defaults to US) - Update changeset and README documentation This is consistent with how IoU/geometric evaluators are handled - complex algorithms belong in plugins, not core.
Replace match: fuzzy with match: exact in dataset.yaml since fuzzy matching was removed from core. Added comments pointing to fuzzy_match.ts code_judge for OCR variation handling.
Demonstrates how to use code_judge for fuzzy string matching on specific fields in structured data. invoice-002 now shows the hybrid approach: - code_judge with Levenshtein similarity for supplier.name (OCR variations) - field_accuracy for exact/numeric/date fields
Enables code_judge scripts to receive configuration via stdin by passing
unrecognized YAML properties as a `config` object. This allows reusable
scripts without hardcoding field paths or thresholds.
Changes:
- Add `config` property to CodeEvaluatorConfig type
- Collect unrecognized YAML properties in evaluator-parser
- Pass config to scripts via stdin payload
- Add multi_field_fuzzy.ts example demonstrating config usage
- Update dataset.yaml invoice-002 to use multi-field fuzzy matcher
- Update OpenSpec docs to reflect fuzzy matching via code_judge
- Add tests for config pass-through
Example usage:
```yaml
- type: code_judge
script: ./multi_field_fuzzy.ts
fields:
- path: supplier.name
threshold: 0.85
algorithm: levenshtein
```
Rename package and update documentation to reflect the directory rename from invoice-extraction to document-extraction. The example content still covers invoice extraction as a specific use case of document extraction.
Add two new evaluator types for validating execution metrics: - `latency`: Check execution duration against a threshold (uses traceSummary.durationMs) - `cost`: Check execution cost against a budget (uses traceSummary.costUsd) Both evaluators leverage the existing TraceSummary infrastructure that already collects these metrics from providers.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces a comprehensive proposal for adding structured data evaluators to AgentV, enabling precise evaluation of document extraction systems (invoices, receipts, forms, etc.).
What's Included
OpenSpec Proposal (
openspec/changes/add-structured-data-evaluators/)structured-data-evaluators/spec.md: field_accuracy evaluator (10 requirements, 30+ scenarios)geometric-evaluators/spec.md: iou_score & coordinate_distance (12 requirements, 35+ scenarios)Functional Testing Example (
examples/features/structured-data/)Evaluator Types Proposed
field_accuracy: Compare extracted structured data against expected schema
invoice.line_items[0].amount)iou_score: Bounding box overlap for document layout verification
coordinate_distance: Spatial comparison for field positioning
AgentV Goals Alignment
✅ Declarative Definitions: YAML-based configuration with clear expected outcomes
⚠️ Multi-Objective Scoring: Example focuses on correctness; includes placeholders for latency/cost/safety
✅ Structured Evaluation: Deterministic field comparison (primitive for rubric-based patterns)
✅ Optimization Ready: Weighted fields enable future hyperparameter tuning
Notable Design Decisions
getfor nested field path resolution (already in dependencies)Validation
openspec validate add-structured-data-evaluators --strictNext Steps
Implementation tracked in tasks.md with 7 phases:
Ready for review: Proposal is self-contained for AI agent implementation without conversation context.