Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
7bd0aa9
feat: Add structured data evaluators proposal and invoice extraction …
christso Jan 1, 2026
4ba1e07
feat: Add file references and mock extractor to invoice example
christso Jan 1, 2026
3c1c897
fix: Make mock extractor produce realistic variations for testing
christso Jan 1, 2026
a2605ad
refactor: Remove redundant product_code field from invoice data
christso Jan 1, 2026
43a81a5
fix: Repair corrupted code and add jsdom dependencies
christso Jan 1, 2026
ec39c17
refactor: Restructure examples to match showcase pattern with local d…
christso Jan 1, 2026
05f7e47
refactor: Simplify fixtures from HTML to JSON, remove jsdom dependency
christso Jan 2, 2026
4cdfc77
Rename invoice-extraction to document-extraction and add targets.yaml…
christso Jan 2, 2026
ff89b2c
refactor: Simplify structured data evaluators proposal
christso Jan 2, 2026
89de1a1
feat: Add field_accuracy evaluator for structured data comparison
christso Jan 2, 2026
e2ce716
fix: Correct invoice-004 test case to include expected invoice_number
christso Jan 2, 2026
501f650
refactor: Remove fuzzy matching from core, provide as code_judge example
christso Jan 2, 2026
e1ac6e5
fix: Update dataset to remove fuzzy match type references
christso Jan 2, 2026
0bb68d8
feat: Add supplier_name_fuzzy.ts example for field-level fuzzy matching
christso Jan 2, 2026
97a6d10
feat: Add config pass-through for code_judge evaluators
christso Jan 2, 2026
a8dc46d
docs: Update references from invoice-extraction to document-extraction
christso Jan 2, 2026
f1616dd
feat: Add latency and cost evaluators for execution metrics
christso Jan 2, 2026
256c990
docs: Update multi-objective scoring status to implemented
christso Jan 2, 2026
55af03e
chore: Remove unnecessary $schema from eval file
christso Jan 2, 2026
eb4712c
chore: Remove $schema from eval-schema.json and examples
christso Jan 2, 2026
b4bd016
Fix composite evaluators trace context
christso Jan 2, 2026
1fb6f8c
Trim document-extraction README and update eval-builder skill
christso Jan 2, 2026
bd4d075
Move structured evaluator guidance into skill reference
christso Jan 2, 2026
e7f896f
Remove redundant supplier_name_fuzzy example
christso Jan 2, 2026
5c53fb2
Trim verbose dataset header comment
christso Jan 2, 2026
9b25eb6
Fix invoice-003 outcome text
christso Jan 2, 2026
3e6e142
Clarify invoice-003 outcome wording
christso Jan 2, 2026
9a34028
Clarify mock output wording in invoice-003 outcome
christso Jan 2, 2026
4c8820c
Avoid embedding file contents for cli providers
christso Jan 2, 2026
d1e7f33
Refactor CLI prompt formatting check
christso Jan 2, 2026
8fda553
Fix invoice-002 outcome wording
christso Jan 2, 2026
122803a
Add token_usage evaluator
christso Jan 2, 2026
543e26b
Update code_judge scripts to argv with legacy fallback
christso Jan 3, 2026
1e38371
Fix CLI healthcheck cwd fallback for local-cli
christso Jan 3, 2026
2313998
Fix feature example targets and trace handling
christso Jan 3, 2026
4808d79
Normalize example evaluators to snake_case
christso Jan 3, 2026
b757cff
Remove obsolete batch-cli eval targets
christso Jan 3, 2026
470f070
Rename SnakeTraceSummary to TraceSummary
christso Jan 3, 2026
4080135
Add example eval baselines and baseline checks
christso Jan 3, 2026
909cfbe
Require eval_id in compare results
christso Jan 3, 2026
986591d
Skip content-filtered eval in CI baselines
christso Jan 4, 2026
d1d9442
Run all eval baselines in CI
christso Jan 4, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .changeset/add-field-accuracy-evaluator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
"@agentv/core": minor
"agentv": minor
---

Add `field_accuracy`, `latency`, and `cost` evaluators

- `field_accuracy`: Compare structured data fields with exact, numeric_tolerance, or date matching
- `latency`: Check execution duration against threshold (uses traceSummary.durationMs)
- `cost`: Check execution cost against budget (uses traceSummary.costUsd)

See `examples/features/document-extraction/README.md` for usage examples.
6 changes: 6 additions & 0 deletions .changeset/add-token-usage-evaluator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
"@agentv/core": minor
"agentv": minor
---

Add `token_usage` evaluator to gate on provider-reported token budgets.
5 changes: 5 additions & 0 deletions .changeset/fix-composite-trace-context.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"@agentv/core": patch
---

Fix composite evaluators to pass through trace and output message context so trace-dependent evaluators (e.g. latency/cost/tool_trajectory) work when nested.
1 change: 1 addition & 0 deletions .claude/skills/agentv-eval-builder/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ description: Create and maintain AgentV YAML evaluation files for testing AI age
- Rubrics: `references/rubric-evaluator.md` - Structured criteria-based evaluation
- Composite Evaluators: `references/composite-evaluator.md` - Combine multiple evaluators
- Tool Trajectory: `references/tool-trajectory-evaluator.md` - Validate agent tool usage
- Structured Data + Metrics: `references/structured-data-evaluators.md` - `field_accuracy`, `latency`, `cost`
- Custom Evaluators: `references/custom-evaluators.md` - Code and LLM judge templates
- Batch CLI: `references/batch-cli-evaluator.md` - Evaluate batch runner output (JSONL)
- Compare: `references/compare-command.md` - Compare evaluation results between runs
Expand Down
16 changes: 10 additions & 6 deletions .claude/skills/agentv-eval-builder/references/eval-schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,6 @@
"description": "Schema for YAML evaluation files with conversation flows, multiple evaluators, and execution configuration",
"type": "object",
"properties": {
"$schema": {
"type": "string",
"description": "Schema identifier",
"enum": ["agentv-eval-v2"]
},
"description": {
"type": "string",
"description": "Description of what this eval suite covers"
Expand Down Expand Up @@ -37,7 +32,16 @@
},
"type": {
"type": "string",
"enum": ["code", "llm_judge"],
"enum": [
"code",
"llm_judge",
"composite",
"tool_trajectory",
"field_accuracy",
"latency",
"cost",
"token_usage"
],
"description": "Evaluator type: 'code' for scripts/regex/keywords, 'llm_judge' for LLM-based evaluation"
},
"script": {
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Structured Data + Metrics Evaluators

This reference covers the built-in evaluators used for grading structured outputs and gating on execution metrics:

- `field_accuracy`
- `latency`
- `cost`
- `token_usage`

## Ground Truth (`expected_messages`)

Put the expected structured output in the evalcase `expected_messages` (typically as the last `assistant` message with `content` as an object). Evaluators read expected values from there.

```yaml
evalcases:
- id: invoice-001
expected_messages:
- role: assistant
content:
invoice_number: "INV-2025-001234"
net_total: 1889
```

## `field_accuracy`

Use `field_accuracy` to compare fields in the candidate JSON against the ground-truth object in `expected_messages`.

```yaml
execution:
evaluators:
- name: invoice_fields
type: field_accuracy
aggregation: weighted_average
fields:
- path: invoice_number
match: exact
required: true
weight: 2.0
- path: invoice_date
match: date
formats: ["DD-MMM-YYYY", "YYYY-MM-DD"]
- path: net_total
match: numeric_tolerance
tolerance: 1.0
```

### Match types

- `exact`: strict equality
- `date`: compares dates after parsing; optionally provide `formats`
- `numeric_tolerance`: numeric compare within `tolerance` (set `relative: true` for relative tolerance)

For fuzzy string matching, use a `code_judge` evaluator (e.g. Levenshtein) instead of adding a fuzzy mode to `field_accuracy`.

### Aggregation

- `weighted_average` (default): weighted mean of field scores
- `all_or_nothing`: score 1.0 only if all graded fields pass

## `latency` and `cost`

These evaluators gate on execution metrics reported by the provider (via `traceSummary`).

```yaml
execution:
evaluators:
- name: performance
type: latency
threshold: 2000
- name: budget
type: cost
budget: 0.10
```

## `token_usage`

Gate on provider-reported token usage (useful when cost is unavailable or model pricing differs).

```yaml
execution:
evaluators:
- name: token-budget
type: token_usage
max_total: 10000
# or:
# max_input: 8000
# max_output: 2000
```

## Common pattern: combine correctness + gates

Use a `composite` evaluator if you want a single “release gate” score/verdict from multiple checks:

```yaml
execution:
evaluators:
- name: release_gate
type: composite
evaluators:
- name: correctness
type: field_accuracy
fields:
- path: invoice_number
match: exact
- name: latency
type: latency
threshold: 2000
- name: cost
type: cost
budget: 0.10
- name: tokens
type: token_usage
max_total: 10000
aggregator:
type: weighted_average
weights:
correctness: 0.8
latency: 0.1
cost: 0.05
tokens: 0.05
```
5 changes: 1 addition & 4 deletions apps/cli/src/commands/compare/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -52,10 +52,7 @@ export function loadJsonlResults(filePath: string): EvalResult[] {
.filter((line) => line.trim());

return lines.map((line) => {
const record = JSON.parse(line) as {
eval_id?: string;
score?: number;
};
const record = JSON.parse(line) as { eval_id?: string; score?: number };
if (typeof record.eval_id !== 'string') {
throw new Error(`Missing eval_id in result: ${line}`);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ description: Create and maintain AgentV YAML evaluation files for testing AI age
- Rubrics: `references/rubric-evaluator.md` - Structured criteria-based evaluation
- Composite Evaluators: `references/composite-evaluator.md` - Combine multiple evaluators
- Tool Trajectory: `references/tool-trajectory-evaluator.md` - Validate agent tool usage
- Structured Data + Metrics: `references/structured-data-evaluators.md` - `field_accuracy`, `latency`, `cost`
- Custom Evaluators: `references/custom-evaluators.md` - Code and LLM judge templates
- Batch CLI: `references/batch-cli-evaluator.md` - Evaluate batch runner output (JSONL)
- Compare: `references/compare-command.md` - Compare evaluation results between runs
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,6 @@
"description": "Schema for YAML evaluation files with conversation flows, multiple evaluators, and execution configuration",
"type": "object",
"properties": {
"$schema": {
"type": "string",
"description": "Schema identifier",
"enum": ["agentv-eval-v2"]
},
"description": {
"type": "string",
"description": "Description of what this eval suite covers"
Expand Down Expand Up @@ -37,7 +32,16 @@
},
"type": {
"type": "string",
"enum": ["code", "llm_judge"],
"enum": [
"code",
"llm_judge",
"composite",
"tool_trajectory",
"field_accuracy",
"latency",
"cost",
"token_usage"
],
"description": "Evaluator type: 'code' for scripts/regex/keywords, 'llm_judge' for LLM-based evaluation"
},
"script": {
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Structured Data + Metrics Evaluators

This reference covers the built-in evaluators used for grading structured outputs and gating on execution metrics:

- `field_accuracy`
- `latency`
- `cost`
- `token_usage`

## Ground Truth (`expected_messages`)

Put the expected structured output in the evalcase `expected_messages` (typically as the last `assistant` message with `content` as an object). Evaluators read expected values from there.

```yaml
evalcases:
- id: invoice-001
expected_messages:
- role: assistant
content:
invoice_number: "INV-2025-001234"
net_total: 1889
```

## `field_accuracy`

Use `field_accuracy` to compare fields in the candidate JSON against the ground-truth object in `expected_messages`.

```yaml
execution:
evaluators:
- name: invoice_fields
type: field_accuracy
aggregation: weighted_average
fields:
- path: invoice_number
match: exact
required: true
weight: 2.0
- path: invoice_date
match: date
formats: ["DD-MMM-YYYY", "YYYY-MM-DD"]
- path: net_total
match: numeric_tolerance
tolerance: 1.0
```

### Match types

- `exact`: strict equality
- `date`: compares dates after parsing; optionally provide `formats`
- `numeric_tolerance`: numeric compare within `tolerance` (set `relative: true` for relative tolerance)

For fuzzy string matching, use a `code_judge` evaluator (e.g. Levenshtein) instead of adding a fuzzy mode to `field_accuracy`.

### Aggregation

- `weighted_average` (default): weighted mean of field scores
- `all_or_nothing`: score 1.0 only if all graded fields pass

## `latency` and `cost`

These evaluators gate on execution metrics reported by the provider (via `traceSummary`).

```yaml
execution:
evaluators:
- name: performance
type: latency
threshold: 2000
- name: budget
type: cost
budget: 0.10
```

## `token_usage`

Gate on provider-reported token usage (useful when cost is unavailable or model pricing differs).

```yaml
execution:
evaluators:
- name: token-budget
type: token_usage
max_total: 10000
# or:
# max_input: 8000
# max_output: 2000
```

## Common pattern: combine correctness + gates

Use a `composite` evaluator if you want a single “release gate” score/verdict from multiple checks:

```yaml
execution:
evaluators:
- name: release_gate
type: composite
evaluators:
- name: correctness
type: field_accuracy
fields:
- path: invoice_number
match: exact
- name: latency
type: latency
threshold: 2000
- name: cost
type: cost
budget: 0.10
- name: tokens
type: token_usage
max_total: 10000
aggregator:
type: weighted_average
weights:
correctness: 0.8
latency: 0.1
cost: 0.05
tokens: 0.05
```
2 changes: 1 addition & 1 deletion apps/cli/test/commands/compare/compare.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ describe('compare command', () => {
});

describe('loadJsonlResults', () => {
it('should load valid JSONL file with snake_case eval results', () => {
it('should load valid JSONL file with eval_id results', () => {
const filePath = path.join(tempDir, 'results.jsonl');
writeFileSync(
filePath,
Expand Down
4 changes: 2 additions & 2 deletions bun.lock
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
},
"apps/cli": {
"name": "agentv",
"version": "1.2.0",
"version": "1.6.0",
"bin": {
"agentv": "./dist/cli.js",
},
Expand All @@ -39,7 +39,7 @@
},
"packages/core": {
"name": "@agentv/core",
"version": "1.2.0",
"version": "1.5.0",
"dependencies": {
"@ai-sdk/anthropic": "^2.0.53",
"@ai-sdk/azure": "^2.0.78",
Expand Down
Loading