-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Add Execution Metrics Evaluator
Summary
Introduce a built-in execution_metrics evaluator type for threshold-based checks on execution metrics (token usage, cost, tool call count, duration).
Motivation
Core already computes TraceSummary with execution metrics (tokenUsage, costUsd, durationMs, toolCallsByName). Users commonly want to check these against thresholds:
- "Fail if cost exceeds $0.05"
- "Warn if more than 10 tool calls"
- "Check token usage stays under budget"
Currently, users must write a code_judge script to perform these simple threshold checks. This is:
- Boilerplate-heavy for common use cases
- Error-prone - must manually define interfaces that match core's JSON contract
- Language-dependent - TypeScript scripts duplicate core interfaces
A built-in evaluator would cover 80% of use cases with simple configuration.
Proposed Changes
1. New Evaluator Type: execution_metrics
evaluators:
- name: efficiency
type: execution_metrics
max_tool_calls: 10
max_tokens: 5000
max_cost_usd: 0.05
max_duration_ms: 30000
target_exploration_ratio: 0.6 # optional2. Configuration Options
| Option | Type | Description |
|---|---|---|
max_tool_calls |
number | Maximum total tool calls allowed |
max_tokens |
number | Maximum total tokens (input + output) |
max_cost_usd |
number | Maximum cost in USD |
max_duration_ms |
number | Maximum execution duration |
target_exploration_ratio |
number | Target ratio of exploration tools (0-1) |
exploration_tolerance |
number | Tolerance for exploration ratio (default: 0.2) |
All options are optional. Only specified thresholds are checked.
3. Scoring
- Each threshold check contributes equally to the final score
- Pass = 1.0, Fail = 0.0 (or proportional penalty for soft limits)
hitslists passed checks,misseslists failed checks
4. Implementation
Leverage existing core utilities:
TraceSummaryinterface (already computed)explorationRatio()helper functiontokensPerTool()helper function
Migration Strategy
- Additive change - existing evaluators unchanged
- Showcase
efficiency-scorer.tsremains as advanced example for custom formulas/heuristics - Document built-in as preferred approach for simple threshold checks
Relationship to Existing Features
- Complements
tool_trajectory(sequence checking) with metric checking - Uses same
TraceSummarydata already passed to evaluators - Simpler alternative to
code_judgefor threshold-based checks
Metadata
Metadata
Assignees
Labels
No labels