Skip to content

feat: add built-in execution_metrics evaluator #103

@christso

Description

@christso

Add Execution Metrics Evaluator

Summary

Introduce a built-in execution_metrics evaluator type for threshold-based checks on execution metrics (token usage, cost, tool call count, duration).

Motivation

Core already computes TraceSummary with execution metrics (tokenUsage, costUsd, durationMs, toolCallsByName). Users commonly want to check these against thresholds:

  • "Fail if cost exceeds $0.05"
  • "Warn if more than 10 tool calls"
  • "Check token usage stays under budget"

Currently, users must write a code_judge script to perform these simple threshold checks. This is:

  1. Boilerplate-heavy for common use cases
  2. Error-prone - must manually define interfaces that match core's JSON contract
  3. Language-dependent - TypeScript scripts duplicate core interfaces

A built-in evaluator would cover 80% of use cases with simple configuration.

Proposed Changes

1. New Evaluator Type: execution_metrics

evaluators:
  - name: efficiency
    type: execution_metrics
    max_tool_calls: 10
    max_tokens: 5000
    max_cost_usd: 0.05
    max_duration_ms: 30000
    target_exploration_ratio: 0.6  # optional

2. Configuration Options

Option Type Description
max_tool_calls number Maximum total tool calls allowed
max_tokens number Maximum total tokens (input + output)
max_cost_usd number Maximum cost in USD
max_duration_ms number Maximum execution duration
target_exploration_ratio number Target ratio of exploration tools (0-1)
exploration_tolerance number Tolerance for exploration ratio (default: 0.2)

All options are optional. Only specified thresholds are checked.

3. Scoring

  • Each threshold check contributes equally to the final score
  • Pass = 1.0, Fail = 0.0 (or proportional penalty for soft limits)
  • hits lists passed checks, misses lists failed checks

4. Implementation

Leverage existing core utilities:

  • TraceSummary interface (already computed)
  • explorationRatio() helper function
  • tokensPerTool() helper function

Migration Strategy

  • Additive change - existing evaluators unchanged
  • Showcase efficiency-scorer.ts remains as advanced example for custom formulas/heuristics
  • Document built-in as preferred approach for simple threshold checks

Relationship to Existing Features

  • Complements tool_trajectory (sequence checking) with metric checking
  • Uses same TraceSummary data already passed to evaluators
  • Simpler alternative to code_judge for threshold-based checks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions