feat: add built-in execution_metrics evaluator

# Add Execution Metrics Evaluator

## Summary

Introduce a built-in `execution_metrics` evaluator type for threshold-based checks on execution metrics (token usage, cost, tool call count, duration).

## Motivation

Core already computes `TraceSummary` with execution metrics (`tokenUsage`, `costUsd`, `durationMs`, `toolCallsByName`). Users commonly want to check these against thresholds:

- "Fail if cost exceeds $0.05"
- "Warn if more than 10 tool calls"
- "Check token usage stays under budget"

Currently, users must write a `code_judge` script to perform these simple threshold checks. This is:
1. **Boilerplate-heavy** for common use cases
2. **Error-prone** - must manually define interfaces that match core's JSON contract
3. **Language-dependent** - TypeScript scripts duplicate core interfaces

A built-in evaluator would cover 80% of use cases with simple configuration.

## Proposed Changes

### 1. New Evaluator Type: `execution_metrics`

```yaml
evaluators:
  - name: efficiency
    type: execution_metrics
    max_tool_calls: 10
    max_tokens: 5000
    max_cost_usd: 0.05
    max_duration_ms: 30000
    target_exploration_ratio: 0.6  # optional
```

### 2. Configuration Options

| Option | Type | Description |
|--------|------|-------------|
| `max_tool_calls` | number | Maximum total tool calls allowed |
| `max_tokens` | number | Maximum total tokens (input + output) |
| `max_cost_usd` | number | Maximum cost in USD |
| `max_duration_ms` | number | Maximum execution duration |
| `target_exploration_ratio` | number | Target ratio of exploration tools (0-1) |
| `exploration_tolerance` | number | Tolerance for exploration ratio (default: 0.2) |

All options are optional. Only specified thresholds are checked.

### 3. Scoring

- Each threshold check contributes equally to the final score
- Pass = 1.0, Fail = 0.0 (or proportional penalty for soft limits)
- `hits` lists passed checks, `misses` lists failed checks

### 4. Implementation

Leverage existing core utilities:
- `TraceSummary` interface (already computed)
- `explorationRatio()` helper function
- `tokensPerTool()` helper function

## Migration Strategy

- Additive change - existing evaluators unchanged
- Showcase `efficiency-scorer.ts` remains as advanced example for custom formulas/heuristics
- Document built-in as preferred approach for simple threshold checks

## Relationship to Existing Features

- Complements `tool_trajectory` (sequence checking) with metric checking
- Uses same `TraceSummary` data already passed to evaluators
- Simpler alternative to `code_judge` for threshold-based checks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add built-in execution_metrics evaluator #103

Add Execution Metrics Evaluator

Summary

Motivation

Proposed Changes

1. New Evaluator Type: `execution_metrics`

2. Configuration Options

3. Scoring

4. Implementation

Migration Strategy

Relationship to Existing Features

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Option	Type	Description
`max_tool_calls`	number	Maximum total tool calls allowed
`max_tokens`	number	Maximum total tokens (input + output)
`max_cost_usd`	number	Maximum cost in USD
`max_duration_ms`	number	Maximum execution duration
`target_exploration_ratio`	number	Target ratio of exploration tools (0-1)
`exploration_tolerance`	number	Tolerance for exploration ratio (default: 0.2)

feat: add built-in execution_metrics evaluator #103

Description

Add Execution Metrics Evaluator

Summary

Motivation

Proposed Changes

1. New Evaluator Type: execution_metrics

2. Configuration Options

3. Scoring

4. Implementation

Migration Strategy

Relationship to Existing Features

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. New Evaluator Type: `execution_metrics`