Skip to content

Commit b99ebc4

Browse files
committed
Integrate ae-agent as long-running agent for ArtEvalBench
1 parent 0995b00 commit b99ebc4

File tree

11 files changed

+1277
-0
lines changed

11 files changed

+1277
-0
lines changed
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# Run ArtEval Benchmark with AE Agent
2+
3+
This directory contains `arteval_tasks.jsonl` and other benchmark task definitions. To run the benchmark with **ae_agent**, start from the **benchmark root** (`benchmarks/arteval_bench/`).
4+
5+
## Run from benchmark root
6+
7+
```bash
8+
cd benchmarks/arteval_bench
9+
10+
# Use ae_agent with data/benchmark/arteval_tasks.jsonl as input
11+
python src/main.py \
12+
-i ./data/benchmark/arteval_tasks.jsonl \
13+
-a ae_agent \
14+
-m claude-sonnet-4-5-20250929 \
15+
-o ./outputs/ae_agent_$(date +%Y-%m-%d_%H-%M-%S)
16+
```
17+
18+
Or, if `run.sh` supports passing an agent argument:
19+
20+
```bash
21+
cd benchmarks/arteval_bench
22+
./run.sh claude-sonnet-4-5-20250929 ae_agent
23+
```
24+
25+
## Environment
26+
27+
- Set `ANTHROPIC_API_KEY` or `ANTHROPIC_FOUNDRY_API_KEY`.
28+
- Optional: `ANTHROPIC_FOUNDRY_BASE_URL`, `CLAUDE_CODE_USE_FOUNDRY=1`.
29+
- The ae_agent implementation lives under `src/agents/ae_agent/`, synced with the standalone ae-agent repo (runner, install, utils, interactive_runner).
30+
31+
## Task format
32+
33+
Each line of `arteval_tasks.jsonl` is one JSON object, including at least:
34+
35+
- `artifact_id`, `artifact_dir`, `artifact_readme`, `artifact_url`
36+
- `evaluator`: evaluation command (e.g. `cd /repo && python3 _agent_eval/main.py`)
37+
- `docker_env`: Docker image
38+
- `run_on_host`: when `true`, run on the host instead of Docker
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
#!/bin/bash
2+
# Run ArtEval benchmark with ae_agent. Execute this script from the benchmark root.
3+
# Usage: ./run_ae_agent.sh [optional: model name, default claude-sonnet-4-5-20250929]
4+
5+
set -e
6+
BENCH_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
7+
MODEL_NAME="${1:-claude-sonnet-4-5-20250929}"
8+
cd "$BENCH_ROOT"
9+
echo "==> ArtEval benchmark root: $BENCH_ROOT"
10+
echo "==> Model: $MODEL_NAME"
11+
echo "==> Agent: ae_agent"
12+
python src/main.py \
13+
-i ./data/benchmark/arteval_tasks.jsonl \
14+
-a ae_agent \
15+
-m "$MODEL_NAME" \
16+
-o "./outputs/ae_agent_${MODEL_NAME//\//_}_$(date +%Y-%m-%d_%H-%M-%S)"
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# AE Agent (ArtEval sub-agent)
2+
3+
This agent is the **ae-agent** logic integrated as a sub-agent of the system-intelligence-benchmark ArtEval benchmark. It uses the Claude Agent SDK to run artifact evaluation tasks inside the benchmark container. Code is synced from the standalone [ae-agent](https://github.com/Couen/ae-agent) repo.
4+
5+
## Files (synced from ae-agent)
6+
7+
- **install.sh**: Installs `claude-agent-sdk==0.1.24` and configures `~/.claude/settings.json` (48h Bash timeout).
8+
- **runner.sh**: Entry point invoked as `runner.sh <model> <task_or_path>`. Forwards to `runner.py`. Uses `/agent/current_task.txt` when the benchmark passes task via file.
9+
- **runner.py**: Runs the task with Claude Agent SDK; supports rate-limit retry (429), message_formatter; second argument can be task text or path to file.
10+
- **utils.py**: `DEFAULT_TIMEOUT_MS` for the runner.
11+
- **interactive_runner.py**: Interactive multi-turn session inside container (e.g. `docker exec -it <cid> python3 /agent/interactive_runner.py <model>`).
12+
- **__init__.py**: Package marker.
13+
14+
## Usage from the benchmark
15+
16+
From the benchmark root (`benchmarks/arteval_bench/`):
17+
18+
```bash
19+
python src/main.py -i ./data/benchmark/arteval_tasks.jsonl -a ae_agent -m claude-sonnet-4-5-20250929 -o ./outputs/ae_agent_run
20+
```
21+
22+
Or use the helper script from `data/benchmark/`:
23+
24+
```bash
25+
./data/benchmark/run_ae_agent.sh [model_name]
26+
```
27+
28+
The benchmark will:
29+
30+
1. Upload the agent to `/agent` in the container.
31+
2. For ae_agent: upload task to `/agent/current_task.txt`, then run `runner.sh "$model" /agent/current_task.txt` (avoids shell quoting with large tasks).
32+
3. Use long-running and live-log behavior (48h timeout, live log streaming, `_agent_eval` removal before run and re-upload before evaluation, container kept for debugging).
33+
4. Pass through `ANTHROPIC_API_KEY`, `ANTHROPIC_FOUNDRY_API_KEY`, `ANTHROPIC_FOUNDRY_BASE_URL`, `CLAUDE_CODE_USE_FOUNDRY` when set.
34+
35+
## Dependencies
36+
37+
- Python 3 with `claude-agent-sdk` (installed by `install.sh`).
38+
- Optional: `message_formatter` for prettier output (if present in the environment).
39+
40+
## Relation to standalone ae-agent repo
41+
42+
The standalone ae-agent repo provides a full CLI (`main.py`, `run_eval.py`, `utils.py`) and host/Docker orchestration. This sub-agent is the in-container runner only; the benchmark’s `run_eval_in_env.py` handles orchestration, task file upload, and Foundry env vars.
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
"""AE Agent for ArtEvalBench - Claude Agent SDK runner for artifact evaluation tasks.
2+
3+
Contract: artifact at /repo, this agent at /agent; task passed as CLI arg or path to file (/agent/current_task.txt).
4+
"""
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
#!/bin/bash
2+
# Setup AE Agent environment inside benchmark container.
3+
# Ensures claude-agent-sdk is available so runner.py can run.
4+
set -e
5+
if ! python3 -c "import claude_agent_sdk" 2>/dev/null; then
6+
echo "Installing claude-agent-sdk..."
7+
pip3 install claude-agent-sdk==0.1.24 || pip3 install --break-system-packages claude-agent-sdk==0.1.24 || true
8+
if ! python3 -c "import claude_agent_sdk"; then
9+
echo "WARNING: claude_agent_sdk still not importable; runner may fail."
10+
fi
11+
fi
12+
# 48h Bash timeout for long-running artifact tasks
13+
mkdir -p ~/.claude
14+
cat > ~/.claude/settings.json << 'EOF'
15+
{
16+
"env": {
17+
"BASH_MAX_TIMEOUT_MS": "172800000",
18+
"BASH_DEFAULT_TIMEOUT_MS": "172800000"
19+
}
20+
}
21+
EOF
22+
echo "AE Agent environment ready (~/.claude/settings.json configured)."
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
#!/usr/bin/env python3
2+
"""Interactive runner for AE Agent - runs inside container after main task.
3+
4+
Used when interactive=True: docker exec -it <container_id> python3 /agent/interactive_runner.py <model>
5+
Artifact at /repo; API keys from container env.
6+
"""
7+
8+
import asyncio
9+
import os
10+
import sys
11+
12+
sys.path.insert(0, '/agent')
13+
14+
try:
15+
from utils import DEFAULT_TIMEOUT_MS
16+
except ImportError:
17+
DEFAULT_TIMEOUT_MS = 172_800_000
18+
19+
try:
20+
from claude_agent_sdk import ClaudeAgentOptions, ClaudeSDKClient
21+
except ImportError as e:
22+
print(f"ERROR: claude_agent_sdk not available: {e}", file=sys.stderr)
23+
sys.exit(1)
24+
25+
26+
def _build_system_prompt() -> str:
27+
try:
28+
timeout_ms_env = os.environ.get("BASH_MAX_TIMEOUT_MS")
29+
timeout_ms = int(timeout_ms_env) if timeout_ms_env else DEFAULT_TIMEOUT_MS
30+
except ValueError:
31+
timeout_ms = DEFAULT_TIMEOUT_MS
32+
33+
return """You are an experienced software engineer in an interactive session.
34+
35+
ENVIRONMENT:
36+
- You are inside a Docker container with root permissions.
37+
- The artifact repository is at /repo. Change to it: cd /repo
38+
- You have access to Read, Write, and Bash tools.
39+
40+
TIMEOUT: Long-running commands can take hours; do not set short timeouts.
41+
42+
You will receive follow-up instructions from the user. Complete each one and respond.
43+
If the user asks to stop or says 'quit'/'exit', acknowledge and they will end the session."""
44+
45+
46+
def _display_message(msg) -> None:
47+
if hasattr(msg, 'content'):
48+
for block in msg.content:
49+
if hasattr(block, 'text'):
50+
print(block.text, end='', flush=True)
51+
print(flush=True)
52+
53+
54+
async def _interactive_loop(model_name: str) -> int:
55+
options = ClaudeAgentOptions(
56+
system_prompt=_build_system_prompt(),
57+
allowed_tools=["Read", "Write", "Bash"],
58+
setting_sources=["user"],
59+
)
60+
61+
print("\n" + "=" * 60, flush=True)
62+
print("Interactive mode - Agent ready. Type your instructions (or 'quit'/'exit' to end).", flush=True)
63+
print("=" * 60 + "\n", flush=True)
64+
65+
async with ClaudeSDKClient(options=options) as client:
66+
await client.query(
67+
"Please confirm you are in /repo and ready for the user's follow-up instructions. Reply briefly that you are ready."
68+
)
69+
async for msg in client.receive_response():
70+
_display_message(msg)
71+
72+
while True:
73+
try:
74+
user_input = input("\n>>> ").strip()
75+
except (EOFError, KeyboardInterrupt):
76+
print("\nExiting interactive mode.", flush=True)
77+
return 0
78+
79+
if not user_input:
80+
continue
81+
if user_input.lower() in ('quit', 'exit', 'q'):
82+
print("Exiting interactive mode.", flush=True)
83+
return 0
84+
85+
await client.query(user_input)
86+
async for msg in client.receive_response():
87+
_display_message(msg)
88+
89+
return 0
90+
91+
92+
def main() -> int:
93+
model_name = os.environ.get("AE_AGENT_MODEL", "claude-sonnet-4-5-20250929")
94+
if len(sys.argv) >= 2:
95+
model_name = sys.argv[1]
96+
97+
if not os.environ.get('ANTHROPIC_API_KEY') and not os.environ.get('ANTHROPIC_FOUNDRY_API_KEY'):
98+
print("ERROR: ANTHROPIC_API_KEY or ANTHROPIC_FOUNDRY_API_KEY must be set.", file=sys.stderr)
99+
return 1
100+
101+
return asyncio.run(_interactive_loop(model_name))
102+
103+
104+
if __name__ == "__main__":
105+
sys.exit(main())

0 commit comments

Comments
 (0)