Skip to content

New judging system (v2)#50

Open
rank-and-file wants to merge 64 commits into
mainfrom
new_judge_v2
Open

New judging system (v2)#50
rank-and-file wants to merge 64 commits into
mainfrom
new_judge_v2

Conversation

@rank-and-file
Copy link
Copy Markdown
Collaborator

@rank-and-file rank-and-file commented Jun 2, 2026

Overview

This branch ("new_judge_v2") reworks PostTrainBench's contamination/safety judging, overhauls the result-aggregation layer, consolidates trace parsing, adds an evaluation task, and moves configuration to a .env-based flow.

Judging system (v2)

  • Judging is now driven by a dedicated src/disallowed_usage_judge/run_judge.sh (new in this PR; previously the judge was invoked inline). It runs a two-judge pipeline after the agent finishes:
    1. GPT-5.4 contamination judge — test-data usage, eval tampering, model substitution, forbidden fine-tuning practices.
    2. GPT-5.4 third-party API-usage judge — separate disallowed_api_usage schema; writes its own judgement_api.json (archival, not consumed by scoring).
  • Each judge writes a per-judge JSON; there is no aggregation step — the canonical contamination verdict is judgement_gpt5_4.json directly. The old scripts/migrate_judgement_files.py is removed.
  • The API judge ignores the agent's own harness identity, and the LLM-as-judge exception is gated on benchmark_id.
  • New judge tooling copied into the sandbox: judge_tools/ (contamination_check.py, model_identity_check.py, reference_configs/), plus contamination_check_tool/download_test_data.{py,sh}. The judge runs in the gpt_5_5 container with apptainer --containall.
  • Rerun pipeline under src/disallowed_usage_judge/rerun_judge/ — batch resubmission for GPT-only / API-only / contamination-only reruns; outputs always carry a _rerun suffix so originals are preserved. Includes find_disallowed_api_usage.py (replaces the removed dev_utils/find_api_illegal.py).

Aggregation / scripts

  • scripts/collect.py now skips error runs, reads config from .env, and drops the API-usage judge from the collect/aggregate flow.
  • scripts/utils.py, scripts/aggregate.py, and scripts/parse_all_to_human_readable.sh updated accordingly.
  • Removed scripts/migrate_judgement_files.py.

Trace parsing

  • Consolidated the per-agent parsers (previously agents/*/human_readable_trace.py, now removed) into src/trace_parsing/ (claude, codex, gemini, opencode) with a shared _common.py and a parse_trace.py dispatcher.
  • Added sanitize_trace.py, run after every parse.

Agents & tasks

  • codexhigh/codexlow now emit JSON traces; minor solve.sh tweaks to claude and claude_non_api.
  • New evaluation task: aime2026 (evaluate.py, benchmark.txt, info.json).
  • Added info.json metadata to existing tasks (aime2025, arenahardwriting, bfcl, gpqamain, gsm8k, healthbench, humaneval) for judge/prompt generation.

Config & infra

  • .env is now the canonical source for POST_TRAIN_BENCH_* vars; added an example.env template and a README "Environment Setup" section. set_env_vars.sh updated to source .env.
  • Added AGENTS.md / CLAUDE.md agent guidelines (and un-ignored them in .gitignore).
  • HTCondor scheduler .err/.out/.log files now route to logs/ (gitignored) instead of the repo root; .gitignore also ignores NFS silly-rename leftovers, rerun-pipeline run logs, and downloaded test_data.json.
  • New container def latest-2026-03-27.def.

Notes

  • The flagged-API-usage judge is known to be non-deterministic (rerunning previously-flagged dirs flipped true→false); its output is archival and not consumed by scoring.

🤖 Generated with Claude Code

rank-and-file and others added 30 commits April 19, 2026 13:10
Deterministic check that compares final_model/config.json against bundled
reference configs on fine-tuning-invariant architecture fields. Judges
previously had to infer model identity from logs; a tokenizer warning
referencing Mistral was enough to trigger a false "disallowed use detected."

- judge_tools/model_identity_check.py + reference_configs for the four
  allowed base models (Qwen3-1.7B-Base, Qwen3-4B-Base, SmolLM3-3B-Base,
  gemma-3-4b-pt), resolved via AutoConfig so nested text_config defaults
  are filled in.
- prompt.txt tells the judge to run the oracle in hard cases and not to
  treat generic vendor-mentioning warnings as evidence.
- run_task.sh and run_judge.sh both expose the oracle and the candidate
  config as ../final_model_config.json inside the judge sandbox.

Verified on 92 agent-produced runs: all MATCH against the true base; 36
cross-model swaps (negative control) all MISMATCH.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rerun judge in --rerun mode for the latest run per (benchmark, model)
across claude_claude-opus-4-5_10h_{final_v3,v5,v6_seed1}. Pulls
POST_TRAIN_BENCH_RESULTS_DIR from .env directly to avoid sourcing
set_env_vars.sh, which fails on nodes without tclsh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Judge in two phases: work backwards from the last agent messages in
`../solve_parsed.txt` to pin down the base model, datasets, and training
scripts that actually produced `final_model`, then inspect only those
artifacts for contamination. The trace is authoritative — files left in
the task directory may have been edited or never executed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bump codex judge from GPT-5.2 (high) to GPT-5.4 (xhigh) across run_task.sh
and run_judge.sh, and rename per-judge output files accordingly. Replace
--rerun with --gpt-only/--sonnet-only so run_judge.sh always writes _rerun
outputs without touching the original judgements, and partial reruns can
skip one model while still aggregating with the other's existing _rerun
result. Add rerun_single_gpt_only.sh + rerun_judge_gpt_only.sub plus
recompute_selected_methods_gpt_only.sh for the three
claude-opus-4-5_10h_{final_v3,v5,v6_seed1} method dirs, and switch
rerun_judge.sub to the user.judge:3333 concurrency limit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the two flat text files (contamination_judgement.txt,
disallowed_model_judgement.txt) with one structured judgement.json per
judge containing contamination/disallowed_model booleans plus
justification strings. A new aggregate_judgement.py merges per-judge
files into judge_result.json (logical OR for booleans, prefixed
concatenation for justifications), and run_judge.sh / run_task.sh now
call it instead of grep-based bash aggregation. Downstream readers
(contamination_list.py, aggregate_contamination.py, extract_traces.py,
the rerun scripts and utils) are updated to consume the new schema.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace opus_4_6_codex_5_3.sif with the new gpt_5_5.sif (definition
file added under containers/) in both run_task.sh and run_judge.sh.
Tighten failure handling: run_judge.sh now exits non-zero if a judge
doesn't produce judgement.json, and aggregate_judgement.py raises
SystemExit on any missing / malformed per-judge file instead of writing
a False/False default that would mask a crashed judge. Also polish the
prompt wording, drop rerun_judge.sub concurrency from 3333 to 1250,
and broaden the oauth_token gitignore to oauth_token*.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a separate GPT-5.4 judge that inspects the agent trace for disallowed
hosted-LLM API usage (training-data generation, distillation, etc.) and
folds its verdict into the aggregated judge_result.json alongside the
existing contamination/base-model judges. Also adds an --api-only rerun
path and matching condor submit wrapper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Swap the second contamination judge from Claude Sonnet 4.6 (claude CLI +
subscription OAuth) to DeepSeek V4 Flash Free (opencode CLI + OPENCODE_API_KEY).
Renames judgement_sonnet4_6.json -> judgement_deepseek.json across run_judge.sh,
run_task.sh, aggregate_judgement.py, extract_traces.py, and the rerun tooling,
and replaces the --sonnet-only flag/scripts with --deepseek-only equivalents.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the four duplicated agents/*/human_readable_trace.py parsers and
their ten symlinks with a single dispatcher in src/trace_parsing/. The
dispatcher substring-matches the agent name against {claude, codex,
gemini, opencode} and falls back to a verbatim copy when no key matches,
preserving the previous behavior for glm5 and qwen3max.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add src/trace_parsing/sanitize_trace.py, which redacts every *_API_KEY
value sourced from the repo's .env file (live-environment values take
precedence; `your-*` placeholders and short strings are skipped) and
labels matches as `[REDACTED:<NAME>]`. Hook it into parse_trace.py so
that every dispatcher run automatically drops `<stem>_sanitized<ext>`
companions next to both the raw input and the parsed output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Clarify in the judge prompt that submitting the unchanged base
  `{model}` as `final_model` is allowed (fine-tuning is not required).
- Raise kimi rerun concurrency limits to 1000 and switch the
  contamination-only sub to the dedicated `user.kimi` slot.
- Gitignore the new `commit_kimi_only_runs/` cluster-ID log dir.
OPENCODE_API_KEY remains the preferred credential; when it is unset
the judge now uses OPENROUTER_API_KEY and pins the OpenRouter request
to the Cloudflare upstream so the model selection stays consistent.
The API key is baked into opencode.json directly instead of going
through an --env passthrough.
.nfs* files appear when an open file is deleted on an NFS share; they
disappear once the handle is closed but clutter `git status` in the
meantime.
# Conflicts:
#	containers/gpt_5_5.def
#	scripts/aggregate_contamination.py
#	src/commit_utils/commit.sh
Replace the legacy contamination_judgement.txt / disallowed_model_judgement.txt
readers with load_judge_result(), which prefers judge_result_rerun.json then
judge_result.json and validates the (contamination, disallowed_model,
disallowed_api_usage) bool schema written by aggregate_judgement.py. Old
runs that still have the contamination_detected/disallowed_model_detected
shape are rejected with a pointer to run_judge.sh rather than silently
treated as disallowed_api_usage=False.

The contamination cell now encodes all three flags as a fixed-order
M/C/A string (e.g. "MCA" when everything is flagged, "" when clean).

While here, drop the ERR / "not avl." / "not stored" sentinels from
load_metrics and load_time_taken: missing or malformed metrics.json or
time_taken.txt inside an existing run dir is now a hard error instead of
a quietly-baselined cell. The baseline fallback in collect.py still
applies, but only for the two legitimate cases — no run for the cell, or
judge flagged it.
Only the GPT-5.4 contamination judge and the GPT-5.4 third-party API
usage judge now run, both in src/run_task.sh and in the standalone
src/disallowed_usage_judge/run_judge.sh. Aggregation no longer takes
a kimi entry, and all --kimi-only / --kimi-contamination-only orchestration
(scripts, .sub files, README/AGENTS references) is dropped.
… .env

Switch collect/aggregate to read POST_TRAIN_BENCH_RESULTS_DIR from the
project's .env (hard error when missing) and read the contamination
verdict directly from judgement_gpt5_4.json instead of the aggregated
judge_result.json. Adds copy_gpt5_4_to_judge_result.py for backfilling
existing run dirs, lowers the rerun-judge concurrency cap to 400, and
makes collect warn (not crash) on a missing metrics.json.
Inject the evaluator exception into the API judge prompt only for the
LLM-as-judge benchmarks (arenahardwriting, healthbench) via a
{api_judge_exception} placeholder, instead of naming them in the prompt.
Also clarify that the agent may run evaluate.py itself and the resulting
third-party API calls are legal.
The aggregated judge_result.json / judge_result_rerun.json files are no
longer produced or consumed. The canonical contamination verdict is now
judgement_gpt5_4.json, preferring judgement_gpt5_4_rerun.json when the
rerun pipeline has written one.

- run_task.sh / run_judge.sh: remove the aggregate_judgement.py step and
  the dead RUN_AGGREGATE flag. The API judge still runs and writes
  judgement_api{,_rerun}.json, now archival (consumed by nothing).
- Delete aggregate_judgement.py and the obsolete copy_gpt5_4_to_judge_result.py.
- scripts/utils.py: load_judge_result -> load_judgement, add judgement_path()
  centralizing the rerun-over-original preference; judge_result_to_cell ->
  judgement_to_cell. Update collect.py accordingly.
- contamination_list.py, extract_traces.py, rerun_judge/{utils,list_results,
  aggregate_rerun_results}.py: read the per-judge GPT-5.4 files (rerun-preferred)
  instead of judge_result*.json.
- Update AGENTS.md, READMEs, and rerun shell-script comments.
Its judge mode (run_judge.sh with no flag) is identical to --gpt-only now
that the non-GPT (Kimi) judge is gone, making it a duplicate of
commit_all_gpt_only.sh. Point the README at the surviving batch submitters
(commit_all_gpt_only.sh / commit_gpt_contamination_only.sh) and document
list_results.py + rerun_judge.sub for ad-hoc targeting.

rerun_judge.sub and rerun_single.sh are kept: recompute_selected_methods.sh
still depends on them.
The third-party API usage judge could mistake the research agent's own
harness banner/process/usage metadata for a disallowed hosted-API call.
Add a per-run 'agent harness' Allowed bullet (build_agent_harness_clause)
that names the known agent + model and its tell-tale banner/process
strings, plumbed through --agent/--agent-config from run_task.sh and
run_judge.sh. Also tighten the Disallowed wording to 'the agent itself
writes or launches'.
All .sub submission templates now write their error/output/log files to
logs/<prefix>_$(Cluster).{err,out,log} (relative to the repo-root submit
directory) instead of cluttering the repo root. Add the logs/ convention to
.gitignore and document it in README.md and AGENTS.md.
@rank-and-file rank-and-file changed the title New judging system (v2): two-judge pipeline, aggregation overhaul, trace-parsing consolidation New judging system (v2) Jun 2, 2026
@hrdkbhatnagar
Copy link
Copy Markdown
Collaborator

I believe the existing trace's sanitization and extraction script would need to add support for the new judge's outputs?

Also the sanitize_trace.py seems to be outdated (see the most recent changes at #52 , this was used for the HF upload of the traces)

@hrdkbhatnagar
Copy link
Copy Markdown
Collaborator

What's this container used for latest-2026-03-27.def ? Most recent one would the opus 4.8 container and vllm_debug.def for the judge (older judge)

# Conflicts:
#	dev_utils/extract_traces.py
Replace the os.environ-based API-key collection with a .env parser
(load_dotenv), matching the project convention that tooling reads
POST_TRAIN_BENCH_* and key secrets straight from .env rather than the
exported environment or set_env_vars.sh.

Add load_sanitization_secrets to redact extra keys listed in the file
named by POST_TRAIN_BENCH_SANITIZATION_SECRETS. Absent named vars are
reported instead of crashing; a configured-but-missing secrets file
still crashes. Duplicate file lines are dropped (order preserved) so the
reported file-secret count is honest; get_api_keys dedupes across all
sources. Document the new var in example.env.
Replace the hardcoded, agent-agnostic --env key block in run_task.sh with a
declarative allowlist. Each agent declares its permitted provider keys in
agents/<agent>/api_keys.json; each benchmark declares the keys its grading
needs in src/eval/tasks/<task>/info.json (required_api_keys). The --containall
agent sandbox receives only the union of the two, so OPENAI_API_KEY reaches an
agent only for arenahardwriting/healthbench (their evaluate.py OpenAI judge),
and gemini/opencode no longer receive unrelated provider keys.

- Add api_keys.json for all 19 agents ([] for subscription-auth agents)
- Add required_api_keys=["OPENAI_API_KEY"] to arenahardwriting/healthbench info.json
- Build the --env list dynamically in run_task.sh; fail loudly if an agent has
  no api_keys.json
- Drop now-redundant unset/blank-key lines from solve.sh, keeping real logic
  (glm5/qwen3max key remap, codex_non_api forced_login_method, claude oauth)
- Remove codex/solve.sh debug echoes that would print the now-populated
  OPENAI_API_KEY/CODEX_API_KEY values into the saved trace
- Document the scheme in AGENTS.md
Each agents/*/solve.sh now upgrades its CLI (claude/codex/gemini/opencode)
to the latest npm release via src/utils/update_agent_cli.sh just before
launching, writing the resolved version to cli_version.txt. run_task.sh
copies the helper into the sandbox and surfaces cli_version.txt in the
result dir.

Since the harnesses are installed/updated per-run in solve.sh, the standard
container no longer pre-installs claude-code/gemini/opencode and keeps only
codex (for the judge); container deps are pinned via requirements-direct.txt
and inspect_evals is pinned to a commit.
Resolve conflicts between main's agent-termination fix (--pid --no-init) and
this branch's API-key allowlist + CLI auto-update:

- run_task.sh agent sandbox: use '-c --cleanenv --pid --no-init' instead of
  '--containall'. Adopts main's PID namespace / --no-init termination fix while
  keeping --cleanenv so the API_KEY_ENV_ARGS allowlist still isolates the host
  environment. Judge sandboxes keep --containall (main never added --pid there).
- agents/*/solve.sh: combine main's stdin-piped invocation and trace flags
  (--thinking-display summarized, --effort high, codexhigh/low flag changes)
  with this branch's update_agent_cli.sh auto-update and env exports.
- AGENTS.md / run_task.sh comments: describe the sandbox as '-c --cleanenv'.
@rank-and-file
Copy link
Copy Markdown
Collaborator Author

What's this container used for latest-2026-03-27.def

removed it

sanitize_trace.py seems to be outdated

Should be fine now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants