New judging system (v2) by rank-and-file · Pull Request #50 · aisa-group/PostTrainBench

rank-and-file · 2026-06-02T10:11:23Z

Overview

This branch ("new_judge_v2") reworks PostTrainBench's contamination/safety judging, overhauls the result-aggregation layer, consolidates trace parsing, adds an evaluation task, and moves configuration to a .env-based flow.

Judging system (v2)

Judging is now driven by a dedicated src/disallowed_usage_judge/run_judge.sh (new in this PR; previously the judge was invoked inline). It runs a two-judge pipeline after the agent finishes:
1. GPT-5.4 contamination judge — test-data usage, eval tampering, model substitution, forbidden fine-tuning practices.
2. GPT-5.4 third-party API-usage judge — separate disallowed_api_usage schema; writes its own judgement_api.json (archival, not consumed by scoring).
Each judge writes a per-judge JSON; there is no aggregation step — the canonical contamination verdict is judgement_gpt5_4.json directly. The old scripts/migrate_judgement_files.py is removed.
The API judge ignores the agent's own harness identity, and the LLM-as-judge exception is gated on benchmark_id.
New judge tooling copied into the sandbox: judge_tools/ (contamination_check.py, model_identity_check.py, reference_configs/), plus contamination_check_tool/download_test_data.{py,sh}. The judge runs in the gpt_5_5 container with apptainer --containall.
Rerun pipeline under src/disallowed_usage_judge/rerun_judge/ — batch resubmission for GPT-only / API-only / contamination-only reruns; outputs always carry a _rerun suffix so originals are preserved. Includes find_disallowed_api_usage.py (replaces the removed dev_utils/find_api_illegal.py).

Aggregation / scripts

scripts/collect.py now skips error runs, reads config from .env, and drops the API-usage judge from the collect/aggregate flow.
scripts/utils.py, scripts/aggregate.py, and scripts/parse_all_to_human_readable.sh updated accordingly.
Removed scripts/migrate_judgement_files.py.

Trace parsing

Consolidated the per-agent parsers (previously agents/*/human_readable_trace.py, now removed) into src/trace_parsing/ (claude, codex, gemini, opencode) with a shared _common.py and a parse_trace.py dispatcher.
Added sanitize_trace.py, run after every parse.

Agents & tasks

codexhigh/codexlow now emit JSON traces; minor solve.sh tweaks to claude and claude_non_api.
New evaluation task: aime2026 (evaluate.py, benchmark.txt, info.json).
Added info.json metadata to existing tasks (aime2025, arenahardwriting, bfcl, gpqamain, gsm8k, healthbench, humaneval) for judge/prompt generation.

Config & infra

.env is now the canonical source for POST_TRAIN_BENCH_* vars; added an example.env template and a README "Environment Setup" section. set_env_vars.sh updated to source .env.
Added AGENTS.md / CLAUDE.md agent guidelines (and un-ignored them in .gitignore).
HTCondor scheduler .err/.out/.log files now route to logs/ (gitignored) instead of the repo root; .gitignore also ignores NFS silly-rename leftovers, rerun-pipeline run logs, and downloaded test_data.json.
New container def latest-2026-03-27.def.

Notes

The flagged-API-usage judge is known to be non-deterministic (rerunning previously-flagged dirs flipped true→false); its output is archival and not consumed by scoring.

🤖 Generated with Claude Code

Deterministic check that compares final_model/config.json against bundled reference configs on fine-tuning-invariant architecture fields. Judges previously had to infer model identity from logs; a tokenizer warning referencing Mistral was enough to trigger a false "disallowed use detected." - judge_tools/model_identity_check.py + reference_configs for the four allowed base models (Qwen3-1.7B-Base, Qwen3-4B-Base, SmolLM3-3B-Base, gemma-3-4b-pt), resolved via AutoConfig so nested text_config defaults are filled in. - prompt.txt tells the judge to run the oracle in hard cases and not to treat generic vendor-mentioning warnings as evidence. - run_task.sh and run_judge.sh both expose the oracle and the candidate config as ../final_model_config.json inside the judge sandbox. Verified on 92 agent-produced runs: all MATCH against the true base; 36 cross-model swaps (negative control) all MISMATCH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rerun judge in --rerun mode for the latest run per (benchmark, model) across claude_claude-opus-4-5_10h_{final_v3,v5,v6_seed1}. Pulls POST_TRAIN_BENCH_RESULTS_DIR from .env directly to avoid sourcing set_env_vars.sh, which fails on nodes without tclsh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Judge in two phases: work backwards from the last agent messages in `../solve_parsed.txt` to pin down the base model, datasets, and training scripts that actually produced `final_model`, then inspect only those artifacts for contamination. The trace is authoritative — files left in the task directory may have been edited or never executed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bump codex judge from GPT-5.2 (high) to GPT-5.4 (xhigh) across run_task.sh and run_judge.sh, and rename per-judge output files accordingly. Replace --rerun with --gpt-only/--sonnet-only so run_judge.sh always writes _rerun outputs without touching the original judgements, and partial reruns can skip one model while still aggregating with the other's existing _rerun result. Add rerun_single_gpt_only.sh + rerun_judge_gpt_only.sub plus recompute_selected_methods_gpt_only.sh for the three claude-opus-4-5_10h_{final_v3,v5,v6_seed1} method dirs, and switch rerun_judge.sub to the user.judge:3333 concurrency limit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the two flat text files (contamination_judgement.txt, disallowed_model_judgement.txt) with one structured judgement.json per judge containing contamination/disallowed_model booleans plus justification strings. A new aggregate_judgement.py merges per-judge files into judge_result.json (logical OR for booleans, prefixed concatenation for justifications), and run_judge.sh / run_task.sh now call it instead of grep-based bash aggregation. Downstream readers (contamination_list.py, aggregate_contamination.py, extract_traces.py, the rerun scripts and utils) are updated to consume the new schema. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace opus_4_6_codex_5_3.sif with the new gpt_5_5.sif (definition file added under containers/) in both run_task.sh and run_judge.sh. Tighten failure handling: run_judge.sh now exits non-zero if a judge doesn't produce judgement.json, and aggregate_judgement.py raises SystemExit on any missing / malformed per-judge file instead of writing a False/False default that would mask a crashed judge. Also polish the prompt wording, drop rerun_judge.sub concurrency from 3333 to 1250, and broaden the oauth_token gitignore to oauth_token*. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a separate GPT-5.4 judge that inspects the agent trace for disallowed hosted-LLM API usage (training-data generation, distillation, etc.) and folds its verdict into the aggregated judge_result.json alongside the existing contamination/base-model judges. Also adds an --api-only rerun path and matching condor submit wrapper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Swap the second contamination judge from Claude Sonnet 4.6 (claude CLI + subscription OAuth) to DeepSeek V4 Flash Free (opencode CLI + OPENCODE_API_KEY). Renames judgement_sonnet4_6.json -> judgement_deepseek.json across run_judge.sh, run_task.sh, aggregate_judgement.py, extract_traces.py, and the rerun tooling, and replaces the --sonnet-only flag/scripts with --deepseek-only equivalents. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the four duplicated agents/*/human_readable_trace.py parsers and their ten symlinks with a single dispatcher in src/trace_parsing/. The dispatcher substring-matches the agent name against {claude, codex, gemini, opencode} and falls back to a verbatim copy when no key matches, preserving the previous behavior for glm5 and qwen3max. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add src/trace_parsing/sanitize_trace.py, which redacts every *_API_KEY value sourced from the repo's .env file (live-environment values take precedence; `your-*` placeholders and short strings are skipped) and labels matches as `[REDACTED:<NAME>]`. Hook it into parse_trace.py so that every dispatcher run automatically drops `<stem>_sanitized<ext>` companions next to both the raw input and the parsed output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Clarify in the judge prompt that submitting the unchanged base `{model}` as `final_model` is allowed (fine-tuning is not required). - Raise kimi rerun concurrency limits to 1000 and switch the contamination-only sub to the dedicated `user.kimi` slot. - Gitignore the new `commit_kimi_only_runs/` cluster-ID log dir.

OPENCODE_API_KEY remains the preferred credential; when it is unset the judge now uses OPENROUTER_API_KEY and pins the OpenRouter request to the Cloudflare upstream so the model selection stays consistent. The API key is baked into opencode.json directly instead of going through an --env passthrough.

.nfs* files appear when an open file is deleted on an NFS share; they disappear once the handle is closed but clutter `git status` in the meantime.

# Conflicts: # containers/gpt_5_5.def # scripts/aggregate_contamination.py # src/commit_utils/commit.sh

Replace the legacy contamination_judgement.txt / disallowed_model_judgement.txt readers with load_judge_result(), which prefers judge_result_rerun.json then judge_result.json and validates the (contamination, disallowed_model, disallowed_api_usage) bool schema written by aggregate_judgement.py. Old runs that still have the contamination_detected/disallowed_model_detected shape are rejected with a pointer to run_judge.sh rather than silently treated as disallowed_api_usage=False. The contamination cell now encodes all three flags as a fixed-order M/C/A string (e.g. "MCA" when everything is flagged, "" when clean). While here, drop the ERR / "not avl." / "not stored" sentinels from load_metrics and load_time_taken: missing or malformed metrics.json or time_taken.txt inside an existing run dir is now a hard error instead of a quietly-baselined cell. The baseline fallback in collect.py still applies, but only for the two legitimate cases — no run for the cell, or judge flagged it.

…ed scripts

Only the GPT-5.4 contamination judge and the GPT-5.4 third-party API usage judge now run, both in src/run_task.sh and in the standalone src/disallowed_usage_judge/run_judge.sh. Aggregation no longer takes a kimi entry, and all --kimi-only / --kimi-contamination-only orchestration (scripts, .sub files, README/AGENTS references) is dropped.

… .env Switch collect/aggregate to read POST_TRAIN_BENCH_RESULTS_DIR from the project's .env (hard error when missing) and read the contamination verdict directly from judgement_gpt5_4.json instead of the aggregated judge_result.json. Adds copy_gpt5_4_to_judge_result.py for backfilling existing run dirs, lowers the rerun-judge concurrency cap to 400, and makes collect warn (not crash) on a missing metrics.json.

Inject the evaluator exception into the API judge prompt only for the LLM-as-judge benchmarks (arenahardwriting, healthbench) via a {api_judge_exception} placeholder, instead of naming them in the prompt. Also clarify that the agent may run evaluate.py itself and the resulting third-party API calls are legal.

The aggregated judge_result.json / judge_result_rerun.json files are no longer produced or consumed. The canonical contamination verdict is now judgement_gpt5_4.json, preferring judgement_gpt5_4_rerun.json when the rerun pipeline has written one. - run_task.sh / run_judge.sh: remove the aggregate_judgement.py step and the dead RUN_AGGREGATE flag. The API judge still runs and writes judgement_api{,_rerun}.json, now archival (consumed by nothing). - Delete aggregate_judgement.py and the obsolete copy_gpt5_4_to_judge_result.py. - scripts/utils.py: load_judge_result -> load_judgement, add judgement_path() centralizing the rerun-over-original preference; judge_result_to_cell -> judgement_to_cell. Update collect.py accordingly. - contamination_list.py, extract_traces.py, rerun_judge/{utils,list_results, aggregate_rerun_results}.py: read the per-judge GPT-5.4 files (rerun-preferred) instead of judge_result*.json. - Update AGENTS.md, READMEs, and rerun shell-script comments.

Its judge mode (run_judge.sh with no flag) is identical to --gpt-only now that the non-GPT (Kimi) judge is gone, making it a duplicate of commit_all_gpt_only.sh. Point the README at the surviving batch submitters (commit_all_gpt_only.sh / commit_gpt_contamination_only.sh) and document list_results.py + rerun_judge.sub for ad-hoc targeting. rerun_judge.sub and rerun_single.sh are kept: recompute_selected_methods.sh still depends on them.

The third-party API usage judge could mistake the research agent's own harness banner/process/usage metadata for a disallowed hosted-API call. Add a per-run 'agent harness' Allowed bullet (build_agent_harness_clause) that names the known agent + model and its tell-tale banner/process strings, plumbed through --agent/--agent-config from run_task.sh and run_judge.sh. Also tighten the Disallowed wording to 'the agent itself writes or launches'.

All .sub submission templates now write their error/output/log files to logs/<prefix>_$(Cluster).{err,out,log} (relative to the repo-root submit directory) instead of cluttering the repo root. Add the logs/ convention to .gitignore and document it in README.md and AGENTS.md.

hrdkbhatnagar · 2026-06-03T18:14:59Z

I believe the existing trace's sanitization and extraction script would need to add support for the new judge's outputs?

Also the sanitize_trace.py seems to be outdated (see the most recent changes at #52 , this was used for the HF upload of the traces)

hrdkbhatnagar · 2026-06-03T18:16:41Z

What's this container used for latest-2026-03-27.def ? Most recent one would the opus 4.8 container and vllm_debug.def for the judge (older judge)

# Conflicts: # dev_utils/extract_traces.py

Replace the os.environ-based API-key collection with a .env parser (load_dotenv), matching the project convention that tooling reads POST_TRAIN_BENCH_* and key secrets straight from .env rather than the exported environment or set_env_vars.sh. Add load_sanitization_secrets to redact extra keys listed in the file named by POST_TRAIN_BENCH_SANITIZATION_SECRETS. Absent named vars are reported instead of crashing; a configured-but-missing secrets file still crashes. Duplicate file lines are dropped (order preserved) so the reported file-secret count is honest; get_api_keys dedupes across all sources. Document the new var in example.env.

Replace the hardcoded, agent-agnostic --env key block in run_task.sh with a declarative allowlist. Each agent declares its permitted provider keys in agents/<agent>/api_keys.json; each benchmark declares the keys its grading needs in src/eval/tasks/<task>/info.json (required_api_keys). The --containall agent sandbox receives only the union of the two, so OPENAI_API_KEY reaches an agent only for arenahardwriting/healthbench (their evaluate.py OpenAI judge), and gemini/opencode no longer receive unrelated provider keys. - Add api_keys.json for all 19 agents ([] for subscription-auth agents) - Add required_api_keys=["OPENAI_API_KEY"] to arenahardwriting/healthbench info.json - Build the --env list dynamically in run_task.sh; fail loudly if an agent has no api_keys.json - Drop now-redundant unset/blank-key lines from solve.sh, keeping real logic (glm5/qwen3max key remap, codex_non_api forced_login_method, claude oauth) - Remove codex/solve.sh debug echoes that would print the now-populated OPENAI_API_KEY/CODEX_API_KEY values into the saved trace - Document the scheme in AGENTS.md

Each agents/*/solve.sh now upgrades its CLI (claude/codex/gemini/opencode) to the latest npm release via src/utils/update_agent_cli.sh just before launching, writing the resolved version to cli_version.txt. run_task.sh copies the helper into the sandbox and surfaces cli_version.txt in the result dir. Since the harnesses are installed/updated per-run in solve.sh, the standard container no longer pre-installs claude-code/gemini/opencode and keeps only codex (for the judge); container deps are pinned via requirements-direct.txt and inspect_evals is pinned to a commit.

Resolve conflicts between main's agent-termination fix (--pid --no-init) and this branch's API-key allowlist + CLI auto-update: - run_task.sh agent sandbox: use '-c --cleanenv --pid --no-init' instead of '--containall'. Adopts main's PID namespace / --no-init termination fix while keeping --cleanenv so the API_KEY_ENV_ARGS allowlist still isolates the host environment. Judge sandboxes keep --containall (main never added --pid there). - agents/*/solve.sh: combine main's stdin-piped invocation and trace flags (--thinking-display summarized, --effort high, codexhigh/low flag changes) with this branch's update_agent_cli.sh auto-update and env exports. - AGENTS.md / run_task.sh comments: describe the sandbox as '-c --cleanenv'.

rank-and-file · 2026-06-08T19:05:57Z

What's this container used for latest-2026-03-27.def

removed it

sanitize_trace.py seems to be outdated

Should be fine now

rank-and-file and others added 30 commits April 19, 2026 13:10

New judging functionality

37a2318

Use no trace

55620ed

Remove judge_result.json

0e4cf93

Update aggregate_rerun_results.py

c66f345

contamination_check

af48bec

Add download_test_data

1fee2bd

Current state

b4aee08

Fix bugs

f94e80a

Decontamination tool

a2a6636

Use two judges

9455613

Introduce dotenv

1053877

Judge fix bugs

c924408

Add aime2026

2a05442

Fix judge, claude

2f0853d

Add 3-sample contamination tolerance and Sonnet-only rerun submit

0bc3267

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Clarify disallowed data usage rules

2208e57

Bind auth.json

4cd6866

Refine definitions for disallowed training data

c0e9418

Use JSON for codex judge

bf657ba

Add ZAI_API_KEY to example env

f489cb9

rank-and-file added 17 commits May 28, 2026 10:16

Ignore NFS silly-rename leftover files

14275c5

.nfs* files appear when an open file is deleted on an NFS share; they disappear once the handle is closed but clutter `git status` in the meantime.

Merge remote-tracking branch 'origin/main' into new_judge_v2

bb3dad6

# Conflicts: # containers/gpt_5_5.def # scripts/aggregate_contamination.py # src/commit_utils/commit.sh

Add --skip-existing option to commit_all_gpt_only.sh and update relat…

2eb12f3

…ed scripts

Fix collect.py to skip errors

7b71b6d

Add commit_gpt_api_only

265850d

Add find_disallowed_api_usage.py and document .env as the env-var source

3427f86

Small fix

90ee97a

rank-and-file changed the title ~~New judging system (v2): two-judge pipeline, aggregation overhaul, trace-parsing consolidation~~ New judging system (v2) Jun 2, 2026

xeophon mentioned this pull request Jun 3, 2026

Add Harbor adapter for judge v2 #51

Closed

hrdkbhatnagar mentioned this pull request Jun 3, 2026

Improve contamination judge #25

Open

rank-and-file added 7 commits June 4, 2026 13:53

Add scripts/replace_final_models_with_note.sh

d9b9e30

Merge remote-tracking branch 'origin/main' into new_judge_v2

2138353

# Conflicts: # dev_utils/extract_traces.py

Update extract_traces.py

2673594

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New judging system (v2)#50

New judging system (v2)#50
rank-and-file wants to merge 64 commits into
mainfrom
new_judge_v2

rank-and-file commented Jun 2, 2026 •

edited

Loading

Uh oh!

hrdkbhatnagar commented Jun 3, 2026

Uh oh!

hrdkbhatnagar commented Jun 3, 2026

Uh oh!

rank-and-file commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rank-and-file commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Judging system (v2)

Aggregation / scripts

Trace parsing

Agents & tasks

Config & infra

Notes

Uh oh!

hrdkbhatnagar commented Jun 3, 2026

Uh oh!

hrdkbhatnagar commented Jun 3, 2026

Uh oh!

rank-and-file commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rank-and-file commented Jun 2, 2026 •

edited

Loading