New judging system (v2)#50
Open
rank-and-file wants to merge 64 commits into
Open
Conversation
Deterministic check that compares final_model/config.json against bundled reference configs on fine-tuning-invariant architecture fields. Judges previously had to infer model identity from logs; a tokenizer warning referencing Mistral was enough to trigger a false "disallowed use detected." - judge_tools/model_identity_check.py + reference_configs for the four allowed base models (Qwen3-1.7B-Base, Qwen3-4B-Base, SmolLM3-3B-Base, gemma-3-4b-pt), resolved via AutoConfig so nested text_config defaults are filled in. - prompt.txt tells the judge to run the oracle in hard cases and not to treat generic vendor-mentioning warnings as evidence. - run_task.sh and run_judge.sh both expose the oracle and the candidate config as ../final_model_config.json inside the judge sandbox. Verified on 92 agent-produced runs: all MATCH against the true base; 36 cross-model swaps (negative control) all MISMATCH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rerun judge in --rerun mode for the latest run per (benchmark, model)
across claude_claude-opus-4-5_10h_{final_v3,v5,v6_seed1}. Pulls
POST_TRAIN_BENCH_RESULTS_DIR from .env directly to avoid sourcing
set_env_vars.sh, which fails on nodes without tclsh.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Judge in two phases: work backwards from the last agent messages in `../solve_parsed.txt` to pin down the base model, datasets, and training scripts that actually produced `final_model`, then inspect only those artifacts for contamination. The trace is authoritative — files left in the task directory may have been edited or never executed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bump codex judge from GPT-5.2 (high) to GPT-5.4 (xhigh) across run_task.sh
and run_judge.sh, and rename per-judge output files accordingly. Replace
--rerun with --gpt-only/--sonnet-only so run_judge.sh always writes _rerun
outputs without touching the original judgements, and partial reruns can
skip one model while still aggregating with the other's existing _rerun
result. Add rerun_single_gpt_only.sh + rerun_judge_gpt_only.sub plus
recompute_selected_methods_gpt_only.sh for the three
claude-opus-4-5_10h_{final_v3,v5,v6_seed1} method dirs, and switch
rerun_judge.sub to the user.judge:3333 concurrency limit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the two flat text files (contamination_judgement.txt, disallowed_model_judgement.txt) with one structured judgement.json per judge containing contamination/disallowed_model booleans plus justification strings. A new aggregate_judgement.py merges per-judge files into judge_result.json (logical OR for booleans, prefixed concatenation for justifications), and run_judge.sh / run_task.sh now call it instead of grep-based bash aggregation. Downstream readers (contamination_list.py, aggregate_contamination.py, extract_traces.py, the rerun scripts and utils) are updated to consume the new schema. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace opus_4_6_codex_5_3.sif with the new gpt_5_5.sif (definition file added under containers/) in both run_task.sh and run_judge.sh. Tighten failure handling: run_judge.sh now exits non-zero if a judge doesn't produce judgement.json, and aggregate_judgement.py raises SystemExit on any missing / malformed per-judge file instead of writing a False/False default that would mask a crashed judge. Also polish the prompt wording, drop rerun_judge.sub concurrency from 3333 to 1250, and broaden the oauth_token gitignore to oauth_token*. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a separate GPT-5.4 judge that inspects the agent trace for disallowed hosted-LLM API usage (training-data generation, distillation, etc.) and folds its verdict into the aggregated judge_result.json alongside the existing contamination/base-model judges. Also adds an --api-only rerun path and matching condor submit wrapper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Swap the second contamination judge from Claude Sonnet 4.6 (claude CLI + subscription OAuth) to DeepSeek V4 Flash Free (opencode CLI + OPENCODE_API_KEY). Renames judgement_sonnet4_6.json -> judgement_deepseek.json across run_judge.sh, run_task.sh, aggregate_judgement.py, extract_traces.py, and the rerun tooling, and replaces the --sonnet-only flag/scripts with --deepseek-only equivalents. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the four duplicated agents/*/human_readable_trace.py parsers and
their ten symlinks with a single dispatcher in src/trace_parsing/. The
dispatcher substring-matches the agent name against {claude, codex,
gemini, opencode} and falls back to a verbatim copy when no key matches,
preserving the previous behavior for glm5 and qwen3max.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add src/trace_parsing/sanitize_trace.py, which redacts every *_API_KEY value sourced from the repo's .env file (live-environment values take precedence; `your-*` placeholders and short strings are skipped) and labels matches as `[REDACTED:<NAME>]`. Hook it into parse_trace.py so that every dispatcher run automatically drops `<stem>_sanitized<ext>` companions next to both the raw input and the parsed output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Clarify in the judge prompt that submitting the unchanged base
`{model}` as `final_model` is allowed (fine-tuning is not required).
- Raise kimi rerun concurrency limits to 1000 and switch the
contamination-only sub to the dedicated `user.kimi` slot.
- Gitignore the new `commit_kimi_only_runs/` cluster-ID log dir.
OPENCODE_API_KEY remains the preferred credential; when it is unset the judge now uses OPENROUTER_API_KEY and pins the OpenRouter request to the Cloudflare upstream so the model selection stays consistent. The API key is baked into opencode.json directly instead of going through an --env passthrough.
.nfs* files appear when an open file is deleted on an NFS share; they disappear once the handle is closed but clutter `git status` in the meantime.
# Conflicts: # containers/gpt_5_5.def # scripts/aggregate_contamination.py # src/commit_utils/commit.sh
Replace the legacy contamination_judgement.txt / disallowed_model_judgement.txt readers with load_judge_result(), which prefers judge_result_rerun.json then judge_result.json and validates the (contamination, disallowed_model, disallowed_api_usage) bool schema written by aggregate_judgement.py. Old runs that still have the contamination_detected/disallowed_model_detected shape are rejected with a pointer to run_judge.sh rather than silently treated as disallowed_api_usage=False. The contamination cell now encodes all three flags as a fixed-order M/C/A string (e.g. "MCA" when everything is flagged, "" when clean). While here, drop the ERR / "not avl." / "not stored" sentinels from load_metrics and load_time_taken: missing or malformed metrics.json or time_taken.txt inside an existing run dir is now a hard error instead of a quietly-baselined cell. The baseline fallback in collect.py still applies, but only for the two legitimate cases — no run for the cell, or judge flagged it.
Only the GPT-5.4 contamination judge and the GPT-5.4 third-party API usage judge now run, both in src/run_task.sh and in the standalone src/disallowed_usage_judge/run_judge.sh. Aggregation no longer takes a kimi entry, and all --kimi-only / --kimi-contamination-only orchestration (scripts, .sub files, README/AGENTS references) is dropped.
… .env Switch collect/aggregate to read POST_TRAIN_BENCH_RESULTS_DIR from the project's .env (hard error when missing) and read the contamination verdict directly from judgement_gpt5_4.json instead of the aggregated judge_result.json. Adds copy_gpt5_4_to_judge_result.py for backfilling existing run dirs, lowers the rerun-judge concurrency cap to 400, and makes collect warn (not crash) on a missing metrics.json.
Inject the evaluator exception into the API judge prompt only for the
LLM-as-judge benchmarks (arenahardwriting, healthbench) via a
{api_judge_exception} placeholder, instead of naming them in the prompt.
Also clarify that the agent may run evaluate.py itself and the resulting
third-party API calls are legal.
The aggregated judge_result.json / judge_result_rerun.json files are no
longer produced or consumed. The canonical contamination verdict is now
judgement_gpt5_4.json, preferring judgement_gpt5_4_rerun.json when the
rerun pipeline has written one.
- run_task.sh / run_judge.sh: remove the aggregate_judgement.py step and
the dead RUN_AGGREGATE flag. The API judge still runs and writes
judgement_api{,_rerun}.json, now archival (consumed by nothing).
- Delete aggregate_judgement.py and the obsolete copy_gpt5_4_to_judge_result.py.
- scripts/utils.py: load_judge_result -> load_judgement, add judgement_path()
centralizing the rerun-over-original preference; judge_result_to_cell ->
judgement_to_cell. Update collect.py accordingly.
- contamination_list.py, extract_traces.py, rerun_judge/{utils,list_results,
aggregate_rerun_results}.py: read the per-judge GPT-5.4 files (rerun-preferred)
instead of judge_result*.json.
- Update AGENTS.md, READMEs, and rerun shell-script comments.
Its judge mode (run_judge.sh with no flag) is identical to --gpt-only now that the non-GPT (Kimi) judge is gone, making it a duplicate of commit_all_gpt_only.sh. Point the README at the surviving batch submitters (commit_all_gpt_only.sh / commit_gpt_contamination_only.sh) and document list_results.py + rerun_judge.sub for ad-hoc targeting. rerun_judge.sub and rerun_single.sh are kept: recompute_selected_methods.sh still depends on them.
The third-party API usage judge could mistake the research agent's own harness banner/process/usage metadata for a disallowed hosted-API call. Add a per-run 'agent harness' Allowed bullet (build_agent_harness_clause) that names the known agent + model and its tell-tale banner/process strings, plumbed through --agent/--agent-config from run_task.sh and run_judge.sh. Also tighten the Disallowed wording to 'the agent itself writes or launches'.
All .sub submission templates now write their error/output/log files to
logs/<prefix>_$(Cluster).{err,out,log} (relative to the repo-root submit
directory) instead of cluttering the repo root. Add the logs/ convention to
.gitignore and document it in README.md and AGENTS.md.
Collaborator
|
I believe the existing trace's sanitization and extraction script would need to add support for the new judge's outputs? Also the |
Collaborator
|
What's this container used for |
# Conflicts: # dev_utils/extract_traces.py
Replace the os.environ-based API-key collection with a .env parser (load_dotenv), matching the project convention that tooling reads POST_TRAIN_BENCH_* and key secrets straight from .env rather than the exported environment or set_env_vars.sh. Add load_sanitization_secrets to redact extra keys listed in the file named by POST_TRAIN_BENCH_SANITIZATION_SECRETS. Absent named vars are reported instead of crashing; a configured-but-missing secrets file still crashes. Duplicate file lines are dropped (order preserved) so the reported file-secret count is honest; get_api_keys dedupes across all sources. Document the new var in example.env.
Replace the hardcoded, agent-agnostic --env key block in run_task.sh with a declarative allowlist. Each agent declares its permitted provider keys in agents/<agent>/api_keys.json; each benchmark declares the keys its grading needs in src/eval/tasks/<task>/info.json (required_api_keys). The --containall agent sandbox receives only the union of the two, so OPENAI_API_KEY reaches an agent only for arenahardwriting/healthbench (their evaluate.py OpenAI judge), and gemini/opencode no longer receive unrelated provider keys. - Add api_keys.json for all 19 agents ([] for subscription-auth agents) - Add required_api_keys=["OPENAI_API_KEY"] to arenahardwriting/healthbench info.json - Build the --env list dynamically in run_task.sh; fail loudly if an agent has no api_keys.json - Drop now-redundant unset/blank-key lines from solve.sh, keeping real logic (glm5/qwen3max key remap, codex_non_api forced_login_method, claude oauth) - Remove codex/solve.sh debug echoes that would print the now-populated OPENAI_API_KEY/CODEX_API_KEY values into the saved trace - Document the scheme in AGENTS.md
Each agents/*/solve.sh now upgrades its CLI (claude/codex/gemini/opencode) to the latest npm release via src/utils/update_agent_cli.sh just before launching, writing the resolved version to cli_version.txt. run_task.sh copies the helper into the sandbox and surfaces cli_version.txt in the result dir. Since the harnesses are installed/updated per-run in solve.sh, the standard container no longer pre-installs claude-code/gemini/opencode and keeps only codex (for the judge); container deps are pinned via requirements-direct.txt and inspect_evals is pinned to a commit.
Resolve conflicts between main's agent-termination fix (--pid --no-init) and this branch's API-key allowlist + CLI auto-update: - run_task.sh agent sandbox: use '-c --cleanenv --pid --no-init' instead of '--containall'. Adopts main's PID namespace / --no-init termination fix while keeping --cleanenv so the API_KEY_ENV_ARGS allowlist still isolates the host environment. Judge sandboxes keep --containall (main never added --pid there). - agents/*/solve.sh: combine main's stdin-piped invocation and trace flags (--thinking-display summarized, --effort high, codexhigh/low flag changes) with this branch's update_agent_cli.sh auto-update and env exports. - AGENTS.md / run_task.sh comments: describe the sandbox as '-c --cleanenv'.
Collaborator
Author
removed it
Should be fine now |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This branch ("new_judge_v2") reworks PostTrainBench's contamination/safety judging, overhauls the result-aggregation layer, consolidates trace parsing, adds an evaluation task, and moves configuration to a
.env-based flow.Judging system (v2)
src/disallowed_usage_judge/run_judge.sh(new in this PR; previously the judge was invoked inline). It runs a two-judge pipeline after the agent finishes:disallowed_api_usageschema; writes its ownjudgement_api.json(archival, not consumed by scoring).judgement_gpt5_4.jsondirectly. The oldscripts/migrate_judgement_files.pyis removed.benchmark_id.judge_tools/(contamination_check.py,model_identity_check.py,reference_configs/), pluscontamination_check_tool/download_test_data.{py,sh}. The judge runs in thegpt_5_5container with apptainer--containall.src/disallowed_usage_judge/rerun_judge/— batch resubmission for GPT-only / API-only / contamination-only reruns; outputs always carry a_rerunsuffix so originals are preserved. Includesfind_disallowed_api_usage.py(replaces the removeddev_utils/find_api_illegal.py).Aggregation / scripts
scripts/collect.pynow skips error runs, reads config from.env, and drops the API-usage judge from the collect/aggregate flow.scripts/utils.py,scripts/aggregate.py, andscripts/parse_all_to_human_readable.shupdated accordingly.scripts/migrate_judgement_files.py.Trace parsing
agents/*/human_readable_trace.py, now removed) intosrc/trace_parsing/(claude,codex,gemini,opencode) with a shared_common.pyand aparse_trace.pydispatcher.sanitize_trace.py, run after every parse.Agents & tasks
codexhigh/codexlownow emit JSON traces; minorsolve.shtweaks toclaudeandclaude_non_api.evaluate.py,benchmark.txt,info.json).info.jsonmetadata to existing tasks (aime2025, arenahardwriting, bfcl, gpqamain, gsm8k, healthbench, humaneval) for judge/prompt generation.Config & infra
.envis now the canonical source forPOST_TRAIN_BENCH_*vars; added anexample.envtemplate and a README "Environment Setup" section.set_env_vars.shupdated to source.env.AGENTS.md/CLAUDE.mdagent guidelines (and un-ignored them in.gitignore)..err/.out/.logfiles now route tologs/(gitignored) instead of the repo root;.gitignorealso ignores NFS silly-rename leftovers, rerun-pipeline run logs, and downloadedtest_data.json.latest-2026-03-27.def.Notes
🤖 Generated with Claude Code