Skip to content

fix(reviewer): surface author rationale to reviewer agents#25

Open
AbirAbbas wants to merge 86 commits into
mainfrom
fix/reviewer-sees-author-intent
Open

fix(reviewer): surface author rationale to reviewer agents#25
AbirAbbas wants to merge 86 commits into
mainfrom
fix/reviewer-sees-author-intent

Conversation

@AbirAbbas
Copy link
Copy Markdown
Contributor

Summary

Reviewer agents were producing findings that contradicted explicit design rationale stated in PR descriptions, because the rationale never reached them. Three structural causes:

  1. Intake/anatomy truncated the description aggressively. Intake gate kept the first 500 chars, intake fallback 1000, anatomy 500. A 1100-char PR body with the rationale in the second half was literally cut off before any model saw it.
  2. The reviewer agent never saw the raw description. It received pr_narrative, risk_surfaces, and intake_summary — all digested summaries. The author's voice was laundered through two summarization layers before the reviewer looked at code.
  3. The anatomy prompt explicitly tells the model to discount the description ("what the CODE says, not what the PR description says"). That framing is correct for diff-vs-description divergence detection, but it leaves the reviewer with no channel for author intent.

Changes

  • intake_phase (gate + fallback) and anatomy_phase: bump description truncation from 500/1000/500 → 4000 chars uniformly. Catches typical thoughtful PR bodies without unbounded prompt growth.
  • review_dimension: new pr_description parameter, plus a dedicated "Author's Stated Intent" section near the top of the assembled prompt. The framing tells the reviewer not to defer to the description — but if a finding contradicts a design choice the author explicitly justified, the reviewer must rebut the stated rationale on its merits, not flag as if the rationale wasn't given. Findings on points the description is silent about are unaffected.
  • ReviewOrchestrator._run_parallel_review: wire self.pr_data.description through to review_dimension.

Background

Investigated against a real review where the reviewer flagged "add idempotency check before POST" and "silent failure on 403" — both contradicting explicitly-stated rationale ("POST is additive on purpose", "fail-soft is intentional") sitting in the PR body the reviewer never saw. The same root cause produces a class of bad findings on any PR whose author bothers to explain why.

Test plan

  • Existing tests pass — 25 pass, 1 pre-existing unrelated cost_tracker failure on main baseline (not introduced by this PR)
  • Imports clean (from pr_af.orchestrator import ReviewOrchestrator etc.)
  • Manual: re-run a review against a PR with substantive design rationale in the body and confirm findings either get dropped or get rebutted on the merits when they contradict that rationale

🤖 Generated with Claude Code

santoshkumarradha and others added 30 commits March 10, 2026 08:21
- Add recursive sub-review spawning with depth control (max_review_depth param)
- Fix GitHub comment posting: path normalization, diff line range filtering
- Add REQUEST_CHANGES → COMMENT fallback for own-PR token case
- Fix CP async execute: wrap params under 'input' key
- Add SubReviewRequest schema and _SubReviewRequest harness schema
- Wire max_review_depth through ReviewInput → BudgetConfig → orchestrator
- All 8 reasoners tracked in CP with VC generation (20 VCs per run)
- Successfully posted 4 inline comments to pysktb PR #10
- Rewrite _format_comment_body: multiline evidence as blockquotes,
  suggestion mode 'comment' renders as plain text (not confusing
  commented-out code), simplified metadata line
- Rewrite _format_summary: overall rating/grade at top, AgentField
  branding at top, PR overview in collapsible section, key findings
  with blocking/non-blocking split, all findings by severity
  (collapsible), review process details (collapsible), pipeline
  stats (collapsible), normal-sized footer
- Add _compute_rating, _build_key_findings, _build_review_details,
  _build_pipeline_stats helpers
- Lower min_severity default to 'nitpick', reduce confidence
  thresholds, raise max_duration_seconds to 1800s
- Add severity calibration guidance to reviewer harness prompt
- Add suggestion_mode toggle ('comment' default vs 'code')
- Fix PR branch checkout for new files in diff
- Docker: shared workspace volume, XDG_DATA_HOME persistence
…LLM-as-judge evaluation

- Rewrite all 7 harness prompts for architecture-level code review depth
  (anatomy, planning, review_dimension, cross_ref, adversary, intake, coverage)
- Fix repo clone for large repos: shallow clone (--depth 1 --no-tags --no-checkout),
  600s timeout, shallow PR ref fetch with 300s timeout
- Improve coverage gap prompt in orchestrator
- Add benchmark/truenas-middleware-18291/ with full dry run results:
  pr-af-result.json (25 findings, 8 critical), claude-code data (4 findings),
  EVALUATION.md comparing PR-AF vs Claude Code as LLM-as-a-judge
…parallel adversary batching

Replace single planning phase with 3 parallel meta-selectors (Semantic,
Mechanical, Systemic) that generate review dimensions from independent
analytical lenses. Add 3-level false-positive reduction: Level 1 in
review_dimension prompt (reachability/evidence/confidence gates), Level 2
via cross-meta deduplication, Level 3 via parallel adversary batching.

- Add MetaDimensionResult + MetaSelectorConfig schemas
- Add meta_semantic, meta_mechanical, meta_systemic reasoners
- Refactor orchestrator: _run_meta_selectors, _dedup_cross_meta,
  _run_parallel_adversary, reorder adversary before cross_ref
- Update phase budgets (meta_selectors: 0.30, adversary: 0.40)
…LM-as-judge evaluation

Rewrites EVALUATION.md as a model comparison on TrueNAS middleware PR #18291.
Sonnet scores highest (0.828) with near-perfect precision; Kimi wins breadth (0.727).
Adds separate result files for each model run.
… benchmark

- Enrich all pipeline phases with full context (diff patches, PR narrative,
  risk surfaces, blast radius, intake summary, dimension dedup awareness)
- Add _write_context_file() for large context → file fallback (.harness reads)
- Add _build_file_patches(), _escalate_depth(), _cleanup_context_dir() to orchestrator
- Upgrade meta-selector prompts: Investigation Protocol instructs harness to
  explore repo (browse files, trace callers, check dependencies) not just
  analyze pre-digested JSON blobs
- Add Quality Gate: blocks dimensions generated solely from diff text
- Upgrade review_dimension: verify claims against actual code before reporting
- Upgrade adversary: must read actual files to verify reviewer claims
- Fix Python 3.11 compat: replace nested f-strings with backslashes
- Add adaptive depth escalation (quick→standard→deep) based on risk signals
- Kimi k2.5 enriched run on PR #254: 25 findings, 8 critical, 16 adversary
  confirmed, 1994s duration
…ification harness

- Add evidence.py: parallel programmatic code extraction (zero-cost context feed)
  - EvidencePackage model with primary_code, caller_snippets, cross_ref_snippets
  - Async extraction with semaphore(10), Python 3.11 compatible
- Add evidence_verifier harness: independently investigates critical/important findings
  - Meta-prompted to browse repo, verify claims against actual code
  - Returns verified/falsified status with revised severity/confidence
- Update adversary_phase: receives ground truth evidence + verification notes
  - Compares reviewer claims vs extracted code before challenging
- Update cross_ref_phase: receives import context and caller snippets per finding
- Update orchestrator: wire extraction → verification → adversary → cross_ref
  - Falsified findings demoted before adversary phase
  - Coverage loop also extracts evidence for gap findings
- 25 findings, 0% FP rate (down from 10% in Sonnet baseline)
- 4 findings falsified by verification harness before adversary phase
- 17 true positives, 5 valid suggestions, 2 duplicates
- errors.Is finding confirmed as TRUE POSITIVE (line 5186 uses string comparison)
- ~43min runtime (+7% overhead vs enriched-only run)
- Implemented Parallel Compound Analysis to discover novel attack chains
- Replaced old  logic entirely
- Added a dedup phase for the newly discovered compound findings
- Adjusted Docker health checks to prevent timeouts under parallel load
- Added EVALUATION.md documenting PR-AF improvements vs Claude Code baseline
- Added viral/developer-facing formatting and badges
- Highlighted the single-agent ceiling vs multi-agent DAG approach
- Documented the 8-phase architecture
- Showcased the compound analysis differentiator using the PR 254 benchmark
- Removed mechanical 8-phase pipeline table
- Emphasized 0% false positive rate via Evidence Grounding
- Highlighted Compound Attack Chain synthesis
- Added focus on Adversarial Tension and Dynamic Meta-prompts
- Strengthened viral/powerful messaging against single-agent baselines
- Added ci_runner.py to handle async execution polling
- Updated README with drop-in GitHub Actions workflow template
- Configured to use native GITHUB_TOKEN for zero-config deployment
- Set workflow trigger to act on 'pr-af' label
- Appended 'Reviewed by AgentField PR-AF' to all inline comments
- Added stylish AgentField shield badge to main PR summary
- Pointed internal links to the new Agent-Field/pr-af repo
- Removed overly verbose architectural terminology and DAG references
- Added Ecosystem Comparison table featuring Claude Code and Commercial SaaS (Codex/CodeRabbit)
- Made the tone more objective, pointed, and respectful of other tools in the space
- Focused explicitly on the core value props: Evidence Grounding, Compound Analysis, Adversarial Red Teaming
…chitecture

- Removed references to 'Red Team agent' and simple 'agents' to avoid sounding trivial
- Replaced with 'Falsifiability Gates' and 'auto-invalidation process'
- Enhanced 'Compound Vulnerability Synthesis' to focus on cross-correlated risks and systemic coalescence
- Framed the architecture as a 'massively parallel cognitive pipeline'
- Added a 'Dynamic Pipeline Architecture' section detailing how PR-AF morphs its execution graph based on PR topology
- Authored a complex, highly technical Mermaid diagram showing the flow from Intake -> Lenses -> Verification -> Compound Synthesis
- Elevated language to emphasize large-scale cognitive orchestration rather than static scripts
- Add /webhook/github endpoint for @mention-triggered PR reviews
- Add GitHub App authentication (JWT + installation tokens) to GitHubClient
- Reviews post as pr-af[bot] when GitHub App is configured, falls back to PAT
- Add railway.toml build config with healthcheck
- Add PyJWT[crypto] dependency for GitHub App JWT signing
- Configure Railway env vars with dynamic service references
- Remove .env from tracking (contains sensitive keys)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CI was failing because `ruff check .` picked up 1375 errors from the
bundled .docker-sdk/agentfield/ directory which is third-party code.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: Railway deployment + GitHub App webhook
- Add diptych hero image comparing traditional single-LLM review vs
  PR-AF's dynamic plan→review→challenge pipeline
- Add vertical funnel architecture diagram showing 7-phase adaptive
  pipeline with noise reduction metrics
- Update README header to "Open-Source Agentic PR Reviewer"
- Wrap mermaid diagram in collapsible details block
- Add architecture image inline with link to full docs
…field.ai

- Remove .ai()/.harness()/meta-prompting from architecture footer
- Replace with: dynamic planning · parallel review · adversarial verification · streaming pipeline
- Fix footer domain to agentfield.ai on both images
Replace /reasoner/pr-af/review with /api/v1/execute/async/pr-af.review
across all DX examples. Wrap JSON bodies in {"input": {...}} envelope
to match the canonical AgentField control plane API.
Standardize DX pattern — show minimal curl command and example
review finding with evidence grounding near top of README.
- Standardize Quick Start to 3-step pattern (clone, env, docker compose)
- Delete Dockerfile.local, docker-compose.local.yml, docker-compose.hub.yml
- Delete vendored .docker-sdk/ directory (use PyPI agentfield instead)
- Update Dockerfile to install agentfield from PyPI
Railpack requires main.py or app.py at the project root to detect
the start command. Adds a thin wrapper that sets PYTHONPATH for the
src layout and delegates to pr_af.app:main.
AbirAbbas and others added 29 commits March 13, 2026 18:44
fix: provider-aware model routing for claude-code harness
…ider

intake_gate and coverage_gate use .ai() (OpenRouter), not .harness(), so
they need OpenRouter model IDs even when provider is claude-code. Split
tier resolution so .ai() fields always use OpenRouter defaults while
.harness() fields use the provider-aware map.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fix: resolve .ai() fields with OpenRouter tier map
The provider_env() passthrough list was missing CLAUDE_CODE_OAUTH_TOKEN,
so the Claude Code SDK subprocess never received the OAuth credentials.
Combined with ANTHROPIC_API_KEY being set to "UNUSED" on Railway, this
caused every .harness() call to silently fail authentication.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Runs a quick version check and auth test when provider is claude-code
so we can see the actual error in Railway logs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add hasCompletedOnboarding to Dockerfile for headless Claude CLI
- Filter out placeholder values (UNUSED) from provider env passthrough
  so ANTHROPIC_API_KEY=UNUSED doesn't override the valid OAuth token

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Claude CLI checks ANTHROPIC_API_KEY before CLAUDE_CODE_OAUTH_TOKEN.
If ANTHROPIC_API_KEY is set to a placeholder value like "UNUSED", the
CLI rejects it and never falls through to the OAuth token. Remove
invalid keys from os.environ at startup so the OAuth flow works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Allow callers to pass `provider` ("opencode", "claude-code", etc.) on
each review request instead of requiring a redeploy to switch harnesses.
The env var PR_AF_PROVIDER remains the default when no provider is
specified in the request.

Threads the provider from the API surface through ReviewConfig,
orchestrator, and into every .harness() call where agentfield's
per-call provider override takes effect.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rride

feat: per-request provider override for harness selection
Railway deployment investigation revealed 7 GB RSS and 2.7 GB disk waste
from accumulated Claude Code session data, V8 JIT .so files, and Python
heap fragmentation after reviews.

Memory fixes:
- Set MALLOC_TRIM_THRESHOLD_=0 so glibc returns freed pages to OS
- Call malloc_trim(0) via ctypes after each review completes
- Clear evidence_map and verification_map after use in review layer
- Only extract evidence for new gap findings in coverage loop (was
  re-extracting for ALL accumulated findings each iteration)
- Explicit del of gap_evidence between coverage iterations

Disk fixes:
- Snapshot Claude session dirs before review, delete only new ones after
  (safe for concurrent reviews — each cleans only its own sessions)
- Clean empty pyright-* temp dirs from /tmp after reviews
- Remove orphaned V8 .so files older than 60s from /tmp

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: reduce memory and disk usage from harness subprocess artifacts
The review reasoner had hard-coded 300s wall-clock and $2 cost caps that
fire as `BudgetExhaustedError("Budget exhausted before meta-selectors")`
on real-world PRs (the docs say a 500-line PR takes 35–50min — 300s
never had a chance). Callers like github-buddy intentionally don't
thread budget overrides through `app.call`, so deployments had no way
to retune without a code change.

Read the defaults from env (PR_AF_NO_BUDGET, PR_AF_MAX_DURATION_SECONDS,
PR_AF_MAX_COST_USD) at module load. Hard-coded fallbacks preserve the
prior behavior for callers that don't set the env. Setting
PR_AF_NO_BUDGET=true on a deployment disables enforcement entirely
(the existing `no_budget` short-circuit in BudgetConfig).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat(budget): env-driven defaults for review reasoner caps
The `_default_tier_map("opencode")` function previously hardcoded the
budget tier to gemini-2.5-flash, ignoring PR_AF_MODEL even though the
mid and premium tiers honored it. This caused the intake_gate,
coverage_gate, and dedup_gate to silently run on gemini when callers
expected their PR_AF_MODEL setting to apply everywhere. Budget now
falls back to ai_model like the other tiers; PR_AF_MODEL_BUDGET
remains available for callers who want to keep gates on a cheaper
model.

Also updates the stale fallback defaults (minimax-m2.5 / gemini-2.5-flash)
to openrouter/moonshotai/kimi-k2.5, matching the docker-compose env
override and what deployments actually use. .env.example documents
the new defaults and notes that PR_AF_MODEL covers all tiers unless
individually overridden.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(config): respect PR_AF_MODEL across all tiers; default to kimi-k2.5
…dimension fan-out

Production data on a recent PR showed 8 review_dimensions running ~25
min each but throttled to a concurrency=3 semaphore — wall-clock time
was ~3× per-dimension cost (≥75 min for the review phase alone) even
though we were paying nothing during the wait. This was the single
biggest reason `review` runs were hitting the 7200s caller timeout in
github-buddy.

Two changes:
1. Default raised 3 → 10. With 6–8 dimensions per typical PR, all of
   them can now run in parallel and the phase is bounded by the
   slowest single dimension instead of the semaphore.
2. Surfaced the value as `PR_AF_MAX_CONCURRENT_REVIEWERS` so a
   deployment can dial it back if its provider hits rate limits. The
   stagger_delay_seconds (2s) is unchanged and remains the first line
   of defense against burst-rate-limit hits.

The previous comment on this field cited "OpenRouter or other
rate-limited providers" — at 10 concurrent on Kimi K2.5 we're well
within OpenRouter's per-key limits, and any provider that can't sustain
that for a single review session would need configuration anyway.

Tests: test_budget_config_defaults updated to pin the new default and
to clear PR_AF_MAX_CONCURRENT_REVIEWERS so the assertion reflects the
in-code default rather than the runtime env override. Full suite
green except for one pre-existing test_cost_tracker failure on main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…iewers

perf(review): bump max_concurrent_reviewers 3 → 10 to unblock review_dimension fan-out
…the prompt

The meta_semantic / meta_mechanical / meta_systemic phases were burning
15–40 minutes apiece in production, even though Kimi K2.6 is fast
per-turn. Profiling traced this back to the prompts themselves: each
phase had an "Investigation Protocol" section that *mandated* browsing
the actual source files, and a "Quality Gate" that *rejected*
diff-only answers ("If your rationale says 'visible in the diff' or
'based on the patches', you have not investigated enough").

That language turned every phase into a 20–40 turn opencode session
of file reads, when the diff patches plus the intake/anatomy summary
already contain almost everything these phases need to produce
dimensions. The model wasn't slow — it was being instructed to do
expensive exploration before it was allowed to answer.

Replaced the mandate with neutral framing:

  - "Working with Context": diff + intake/anatomy is usually enough.
    Repo access is available; use it sparingly, only when the diff
    genuinely doesn't show what you need.
  - "Quality Bar": dimensions must be specific (named files, line
    ranges). The diff already carries those; a dimension grounded in
    the diff is good. Vague dimensions are still rejected.

Each lens keeps its specific bullets (semantic = logic/contracts,
mechanical = signatures/imports, systemic = patterns/coverage) and
its target_files requirement. The output schema is unchanged. What
changes is the model is no longer penalized for producing a correct
answer from the diff in 2 turns instead of 30.

Also bumped the in-code default model openrouter/moonshotai/kimi-k2.5
→ k2.6 to match what's actually deployed in production. Local dev was
silently running on an older model than prod.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hen-to-browse

perf(meta-phases): let the model decide when to browse, not the prompt
Picks up the cooperative cancellation feature in the AgentField SDK
(per-execution cancel callback at /_internal/executions/<id>/cancel,
AbortController/asyncio-task wiring in the worker runtime). With this
floor, a control-plane cancel-tree on a stuck pr-af run will surface
as CancelledError inside the reasoner phase that's currently running,
short-circuiting the in-flight LLM call rather than letting it run to
completion and discarding the output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
chore(deps): bump agentfield floor to >=0.1.77
Adds the two-env-var knob (OPENCODE_ENABLE_EXA=1 + EXA_API_KEY) to
.env.example and a brief subsection in the Quick Start of README.

When both vars are set, opencode's built-in websearch / webfetch tools
become available to every review reasoner running through the open
runtime — model decides per-task whether the lookup is worth doing
(verify API contracts, check CVE/deprecation status, confirm library
version behavior). No PR-AF wiring is required beyond setting the vars
on the deployment; opencode inherits parent env from agentfield's
run_cli, and tool awareness is communicated to the model via the
standard tool-definition layer (verified: model invokes websearch
autonomously when the task warrants it).

No prompt-level guardrails added. Review reasoners are short / single-
shot / analytic — no loop-risk where unrestrained search could
rabbit-hole. SWE-AF's run_coder is the exception (long iterative
coding loops); that gets a restraint snippet in its own PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
docs: document opencode web-search opt-in for review reasoners
The baked-in opencode.json hardcoded both model and small_model to a
literal kimi-k2.5 string, so OpenCode's small_model path (used for
session titles and prompt compaction) silently bypassed whatever
HARNESS_MODEL the deployer set on Railway. Per-call -m on `opencode run`
already pins the main model from HARNESS_MODEL via the Python harness
provider; this aligns the small_model path the same way.

OpenCode supports {env:VAR} interpolation in string fields. Switch
model and small_model to {env:HARNESS_MODEL} so they track the env var
the rest of the pr-af stack already reads. The Dockerfile already sets
HARNESS_MODEL=openrouter/moonshotai/kimi-k2.5 as the image default, so
fresh containers without overrides still have a value to interpolate.

Also drops the now-redundant `models` registration block — OpenCode
auto-handles unknown model names from a configured provider, and pinning
a specific name there broke the moment HARNESS_MODEL changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…model

fix(opencode): respect HARNESS_MODEL in baked-in opencode.json (small_model path)
Reviewers produced findings that contradicted explicit design rationale
from the PR description because the rationale never reached them.

- Intake gate / intake fallback / anatomy were truncating the description
  to 500 / 1000 / 500 chars. Thoughtful PR bodies with rationale past those
  cutoffs were silently chopped before any model saw them.
- review_dimension only received digested summaries (pr_narrative,
  risk_surfaces, intake_summary) — never the raw description, so the
  author's voice was laundered through two summarization layers before
  the reviewer looked at code.
- The anatomy prompt also tells the model to discount the description
  ("what the CODE says, not what the PR description says"), which is
  correct for divergence detection but leaves the reviewer with no
  channel for author intent.

Bump description truncation in intake/anatomy to 4000 chars uniformly,
add a pr_description param to review_dimension with a dedicated
"Author's Stated Intent" section near the top of the prompt, and wire
self.pr_data.description through _run_parallel_review.

The new section is explicitly NOT "trust the author" — it tells the
reviewer to verify the code regardless, but to rebut the author's
stated reasoning on its merits when a finding contradicts an
explicitly-justified design choice, rather than flagging as if the
rationale wasn't given. Findings on points the description is silent
about are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants