Require evidence in roasted PR reviews by neubig · Pull Request #103 · OpenHands/extensions

neubig · 2026-03-08T15:40:50Z

Summary

add an optional require-evidence input to the PR review action
pass the setting through the review agent prompt so reviewers can require an Evidence section in PR descriptions
tighten the requirement so test output alone does not count; evidence must come from running the real code path end-to-end
update codereview-roasted guidance and prompt tests to match the stricter rule

Fixes #102.

Testing

uv run --group test pytest

Evidence

Agent conversation: https://app.all-hands.dev/conversations/6960ab2f3b8a4ee792c365e546644b72
End-to-end runtime evidence for the changed code path (agent_script.py reading REQUIRE_EVIDENCE=true and passing it into format_prompt()):

$ uv run --with openhands-sdk --with openhands-tools --with lmnr python - <<'PY'
import importlib.util
import os
import sys
from pathlib import Path

repo = Path.cwd()
scripts_dir = repo / 'plugins' / 'pr-review' / 'scripts'
sys.path.insert(0, str(scripts_dir))

spec = importlib.util.spec_from_file_location('pr_review_agent_script', scripts_dir / 'agent_script.py')
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)

os.environ.update({
    'LLM_API_KEY': 'demo-key',
    'GITHUB_TOKEN': 'demo-token',
    'PR_NUMBER': '102',
    'PR_TITLE': 'Require evidence in roasted PR reviews',
    'PR_BODY': '## Summary\nAdds an evidence requirement.',
    'PR_BASE_BRANCH': 'main',
    'PR_HEAD_BRANCH': 'openhands/issue-102-require-evidence',
    'REPO_NAME': 'OpenHands/extensions',
    'REVIEW_STYLE': 'roasted',
    'REQUIRE_EVIDENCE': 'true',
})

config = module.validate_environment()
prompt = module.format_prompt(
    skill_trigger='/codereview-roasted',
    title=config['pr_info']['title'],
    body=config['pr_info']['body'],
    repo_name=config['pr_info']['repo_name'],
    base_branch=config['pr_info']['base_branch'],
    head_branch=config['pr_info']['head_branch'],
    pr_number=config['pr_info']['number'],
    commit_id='abc123',
    diff='diff --git a/plugins/pr-review/scripts/prompt.py b/plugins/pr-review/scripts/prompt.py',
    review_context='',
    require_evidence=config['require_evidence'],
)

print('require_evidence:', config['require_evidence'])
print('review_style:', config['review_style'])
print('contains Evidence section:', '## PR Description Evidence Requirement' in prompt)
print('contains test-output rejection:', 'Tests alone do **not** count as evidence.' in prompt)
print('contains end-to-end wording:', 'real code path end-to-end' in prompt)
print('contains conversation link guidance:', 'https://app.all-hands.dev/conversations/{conversation_id}' in prompt)
PY
require_evidence: True
review_style: roasted
contains Evidence section: True
contains test-output rejection: True
contains end-to-end wording: True
contains conversation link guidance: True

Extract from the generated prompt section:

## PR Description Evidence Requirement

Active review instruction: require the PR description to include an `Evidence` section (or similarly labeled section) showing that the code actually works.

When checking the PR description:
- For frontend or UI changes, require a screenshot or video that demonstrates the implemented behavior in the actual product.
- For backend, API, CLI, or script changes, require the command(s) used to run or exercise the real code path end-to-end and the resulting output.
- Tests alone do **not** count as evidence. Do not accept `pytest`, unit test output, or similar test runs as the only proof that the change works.
- If the change appears to come from an agent conversation or AI-assisted workflow, prefer a conversation link such as `https://app.all-hands.dev/conversations/{conversation_id}` so reviewers can trace the work.
- Do not accept vague claims like "tested locally" without concrete runtime artifacts, commands, or output.

Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-03-08T15:45:59Z

Updated the implementation to make the rule stricter: test output alone no longer counts as PR evidence.

I also ran the changed code path directly and captured runtime evidence for this PR.

Command run:

uv run --with openhands-sdk --with openhands-tools --with lmnr python - <<'PY'
import importlib.util
import os
import sys
from pathlib import Path

repo = Path.cwd()
scripts_dir = repo / 'plugins' / 'pr-review' / 'scripts'
sys.path.insert(0, str(scripts_dir))

spec = importlib.util.spec_from_file_location('pr_review_agent_script', scripts_dir / 'agent_script.py')
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)

os.environ.update({
    'LLM_API_KEY': 'demo-key',
    'GITHUB_TOKEN': 'demo-token',
    'PR_NUMBER': '102',
    'PR_TITLE': 'Require evidence in roasted PR reviews',
    'PR_BODY': '## Summary\nAdds an evidence requirement.',
    'PR_BASE_BRANCH': 'main',
    'PR_HEAD_BRANCH': 'openhands/issue-102-require-evidence',
    'REPO_NAME': 'OpenHands/extensions',
    'REVIEW_STYLE': 'roasted',
    'REQUIRE_EVIDENCE': 'true',
})

config = module.validate_environment()
prompt = module.format_prompt(
    skill_trigger='/codereview-roasted',
    title=config['pr_info']['title'],
    body=config['pr_info']['body'],
    repo_name=config['pr_info']['repo_name'],
    base_branch=config['pr_info']['base_branch'],
    head_branch=config['pr_info']['head_branch'],
    pr_number=config['pr_info']['number'],
    commit_id='abc123',
    diff='diff --git a/plugins/pr-review/scripts/prompt.py b/plugins/pr-review/scripts/prompt.py',
    review_context='',
    require_evidence=config['require_evidence'],
)

print('require_evidence:', config['require_evidence'])
print('review_style:', config['review_style'])
print('contains Evidence section:', '## PR Description Evidence Requirement' in prompt)
print('contains test-output rejection:', 'Tests alone do **not** count as evidence.' in prompt)
print('contains end-to-end wording:', 'real code path end-to-end' in prompt)
print('contains conversation link guidance:', 'https://app.all-hands.dev/conversations/{conversation_id}' in prompt)
PY

Observed output:

require_evidence: True
review_style: roasted
contains Evidence section: True
contains test-output rejection: True
contains end-to-end wording: True
contains conversation link guidance: True

This exercises the actual setting flow through repository code:

agent_script.py reads REQUIRE_EVIDENCE=true
validate_environment() returns require_evidence=True
format_prompt() injects the PR Description Evidence Requirement section
the generated prompt explicitly rejects pytest / test-only evidence

all-hands-bot

🟢 Good taste - Clean, pragmatic implementation that solves a real problem.

VERDICT: ✅ Merge it

KEY INSIGHT: This is exactly the kind of straightforward, non-breaking feature addition that demonstrates good engineering judgment - optional flag, sensible default, clear data flow from env → config → prompt, and thorough evidence in the PR description proving it actually works.

plugins/pr-review/scripts/prompt.py

enyst · 2026-03-08T16:41:19Z

skills/codereview-roasted/SKILL.md

+- For frontend/UI changes: a screenshot or video demonstrating the implemented behavior in the real product
+- For backend, API, CLI, or script changes: the exact command(s) used to run the real code path end-to-end and the resulting output
+- Tests alone do not count as evidence; reject `pytest`, unit test output, or similar test runs when they are the only proof provided
+- For agent-generated work when available: a link back to the originating conversation, e.g. `https://app.all-hands.dev/conversations/{conversation_id}`


Can we maybe leave this out? I'm not sure the agent knows where it originated from 😆

On a side note, maybe it would be (really) nice to give the link of the public share, now that the cloud has public share. But I don't know how the agent finds that out or if we can create public share without the user intently creating it... 🤔

e.g. https://app.all-hands.dev/shared/conversations/bfca083e57ce4d748a489ff4fb703637

enyst

Thank you for this! I always do this manually, asking the agent to "live test" or run examples itself, and sometimes upload its results in .pr/ directory for human inspection.

(by the way, maybe we could mention examples/ directory... or not in the main extensions content, in the repo instructions I assume)

Co-authored-by: Engel Nyst <engel.nyst@gmail.com>

skills/codereview-roasted/SKILL.md

neubig · 2026-03-08T16:47:58Z

@OpenHands check and fix failing tests (see the conversation history on this issue for reference as necessary)

openhands-ai · 2026-03-08T16:48:13Z

I'm on it! neubig can track my progress at all-hands.dev

Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2026-03-08T16:50:30Z

Final update:

Fixed the failing test by making the prompt wording in plugins/pr-review/scripts/prompt.py match the stricter expectation:
- from Unit tests alone do **not** count as evidence.
- to Tests alone do **not** count as evidence.

Checklist:

Reproduced the failure from tests/test_pr_review_prompt.py
Applied a minimal, targeted fix
Kept changes concise; no extraneous file changes were kept
Verified the targeted test and full suite pass (14 passed)
Committed and pushed the update to the PR branch
Followed instructions not to leave a PR comment

Git:

Commit: 850d586 — fix: align evidence prompt wording

So the request has been fully addressed, the change was minimal, and the branch has already been updated remotely.

Co-authored-by: openhands <openhands@all-hands.dev>

Add optional PR evidence requirement

ba1ba47

Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai bot mentioned this pull request Mar 8, 2026

Require evidence that the code works for PR review #102

Closed

Require end-to-end PR evidence

6855ed0

Co-authored-by: openhands <openhands@all-hands.dev>

neubig marked this pull request as ready for review March 8, 2026 16:09

neubig requested a review from enyst March 8, 2026 16:10

all-hands-bot approved these changes Mar 8, 2026

View reviewed changes

enyst reviewed Mar 8, 2026

View reviewed changes

plugins/pr-review/scripts/prompt.py Outdated Show resolved Hide resolved

enyst reviewed Mar 8, 2026

View reviewed changes

plugins/pr-review/scripts/prompt.py Outdated Show resolved Hide resolved

enyst reviewed Mar 8, 2026

View reviewed changes

enyst approved these changes Mar 8, 2026

View reviewed changes

Apply suggestions from code review

e88fb98

Co-authored-by: Engel Nyst <engel.nyst@gmail.com>

neubig commented Mar 8, 2026

View reviewed changes

skills/codereview-roasted/SKILL.md Show resolved Hide resolved

Update README.md

7e2e3f0

fix: align evidence prompt wording

850d586

Co-authored-by: openhands <openhands@all-hands.dev>

test: simplify prompt evidence assertions

817b1f2

Co-authored-by: openhands <openhands@all-hands.dev>

neubig merged commit e1c0496 into main Mar 8, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Require evidence in roasted PR reviews#103

Require evidence in roasted PR reviews#103
neubig merged 6 commits intomainfrom
openhands/issue-102-require-evidence

neubig commented Mar 8, 2026 •

edited

Loading

Uh oh!

neubig commented Mar 8, 2026

Uh oh!

all-hands-bot left a comment

Uh oh!

Uh oh!

Uh oh!

enyst Mar 8, 2026

Uh oh!

enyst left a comment

Uh oh!

Uh oh!

neubig commented Mar 8, 2026

Uh oh!

openhands-ai bot commented Mar 8, 2026

Uh oh!

openhands-ai bot commented Mar 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

neubig commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Evidence

Uh oh!

neubig commented Mar 8, 2026

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

enyst Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

enyst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

neubig commented Mar 8, 2026

Uh oh!

openhands-ai bot commented Mar 8, 2026

Uh oh!

openhands-ai bot commented Mar 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

neubig commented Mar 8, 2026 •

edited

Loading