Skip to content

Require evidence in roasted PR reviews#103

Merged
neubig merged 6 commits intomainfrom
openhands/issue-102-require-evidence
Mar 8, 2026
Merged

Require evidence in roasted PR reviews#103
neubig merged 6 commits intomainfrom
openhands/issue-102-require-evidence

Conversation

@neubig
Copy link
Contributor

@neubig neubig commented Mar 8, 2026

Summary

  • add an optional require-evidence input to the PR review action
  • pass the setting through the review agent prompt so reviewers can require an Evidence section in PR descriptions
  • tighten the requirement so test output alone does not count; evidence must come from running the real code path end-to-end
  • update codereview-roasted guidance and prompt tests to match the stricter rule

Fixes #102.

Testing

  • uv run --group test pytest

Evidence

$ uv run --with openhands-sdk --with openhands-tools --with lmnr python - <<'PY'
import importlib.util
import os
import sys
from pathlib import Path

repo = Path.cwd()
scripts_dir = repo / 'plugins' / 'pr-review' / 'scripts'
sys.path.insert(0, str(scripts_dir))

spec = importlib.util.spec_from_file_location('pr_review_agent_script', scripts_dir / 'agent_script.py')
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)

os.environ.update({
    'LLM_API_KEY': 'demo-key',
    'GITHUB_TOKEN': 'demo-token',
    'PR_NUMBER': '102',
    'PR_TITLE': 'Require evidence in roasted PR reviews',
    'PR_BODY': '## Summary\nAdds an evidence requirement.',
    'PR_BASE_BRANCH': 'main',
    'PR_HEAD_BRANCH': 'openhands/issue-102-require-evidence',
    'REPO_NAME': 'OpenHands/extensions',
    'REVIEW_STYLE': 'roasted',
    'REQUIRE_EVIDENCE': 'true',
})

config = module.validate_environment()
prompt = module.format_prompt(
    skill_trigger='/codereview-roasted',
    title=config['pr_info']['title'],
    body=config['pr_info']['body'],
    repo_name=config['pr_info']['repo_name'],
    base_branch=config['pr_info']['base_branch'],
    head_branch=config['pr_info']['head_branch'],
    pr_number=config['pr_info']['number'],
    commit_id='abc123',
    diff='diff --git a/plugins/pr-review/scripts/prompt.py b/plugins/pr-review/scripts/prompt.py',
    review_context='',
    require_evidence=config['require_evidence'],
)

print('require_evidence:', config['require_evidence'])
print('review_style:', config['review_style'])
print('contains Evidence section:', '## PR Description Evidence Requirement' in prompt)
print('contains test-output rejection:', 'Tests alone do **not** count as evidence.' in prompt)
print('contains end-to-end wording:', 'real code path end-to-end' in prompt)
print('contains conversation link guidance:', 'https://app.all-hands.dev/conversations/{conversation_id}' in prompt)
PY
require_evidence: True
review_style: roasted
contains Evidence section: True
contains test-output rejection: True
contains end-to-end wording: True
contains conversation link guidance: True
  • Extract from the generated prompt section:
## PR Description Evidence Requirement

Active review instruction: require the PR description to include an `Evidence` section (or similarly labeled section) showing that the code actually works.

When checking the PR description:
- For frontend or UI changes, require a screenshot or video that demonstrates the implemented behavior in the actual product.
- For backend, API, CLI, or script changes, require the command(s) used to run or exercise the real code path end-to-end and the resulting output.
- Tests alone do **not** count as evidence. Do not accept `pytest`, unit test output, or similar test runs as the only proof that the change works.
- If the change appears to come from an agent conversation or AI-assisted workflow, prefer a conversation link such as `https://app.all-hands.dev/conversations/{conversation_id}` so reviewers can trace the work.
- Do not accept vague claims like "tested locally" without concrete runtime artifacts, commands, or output.

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Contributor Author

neubig commented Mar 8, 2026

Updated the implementation to make the rule stricter: test output alone no longer counts as PR evidence.

I also ran the changed code path directly and captured runtime evidence for this PR.

Command run:

uv run --with openhands-sdk --with openhands-tools --with lmnr python - <<'PY'
import importlib.util
import os
import sys
from pathlib import Path

repo = Path.cwd()
scripts_dir = repo / 'plugins' / 'pr-review' / 'scripts'
sys.path.insert(0, str(scripts_dir))

spec = importlib.util.spec_from_file_location('pr_review_agent_script', scripts_dir / 'agent_script.py')
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)

os.environ.update({
    'LLM_API_KEY': 'demo-key',
    'GITHUB_TOKEN': 'demo-token',
    'PR_NUMBER': '102',
    'PR_TITLE': 'Require evidence in roasted PR reviews',
    'PR_BODY': '## Summary\nAdds an evidence requirement.',
    'PR_BASE_BRANCH': 'main',
    'PR_HEAD_BRANCH': 'openhands/issue-102-require-evidence',
    'REPO_NAME': 'OpenHands/extensions',
    'REVIEW_STYLE': 'roasted',
    'REQUIRE_EVIDENCE': 'true',
})

config = module.validate_environment()
prompt = module.format_prompt(
    skill_trigger='/codereview-roasted',
    title=config['pr_info']['title'],
    body=config['pr_info']['body'],
    repo_name=config['pr_info']['repo_name'],
    base_branch=config['pr_info']['base_branch'],
    head_branch=config['pr_info']['head_branch'],
    pr_number=config['pr_info']['number'],
    commit_id='abc123',
    diff='diff --git a/plugins/pr-review/scripts/prompt.py b/plugins/pr-review/scripts/prompt.py',
    review_context='',
    require_evidence=config['require_evidence'],
)

print('require_evidence:', config['require_evidence'])
print('review_style:', config['review_style'])
print('contains Evidence section:', '## PR Description Evidence Requirement' in prompt)
print('contains test-output rejection:', 'Tests alone do **not** count as evidence.' in prompt)
print('contains end-to-end wording:', 'real code path end-to-end' in prompt)
print('contains conversation link guidance:', 'https://app.all-hands.dev/conversations/{conversation_id}' in prompt)
PY

Observed output:

require_evidence: True
review_style: roasted
contains Evidence section: True
contains test-output rejection: True
contains end-to-end wording: True
contains conversation link guidance: True

This exercises the actual setting flow through repository code:

  • agent_script.py reads REQUIRE_EVIDENCE=true
  • validate_environment() returns require_evidence=True
  • format_prompt() injects the PR Description Evidence Requirement section
  • the generated prompt explicitly rejects pytest / test-only evidence

@neubig neubig marked this pull request as ready for review March 8, 2026 16:09
@neubig neubig requested a review from enyst March 8, 2026 16:10
Copy link

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste - Clean, pragmatic implementation that solves a real problem.

VERDICT:Merge it

KEY INSIGHT: This is exactly the kind of straightforward, non-breaking feature addition that demonstrates good engineering judgment - optional flag, sensible default, clear data flow from env → config → prompt, and thorough evidence in the PR description proving it actually works.

- For frontend/UI changes: a screenshot or video demonstrating the implemented behavior in the real product
- For backend, API, CLI, or script changes: the exact command(s) used to run the real code path end-to-end and the resulting output
- Tests alone do not count as evidence; reject `pytest`, unit test output, or similar test runs when they are the only proof provided
- For agent-generated work when available: a link back to the originating conversation, e.g. `https://app.all-hands.dev/conversations/{conversation_id}`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we maybe leave this out? I'm not sure the agent knows where it originated from 😆

On a side note, maybe it would be (really) nice to give the link of the public share, now that the cloud has public share. But I don't know how the agent finds that out or if we can create public share without the user intently creating it... 🤔

e.g. https://app.all-hands.dev/shared/conversations/bfca083e57ce4d748a489ff4fb703637

Copy link
Collaborator

@enyst enyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this! I always do this manually, asking the agent to "live test" or run examples itself, and sometimes upload its results in .pr/ directory for human inspection.

(by the way, maybe we could mention examples/ directory... or not in the main extensions content, in the repo instructions I assume)

Co-authored-by: Engel Nyst <engel.nyst@gmail.com>
@neubig
Copy link
Contributor Author

neubig commented Mar 8, 2026

@OpenHands check and fix failing tests (see the conversation history on this issue for reference as necessary)

@openhands-ai
Copy link

openhands-ai bot commented Mar 8, 2026

I'm on it! neubig can track my progress at all-hands.dev

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Mar 8, 2026

Final update:

  • Fixed the failing test by making the prompt wording in plugins/pr-review/scripts/prompt.py match the stricter expectation:
    • from Unit tests alone do **not** count as evidence.
    • to Tests alone do **not** count as evidence.

Checklist:

  • Reproduced the failure from tests/test_pr_review_prompt.py
  • Applied a minimal, targeted fix
  • Kept changes concise; no extraneous file changes were kept
  • Verified the targeted test and full suite pass (14 passed)
  • Committed and pushed the update to the PR branch
  • Followed instructions not to leave a PR comment

Git:

  • Commit: 850d586fix: align evidence prompt wording

So the request has been fully addressed, the change was minimal, and the branch has already been updated remotely.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig merged commit e1c0496 into main Mar 8, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Require evidence that the code works for PR review

4 participants