Refactors BB dataset, eva + grading scripts #30

geemi725 · 2025-06-06T23:32:00Z

This PR refactors

Grading + Eval scripts to work with new field evaluation mode
unit tests
pre-commit/ linting
README
run_zeroshot_evals.py -> generate_zeroshot_evals.py

jonlaurent · 2025-06-12T18:17:02Z

I can't comment much on the code, but looks good to me. Is the README still a TODO?

geemi725 · 2025-06-12T18:57:06Z

@jonlaurent README is now updated

bixbench_results/zero_shot_baselines.json

.github/workflows/tests.yml

ludomitch · 2025-06-12T19:35:39Z

bixbench/graders.py

+                unsure=unsure,
+                partial_match=False,
+                llm_match=False,
+            )


Do we need to update the open_ended_prompt_template to ensure it outputs things that can be evaluated by the exact match grade_range_verifier? A bit scared the LLMs output may be correct but will fail on the grade_verifier method

@ludomitch yes, feel free to suggest changes here

geemi725 added 11 commits June 6, 2025 16:30

draft: updated with evaluation_mode

30099e0

black

260d596

refactoring: grading

c0f673d

fixes: pre-commit

03b2c7b

ruff RUF035 -> S704

6d7704a

fix: tests

3719a96

update: linting to be consitent with precommit

e203b6e

minor fixes: yml files

a10951a

rm codespell from linting

cb97ffd

rm validate-pyproject

75c9815

fix: test asyncio

1c30126

geemi725 marked this pull request as ready for review June 11, 2025 22:41

geemi725 requested review from jonlaurent and ludomitch June 11, 2025 22:41

geemi725 self-assigned this Jun 11, 2025

geemi725 changed the title ~~WIP: Update eval scripts with work with "answer and grading modes"~~ Refactors BB dataset, eva + grading scripts Jun 12, 2025

geemi725 added 2 commits June 12, 2025 11:49

updated readme

078c227

pre-commit + lint

49afc7e

ludomitch approved these changes Jun 12, 2025

View reviewed changes

rm redundant linting

eeea67e

geemi725 merged commit 5c54e35 into main Jun 12, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactors BB dataset, eva + grading scripts #30

Refactors BB dataset, eva + grading scripts #30

Uh oh!

geemi725 commented Jun 6, 2025 •

edited

Loading

Uh oh!

jonlaurent commented Jun 12, 2025

Uh oh!

geemi725 commented Jun 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ludomitch Jun 12, 2025

Uh oh!

geemi725 Jun 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Refactors BB dataset, eva + grading scripts #30

Refactors BB dataset, eva + grading scripts #30

Uh oh!

Conversation

geemi725 commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonlaurent commented Jun 12, 2025

Uh oh!

geemi725 commented Jun 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ludomitch Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

geemi725 Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

geemi725 commented Jun 6, 2025 •

edited

Loading