Skip to content

Commit 5c54e35

Browse files
authored
Refactors BB dataset, eva + grading scripts (#30)
* draft: updated with evaluation_mode * black * refactoring: grading * fixes: pre-commit * ruff RUF035 -> S704 * fix: tests * update: linting to be consitent with precommit * minor fixes: yml files * rm codespell from linting * rm validate-pyproject * fix: test asyncio * updated readme * pre-commit + lint * rm redundant linting
1 parent 390524a commit 5c54e35

22 files changed

+815
-494
lines changed

.github/workflows/tests.yml

Lines changed: 7 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -6,33 +6,19 @@ on:
66
pull_request:
77

88
jobs:
9-
lint:
9+
pre-commit:
1010
runs-on: ubuntu-latest
11-
strategy:
12-
matrix:
13-
python-version: ["3.12"]
14-
1511
steps:
1612
- name: Check out Git repository
1713
uses: actions/checkout@v4
18-
- name: Set up Python ${{ matrix.python-version }}
14+
15+
- name: Set up Python
1916
uses: actions/setup-python@v5
2017
with:
21-
python-version: ${{ matrix.python-version }}
22-
- name: Install dependencies
23-
run: |
24-
python -m pip install --upgrade pip
25-
pip install ruff pytest numpy setuptools>=66 wheel>=0.36 build
26-
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
27-
if [ -f pyproject.toml ]; then pip install -e .; fi
28-
29-
- name: Run Lint
30-
run: |
31-
# Check for linting issues
32-
ruff check .
33-
# Check for formatting issues (will fail if code needs formatting)
34-
ruff format --check .
18+
python-version: "3.12"
3519

20+
- name: Run pre-commit
21+
uses: pre-commit/[email protected]
3622
test:
3723
runs-on: ubuntu-latest
3824
strategy:
@@ -50,7 +36,7 @@ jobs:
5036
- name: Install dependencies
5137
run: |
5238
python -m pip install --upgrade pip
53-
pip install ruff pytest numpy setuptools>=66 wheel>=0.36 build
39+
pip install ruff pytest pytest-asyncio numpy setuptools>=66 wheel>=0.36 build
5440
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
5541
if [ -f pyproject.toml ]; then pip install -e .; fi
5642

.pre-commit-config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ repos:
4242
hooks:
4343
- id: codespell
4444
additional_dependencies: [".[toml]"]
45+
args: [--skip=".git"]
4546
- repo: https://github.com/henryiii/validate-pyproject-schema-store
4647
rev: 2025.02.24
4748
hooks:

README.md

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -140,7 +140,7 @@ The script will save the evaluation dataframe as a CSV file in the `bixbench_res
140140

141141
## Zero-shot Evaluations & Grading
142142

143-
You can run zero-shot evaluations using the `run_zeroshot_evals.py` script and then automatically grade the responses using the `grade_outputs.py` script. This code:
143+
You can run zero-shot evaluations using the `generate_zeroshot_evals.py` script and then grade the responses using the `grade_outputs.py` script. These two scripts:
144144

145145
1. Loads the BixBench dataset from Hugging Face
146146
2. Evaluates the LLM on the dataset, outputting a CSV file with the results
@@ -149,6 +149,23 @@ You can run zero-shot evaluations using the `run_zeroshot_evals.py` script and t
149149

150150
The scripts can be configured to run with open-ended questions, multiple-choice questions (with or without a refusal option), different models, and different temperatures. To explore the different options, run the scripts with the `--help` flag.
151151

152+
**Example: Generate zero-shot answers in MCQ setting with the "refusal option" (in addition to the original distractors)**
153+
154+
```bash
155+
python generate_zeroshot_evals.py \
156+
--answer-mode "mcq" \
157+
--model "gpt-4o" \
158+
--with-refusal
159+
```
160+
161+
**Example: Grade the zero-shot answers from the previous step**
162+
163+
```bash
164+
python grade_outputs.py \
165+
--input-file path/to/zeroshot.csv \
166+
--answer-mode "mcq"
167+
```
168+
152169
## Replicating the BixBench Paper Results
153170

154171
To replicate the BixBench paper results for agentic evaluations, you can download the raw data from 2,120 trajectories and its respective postprocessed evaluation dataframe:

bixbench/__init__.py

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,31 @@
1-
from .graders import compute_metrics, grade_mcq_answer, grade_open_ended_answer
1+
from .graders import GradeAnswer, MCQGrader, OpenEndedGrader
22
from .prompts import (
33
MCQ_PROMPT_TEMPLATE_WITH_REFUSAL,
44
MCQ_PROMPT_TEMPLATE_WITHOUT_REFUSAL,
55
OPEN_ENDED_PROMPT_TEMPLATE,
66
)
7-
from .utils import AgentInput, EvalMode, LLMConfig, parse_response, randomize_choices
7+
from .utils import (
8+
AnswerMode,
9+
LLMConfig,
10+
Query,
11+
compute_metrics,
12+
parse_response,
13+
randomize_choices,
14+
)
815
from .zero_shot import ZeroshotBaseline
916

1017
__all__ = [
1118
"MCQ_PROMPT_TEMPLATE_WITHOUT_REFUSAL",
1219
"MCQ_PROMPT_TEMPLATE_WITH_REFUSAL",
1320
"OPEN_ENDED_PROMPT_TEMPLATE",
14-
"AgentInput",
15-
"EvalMode",
21+
"AnswerMode",
22+
"GradeAnswer",
1623
"LLMConfig",
24+
"MCQGrader",
25+
"OpenEndedGrader",
26+
"Query",
1727
"ZeroshotBaseline",
1828
"compute_metrics",
19-
"grade_mcq_answer",
20-
"grade_open_ended_answer",
2129
"parse_response",
2230
"randomize_choices",
2331
]

bixbench/generate_trajectories.py

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -240,9 +240,9 @@ def environment_factory(self, capsule: dict[str, Any]) -> DataAnalysisEnv:
240240
load_mcq(i, open_question=True, question_id=i["id"]) for i in raw_questions
241241
]
242242
problem = self.config.base_prompt.format(
243-
questions="\n-------\n".join([
244-
i.question_prompt for i in processed_questions
245-
])
243+
questions="\n-------\n".join(
244+
[i.question_prompt for i in processed_questions]
245+
)
246246
)
247247
answer = {i.question_id: i.ideal_answer for i in processed_questions}
248248
work_dir = (self.config.local_workspace_dir / capsule["uuid"]).absolute()
@@ -369,9 +369,12 @@ async def batch_rollout(
369369
agent = self.config.agent_config.construct_agent()
370370
rollout_manager = getattr(self, f"{self.config.rollout.rollout_type}_rollout")
371371

372-
return await asyncio.gather(*[
373-
rollout_manager(agent, environment) for environment in list_of_environments
374-
])
372+
return await asyncio.gather(
373+
*[
374+
rollout_manager(agent, environment)
375+
for environment in list_of_environments
376+
]
377+
)
375378

376379
async def run(self) -> None:
377380
"""Run the full trajectory generation pipeline."""
@@ -393,9 +396,11 @@ async def run(self) -> None:
393396

394397
# Update progress bar
395398
pbar.update(len(batch))
396-
pbar.set_postfix({
397-
"batch": f"{i // self.config.rollout.batch_size + 1}/{total_batches}"
398-
})
399+
pbar.set_postfix(
400+
{
401+
"batch": f"{i // self.config.rollout.batch_size + 1}/{total_batches}"
402+
}
403+
)
399404

400405

401406
if __name__ == "__main__":

0 commit comments

Comments
 (0)