DR-OPIC is a runnable Python framework for coding SLM experiments:
Domain-Routed On-Policy Iterative Correction.
It does three concrete things:
- Runs student coding attempts against executable tests.
- Builds verified repair/delta/preference training records from real failures.
- Reports the metrics that matter for a coding SLM:
greedy@1,coverage@K,selected@K,selector_gap, andrepair@1.
No private datasets, PDFs, model weights, or Kaggle outputs are included in this repo.
From the repo root:
python -m pip install -e ".[dev]"Run the test suite:
python -m pytest -qExpected result:
7 passed
The demo uses one toy Python task. The first student answer fails, the second passes, a repair candidate is verified, and DR-OPIC emits rollout, repair, ZPD, advantage, winner, and delta-span records.
python -m dr_opic.cli forge-demoYou can also use the installed console command:
dr-opic forge-demoWrite a stable artifact bundle:
python -m dr_opic.cli --output outputs\demo forge-demoThis creates:
round_summary.json
student_rollouts.jsonl
verified_repairs.jsonl
learnable_winner.json
delta_spans.json
Create a task JSON:
{
"prompt": "Implement reverse_words(s) returning words in reverse order.",
"entrypoint": "reverse_words",
"tests": "assert reverse_words('one two three') == 'three two one'\nassert reverse_words('solo') == 'solo'",
"task_id": "reverse_words_demo"
}Run verification against code embedded in the JSON or from a file:
python -m dr_opic.cli verify-python examples/python_task.json --code examples/reverse_words_good.pyExpected output includes:
{
"passed": true,
"observation": "passed"
}python -m dr_opic.cli zpd --passes 2 --samples 5This prints Jeffreys-smoothed pass rate and the ZPD weight:
p_tilde = (passes + 0.5) / (samples + 1)
w_zpd = 4 * p_tilde * (1 - p_tilde)
Route a prompt through the domain and safety checks:
python -m dr_opic.cli route "Fix this Python traceback and add pytest coverage"Estimate dense model memory and per-token compute:
python -m dr_opic.cli estimate-model --params 3.09e9Build a counterfactual delta-span record:
python -m dr_opic.cli delta --task-id reverse_words --failed examples\reverse_words_bad.py --fixed examples\reverse_words_good.pyRun the verifier-ZPD scheduler demo:
python -m dr_opic.cli schedule-demoThe repo does not ship training data. To audit an SFT JSONL stored outside this repo:
python -m dr_opic.cli audit-jsonl C:\datasets\slm\sft.jsonl --schema sftExpected SFT schema:
{"prompt": "...", "response": "..."}For preference rows:
python -m dr_opic.cli audit-jsonl C:\datasets\slm\preferences.jsonl --schema preferenceRequired preference fields:
{"prompt": "...", "chosen": "...", "rejected": "..."}dr_opic.maths: ZPD, rewards, advantages, coverage metrics, cost estimates.dr_opic.verifier: Python code extraction, static checks, test execution.dr_opic.forge: student-first rollout, repair, and artifact construction.dr_opic.selectors: verified learnable-winner selection.dr_opic.delta: token/line delta spans between failed and fixed code.dr_opic.scheduler: verifier-ZPD curriculum buckets and train-mix weights.dr_opic.preference: scalar helpers for verified DPO/ORPO-style pairs.dr_opic.datasets: JSONL schema and quality audit helpers.dr_opic.replay: deterministic replay certification.dr_opic.routing: domain routing and abstention helper.dr_opic.safety: simple coding-safety acceptance helper.dr_opic.compression: memory/compute estimates and retention gates.dr_opic.losses: optional PyTorch losses for SFT, delta, DPO, and RLVR.
More detail:
verify-python executes candidate code in a temporary subprocess with a timeout.
That is useful for local research, but it is not a security sandbox. Run untrusted
model code inside a container or VM.