evals: harness for the xb-* skills (first cut)#351
Open
salmanmkc wants to merge 14 commits into
Open
Conversation
python harness, 6 seed tasks, worktree shim, scorer + summarizer. smoke test confirmed: golden-as-agent scores 1.0 across the board, empty-as-agent scores 0.0. still manual on the agent-invocation side. CI-pass scoring and llm-as-judge planned as follow-ups.
automated runner that invokes gemini-cli headless against a task worktree. injects current main SKILL.md files for the with-skill run (many task bases predate the skill files), strips them for the without-skill run. wraps the scorer. verified on task google#335: gemini reaches recall 0.5 line_sim 0.63 in both modes, surfacing that current skills do not address the sample-vs-framework-layer choice that task needed.
runner now installs every skills/xb-* dir via gemini skills install --consent at user scope for with-skill, uninstalls at end. cleans pre-existing leftovers at start so subsequent runs do not leak state. matches actual user workflow rather than relying on auto-discovery from the worktree.
uses gemini-2.5-pro via google-genai api directly, skill content in system prompt, no filesystem access. mirrors the canvas gem deployment rather than gemini-cli. four prototyping tasks (netblocks, hands, ui, ai) score 1.0 with the matching skill and 0.50-0.81 without it. biggest gap on xb-netblocks (most bespoke api surface, no gemini priors). FINDINGS.md tracks the design pivots: contributor-pr-replay (wrong audience), cli-prototyping (ceiling effect), api-canvas-faithful (the one that worked).
judge.py runs gemini-2.5-pro over each agent output with the full SKILL.md as ground truth. with-skill consistently 5/5/yes, without-skill 1/1/no. judge prose names the specific hallucinations (xr-room, <xr-hands>, xrb-button, etc). smoke.py serves the workspace via http and loads index.html in headless chromium via playwright. caught a real bug in the xb-netblocks skill: its import examples use a path that 404s on the build-branch cdn. ablate.py drops one skill section at a time from the system prompt. all six variants still scored 1.0, so binary api_match is overdetermined; finer ablations need the judge in the loop.
removes v1 pr-replay scaffolding (dead weight). adds depth-occlusion, gestures-thumbs-up, modelviewer-gltf, physics-falling-cube, sound-spatial-audio, world-plane-detection tasks. adds run_all.sh orchestrator (with --judge), summarize_proto.py, plot.py (composite + per-metric + judge charts). rewrites evals/README.md for the actual v3 design. n=10 sweep result: +0.27 average composite gap with vs without skill. holds across all 10 skills. judge agrees 10/10 on would_merge. biggest gaps on bespoke api surfaces (netblocks, physics, modelviewer); smallest on webxr-standard surfaces (hands, depth).
adds 10 canvas-* tasks phrased like real canvas-gem prompts (imaginative outcome descriptions, no api jargon) to test alongside the engineer-spec ones. namespaces results and workspaces by model so pro and flash can co-exist. swaps the judge would_merge dim for hallucination_severity (none/minor/major), a focused measure of the failure mode the eval actually catches. full pro+flash sweep result: with-skill avg 0.97, without 0.71, delta +0.26 to +0.31 across cells. canvas-faithful prompts widen the gap rather than narrow it.
…rouped task charts import_match was defaulting to 1.0 when expected_imports was empty (16/20 tasks), and forbidden_clean / parse_ok are mostly invariant, so the original composite mean inflated by ~0.07 over the without-skill arm. recomputes composite to skip import_match when expected_imports is empty (kept the per-metric values intact for inspection). adds a headline api_match_breakdown.png since api_match is the only dimension that actually moves between arms. per-task charts now group engineer-spec on the left, canvas-faithful on the right, with a vertical separator. metric grid uses better short labels. prompt-style chart moves delta annotations above bars instead of overlapping the x-axis. headline numbers after fix: api_match goes from 0.03-0.07 without to 0.85-1.00 with, delta +0.78 to +0.97 across cells.
the dev log narrative was useful while iterating but is not the right artifact for a public PR. moves the content out of the repo. saved separately for the author.
moves all chart legends outside the plot so they stop overlapping bars or delta annotations. lifts the engineer-spec / canvas-faithful group labels higher and bumps the title pad so they no longer overlap. ignores evals/results in prettier (regenerable, lots of files).
investigated three with-skill "gaps" surfaced by the eval and found they were spec-side bugs, not skill defects: 1. modelviewer-gltf required setupPlatform, but xb.ModelViewer auto-attaches a platform by default. Dropped the expectation. 2. depth-occlusion required xrDepthMeshOptions, but enableDepth applies that preset internally. Dropped the expectation. 3. canvas-dandelion and canvas-occluded-statue had the same issues, fixed in parallel. re-scored all 80 workspaces against the corrected specs. headline api_match numbers improved: - pro engineer-spec: 0.92 to 1.00 - flash engineer-spec: 0.85 to 0.93 canvas-faithful cells unchanged at 1.00 / 0.97. without-skill arm unchanged. judge re-run with new schema and ground truth. charts regenerated.
Collaborator
|
I am deeply impressed! As said in https://arxiv.org/abs/2603.24591, we also have an evaluation pipeline but haven't been open sourced since I don't think it it rigorous enough - also it was designed for traditional one-shot prompting instead of agentic loops. Thanks for the contribution, I'll take a deeper look and think of how we should open source of what we have with what you contribute here! I appreciate your proactivity and treat of as one of our key contributors here :) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
first cut at a benchmark for the xrblocks
xb-*skills, calls gemini via the api the way the canvas-gem deploys it. skill content baked into the system prompt, no filesystem access. the model has to produce a completemain.jsfrom scratch.20 tasks across the 13
xb-*skills, half phrased as engineer specs ("modifymain.jsto enable WebXR depth sensing"), half as the imaginative requests users actually type into the gem ("make my room feel like there's a giant statue standing across it"). same scoring rubric on both: import / api / parse / forbidden patterns + an llm-judge pass (gemini-2.5-pro again, full SKILL.md as ground truth, scoringaccomplishes_task,idiomatic_xrblocks,hallucination_severity).ran the full sweep on both
gemini-2.5-proandgemini-2.5-flash.headline: api_match
did the agent call the APIs the skill defines? this is the cleanest single-number signal, since parse_ok / forbidden_clean / import_match are mostly invariant (gemini writes valid js either way, etc).
without the skill, both models call the right xrblocks APIs ~3-7% of the time. with the skill, 85-100%.
composite score (multi-dimensional)
unexpected: canvas-faithful prompts widen the skill gap, not narrow it. vague prompts force the model to lean on its system-prompt priors, which is exactly what the skill content is.
hallucination breakdown
llm-judge classified each output as
none/minor/majorhallucination using the full SKILL.md as ground truth:without the skill, flash produces a
majorhallucination on every task. consistent patterns: jsx-likexr-scene/xr-roomfor netblocks, fake<xr-hands>elements for hands, invented A-Framexrb-buttonfor ui, made-upxr.ai.multimodalfor ai. several without-skill outputs imported fromxr-blocks@0.4.1and similar packages that have never been published.with the skill, most outputs are rated
none. the remainingmajorcases are concentrated onxb-modelviewer, which has a documentation gap the eval surfaced.per-metric breakdown:
skill gaps surfaced
beyond the headline numbers, the eval flagged concrete improvements to chase:
Loaderclass that does not exist. 4-5majorhallucinations per model. the skill description mentionsloadGLTFModelandsetupPlatformbut the quick-start skips the actual loader call, so the model fills in something plausible-looking. fix: add a minimalModelViewerloader example to the skill.xrblocks/addons/<addon>/src) that 404s on the@buildcdn because browser importmaps do not have node-style index resolution. fixed in netblocks: fix import path in skill + readme examples #349 (netblocks) and lipsync: add skill files + fix import path in readme #350 (lipsync).enableDepthbut misses adjacent expected setup. skill could spell out the full recipe.what's in the PR
evals/prototypes/runners/run_gem_api.py- agent runner (gemini api directly)evals/prototypes/score_proto.py- scorer (skips vacuous import_match when expected_imports is empty)evals/prototypes/judge.py- llm-as-judgeevals/prototypes/smoke.py- playwright + headless chromium (caught real CDN-resolution bugs in the netblocks and lipsync skills; fixed in netblocks: fix import path in skill + readme examples #349 and lipsync: add skill files + fix import path in readme #350)evals/prototypes/ablate.py- drop one skill section at a time (binary scorer is overdetermined for this; would benefit from judge in the loop)evals/run_all.sh+summarize_proto.py+plot.py- orchestrator, summary, matplotlib chartsevals/prototypes/tasks/- 20 tasks (10 engineer-spec + 10 canvas-faithful)what's not in the PR
first attempt, posting for design feedback before scaling further.