evals: harness for the xb-* skills (first cut) by salmanmkc · Pull Request #351 · google/xrblocks

salmanmkc · 2026-06-06T05:44:58Z

first cut at a benchmark for the xrblocks xb-* skills, calls gemini via the api the way the canvas-gem deploys it. skill content baked into the system prompt, no filesystem access. the model has to produce a complete main.js from scratch.

20 tasks across the 13 xb-* skills, half phrased as engineer specs ("modify main.js to enable WebXR depth sensing"), half as the imaginative requests users actually type into the gem ("make my room feel like there's a giant statue standing across it"). same scoring rubric on both: import / api / parse / forbidden patterns + an llm-judge pass (gemini-2.5-pro again, full SKILL.md as ground truth, scoring accomplishes_task, idiomatic_xrblocks, hallucination_severity).

ran the full sweep on both gemini-2.5-pro and gemini-2.5-flash.

headline: api_match

did the agent call the APIs the skill defines? this is the cleanest single-number signal, since parse_ok / forbidden_clean / import_match are mostly invariant (gemini writes valid js either way, etc).

model	prompt style	with skill	without skill	Δ
pro	engineer-spec	1.00	0.07	+0.93
pro	canvas-faithful	1.00	0.03	+0.97
flash	engineer-spec	0.93	0.07	+0.87
flash	canvas-faithful	0.97	0.03	+0.93

without the skill, both models call the right xrblocks APIs ~3-7% of the time. with the skill, 85-100%.

composite score (multi-dimensional)

model	prompt style	with skill	without skill	Δ
pro	engineer-spec	1.00	0.65	+0.35
pro	canvas-faithful	1.00	0.65	+0.36
flash	engineer-spec	0.98	0.65	+0.33
flash	canvas-faithful	0.99	0.61	+0.38

unexpected: canvas-faithful prompts widen the skill gap, not narrow it. vague prompts force the model to lean on its system-prompt priors, which is exactly what the skill content is.

hallucination breakdown

llm-judge classified each output as none / minor / major hallucination using the full SKILL.md as ground truth:

model	with skill (none / minor / major)	without skill (none / minor / major)
pro	15 / 1 / 4	3 / 0 / 17
flash	13 / 2 / 5	0 / 0 / 20

without the skill, flash produces a major hallucination on every task. consistent patterns: jsx-like xr-scene/xr-room for netblocks, fake <xr-hands> elements for hands, invented A-Frame xrb-button for ui, made-up xr.ai.multimodal for ai. several without-skill outputs imported from xr-blocks@0.4.1 and similar packages that have never been published.

with the skill, most outputs are rated none. the remaining major cases are concentrated on xb-modelviewer, which has a documentation gap the eval surfaced.

per-metric breakdown:

skill gaps surfaced

beyond the headline numbers, the eval flagged concrete improvements to chase:

xb-modelviewer: even with the skill loaded, gemini invents a Loader class that does not exist. 4-5 major hallucinations per model. the skill description mentions loadGLTFModel and setupPlatform but the quick-start skips the actual loader call, so the model fills in something plausible-looking. fix: add a minimal ModelViewer loader example to the skill.
xb-netblocks + xb-lipsync import paths: the skills used a directory import (xrblocks/addons/<addon>/src) that 404s on the @build cdn because browser importmaps do not have node-style index resolution. fixed in netblocks: fix import path in skill + readme examples #349 (netblocks) and lipsync: add skill files + fix import path in readme #350 (lipsync).
xb-depth: with-skill composite drops to 0.88 because the agent uses enableDepth but misses adjacent expected setup. skill could spell out the full recipe.
xb-hands (flash only): composite 0.83 for the same shape of issue. depth and hands skills could both use a "minimum viable setup" snippet at the top.

what's in the PR

evals/prototypes/runners/run_gem_api.py - agent runner (gemini api directly)
evals/prototypes/score_proto.py - scorer (skips vacuous import_match when expected_imports is empty)
evals/prototypes/judge.py - llm-as-judge
evals/prototypes/smoke.py - playwright + headless chromium (caught real CDN-resolution bugs in the netblocks and lipsync skills; fixed in netblocks: fix import path in skill + readme examples #349 and lipsync: add skill files + fix import path in readme #350)
evals/prototypes/ablate.py - drop one skill section at a time (binary scorer is overdetermined for this; would benefit from judge in the loop)
evals/run_all.sh + summarize_proto.py + plot.py - orchestrator, summary, matplotlib charts
evals/prototypes/tasks/ - 20 tasks (10 engineer-spec + 10 canvas-faithful)

what's not in the PR

multi-trial cells (each cell ran once, no IQR)
cross-model judge (judge is gemini-pro, same family as the agent)
runtime correctness beyond load-time smoke
multi-skill composition tasks
skill-edit regression CI

first attempt, posting for design feedback before scaling further.

python harness, 6 seed tasks, worktree shim, scorer + summarizer. smoke test confirmed: golden-as-agent scores 1.0 across the board, empty-as-agent scores 0.0. still manual on the agent-invocation side. CI-pass scoring and llm-as-judge planned as follow-ups.

automated runner that invokes gemini-cli headless against a task worktree. injects current main SKILL.md files for the with-skill run (many task bases predate the skill files), strips them for the without-skill run. wraps the scorer. verified on task google#335: gemini reaches recall 0.5 line_sim 0.63 in both modes, surfacing that current skills do not address the sample-vs-framework-layer choice that task needed.

runner now installs every skills/xb-* dir via gemini skills install --consent at user scope for with-skill, uninstalls at end. cleans pre-existing leftovers at start so subsequent runs do not leak state. matches actual user workflow rather than relying on auto-discovery from the worktree.

uses gemini-2.5-pro via google-genai api directly, skill content in system prompt, no filesystem access. mirrors the canvas gem deployment rather than gemini-cli. four prototyping tasks (netblocks, hands, ui, ai) score 1.0 with the matching skill and 0.50-0.81 without it. biggest gap on xb-netblocks (most bespoke api surface, no gemini priors). FINDINGS.md tracks the design pivots: contributor-pr-replay (wrong audience), cli-prototyping (ceiling effect), api-canvas-faithful (the one that worked).

judge.py runs gemini-2.5-pro over each agent output with the full SKILL.md as ground truth. with-skill consistently 5/5/yes, without-skill 1/1/no. judge prose names the specific hallucinations (xr-room, <xr-hands>, xrb-button, etc). smoke.py serves the workspace via http and loads index.html in headless chromium via playwright. caught a real bug in the xb-netblocks skill: its import examples use a path that 404s on the build-branch cdn. ablate.py drops one skill section at a time from the system prompt. all six variants still scored 1.0, so binary api_match is overdetermined; finer ablations need the judge in the loop.

removes v1 pr-replay scaffolding (dead weight). adds depth-occlusion, gestures-thumbs-up, modelviewer-gltf, physics-falling-cube, sound-spatial-audio, world-plane-detection tasks. adds run_all.sh orchestrator (with --judge), summarize_proto.py, plot.py (composite + per-metric + judge charts). rewrites evals/README.md for the actual v3 design. n=10 sweep result: +0.27 average composite gap with vs without skill. holds across all 10 skills. judge agrees 10/10 on would_merge. biggest gaps on bespoke api surfaces (netblocks, physics, modelviewer); smallest on webxr-standard surfaces (hands, depth).

adds 10 canvas-* tasks phrased like real canvas-gem prompts (imaginative outcome descriptions, no api jargon) to test alongside the engineer-spec ones. namespaces results and workspaces by model so pro and flash can co-exist. swaps the judge would_merge dim for hallucination_severity (none/minor/major), a focused measure of the failure mode the eval actually catches. full pro+flash sweep result: with-skill avg 0.97, without 0.71, delta +0.26 to +0.31 across cells. canvas-faithful prompts widen the gap rather than narrow it.

…rouped task charts import_match was defaulting to 1.0 when expected_imports was empty (16/20 tasks), and forbidden_clean / parse_ok are mostly invariant, so the original composite mean inflated by ~0.07 over the without-skill arm. recomputes composite to skip import_match when expected_imports is empty (kept the per-metric values intact for inspection). adds a headline api_match_breakdown.png since api_match is the only dimension that actually moves between arms. per-task charts now group engineer-spec on the left, canvas-faithful on the right, with a vertical separator. metric grid uses better short labels. prompt-style chart moves delta annotations above bars instead of overlapping the x-axis. headline numbers after fix: api_match goes from 0.03-0.07 without to 0.85-1.00 with, delta +0.78 to +0.97 across cells.

…otations

the dev log narrative was useful while iterating but is not the right artifact for a public PR. moves the content out of the repo. saved separately for the author.

moves all chart legends outside the plot so they stop overlapping bars or delta annotations. lifts the engineer-spec / canvas-faithful group labels higher and bumps the title pad so they no longer overlap. ignores evals/results in prettier (regenerable, lots of files).

investigated three with-skill "gaps" surfaced by the eval and found they were spec-side bugs, not skill defects: 1. modelviewer-gltf required setupPlatform, but xb.ModelViewer auto-attaches a platform by default. Dropped the expectation. 2. depth-occlusion required xrDepthMeshOptions, but enableDepth applies that preset internally. Dropped the expectation. 3. canvas-dandelion and canvas-occluded-statue had the same issues, fixed in parallel. re-scored all 80 workspaces against the corrected specs. headline api_match numbers improved: - pro engineer-spec: 0.92 to 1.00 - flash engineer-spec: 0.85 to 0.93 canvas-faithful cells unchanged at 1.00 / 0.97. without-skill arm unchanged. judge re-run with new schema and ground truth. charts regenerated.

ruofeidu · 2026-06-06T23:14:38Z

I am deeply impressed!

As said in https://arxiv.org/abs/2603.24591, we also have an evaluation pipeline but haven't been open sourced since I don't think it it rigorous enough - also it was designed for traditional one-shot prompting instead of agentic loops.

Thanks for the contribution, I'll take a deeper look and think of how we should open source of what we have with what you contribute here! I appreciate your proactivity and treat of as one of our key contributors here :)

salmanmkc added 11 commits June 6, 2026 08:59

evals: move legend outside the plot so it stops overlapping delta ann…

b3c9bb5

…otations

evals: drop FINDINGS dev log from the PR

9dc4ab4

the dev log narrative was useful while iterating but is not the right artifact for a public PR. moves the content out of the repo. saved separately for the author.

salmanmkc marked this pull request as ready for review June 6, 2026 05:48

salmanmkc and others added 3 commits June 6, 2026 14:51

evals: prettier-format updated spec.json files

423d4ff

Merge branch 'main' into feat/skill-eval-harness

555e987

ruofeidu requested a review from dli7319 June 6, 2026 23:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evals: harness for the xb-* skills (first cut)#351

evals: harness for the xb-* skills (first cut)#351
salmanmkc wants to merge 14 commits into
google:mainfrom
salmanmkc:feat/skill-eval-harness

salmanmkc commented Jun 6, 2026 •

edited

Loading

Uh oh!

ruofeidu commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

salmanmkc commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

headline: api_match

composite score (multi-dimensional)

hallucination breakdown

skill gaps surfaced

what's in the PR

what's not in the PR

Uh oh!

ruofeidu commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

salmanmkc commented Jun 6, 2026 •

edited

Loading