Skip to content

evals: harness for the xb-* skills (first cut)#351

Open
salmanmkc wants to merge 14 commits into
google:mainfrom
salmanmkc:feat/skill-eval-harness
Open

evals: harness for the xb-* skills (first cut)#351
salmanmkc wants to merge 14 commits into
google:mainfrom
salmanmkc:feat/skill-eval-harness

Conversation

@salmanmkc
Copy link
Copy Markdown
Contributor

@salmanmkc salmanmkc commented Jun 6, 2026

first cut at a benchmark for the xrblocks xb-* skills, calls gemini via the api the way the canvas-gem deploys it. skill content baked into the system prompt, no filesystem access. the model has to produce a complete main.js from scratch.

20 tasks across the 13 xb-* skills, half phrased as engineer specs ("modify main.js to enable WebXR depth sensing"), half as the imaginative requests users actually type into the gem ("make my room feel like there's a giant statue standing across it"). same scoring rubric on both: import / api / parse / forbidden patterns + an llm-judge pass (gemini-2.5-pro again, full SKILL.md as ground truth, scoring accomplishes_task, idiomatic_xrblocks, hallucination_severity).

ran the full sweep on both gemini-2.5-pro and gemini-2.5-flash.

headline: api_match

did the agent call the APIs the skill defines? this is the cleanest single-number signal, since parse_ok / forbidden_clean / import_match are mostly invariant (gemini writes valid js either way, etc).

model prompt style with skill without skill Δ
pro engineer-spec 1.00 0.07 +0.93
pro canvas-faithful 1.00 0.03 +0.97
flash engineer-spec 0.93 0.07 +0.87
flash canvas-faithful 0.97 0.03 +0.93

api_match by model and prompt style

without the skill, both models call the right xrblocks APIs ~3-7% of the time. with the skill, 85-100%.

composite score (multi-dimensional)

model prompt style with skill without skill Δ
pro engineer-spec 1.00 0.65 +0.35
pro canvas-faithful 1.00 0.65 +0.36
flash engineer-spec 0.98 0.65 +0.33
flash canvas-faithful 0.99 0.61 +0.38

skill effect by model and prompt style

unexpected: canvas-faithful prompts widen the skill gap, not narrow it. vague prompts force the model to lean on its system-prompt priors, which is exactly what the skill content is.

hallucination breakdown

llm-judge classified each output as none / minor / major hallucination using the full SKILL.md as ground truth:

model with skill (none / minor / major) without skill (none / minor / major)
pro 15 / 1 / 4 3 / 0 / 17
flash 13 / 2 / 5 0 / 0 / 20

without the skill, flash produces a major hallucination on every task. consistent patterns: jsx-like xr-scene/xr-room for netblocks, fake <xr-hands> elements for hands, invented A-Frame xrb-button for ui, made-up xr.ai.multimodal for ai. several without-skill outputs imported from xr-blocks@0.4.1 and similar packages that have never been published.

with the skill, most outputs are rated none. the remaining major cases are concentrated on xb-modelviewer, which has a documentation gap the eval surfaced.

llm judge: idiomatic xrblocks rating per task

composite per task across both models

per-metric breakdown:

pro metric grid

flash metric grid

skill gaps surfaced

beyond the headline numbers, the eval flagged concrete improvements to chase:

  1. xb-modelviewer: even with the skill loaded, gemini invents a Loader class that does not exist. 4-5 major hallucinations per model. the skill description mentions loadGLTFModel and setupPlatform but the quick-start skips the actual loader call, so the model fills in something plausible-looking. fix: add a minimal ModelViewer loader example to the skill.
  2. xb-netblocks + xb-lipsync import paths: the skills used a directory import (xrblocks/addons/<addon>/src) that 404s on the @build cdn because browser importmaps do not have node-style index resolution. fixed in netblocks: fix import path in skill + readme examples #349 (netblocks) and lipsync: add skill files + fix import path in readme #350 (lipsync).
  3. xb-depth: with-skill composite drops to 0.88 because the agent uses enableDepth but misses adjacent expected setup. skill could spell out the full recipe.
  4. xb-hands (flash only): composite 0.83 for the same shape of issue. depth and hands skills could both use a "minimum viable setup" snippet at the top.

what's in the PR

  • evals/prototypes/runners/run_gem_api.py - agent runner (gemini api directly)
  • evals/prototypes/score_proto.py - scorer (skips vacuous import_match when expected_imports is empty)
  • evals/prototypes/judge.py - llm-as-judge
  • evals/prototypes/smoke.py - playwright + headless chromium (caught real CDN-resolution bugs in the netblocks and lipsync skills; fixed in netblocks: fix import path in skill + readme examples #349 and lipsync: add skill files + fix import path in readme #350)
  • evals/prototypes/ablate.py - drop one skill section at a time (binary scorer is overdetermined for this; would benefit from judge in the loop)
  • evals/run_all.sh + summarize_proto.py + plot.py - orchestrator, summary, matplotlib charts
  • evals/prototypes/tasks/ - 20 tasks (10 engineer-spec + 10 canvas-faithful)

what's not in the PR

  • multi-trial cells (each cell ran once, no IQR)
  • cross-model judge (judge is gemini-pro, same family as the agent)
  • runtime correctness beyond load-time smoke
  • multi-skill composition tasks
  • skill-edit regression CI

first attempt, posting for design feedback before scaling further.

salmanmkc added 11 commits June 6, 2026 08:59
python harness, 6 seed tasks, worktree shim, scorer + summarizer. smoke test confirmed: golden-as-agent scores 1.0 across the board, empty-as-agent scores 0.0.

still manual on the agent-invocation side. CI-pass scoring and llm-as-judge planned as follow-ups.
automated runner that invokes gemini-cli headless against a task worktree. injects current main SKILL.md files for the with-skill run (many task bases predate the skill files), strips them for the without-skill run. wraps the scorer.

verified on task google#335: gemini reaches recall 0.5 line_sim 0.63 in both modes, surfacing that current skills do not address the sample-vs-framework-layer choice that task needed.
runner now installs every skills/xb-* dir via gemini skills install --consent at user scope for with-skill, uninstalls at end. cleans pre-existing leftovers at start so subsequent runs do not leak state. matches actual user workflow rather than relying on auto-discovery from the worktree.
uses gemini-2.5-pro via google-genai api directly, skill content in system prompt, no filesystem access. mirrors the canvas gem deployment rather than gemini-cli.

four prototyping tasks (netblocks, hands, ui, ai) score 1.0 with the matching skill and 0.50-0.81 without it. biggest gap on xb-netblocks (most bespoke api surface, no gemini priors).

FINDINGS.md tracks the design pivots: contributor-pr-replay (wrong audience), cli-prototyping (ceiling effect), api-canvas-faithful (the one that worked).
judge.py runs gemini-2.5-pro over each agent output with the full SKILL.md as ground truth. with-skill consistently 5/5/yes, without-skill 1/1/no. judge prose names the specific hallucinations (xr-room, <xr-hands>, xrb-button, etc).

smoke.py serves the workspace via http and loads index.html in headless chromium via playwright. caught a real bug in the xb-netblocks skill: its import examples use a path that 404s on the build-branch cdn.

ablate.py drops one skill section at a time from the system prompt. all six variants still scored 1.0, so binary api_match is overdetermined; finer ablations need the judge in the loop.
removes v1 pr-replay scaffolding (dead weight). adds depth-occlusion, gestures-thumbs-up, modelviewer-gltf, physics-falling-cube, sound-spatial-audio, world-plane-detection tasks. adds run_all.sh orchestrator (with --judge), summarize_proto.py, plot.py (composite + per-metric + judge charts). rewrites evals/README.md for the actual v3 design.

n=10 sweep result: +0.27 average composite gap with vs without skill. holds across all 10 skills. judge agrees 10/10 on would_merge. biggest gaps on bespoke api surfaces (netblocks, physics, modelviewer); smallest on webxr-standard surfaces (hands, depth).
adds 10 canvas-* tasks phrased like real canvas-gem prompts (imaginative outcome descriptions, no api jargon) to test alongside the engineer-spec ones. namespaces results and workspaces by model so pro and flash can co-exist. swaps the judge would_merge dim for hallucination_severity (none/minor/major), a focused measure of the failure mode the eval actually catches.

full pro+flash sweep result: with-skill avg 0.97, without 0.71, delta +0.26 to +0.31 across cells. canvas-faithful prompts widen the gap rather than narrow it.
…rouped task charts

import_match was defaulting to 1.0 when expected_imports was empty (16/20 tasks), and forbidden_clean / parse_ok are mostly invariant, so the original composite mean inflated by ~0.07 over the without-skill arm.

recomputes composite to skip import_match when expected_imports is empty (kept the per-metric values intact for inspection). adds a headline api_match_breakdown.png since api_match is the only dimension that actually moves between arms.

per-task charts now group engineer-spec on the left, canvas-faithful on the right, with a vertical separator. metric grid uses better short labels. prompt-style chart moves delta annotations above bars instead of overlapping the x-axis.

headline numbers after fix: api_match goes from 0.03-0.07 without to 0.85-1.00 with, delta +0.78 to +0.97 across cells.
the dev log narrative was useful while iterating but is not the right artifact for a public PR. moves the content out of the repo. saved separately for the author.
moves all chart legends outside the plot so they stop overlapping bars or delta annotations. lifts the engineer-spec / canvas-faithful group labels higher and bumps the title pad so they no longer overlap. ignores evals/results in prettier (regenerable, lots of files).
@salmanmkc salmanmkc marked this pull request as ready for review June 6, 2026 05:48
salmanmkc and others added 3 commits June 6, 2026 14:51
investigated three with-skill "gaps" surfaced by the eval and found they were spec-side bugs, not skill defects:

1. modelviewer-gltf required setupPlatform, but xb.ModelViewer auto-attaches a platform by default. Dropped the expectation.
2. depth-occlusion required xrDepthMeshOptions, but enableDepth applies that preset internally. Dropped the expectation.
3. canvas-dandelion and canvas-occluded-statue had the same issues, fixed in parallel.

re-scored all 80 workspaces against the corrected specs. headline api_match numbers improved:
- pro engineer-spec: 0.92 to 1.00
- flash engineer-spec: 0.85 to 0.93

canvas-faithful cells unchanged at 1.00 / 0.97. without-skill arm unchanged. judge re-run with new schema and ground truth.

charts regenerated.
@ruofeidu
Copy link
Copy Markdown
Collaborator

ruofeidu commented Jun 6, 2026

I am deeply impressed!

As said in https://arxiv.org/abs/2603.24591, we also have an evaluation pipeline but haven't been open sourced since I don't think it it rigorous enough - also it was designed for traditional one-shot prompting instead of agentic loops.

Thanks for the contribution, I'll take a deeper look and think of how we should open source of what we have with what you contribute here! I appreciate your proactivity and treat of as one of our key contributors here :)

@ruofeidu ruofeidu requested a review from dli7319 June 6, 2026 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants