Skill testing harness by Myestery · Pull Request #2932 · golemcloud/golem

Myestery · 2026-03-05T02:58:03Z

Summary

Merges the skill testing harness infrastructure into main. This includes the full agent testing harness for validating AI coding agents can discover and follow Golem skill files, CI workflow, and bootstrap scenarios.

Issues

Closes #2872
Closes #2873
Closes #2874
Closes #2875
Closes #2876
Closes #2878
Closes #2879
Closes #2880
Closes #2881
Closes #2882
Closes #2898
Closes #2906
Closes #2907
Closes #2887
Closes #2895
Closes #2897
Closes #2889
Closes #2890
Closes #2893
Closes #2894
Closes #2912

* Add agent testing harness for skill verification Implements the skill testing harness (#2872-#2882, #2898) that validates AI coding agents can discover, load, and follow Golem skill files to produce correct build artifacts. - CLI entrypoint with arg parsing and scenario filtering - YAML scenario loader with Zod schema validation - AgentDriver interface with Claude Code and Gemini (stub) drivers - SkillWatcher with inotifywait (Linux), fswatch (macOS), atime and presence-based fallback detection - Skill activation assertion engine (default, strict, allowedExtras) - Build verification with golem.yaml directory discovery - JSON report generation per scenario run - Bootstrap scenario: golem-new-project-ts - golem-new-project skill (SKILL.md) - CI workflow for unit + integration tests on Ubuntu - 21 unit tests (watcher, executor assertions, loader validation) * Fix CI test glob pattern for Linux compatibility Change **/*.test.js to *.test.js since /bin/sh (dash) on Linux does not support globstar. All test files are in a flat directory so ** is unnecessary. * Fix CI: golem binary, health check, and exit code on failure - Download golem v1.4.2 binary from golemcloud/golem releases - Use golem server run with /healthcheck readiness loop - Fix executor health check to use /healthcheck endpoint - Exit with code 1 when any scenario fails

github-actions · 2026-03-05T02:58:14Z

✅ All contributors have signed the CLA.
_{Posted by the CLA Assistant Lite bot.}

…oke, shell/sleep/trigger, create/delete agent, opencode driver, aggregated report - Add assertion engine with exit_code, stdout, body, status, and result_json checks (#2895) - Add scenario-level settings, prerequisites, step timeout, and continue_session (#2887) - Add deploy verification with implicit build (#2889) - Add invoke check with expect assertions (#2890) - Add shell, sleep, and trigger step actions (#2893) - Add create_agent and delete_agent step actions (#2894) - Add OpenCode driver stub with `opencode run` (#2897) - Add aggregated summary report with summary.json output (#2912)

…cies scenario, update paths

- Add --approval-mode yolo to Gemini CLI to enable all tools (run_shell_command, activate_skill, etc.) - Symlink skills to .gemini/skills/ so Gemini can discover them - Watch all agent skill dirs (.claude, .gemini, .agents) for activation - Fix macOS APFS relatime: reset atime before mtime so reads trigger updates - Add fswatch -a (access) and -L (follow-links) flags for macOS - Remove presence-check fallback, log full paths for detected skills - Move skills dir default from golem/skills to skills/

- Add opencode (opencode-ai) to matrix agents - Replace Gemini CLI placeholder with actual npm install - Update path triggers from golem/skills/ to skills/

Each driver now declares a skillDirs array instead of duplicating the symlink loop. Removes ~60 lines of repeated code.

Adds CodexAgentDriver using codex exec with session resume support. Makes --scenarios default to ./scenarios so --scenario can be used alone.

Adds version check fallback between golem and golem-cli binaries and clarifies that golem should not be built from scratch.

- Pre-create ~/.gemini/ dir to prevent ENOENT on projects.json - Use GEMINI_API_KEY secret directly - Add codex agent to CI matrix with OPENAI_API_KEY

OpenCode expects GOOGLE_GENERATIVE_AI_API_KEY, not GEMINI_API_KEY.

Codex CLI requires explicit login rather than reading OPENAI_API_KEY directly from the environment.

Copies harness and skills to /tmp/harness-run/ with a fresh git init so agents cannot crawl up into the golem repo.

vigoo

Good start, added some comments.

When those are addressed, let's remove the ticket links from this PR that are not solved yet (drivers that are only stubs etc) so we can merge this and follow-up PRs can target main directly.

vigoo · 2026-03-16T08:22:17Z

golem-skills/skills/golem-new-project/SKILL.md

@@ -0,0 +1,53 @@
+---


Organization:

let's move the whole thing into a subdirectory of the repo:

golem-skills/skills/... golem-skills/tests/harness/...

so it is not interleaved with the server code
(similar to how we have sdks/ etc)

Done: moved everything under golem-skills/skills/ and golem-skills/tests/harness/. Updated CI workflow paths, .gitignore, and AGENTS.md

vigoo · 2026-03-16T08:25:54Z

tests/harness/src/run.ts

+Usage:
+  npx tsx src/run.ts --agent <name> --language <ts|rust> --scenarios <dir> [options]
+
+Required:


I see how this is used in the CI with a matrix job, but we should make local runs more convenient. What I have in mind, let's say I write a new skill for all languages, I want to run it (locally) with all languages and coding agents to see if it's good enough before I submit a PR.

It makes sense to make all defaults to target all options.

"Implemented : agent and --language now default to all. Running npx tsx src/run.ts with no flags runs every agent × language combination. Report filenames include {agent}-{language}-{scenario}.json to avoid collisions

vigoo · 2026-03-16T08:27:46Z

golem-skills/tests/harness/src/run.ts

+  const skillsDir = path.resolve(process.cwd(), skillsDirRel!);
+  const scenariosDir = path.resolve(process.cwd(), scenarios!);
+  const resultsDir = path.resolve(process.cwd(), output!);
+  const globalTimeoutSeconds = timeout ? Number.parseInt(timeout, 10) : undefined;


Is a 10 second timeout meaningful for tests like this? Reading the skill, actually using golem to do stuff, verifying. Isn't it taking minutes?

I agree the default timeout isn't very meaningful. I will increase it to 120 cos that was what I was explicitly passing during testing.

Updated to 300 now

vigoo · 2026-03-16T08:31:56Z

golem-skills/tests/harness/src/watcher.ts

+  timestamp: number;
+}
+
+export class SkillWatcher {


So did the filesystem watching proved to be the best way to detect if an agent uses skills? Did you investigate other ways as the design document proposed? Please write a summary of your findings

I checked out ACP and also checked out some agents to see if there's an automated way to see skill usage from their output but it didn't work as planned.
The other options I had was adding a "print skill usage" prompt to the agents and regex parsing their output.

vigoo · 2026-03-16T08:35:22Z

tests/harness/src/run.ts

+    if (scenarioFilter && spec.name !== scenarioFilter) continue;
+
+    console.log(chalk.blue(`Running scenario: ${spec.name}`));
+    const workspace = path.join(process.cwd(), 'workspaces', spec.name.replace(/\s+/g, '-').toLowerCase());


Let's make sure this is .gitignore-d so it's not annoying when we run it locally

Yes this is already ignored

vigoo · 2026-03-16T08:37:56Z

tests/harness/src/executor.ts

+  steps: z.array(StepSpecSchema).min(1, 'Scenario must have at least one step'),
+});
+
+export interface StepSpec {


The action a step describes (invoke/shell/trigger/prompt etc) should be a discriminated union - we can't have more than one in one step and this type currently does not reflect that

vigoo · 2026-03-16T08:39:15Z

golem-skills/tests/harness/src/driver/claude.ts

@@ -0,0 +1,58 @@
+import { BaseAgentDriver, AgentResult } from './base.js';
+
+export class ClaudeAgentDriver extends BaseAgentDriver {


Did you consider using ACP? Please write a summary of why not use it

ACP is nice and well suited for creating interactive applications/editors that use multiple AI agents but I think its an overkill/mismatch for this use case.
Ideally what we want here is to be able to send prompts to AI agents, know when they are done, see the skill files they accessed and judge their output.
We also want to use it in a way that's closest to how the ideal user will use it.

Other things to consider

ACP is made for editors

ACP has no concept of skills use reporting

ACP adds protocol overhead with close to zero benefit

vigoo · 2026-03-16T08:41:30Z

tests/harness/scenarios/golem-new-project-ts.yaml

+    expectedSkills:
+      - "golem-new-project"
+    verify:
+      build: true


We should verify that it also deploys to have an end-to-end example, and also have as many of the steps and assertions as possible (even if not necessary for the final "new project skill") so we can see the test harness working.

Added the deploy step

vigoo · 2026-03-16T08:43:34Z

tests/harness/src/run.ts

+    console.log(`${allPassed ? 'PASS' : 'FAIL'} ${spec.name} steps=${results.length}/${spec.steps.length}`);
+  }
+
+  // Aggregated summary report (#2912)


We should generate reports in CTRF format so we can nicely integrate it in our CI flow
(this can be a separate ticket)

tracking in #2997 now

Restructure to group skill definitions and the testing harness under a single top-level golem-skills/ directory. Update CI workflow paths, .gitignore, and AGENTS.md accordingly.

Move issue tracking to PR description. Update OpenCode driver comment to clarify session continuity status.

Add .refine() to StepSpecSchema ensuring exactly one action field per step. Define StepSpec as a union type for better type narrowing. Add negative tests for zero and multiple actions per step.

Extract createDriver() function, add SUPPORTED_AGENTS and SUPPORTED_LANGUAGES constants, wrap scenario loop in agent/language matrix. Update report filenames to include agent-language prefix. Show default timeout (300s) in help text.

Add file existence check via shell step and deploy verification to demonstrate more harness capabilities.

Share the default timeout value between executor and run.ts help text via an exported constant.

The scaffolded project has no components, so deploy produces an empty diff that the CLI misreads as a concurrent modification error.

Myestery marked this pull request as draft March 9, 2026 08:20

Myestery and others added 5 commits March 9, 2026 22:21

Move golem-new-project skill to golem/skills/, remove adding-dependen…

d7cb9b1

…cies scenario, update paths

Merge branch 'main' into skills

822ee01

Add opencode agent to CI, uncomment Gemini install, update paths

9959988

- Add opencode (opencode-ai) to matrix agents - Replace Gemini CLI placeholder with actual npm install - Update path triggers from golem/skills/ to skills/

Myestery marked this pull request as ready for review March 10, 2026 05:37

Myestery and others added 11 commits March 11, 2026 07:16

Merge branch 'main' into skills

1da3c87

Add --help flag to harness run.ts (#2871)

9125b2e

Extract skill symlink logic into BaseAgentDriver.linkSkills()

3c516ec

Each driver now declares a skillDirs array instead of duplicating the symlink loop. Removes ~60 lines of repeated code.

Add Codex agent driver and default --scenarios flag

f35aa63

Adds CodexAgentDriver using codex exec with session resume support. Makes --scenarios default to ./scenarios so --scenario can be used alone.

Update golem-new-project skill to check golem/golem-cli availability

ade2e95

Adds version check fallback between golem and golem-cli binaries and clarifies that golem should not be built from scratch.

Fix Gemini CI failure and add Codex to CI matrix

ab1beb0

- Pre-create ~/.gemini/ dir to prevent ENOENT on projects.json - Use GEMINI_API_KEY secret directly - Add codex agent to CI matrix with OPENAI_API_KEY

Add GOOGLE_GENERATIVE_AI_API_KEY for OpenCode in CI

81e6fb4

OpenCode expects GOOGLE_GENERATIVE_AI_API_KEY, not GEMINI_API_KEY.

Add codex login step in CI using --with-api-key

e1ec14b

Codex CLI requires explicit login rather than reading OPENAI_API_KEY directly from the environment.

Run harness tests in isolated directory outside the repo

9c05312

Copies harness and skills to /tmp/harness-run/ with a fresh git init so agents cannot crawl up into the golem repo.

Fix git init in CI isolated dir: set user identity and branch name

5ff705c

Fix isolated dir: copy dirs individually, use global tsx

e037b22

Myestery linked an issue Mar 16, 2026 that may be closed by this pull request

Scaffold harness project (package.json, tsconfig, directory layout) #2871

Open

vigoo requested changes Mar 16, 2026

View reviewed changes

Myestery added 6 commits March 16, 2026 15:57

Move skills/ and tests/harness/ into golem-skills/ directory

689bcc7

Restructure to group skill definitions and the testing harness under a single top-level golem-skills/ directory. Update CI workflow paths, .gitignore, and AGENTS.md accordingly.

Strip inline ticket references from code comments

aaabacc

Move issue tracking to PR description. Update OpenCode driver comment to clarify session continuity status.

Add discriminated union validation for StepSpec

cb61348

Add .refine() to StepSpecSchema ensuring exactly one action field per step. Define StepSpec as a union type for better type narrowing. Add negative tests for zero and multiple actions per step.

Make --agent and --language optional, default to all

00b9e31

Extract createDriver() function, add SUPPORTED_AGENTS and SUPPORTED_LANGUAGES constants, wrap scenario loop in agent/language matrix. Update report filenames to include agent-language prefix. Show default timeout (300s) in help text.

Expand golem-new-project-ts scenario with deploy and shell steps

275a002

Add file existence check via shell step and deploy verification to demonstrate more harness capabilities.

Extract DEFAULT_STEP_TIMEOUT_SECONDS constant

0fd1472

Share the default timeout value between executor and run.ts help text via an exported constant.

Myestery mentioned this pull request Mar 16, 2026

Add CTRF report format to skill testing harness #2997

Open

Format code with prettier

58b9e96

Myestery force-pushed the skills branch from 349228f to 75c0060 Compare March 17, 2026 15:28

Remove deploy step from golem-new-project-ts scenario

f632aae

The scaffolded project has no components, so deploy produces an empty diff that the CLI misreads as a concurrent modification error.

Myestery force-pushed the skills branch from 75c0060 to f632aae Compare March 17, 2026 15:31

		@@ -0,0 +1,58 @@
		import { BaseAgentDriver, AgentResult } from './base.js';

		export class ClaudeAgentDriver extends BaseAgentDriver {

Conversation

Myestery commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Issues

Uh oh!

github-actions bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vigoo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Myestery commented Mar 5, 2026 •

edited

Loading

github-actions bot commented Mar 5, 2026 •

edited

Loading