Skip to content

feat(azure-compute): add VM creation workflow with approval gate before deploy#2297

Open
deankroker wants to merge 13 commits into
microsoft:mainfrom
deankroker:feat/azure-compute-skill
Open

feat(azure-compute): add VM creation workflow with approval gate before deploy#2297
deankroker wants to merge 13 commits into
microsoft:mainfrom
deankroker:feat/azure-compute-skill

Conversation

@deankroker
Copy link
Copy Markdown
Member

@deankroker deankroker commented May 18, 2026

What this does

Before this PR, asking Copilot for Azure to "create a VM" jumped straight to generating Bicep or running az vm create. The user had no chance to review what was about to be deployed, and for VM scale sets the model often missed the intent entirely.

This PR adds a guided workflow so the same request now:

  1. Asks 2–3 short questions to figure out what the user actually needs (region, size, OS, network exposure)
  2. Shows a Plan Card — a 14–21 row summary of every parameter the deployment will use, with each value labeled by where it came from (user input, default, recommendation)
  3. Waits for the user to approve, edit, or change format — in a single picker, not three sequential prompts
  4. Only then generates Bicep / Terraform / az CLI / MCP calls

It also recognizes VMSS intent without the user having to say "VMSS" (e.g. "fleet that autoscales" → Flexible orchestration scale set), and refuses to deploy templates that would leave SSH open to the internet.

What's in the PR

Workflow + supporting docs

  • 57789bd — new vm-creator workflow under plugin/skills/azure-compute/, with depth-probe references (beginner / cost / networking / security / spec) so the question set adapts to user expertise. Output adapters for az CLI, Bicep, Terraform, and Azure MCP. Trims vm-recommender.md and pulls verbose web_fetch rules into a reference file.
  • a7c7954 — Plan Card approval is now one batched picker (approve / edit / change format) instead of three sequential questions.

Plumbing

  • e729b2f — switches the Azure MCP CLI to --mode single, collapsing ~26 per-service tool schemas (~30K tokens) behind one routing tool (~3K tokens). Same APIs, same auth — the schemas just don't load until the tool is called. This is what makes the workflow fit in context without truncating.

Eval coverage

  • 4592b1fevals/azure-compute/eval.yaml with 16 stimuli covering router accuracy, depth-probe branches, VMSS recognition, Plan Card gate, and output adapter coverage. Smoke tier (5 × 3) was 3/5 flaky before; this PR brings it to 5/5 passing 3-for-3.

Follow-ups during review

  • 710e3f7 — fixes Skill Structure CI: a few cross-file references pointed at paths that didn't exist after the workflow split.
  • a590be1 — rewrites the skill frontmatter description: from a >- folded scalar to an inline double-quoted string. skills.sh only accepts the inline form.
  • ef08203 — removes the '*' default on sshSourceAddressPrefix (Bicep) and ssh_source_address_prefix (Terraform). The templates now require the caller to pass their own IP or a trusted CIDR; opening port 22 to the entire internet has to be an explicit, accepted choice. Resolves the open Copilot security suggestions.

Why it matters

The visible change: a user asking for a VM gets to see and edit what's about to be deployed, instead of finding out after the fact. The eval suite locks the behavior in so it doesn't regress.

The invisible change (e729b2f): the whole workflow used to run out of context room around turn 3 because the Azure MCP server was loading every service's tool schema up front. With --mode single, schemas load on demand and the workflow holds end-to-end.

Test plan

  • vally run evals/azure-compute/eval.yaml --tier smoke — 5/5 stimuli × 3 trials
  • Manual: "I need a small Ubuntu VM to test something" → beginner depth probe → 2 questions → Plan Card → batched picker. PASS (5 turns, $0.42)
  • Manual: "set up an autoscaling web tier for ~500 rps" → VMSS recognized without user saying VMSS → Flexible orchestration recommended. PASS (4 turns, $0.39)
  • Manual: "apply via MCP" end-to-end — verify the single MCP routing tool stays responsive past turn 3. PASS through 16 turns (quota validation caught zero D2s_v5 quota, model auto-switched to B1s). Apply step was cancelled at the permission prompt, which is orthogonal to the schema-cost fix being tested.
  • No regressions in vm-recommender standalone (recommend-only, no hand-off to creator). PASS (1 turn, $0.30)
  • All 11 CI checks pass on the latest commit (ef08203)

Manual items run via the claude-copilot proxy on claude-sonnet-4.6 with the plugin loaded from a working tree clone.

deankroker and others added 4 commits May 17, 2026 09:39
…der token limits

Adds the vm-creator workflow (Plan-Card-gated VM/VMSS provisioning with
Azure MCP, az CLI, Bicep, and Terraform output adapters) and refactors
the existing azure-compute skill files to stay well under CI token limits:

  - SKILL.md          1,840 -> 382 tokens
  - vm-recommender.md 3,006 -> 1,573 tokens (extracted handoff-to-creator
                                              and web-fetch-policy refs)
  - vm-creator.md     new; 1,676 tokens

vm-creator splits its larger references into sub-folders so every file
sits under the 1,000-token soft limit:

  - references/depth-probe/        (6 files, max 755t)
  - references/output-adapters/    (5 files, max 760t)
  - references/delivery-options/   (4 files, max 817t)
  - references/{plan-card,validation-gates,mcp-tools}.md (3 standalone)

Bicep and Terraform templates moved to examples/ as code files so they
don't count against markdown token budgets. The skill now consumes the
four new compute discovery tools shipping in the paired mcp PR
(vm list-skus, vm list-images, vm check-quota, vm recommend-region) as
its Step 4 validation gates.

Zero hard-limit violations and zero soft warnings on any file in this PR.
The bundled Azure MCP server registers one tool per service namespace
by default (~26 tools: compute, storage, keyvault, monitor, sql, ...).
Combined with the azure-compute skill's loaded reference files, the
total context routinely crosses Claude Code's autocompact threshold
within 3 turns, causing thrash errors during vm-creator flows.

Passing --mode single collapses all namespaces into one routing tool
(per the upstream @azure/mcp CLI). Schema cost drops from ~30K to
~3K tokens with zero functionality loss — service routing happens
inside the single tool via its command arg.

Verified: echo "create a vm" | claude --plugin-dir output --print
- before: autocompact thrash after 3 turns
- after: 24.9s, 145K input tokens, full Plan Card rendered
… action picker

The old approve → output-format → delivery flow was 3 sequential popups for
what's effectively one decision ("how do you want this delivered?"). This
trims plan-card.md to specify a single AskUserQuestion with 4 likely
combinations, an explicit "Edit a row" branch, and an early-skip when the
user's original prompt already implied the delivery choice.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…SS, plan card, adapters

16 stimuli across the azure-compute skill surface area:
- Router accuracy (vm-creator vs vm-recommender vs azure-prepare handoff)
- Depth-probe branch selection (beginner / spec / cost / networking / security)
- VMSS recognition without the user saying "VMSS" (agent pools, runner fleets)
- Plan Card gate (no artifact emitted before approval, even under pressure)
- Output adapter coverage (az CLI / Bicep / Terraform / MCP apply)
- Security posture (no open SSH/RDP, no plaintext passwords, managed identity preferred)

Smoke tier (5 stimuli, --tag tier=smoke) currently passes 15/15 trials with
zero flakes against claude-sonnet-4.6.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 18, 2026 22:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the azure-compute skill from recommendation/troubleshooting routing into a fuller VM/VMSS creation workflow, adds MCP single-tool mode context optimization, and introduces an eval suite to guard routing and Plan Card behavior.

Changes:

  • Adds a new vm-creator workflow with depth probes, validation gates, Plan Card approval, output adapters, delivery options, and example Bicep/Terraform artifacts.
  • Updates the existing VM recommender and main skill router to hand off provisioning requests to vm-creator.
  • Adds Azure Compute Vally eval coverage and switches Azure MCP startup to --mode single.

Reviewed changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
plugin/skills/azure-compute/SKILL.md Updates router description and adds vm-creator routing.
plugin/.mcp.json Starts Azure MCP in single-tool mode.
evals/azure-compute/eval.yaml Adds azure-compute routing/workflow eval suite.
plugin/skills/azure-compute/workflows/vm-recommender/vm-recommender.md Trims recommender workflow and adds creator handoff.
plugin/skills/azure-compute/workflows/vm-recommender/references/web-fetch-policy.md Adds live-doc verification fallback policy.
plugin/skills/azure-compute/workflows/vm-recommender/references/handoff-to-creator.md Defines recommender-to-creator handoff rules.
plugin/skills/azure-compute/workflows/vm-creator/vm-creator.md Adds top-level VM/VMSS creation workflow.
plugin/skills/azure-compute/workflows/vm-creator/references/validation-gates.md Documents required pre-Plan Card validation checks.
plugin/skills/azure-compute/workflows/vm-creator/references/plan-card.md Defines Plan Card schema and approval picker.
plugin/skills/azure-compute/workflows/vm-creator/references/mcp-tools.md Documents MCP commands and CLI fallbacks.
plugin/skills/azure-compute/workflows/vm-creator/references/depth-probe/index.md Adds depth-probe branch selection.
plugin/skills/azure-compute/workflows/vm-creator/references/depth-probe/beginner.md Adds beginner fast-path defaults.
plugin/skills/azure-compute/workflows/vm-creator/references/depth-probe/cost-deep.md Adds cost-focused questioning guidance.
plugin/skills/azure-compute/workflows/vm-creator/references/depth-probe/networking-deep.md Adds networking-focused questioning guidance.
plugin/skills/azure-compute/workflows/vm-creator/references/depth-probe/security-deep.md Adds security-focused questioning guidance.
plugin/skills/azure-compute/workflows/vm-creator/references/depth-probe/spec-deep.md Adds SKU/spec-focused questioning guidance.
plugin/skills/azure-compute/workflows/vm-creator/references/output-adapters/index.md Adds shared output adapter routing and field mapping.
plugin/skills/azure-compute/workflows/vm-creator/references/output-adapters/az-cli.md Adds az CLI output adapter.
plugin/skills/azure-compute/workflows/vm-creator/references/output-adapters/bicep.md Adds Bicep output adapter.
plugin/skills/azure-compute/workflows/vm-creator/references/output-adapters/terraform.md Adds Terraform output adapter.
plugin/skills/azure-compute/workflows/vm-creator/references/output-adapters/mcp-apply.md Adds live Azure MCP apply adapter.
plugin/skills/azure-compute/workflows/vm-creator/references/delivery-options/index.md Adds delivery mode decision logic.
plugin/skills/azure-compute/workflows/vm-creator/references/delivery-options/print.md Adds print-in-chat delivery mode.
plugin/skills/azure-compute/workflows/vm-creator/references/delivery-options/save-local.md Adds local file save delivery mode.
plugin/skills/azure-compute/workflows/vm-creator/references/delivery-options/github-pr.md Adds GitHub PR delivery mode.
plugin/skills/azure-compute/workflows/vm-creator/examples/bicep/main.bicep Adds deployable Bicep VM example.
plugin/skills/azure-compute/workflows/vm-creator/examples/bicep/README.md Adds Bicep example usage documentation.
plugin/skills/azure-compute/workflows/vm-creator/examples/terraform/main.tf Adds deployable Terraform VM example.
plugin/skills/azure-compute/workflows/vm-creator/examples/terraform/variables.tf Adds Terraform input variables.
plugin/skills/azure-compute/workflows/vm-creator/examples/terraform/outputs.tf Adds Terraform outputs.
plugin/skills/azure-compute/workflows/vm-creator/examples/terraform/README.md Adds Terraform example usage documentation.

Comment thread plugin/skills/azure-compute/SKILL.md Outdated
Comment thread plugin/skills/azure-compute/workflows/vm-creator/references/plan-card.md Outdated
Comment thread plugin/skills/azure-compute/workflows/vm-creator/examples/bicep/main.bicep Outdated
Comment thread plugin/skills/azure-compute/workflows/vm-creator/examples/terraform/main.tf Outdated
Comment thread plugin/skills/azure-compute/workflows/vm-creator/examples/terraform/README.md Outdated
deankroker and others added 3 commits May 18, 2026 16:12
- Fix mcp-tool-references test by rephrasing line 38 of mcp-tools.md to
  avoid the literal `azure__compute_vm_check-quota` regex match (the
  prose was being parsed as a tool reference even though it was just
  illustrating sub-command vs tool distinction)
- Update triggers.test.ts to reflect the broader azure-compute scope:
  remove prompts from the no-trigger list that now legitimately trigger
  due to generic compute keywords ("create"/"deploy"/"server") in the
  expanded description; add comment explaining the router's
  disambiguation rule forwards non-VM intents to the right skill
- Regenerate triggers snapshot for the expanded keyword set
- Address 7 Copilot review comments across bicep/terraform examples,
  depth-probe, plan-card, vm-creator, vm-recommender, and adapter docs
…h format rule

The frontmatter validator (description-format rule) rejects >- folded scalars
because skills.sh in the runtime is incompatible with them. Convert the
azure-compute SKILL.md description from a >- folded multi-line scalar to an
inline double-quoted string. Parsed text is identical, so trigger snapshots
and keyword extraction are unchanged.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…form examples

Both vm-creator templates previously defaulted the SSH allow-rule source to
'*', leaving port 22 open to the entire internet on any deployment that
didn't override the parameter. An exposed SSH endpoint gets credential-
stuffed within minutes of going public.

Make the parameter required (no default) and document the security tradeoff
on the description so callers must pass either their own /32 or a trusted
CIDR before the template will deploy. Quickstart now derives MY_IP from
ifconfig.me and passes it explicitly; cleanup command updated to match.

Addresses Copilot review comments on PR microsoft#2297.
@deankroker
Copy link
Copy Markdown
Member Author

Pushed ef08203 to address the open Copilot review suggestions. Audited all 6; 3 had real underlying problems that needed code changes, 3 were already correct as written.

Fixed (ef08203):

File Change
examples/bicep/main.bicep sshSourceAddressPrefix no longer defaults to '*' — now required; description explains the tradeoff
examples/terraform/variables.tf ssh_source_address_prefix no longer defaults to "*" — now required; description explains the tradeoff
examples/terraform/README.md Quickstart derives MY_IP=$(curl -s ifconfig.me)/32 and passes it explicitly to plan/apply; variables table marks the var required; Notes call out why; Cleanup command updated to match

Templates now refuse to deploy without an explicit source prefix — the open-to-internet path requires the user to pass "*" and accept the risk.

Audited, no change needed:

Comment Why no change
plan-card.md "4 options" vs 6 listed File already says "6 options" on the relevant line; suggestion was based on stale context
mcp-apply.md familyPrefixfamily The doc uses familyPrefix for compute_vm_list-skus (correct param for that command) and family for compute_vm_check-quota (correct param for that command) — confirmed against references/mcp-tools.md
depth-probe/beginner.md SSH/RDP defaults to * Beginner path already defaults to the user's current public IP via curl -s ifconfig.me; * is only the fallback if detection fails, and is always flagged with ⚠

Watching CI; will flag if anything regresses.

@deankroker deankroker changed the title feat(azure-compute): vm-creator workflow + MCP context fix + eval suite feat(azure-compute): add VM creation workflow with approval gate before deploy May 19, 2026
Comment thread evals/azure-compute/eval.yaml Outdated
Comment thread evals/azure-compute/eval.yaml Outdated
Comment thread evals/azure-compute/eval.yaml Outdated
Comment thread plugin/.mcp.json Outdated
deankroker and others added 6 commits May 20, 2026 12:25
Azure MCP uses AzureCliCredential (not DefaultAzureCredential) to pick
up the `az login` session.

Addresses review comment from JasonYeMSFT on PR microsoft#2297.
Plugin-wide change should not ride on a single skill's PR. Splitting
the context-savings work to a follow-up PR scoped to plugin config.

Addresses review comment from JasonYeMSFT on PR microsoft#2297.
Per review feedback, omit the brittle `AzureCliCredential` reference from
the MCP apply prerequisite. State the user-facing requirement (be signed
in to Azure, e.g. `az login`) and treat the exact credential mechanism
as an MCP server implementation detail.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per JasonYeMSFT review on PR microsoft#2297 (thread 3270086905, eval.yaml:757):
when the user combines an explicit deliverable ("give me the Bicep") with
an explicit refusal of dialog ("no questions", "skip planning"), the
vm-creator workflow should honor that intent rather than overrule it with
a mandatory Plan Card approval popup.

vm-creator Step 5 now has two paths:
- Default: render the full Plan Card table + Approve/Edit AskUserQuestion
- Explicit-override fast path: emit a one-line preview of high-signal
  decisions + the requested artifact, no popup. Step 4 validation gates
  (SKU / image / quota / region) still run.

The plan-card-gate-001 eval stimulus is renamed to plan-card-override-001.
Prompt is unchanged (same explicit override), graders flipped to verify
the new contract: Bicep must emit directly, must name the SKU, must NOT
render the full Plan Card markdown table, must NOT block on the
Approve/Edit AskUserQuestion.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…muli

Per JasonYeMSFT review on PR microsoft#2297 (thread 3270067015, eval.yaml:91):
output-text regex was a fragile proxy for routing validation. Migrate the
four router smoke stimuli to the skill-invocation grader, which checks
the actual trajectory's skill_activation events rather than scraping
response text.

Changes:

- route-vm-creator-001: required=[azure-compute], disallowed=[azure-prepare]
- route-vmss-001: required=[azure-compute]
- route-vm-recommender-001: required=[azure-compute]
- route-away-azure-prepare-001: disallowed=[azure-compute] (the
  anti-pattern is azure-compute silently provisioning a VM; both
  azure-prepare and azure-deploy are legitimate handoff targets)

Content-checking graders (Ubuntu mention, VMSS substring, SKU patterns,
pricing references, no-VM-artifact anti-pattern) are kept on output-text
regex — those are behavioral assertions about response content, distinct
from routing.

Adds skill-invocation: 1.5 to scoring weights (mirrors
azure-enterprise-infra-planner) so routing-miss failures are weighted
appropriately.

Verified end-to-end: smoke tier (5 stimuli, 1 run each) passes at 100%
against claude-sonnet-4.6.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…from route-vmss-001

Per review feedback, output-not-matches graders are prone to false-positive
failures because agent output is stochastic. Drop the open-management-port
not-match grader from the route-vmss-001 case Jason flagged. The same
risky-NSG behavior is covered elsewhere in the workflow via skill content
and Plan Card rendering, so the eval signal is preserved without the
brittle negative pattern.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants