samples/python: Add evaluation learning path for hosted agents by aprilk-ms · Pull Request #726 · microsoft-foundry/foundry-samples

aprilk-ms · 2026-05-22T10:52:03Z

What

Adds an evaluation learning path to the Python hosted-agent samples
(samples/python/hosted-agents/), per the three deliverables requested:

General evaluation learning path at
agent-framework/responses/14-evaluation/ — a tiny demo agent (main.py,
agent.yaml) + 8 eval scripts + 4 seed datasets + a beginner-focused
nav-hub README.md.
Multi-turn evaluation co-located in 01-basic/ — two scripts
(simulation + traces) + seed datasets + a new "Evaluating multi-turn
conversations" section in 01-basic/README.md so the multi-turn learning
path can stand on its own.
"Related: Evaluate this agent" links added to 31 hosted-agent
READMEs (parent index, agent-framework table, every responses/01..13
sample, invocations/01-basic, every bring-your-own/** sample).

The 8 scripts in `14-evaluation/`

Script	What it teaches
`evaluate_basic.py`	Easiest first run — built-in evaluators (`task_adherence`, `fluency`, `relevance`) against 4 inline questions
`evaluate_custom_rubric.py` ⭐	Generates a 5-7 dimension rubric tailored to your agent from a short prompt — the script you'd actually use on a real project
`evaluate_multiturn_simulation.py`	Foundry simulates full multi-turn conversations from seed scenarios
`evaluate_multiturn_traces.py`	Same multi-turn evaluators over real traced conversations
`generate_dataset_from_traces.py`	Materialize recent traces into a reusable registered dataset
`generate_dataset_synthetic.py`	Bootstrap a dataset from topic seeds when you have no traffic
`evaluate_scheduled.py`	Score every new agent response continuously
`evaluate_redteam.py`	Content-safety / red-team eval (violence, self-harm, hate, sexual; 0-7 severity)

Shared helpers live in eval_common.py:

API version pinned (2025-11-15-preview) in one place
target_agent() default + env-var overrides (EVAL_AGENT_NAME, EVAL_AGENT_VERSION)
print_friendly_output() — Pydantic-safe per-row summary (single-turn / multi-turn / dataset-row shapes), with EVAL_DEBUG=1 to also dump the raw payload
build_testing_criteria(model, response_source=) — "sample" for live agent runs ({{sample.output_text}}), "item" for dataset rows that already contain a response ({{item.response}})

The demo agent ships with ENABLE_INSTRUMENTATION=true and
ENABLE_SENSITIVE_DATA=true so trace-based + continuous scripts work
out of the box. The README has an explicit privacy callout about this.

Beginner-friendliness passes

Two rubber-duck rounds (correctness + score scales) and one voice-and-verbosity
sweep were applied before this PR:

Score-shape table distinguishes quality (1-5, higher better),
agent-task (Pass/Fail + optional score), safety (0-7 severity, higher
worse), attack-detection (boolean), custom rubric (weighted 1-5)
builtin.indirect_attack documented as the prompt-injection
evaluator (the non-existent builtin.jailbreak was removed in an
earlier pass)
evaluate_custom_rubric.py now polls to completion + prints
friendly output, with CHANGE-THIS-FIRST markers on the rubric prompt
and CHANGE-THIS-TOO on the inline eval questions
generate_dataset_synthetic.py defaults to running the generated
questions through the deployed agent; EVAL_AGAINST_DATASET_ONLY=true
opts out
"If a score is low, what next?", "Cost and data usage", and
tracing-privacy sections added
LLM-ism scrub: "small judges that emit", "the intuition is always",
"the real recommended", "almost always", "small LLM judges", "emit a
severity" all removed
"The scripts" section compressed from 65 lines to 40 by removing
overlap with the "Pick the right flow" table immediately above it
(while keeping every load-bearing callout: schedule keeps running,
redteam writes adversarial prompts to traces, etc.)
One-line description of the demo agent added so readers don't
need to open agent.yaml to figure out what it is

Verification

All 12 new Python files AST-parse clean
print_friendly_output() smoke-tested against Pydantic-like objects +
single-turn / multi-turn / dataset-row shapes
All 31 added "Related" link targets resolve to existing relative paths
No stale references to removed CLI commands (azd ai agent deploy,
builtin.jailbreak) anywhere in the diff
Deploy flow matches the parent README's canonical pattern
(mkdir foo && cd foo && azd ai agent init -m <manifest> && azd up)

Not yet verified

End-to-end run against a live Foundry project (no Azure creds in the
authoring environment). The preview API surfaces
(evaluator_generation_jobs, data_generation_jobs, scheduled eval)
are pinned to API version 2025-11-15-preview in eval_common.py
as the single bump point.

File count

60 files in the first commit (8 scripts + helpers + manifest +
Dockerfile + 4 seeds + 31 README updates)
3 files in the voice-pass commit (-25 net lines)

Three things: 1. New 14-evaluation sample under agent-framework/responses/, with a beginner-focused README (What is evaluation?, score-shape table, first-run quickstart with expected console output, "if a score is low" checklist, cost/data-usage table, privacy callout for ENABLE_SENSITIVE_DATA=true) and 8 evaluate_*/generate_dataset_* scripts: - evaluate_basic.py (single-turn built-in evaluators) - evaluate_custom_rubric.py (Custom Rubric Evaluator, generate + inspect + HITL regenerate + eval) - evaluate_multiturn_simulation.py - evaluate_multiturn_traces.py (agent_filter / conversation_id / trace_id variants) - generate_dataset_from_traces.py - generate_dataset_synthetic.py (default runs the deployed agent against generated queries; EVAL_AGAINST_DATASET_ONLY=true falls back to grading the rows) - evaluate_scheduled.py (event-triggered + interval modes) - evaluate_redteam.py (content-safety evaluators on the 0-7 severity scale; indirect_attack is an opt-in commented entry) Shared helpers live in eval_common.py, including a friendly print_friendly_output() formatter that normalizes both Pydantic eval-run objects and plain dicts, and gates the raw pprint behind EVAL_DEBUG=1. Tracing is enabled by default (ENABLE_INSTRUMENTATION + ENABLE_SENSITIVE_DATA) so the trace-based and scheduled scripts work out of the box. 2. Multi-turn evaluation co-located in 01-basic: evaluate_multiturn_simulation.py + evaluate_multiturn_traces.py defaulted to the 01-basic agent name, with small seed JSONL files under data/, a requirements-eval.txt for the eval-only deps, and a new README section that explains what evaluation is, points back to 14-evaluation, flags the tracing prereq + redeploy step, and adds a privacy heads-up on ENABLE_SENSITIVE_DATA. 3. "Related: Evaluate this agent" links added to every hosted-agent sample README (responses/01..13, invocations/01-basic, all bring-your-own/**) pointing at 14-evaluation, plus a row in the main hosted-agents README learning path and the agent-framework Responses API samples table. Scripts pin API version 2025-11-15-preview in eval_common.py for preview LROs (evaluator generation, data generation, scheduled eval) and use openai_client.evals for the GA-ish eval-run surface. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Rubber-duck pass focused on voice and length dimensions: * Drop LLM-isms: `small judges that emit`, `the intuition is always`, `the *real* recommended`, `the words 'why' are almost always`, `emit a severity 0-7`, `small LLM judges`. * Compress `The scripts` section from 65 lines to ~40 by cutting redundancy with `Pick the right flow`; each entry now one paragraph with the same callouts (schedule keeps running / red-team writes to traces / generate_dataset_synthetic default). * Add a one-line description of what the demo agent actually does so readers don't have to open agent.yaml to find out. * Tighten the EVAL_DEBUG note and the pass-threshold paragraph. Net: README 362 -> 337 lines (-25), 20.6KB -> 19.0KB. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-05-22T10:58:25Z

👋 Thanks for your interest in contributing, @aprilk-ms!

This repository does not accept pull requests directly. If you'd like to report a bug, suggest an improvement, or propose a new sample, please open an issue instead.

See CONTRIBUTING.md for more details.

aprilk-ms and others added 2 commits May 22, 2026 03:27

github-actions Bot closed this May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

samples/python: Add evaluation learning path for hosted agents#726

samples/python: Add evaluation learning path for hosted agents#726
aprilk-ms wants to merge 2 commits into
microsoft-foundry:mainfrom
aprilk-ms:aprilk/hosted-agents-evaluation-learning-path

aprilk-ms commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aprilk-ms commented May 22, 2026

What

The 8 scripts in 14-evaluation/

Beginner-friendliness passes

Verification

Not yet verified

File count

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

The 8 scripts in `14-evaluation/`