samples/python: Add evaluation learning path for hosted agents#726
Closed
aprilk-ms wants to merge 2 commits into
Closed
samples/python: Add evaluation learning path for hosted agents#726aprilk-ms wants to merge 2 commits into
aprilk-ms wants to merge 2 commits into
Conversation
Three things:
1. New 14-evaluation sample under agent-framework/responses/, with a
beginner-focused README (What is evaluation?, score-shape table,
first-run quickstart with expected console output, "if a score is
low" checklist, cost/data-usage table, privacy callout for
ENABLE_SENSITIVE_DATA=true) and 8 evaluate_*/generate_dataset_*
scripts:
- evaluate_basic.py (single-turn built-in evaluators)
- evaluate_custom_rubric.py (Custom Rubric Evaluator, generate +
inspect + HITL regenerate + eval)
- evaluate_multiturn_simulation.py
- evaluate_multiturn_traces.py (agent_filter / conversation_id /
trace_id variants)
- generate_dataset_from_traces.py
- generate_dataset_synthetic.py (default runs the deployed agent
against generated queries;
EVAL_AGAINST_DATASET_ONLY=true
falls back to grading the rows)
- evaluate_scheduled.py (event-triggered + interval modes)
- evaluate_redteam.py (content-safety evaluators on the
0-7 severity scale; indirect_attack
is an opt-in commented entry)
Shared helpers live in eval_common.py, including a friendly
print_friendly_output() formatter that normalizes both Pydantic
eval-run objects and plain dicts, and gates the raw pprint behind
EVAL_DEBUG=1. Tracing is enabled by default
(ENABLE_INSTRUMENTATION + ENABLE_SENSITIVE_DATA) so the trace-based
and scheduled scripts work out of the box.
2. Multi-turn evaluation co-located in 01-basic:
evaluate_multiturn_simulation.py + evaluate_multiturn_traces.py
defaulted to the 01-basic agent name, with small seed JSONL files
under data/, a requirements-eval.txt for the eval-only deps, and a
new README section that explains what evaluation is, points back to
14-evaluation, flags the tracing prereq + redeploy step, and adds
a privacy heads-up on ENABLE_SENSITIVE_DATA.
3. "Related: Evaluate this agent" links added to every hosted-agent
sample README (responses/01..13, invocations/01-basic, all
bring-your-own/**) pointing at 14-evaluation, plus a row in the
main hosted-agents README learning path and the agent-framework
Responses API samples table.
Scripts pin API version 2025-11-15-preview in eval_common.py for
preview LROs (evaluator generation, data generation, scheduled
eval) and use openai_client.evals for the GA-ish eval-run surface.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rubber-duck pass focused on voice and length dimensions: * Drop LLM-isms: `small judges that emit`, `the intuition is always`, `the *real* recommended`, `the words 'why' are almost always`, `emit a severity 0-7`, `small LLM judges`. * Compress `The scripts` section from 65 lines to ~40 by cutting redundancy with `Pick the right flow`; each entry now one paragraph with the same callouts (schedule keeps running / red-team writes to traces / generate_dataset_synthetic default). * Add a one-line description of what the demo agent actually does so readers don't have to open agent.yaml to find out. * Tighten the EVAL_DEBUG note and the pass-threshold paragraph. Net: README 362 -> 337 lines (-25), 20.6KB -> 19.0KB. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
|
👋 Thanks for your interest in contributing, @aprilk-ms! This repository does not accept pull requests directly. If you'd like to report a bug, suggest an improvement, or propose a new sample, please open an issue instead. See CONTRIBUTING.md for more details. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds an evaluation learning path to the Python hosted-agent samples
(
samples/python/hosted-agents/), per the three deliverables requested:agent-framework/responses/14-evaluation/— a tiny demo agent (main.py,agent.yaml) + 8 eval scripts + 4 seed datasets + a beginner-focusednav-hub
README.md.01-basic/— two scripts(simulation + traces) + seed datasets + a new "Evaluating multi-turn
conversations" section in
01-basic/README.mdso the multi-turn learningpath can stand on its own.
READMEs (parent index, agent-framework table, every
responses/01..13sample,
invocations/01-basic, everybring-your-own/**sample).The 8 scripts in
14-evaluation/evaluate_basic.pytask_adherence,fluency,relevance) against 4 inline questionsevaluate_custom_rubric.py⭐evaluate_multiturn_simulation.pyevaluate_multiturn_traces.pygenerate_dataset_from_traces.pygenerate_dataset_synthetic.pyevaluate_scheduled.pyevaluate_redteam.pyShared helpers live in
eval_common.py:2025-11-15-preview) in one placetarget_agent()default + env-var overrides (EVAL_AGENT_NAME,EVAL_AGENT_VERSION)print_friendly_output()— Pydantic-safe per-row summary (single-turn / multi-turn / dataset-row shapes), withEVAL_DEBUG=1to also dump the raw payloadbuild_testing_criteria(model, response_source=)—"sample"for live agent runs ({{sample.output_text}}),"item"for dataset rows that already contain a response ({{item.response}})The demo agent ships with
ENABLE_INSTRUMENTATION=trueandENABLE_SENSITIVE_DATA=trueso trace-based + continuous scripts workout of the box. The README has an explicit privacy callout about this.
Beginner-friendliness passes
Two rubber-duck rounds (correctness + score scales) and one voice-and-verbosity
sweep were applied before this PR:
agent-task (Pass/Fail + optional score), safety (0-7 severity, higher
worse), attack-detection (boolean), custom rubric (weighted 1-5)
builtin.indirect_attackdocumented as the prompt-injectionevaluator (the non-existent
builtin.jailbreakwas removed in anearlier pass)
evaluate_custom_rubric.pynow polls to completion + printsfriendly output, with
CHANGE-THIS-FIRSTmarkers on the rubric promptand
CHANGE-THIS-TOOon the inline eval questionsgenerate_dataset_synthetic.pydefaults to running the generatedquestions through the deployed agent;
EVAL_AGAINST_DATASET_ONLY=trueopts out
tracing-privacy sections added
"the real recommended", "almost always", "small LLM judges", "emit a
severity" all removed
overlap with the "Pick the right flow" table immediately above it
(while keeping every load-bearing callout: schedule keeps running,
redteam writes adversarial prompts to traces, etc.)
need to open
agent.yamlto figure out what it isVerification
print_friendly_output()smoke-tested against Pydantic-like objects +single-turn / multi-turn / dataset-row shapes
azd ai agent deploy,builtin.jailbreak) anywhere in the diff(
mkdir foo && cd foo && azd ai agent init -m <manifest> && azd up)Not yet verified
authoring environment). The preview API surfaces
(
evaluator_generation_jobs,data_generation_jobs, scheduled eval)are pinned to API version
2025-11-15-previewineval_common.pyas the single bump point.
File count
Dockerfile + 4 seeds + 31 README updates)