Skip to content

samples/python: Add evaluation learning path for hosted agents#726

Closed
aprilk-ms wants to merge 2 commits into
microsoft-foundry:mainfrom
aprilk-ms:aprilk/hosted-agents-evaluation-learning-path
Closed

samples/python: Add evaluation learning path for hosted agents#726
aprilk-ms wants to merge 2 commits into
microsoft-foundry:mainfrom
aprilk-ms:aprilk/hosted-agents-evaluation-learning-path

Conversation

@aprilk-ms
Copy link
Copy Markdown

What

Adds an evaluation learning path to the Python hosted-agent samples
(samples/python/hosted-agents/), per the three deliverables requested:

  1. General evaluation learning path at
    agent-framework/responses/14-evaluation/ — a tiny demo agent (main.py,
    agent.yaml) + 8 eval scripts + 4 seed datasets + a beginner-focused
    nav-hub README.md.
  2. Multi-turn evaluation co-located in 01-basic/ — two scripts
    (simulation + traces) + seed datasets + a new "Evaluating multi-turn
    conversations" section in 01-basic/README.md so the multi-turn learning
    path can stand on its own.
  3. "Related: Evaluate this agent" links added to 31 hosted-agent
    READMEs (parent index, agent-framework table, every responses/01..13
    sample, invocations/01-basic, every bring-your-own/** sample).

The 8 scripts in 14-evaluation/

Script What it teaches
evaluate_basic.py Easiest first run — built-in evaluators (task_adherence, fluency, relevance) against 4 inline questions
evaluate_custom_rubric.py Generates a 5-7 dimension rubric tailored to your agent from a short prompt — the script you'd actually use on a real project
evaluate_multiturn_simulation.py Foundry simulates full multi-turn conversations from seed scenarios
evaluate_multiturn_traces.py Same multi-turn evaluators over real traced conversations
generate_dataset_from_traces.py Materialize recent traces into a reusable registered dataset
generate_dataset_synthetic.py Bootstrap a dataset from topic seeds when you have no traffic
evaluate_scheduled.py Score every new agent response continuously
evaluate_redteam.py Content-safety / red-team eval (violence, self-harm, hate, sexual; 0-7 severity)

Shared helpers live in eval_common.py:

  • API version pinned (2025-11-15-preview) in one place
  • target_agent() default + env-var overrides (EVAL_AGENT_NAME, EVAL_AGENT_VERSION)
  • print_friendly_output() — Pydantic-safe per-row summary (single-turn / multi-turn / dataset-row shapes), with EVAL_DEBUG=1 to also dump the raw payload
  • build_testing_criteria(model, response_source=)"sample" for live agent runs ({{sample.output_text}}), "item" for dataset rows that already contain a response ({{item.response}})

The demo agent ships with ENABLE_INSTRUMENTATION=true and
ENABLE_SENSITIVE_DATA=true so trace-based + continuous scripts work
out of the box. The README has an explicit privacy callout about this.

Beginner-friendliness passes

Two rubber-duck rounds (correctness + score scales) and one voice-and-verbosity
sweep were applied before this PR:

  • Score-shape table distinguishes quality (1-5, higher better),
    agent-task (Pass/Fail + optional score), safety (0-7 severity, higher
    worse), attack-detection (boolean), custom rubric (weighted 1-5)
  • builtin.indirect_attack documented as the prompt-injection
    evaluator (the non-existent builtin.jailbreak was removed in an
    earlier pass)
  • evaluate_custom_rubric.py now polls to completion + prints
    friendly output, with CHANGE-THIS-FIRST markers on the rubric prompt
    and CHANGE-THIS-TOO on the inline eval questions
  • generate_dataset_synthetic.py defaults to running the generated
    questions through the deployed agent; EVAL_AGAINST_DATASET_ONLY=true
    opts out
  • "If a score is low, what next?", "Cost and data usage", and
    tracing-privacy sections added
  • LLM-ism scrub: "small judges that emit", "the intuition is always",
    "the real recommended", "almost always", "small LLM judges", "emit a
    severity" all removed
  • "The scripts" section compressed from 65 lines to 40 by removing
    overlap with the "Pick the right flow" table immediately above it
    (while keeping every load-bearing callout: schedule keeps running,
    redteam writes adversarial prompts to traces, etc.)
  • One-line description of the demo agent added so readers don't
    need to open agent.yaml to figure out what it is

Verification

  • All 12 new Python files AST-parse clean
  • print_friendly_output() smoke-tested against Pydantic-like objects +
    single-turn / multi-turn / dataset-row shapes
  • All 31 added "Related" link targets resolve to existing relative paths
  • No stale references to removed CLI commands (azd ai agent deploy,
    builtin.jailbreak) anywhere in the diff
  • Deploy flow matches the parent README's canonical pattern
    (mkdir foo && cd foo && azd ai agent init -m <manifest> && azd up)

Not yet verified

  • End-to-end run against a live Foundry project (no Azure creds in the
    authoring environment). The preview API surfaces
    (evaluator_generation_jobs, data_generation_jobs, scheduled eval)
    are pinned to API version 2025-11-15-preview in eval_common.py
    as the single bump point.

File count

  • 60 files in the first commit (8 scripts + helpers + manifest +
    Dockerfile + 4 seeds + 31 README updates)
  • 3 files in the voice-pass commit (-25 net lines)

aprilk-ms and others added 2 commits May 22, 2026 03:27
Three things:

1. New 14-evaluation sample under agent-framework/responses/, with a
   beginner-focused README (What is evaluation?, score-shape table,
   first-run quickstart with expected console output, "if a score is
   low" checklist, cost/data-usage table, privacy callout for
   ENABLE_SENSITIVE_DATA=true) and 8 evaluate_*/generate_dataset_*
   scripts:

   - evaluate_basic.py            (single-turn built-in evaluators)
   - evaluate_custom_rubric.py    (Custom Rubric Evaluator, generate +
                                   inspect + HITL regenerate + eval)
   - evaluate_multiturn_simulation.py
   - evaluate_multiturn_traces.py (agent_filter / conversation_id /
                                   trace_id variants)
   - generate_dataset_from_traces.py
   - generate_dataset_synthetic.py (default runs the deployed agent
                                    against generated queries;
                                    EVAL_AGAINST_DATASET_ONLY=true
                                    falls back to grading the rows)
   - evaluate_scheduled.py        (event-triggered + interval modes)
   - evaluate_redteam.py          (content-safety evaluators on the
                                   0-7 severity scale; indirect_attack
                                   is an opt-in commented entry)

   Shared helpers live in eval_common.py, including a friendly
   print_friendly_output() formatter that normalizes both Pydantic
   eval-run objects and plain dicts, and gates the raw pprint behind
   EVAL_DEBUG=1. Tracing is enabled by default
   (ENABLE_INSTRUMENTATION + ENABLE_SENSITIVE_DATA) so the trace-based
   and scheduled scripts work out of the box.

2. Multi-turn evaluation co-located in 01-basic:
   evaluate_multiturn_simulation.py + evaluate_multiturn_traces.py
   defaulted to the 01-basic agent name, with small seed JSONL files
   under data/, a requirements-eval.txt for the eval-only deps, and a
   new README section that explains what evaluation is, points back to
   14-evaluation, flags the tracing prereq + redeploy step, and adds
   a privacy heads-up on ENABLE_SENSITIVE_DATA.

3. "Related: Evaluate this agent" links added to every hosted-agent
   sample README (responses/01..13, invocations/01-basic, all
   bring-your-own/**) pointing at 14-evaluation, plus a row in the
   main hosted-agents README learning path and the agent-framework
   Responses API samples table.

Scripts pin API version 2025-11-15-preview in eval_common.py for
preview LROs (evaluator generation, data generation, scheduled
eval) and use openai_client.evals for the GA-ish eval-run surface.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rubber-duck pass focused on voice and length dimensions:

* Drop LLM-isms: `small judges that emit`, `the intuition is always`,
  `the *real* recommended`, `the words 'why' are almost always`,
  `emit a severity 0-7`, `small LLM judges`.
* Compress `The scripts` section from 65 lines to ~40 by cutting
  redundancy with `Pick the right flow`; each entry now one paragraph
  with the same callouts (schedule keeps running / red-team writes to
  traces / generate_dataset_synthetic default).
* Add a one-line description of what the demo agent actually does so
  readers don't have to open agent.yaml to find out.
* Tighten the EVAL_DEBUG note and the pass-threshold paragraph.

Net: README 362 -> 337 lines (-25), 20.6KB -> 19.0KB.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

👋 Thanks for your interest in contributing, @aprilk-ms!

This repository does not accept pull requests directly. If you'd like to report a bug, suggest an improvement, or propose a new sample, please open an issue instead.

See CONTRIBUTING.md for more details.

@github-actions github-actions Bot closed this May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant