microsoft-foundry · aprilk-ms · May 22, 2026 · May 22, 2026
diff --git a/samples/python/hosted-agents/README.md b/samples/python/hosted-agents/README.md
@@ -120,6 +120,7 @@ The platform manages conversation history, streaming lifecycle, and background e
 8. **[Observability](./agent-framework/responses/08-observability/)** — Add logging, metrics, and distributed tracing to your agent and visualize them in Foundry.
 9. **[Declarative Workflows](./agent-framework/responses/09-declarative-customer-support/)** — A multi-turn customer-support triage workflow defined entirely in YAML and hosted as an agent, demonstrating declarative workflow authoring with `InvokeAzureAgent` calls to specialist Foundry-hosted agents and conversation-aware routing.
 10. **[Downstream Azure services](./agent-framework/responses/09-downstream-azure/)** — Call Azure Blob Storage and Service Bus from the agent using its per-agent Microsoft Entra identity (no connection strings).
+11. **[Evaluation](./agent-framework/responses/14-evaluation/)** — Evaluate any deployed hosted agent. Custom Rubric Evaluator ⭐, built-in single- and multi-turn evaluators, dataset generation (traces / synthetic), scheduled / continuous evaluation, and red-team / safety evaluation. Start here when you want to measure agent quality.
 
 ### Invocations protocol
 

diff --git a/samples/python/hosted-agents/agent-framework/README.md b/samples/python/hosted-agents/agent-framework/README.md
@@ -21,6 +21,7 @@ This directory contains samples that demonstrate how to use the [Agent Framework
 | 11  | [Azure AI Search RAG](responses/11-azure-search-rag/)                      | An agent with Retrieval Augmented Generation (RAG) capabilities backed by Azure AI Search, grounding answers in documents indexed in a pre-provisioned search index.                                                                          |
 | 12  | [Foundry Skills](responses/12-foundry-skills/)                             | An agent that uploads `SKILL.md` files to the Foundry Skills REST API and downloads them at startup, decoupling tone/policy guidelines from agent code.                                                                                       |
 | 13  | [Foundry Memory](responses/13-foundry-memory/)                             | An agent with persistent semantic memory backed by an Azure AI Foundry Memory Store, using `FoundryMemoryProvider` to remember user facts across sessions.                                                                                    |
+| 14  | [Evaluation](responses/14-evaluation/)                                     | Evaluate a deployed hosted agent end-to-end: **Custom Rubric Evaluator** ⭐, built-in single- and multi-turn evaluators, dataset generation from traces / synthetic seeds, scheduled / continuous evaluation, and red-team / safety evaluation. |
 
 ### Invocations API
 

diff --git a/samples/python/hosted-agents/agent-framework/invocations/01-basic/README.md b/samples/python/hosted-agents/agent-framework/invocations/01-basic/README.md
@@ -79,3 +79,8 @@ To host the agent on Foundry, follow the instructions in the [Deploying the Agen
    - Pick a **CPU and Memory** size.
    - Click **Deploy**. Fields are validated inline, and the extension handles the build/upload, agent version creation, and RBAC role assignment.
 5. After deployment, invoke the agent in the Agent Playground and stream live logs from the **Logs** tab.
+
+
+## Related
+
+**Evaluate this agent:** see [`../../responses/14-evaluation/`](../../responses/14-evaluation/) for the general evaluation flow — **Custom Rubric Evaluator** ⭐, built-in single- and multi-turn evaluators, dataset generation, scheduled / continuous evaluation, and red-team / safety. The patterns work for Invocations agents too — only the target shape changes from the agent name (`EVAL_AGENT_NAME` / `EVAL_AGENT_VERSION`). For multi-turn evaluation scripts you can run end-to-end, see [`../../responses/01-basic/`](../../responses/01-basic/#evaluating-multi-turn-conversations).
diff --git a/samples/python/hosted-agents/agent-framework/responses/01-basic/README.md b/samples/python/hosted-agents/agent-framework/responses/01-basic/README.md
@@ -62,3 +62,72 @@ To host the agent on Foundry, follow the instructions in the [Deploying the Agen
    - Pick a **CPU and Memory** size.
    - Click **Deploy**. Fields are validated inline, and the extension handles the build/upload, agent version creation, and RBAC role assignment.
 5. After deployment, invoke the agent in the Agent Playground and stream live logs from the **Logs** tab.
+
+## Evaluating multi-turn conversations
+
+After your agent is deployed and you've tried it in the Playground, the
+next question is *"is it actually any good at multi-turn conversations?"*
+**Evaluation** answers that — you run the agent against test conversations
+and let built-in evaluators (automated scorers, themselves LLM-backed)
+grade each conversation on things like *task completion*, *coherence*,
+and *groundedness*. New to evaluation? Skim the **What is evaluation?**
+section in [`../14-evaluation/README.md`](../14-evaluation/README.md)
+first — this section assumes you've seen it.
+
+Two scripts in this folder let you evaluate multi-turn behavior end-to-end
+without leaving the `01-basic` sample:
+
+* **[`evaluate_multiturn_simulation.py`](./evaluate_multiturn_simulation.py)** —
+  drives the deployed agent through simulated multi-turn conversations
+  seeded from [`data/test-scenarios.jsonl`](./data/test-scenarios.jsonl)
+  and scores them with the 4 built-in conversation-level evaluators
+  (`customer_satisfaction`, `groundedness`, `coherence`, `task_completion`).
+  No traces required — pick this if you haven't enabled tracing yet.
+* **[`evaluate_multiturn_traces.py`](./evaluate_multiturn_traces.py)** —
+  same 4 evaluators, but scored against **real conversations captured as
+  traces**. Use this once your agent is receiving real traffic.
+
+> **Tracing prerequisite for `evaluate_multiturn_traces.py`** — this sample
+> does **not** enable tracing by default. Before running the trace-based
+> script, copy `ENABLE_INSTRUMENTATION=true` and `ENABLE_SENSITIVE_DATA=true`
+> from [`../08-observability/agent.yaml`](../08-observability/agent.yaml) onto
+> your `01-basic` deployment, **then redeploy the agent** (changes to
+> `agent.yaml` don't take effect until the next `azd up`),
+> or just use `evaluate_multiturn_simulation.py` instead.
+
+> ⚠ **About `ENABLE_SENSITIVE_DATA=true`** — that flag means user inputs
+> and model outputs (including any PII) are written verbatim to your
+> Application Insights workspace, so trace-based evaluation can score the
+> content. Fine for dev / demos; for anything customer-facing, decide
+> deliberately and treat the trace workspace as customer data.
+
+### Run a script
+
+```bash
+pip install -r requirements-eval.txt
+az login
+# Required:
+export FOUNDRY_PROJECT_ENDPOINT="https://<account>.services.ai.azure.com/api/projects/<project>"
+export AZURE_AI_MODEL_DEPLOYMENT_NAME="gpt-4.1-mini"
+# Optional overrides (defaults shown):
+export EVAL_AGENT_NAME="agent-framework-agent-basic-responses"
+export EVAL_AGENT_VERSION="1"
+
+python evaluate_multiturn_simulation.py
+# or
+python evaluate_multiturn_traces.py
+```
+
+> Windows / PowerShell? Replace `export FOO=bar` with `$env:FOO = "bar"`.
+
+Each script prints the eval ID, run ID, a `result_counts` summary, and a
+**Foundry portal report URL** — open the URL to drill into per-row scores
+and rationales.
+
+### See also
+
+These scripts are co-located here for the **multi-turn learning path**. For
+the broader evaluation story — **Custom Rubric Evaluator** ⭐, built-in
+single-turn evaluators, dataset generation (traces / synthetic), scheduled /
+continuous evaluation, and red-team / safety evaluation — see
+[`../14-evaluation/`](../14-evaluation/).
diff --git a/...thon/hosted-agents/agent-framework/responses/01-basic/data/multi-turn-conversations.jsonl b/...thon/hosted-agents/agent-framework/responses/01-basic/data/multi-turn-conversations.jsonl
@@ -0,0 +1,4 @@
+{"messages": [{"role": "user", "content": "Hi there!"}, {"role": "assistant", "content": "Hello! How can I help you today?"}, {"role": "user", "content": "Just wanted to say hi."}, {"role": "assistant", "content": "That's kind of you. Anything else I can do?"}]}
+{"messages": [{"role": "user", "content": "What's the weather like in Seattle today?"}, {"role": "assistant", "content": "It's currently 58°F and overcast in Seattle, with a high of 64°F expected. Light rain in the afternoon."}, {"role": "user", "content": "Do I need an umbrella?"}, {"role": "assistant", "content": "Yes, the rain probability after 2 PM is about 70%, so an umbrella is a good idea."}, {"role": "user", "content": "Thanks!"}, {"role": "assistant", "content": "You're welcome — have a good day!"}]}
+{"messages": [{"role": "user", "content": "Is the downtown store still open?"}, {"role": "assistant", "content": "The downtown store closes at 9 PM tonight. It's currently 7:45 PM, so you have about 75 minutes."}, {"role": "user", "content": "What if I want curbside pickup?"}, {"role": "assistant", "content": "Curbside pickup stops 30 minutes before close, so 8:30 PM tonight. Want me to start an order?"}, {"role": "user", "content": "Yes please, one bag of dark roast coffee beans."}, {"role": "assistant", "content": "Order placed: 1x dark roast coffee beans for curbside pickup at the downtown store. Confirmation #DT-48291. Ready in ~15 minutes."}]}
+{"messages": [{"role": "user", "content": "What's 23 times 47?"}, {"role": "assistant", "content": "23 × 47 = 1081."}, {"role": "user", "content": "Now divide that by 2 and add 19."}, {"role": "assistant", "content": "1081 / 2 = 540.5. Then 540.5 + 19 = 559.5."}, {"role": "user", "content": "Square root of that?"}, {"role": "assistant", "content": "√559.5 ≈ 23.65."}]}
diff --git a/samples/python/hosted-agents/agent-framework/responses/01-basic/data/test-scenarios.jsonl b/samples/python/hosted-agents/agent-framework/responses/01-basic/data/test-scenarios.jsonl
@@ -0,0 +1,4 @@
+{"id": "scenario-1-greeting", "test_case_description": "User starts with a casual greeting; agent should respond warmly and offer help.", "desired_num_turns": 3}
+{"id": "scenario-2-weather-followup", "test_case_description": "User asks about the weather in a major city, then asks a follow-up about whether to bring an umbrella.", "desired_num_turns": 4}
+{"id": "scenario-3-store-hours", "test_case_description": "User asks if a store is open, then progressively narrows down to curbside-pickup hours, then places an order.", "desired_num_turns": 5}
+{"id": "scenario-4-math-chained", "test_case_description": "User asks a multi-step arithmetic question across multiple turns, building on prior results.", "desired_num_turns": 4}
diff --git a/.../python/hosted-agents/agent-framework/responses/01-basic/evaluate_multiturn_simulation.py b/.../python/hosted-agents/agent-framework/responses/01-basic/evaluate_multiturn_simulation.py
@@ -0,0 +1,175 @@
+# Copyright (c) Microsoft. All rights reserved.
+
+"""Multi-turn evaluation driven by simulated conversations.
+
+DESCRIPTION:
+    Companion to ``main.py``. Once you've deployed the basic agent (``azd up``
+    or via the Foundry Toolkit), this script runs an *eval run* that simulates
+    multi-turn conversations against the deployed agent and scores them with
+    the 4 built-in conversation-level evaluators
+    (``customer_satisfaction``, ``groundedness``, ``coherence``,
+    ``task_completion``).
+
+    For the full evaluation learning path — Custom Rubric Evaluator, dataset
+    generation, scheduled / continuous eval, red-team — see
+    [``14-evaluation/``](../14-evaluation/).
+
+USAGE:
+    python evaluate_multiturn_simulation.py
+
+    Prerequisites:
+      * Deployed ``01-basic`` agent (``EVAL_AGENT_NAME``,
+        ``EVAL_AGENT_VERSION`` default to ``agent-framework-agent-basic-responses:1``).
+      * ``FOUNDRY_PROJECT_ENDPOINT`` set (see ``.env.example``).
+"""
+
+import json
+import os
+import time
+from pathlib import Path
+from pprint import pprint
+from typing import Union
+
+from azure.ai.projects import AIProjectClient
+from azure.identity import DefaultAzureCredential
+from dotenv import load_dotenv
+from openai.types.evals.run_create_response import RunCreateResponse
+from openai.types.evals.run_retrieve_response import RunRetrieveResponse
+
+API_VERSION = "2025-11-15-preview"
+SCENARIOS_FILE = Path(__file__).parent / "data" / "test-scenarios.jsonl"
+
+
+def _env(name: str, default: str | None = None) -> str:
+    value = os.environ.get(name, default)
+    if not value:
+        raise RuntimeError(f"{name} is not set (see .env.example).")
+    return value
+
+
+def target_agent() -> dict[str, str]:
+    return {
+        "type": "azure_ai_agent",
+        "name": os.environ.get("EVAL_AGENT_NAME", "agent-framework-agent-basic-responses"),
+        "version": os.environ.get("EVAL_AGENT_VERSION", "1"),
+    }
+
+
+def build_conversation_evaluators(model: str) -> list[dict]:
+    common = {
+        "type": "azure_ai_evaluator",
+        "initialization_parameters": {"deployment_name": model},
+        "data_mapping": {"messages": "{{item.messages}}"},
+    }
+    return [
+        {**common, "name": "customer_satisfaction",
+         "evaluator_name": "builtin.customer_satisfaction"},
+        {**common, "name": "groundedness",
+         "evaluator_name": "builtin.groundedness"},
+        {**common, "name": "coherence",
+         "evaluator_name": "builtin.coherence"},
+        {**common, "name": "task_completion",
+         "evaluator_name": "builtin.task_completion"},
+    ]
+
+
+def load_scenarios() -> list[dict]:
+    with SCENARIOS_FILE.open(encoding="utf-8") as fp:
+        return [json.loads(line) for line in fp if line.strip()]
+
+
+def main() -> None:
+    load_dotenv()
+    endpoint = _env("FOUNDRY_PROJECT_ENDPOINT").rstrip("/")
+    model = os.environ.get("AZURE_AI_MODEL_DEPLOYMENT_NAME", "gpt-4.1-mini")
+    print(f"Using API version: {API_VERSION}")
+    print(f"Project: {endpoint}")
+    print(f"Target agent: {target_agent()}\n")
+
+    scenarios = load_scenarios()
+    print(f"Loaded {len(scenarios)} seed scenarios from {SCENARIOS_FILE.name}")
+
+    with (
+        DefaultAzureCredential() as credential,
+        AIProjectClient(endpoint=endpoint, credential=credential) as project_client,
+        project_client.get_openai_client() as openai_client,
+    ):
+        eval_object = openai_client.evals.create(
+            name="basic-agent-multiturn-sim",
+            data_source_config={
+                "type": "custom",
+                "item_schema": {
+                    "type": "object",
+                    "properties": {"messages": {"type": "array"}},
+                    "required": ["messages"],
+                },
+                "include_sample_schema": False,
+            },
+            testing_criteria=build_conversation_evaluators(model),
+        )
+        print(f"Eval created: {eval_object.id}")
+
+        run: Union[RunCreateResponse, RunRetrieveResponse] = openai_client.evals.runs.create(
+            eval_id=eval_object.id,
+            name=f"sim-{target_agent()['name']}",
+            evaluation_level="conversation",
+            data_source={
+                "type": "azure_ai_target_completions",
+                "source": {
+                    "type": "file_content",
+                    "content": [{"item": s} for s in scenarios],
+                },
+                "target": target_agent(),
+                "input_messages": {
+                    "type": "template",
+                    "template": [
+                        {
+                            "type": "message",
+                            "role": "user",
+                            "content": {
+                                "type": "input_text",
+                                "text": "{{item.test_case_description}}",
+                            },
+                        }
+                    ],
+                },
+                "item_generation_params": {
+                    "type": "conversation_gen_preview",
+                    "model": model,
+                    "num_conversations": 1,
+                    "max_turns": 4,
+                    "sampling_params": {
+                        "temperature": 0.7,
+                        "top_p": 1.0,
+                        "max_completion_tokens": 800,
+                    },
+                    "data_mapping": {
+                        "test_case_description": "test_case_description",
+                        "id": "id",
+                        "desired_num_turns": "desired_num_turns",
+                    },
+                },
+            },
+        )
+        print(f"Eval run created: {run.id}")
+        print("Simulation runs can take several minutes per conversation …")
+
+        while run.status not in {"completed", "failed", "canceled"}:
+            time.sleep(10)
+            run = openai_client.evals.runs.retrieve(run_id=run.id, eval_id=eval_object.id)
+            print(f"  status: {run.status}")
+
+        print(f"\nFinal status: {run.status}")
+        print(f"Result counts: {run.result_counts}")
+        print(f"Report URL: {run.report_url}")
+        if run.status == "completed":
+            items = list(
+                openai_client.evals.runs.output_items.list(run_id=run.id, eval_id=eval_object.id)
+            )
+            print(f"\nFirst of {len(items)} output items:")
+            if items:
+                pprint(items[0])
+
+
+if __name__ == "__main__":
+    main()