Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions samples/python/hosted-agents/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@ The platform manages conversation history, streaming lifecycle, and background e
8. **[Observability](./agent-framework/responses/08-observability/)** — Add logging, metrics, and distributed tracing to your agent and visualize them in Foundry.
9. **[Declarative Workflows](./agent-framework/responses/09-declarative-customer-support/)** — A multi-turn customer-support triage workflow defined entirely in YAML and hosted as an agent, demonstrating declarative workflow authoring with `InvokeAzureAgent` calls to specialist Foundry-hosted agents and conversation-aware routing.
10. **[Downstream Azure services](./agent-framework/responses/09-downstream-azure/)** — Call Azure Blob Storage and Service Bus from the agent using its per-agent Microsoft Entra identity (no connection strings).
11. **[Evaluation](./agent-framework/responses/14-evaluation/)** — Evaluate any deployed hosted agent. Custom Rubric Evaluator ⭐, built-in single- and multi-turn evaluators, dataset generation (traces / synthetic), scheduled / continuous evaluation, and red-team / safety evaluation. Start here when you want to measure agent quality.

### Invocations protocol

Expand Down
1 change: 1 addition & 0 deletions samples/python/hosted-agents/agent-framework/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ This directory contains samples that demonstrate how to use the [Agent Framework
| 11 | [Azure AI Search RAG](responses/11-azure-search-rag/) | An agent with Retrieval Augmented Generation (RAG) capabilities backed by Azure AI Search, grounding answers in documents indexed in a pre-provisioned search index. |
| 12 | [Foundry Skills](responses/12-foundry-skills/) | An agent that uploads `SKILL.md` files to the Foundry Skills REST API and downloads them at startup, decoupling tone/policy guidelines from agent code. |
| 13 | [Foundry Memory](responses/13-foundry-memory/) | An agent with persistent semantic memory backed by an Azure AI Foundry Memory Store, using `FoundryMemoryProvider` to remember user facts across sessions. |
| 14 | [Evaluation](responses/14-evaluation/) | Evaluate a deployed hosted agent end-to-end: **Custom Rubric Evaluator** ⭐, built-in single- and multi-turn evaluators, dataset generation from traces / synthetic seeds, scheduled / continuous evaluation, and red-team / safety evaluation. |

### Invocations API

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -79,3 +79,8 @@ To host the agent on Foundry, follow the instructions in the [Deploying the Agen
- Pick a **CPU and Memory** size.
- Click **Deploy**. Fields are validated inline, and the extension handles the build/upload, agent version creation, and RBAC role assignment.
5. After deployment, invoke the agent in the Agent Playground and stream live logs from the **Logs** tab.


## Related

**Evaluate this agent:** see [`../../responses/14-evaluation/`](../../responses/14-evaluation/) for the general evaluation flow — **Custom Rubric Evaluator** ⭐, built-in single- and multi-turn evaluators, dataset generation, scheduled / continuous evaluation, and red-team / safety. The patterns work for Invocations agents too — only the target shape changes from the agent name (`EVAL_AGENT_NAME` / `EVAL_AGENT_VERSION`). For multi-turn evaluation scripts you can run end-to-end, see [`../../responses/01-basic/`](../../responses/01-basic/#evaluating-multi-turn-conversations).
Original file line number Diff line number Diff line change
Expand Up @@ -62,3 +62,72 @@ To host the agent on Foundry, follow the instructions in the [Deploying the Agen
- Pick a **CPU and Memory** size.
- Click **Deploy**. Fields are validated inline, and the extension handles the build/upload, agent version creation, and RBAC role assignment.
5. After deployment, invoke the agent in the Agent Playground and stream live logs from the **Logs** tab.

## Evaluating multi-turn conversations

After your agent is deployed and you've tried it in the Playground, the
next question is *"is it actually any good at multi-turn conversations?"*
**Evaluation** answers that — you run the agent against test conversations
and let built-in evaluators (automated scorers, themselves LLM-backed)
grade each conversation on things like *task completion*, *coherence*,
and *groundedness*. New to evaluation? Skim the **What is evaluation?**
section in [`../14-evaluation/README.md`](../14-evaluation/README.md)
first — this section assumes you've seen it.

Two scripts in this folder let you evaluate multi-turn behavior end-to-end
without leaving the `01-basic` sample:

* **[`evaluate_multiturn_simulation.py`](./evaluate_multiturn_simulation.py)** —
drives the deployed agent through simulated multi-turn conversations
seeded from [`data/test-scenarios.jsonl`](./data/test-scenarios.jsonl)
and scores them with the 4 built-in conversation-level evaluators
(`customer_satisfaction`, `groundedness`, `coherence`, `task_completion`).
No traces required — pick this if you haven't enabled tracing yet.
* **[`evaluate_multiturn_traces.py`](./evaluate_multiturn_traces.py)** —
same 4 evaluators, but scored against **real conversations captured as
traces**. Use this once your agent is receiving real traffic.

> **Tracing prerequisite for `evaluate_multiturn_traces.py`** — this sample
> does **not** enable tracing by default. Before running the trace-based
> script, copy `ENABLE_INSTRUMENTATION=true` and `ENABLE_SENSITIVE_DATA=true`
> from [`../08-observability/agent.yaml`](../08-observability/agent.yaml) onto
> your `01-basic` deployment, **then redeploy the agent** (changes to
> `agent.yaml` don't take effect until the next `azd up`),
> or just use `evaluate_multiturn_simulation.py` instead.

> ⚠ **About `ENABLE_SENSITIVE_DATA=true`** — that flag means user inputs
> and model outputs (including any PII) are written verbatim to your
> Application Insights workspace, so trace-based evaluation can score the
> content. Fine for dev / demos; for anything customer-facing, decide
> deliberately and treat the trace workspace as customer data.

### Run a script

```bash
pip install -r requirements-eval.txt
az login
# Required:
export FOUNDRY_PROJECT_ENDPOINT="https://<account>.services.ai.azure.com/api/projects/<project>"
export AZURE_AI_MODEL_DEPLOYMENT_NAME="gpt-4.1-mini"
# Optional overrides (defaults shown):
export EVAL_AGENT_NAME="agent-framework-agent-basic-responses"
export EVAL_AGENT_VERSION="1"

python evaluate_multiturn_simulation.py
# or
python evaluate_multiturn_traces.py
```

> Windows / PowerShell? Replace `export FOO=bar` with `$env:FOO = "bar"`.

Each script prints the eval ID, run ID, a `result_counts` summary, and a
**Foundry portal report URL** — open the URL to drill into per-row scores
and rationales.

### See also

These scripts are co-located here for the **multi-turn learning path**. For
the broader evaluation story — **Custom Rubric Evaluator** ⭐, built-in
single-turn evaluators, dataset generation (traces / synthetic), scheduled /
continuous evaluation, and red-team / safety evaluation — see
[`../14-evaluation/`](../14-evaluation/).
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{"messages": [{"role": "user", "content": "Hi there!"}, {"role": "assistant", "content": "Hello! How can I help you today?"}, {"role": "user", "content": "Just wanted to say hi."}, {"role": "assistant", "content": "That's kind of you. Anything else I can do?"}]}
{"messages": [{"role": "user", "content": "What's the weather like in Seattle today?"}, {"role": "assistant", "content": "It's currently 58°F and overcast in Seattle, with a high of 64°F expected. Light rain in the afternoon."}, {"role": "user", "content": "Do I need an umbrella?"}, {"role": "assistant", "content": "Yes, the rain probability after 2 PM is about 70%, so an umbrella is a good idea."}, {"role": "user", "content": "Thanks!"}, {"role": "assistant", "content": "You're welcome — have a good day!"}]}
{"messages": [{"role": "user", "content": "Is the downtown store still open?"}, {"role": "assistant", "content": "The downtown store closes at 9 PM tonight. It's currently 7:45 PM, so you have about 75 minutes."}, {"role": "user", "content": "What if I want curbside pickup?"}, {"role": "assistant", "content": "Curbside pickup stops 30 minutes before close, so 8:30 PM tonight. Want me to start an order?"}, {"role": "user", "content": "Yes please, one bag of dark roast coffee beans."}, {"role": "assistant", "content": "Order placed: 1x dark roast coffee beans for curbside pickup at the downtown store. Confirmation #DT-48291. Ready in ~15 minutes."}]}
{"messages": [{"role": "user", "content": "What's 23 times 47?"}, {"role": "assistant", "content": "23 × 47 = 1081."}, {"role": "user", "content": "Now divide that by 2 and add 19."}, {"role": "assistant", "content": "1081 / 2 = 540.5. Then 540.5 + 19 = 559.5."}, {"role": "user", "content": "Square root of that?"}, {"role": "assistant", "content": "√559.5 ≈ 23.65."}]}
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{"id": "scenario-1-greeting", "test_case_description": "User starts with a casual greeting; agent should respond warmly and offer help.", "desired_num_turns": 3}
{"id": "scenario-2-weather-followup", "test_case_description": "User asks about the weather in a major city, then asks a follow-up about whether to bring an umbrella.", "desired_num_turns": 4}
{"id": "scenario-3-store-hours", "test_case_description": "User asks if a store is open, then progressively narrows down to curbside-pickup hours, then places an order.", "desired_num_turns": 5}
{"id": "scenario-4-math-chained", "test_case_description": "User asks a multi-step arithmetic question across multiple turns, building on prior results.", "desired_num_turns": 4}
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# Copyright (c) Microsoft. All rights reserved.

"""Multi-turn evaluation driven by simulated conversations.

DESCRIPTION:
Companion to ``main.py``. Once you've deployed the basic agent (``azd up``
or via the Foundry Toolkit), this script runs an *eval run* that simulates
multi-turn conversations against the deployed agent and scores them with
the 4 built-in conversation-level evaluators
(``customer_satisfaction``, ``groundedness``, ``coherence``,
``task_completion``).

For the full evaluation learning path — Custom Rubric Evaluator, dataset
generation, scheduled / continuous eval, red-team — see
[``14-evaluation/``](../14-evaluation/).

USAGE:
python evaluate_multiturn_simulation.py

Prerequisites:
* Deployed ``01-basic`` agent (``EVAL_AGENT_NAME``,
``EVAL_AGENT_VERSION`` default to ``agent-framework-agent-basic-responses:1``).
* ``FOUNDRY_PROJECT_ENDPOINT`` set (see ``.env.example``).
"""

import json
import os
import time
from pathlib import Path
from pprint import pprint
from typing import Union

from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
from dotenv import load_dotenv
from openai.types.evals.run_create_response import RunCreateResponse
from openai.types.evals.run_retrieve_response import RunRetrieveResponse

API_VERSION = "2025-11-15-preview"
SCENARIOS_FILE = Path(__file__).parent / "data" / "test-scenarios.jsonl"


def _env(name: str, default: str | None = None) -> str:
value = os.environ.get(name, default)
if not value:
raise RuntimeError(f"{name} is not set (see .env.example).")
return value


def target_agent() -> dict[str, str]:
return {
"type": "azure_ai_agent",
"name": os.environ.get("EVAL_AGENT_NAME", "agent-framework-agent-basic-responses"),
"version": os.environ.get("EVAL_AGENT_VERSION", "1"),
}


def build_conversation_evaluators(model: str) -> list[dict]:
common = {
"type": "azure_ai_evaluator",
"initialization_parameters": {"deployment_name": model},
"data_mapping": {"messages": "{{item.messages}}"},
}
return [
{**common, "name": "customer_satisfaction",
"evaluator_name": "builtin.customer_satisfaction"},
{**common, "name": "groundedness",
"evaluator_name": "builtin.groundedness"},
{**common, "name": "coherence",
"evaluator_name": "builtin.coherence"},
{**common, "name": "task_completion",
"evaluator_name": "builtin.task_completion"},
]


def load_scenarios() -> list[dict]:
with SCENARIOS_FILE.open(encoding="utf-8") as fp:
return [json.loads(line) for line in fp if line.strip()]


def main() -> None:
load_dotenv()
endpoint = _env("FOUNDRY_PROJECT_ENDPOINT").rstrip("/")
model = os.environ.get("AZURE_AI_MODEL_DEPLOYMENT_NAME", "gpt-4.1-mini")
print(f"Using API version: {API_VERSION}")
print(f"Project: {endpoint}")
print(f"Target agent: {target_agent()}\n")

scenarios = load_scenarios()
print(f"Loaded {len(scenarios)} seed scenarios from {SCENARIOS_FILE.name}")

with (
DefaultAzureCredential() as credential,
AIProjectClient(endpoint=endpoint, credential=credential) as project_client,
project_client.get_openai_client() as openai_client,
):
eval_object = openai_client.evals.create(
name="basic-agent-multiturn-sim",
data_source_config={
"type": "custom",
"item_schema": {
"type": "object",
"properties": {"messages": {"type": "array"}},
"required": ["messages"],
},
"include_sample_schema": False,
},
testing_criteria=build_conversation_evaluators(model),
)
print(f"Eval created: {eval_object.id}")

run: Union[RunCreateResponse, RunRetrieveResponse] = openai_client.evals.runs.create(
eval_id=eval_object.id,
name=f"sim-{target_agent()['name']}",
evaluation_level="conversation",
data_source={
"type": "azure_ai_target_completions",
"source": {
"type": "file_content",
"content": [{"item": s} for s in scenarios],
},
"target": target_agent(),
"input_messages": {
"type": "template",
"template": [
{
"type": "message",
"role": "user",
"content": {
"type": "input_text",
"text": "{{item.test_case_description}}",
},
}
],
},
"item_generation_params": {
"type": "conversation_gen_preview",
"model": model,
"num_conversations": 1,
"max_turns": 4,
"sampling_params": {
"temperature": 0.7,
"top_p": 1.0,
"max_completion_tokens": 800,
},
"data_mapping": {
"test_case_description": "test_case_description",
"id": "id",
"desired_num_turns": "desired_num_turns",
},
},
},
)
print(f"Eval run created: {run.id}")
print("Simulation runs can take several minutes per conversation …")

while run.status not in {"completed", "failed", "canceled"}:
time.sleep(10)
run = openai_client.evals.runs.retrieve(run_id=run.id, eval_id=eval_object.id)
print(f" status: {run.status}")

print(f"\nFinal status: {run.status}")
print(f"Result counts: {run.result_counts}")
print(f"Report URL: {run.report_url}")
if run.status == "completed":
items = list(
openai_client.evals.runs.output_items.list(run_id=run.id, eval_id=eval_object.id)
)
print(f"\nFirst of {len(items)} output items:")
if items:
pprint(items[0])


if __name__ == "__main__":
main()
Loading
Loading