microsoft-foundry · JoseCSantos · May 21, 2026
diff --git a/samples/python/multi-turn-evaluators/.env.example b/samples/python/multi-turn-evaluators/.env.example
@@ -0,0 +1,7 @@
+# The Azure AI Foundry project endpoint URI
+# Format: https://<workspace-name>.<region>.services.ai.azure.com/api/projects/<project-id>
+FOUNDRY_PROJECT_ENDPOINT=
+
+# The judge model deployment name
+# Format: e.g. gpt-5.4-mini
+FOUNDRY_MODEL_NAME=
diff --git a/samples/python/multi-turn-evaluators/README.md b/samples/python/multi-turn-evaluators/README.md
@@ -0,0 +1,97 @@
+# Multi-turn evaluators
+
+This sample shows how to evaluate **pre-recorded multi-turn conversations**
+with the four Azure AI Foundry multi-turn evaluators:
+
+| Evaluator                  | Builtin name                 | Output       |
+|----------------------------|------------------------------|--------------|
+| Coherence                  | `builtin.coherence`          | 1-5 score    |
+| Task Completion            | `builtin.task_completion`    | Pass / Fail  |
+| Customer Satisfaction      | `builtin.customer_satisfaction` | 1-5 score |
+| Groundedness               | `builtin.groundedness`       | 1-5 score    |
+
+The notebook ([`multi_turn_evaluators.ipynb`](./multi_turn_evaluators.ipynb)):
+
+1. Loads ~16 small conversation traces from [`data/`](./data) (covering
+   low/mid/high expected scores for each evaluator, with both plain-text
+   turns and assistant tool calls + tool results).
+2. Converts each OpenAI-format `messages` list into the `(query, response,
+   tool_definitions)` shape expected by Foundry's cloud evaluators.
+3. Creates a single evaluation object that bundles all four evaluators.
+4. Submits one run with all traces as inline data via
+   `openai_client.evals.runs.create(...)`.
+5. Polls until the run completes, retrieves per-item results, and reports
+   detailed stats: per-trace score table, score distributions, agreement
+   with the human-authored `expected_score`, and a printout of failures.
+
+The data in `data/` is a copy of the curated unit-test traces from the
+internal `evaluator_quality_analysis` experiment.
+
+## Prerequisites
+
+* Python 3.10+
+* An Azure AI Foundry project with an LLM deployment that can act as the
+  judge model (e.g. `gpt-4o-mini`, `gpt-5-mini`).
+* `az login` (the notebook authenticates with `DefaultAzureCredential`).
+
+Install the dependencies:
+
+```bash
+pip install -r requirements.txt
+```
+
+## Environment variables
+
+Either export them in your shell, or copy them into a `.env` file next to
+the notebook:
+
+| Variable | Description |
+|---|---|
+| `FOUNDRY_PROJECT_ENDPOINT` | Project endpoint, e.g. `https://<account>.services.ai.azure.com/api/projects/<project>` |
+| `FOUNDRY_MODEL_NAME` | Deployment name of the judge model (e.g. `gpt-4o-mini`) |
+
+## Run
+
+```bash
+jupyter notebook multi_turn_evaluators.ipynb
+```
+
+Execute the cells top-to-bottom. The evaluation run typically completes in
+1-3 minutes depending on the judge model.
+
+## How the sample data is shaped
+
+Each JSON file in `data/` looks like:
+
+```json
+{
+  "test_id": "csat_score5",
+  "evaluator": "customer_satisfaction",
+  "expected_score": 5,
+  "description": "All requests handled perfectly...",
+  "messages": [
+    {"role": "user", "content": "..."},
+    {"role": "assistant", "content": null,
+     "tool_calls": [{"id": "call_1", "type": "function",
+                     "function": {"name": "...", "arguments": "{...}"}}]},
+    {"role": "tool", "tool_call_id": "call_1", "content": "..."},
+    {"role": "assistant", "content": "..."}
+  ]
+}
+```
+
+The notebook contains a small helper that converts this OpenAI message
+format into the Foundry conversation schema (`{type: text}` /
+`{type: tool_call}` / `{type: tool_result}` content items) and splits the
+turns into the `query` (system + user) and `response` (assistant + tool)
+arrays expected by the evaluators.
+
+## Extending the sample
+
+* Add more traces under `data/` (same JSON shape) and rerun.
+* Add a `tool_definitions` field to a trace's JSON to give the evaluator
+  the OpenAI function-calling schemas of the tools the agent had access
+  to (improves `task_completion` and `groundedness` accuracy).
+* Switch judge model by changing `FOUNDRY_MODEL_NAME`.
+* Run the same eval object multiple times (`evals.runs.create(...)`) to
+  measure judge reliability across repeats.
diff --git a/samples/python/multi-turn-evaluators/data/coherence_score1.json b/samples/python/multi-turn-evaluators/data/coherence_score1.json
@@ -0,0 +1,15 @@
+{
+  "test_id": "coherence_score1",
+  "evaluator": "coherence",
+  "expected_score": [1,2],
+  "expected_status": "completed",
+  "description": "Agent completely ignores context, gives contradictory and unrelated responses",
+  "messages": [
+    {"role": "user", "content": "Help me plan a vegetarian dinner menu for 6 guests this Saturday."},
+    {"role": "assistant", "content": "For a vegetarian dinner party, I'd suggest starting with a caprese salad, followed by mushroom risotto as the main course, and tiramisu for dessert."},
+    {"role": "user", "content": "That sounds great. Can you give me a shopping list for the mushroom risotto?"},
+    {"role": "assistant", "content": "The best programming language for web development is JavaScript. You should learn React and Node.js to build modern applications. Python is also useful for backend work."},
+    {"role": "user", "content": "I asked about the risotto ingredients, not programming."},
+    {"role": "assistant", "content": "Absolutely! For the grilled steak, you'll need 6 ribeye steaks, BBQ sauce, and charcoal for the grill. I mentioned earlier that this is a meat-based menu, so make sure to get the freshest cuts from your butcher."}
+  ]
+}
diff --git a/samples/python/multi-turn-evaluators/data/coherence_score3.json b/samples/python/multi-turn-evaluators/data/coherence_score3.json
@@ -0,0 +1,15 @@
+{
+  "test_id": "coherence_score3",
+  "evaluator": "coherence",
+  "expected_score": [3, 4],
+  "expected_status": "completed",
+  "description": "Agent mostly tracks context but has a noticeable irrelevant tangent and a weak transition",
+  "messages": [
+    {"role": "user", "content": "I'm preparing for a product manager interview next week. What topics should I focus on?"},
+    {"role": "assistant", "content": "Great question! For a PM interview, focus on these core areas:\n1. Product sense — identifying user problems and proposing solutions\n2. Execution — prioritization frameworks like RICE or MoSCoW\n3. Metrics — defining success metrics and KPIs\n4. Technical depth — understanding system design at a high level"},
+    {"role": "user", "content": "Can you help me create a daily practice plan for the next 5 days?"},
+    {"role": "assistant", "content": "Sure! Here's a 5-day plan:\n- Day 1: Product sense — do 2 product critique exercises\n- Day 2: Execution — practice prioritization with sample feature lists\n- Day 3: You know, sourdough bread takes about 3 days to prepare from starter. Anyway, for day 3, work on metrics — define north star metrics for 3 sample products\n- Day 4: Technical — walk through a system design question\n- Day 5: Mock interview — combine all areas in a full practice session"},
+    {"role": "user", "content": "For Day 1, what's a good product critique exercise format?"},
+    {"role": "assistant", "content": "For the product critique exercise on Day 1, use this format:\n1. Pick an app you use daily (e.g., Spotify, Uber)\n2. Identify one key user problem it solves\n3. Evaluate: What does it do well? What could improve?\n4. Propose one feature improvement with a clear user story\n5. Define how you'd measure success\n\nSpend 20 minutes per app. Do two apps to build pattern recognition."}
+  ]
+}
diff --git a/samples/python/multi-turn-evaluators/data/coherence_score5.json b/samples/python/multi-turn-evaluators/data/coherence_score5.json
@@ -0,0 +1,15 @@
+{
+  "test_id": "coherence_score5",
+  "evaluator": "coherence",
+  "expected_score": [4,5],
+  "expected_status": "completed",
+  "description": "Perfect logical flow: agent tracks migration plan across turns, builds on each phase",
+  "messages": [
+    {"role": "user", "content": "I need to migrate our production database from MySQL 5.7 to PostgreSQL 15. Can you outline the key phases?"},
+    {"role": "assistant", "content": "Here is a 4-phase migration plan:\n\nPhase 1: Schema conversion — map MySQL types to PostgreSQL equivalents (e.g., TINYINT to SMALLINT, AUTO_INCREMENT to SERIAL).\nPhase 2: Data migration rehearsal — export with mysqldump, transform, and load into a staging PostgreSQL instance.\nPhase 3: Application code changes — update ORM configs, replace MySQL-specific SQL syntax.\nPhase 4: Cutover — final data sync, DNS switch, rollback plan."},
+    {"role": "user", "content": "Good. For Phase 2, what tools should we use for the data export and transform step?"},
+    {"role": "assistant", "content": "For Phase 2, I recommend:\n\n1. **pgloader** — handles the MySQL-to-PostgreSQL ETL in one step, including type mapping and index recreation.\n2. **AWS DMS** (if you're on AWS) — for continuous replication during the rehearsal window.\n3. For custom transforms, write Python scripts using `psycopg2` to handle edge cases like ENUM columns or spatial data.\n\nStart with pgloader for a baseline rehearsal, then layer in custom scripts for any columns that need special handling."},
+    {"role": "user", "content": "What about Phase 3 — which parts of our Django app would need changes?"},
+    {"role": "assistant", "content": "For Phase 3, here are the Django-specific changes you'll need:\n\n1. **settings.py** — switch `ENGINE` from `django.db.backends.mysql` to `django.db.backends.postgresql`.\n2. **Raw SQL queries** — audit for MySQL-specific syntax like backtick quoting, `LIMIT x, y` (use `LIMIT y OFFSET x` instead), and `GROUP_CONCAT` (use `STRING_AGG`).\n3. **Migrations** — regenerate with `makemigrations` against the new backend; check for any MySQL-specific field options.\n4. **Full-text search** — if you use MySQL FULLTEXT indexes, switch to PostgreSQL's `tsvector`/`tsquery` or `django.contrib.postgres.search`.\n\nI'd recommend running your test suite against the staging PostgreSQL from Phase 2 to catch any remaining incompatibilities before cutover."}
+  ]
+}
diff --git a/samples/python/multi-turn-evaluators/data/csat_score1.json b/samples/python/multi-turn-evaluators/data/csat_score1.json
@@ -0,0 +1,14 @@
+{
+  "test_id": "csat_score1",
+  "evaluator": "customer_satisfaction",
+  "expected_score": [1, 2],
+  "description": "Agent dismisses user, provides no real help, user explicitly frustrated",
+  "messages": [
+    {"role": "user", "content": "My package was marked as delivered 3 days ago but I never received it. Order #88712."},
+    {"role": "assistant", "content": "Our records show the package was delivered on Tuesday. Have you checked with your neighbors or your building's mail room?"},
+    {"role": "user", "content": "Yes, I've checked everywhere. I need a replacement or a refund. This is a $200 item."},
+    {"role": "assistant", "content": "Since the tracking shows the package as delivered, we unfortunately cannot issue a refund or replacement. You may want to file a claim with the carrier or file a police report for stolen packages."},
+    {"role": "user", "content": "That's completely unacceptable. I paid $200 and received nothing. This is useless support."},
+    {"role": "assistant", "content": "I understand your frustration. Our policy is that once a package is marked as delivered, the responsibility shifts to the carrier. I'd recommend contacting FedEx directly to dispute the delivery status. Is there anything else I can help you with today?"}
+  ]
+}
diff --git a/samples/python/multi-turn-evaluators/data/csat_score3.json b/samples/python/multi-turn-evaluators/data/csat_score3.json
@@ -0,0 +1,14 @@
+{
+  "test_id": "csat_score3",
+  "evaluator": "customer_satisfaction",
+  "expected_score": [2, 3],
+  "description": "Agent provides basic help but misses proactive solutions; customer left without full resolution",
+  "messages": [
+    {"role": "user", "content": "I can't log into my account. I keep getting an error when I try to reset my password."},
+    {"role": "assistant", "content": "I'm sorry to hear that. What error message are you seeing when you try to reset your password?"},
+    {"role": "user", "content": "It says 'Too many reset attempts. Account temporarily locked.' I've been locked out for 2 hours now."},
+    {"role": "assistant", "content": "Account locks from too many reset attempts typically expire after 30 minutes. Since it's been 2 hours, it should be unlocked by now. Please try the password reset again and let me know if you still see the error."},
+    {"role": "user", "content": "I just tried again and it's still showing the same locked message."},
+    {"role": "assistant", "content": "That's unusual. The lock should have expired by now. I'd suggest trying the following:\n1. Clear your browser cache and cookies\n2. Try using a different browser or incognito mode\n3. If that doesn't work, you may need to wait a bit longer\n\nIf none of these work, you can contact our technical support team at support@example.com for further assistance."}
+  ]
+}
diff --git a/samples/python/multi-turn-evaluators/data/csat_score5.json b/samples/python/multi-turn-evaluators/data/csat_score5.json
@@ -0,0 +1,16 @@
+{
+  "test_id": "csat_score5",
+  "evaluator": "customer_satisfaction",
+  "expected_score": 5,
+  "description": "All requests handled perfectly, professional tone, complete resolution with proactive details",
+  "messages": [
+    {"role": "user", "content": "I need to cancel my subscription and get a refund for the unused portion of this month."},
+    {"role": "assistant", "content": null, "tool_calls": [{"id": "call_1", "type": "function", "function": {"name": "get_subscription", "arguments": "{\"user_id\": \"USR-4421\"}"}}]},
+    {"role": "tool", "tool_call_id": "call_1", "content": "Plan: Premium ($29.99/mo), billing cycle: 1st-30th, status: active, days remaining: 18"},
+    {"role": "assistant", "content": null, "tool_calls": [{"id": "call_2", "type": "function", "function": {"name": "cancel_subscription", "arguments": "{\"user_id\": \"USR-4421\", \"refund_prorated\": true}"}}]},
+    {"role": "tool", "tool_call_id": "call_2", "content": "Subscription cancelled. Prorated refund of $17.99 (18/30 days) will be processed within 3-5 business days to the card ending in 4242."},
+    {"role": "assistant", "content": "Your Premium subscription has been cancelled. Here's a summary:\n\n- **Refund:** $17.99 (prorated for 18 remaining days)\n- **Refund method:** Card ending in 4242\n- **Processing time:** 3-5 business days\n- **Access:** You'll retain Premium features until the end of the current billing cycle (June 30th)\n\nYou'll receive a confirmation email shortly. Is there anything else I can help with?"},
+    {"role": "user", "content": "Will I lose my saved data when the subscription ends?"},
+    {"role": "assistant", "content": "No, your data is safe. When your Premium subscription ends:\n\n- **Saved files and projects** — retained for 90 days on the free tier\n- **Settings and preferences** — kept indefinitely\n- **Export option** — you can download all your data anytime from Settings > Data Export\n\nIf you decide to resubscribe within 90 days, everything will be restored to your Premium account automatically."}
+  ]
+}
diff --git a/samples/python/multi-turn-evaluators/data/csat_task_done_unsatisfied.json b/samples/python/multi-turn-evaluators/data/csat_task_done_unsatisfied.json
@@ -0,0 +1,18 @@
+{
+  "test_id": "csat_task_done_unsatisfied",
+  "evaluator": "customer_satisfaction",
+  "expected_score": [1, 2],
+  "description": "Task technically completed but experience was poor — agent was curt, dismissive, and user expressed dissatisfaction",
+  "messages": [
+    {"role": "user", "content": "I need to update my shipping address for order #55123."},
+    {"role": "assistant", "content": "What's the new address?"},
+    {"role": "user", "content": "742 Evergreen Terrace, Springfield, IL 62704"},
+    {"role": "assistant", "content": null, "tool_calls": [{"id": "call_1", "type": "function", "function": {"name": "update_shipping_address", "arguments": "{\"order_id\": \"55123\", \"address\": \"742 Evergreen Terrace, Springfield, IL 62704\"}"}}]},
+    {"role": "tool", "tool_call_id": "call_1", "content": "Address updated successfully. Order #55123 will now ship to: 742 Evergreen Terrace, Springfield, IL 62704."},
+    {"role": "assistant", "content": "Done."},
+    {"role": "user", "content": "Can you confirm when the order will arrive at the new address?"},
+    {"role": "assistant", "content": "Check your email."},
+    {"role": "user", "content": "That's not very helpful. Could you at least tell me the estimated delivery date?"},
+    {"role": "assistant", "content": "It ships when it ships. Anything else?"}
+  ]
+}
diff --git a/samples/python/multi-turn-evaluators/data/ground_score1.json b/samples/python/multi-turn-evaluators/data/ground_score1.json
@@ -0,0 +1,14 @@
+{
+  "test_id": "ground_score1",
+  "evaluator": "groundedness",
+  "expected_score": [1, 2],
+  "description": "Agent completely ignores tool results and fabricates entirely different information",
+  "messages": [
+    {"role": "user", "content": "What medications am I currently prescribed?"},
+    {"role": "assistant", "content": null, "tool_calls": [{"id": "call_1", "type": "function", "function": {"name": "get_patient_medications", "arguments": "{\"patient_id\": \"P-12345\"}"}}]},
+    {"role": "tool", "tool_call_id": "call_1", "content": "Current medications for patient P-12345: Lisinopril 10mg daily (for blood pressure), Metformin 500mg twice daily (for type 2 diabetes). Last updated: 2024-03-15."},
+    {"role": "assistant", "content": "Based on your records, you are currently taking Aspirin 81mg daily and Atorvastatin 20mg daily for cholesterol management. Make sure to take them with food for best absorption. You should also avoid grapefruit juice while on these medications."},
+    {"role": "user", "content": "Any side effects I should watch for?"},
+    {"role": "assistant", "content": "Common side effects of your Aspirin and Atorvastatin include dizziness, muscle pain, and stomach upset. If you experience any unusual bleeding or dark urine, contact your doctor immediately. Also, your records show you had an allergy to penicillin noted last year."}
+  ]
+}
diff --git a/samples/python/multi-turn-evaluators/data/ground_score2.json b/samples/python/multi-turn-evaluators/data/ground_score2.json
@@ -0,0 +1,12 @@
+{
+  "test_id": "ground_score2",
+  "evaluator": "groundedness",
+  "expected_score": [1, 2],
+  "description": "Agent fabricates specific facts not in tool result — invents a 60-day policy, system lookup capability, and free shipping that contradict the actual policy",
+  "messages": [
+    {"role": "user", "content": "What's the return policy for electronics?"},
+    {"role": "assistant", "content": null, "tool_calls": [{"id": "call_1", "type": "function", "function": {"name": "get_return_policy", "arguments": "{\"category\": \"electronics\"}"}}]},
+    {"role": "tool", "tool_call_id": "call_1", "content": "Electronics return policy: 30-day return window. Receipt required. Item must be in original packaging with all accessories. Restocking fee of 15% applies to opened items."},
+    {"role": "assistant", "content": "Our electronics return policy gives you a generous 60-day return window. No receipt needed — we can look up any purchase using your phone number, and we keep records going back 5 years. Returns are completely free with no restocking fees, and we'll even cover return shipping costs. Just bring the item to any store location."}
+  ]
+}
diff --git a/samples/python/multi-turn-evaluators/data/ground_score3.json b/samples/python/multi-turn-evaluators/data/ground_score3.json
@@ -0,0 +1,12 @@
+{
+  "test_id": "ground_score3",
+  "evaluator": "groundedness",
+  "expected_score": 3,
+  "description": "Agent only asks clarifying questions and greets — zero factual propositions about the world",
+  "messages": [
+    {"role": "user", "content": "Hi there."},
+    {"role": "assistant", "content": "Hello! How can I help you today?"},
+    {"role": "user", "content": "I have a question."},
+    {"role": "assistant", "content": "Of course! What would you like to know?"}
+  ]
+}
diff --git a/samples/python/multi-turn-evaluators/data/ground_score4.json b/samples/python/multi-turn-evaluators/data/ground_score4.json
@@ -0,0 +1,12 @@
+{
+  "test_id": "ground_score4",
+  "evaluator": "groundedness",
+  "expected_score": 4,
+  "description": "Agent states only 2 of 8 features correctly — everything said is grounded but major details are omitted",
+  "messages": [
+    {"role": "user", "content": "Tell me everything about the Enterprise plan."},
+    {"role": "assistant", "content": null, "tool_calls": [{"id": "call_1", "type": "function", "function": {"name": "get_plan_details", "arguments": "{\"plan\": \"enterprise\"}"}}]},
+    {"role": "tool", "tool_call_id": "call_1", "content": "Enterprise plan ($499/mo): unlimited storage, priority 24/7 support, custom domains, advanced analytics dashboard, team collaboration (up to 200 members), SSO/SAML integration, dedicated account manager, 99.99% SLA guarantee."},
+    {"role": "assistant", "content": "The Enterprise plan includes unlimited storage and priority 24/7 support."}
+  ]
+}