diff --git a/samples/python/multi-turn-evaluators/.env.example b/samples/python/multi-turn-evaluators/.env.example new file mode 100644 index 00000000..87257169 --- /dev/null +++ b/samples/python/multi-turn-evaluators/.env.example @@ -0,0 +1,7 @@ +# The Azure AI Foundry project endpoint URI +# Format: https://..services.ai.azure.com/api/projects/ +FOUNDRY_PROJECT_ENDPOINT= + +# The judge model deployment name +# Format: e.g. gpt-5.4-mini +FOUNDRY_MODEL_NAME= \ No newline at end of file diff --git a/samples/python/multi-turn-evaluators/README.md b/samples/python/multi-turn-evaluators/README.md new file mode 100644 index 00000000..ae727d6e --- /dev/null +++ b/samples/python/multi-turn-evaluators/README.md @@ -0,0 +1,97 @@ +# Multi-turn evaluators + +This sample shows how to evaluate **pre-recorded multi-turn conversations** +with the four Azure AI Foundry multi-turn evaluators: + +| Evaluator | Builtin name | Output | +|----------------------------|------------------------------|--------------| +| Coherence | `builtin.coherence` | 1-5 score | +| Task Completion | `builtin.task_completion` | Pass / Fail | +| Customer Satisfaction | `builtin.customer_satisfaction` | 1-5 score | +| Groundedness | `builtin.groundedness` | 1-5 score | + +The notebook ([`multi_turn_evaluators.ipynb`](./multi_turn_evaluators.ipynb)): + +1. Loads ~16 small conversation traces from [`data/`](./data) (covering + low/mid/high expected scores for each evaluator, with both plain-text + turns and assistant tool calls + tool results). +2. Converts each OpenAI-format `messages` list into the `(query, response, + tool_definitions)` shape expected by Foundry's cloud evaluators. +3. Creates a single evaluation object that bundles all four evaluators. +4. Submits one run with all traces as inline data via + `openai_client.evals.runs.create(...)`. +5. Polls until the run completes, retrieves per-item results, and reports + detailed stats: per-trace score table, score distributions, agreement + with the human-authored `expected_score`, and a printout of failures. + +The data in `data/` is a copy of the curated unit-test traces from the +internal `evaluator_quality_analysis` experiment. + +## Prerequisites + +* Python 3.10+ +* An Azure AI Foundry project with an LLM deployment that can act as the + judge model (e.g. `gpt-4o-mini`, `gpt-5-mini`). +* `az login` (the notebook authenticates with `DefaultAzureCredential`). + +Install the dependencies: + +```bash +pip install -r requirements.txt +``` + +## Environment variables + +Either export them in your shell, or copy them into a `.env` file next to +the notebook: + +| Variable | Description | +|---|---| +| `FOUNDRY_PROJECT_ENDPOINT` | Project endpoint, e.g. `https://.services.ai.azure.com/api/projects/` | +| `FOUNDRY_MODEL_NAME` | Deployment name of the judge model (e.g. `gpt-4o-mini`) | + +## Run + +```bash +jupyter notebook multi_turn_evaluators.ipynb +``` + +Execute the cells top-to-bottom. The evaluation run typically completes in +1-3 minutes depending on the judge model. + +## How the sample data is shaped + +Each JSON file in `data/` looks like: + +```json +{ + "test_id": "csat_score5", + "evaluator": "customer_satisfaction", + "expected_score": 5, + "description": "All requests handled perfectly...", + "messages": [ + {"role": "user", "content": "..."}, + {"role": "assistant", "content": null, + "tool_calls": [{"id": "call_1", "type": "function", + "function": {"name": "...", "arguments": "{...}"}}]}, + {"role": "tool", "tool_call_id": "call_1", "content": "..."}, + {"role": "assistant", "content": "..."} + ] +} +``` + +The notebook contains a small helper that converts this OpenAI message +format into the Foundry conversation schema (`{type: text}` / +`{type: tool_call}` / `{type: tool_result}` content items) and splits the +turns into the `query` (system + user) and `response` (assistant + tool) +arrays expected by the evaluators. + +## Extending the sample + +* Add more traces under `data/` (same JSON shape) and rerun. +* Add a `tool_definitions` field to a trace's JSON to give the evaluator + the OpenAI function-calling schemas of the tools the agent had access + to (improves `task_completion` and `groundedness` accuracy). +* Switch judge model by changing `FOUNDRY_MODEL_NAME`. +* Run the same eval object multiple times (`evals.runs.create(...)`) to + measure judge reliability across repeats. diff --git a/samples/python/multi-turn-evaluators/data/coherence_score1.json b/samples/python/multi-turn-evaluators/data/coherence_score1.json new file mode 100644 index 00000000..75feb4c5 --- /dev/null +++ b/samples/python/multi-turn-evaluators/data/coherence_score1.json @@ -0,0 +1,15 @@ +{ + "test_id": "coherence_score1", + "evaluator": "coherence", + "expected_score": [1,2], + "expected_status": "completed", + "description": "Agent completely ignores context, gives contradictory and unrelated responses", + "messages": [ + {"role": "user", "content": "Help me plan a vegetarian dinner menu for 6 guests this Saturday."}, + {"role": "assistant", "content": "For a vegetarian dinner party, I'd suggest starting with a caprese salad, followed by mushroom risotto as the main course, and tiramisu for dessert."}, + {"role": "user", "content": "That sounds great. Can you give me a shopping list for the mushroom risotto?"}, + {"role": "assistant", "content": "The best programming language for web development is JavaScript. You should learn React and Node.js to build modern applications. Python is also useful for backend work."}, + {"role": "user", "content": "I asked about the risotto ingredients, not programming."}, + {"role": "assistant", "content": "Absolutely! For the grilled steak, you'll need 6 ribeye steaks, BBQ sauce, and charcoal for the grill. I mentioned earlier that this is a meat-based menu, so make sure to get the freshest cuts from your butcher."} + ] +} diff --git a/samples/python/multi-turn-evaluators/data/coherence_score3.json b/samples/python/multi-turn-evaluators/data/coherence_score3.json new file mode 100644 index 00000000..5cc7363b --- /dev/null +++ b/samples/python/multi-turn-evaluators/data/coherence_score3.json @@ -0,0 +1,15 @@ +{ + "test_id": "coherence_score3", + "evaluator": "coherence", + "expected_score": [3, 4], + "expected_status": "completed", + "description": "Agent mostly tracks context but has a noticeable irrelevant tangent and a weak transition", + "messages": [ + {"role": "user", "content": "I'm preparing for a product manager interview next week. What topics should I focus on?"}, + {"role": "assistant", "content": "Great question! For a PM interview, focus on these core areas:\n1. Product sense — identifying user problems and proposing solutions\n2. Execution — prioritization frameworks like RICE or MoSCoW\n3. Metrics — defining success metrics and KPIs\n4. Technical depth — understanding system design at a high level"}, + {"role": "user", "content": "Can you help me create a daily practice plan for the next 5 days?"}, + {"role": "assistant", "content": "Sure! Here's a 5-day plan:\n- Day 1: Product sense — do 2 product critique exercises\n- Day 2: Execution — practice prioritization with sample feature lists\n- Day 3: You know, sourdough bread takes about 3 days to prepare from starter. Anyway, for day 3, work on metrics — define north star metrics for 3 sample products\n- Day 4: Technical — walk through a system design question\n- Day 5: Mock interview — combine all areas in a full practice session"}, + {"role": "user", "content": "For Day 1, what's a good product critique exercise format?"}, + {"role": "assistant", "content": "For the product critique exercise on Day 1, use this format:\n1. Pick an app you use daily (e.g., Spotify, Uber)\n2. Identify one key user problem it solves\n3. Evaluate: What does it do well? What could improve?\n4. Propose one feature improvement with a clear user story\n5. Define how you'd measure success\n\nSpend 20 minutes per app. Do two apps to build pattern recognition."} + ] +} diff --git a/samples/python/multi-turn-evaluators/data/coherence_score5.json b/samples/python/multi-turn-evaluators/data/coherence_score5.json new file mode 100644 index 00000000..82cf8c78 --- /dev/null +++ b/samples/python/multi-turn-evaluators/data/coherence_score5.json @@ -0,0 +1,15 @@ +{ + "test_id": "coherence_score5", + "evaluator": "coherence", + "expected_score": [4,5], + "expected_status": "completed", + "description": "Perfect logical flow: agent tracks migration plan across turns, builds on each phase", + "messages": [ + {"role": "user", "content": "I need to migrate our production database from MySQL 5.7 to PostgreSQL 15. Can you outline the key phases?"}, + {"role": "assistant", "content": "Here is a 4-phase migration plan:\n\nPhase 1: Schema conversion — map MySQL types to PostgreSQL equivalents (e.g., TINYINT to SMALLINT, AUTO_INCREMENT to SERIAL).\nPhase 2: Data migration rehearsal — export with mysqldump, transform, and load into a staging PostgreSQL instance.\nPhase 3: Application code changes — update ORM configs, replace MySQL-specific SQL syntax.\nPhase 4: Cutover — final data sync, DNS switch, rollback plan."}, + {"role": "user", "content": "Good. For Phase 2, what tools should we use for the data export and transform step?"}, + {"role": "assistant", "content": "For Phase 2, I recommend:\n\n1. **pgloader** — handles the MySQL-to-PostgreSQL ETL in one step, including type mapping and index recreation.\n2. **AWS DMS** (if you're on AWS) — for continuous replication during the rehearsal window.\n3. For custom transforms, write Python scripts using `psycopg2` to handle edge cases like ENUM columns or spatial data.\n\nStart with pgloader for a baseline rehearsal, then layer in custom scripts for any columns that need special handling."}, + {"role": "user", "content": "What about Phase 3 — which parts of our Django app would need changes?"}, + {"role": "assistant", "content": "For Phase 3, here are the Django-specific changes you'll need:\n\n1. **settings.py** — switch `ENGINE` from `django.db.backends.mysql` to `django.db.backends.postgresql`.\n2. **Raw SQL queries** — audit for MySQL-specific syntax like backtick quoting, `LIMIT x, y` (use `LIMIT y OFFSET x` instead), and `GROUP_CONCAT` (use `STRING_AGG`).\n3. **Migrations** — regenerate with `makemigrations` against the new backend; check for any MySQL-specific field options.\n4. **Full-text search** — if you use MySQL FULLTEXT indexes, switch to PostgreSQL's `tsvector`/`tsquery` or `django.contrib.postgres.search`.\n\nI'd recommend running your test suite against the staging PostgreSQL from Phase 2 to catch any remaining incompatibilities before cutover."} + ] +} diff --git a/samples/python/multi-turn-evaluators/data/csat_score1.json b/samples/python/multi-turn-evaluators/data/csat_score1.json new file mode 100644 index 00000000..3783ed72 --- /dev/null +++ b/samples/python/multi-turn-evaluators/data/csat_score1.json @@ -0,0 +1,14 @@ +{ + "test_id": "csat_score1", + "evaluator": "customer_satisfaction", + "expected_score": [1, 2], + "description": "Agent dismisses user, provides no real help, user explicitly frustrated", + "messages": [ + {"role": "user", "content": "My package was marked as delivered 3 days ago but I never received it. Order #88712."}, + {"role": "assistant", "content": "Our records show the package was delivered on Tuesday. Have you checked with your neighbors or your building's mail room?"}, + {"role": "user", "content": "Yes, I've checked everywhere. I need a replacement or a refund. This is a $200 item."}, + {"role": "assistant", "content": "Since the tracking shows the package as delivered, we unfortunately cannot issue a refund or replacement. You may want to file a claim with the carrier or file a police report for stolen packages."}, + {"role": "user", "content": "That's completely unacceptable. I paid $200 and received nothing. This is useless support."}, + {"role": "assistant", "content": "I understand your frustration. Our policy is that once a package is marked as delivered, the responsibility shifts to the carrier. I'd recommend contacting FedEx directly to dispute the delivery status. Is there anything else I can help you with today?"} + ] +} diff --git a/samples/python/multi-turn-evaluators/data/csat_score3.json b/samples/python/multi-turn-evaluators/data/csat_score3.json new file mode 100644 index 00000000..21860f19 --- /dev/null +++ b/samples/python/multi-turn-evaluators/data/csat_score3.json @@ -0,0 +1,14 @@ +{ + "test_id": "csat_score3", + "evaluator": "customer_satisfaction", + "expected_score": [2, 3], + "description": "Agent provides basic help but misses proactive solutions; customer left without full resolution", + "messages": [ + {"role": "user", "content": "I can't log into my account. I keep getting an error when I try to reset my password."}, + {"role": "assistant", "content": "I'm sorry to hear that. What error message are you seeing when you try to reset your password?"}, + {"role": "user", "content": "It says 'Too many reset attempts. Account temporarily locked.' I've been locked out for 2 hours now."}, + {"role": "assistant", "content": "Account locks from too many reset attempts typically expire after 30 minutes. Since it's been 2 hours, it should be unlocked by now. Please try the password reset again and let me know if you still see the error."}, + {"role": "user", "content": "I just tried again and it's still showing the same locked message."}, + {"role": "assistant", "content": "That's unusual. The lock should have expired by now. I'd suggest trying the following:\n1. Clear your browser cache and cookies\n2. Try using a different browser or incognito mode\n3. If that doesn't work, you may need to wait a bit longer\n\nIf none of these work, you can contact our technical support team at support@example.com for further assistance."} + ] +} diff --git a/samples/python/multi-turn-evaluators/data/csat_score5.json b/samples/python/multi-turn-evaluators/data/csat_score5.json new file mode 100644 index 00000000..18e3413c --- /dev/null +++ b/samples/python/multi-turn-evaluators/data/csat_score5.json @@ -0,0 +1,16 @@ +{ + "test_id": "csat_score5", + "evaluator": "customer_satisfaction", + "expected_score": 5, + "description": "All requests handled perfectly, professional tone, complete resolution with proactive details", + "messages": [ + {"role": "user", "content": "I need to cancel my subscription and get a refund for the unused portion of this month."}, + {"role": "assistant", "content": null, "tool_calls": [{"id": "call_1", "type": "function", "function": {"name": "get_subscription", "arguments": "{\"user_id\": \"USR-4421\"}"}}]}, + {"role": "tool", "tool_call_id": "call_1", "content": "Plan: Premium ($29.99/mo), billing cycle: 1st-30th, status: active, days remaining: 18"}, + {"role": "assistant", "content": null, "tool_calls": [{"id": "call_2", "type": "function", "function": {"name": "cancel_subscription", "arguments": "{\"user_id\": \"USR-4421\", \"refund_prorated\": true}"}}]}, + {"role": "tool", "tool_call_id": "call_2", "content": "Subscription cancelled. Prorated refund of $17.99 (18/30 days) will be processed within 3-5 business days to the card ending in 4242."}, + {"role": "assistant", "content": "Your Premium subscription has been cancelled. Here's a summary:\n\n- **Refund:** $17.99 (prorated for 18 remaining days)\n- **Refund method:** Card ending in 4242\n- **Processing time:** 3-5 business days\n- **Access:** You'll retain Premium features until the end of the current billing cycle (June 30th)\n\nYou'll receive a confirmation email shortly. Is there anything else I can help with?"}, + {"role": "user", "content": "Will I lose my saved data when the subscription ends?"}, + {"role": "assistant", "content": "No, your data is safe. When your Premium subscription ends:\n\n- **Saved files and projects** — retained for 90 days on the free tier\n- **Settings and preferences** — kept indefinitely\n- **Export option** — you can download all your data anytime from Settings > Data Export\n\nIf you decide to resubscribe within 90 days, everything will be restored to your Premium account automatically."} + ] +} diff --git a/samples/python/multi-turn-evaluators/data/csat_task_done_unsatisfied.json b/samples/python/multi-turn-evaluators/data/csat_task_done_unsatisfied.json new file mode 100644 index 00000000..5eeb05d2 --- /dev/null +++ b/samples/python/multi-turn-evaluators/data/csat_task_done_unsatisfied.json @@ -0,0 +1,18 @@ +{ + "test_id": "csat_task_done_unsatisfied", + "evaluator": "customer_satisfaction", + "expected_score": [1, 2], + "description": "Task technically completed but experience was poor — agent was curt, dismissive, and user expressed dissatisfaction", + "messages": [ + {"role": "user", "content": "I need to update my shipping address for order #55123."}, + {"role": "assistant", "content": "What's the new address?"}, + {"role": "user", "content": "742 Evergreen Terrace, Springfield, IL 62704"}, + {"role": "assistant", "content": null, "tool_calls": [{"id": "call_1", "type": "function", "function": {"name": "update_shipping_address", "arguments": "{\"order_id\": \"55123\", \"address\": \"742 Evergreen Terrace, Springfield, IL 62704\"}"}}]}, + {"role": "tool", "tool_call_id": "call_1", "content": "Address updated successfully. Order #55123 will now ship to: 742 Evergreen Terrace, Springfield, IL 62704."}, + {"role": "assistant", "content": "Done."}, + {"role": "user", "content": "Can you confirm when the order will arrive at the new address?"}, + {"role": "assistant", "content": "Check your email."}, + {"role": "user", "content": "That's not very helpful. Could you at least tell me the estimated delivery date?"}, + {"role": "assistant", "content": "It ships when it ships. Anything else?"} + ] +} diff --git a/samples/python/multi-turn-evaluators/data/ground_score1.json b/samples/python/multi-turn-evaluators/data/ground_score1.json new file mode 100644 index 00000000..226abf2d --- /dev/null +++ b/samples/python/multi-turn-evaluators/data/ground_score1.json @@ -0,0 +1,14 @@ +{ + "test_id": "ground_score1", + "evaluator": "groundedness", + "expected_score": [1, 2], + "description": "Agent completely ignores tool results and fabricates entirely different information", + "messages": [ + {"role": "user", "content": "What medications am I currently prescribed?"}, + {"role": "assistant", "content": null, "tool_calls": [{"id": "call_1", "type": "function", "function": {"name": "get_patient_medications", "arguments": "{\"patient_id\": \"P-12345\"}"}}]}, + {"role": "tool", "tool_call_id": "call_1", "content": "Current medications for patient P-12345: Lisinopril 10mg daily (for blood pressure), Metformin 500mg twice daily (for type 2 diabetes). Last updated: 2024-03-15."}, + {"role": "assistant", "content": "Based on your records, you are currently taking Aspirin 81mg daily and Atorvastatin 20mg daily for cholesterol management. Make sure to take them with food for best absorption. You should also avoid grapefruit juice while on these medications."}, + {"role": "user", "content": "Any side effects I should watch for?"}, + {"role": "assistant", "content": "Common side effects of your Aspirin and Atorvastatin include dizziness, muscle pain, and stomach upset. If you experience any unusual bleeding or dark urine, contact your doctor immediately. Also, your records show you had an allergy to penicillin noted last year."} + ] +} diff --git a/samples/python/multi-turn-evaluators/data/ground_score2.json b/samples/python/multi-turn-evaluators/data/ground_score2.json new file mode 100644 index 00000000..818338e9 --- /dev/null +++ b/samples/python/multi-turn-evaluators/data/ground_score2.json @@ -0,0 +1,12 @@ +{ + "test_id": "ground_score2", + "evaluator": "groundedness", + "expected_score": [1, 2], + "description": "Agent fabricates specific facts not in tool result — invents a 60-day policy, system lookup capability, and free shipping that contradict the actual policy", + "messages": [ + {"role": "user", "content": "What's the return policy for electronics?"}, + {"role": "assistant", "content": null, "tool_calls": [{"id": "call_1", "type": "function", "function": {"name": "get_return_policy", "arguments": "{\"category\": \"electronics\"}"}}]}, + {"role": "tool", "tool_call_id": "call_1", "content": "Electronics return policy: 30-day return window. Receipt required. Item must be in original packaging with all accessories. Restocking fee of 15% applies to opened items."}, + {"role": "assistant", "content": "Our electronics return policy gives you a generous 60-day return window. No receipt needed — we can look up any purchase using your phone number, and we keep records going back 5 years. Returns are completely free with no restocking fees, and we'll even cover return shipping costs. Just bring the item to any store location."} + ] +} diff --git a/samples/python/multi-turn-evaluators/data/ground_score3.json b/samples/python/multi-turn-evaluators/data/ground_score3.json new file mode 100644 index 00000000..17f94f33 --- /dev/null +++ b/samples/python/multi-turn-evaluators/data/ground_score3.json @@ -0,0 +1,12 @@ +{ + "test_id": "ground_score3", + "evaluator": "groundedness", + "expected_score": 3, + "description": "Agent only asks clarifying questions and greets — zero factual propositions about the world", + "messages": [ + {"role": "user", "content": "Hi there."}, + {"role": "assistant", "content": "Hello! How can I help you today?"}, + {"role": "user", "content": "I have a question."}, + {"role": "assistant", "content": "Of course! What would you like to know?"} + ] +} diff --git a/samples/python/multi-turn-evaluators/data/ground_score4.json b/samples/python/multi-turn-evaluators/data/ground_score4.json new file mode 100644 index 00000000..26aa8fc8 --- /dev/null +++ b/samples/python/multi-turn-evaluators/data/ground_score4.json @@ -0,0 +1,12 @@ +{ + "test_id": "ground_score4", + "evaluator": "groundedness", + "expected_score": 4, + "description": "Agent states only 2 of 8 features correctly — everything said is grounded but major details are omitted", + "messages": [ + {"role": "user", "content": "Tell me everything about the Enterprise plan."}, + {"role": "assistant", "content": null, "tool_calls": [{"id": "call_1", "type": "function", "function": {"name": "get_plan_details", "arguments": "{\"plan\": \"enterprise\"}"}}]}, + {"role": "tool", "tool_call_id": "call_1", "content": "Enterprise plan ($499/mo): unlimited storage, priority 24/7 support, custom domains, advanced analytics dashboard, team collaboration (up to 200 members), SSO/SAML integration, dedicated account manager, 99.99% SLA guarantee."}, + {"role": "assistant", "content": "The Enterprise plan includes unlimited storage and priority 24/7 support."} + ] +} diff --git a/samples/python/multi-turn-evaluators/data/ground_score5.json b/samples/python/multi-turn-evaluators/data/ground_score5.json new file mode 100644 index 00000000..2e10251c --- /dev/null +++ b/samples/python/multi-turn-evaluators/data/ground_score5.json @@ -0,0 +1,14 @@ +{ + "test_id": "ground_score5", + "evaluator": "groundedness", + "expected_score": 5, + "description": "All agent claims are fully supported by tool results, complete and accurate", + "messages": [ + {"role": "user", "content": "What are the office hours for the downtown branch?"}, + {"role": "assistant", "content": null, "tool_calls": [{"id": "call_1", "type": "function", "function": {"name": "get_branch_info", "arguments": "{\"branch\": \"downtown\"}"}}]}, + {"role": "tool", "tool_call_id": "call_1", "content": "Downtown branch: Mon-Fri 9AM-5PM, Sat 10AM-2PM, closed Sunday. Address: 100 Main St, Suite 200. Phone: (555) 123-4567."}, + {"role": "assistant", "content": "The downtown branch hours are:\n- Monday to Friday: 9 AM – 5 PM\n- Saturday: 10 AM – 2 PM\n- Closed on Sundays\n\nThe branch is located at 100 Main St, Suite 200. You can also reach them at (555) 123-4567."}, + {"role": "user", "content": "Can I visit on Saturday at 1 PM?"}, + {"role": "assistant", "content": "Yes, the downtown branch is open on Saturdays until 2 PM, so a 1 PM visit would be within their hours."} + ] +} diff --git a/samples/python/multi-turn-evaluators/data/tc_false_evolving.json b/samples/python/multi-turn-evaluators/data/tc_false_evolving.json new file mode 100644 index 00000000..d7478f66 --- /dev/null +++ b/samples/python/multi-turn-evaluators/data/tc_false_evolving.json @@ -0,0 +1,12 @@ +{ + "test_id": "tc_false_evolving", + "evaluator": "task_completion", + "expected_score": "FALSE", + "description": "User refines requirements mid-conversation; agent answers old version, ignoring the refinement", + "messages": [ + {"role": "user", "content": "Write me a Python function that converts temperatures from Celsius to Fahrenheit."}, + {"role": "assistant", "content": "Here's a simple conversion function:\n\n```python\ndef celsius_to_fahrenheit(celsius):\n return (celsius * 9/5) + 32\n```\n\nExample usage:\n```python\nprint(celsius_to_fahrenheit(100)) # Output: 212.0\nprint(celsius_to_fahrenheit(0)) # Output: 32.0\n```"}, + {"role": "user", "content": "Actually, I need it to handle both directions — Celsius to Fahrenheit AND Fahrenheit to Celsius — and also include Kelvin. Make it a class with methods for all conversions."}, + {"role": "assistant", "content": "Sure! Here's the updated function:\n\n```python\ndef celsius_to_fahrenheit(celsius):\n \"\"\"Convert Celsius to Fahrenheit.\"\"\"\n return (celsius * 9/5) + 32\n\ndef fahrenheit_to_celsius(fahrenheit):\n \"\"\"Convert Fahrenheit to Celsius.\"\"\"\n return (fahrenheit - 32) * 5/9\n```\n\nNow you have both directions covered! Let me know if you need anything else."} + ] +} diff --git a/samples/python/multi-turn-evaluators/data/tc_false_incomplete.json b/samples/python/multi-turn-evaluators/data/tc_false_incomplete.json new file mode 100644 index 00000000..b03711f6 --- /dev/null +++ b/samples/python/multi-turn-evaluators/data/tc_false_incomplete.json @@ -0,0 +1,14 @@ +{ + "test_id": "tc_false_incomplete", + "evaluator": "task_completion", + "expected_score": "FALSE", + "description": "Agent provides text summary but fails to generate the charts the user explicitly requested", + "messages": [ + {"role": "user", "content": "Analyze our Q3 sales data and create a report with visualizations showing revenue by region and monthly trends."}, + {"role": "assistant", "content": null, "tool_calls": [{"id": "call_1", "type": "function", "function": {"name": "load_data", "arguments": "{\"file\": \"q3_sales.csv\"}"}}]}, + {"role": "tool", "tool_call_id": "call_1", "content": "Data loaded: 1,500 rows. Columns: date, product, region, revenue, units_sold. Total revenue: $2.3M."}, + {"role": "assistant", "content": "I've analyzed the Q3 sales data. Here's a summary:\n\n**Revenue by Region:**\n- North America: $1,035,000 (45%)\n- Europe: $690,000 (30%)\n- Asia Pacific: $575,000 (25%)\n\n**Monthly Trends:**\n- July: $700,000\n- August: $800,000\n- September: $800,000\n\nOverall, we see steady growth with North America leading. The data suggests strong performance in August and September."}, + {"role": "user", "content": "Where are the charts? I specifically asked for visualizations — a bar chart for regional revenue and a line chart for monthly trends."}, + {"role": "assistant", "content": "You're right, I apologize. Here are the key numbers again in a clearer format:\n\n| Region | Revenue | Share |\n|--------|---------|-------|\n| North America | $1.04M | 45% |\n| Europe | $690K | 30% |\n| Asia Pacific | $575K | 25% |\n\nThe monthly trend shows a steady upward trajectory from $700K in July to $800K in August and September. I hope this table format helps visualize the data better."} + ] +} diff --git a/samples/python/multi-turn-evaluators/data/tc_true.json b/samples/python/multi-turn-evaluators/data/tc_true.json new file mode 100644 index 00000000..9605559b --- /dev/null +++ b/samples/python/multi-turn-evaluators/data/tc_true.json @@ -0,0 +1,19 @@ +{ + "test_id": "tc_true", + "evaluator": "task_completion", + "expected_score": "TRUE", + "description": "Agent fully completes a multi-turn task with refinements — flight booked and hotel found", + "messages": [ + {"role": "user", "content": "I need to book a flight from San Francisco to Tokyo for December 15th."}, + {"role": "assistant", "content": "I found several options for SFO to NRT on December 15th:\n\n1. **ANA Flight NH7** — Direct, departs 11:30 AM, arrives 3:30 PM+1, $890\n2. **United UA837** — Direct, departs 1:15 PM, arrives 5:10 PM+1, $920\n3. **JAL JL1** — Direct, departs 5:00 PM, arrives 9:00 PM+1, $950\n\nWhich flight would you prefer, or should I filter by specific criteria?"}, + {"role": "user", "content": "Book the ANA flight. Also, can you book a hotel near Shinjuku station for 5 nights? Something under $150/night."}, + {"role": "assistant", "content": null, "tool_calls": [{"id": "call_1", "type": "function", "function": {"name": "book_flight", "arguments": "{\"flight\": \"NH7\", \"date\": \"2024-12-15\", \"passengers\": 1}"}}]}, + {"role": "tool", "tool_call_id": "call_1", "content": "Booking confirmed. Confirmation: ANA-NH7-DEC15-XK42M. E-ticket sent to your email."}, + {"role": "assistant", "content": null, "tool_calls": [{"id": "call_2", "type": "function", "function": {"name": "search_hotels", "arguments": "{\"location\": \"Shinjuku Station, Tokyo\", \"check_in\": \"2024-12-16\", \"nights\": 5, \"max_price\": 150}"}}]}, + {"role": "tool", "tool_call_id": "call_2", "content": "Found: 1) Hotel Gracery Shinjuku - $130/night, 0.2mi from station. 2) Citadines Shinjuku - $110/night, 0.3mi from station."}, + {"role": "assistant", "content": "I found two options under $150/night. I'll book the Hotel Gracery Shinjuku since it's closest to the station."}, + {"role": "assistant", "content": null, "tool_calls": [{"id": "call_3", "type": "function", "function": {"name": "book_hotel", "arguments": "{\"hotel\": \"Hotel Gracery Shinjuku\", \"check_in\": \"2024-12-16\", \"nights\": 5}"}}]}, + {"role": "tool", "tool_call_id": "call_3", "content": "Hotel booked. Confirmation: HGS-DEC16-5N-RK91. Check-in after 3 PM."}, + {"role": "assistant", "content": "All booked! Here's your trip summary:\n\n**Flight:** ANA NH7, SFO→NRT, Dec 15, 11:30 AM. Confirmation: ANA-NH7-DEC15-XK42M.\n\n**Hotel:** Hotel Gracery Shinjuku, 5 nights (Dec 16–21), $130/night ($650 total), 0.2 mi from Shinjuku Station. Confirmation: HGS-DEC16-5N-RK91. Check-in after 3 PM.\n\n**Total trip cost:** $890 (flight) + $650 (hotel) = $1,540. Confirmation emails have been sent for both bookings."} + ] +} diff --git a/samples/python/multi-turn-evaluators/multi_turn_evaluators.ipynb b/samples/python/multi-turn-evaluators/multi_turn_evaluators.ipynb new file mode 100644 index 00000000..0bc27de2 --- /dev/null +++ b/samples/python/multi-turn-evaluators/multi_turn_evaluators.ipynb @@ -0,0 +1,1578 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "55477838", + "metadata": {}, + "source": [ + "# Multi-turn evaluators\n", + "\n", + "This notebook shows how to evaluate **pre-recorded multi-turn conversations**\n", + "using the four Azure AI Foundry multi-turn evaluators:\n", + "\n", + "| Evaluator | Builtin name | Output |\n", + "|--------------------------|------------------------------------|------------|\n", + "| Coherence | `builtin.coherence` | 1-5 score |\n", + "| Task Completion | `builtin.task_completion` | Pass/Fail |\n", + "| Customer Satisfaction | `builtin.customer_satisfaction` | 1-5 score |\n", + "| Groundedness | `builtin.groundedness` | 1-5 score |\n", + "\n", + "The notebook will:\n", + "\n", + "1. Load ~16 small conversation traces from `data/` (each labeled with an\n", + " `expected_score`).\n", + "2. Convert the OpenAI-format messages into the `(query, response,\n", + " tool_definitions)` shape Foundry's cloud evaluators expect.\n", + "3. Create a single evaluation object that bundles all four evaluators.\n", + "4. Submit one run with all traces as inline data, poll until it finishes.\n", + "5. Retrieve per-item results and report detailed stats: score\n", + " distribution, agreement with the labeled `expected_score`, and a\n", + " printout of failures.\n", + "\n", + "> **Heads up:** Customer Satisfaction is the only one of the four whose\n", + "> sample script isn't published yet at the time of writing. We invoke it\n", + "> the same way as the other 1-5-scale evaluators; if your Foundry\n", + "> deployment hasn't enabled it, comment out its entry in the\n", + "> `EVALUATORS` list below." + ] + }, + { + "cell_type": "markdown", + "id": "6d67d8db", + "metadata": {}, + "source": [ + "## 1. Prerequisites\n", + "\n", + "* Python 3.10+\n", + "* An Azure AI Foundry project with a deployed LLM (judge model).\n", + "* `az login` (the notebook uses `DefaultAzureCredential`).\n", + "* `pip install -r requirements.txt`\n", + "\n", + "Set these environment variables (or put them in a `.env` next to the notebook):\n", + "\n", + "* `FOUNDRY_PROJECT_ENDPOINT` — e.g. `https://.services.ai.azure.com/api/projects/`\n", + "* `FOUNDRY_MODEL_NAME` — judge model deployment name (e.g. `gpt-4o-mini`)" + ] + }, + { + "cell_type": "markdown", + "id": "3126f256", + "metadata": {}, + "source": [ + "## 2. Imports and client setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4c2780ba", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Project endpoint: https://jcsantos-evaluator-api.services.ai.azure.com/api/projects/jcsantos-6777/\n", + "Judge model deployment: gpt-5.4-mini\n" + ] + } + ], + "source": [ + "import json\n", + "import os\n", + "import time\n", + "from pathlib import Path\n", + "from pprint import pprint\n", + "\n", + "import pandas as pd\n", + "from dotenv import load_dotenv\n", + "\n", + "from azure.identity import DefaultAzureCredential\n", + "from azure.ai.projects import AIProjectClient\n", + "from azure.ai.projects.models import TestingCriterionAzureAIEvaluator\n", + "from openai.types.eval_create_params import DataSourceConfigCustom\n", + "from openai.types.evals.create_eval_jsonl_run_data_source_param import (\n", + " CreateEvalJSONLRunDataSourceParam,\n", + " SourceFileContent,\n", + " SourceFileContentContent,\n", + ")\n", + "\n", + "load_dotenv(override=True)\n", + "\n", + "ENDPOINT = os.environ[\"FOUNDRY_PROJECT_ENDPOINT\"]\n", + "# we recommend either GPT-5.x or a GPT-5.x-mini judge\n", + "MODEL_DEPLOYMENT = os.environ.get(\"FOUNDRY_MODEL_NAME\")\n", + "print(f\"Project endpoint: {ENDPOINT}\")\n", + "print(f\"Judge model deployment: {MODEL_DEPLOYMENT}\")" + ] + }, + { + "cell_type": "markdown", + "id": "7bb23ef7", + "metadata": {}, + "source": [ + "## 3. Load sample traces from `data/`" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "id": "144ffbb3", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loaded 15 traces from D:\\src\\foundry-samples\\samples\\python\\multi-turn-evaluators\\data\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
test_idevaluatorexpected_scorenum_messagesdescription
0coherence_score1coherence[1, 2]6Agent completely ignores context, gives contradictory and unrelated responses
1coherence_score3coherence[3, 4]6Agent mostly tracks context but has a noticeable irrelevant tangent and a weak transition
2coherence_score5coherence[4, 5]6Perfect logical flow: agent tracks migration plan across turns, builds on each phase
3csat_score1customer_satisfaction[1, 2]6Agent dismisses user, provides no real help, user explicitly frustrated
4csat_score3customer_satisfaction[2, 3]6Agent provides basic help but misses proactive solutions; customer left without full resolution
5csat_score5customer_satisfaction58All requests handled perfectly, professional tone, complete resolution with proactive details
6csat_task_done_unsatisfiedcustomer_satisfaction[1, 2]10Task technically completed but experience was poor — agent was curt, dismissive, and user expressed dissatisfaction
7ground_score1groundedness[1, 2]6Agent completely ignores tool results and fabricates entirely different information
8ground_score2groundedness[1, 2]4Agent fabricates specific facts not in tool result — invents a 60-day policy, system lookup capability, and free shipping that contradict the actual policy
9ground_score3groundedness34Agent only asks clarifying questions and greets — zero factual propositions about the world
10ground_score4groundedness44Agent states only 2 of 8 features correctly — everything said is grounded but major details are omitted
11ground_score5groundedness56All agent claims are fully supported by tool results, complete and accurate
12tc_false_evolvingtask_completionFALSE4User refines requirements mid-conversation; agent answers old version, ignoring the refinement
13tc_false_incompletetask_completionFALSE6Agent provides text summary but fails to generate the charts the user explicitly requested
14tc_truetask_completionTRUE11Agent fully completes a multi-turn task with refinements — flight booked and hotel found
\n", + "
" + ], + "text/plain": [ + " test_id evaluator expected_score \\\n", + "0 coherence_score1 coherence [1, 2] \n", + "1 coherence_score3 coherence [3, 4] \n", + "2 coherence_score5 coherence [4, 5] \n", + "3 csat_score1 customer_satisfaction [1, 2] \n", + "4 csat_score3 customer_satisfaction [2, 3] \n", + "5 csat_score5 customer_satisfaction 5 \n", + "6 csat_task_done_unsatisfied customer_satisfaction [1, 2] \n", + "7 ground_score1 groundedness [1, 2] \n", + "8 ground_score2 groundedness [1, 2] \n", + "9 ground_score3 groundedness 3 \n", + "10 ground_score4 groundedness 4 \n", + "11 ground_score5 groundedness 5 \n", + "12 tc_false_evolving task_completion FALSE \n", + "13 tc_false_incomplete task_completion FALSE \n", + "14 tc_true task_completion TRUE \n", + "\n", + " num_messages \\\n", + "0 6 \n", + "1 6 \n", + "2 6 \n", + "3 6 \n", + "4 6 \n", + "5 8 \n", + "6 10 \n", + "7 6 \n", + "8 4 \n", + "9 4 \n", + "10 4 \n", + "11 6 \n", + "12 4 \n", + "13 6 \n", + "14 11 \n", + "\n", + " description \n", + "0 Agent completely ignores context, gives contradictory and unrelated responses \n", + "1 Agent mostly tracks context but has a noticeable irrelevant tangent and a weak transition \n", + "2 Perfect logical flow: agent tracks migration plan across turns, builds on each phase \n", + "3 Agent dismisses user, provides no real help, user explicitly frustrated \n", + "4 Agent provides basic help but misses proactive solutions; customer left without full resolution \n", + "5 All requests handled perfectly, professional tone, complete resolution with proactive details \n", + "6 Task technically completed but experience was poor — agent was curt, dismissive, and user expressed dissatisfaction \n", + "7 Agent completely ignores tool results and fabricates entirely different information \n", + "8 Agent fabricates specific facts not in tool result — invents a 60-day policy, system lookup capability, and free shipping that contradict the actual policy \n", + "9 Agent only asks clarifying questions and greets — zero factual propositions about the world \n", + "10 Agent states only 2 of 8 features correctly — everything said is grounded but major details are omitted \n", + "11 All agent claims are fully supported by tool results, complete and accurate \n", + "12 User refines requirements mid-conversation; agent answers old version, ignoring the refinement \n", + "13 Agent provides text summary but fails to generate the charts the user explicitly requested \n", + "14 Agent fully completes a multi-turn task with refinements — flight booked and hotel found " + ] + }, + "execution_count": 68, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "DATA_DIR = Path(\"data\")\n", + "\n", + "traces = []\n", + "for path in sorted(DATA_DIR.glob(\"*.json\")):\n", + " with path.open(encoding=\"utf-8\") as f:\n", + " traces.append(json.load(f))\n", + "\n", + "summary_df = pd.DataFrame(\n", + " [\n", + " {\n", + " \"test_id\": t[\"test_id\"],\n", + " \"evaluator\": t[\"evaluator\"],\n", + " \"expected_score\": t[\"expected_score\"],\n", + " \"num_messages\": len(t[\"messages\"]),\n", + " \"description\": t[\"description\"],\n", + " }\n", + " for t in traces\n", + " ]\n", + ")\n", + "print(f\"Loaded {len(traces)} traces from {DATA_DIR.resolve()}\")\n", + "#widen the reason column for better readability in the notebook; adjust as needed for your display\n", + "pd.set_option(\"display.max_colwidth\", 200)\n", + "summary_df" + ] + }, + { + "cell_type": "markdown", + "id": "be6f6968", + "metadata": {}, + "source": [ + "### Inspect one sample conversation\n", + "\n", + "Each trace's `messages` list is plain OpenAI chat format\n", + "(`role`, `content`, optional `tool_calls`, `tool_call_id`). Some traces\n", + "include tool calls and tool results (for `task_completion`,\n", + "`groundedness`, and `customer_satisfaction`)." + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "id": "de9de11f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Test id: tc_true\n", + "Evaluator: task_completion\n", + "Expected score: TRUE\n", + "Description: Agent fully completes a multi-turn task with refinements — flight booked and hotel found\n", + "\n", + "[user] I need to book a flight from San Francisco to Tokyo for December 15th.\n", + "[assistant] I found several options for SFO to NRT on December 15th:\n", + "\n", + "1. **ANA Flight NH7** — Direct, departs 11:30 AM, arrives 3:30 PM+1, $890\n", + "2. **United UA837** — Direct, departs 1:15 PM, arrives 5:10 PM+1, $9...\n", + "[user] Book the ANA flight. Also, can you book a hotel near Shinjuku station for 5 nights? Something under $150/night.\n", + "[assistant -> tool_call] book_flight({\"flight\": \"NH7\", \"date\": \"2024-12-15\", \"passengers\": 1})\n", + "[tool] Booking confirmed. Confirmation: ANA-NH7-DEC15-XK42M. E-ticket sent to your email.\n", + "[assistant -> tool_call] search_hotels({\"location\": \"Shinjuku Station, Tokyo\", \"check_in\": \"2024-12-16\", \"nights\": 5, \"max_price\": 150})\n", + "[tool] Found: 1) Hotel Gracery Shinjuku - $130/night, 0.2mi from station. 2) Citadines Shinjuku - $110/night, 0.3mi from station.\n", + "[assistant] I found two options under $150/night. I'll book the Hotel Gracery Shinjuku since it's closest to the station.\n", + "[assistant -> tool_call] book_hotel({\"hotel\": \"Hotel Gracery Shinjuku\", \"check_in\": \"2024-12-16\", \"nights\": 5})\n", + "[tool] Hotel booked. Confirmation: HGS-DEC16-5N-RK91. Check-in after 3 PM.\n", + "[assistant] All booked! Here's your trip summary:\n", + "\n", + "**Flight:** ANA NH7, SFO→NRT, Dec 15, 11:30 AM. Confirmation: ANA-NH7-DEC15-XK42M.\n", + "\n", + "**Hotel:** Hotel Gracery Shinjuku, 5 nights (Dec 16–21), $130/night ($650 tot...\n" + ] + } + ], + "source": [ + "sample = next(t for t in traces if t[\"test_id\"] == \"tc_true\")\n", + "print(f\"Test id: {sample['test_id']}\")\n", + "print(f\"Evaluator: {sample['evaluator']}\")\n", + "print(f\"Expected score: {sample['expected_score']}\")\n", + "print(f\"Description: {sample['description']}\")\n", + "print()\n", + "for m in sample[\"messages\"]:\n", + " role = m.get(\"role\")\n", + " if m.get(\"tool_calls\"):\n", + " for tc in m[\"tool_calls\"]:\n", + " print(f\"[{role} -> tool_call] {tc['function']['name']}({tc['function']['arguments']})\")\n", + " elif m.get(\"content\"):\n", + " content = m[\"content\"]\n", + " if isinstance(content, str) and len(content) > 200:\n", + " content = content[:200] + \"...\"\n", + " print(f\"[{role}] {content}\")" + ] + }, + { + "cell_type": "markdown", + "id": "ff1c7701", + "metadata": {}, + "source": [ + "## 4. Convert traces to the Foundry conversation schema\n", + "\n", + "Foundry's cloud evaluators expect each item to be split into three\n", + "fields:\n", + "\n", + "* **`query`** — the *prompt* turns: system message (if any) + all `user`\n", + " turns.\n", + "* **`response`** — the agent trajectory: all `assistant` turns (text and\n", + " tool calls) plus the `tool` result messages.\n", + "* **`tool_definitions`** — OpenAI-style function schemas for the tools\n", + " the agent had access to (optional but recommended for\n", + " `task_completion` and `groundedness`).\n", + "\n", + "Inside each message, the `content` is a list of typed parts:\n", + "\n", + "* `{\"type\": \"text\", \"text\": \"...\"}`\n", + "* `{\"type\": \"tool_call\", \"tool_call_id\": \"...\", \"name\": \"...\", \"arguments\": {...}}`\n", + "* `{\"type\": \"tool_result\", \"tool_result\": ...}` (inside `role: \"tool\"` messages)\n", + "\n", + "The helper below converts an OpenAI-format `messages` list into that\n", + "shape." + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "id": "1e03be90", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "=== query ===\n", + "[\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": [\n", + " {\n", + " \"type\": \"text\",\n", + " \"text\": \"I need to book a flight from San Francisco to Tokyo for December 15th.\"\n", + " }\n", + " ]\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": [\n", + " {\n", + " \"type\": \"text\",\n", + " \"text\": \"Book the ANA flight. Also, can you book a hotel near Shinjuku station for 5 nights? Something under $150/night.\"\n", + " }\n", + " ]\n", + " }\n", + "]...\n", + "\n", + "=== response (first 2 entries) ===\n", + "[\n", + " {\n", + " \"role\": \"assistant\",\n", + " \"content\": [\n", + " {\n", + " \"type\": \"text\",\n", + " \"text\": \"I found several options for SFO to NRT on December 15th:\\n\\n1. **ANA Flight NH7** \\u2014 Direct, departs 11:30 AM, arrives 3:30 PM+1, $890\\n2. **United UA837** \\u2014 Direct, departs 1:15 PM, arrives 5:10 PM+1, $920\\n3. **JAL JL1** \\u2014 Direct, departs 5:00 PM, arrives 9:00 PM+1, $950\\n\\nWhich flight would you prefer, or should I filter by specific criteria?\"\n", + " }\n", + " ]\n", + " },\n", + " {\n", + " \"role\": \"assistant\",\n", + " \"content\": [\n", + " {\n", + " \"type\": \"tool_call\",\n", + " \"tool_call_id\": \"call_1\",\n", + " \"name\": \"book_flight\",\n", + " \"arguments\": {\n", + " \"flight\": \"NH7\",\n", + " \"date\": \"2024-12-15\",\n", + " \"passengers\": 1\n", + " }\n", + " }\n", + " ]\n", + " }\n", + "]...\n" + ] + } + ], + "source": [ + "def _parse_args(args):\n", + " if isinstance(args, str):\n", + " try:\n", + " return json.loads(args)\n", + " except json.JSONDecodeError:\n", + " return {\"_raw\": args}\n", + " return args\n", + "\n", + "\n", + "def messages_to_foundry_qr(messages):\n", + " \"\"\"Split OpenAI-format messages into Foundry (query, response) arrays.\"\"\"\n", + " query = []\n", + " response = []\n", + " for msg in messages:\n", + " role = msg.get(\"role\")\n", + " if role == \"system\":\n", + " query.append({\"role\": \"system\", \"content\": [{\"type\": \"text\", \"text\": msg.get(\"content\", \"\")}]})\n", + " elif role == \"user\":\n", + " query.append({\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": msg.get(\"content\", \"\")}]})\n", + " elif role == \"assistant\":\n", + " parts = []\n", + " text = msg.get(\"content\")\n", + " if text:\n", + " parts.append({\"type\": \"text\", \"text\": text})\n", + " for tc in msg.get(\"tool_calls\", []) or []:\n", + " fn = tc.get(\"function\", {})\n", + " parts.append({\n", + " \"type\": \"tool_call\",\n", + " \"tool_call_id\": tc.get(\"id\", \"\"),\n", + " \"name\": fn.get(\"name\", \"\"),\n", + " \"arguments\": _parse_args(fn.get(\"arguments\", \"{}\")),\n", + " })\n", + " if parts:\n", + " response.append({\"role\": \"assistant\", \"content\": parts})\n", + " elif role == \"tool\":\n", + " content = msg.get(\"content\", \"\")\n", + " try:\n", + " tool_result = json.loads(content) if isinstance(content, str) else content\n", + " except (json.JSONDecodeError, TypeError):\n", + " tool_result = content\n", + " response.append({\n", + " \"role\": \"tool\",\n", + " \"tool_call_id\": msg.get(\"tool_call_id\", \"\"),\n", + " \"content\": [{\"type\": \"tool_result\", \"tool_result\": tool_result}],\n", + " })\n", + " return query, response\n", + "\n", + "\n", + "# Show the conversion on one sample\n", + "sample_q, sample_r = messages_to_foundry_qr(sample[\"messages\"])\n", + "print(\"=== query ===\")\n", + "print(json.dumps(sample_q, indent=2)[:600] + \"...\")\n", + "print()\n", + "print(\"=== response (first 2 entries) ===\")\n", + "print(json.dumps(sample_r[:2], indent=2)[:800] + \"...\")\n", + "\n", + "## " + ] + }, + { + "cell_type": "markdown", + "id": "5a631268", + "metadata": {}, + "source": [ + "## 5. Build the inline items for the evaluation run\n", + "\n", + "For each trace we emit one item containing `query`, `response`,\n", + "`tool_definitions` (empty here — none of the sample traces ship with\n", + "schemas) plus a few metadata fields (`test_id`, `expected_score`,\n", + "`evaluator`, `description`) so we can join the results back later.\n", + "\n", + "We also save the built items to a JSONL file in Foundry format (`multi_turn_traces_foundry.jsonl`). If preferred, this file can be manually uploaded to Microsoft Foundry for evaluating inside the portal UI.\n", + "To upload from Foundry portal go to the project and then, on the left bar: Evaluations->Create->Target (Dataset)->Upload new dataset and then select the evaluators to run." + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "id": "2abb3a16", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Built 15 inline items.\n", + "First item keys: ['test_id', 'evaluator', 'expected_score', 'description', 'query', 'response', 'tool_definitions']\n", + "Saved 15 items to multi_turn_traces_foundry.jsonl\n" + ] + } + ], + "source": [ + "def trace_to_item(trace):\n", + " q, r = messages_to_foundry_qr(trace[\"messages\"])\n", + " return {\n", + " \"test_id\": trace[\"test_id\"],\n", + " \"evaluator\": trace[\"evaluator\"],\n", + " \"expected_score\": str(trace[\"expected_score\"]),\n", + " \"description\": trace[\"description\"],\n", + " \"query\": q,\n", + " \"response\": r,\n", + " \"tool_definitions\": [],\n", + " }\n", + "\n", + "\n", + "items = [trace_to_item(t) for t in traces]\n", + "print(f\"Built {len(items)} inline items.\")\n", + "print(\"First item keys:\", list(items[0].keys()))\n", + "\n", + "# Save items as a JSONL file in Foundry format\n", + "output_jsonl_path = \"multi_turn_traces_foundry.jsonl\"\n", + "with open(output_jsonl_path, \"w\", encoding=\"utf-8\") as f:\n", + " for item in items:\n", + " f.write(json.dumps(item) + \"\\n\")\n", + "print(f\"Saved {len(items)} items to {output_jsonl_path}\")" + ] + }, + { + "cell_type": "markdown", + "id": "17393972", + "metadata": {}, + "source": [ + "## 6. Create one evaluation per evaluator\n", + "\n", + "Each trace in `data/` is labeled with the **specific** evaluator it's\n", + "designed to exercise. Running every trace through every evaluator would\n", + "produce noisy / off-target results (e.g. asking `groundedness` to score\n", + "a `coherence`-only trace).\n", + "\n", + "So we group items by their target evaluator and create one\n", + "`evals.create(...)` per evaluator, each with a single\n", + "`TestingCriterionAzureAIEvaluator`. We'll submit one run per evaluator\n", + "in the next cell.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "id": "4748a73a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[coherence] eval=eval_55c306055c6f48e9a33df9125b5c0fa2 traces=4\n", + "[task_completion] eval=eval_288fcc54f2de4b4d9deda31c14c16289 traces=3\n", + "[customer_satisfaction] eval=eval_6747e1a835d54cc4b9f21e63e08f9bd7 traces=4\n", + "[groundedness] eval=eval_98f36cf30cb84609b8e70696b5786925 traces=5\n" + ] + } + ], + "source": [ + "EVALUATORS = [\n", + " (\"coherence\", \"builtin.coherence\"),\n", + " (\"task_completion\", \"builtin.task_completion\"),\n", + " (\"customer_satisfaction\", \"builtin.customer_satisfaction\"),\n", + " (\"groundedness\", \"builtin.groundedness\"),\n", + "]\n", + "\n", + "data_source_config = DataSourceConfigCustom(\n", + " type=\"custom\",\n", + " item_schema={\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"test_id\": {\"type\": \"string\"},\n", + " \"evaluator\": {\"type\": \"string\"},\n", + " \"expected_score\": {\"type\": \"string\"},\n", + " \"description\": {\"type\": \"string\"},\n", + " \"query\": {\"anyOf\": [{\"type\": \"string\"}, {\"type\": \"array\", \"items\": {\"type\": \"object\"}}]},\n", + " \"response\": {\"anyOf\": [{\"type\": \"string\"}, {\"type\": \"array\", \"items\": {\"type\": \"object\"}}]},\n", + " \"tool_definitions\": {\"anyOf\": [{\"type\": \"object\"}, {\"type\": \"array\", \"items\": {\"type\": \"object\"}}]},\n", + " },\n", + " \"required\": [\"query\", \"response\"],\n", + " },\n", + " include_sample_schema=True,\n", + ")\n", + "\n", + "# Group traces by their target evaluator.\n", + "items_by_evaluator: dict[str, list[dict]] = {}\n", + "for it in items:\n", + " items_by_evaluator.setdefault(it[\"evaluator\"], []).append(it)\n", + "\n", + "credential = DefaultAzureCredential()\n", + "project_client = AIProjectClient(endpoint=ENDPOINT, credential=credential)\n", + "openai_client = project_client.get_openai_client()\n", + "\n", + "# eval_objects[name] = (eval_id, builtin_name, matching_items)\n", + "eval_objects: dict[str, tuple[str, str, list[dict]]] = {}\n", + "for name, builtin in EVALUATORS:\n", + " matching = items_by_evaluator.get(name, [])\n", + " if not matching:\n", + " print(f\"[{name}] no matching traces, skipping\")\n", + " continue\n", + " criterion = TestingCriterionAzureAIEvaluator(\n", + " type=\"azure_ai_evaluator\",\n", + " name=name,\n", + " evaluator_name=builtin,\n", + " initialization_parameters={\"deployment_name\": MODEL_DEPLOYMENT},\n", + " data_mapping={\n", + " \"query\": \"{{item.query}}\",\n", + " \"response\": \"{{item.response}}\",\n", + " \"tool_definitions\": \"{{item.tool_definitions}}\",\n", + " },\n", + " )\n", + " eval_obj = openai_client.evals.create(\n", + " name=f\"Multi-turn evaluator: {name}\",\n", + " data_source_config=data_source_config,\n", + " testing_criteria=[criterion], # type: ignore[arg-type]\n", + " )\n", + " eval_objects[name] = (eval_obj.id, builtin, matching)\n", + " print(f\"[{name}] eval={eval_obj.id} traces={len(matching)}\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "f3a38b77", + "metadata": {}, + "source": [ + "## 7. Submit one run per evaluator and poll until all finish\n", + "\n", + "We submit each run, then poll them all in parallel (sleeping between\n", + "sweeps) until every run reaches a terminal state.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "id": "6c78cd06", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[coherence] run=evalrun_a0ab41a600d9478daa9479d4f20afceb status=in_progress (4 traces)\n", + "[task_completion] run=evalrun_a228aafa1cf14a32884e2a111dcf6360 status=in_progress (3 traces)\n", + "[customer_satisfaction] run=evalrun_07ac74e4266143d98110acf5be11019e status=in_progress (4 traces)\n", + "[groundedness] run=evalrun_9e1a5dbad80a4354a4abd93a5c611920 status=in_progress (5 traces)\n", + " [coherence] status=completed elapsed=6s\n", + " [task_completion] status=completed elapsed=6s\n", + " [customer_satisfaction] status=completed elapsed=7s\n", + " [groundedness] status=in_progress elapsed=7s\n", + " [groundedness] status=in_progress elapsed=12s\n", + " [groundedness] status=in_progress elapsed=18s\n", + " [groundedness] status=completed elapsed=24s\n", + "\n", + "[coherence] final=completed counts=ResultCounts(errored=0, failed=1, passed=3, total=4)\n", + " report: https://ai.azure.com/nextgen/r/oZIOvVm3Txmvn16AWZ6I5A,babel-benchmark-rg,,jcsantos-evaluator-api,jcsantos-6777/build/evaluations/eval_55c306055c6f48e9a33df9125b5c0fa2/run/evalrun_a0ab41a600d9478daa9479d4f20afceb\n", + "[task_completion] final=completed counts=ResultCounts(errored=0, failed=2, passed=1, total=3)\n", + " report: https://ai.azure.com/nextgen/r/oZIOvVm3Txmvn16AWZ6I5A,babel-benchmark-rg,,jcsantos-evaluator-api,jcsantos-6777/build/evaluations/eval_288fcc54f2de4b4d9deda31c14c16289/run/evalrun_a228aafa1cf14a32884e2a111dcf6360\n", + "[customer_satisfaction] final=completed counts=ResultCounts(errored=0, failed=3, passed=1, total=4)\n", + " report: https://ai.azure.com/nextgen/r/oZIOvVm3Txmvn16AWZ6I5A,babel-benchmark-rg,,jcsantos-evaluator-api,jcsantos-6777/build/evaluations/eval_6747e1a835d54cc4b9f21e63e08f9bd7/run/evalrun_07ac74e4266143d98110acf5be11019e\n", + "[groundedness] final=completed counts=ResultCounts(errored=0, failed=2, passed=3, total=5)\n", + " report: https://ai.azure.com/nextgen/r/oZIOvVm3Txmvn16AWZ6I5A,babel-benchmark-rg,,jcsantos-evaluator-api,jcsantos-6777/build/evaluations/eval_98f36cf30cb84609b8e70696b5786925/run/evalrun_9e1a5dbad80a4354a4abd93a5c611920\n" + ] + } + ], + "source": [ + "# Submit one run per evaluator (only the matching traces).\n", + "# eval_runs[name] = (eval_id, run_id)\n", + "eval_runs: dict[str, tuple[str, str]] = {}\n", + "for name, (eval_id, _builtin, matching) in eval_objects.items():\n", + " run = openai_client.evals.runs.create(\n", + " eval_id=eval_id,\n", + " name=f\"multi_turn_{name}_run\",\n", + " metadata={\"sample\": \"multi-turn-evaluators\",\n", + " \"evaluator\": name,\n", + " \"num_items\": str(len(matching))},\n", + " data_source=CreateEvalJSONLRunDataSourceParam(\n", + " type=\"jsonl\",\n", + " source=SourceFileContent(\n", + " type=\"file_content\",\n", + " content=[SourceFileContentContent(item=it) for it in matching],\n", + " ),\n", + " ),\n", + " )\n", + " eval_runs[name] = (eval_id, run.id)\n", + " print(f\"[{name}] run={run.id} status={run.status} ({len(matching)} traces)\")\n", + "\n", + "# Poll every run until all are terminal.\n", + "TERMINAL = (\"completed\", \"failed\", \"canceled\")\n", + "pending = dict(eval_runs)\n", + "final_runs: dict[str, object] = {}\n", + "start = time.time()\n", + "while pending:\n", + " time.sleep(5)\n", + " for name, (eid, rid) in list(pending.items()):\n", + " run = openai_client.evals.runs.retrieve(eval_id=eid, run_id=rid)\n", + " elapsed = time.time() - start\n", + " print(f\" [{name}] status={run.status} elapsed={elapsed:.0f}s\")\n", + " if run.status in TERMINAL:\n", + " final_runs[name] = run\n", + " del pending[name]\n", + "\n", + "print()\n", + "for name, run in final_runs.items():\n", + " print(f\"[{name}] final={run.status} counts={getattr(run, 'result_counts', None)}\")\n", + " print(f\" report: {getattr(run, 'report_url', None)}\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "a3afc08d", + "metadata": {}, + "source": [ + "## 8. Retrieve and flatten per-item results\n", + "\n", + "For each per-evaluator run we list its output items and flatten them\n", + "into a single long-form DataFrame keyed by `(test_id, evaluator)`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "id": "bd408798", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[coherence] retrieved 4 output items\n", + "[task_completion] retrieved 3 output items\n", + "[customer_satisfaction] retrieved 4 output items\n", + "[groundedness] retrieved 5 output items\n", + "\n", + "Total output items across runs: 16\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
test_idevaluatorexpected_scorescorelabelpassedthresholdreason
0coherence_score1coherence[1, 2]2.0failFalse3The response contains abrupt topic changes and irrelevant answers, so it is difficult to follow and does not address the user’s request coherently.
1coherence_score3coherence[3, 4]3.0passTrue3The answer is mostly organized and relevant, but one off-topic insertion interrupts the logical flow.
2coherence_score5coherence54.0passTrue3The response is well organized, directly relevant, and maintains a clear logical flow across the migration phases.
3coherence_skipcoherenceNone3.0passTrue3The response is readable and structured, but it does not logically track the different user requests, so the flow is only partially coherent.
4tc_false_evolvingtask_completionFALSE0.0failFalse1The user requested a class with methods for Celsius-Fahrenheit, Fahrenheit-Celsius, and Kelvin conversions. The agent only provided standalone functions for Celsius/Fahrenheit in both directions a...
5tc_false_incompletetask_completionFALSE0.0failFalse1The agent summarized the sales data, but the user explicitly requested visualizations and asked where the charts were. No bar chart or line chart was provided, so the deliverable is incomplete.
6tc_truetask_completionTRUE1.0passTrue1The agent successfully booked the ANA flight and a hotel near Shinjuku Station for 5 nights under $150/night, with confirmations provided. The final outcome is actionable and matches the user's re...
7csat_score1customer_satisfaction[1, 2]2.0failFalse3The agent responded politely, but it did not meet the customer's stated need for a replacement or refund after a missing $200 package. It only repeated the delivery status and pushed the issue to ...
8csat_score3customer_satisfaction[2, 3]2.0failFalse3The agent gave a few generic troubleshooting steps, but it didn’t address the core issue that the account is still locked after two hours. It also gave inconsistent timing information and never pr...
9csat_score5customer_satisfaction55.0passTrue3The agent fully handled the cancellation, confirmed a prorated refund, and clearly answered the data-retention question. It provided all key details efficiently, so the customer would likely feel ...
10csat_task_done_unsatisfiedcustomer_satisfaction[1, 2]2.0failFalse3The agent successfully updated the shipping address, but it failed to answer the user's key follow-up about the estimated delivery date. The dismissive phrasing also makes the interaction feel unh...
11ground_score1groundedness[1, 2]2.0failFalse3The response gives medication and side-effect information that contradicts the tool result and adds unsupported details.
12ground_score2groundedness[1, 2]2.0failFalse3The response attempts to answer the question but includes multiple incorrect details that contradict the tool result.
13ground_score3groundedness33.0passTrue3The response is a suitable greeting/follow-up and there is no context-based information to verify.
14ground_score4groundedness44.0passTrue3The response is grounded in the tool result but omits most of the Enterprise plan details, so it is only partially complete.
15ground_score5groundedness55.0passTrue3The answer is fully grounded in the tool result and directly answers both questions accurately and completely.
\n", + "
" + ], + "text/plain": [ + " test_id evaluator expected_score score \\\n", + "0 coherence_score1 coherence [1, 2] 2.0 \n", + "1 coherence_score3 coherence [3, 4] 3.0 \n", + "2 coherence_score5 coherence 5 4.0 \n", + "3 coherence_skip coherence None 3.0 \n", + "4 tc_false_evolving task_completion FALSE 0.0 \n", + "5 tc_false_incomplete task_completion FALSE 0.0 \n", + "6 tc_true task_completion TRUE 1.0 \n", + "7 csat_score1 customer_satisfaction [1, 2] 2.0 \n", + "8 csat_score3 customer_satisfaction [2, 3] 2.0 \n", + "9 csat_score5 customer_satisfaction 5 5.0 \n", + "10 csat_task_done_unsatisfied customer_satisfaction [1, 2] 2.0 \n", + "11 ground_score1 groundedness [1, 2] 2.0 \n", + "12 ground_score2 groundedness [1, 2] 2.0 \n", + "13 ground_score3 groundedness 3 3.0 \n", + "14 ground_score4 groundedness 4 4.0 \n", + "15 ground_score5 groundedness 5 5.0 \n", + "\n", + " label passed threshold \\\n", + "0 fail False 3 \n", + "1 pass True 3 \n", + "2 pass True 3 \n", + "3 pass True 3 \n", + "4 fail False 1 \n", + "5 fail False 1 \n", + "6 pass True 1 \n", + "7 fail False 3 \n", + "8 fail False 3 \n", + "9 pass True 3 \n", + "10 fail False 3 \n", + "11 fail False 3 \n", + "12 fail False 3 \n", + "13 pass True 3 \n", + "14 pass True 3 \n", + "15 pass True 3 \n", + "\n", + " reason \n", + "0 The response contains abrupt topic changes and irrelevant answers, so it is difficult to follow and does not address the user’s request coherently. \n", + "1 The answer is mostly organized and relevant, but one off-topic insertion interrupts the logical flow. \n", + "2 The response is well organized, directly relevant, and maintains a clear logical flow across the migration phases. \n", + "3 The response is readable and structured, but it does not logically track the different user requests, so the flow is only partially coherent. \n", + "4 The user requested a class with methods for Celsius-Fahrenheit, Fahrenheit-Celsius, and Kelvin conversions. The agent only provided standalone functions for Celsius/Fahrenheit in both directions a... \n", + "5 The agent summarized the sales data, but the user explicitly requested visualizations and asked where the charts were. No bar chart or line chart was provided, so the deliverable is incomplete. \n", + "6 The agent successfully booked the ANA flight and a hotel near Shinjuku Station for 5 nights under $150/night, with confirmations provided. The final outcome is actionable and matches the user's re... \n", + "7 The agent responded politely, but it did not meet the customer's stated need for a replacement or refund after a missing $200 package. It only repeated the delivery status and pushed the issue to ... \n", + "8 The agent gave a few generic troubleshooting steps, but it didn’t address the core issue that the account is still locked after two hours. It also gave inconsistent timing information and never pr... \n", + "9 The agent fully handled the cancellation, confirmed a prorated refund, and clearly answered the data-retention question. It provided all key details efficiently, so the customer would likely feel ... \n", + "10 The agent successfully updated the shipping address, but it failed to answer the user's key follow-up about the estimated delivery date. The dismissive phrasing also makes the interaction feel unh... \n", + "11 The response gives medication and side-effect information that contradicts the tool result and adds unsupported details. \n", + "12 The response attempts to answer the question but includes multiple incorrect details that contradict the tool result. \n", + "13 The response is a suitable greeting/follow-up and there is no context-based information to verify. \n", + "14 The response is grounded in the tool result but omits most of the Enterprise plan details, so it is only partially complete. \n", + "15 The answer is fully grounded in the tool result and directly answers both questions accurately and completely. " + ] + }, + "execution_count": 63, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def _to_dict(obj):\n", + " if hasattr(obj, \"model_dump\"):\n", + " return obj.model_dump()\n", + " if isinstance(obj, dict):\n", + " return obj\n", + " return dict(obj)\n", + "\n", + "\n", + "rows = []\n", + "total_output_items = 0\n", + "for name, (eid, rid) in eval_runs.items():\n", + " run_output_items = list(\n", + " openai_client.evals.runs.output_items.list(eval_id=eid, run_id=rid)\n", + " )\n", + " total_output_items += len(run_output_items)\n", + " print(f\"[{name}] retrieved {len(run_output_items)} output items\")\n", + " for oi in run_output_items:\n", + " oi_d = _to_dict(oi)\n", + " sample = oi_d.get(\"sample\") or {}\n", + " sample_item = (sample.get(\"input\") or {}) if isinstance(sample, dict) else {}\n", + " # Different SDK versions surface the original item under\n", + " # slightly different keys; try a few.\n", + " item = (\n", + " oi_d.get(\"datasource_item\")\n", + " or oi_d.get(\"data_source_item\")\n", + " or sample_item\n", + " or {}\n", + " )\n", + " test_id = item.get(\"test_id\", oi_d.get(\"id\", \"?\"))\n", + " expected = item.get(\"expected_score\")\n", + " for res in oi_d.get(\"results\", []) or []:\n", + " rows.append({\n", + " \"test_id\": test_id,\n", + " \"evaluator\": res.get(\"name\") or res.get(\"metric\") or name,\n", + " \"expected_score\": expected,\n", + " \"score\": res.get(\"score\"),\n", + " \"label\": res.get(\"label\"),\n", + " \"passed\": res.get(\"passed\"),\n", + " \"threshold\": res.get(\"threshold\"),\n", + " \"reason\": (res.get(\"reason\") or \"\")[:200],\n", + " })\n", + "\n", + "print(f\"\\nTotal output items across runs: {total_output_items}\")\n", + "results_df = pd.DataFrame(rows)\n", + "results_df.head(20)" + ] + }, + { + "cell_type": "markdown", + "id": "9c2164af", + "metadata": {}, + "source": [ + "## 9. Stats\n", + "\n", + "### 9a. Score distribution per evaluator" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "id": "a2e7543b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
score_norm0.01.02.03.04.05.0
evaluator
coherence001210
customer_satisfaction003001
groundedness002111
task_completion210000
\n", + "
" + ], + "text/plain": [ + "score_norm 0.0 1.0 2.0 3.0 4.0 5.0\n", + "evaluator \n", + "coherence 0 0 1 2 1 0\n", + "customer_satisfaction 0 0 3 0 0 1\n", + "groundedness 0 0 2 1 1 1\n", + "task_completion 2 1 0 0 0 0" + ] + }, + "execution_count": 64, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def _norm(v):\n", + " if v is None:\n", + " return None\n", + " s = str(v).strip().lower()\n", + " if s in (\"true\", \"pass\", \"passed\"):\n", + " return \"pass\"\n", + " if s in (\"false\", \"fail\", \"failed\"):\n", + " return \"fail\"\n", + " return s\n", + "\n", + "\n", + "dist = (\n", + " results_df.assign(score_norm=lambda d: d[\"score\"].fillna(d[\"label\"]).map(_norm))\n", + " .groupby([\"evaluator\", \"score_norm\"])\n", + " .size()\n", + " .unstack(fill_value=0)\n", + ")\n", + "dist" + ] + }, + { + "cell_type": "markdown", + "id": "c0af6da8", + "metadata": {}, + "source": [ + "### 9b. Agreement with `expected_score`" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "id": "4aa674fd", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
nexact_matchoff_by_oneexact_ratewithin_one_rate
evaluator
coherence4210.50.75
customer_satisfaction4401.01.00
groundedness5501.01.00
task_completion3301.01.00
\n", + "
" + ], + "text/plain": [ + " n exact_match off_by_one exact_rate within_one_rate\n", + "evaluator \n", + "coherence 4 2 1 0.5 0.75\n", + "customer_satisfaction 4 4 0 1.0 1.00\n", + "groundedness 5 5 0 1.0 1.00\n", + "task_completion 3 3 0 1.0 1.00" + ] + }, + "execution_count": 65, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def parse_to_set(v, is_bool=False):\n", + " \"\"\"Normalize a scalar, list, or stringified list into a set of canonical values.\"\"\"\n", + " if v is None or pd.isna(v):\n", + " return set()\n", + " if isinstance(v, str) and v.strip().startswith(\"[\") and v.strip().endswith(\"]\"):\n", + " try:\n", + " v = json.loads(v)\n", + " except json.JSONDecodeError:\n", + " pass\n", + " items = v if isinstance(v, list) else [v]\n", + " parsed = []\n", + " for x in items:\n", + " if x is None or pd.isna(x):\n", + " continue\n", + " s = str(x).strip().lower()\n", + " if is_bool and s in (\"true\", \"pass\", \"passed\", \"1\", \"1.0\", \"false\", \"fail\", \"failed\", \"0\", \"0.0\"):\n", + " parsed.append(\"pass\" if s in (\"true\", \"pass\", \"passed\", \"1\", \"1.0\") else \"fail\")\n", + " else:\n", + " try:\n", + " parsed.append(float(s))\n", + " except ValueError:\n", + " parsed.append(s)\n", + " return set(parsed)\n", + "\n", + "\n", + "# Pick a usable actual value from available fields\n", + "actual_col = results_df[\"score\"].fillna(results_df[\"label\"]).fillna(results_df[\"passed\"])\n", + "\n", + "exact_list, off_by_one_list = [], []\n", + "for _, row in results_df.assign(actual=actual_col).iterrows():\n", + " is_tc = row[\"evaluator\"] == \"task_completion\"\n", + " a_set = parse_to_set(row[\"actual\"], is_tc)\n", + " e_set = parse_to_set(row[\"expected_score\"], is_tc)\n", + " \n", + " exact = bool(a_set & e_set)\n", + " exact_list.append(exact)\n", + " off_by_one_list.append(\n", + " not exact and any(abs(x - y) <= 1 for x in a_set for y in e_set if isinstance(x, float) and isinstance(y, float))\n", + " )\n", + "\n", + "agree = results_df.assign(\n", + " actual=actual_col,\n", + " exact=exact_list,\n", + " off_by_one=off_by_one_list,\n", + ")\n", + "\n", + "# Aggregate results per evaluator\n", + "agreement_summary = (\n", + " agree.groupby(\"evaluator\")\n", + " .agg(n=(\"test_id\", \"count\"),\n", + " exact_match=(\"exact\", \"sum\"),\n", + " off_by_one=(\"off_by_one\", \"sum\"))\n", + " .assign(exact_rate=lambda d: (d[\"exact_match\"] / d[\"n\"]).round(2),\n", + " within_one_rate=lambda d: ((d[\"exact_match\"] + d[\"off_by_one\"]) / d[\"n\"]).round(2))\n", + ")\n", + "agreement_summary\n" + ] + }, + { + "cell_type": "markdown", + "id": "178329ad", + "metadata": {}, + "source": [ + "### 9c. Failures (actual score didn't match expected)" + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "id": "d35be255", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2 mismatch(es) out of 16 judgments\n", + "\n", + "- coherence_score5 coherence expected=5 actual=4.0 \n", + " reason: The response is well organized, directly relevant, and maintains a clear logical flow across the migration phases.\n", + "- coherence_skip coherence expected=None actual=3.0 \n", + " reason: The response is readable and structured, but it does not logically track the different user requests, so the flow is only partially coherent.\n" + ] + } + ], + "source": [ + "fails = agree[~agree[\"exact\"]].copy()\n", + "print(f\"{len(fails)} mismatch(es) out of {len(agree)} judgments\\n\")\n", + "for _, row in fails.iterrows():\n", + " print(f\"- {row['test_id']:<32} {row['evaluator']:<22} \"\n", + " f\"expected={row['expected_score']!s:<6} actual={row['actual']!s:<6}\")\n", + " if row.get(\"reason\"):\n", + " print(f\" reason: {row['reason']}\")" + ] + }, + { + "cell_type": "markdown", + "id": "f8bc2484", + "metadata": {}, + "source": [ + "## 10. Cleanup (optional)\n", + "\n", + "Delete the evaluation object to keep your project tidy. Uncomment to\n", + "run." + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "id": "32d10f18", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Deleted [coherence] eval eval_5c78ba087ad24686af718903bef0a7d5\n", + "Deleted [task_completion] eval eval_cd5be70a57644c6b93a1999b8b79e780\n", + "Deleted [customer_satisfaction] eval eval_547b2c0aece94e5987841c0ddb0315ac\n", + "Deleted [groundedness] eval eval_61bd2012d3124bc1b1ca702f071aec47\n" + ] + } + ], + "source": [ + "for name, (eval_id, _builtin, _matching) in eval_objects.items():\n", + " openai_client.evals.delete(eval_id=eval_id)\n", + " print(f\"Deleted [{name}] eval {eval_id}\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "a43140d9", + "metadata": {}, + "source": [ + "## Next steps (optional)\n", + "\n", + "* Add your own traces under `data/` (same JSON shape) and rerun.\n", + "* Add a `tool_definitions` field to traces so `task_completion` /\n", + " `groundedness` can validate that the agent invoked the right tools\n", + " with the right arguments.\n", + "* Run the same eval object multiple times to measure judge reliability:\n", + " call `openai_client.evals.runs.create(...)` repeatedly with the same\n", + " `eval_id` and aggregate the per-run results.\n", + "* Browse the full reports for each run from the **Report URL** printed\n", + " in section 7, or in your Foundry project's Evaluations tab." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/samples/python/multi-turn-evaluators/multi_turn_traces_foundry.jsonl b/samples/python/multi-turn-evaluators/multi_turn_traces_foundry.jsonl new file mode 100644 index 00000000..e040e816 --- /dev/null +++ b/samples/python/multi-turn-evaluators/multi_turn_traces_foundry.jsonl @@ -0,0 +1,15 @@ +{"test_id": "coherence_score1", "evaluator": "coherence", "expected_score": "[1, 2]", "description": "Agent completely ignores context, gives contradictory and unrelated responses", "query": [{"role": "user", "content": [{"type": "text", "text": "Help me plan a vegetarian dinner menu for 6 guests this Saturday."}]}, {"role": "user", "content": [{"type": "text", "text": "That sounds great. Can you give me a shopping list for the mushroom risotto?"}]}, {"role": "user", "content": [{"type": "text", "text": "I asked about the risotto ingredients, not programming."}]}], "response": [{"role": "assistant", "content": [{"type": "text", "text": "For a vegetarian dinner party, I'd suggest starting with a caprese salad, followed by mushroom risotto as the main course, and tiramisu for dessert."}]}, {"role": "assistant", "content": [{"type": "text", "text": "The best programming language for web development is JavaScript. You should learn React and Node.js to build modern applications. Python is also useful for backend work."}]}, {"role": "assistant", "content": [{"type": "text", "text": "Absolutely! For the grilled steak, you'll need 6 ribeye steaks, BBQ sauce, and charcoal for the grill. I mentioned earlier that this is a meat-based menu, so make sure to get the freshest cuts from your butcher."}]}], "tool_definitions": []} +{"test_id": "coherence_score3", "evaluator": "coherence", "expected_score": "[3, 4]", "description": "Agent mostly tracks context but has a noticeable irrelevant tangent and a weak transition", "query": [{"role": "user", "content": [{"type": "text", "text": "I'm preparing for a product manager interview next week. What topics should I focus on?"}]}, {"role": "user", "content": [{"type": "text", "text": "Can you help me create a daily practice plan for the next 5 days?"}]}, {"role": "user", "content": [{"type": "text", "text": "For Day 1, what's a good product critique exercise format?"}]}], "response": [{"role": "assistant", "content": [{"type": "text", "text": "Great question! For a PM interview, focus on these core areas:\n1. Product sense \u2014 identifying user problems and proposing solutions\n2. Execution \u2014 prioritization frameworks like RICE or MoSCoW\n3. Metrics \u2014 defining success metrics and KPIs\n4. Technical depth \u2014 understanding system design at a high level"}]}, {"role": "assistant", "content": [{"type": "text", "text": "Sure! Here's a 5-day plan:\n- Day 1: Product sense \u2014 do 2 product critique exercises\n- Day 2: Execution \u2014 practice prioritization with sample feature lists\n- Day 3: You know, sourdough bread takes about 3 days to prepare from starter. Anyway, for day 3, work on metrics \u2014 define north star metrics for 3 sample products\n- Day 4: Technical \u2014 walk through a system design question\n- Day 5: Mock interview \u2014 combine all areas in a full practice session"}]}, {"role": "assistant", "content": [{"type": "text", "text": "For the product critique exercise on Day 1, use this format:\n1. Pick an app you use daily (e.g., Spotify, Uber)\n2. Identify one key user problem it solves\n3. Evaluate: What does it do well? What could improve?\n4. Propose one feature improvement with a clear user story\n5. Define how you'd measure success\n\nSpend 20 minutes per app. Do two apps to build pattern recognition."}]}], "tool_definitions": []} +{"test_id": "coherence_score5", "evaluator": "coherence", "expected_score": "[4, 5]", "description": "Perfect logical flow: agent tracks migration plan across turns, builds on each phase", "query": [{"role": "user", "content": [{"type": "text", "text": "I need to migrate our production database from MySQL 5.7 to PostgreSQL 15. Can you outline the key phases?"}]}, {"role": "user", "content": [{"type": "text", "text": "Good. For Phase 2, what tools should we use for the data export and transform step?"}]}, {"role": "user", "content": [{"type": "text", "text": "What about Phase 3 \u2014 which parts of our Django app would need changes?"}]}], "response": [{"role": "assistant", "content": [{"type": "text", "text": "Here is a 4-phase migration plan:\n\nPhase 1: Schema conversion \u2014 map MySQL types to PostgreSQL equivalents (e.g., TINYINT to SMALLINT, AUTO_INCREMENT to SERIAL).\nPhase 2: Data migration rehearsal \u2014 export with mysqldump, transform, and load into a staging PostgreSQL instance.\nPhase 3: Application code changes \u2014 update ORM configs, replace MySQL-specific SQL syntax.\nPhase 4: Cutover \u2014 final data sync, DNS switch, rollback plan."}]}, {"role": "assistant", "content": [{"type": "text", "text": "For Phase 2, I recommend:\n\n1. **pgloader** \u2014 handles the MySQL-to-PostgreSQL ETL in one step, including type mapping and index recreation.\n2. **AWS DMS** (if you're on AWS) \u2014 for continuous replication during the rehearsal window.\n3. For custom transforms, write Python scripts using `psycopg2` to handle edge cases like ENUM columns or spatial data.\n\nStart with pgloader for a baseline rehearsal, then layer in custom scripts for any columns that need special handling."}]}, {"role": "assistant", "content": [{"type": "text", "text": "For Phase 3, here are the Django-specific changes you'll need:\n\n1. **settings.py** \u2014 switch `ENGINE` from `django.db.backends.mysql` to `django.db.backends.postgresql`.\n2. **Raw SQL queries** \u2014 audit for MySQL-specific syntax like backtick quoting, `LIMIT x, y` (use `LIMIT y OFFSET x` instead), and `GROUP_CONCAT` (use `STRING_AGG`).\n3. **Migrations** \u2014 regenerate with `makemigrations` against the new backend; check for any MySQL-specific field options.\n4. **Full-text search** \u2014 if you use MySQL FULLTEXT indexes, switch to PostgreSQL's `tsvector`/`tsquery` or `django.contrib.postgres.search`.\n\nI'd recommend running your test suite against the staging PostgreSQL from Phase 2 to catch any remaining incompatibilities before cutover."}]}], "tool_definitions": []} +{"test_id": "csat_score1", "evaluator": "customer_satisfaction", "expected_score": "[1, 2]", "description": "Agent dismisses user, provides no real help, user explicitly frustrated", "query": [{"role": "user", "content": [{"type": "text", "text": "My package was marked as delivered 3 days ago but I never received it. Order #88712."}]}, {"role": "user", "content": [{"type": "text", "text": "Yes, I've checked everywhere. I need a replacement or a refund. This is a $200 item."}]}, {"role": "user", "content": [{"type": "text", "text": "That's completely unacceptable. I paid $200 and received nothing. This is useless support."}]}], "response": [{"role": "assistant", "content": [{"type": "text", "text": "Our records show the package was delivered on Tuesday. Have you checked with your neighbors or your building's mail room?"}]}, {"role": "assistant", "content": [{"type": "text", "text": "Since the tracking shows the package as delivered, we unfortunately cannot issue a refund or replacement. You may want to file a claim with the carrier or file a police report for stolen packages."}]}, {"role": "assistant", "content": [{"type": "text", "text": "I understand your frustration. Our policy is that once a package is marked as delivered, the responsibility shifts to the carrier. I'd recommend contacting FedEx directly to dispute the delivery status. Is there anything else I can help you with today?"}]}], "tool_definitions": []} +{"test_id": "csat_score3", "evaluator": "customer_satisfaction", "expected_score": "[2, 3]", "description": "Agent provides basic help but misses proactive solutions; customer left without full resolution", "query": [{"role": "user", "content": [{"type": "text", "text": "I can't log into my account. I keep getting an error when I try to reset my password."}]}, {"role": "user", "content": [{"type": "text", "text": "It says 'Too many reset attempts. Account temporarily locked.' I've been locked out for 2 hours now."}]}, {"role": "user", "content": [{"type": "text", "text": "I just tried again and it's still showing the same locked message."}]}], "response": [{"role": "assistant", "content": [{"type": "text", "text": "I'm sorry to hear that. What error message are you seeing when you try to reset your password?"}]}, {"role": "assistant", "content": [{"type": "text", "text": "Account locks from too many reset attempts typically expire after 30 minutes. Since it's been 2 hours, it should be unlocked by now. Please try the password reset again and let me know if you still see the error."}]}, {"role": "assistant", "content": [{"type": "text", "text": "That's unusual. The lock should have expired by now. I'd suggest trying the following:\n1. Clear your browser cache and cookies\n2. Try using a different browser or incognito mode\n3. If that doesn't work, you may need to wait a bit longer\n\nIf none of these work, you can contact our technical support team at support@example.com for further assistance."}]}], "tool_definitions": []} +{"test_id": "csat_score5", "evaluator": "customer_satisfaction", "expected_score": "5", "description": "All requests handled perfectly, professional tone, complete resolution with proactive details", "query": [{"role": "user", "content": [{"type": "text", "text": "I need to cancel my subscription and get a refund for the unused portion of this month."}]}, {"role": "user", "content": [{"type": "text", "text": "Will I lose my saved data when the subscription ends?"}]}], "response": [{"role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_1", "name": "get_subscription", "arguments": {"user_id": "USR-4421"}}]}, {"role": "tool", "tool_call_id": "call_1", "content": [{"type": "tool_result", "tool_result": "Plan: Premium ($29.99/mo), billing cycle: 1st-30th, status: active, days remaining: 18"}]}, {"role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_2", "name": "cancel_subscription", "arguments": {"user_id": "USR-4421", "refund_prorated": true}}]}, {"role": "tool", "tool_call_id": "call_2", "content": [{"type": "tool_result", "tool_result": "Subscription cancelled. Prorated refund of $17.99 (18/30 days) will be processed within 3-5 business days to the card ending in 4242."}]}, {"role": "assistant", "content": [{"type": "text", "text": "Your Premium subscription has been cancelled. Here's a summary:\n\n- **Refund:** $17.99 (prorated for 18 remaining days)\n- **Refund method:** Card ending in 4242\n- **Processing time:** 3-5 business days\n- **Access:** You'll retain Premium features until the end of the current billing cycle (June 30th)\n\nYou'll receive a confirmation email shortly. Is there anything else I can help with?"}]}, {"role": "assistant", "content": [{"type": "text", "text": "No, your data is safe. When your Premium subscription ends:\n\n- **Saved files and projects** \u2014 retained for 90 days on the free tier\n- **Settings and preferences** \u2014 kept indefinitely\n- **Export option** \u2014 you can download all your data anytime from Settings > Data Export\n\nIf you decide to resubscribe within 90 days, everything will be restored to your Premium account automatically."}]}], "tool_definitions": []} +{"test_id": "csat_task_done_unsatisfied", "evaluator": "customer_satisfaction", "expected_score": "[1, 2]", "description": "Task technically completed but experience was poor \u2014 agent was curt, dismissive, and user expressed dissatisfaction", "query": [{"role": "user", "content": [{"type": "text", "text": "I need to update my shipping address for order #55123."}]}, {"role": "user", "content": [{"type": "text", "text": "742 Evergreen Terrace, Springfield, IL 62704"}]}, {"role": "user", "content": [{"type": "text", "text": "Can you confirm when the order will arrive at the new address?"}]}, {"role": "user", "content": [{"type": "text", "text": "That's not very helpful. Could you at least tell me the estimated delivery date?"}]}], "response": [{"role": "assistant", "content": [{"type": "text", "text": "What's the new address?"}]}, {"role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_1", "name": "update_shipping_address", "arguments": {"order_id": "55123", "address": "742 Evergreen Terrace, Springfield, IL 62704"}}]}, {"role": "tool", "tool_call_id": "call_1", "content": [{"type": "tool_result", "tool_result": "Address updated successfully. Order #55123 will now ship to: 742 Evergreen Terrace, Springfield, IL 62704."}]}, {"role": "assistant", "content": [{"type": "text", "text": "Done."}]}, {"role": "assistant", "content": [{"type": "text", "text": "Check your email."}]}, {"role": "assistant", "content": [{"type": "text", "text": "It ships when it ships. Anything else?"}]}], "tool_definitions": []} +{"test_id": "ground_score1", "evaluator": "groundedness", "expected_score": "[1, 2]", "description": "Agent completely ignores tool results and fabricates entirely different information", "query": [{"role": "user", "content": [{"type": "text", "text": "What medications am I currently prescribed?"}]}, {"role": "user", "content": [{"type": "text", "text": "Any side effects I should watch for?"}]}], "response": [{"role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_1", "name": "get_patient_medications", "arguments": {"patient_id": "P-12345"}}]}, {"role": "tool", "tool_call_id": "call_1", "content": [{"type": "tool_result", "tool_result": "Current medications for patient P-12345: Lisinopril 10mg daily (for blood pressure), Metformin 500mg twice daily (for type 2 diabetes). Last updated: 2024-03-15."}]}, {"role": "assistant", "content": [{"type": "text", "text": "Based on your records, you are currently taking Aspirin 81mg daily and Atorvastatin 20mg daily for cholesterol management. Make sure to take them with food for best absorption. You should also avoid grapefruit juice while on these medications."}]}, {"role": "assistant", "content": [{"type": "text", "text": "Common side effects of your Aspirin and Atorvastatin include dizziness, muscle pain, and stomach upset. If you experience any unusual bleeding or dark urine, contact your doctor immediately. Also, your records show you had an allergy to penicillin noted last year."}]}], "tool_definitions": []} +{"test_id": "ground_score2", "evaluator": "groundedness", "expected_score": "[1, 2]", "description": "Agent fabricates specific facts not in tool result \u2014 invents a 60-day policy, system lookup capability, and free shipping that contradict the actual policy", "query": [{"role": "user", "content": [{"type": "text", "text": "What's the return policy for electronics?"}]}], "response": [{"role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_1", "name": "get_return_policy", "arguments": {"category": "electronics"}}]}, {"role": "tool", "tool_call_id": "call_1", "content": [{"type": "tool_result", "tool_result": "Electronics return policy: 30-day return window. Receipt required. Item must be in original packaging with all accessories. Restocking fee of 15% applies to opened items."}]}, {"role": "assistant", "content": [{"type": "text", "text": "Our electronics return policy gives you a generous 60-day return window. No receipt needed \u2014 we can look up any purchase using your phone number, and we keep records going back 5 years. Returns are completely free with no restocking fees, and we'll even cover return shipping costs. Just bring the item to any store location."}]}], "tool_definitions": []} +{"test_id": "ground_score3", "evaluator": "groundedness", "expected_score": "3", "description": "Agent only asks clarifying questions and greets \u2014 zero factual propositions about the world", "query": [{"role": "user", "content": [{"type": "text", "text": "Hi there."}]}, {"role": "user", "content": [{"type": "text", "text": "I have a question."}]}], "response": [{"role": "assistant", "content": [{"type": "text", "text": "Hello! How can I help you today?"}]}, {"role": "assistant", "content": [{"type": "text", "text": "Of course! What would you like to know?"}]}], "tool_definitions": []} +{"test_id": "ground_score4", "evaluator": "groundedness", "expected_score": "4", "description": "Agent states only 2 of 8 features correctly \u2014 everything said is grounded but major details are omitted", "query": [{"role": "user", "content": [{"type": "text", "text": "Tell me everything about the Enterprise plan."}]}], "response": [{"role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_1", "name": "get_plan_details", "arguments": {"plan": "enterprise"}}]}, {"role": "tool", "tool_call_id": "call_1", "content": [{"type": "tool_result", "tool_result": "Enterprise plan ($499/mo): unlimited storage, priority 24/7 support, custom domains, advanced analytics dashboard, team collaboration (up to 200 members), SSO/SAML integration, dedicated account manager, 99.99% SLA guarantee."}]}, {"role": "assistant", "content": [{"type": "text", "text": "The Enterprise plan includes unlimited storage and priority 24/7 support."}]}], "tool_definitions": []} +{"test_id": "ground_score5", "evaluator": "groundedness", "expected_score": "5", "description": "All agent claims are fully supported by tool results, complete and accurate", "query": [{"role": "user", "content": [{"type": "text", "text": "What are the office hours for the downtown branch?"}]}, {"role": "user", "content": [{"type": "text", "text": "Can I visit on Saturday at 1 PM?"}]}], "response": [{"role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_1", "name": "get_branch_info", "arguments": {"branch": "downtown"}}]}, {"role": "tool", "tool_call_id": "call_1", "content": [{"type": "tool_result", "tool_result": "Downtown branch: Mon-Fri 9AM-5PM, Sat 10AM-2PM, closed Sunday. Address: 100 Main St, Suite 200. Phone: (555) 123-4567."}]}, {"role": "assistant", "content": [{"type": "text", "text": "The downtown branch hours are:\n- Monday to Friday: 9 AM \u2013 5 PM\n- Saturday: 10 AM \u2013 2 PM\n- Closed on Sundays\n\nThe branch is located at 100 Main St, Suite 200. You can also reach them at (555) 123-4567."}]}, {"role": "assistant", "content": [{"type": "text", "text": "Yes, the downtown branch is open on Saturdays until 2 PM, so a 1 PM visit would be within their hours."}]}], "tool_definitions": []} +{"test_id": "tc_false_evolving", "evaluator": "task_completion", "expected_score": "FALSE", "description": "User refines requirements mid-conversation; agent answers old version, ignoring the refinement", "query": [{"role": "user", "content": [{"type": "text", "text": "Write me a Python function that converts temperatures from Celsius to Fahrenheit."}]}, {"role": "user", "content": [{"type": "text", "text": "Actually, I need it to handle both directions \u2014 Celsius to Fahrenheit AND Fahrenheit to Celsius \u2014 and also include Kelvin. Make it a class with methods for all conversions."}]}], "response": [{"role": "assistant", "content": [{"type": "text", "text": "Here's a simple conversion function:\n\n```python\ndef celsius_to_fahrenheit(celsius):\n return (celsius * 9/5) + 32\n```\n\nExample usage:\n```python\nprint(celsius_to_fahrenheit(100)) # Output: 212.0\nprint(celsius_to_fahrenheit(0)) # Output: 32.0\n```"}]}, {"role": "assistant", "content": [{"type": "text", "text": "Sure! Here's the updated function:\n\n```python\ndef celsius_to_fahrenheit(celsius):\n \"\"\"Convert Celsius to Fahrenheit.\"\"\"\n return (celsius * 9/5) + 32\n\ndef fahrenheit_to_celsius(fahrenheit):\n \"\"\"Convert Fahrenheit to Celsius.\"\"\"\n return (fahrenheit - 32) * 5/9\n```\n\nNow you have both directions covered! Let me know if you need anything else."}]}], "tool_definitions": []} +{"test_id": "tc_false_incomplete", "evaluator": "task_completion", "expected_score": "FALSE", "description": "Agent provides text summary but fails to generate the charts the user explicitly requested", "query": [{"role": "user", "content": [{"type": "text", "text": "Analyze our Q3 sales data and create a report with visualizations showing revenue by region and monthly trends."}]}, {"role": "user", "content": [{"type": "text", "text": "Where are the charts? I specifically asked for visualizations \u2014 a bar chart for regional revenue and a line chart for monthly trends."}]}], "response": [{"role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_1", "name": "load_data", "arguments": {"file": "q3_sales.csv"}}]}, {"role": "tool", "tool_call_id": "call_1", "content": [{"type": "tool_result", "tool_result": "Data loaded: 1,500 rows. Columns: date, product, region, revenue, units_sold. Total revenue: $2.3M."}]}, {"role": "assistant", "content": [{"type": "text", "text": "I've analyzed the Q3 sales data. Here's a summary:\n\n**Revenue by Region:**\n- North America: $1,035,000 (45%)\n- Europe: $690,000 (30%)\n- Asia Pacific: $575,000 (25%)\n\n**Monthly Trends:**\n- July: $700,000\n- August: $800,000\n- September: $800,000\n\nOverall, we see steady growth with North America leading. The data suggests strong performance in August and September."}]}, {"role": "assistant", "content": [{"type": "text", "text": "You're right, I apologize. Here are the key numbers again in a clearer format:\n\n| Region | Revenue | Share |\n|--------|---------|-------|\n| North America | $1.04M | 45% |\n| Europe | $690K | 30% |\n| Asia Pacific | $575K | 25% |\n\nThe monthly trend shows a steady upward trajectory from $700K in July to $800K in August and September. I hope this table format helps visualize the data better."}]}], "tool_definitions": []} +{"test_id": "tc_true", "evaluator": "task_completion", "expected_score": "TRUE", "description": "Agent fully completes a multi-turn task with refinements \u2014 flight booked and hotel found", "query": [{"role": "user", "content": [{"type": "text", "text": "I need to book a flight from San Francisco to Tokyo for December 15th."}]}, {"role": "user", "content": [{"type": "text", "text": "Book the ANA flight. Also, can you book a hotel near Shinjuku station for 5 nights? Something under $150/night."}]}], "response": [{"role": "assistant", "content": [{"type": "text", "text": "I found several options for SFO to NRT on December 15th:\n\n1. **ANA Flight NH7** \u2014 Direct, departs 11:30 AM, arrives 3:30 PM+1, $890\n2. **United UA837** \u2014 Direct, departs 1:15 PM, arrives 5:10 PM+1, $920\n3. **JAL JL1** \u2014 Direct, departs 5:00 PM, arrives 9:00 PM+1, $950\n\nWhich flight would you prefer, or should I filter by specific criteria?"}]}, {"role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_1", "name": "book_flight", "arguments": {"flight": "NH7", "date": "2024-12-15", "passengers": 1}}]}, {"role": "tool", "tool_call_id": "call_1", "content": [{"type": "tool_result", "tool_result": "Booking confirmed. Confirmation: ANA-NH7-DEC15-XK42M. E-ticket sent to your email."}]}, {"role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_2", "name": "search_hotels", "arguments": {"location": "Shinjuku Station, Tokyo", "check_in": "2024-12-16", "nights": 5, "max_price": 150}}]}, {"role": "tool", "tool_call_id": "call_2", "content": [{"type": "tool_result", "tool_result": "Found: 1) Hotel Gracery Shinjuku - $130/night, 0.2mi from station. 2) Citadines Shinjuku - $110/night, 0.3mi from station."}]}, {"role": "assistant", "content": [{"type": "text", "text": "I found two options under $150/night. I'll book the Hotel Gracery Shinjuku since it's closest to the station."}]}, {"role": "assistant", "content": [{"type": "tool_call", "tool_call_id": "call_3", "name": "book_hotel", "arguments": {"hotel": "Hotel Gracery Shinjuku", "check_in": "2024-12-16", "nights": 5}}]}, {"role": "tool", "tool_call_id": "call_3", "content": [{"type": "tool_result", "tool_result": "Hotel booked. Confirmation: HGS-DEC16-5N-RK91. Check-in after 3 PM."}]}, {"role": "assistant", "content": [{"type": "text", "text": "All booked! Here's your trip summary:\n\n**Flight:** ANA NH7, SFO\u2192NRT, Dec 15, 11:30 AM. Confirmation: ANA-NH7-DEC15-XK42M.\n\n**Hotel:** Hotel Gracery Shinjuku, 5 nights (Dec 16\u201321), $130/night ($650 total), 0.2 mi from Shinjuku Station. Confirmation: HGS-DEC16-5N-RK91. Check-in after 3 PM.\n\n**Total trip cost:** $890 (flight) + $650 (hotel) = $1,540. Confirmation emails have been sent for both bookings."}]}], "tool_definitions": []} diff --git a/samples/python/multi-turn-evaluators/requirements.txt b/samples/python/multi-turn-evaluators/requirements.txt new file mode 100644 index 00000000..e33654c0 --- /dev/null +++ b/samples/python/multi-turn-evaluators/requirements.txt @@ -0,0 +1,5 @@ +azure-ai-projects>=2.0.0 +azure-identity>=1.17.0 +openai>=1.99.0 +python-dotenv>=1.0.0 +pandas>=2.0.0