feat!: version 13 - dataset evaluators #9642

mikeldking · 2025-09-25T20:54:32Z

this is the feature branch for the upcoming version 13

pkg-pr-new · 2025-09-25T20:57:43Z

npm i https://pkg.pr.new/Arize-ai/phoenix/@arizeai/phoenix-client@9642

npm i https://pkg.pr.new/Arize-ai/phoenix/@arizeai/phoenix-mcp@9642

commit: 7a0c5f7

RogerHYang

blocking feature branch

…0075)

* add minimal evaluators menu * handle selection * styling * add menu footer * set menu max width * replace footer button with link

* feat: Add name field to evaluators form * Reorganize choices and their default state * Disable prompt save, tools, response format * Move input mapping field from select to combobox * Update arrow icon * Persist input mapping fields across labels * Do not render response format or tools if they are saved as provider default

* Add dummy evaluation payloads to single playground run * Implement for chat mutations and subscription over dataset * Ruff 🐶 and update graphql schema * compile relay * frontend * Add dataset example id and repetition number * Address feedback * Update input typing * Update relay * Load and display real global evaluators --------- Co-authored-by: Alexander Song <[email protected]> Co-authored-by: Tony Powell <[email protected]>

* Add filter and sort capabilities to evaluators * Improve clarity of allowed sort columns

* add EvaluatorSelect to dataset page * stub out evaluator config dialog and rework data fetching * add readonly prompt messages to eval config modal * add output config to modal * add dataset example preview and input mapping section to modal * wire up add evaluator mutation * add suspense boundaries * Refactor promptVersionToInstance to depend on inline fragment * remove unnecessary type annotations: --------- Co-authored-by: Tony Powell <[email protected]>

…0152) * output config resolver * clean

* evaluator crud * clean * patch mutation * update * types * Revert "types" This reverts commit 25579b5. * type ignore * plural delete * clean * decorator * fix metadata * clean * clean * already exists * test * simplify * test * simplify

* add annotation name to eval select * address feedback

…s useful (#10187) * feat(evaluators): provide a useful correctness pre-built evaluator * feat(evaluators): provide a useful correctness pre-built evaluator * simplify

* evaluator prompt validation * cursor tests * clean * condense * test * clean * clean * test * parse pydantic errors * clean * validate mutations * fix tests * validate choices * test with form * test * type check * clean

…10253)

* include only dataset-specific evaluators in playground eval selector * fix dataset page tab selection * add aria label to dialog * add annotation names to playground select * handle long annotation names * separate components for DatasetEvaluatorSelect and PlaygroundEvaluatorSelect * remove extra opacity css var * updates to Menu * updates to evaluator menus * fix menu item flicker * wip: enable mapping evaluator from playground * formatting

…10292)

* add eval outputs to playground output cell * add evaluation details popover & trace link * include evals in output for non-streaming playground runs * fix unnecessary truncation of eval name * handle evaluations on error * fix evaluation name * rerun CI * prevent losing example data when handling tool chunk --------- Co-authored-by: Alexander Song <[email protected]>

* feat: Create distinct slideovers for evaluator use cases * fix: manually update updated_at when creating llm_evaluator * fix: global change to combobox, opens submenu on enter --------- Co-authored-by: Rick Steele <[email protected]>

* Spike out builtin evaluator interfaces * Get builtin evaluator if it exists * Refine data model * Simplify models * Implement literal/path mapping logic * Wire up builtin evaluators * Persist single-run evaluations as SpanAnnotations * Update gql schema and run relay compiler * Fix evaluation over playground dataset run * ruff * Fix queries w.r.t BuiltInEvaluator * Add built in evaluators to dataset evaluators query * Add xfail to dataset evaluator test * Ignore missing type stubs * fix evaluators over single chat * fix ts ci --------- Co-authored-by: Tony Powell <[email protected]> Co-authored-by: Alexander Song <[email protected]>

* wip: enable unassigning a dataset evaluator * update cached evaluator data upon assignment/unassignment * add confirmation dialog * wire up evaluator unlink with optional delete * remove row selectability * add comment * use alert banner instead of toast for errors * explicitly close dialog on successful delete/unlink

* fix evaluator config dialog header overflow * fix dataset select overflow * styling * dataset select styling

* feat: Add builtin evaluator support to crosswalk table * Fix migration and updqte gql schema * Fix relationship definition * feat: Add prebuilt evaluators to template submenu * Tweak language * feat: Support input mapping code evaluators * Improve dataset messaging in evaluator form * update default evaluator template * Add DatasetExampleSelect component Also makes combobox and dataset select more responsive * Allow users to edit evaluator input preview * Fix db constraints for input mapping * Wire up input mapping end to end * Fix ruff * use fastapi instead of starlette import * Remove xfail and clean up input_config handling * Ruff * Verify evaluator id existence for type checker * Build gql schema and run relay compiler * Pull output from input-mapped inputs * Insure input config is stored as JSON * Add minWidth prop to Select * Fix evaluator config dialog header truncation * Use both unique constraint and partial index * Add builtin evaluators to dataloader * Call lower() after str conversion * Rename evaluator for simplicity * Remove explicit constraint name * Update variable name * Address PR feedback * Change column name from input_config to input_mapping * Update tests and other input_config references * Make mypy happy --------- Co-authored-by: Dustin Ngo <[email protected]>

mikeldking requested review from a team as code owners September 25, 2025 20:54

github-project-automation bot added this to phoenix Sep 25, 2025

github-project-automation bot moved this to 📘 Todo in phoenix Sep 25, 2025

dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Sep 25, 2025

mikeldking changed the base branch from main to feat/version-12 September 25, 2025 20:56

dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Sep 25, 2025

mikeldking changed the title ~~version 13~~ feat!: version 13 - dataset evaluators Sep 25, 2025

mikeldking added the feature branch a feature branch that consolidates multiple features into a single commit on main label Sep 25, 2025

mikeldking marked this pull request as draft September 25, 2025 21:22

RogerHYang requested changes Sep 25, 2025

View reviewed changes

github-project-automation bot moved this from 📘 Todo to 🔍. Needs Review in phoenix Sep 25, 2025

axiomofjoy force-pushed the feat/version-12 branch from 001f109 to b65ea42 Compare September 29, 2025 17:18

Base automatically changed from feat/version-12 to main September 29, 2025 18:13

An error occurred while trying to automatically change base from feat/version-12 to main September 29, 2025 18:13

mikeldking force-pushed the version-13 branch from 3ed12e6 to 4ad9dc8 Compare October 2, 2025 08:01

RogerHYang force-pushed the version-13 branch from 4ad9dc8 to 4da307b Compare October 6, 2025 17:24

RogerHYang removed this from phoenix Oct 6, 2025

RogerHYang force-pushed the version-13 branch from 5afdccb to e0e6709 Compare October 8, 2025 06:19

RogerHYang force-pushed the version-13 branch 4 times, most recently from fc49ed1 to 2b19c56 Compare October 24, 2025 15:47

RogerHYang force-pushed the version-13 branch 3 times, most recently from f4ab1f0 to 2ef94fd Compare October 29, 2025 15:31

mikeldking closed this Nov 4, 2025

mikeldking reopened this Nov 4, 2025

cephalization and others added 25 commits November 20, 2025 08:41

feat: Collect all json path segments when flattening example keys (#1…

2013b79

…0075)

feat(evaluators): add evaluator select (#10063)

a8ae458

* add minimal evaluators menu * handle selection * styling * add menu footer * set menu max width * replace footer button with link

feat: Add examples route with examples table (#10123)

7147b15

* Add filter and sort capabilities to evaluators * Improve clarity of allowed sort columns

feat: Add optional description field to new evaluator creation (#10132)

44df74a

feat: Improve rendering of dataset evals on playground (#10136)

97c3cbe

feat: add metadata to evaluator db table (#10139)

9c22f28

fix(evaluators): return annotation name in output config resolver (#1…

88edc13

…0152) * output config resolver * clean

feat(evaluators): add annotation name to eval menu (#10156)

573004f

* add annotation name to eval select * address feedback

feat: Add evaluators table to dataset evaluators page (#10157)

cc9addd

fix: Fix import error on evaluator page (#10185)

9d63e13

feat(evaluators): load in a default template for the evaluator that i…

8f14f74

…s useful (#10187) * feat(evaluators): provide a useful correctness pre-built evaluator * feat(evaluators): provide a useful correctness pre-built evaluator * simplify

fix: eslint errors

4fb7b3e

ci: add ci for 12 (#10196)

7a1eff4

feat: persist tools with eval (#10220)

f19783b

feat: Refactor evaluator form for usage in create and edit workflows (#…

8f7ce68

…10253)

only include dataset-specific evaluators in playground eval select (#…

7f74cce

…10292)

fix(evaluators): clean up evaluators rebase

c011874

axiomofjoy force-pushed the version-13 branch from 289b75c to c011874 Compare November 20, 2025 17:44

anticorrelator and others added 4 commits November 20, 2025 15:16

fix: fix evaluator config dialog layout (#10366)

905764c

* fix evaluator config dialog header overflow * fix dataset select overflow * styling * dataset select styling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat!: version 13 - dataset evaluators #9642

feat!: version 13 - dataset evaluators #9642

Uh oh!

mikeldking commented Sep 25, 2025 •

edited

Loading

Uh oh!

pkg-pr-new bot commented Sep 25, 2025

Uh oh!

RogerHYang left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

feat!: version 13 - dataset evaluators #9642

Are you sure you want to change the base?

feat!: version 13 - dataset evaluators #9642

Uh oh!

Conversation

mikeldking commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pkg-pr-new bot commented Sep 25, 2025

Uh oh!

RogerHYang left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

mikeldking commented Sep 25, 2025 •

edited

Loading