Skip to content

Commit 0c409ad

Browse files
feat: add evaluation helpers to easily pull rag spans (#10341)
Co-authored-by: Dustin Ngo <[email protected]>
1 parent a7e03e3 commit 0c409ad

File tree

8 files changed

+1095
-1233
lines changed

8 files changed

+1095
-1233
lines changed

js/examples/notebooks/tracing_openai_sessions_tutorial.ipynb

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -133,11 +133,6 @@
133133
}
134134
],
135135
"metadata": {
136-
"kernelspec": {
137-
"display_name": "Deno",
138-
"language": "typescript",
139-
"name": "deno"
140-
},
141136
"language_info": {
142137
"name": "typescript"
143138
}

packages/phoenix-client/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ Phoenix Client provides a interface for interacting with the Phoenix platform vi
2929
- **Experiments** - Run evaluations and track experiment results
3030
- **Spans** - Query and analyze traces with powerful filtering
3131
- **Annotations** - Add human feedback and automated evaluations
32+
- **Evaluation Helpers** - Extract span data in formats optimized for RAG evaluation workflows
3233

3334
## Installation
3435

packages/phoenix-client/docs/source/index.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -229,6 +229,102 @@ df = pd.DataFrame({
229229
client.spans.log_span_annotations_dataframe(dataframe=df)
230230
```
231231

232+
### Evaluation Helpers
233+
234+
The Phoenix Client provides helper functions to extract span data in formats optimized for RAG evaluation workflows. These helpers streamline the process of preparing data for evaluation with `phoenix.evals`.
235+
236+
#### RAG Retrieval Evaluation
237+
238+
Extract retrieved documents from retriever spans for relevance evaluation:
239+
240+
```python
241+
from phoenix.client import Client
242+
from phoenix.client.helpers.spans import get_retrieved_documents
243+
244+
client = Client()
245+
246+
# Extract retrieved documents for evaluation
247+
retrieved_docs_df = get_retrieved_documents(
248+
client,
249+
project_name="my-rag-app"
250+
)
251+
252+
# Each row is a retrieved document with its metadata
253+
print(retrieved_docs_df.head())
254+
# Index: context.span_id, document_position
255+
# Columns: context.trace_id, input, document, document_score, document_metadata
256+
257+
# Use with phoenix.evals for relevance evaluation
258+
from phoenix.evals import LLM, async_evaluate_dataframe
259+
from phoenix.evals.metrics import DocumentRelevanceEvaluator
260+
261+
llm = LLM(model="gpt-4o", provider="openai")
262+
relevance_evaluator = DocumentRelevanceEvaluator(llm=llm)
263+
264+
relevance_results = await async_evaluate_dataframe(
265+
dataframe=retrieved_docs_df,
266+
evaluators=[relevance_evaluator],
267+
concurrency=10,
268+
exit_on_error=True,
269+
)
270+
relevance_results.head()
271+
```
272+
273+
#### RAG Q&A Evaluation
274+
275+
Extract Q&A pairs with reference context for hallucination evaluation:
276+
277+
```python
278+
from phoenix.client.helpers.spans import get_input_output_context
279+
from phoenix.evals.metrics import HallucinationEvaluator
280+
281+
# Extract Q&A with context documents
282+
qa_df = get_input_output_context(
283+
client,
284+
project_name="my-rag-app"
285+
)
286+
287+
# Each row combines a Q&A pair with concatenated retrieval documents
288+
# Index: context.span_id
289+
# Columns: context.trace_id, input, output, context, metadata
290+
if qa_df is not None:
291+
print(qa_df.head())
292+
293+
# Run hallucination evaluations
294+
hallucination_evaluator = HallucinationEvaluator(llm=llm)
295+
296+
hallucination_results = await async_evaluate_dataframe(
297+
dataframe=qa_df,
298+
evaluators=[hallucination_evaluator],
299+
concurrency=10,
300+
exit_on_error=True,
301+
)
302+
hallucination_results.head()
303+
```
304+
305+
#### Time-Filtered RAG Spans
306+
307+
Filter spans by time range for evaluation:
308+
309+
```python
310+
from datetime import datetime, timedelta
311+
312+
# Get documents from last 24 hours
313+
recent_docs = get_retrieved_documents(
314+
client,
315+
project_name="my-rag-app",
316+
start_time=datetime.now() - timedelta(hours=24),
317+
end_time=datetime.now()
318+
)
319+
320+
# Get Q&A from last week
321+
weekly_qa = get_input_output_context(
322+
client,
323+
project_name="my-rag-app",
324+
start_time=datetime.now() - timedelta(days=7)
325+
)
326+
```
327+
232328
### Datasets
233329

234330
Manage evaluation datasets and examples for experiments and testing:

packages/phoenix-client/src/phoenix/client/helpers/spans/__init__.py

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,27 @@
66

77
from phoenix.client.__generated__ import v1
88

9+
from .rag import (
10+
async_get_input_output_context,
11+
async_get_retrieved_documents,
12+
get_input_output_context,
13+
get_retrieved_documents,
14+
)
15+
916
Span = v1.Span
1017

1118
if TYPE_CHECKING:
1219
import pandas as pd
1320

14-
__all__ = ["uniquify_spans", "uniquify_spans_dataframe", "dataframe_to_spans"]
21+
__all__ = [
22+
"uniquify_spans",
23+
"uniquify_spans_dataframe",
24+
"dataframe_to_spans",
25+
"get_input_output_context",
26+
"get_retrieved_documents",
27+
"async_get_input_output_context",
28+
"async_get_retrieved_documents",
29+
]
1530

1631
# Source implementation:opentelemetry.sdk.trace.id_generator.RandomIdGenerator
1732

0 commit comments

Comments
 (0)