Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.MD
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ _released 04--2026

### Added
- **AI Evaluation Template Support**: Uploading test result support for TestRail's AI Evaluation Template with multi-dimensional quality ratings. See README "AI Evaluation Template Support" section for complete examples.
- **Multi-Step AI Evaluation Workflows**: Support for combining step-level execution tracking (`testrail_result_step`) with overall quality ratings in AI Evaluation tests. See README "Multi-Step AI Evaluation Workflows" section.
- **Global Quality Rating via `--result-fields`**: Added support for applying quality ratings to all test results using `--result-fields quality_rating:'{"category": value}'`. Test-specific quality ratings in XML/JSON properties take precedence over CLI global ratings.

## [1.14.1]
Expand Down
73 changes: 73 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -690,6 +690,79 @@ trcli parse_robot \
--suite-id 100
```

### Multi-Step AI Evaluation Workflows

For complex AI systems with multiple pipeline stages (like RAG, multi-agent systems, or sequential AI workflows), you can combine **step-level execution tracking** with **overall quality assessment** in your AI Evaluation tests. quality_rating result field can be added to to Test Case (Steps)

#### How It Works

**Step-Level Tracking:**
- Each step has its own **status** (passed, failed, skipped, untested)
- See exactly where in the pipeline the failure occurred

**Overall Quality Rating:**
- One **quality_rating** applies to the entire test result
- Assess the final output quality across multiple dimensions

#### JUnit XML Example

```xml
<?xml version="1.0" encoding="UTF-8"?>
<testsuites name="RAG Pipeline Tests" tests="1" failures="1" time="10.5">
<testsuite name="Document QA" tests="1" failures="1" time="10.5">

<testcase classname="ai.rag.DocumentQA" name="C1000_test_rag_pipeline" time="10.5">
<properties>
<property name="test_id" value="C1000"/>

<!-- Step-Level Execution Tracking -->
<property name="testrail_result_step" value="passed:Step 1 Query Understanding"/>
<property name="testrail_result_step" value="passed:Step 2 Document Retrieval"/>
<property name="testrail_result_step" value="failed:Step 3 Answer Generation"/>
<property name="testrail_result_step" value="untested:Step 4 Response Validation"/>

<!-- Overall Quality Rating -->
<property name="quality_rating" value='{"factual_accuracy": 2, "coherence": 3, "completeness": 1}'/>

<!-- AI Context Fields (not applicable to Test Case (Steps) -->
<property name="testrail_result_field" value="custom_ai_input:What programming language is used for machine learning?"/>
<property name="testrail_result_field" value="custom_ai_output:JavaScript is the primary language for machine learning."/>
<property name="testrail_result_field" value="custom_ai_traces:https://logs.example.com/trace/rag-001"/>
<property name="testrail_result_field" value="custom_ai_latency:10.5 seconds"/>
</properties>
<failure message="Answer generation produced factually incorrect response"/>
</testcase>

</testsuite>
</testsuites>
```

**Upload Command:**
```bash
trcli parse_junit \
-f rag_pipeline_results.xml \
--project-id 1 \
--suite-id 100
```

#### Important Notes

1. **Quality Rating Scope**: The `quality_rating` applies to the **entire test result**, not individual steps. It represents the overall quality of the AI system's final output.

2. **Step Status Format**: Use `status:description` format for step-level tracking:
- `passed:Step 1 Query Understanding`
- `failed:Step 3 Answer Generation`
- `skipped:Optional Enhancement`
- `untested:Step 4 Response Validation`

3. **Available Step Statuses**:
- `passed` (status_id: 1) - Step completed successfully
- `untested` (status_id: 3) - Step not executed
- `skipped` (status_id: 4) - Step intentionally skipped
- `failed` (status_id: 5) - Step failed

4. **Test Status Aggregation**: The overall test status follows **fail-fast** logic - if any step fails, the entire test fails.

## Behavior-Driven Development (BDD) Support

The TestRail CLI provides comprehensive support for Behavior-Driven Development workflows using Gherkin syntax. The BDD features enable you to manage test cases written in Gherkin format, execute BDD tests with various frameworks (Cucumber, Behave, pytest-bdd, etc.), and seamlessly upload results to TestRail.
Expand Down
90 changes: 90 additions & 0 deletions tests/test_data/XML/sample_ai_eval_multistep_workflow.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
<?xml version="1.0" encoding="UTF-8"?>
<testsuites name="RAG Pipeline - AI Evaluation" tests="3" failures="2" errors="0" time="35.2">

<!-- Suite 1: Document QA Tests -->
<testsuite name="Document QA RAG Pipeline" tests="3" failures="2" errors="0" time="35.2">

<!-- Test 1: Successful RAG Pipeline (All Steps Pass) -->
<testcase classname="ai.rag.DocumentQA" name="C1000_test_rag_pipeline_success" time="12.5">
<properties>
<property name="test_id" value="C1000"/>

<!-- Step-Level Execution Tracking -->
<property name="testrail_result_step" value="passed:Step 1 Query Understanding"/>
<property name="testrail_result_step" value="passed:Step 2 Document Retrieval"/>
<property name="testrail_result_step" value="passed:Step 3 Answer Generation"/>
<property name="testrail_result_step" value="passed:Step 4 Response Validation"/>

<!-- Overall Quality Rating (High Quality) -->
<property name="quality_rating" value='{"factual_accuracy": 5, "coherence": 5, "completeness": 4, "relevance": 5}'/>

<!-- AI Context Fields -->
<property name="testrail_result_field" value="custom_ai_input:What is the capital of France?"/>
<property name="testrail_result_field" value="custom_ai_output:The capital of France is Paris. Paris is the largest city in France and has been the capital since 987 AD."/>
<property name="testrail_result_field" value="custom_ai_traces:https://logs.example.com/trace/rag-success-001"/>
<property name="testrail_result_field" value="custom_ai_latency:12.5 seconds"/>
</properties>
</testcase>

<!-- Test 2: Failed Answer Generation (Step 3 Fails) -->
<testcase classname="ai.rag.DocumentQA" name="C1001_test_rag_pipeline_factual_error" time="10.5">
<properties>
<property name="test_id" value="C1001"/>

<!-- Step-Level Execution Tracking -->
<property name="testrail_result_step" value="passed:Step 1 Query Understanding"/>
<property name="testrail_result_step" value="passed:Step 2 Document Retrieval"/>
<property name="testrail_result_step" value="failed:Step 3 Answer Generation"/>
<property name="testrail_result_step" value="untested:Step 4 Response Validation"/>

<!-- Overall Quality Rating (Low Due to Factual Error) -->
<property name="quality_rating" value='{"factual_accuracy": 1, "coherence": 3, "completeness": 2, "relevance": 2}'/>

<!-- AI Context Fields -->
<property name="testrail_result_field" value="custom_ai_input:What programming language is primarily used for machine learning?"/>
<property name="testrail_result_field" value="custom_ai_output:JavaScript is the primary language for machine learning, widely used in neural networks and deep learning."/>
<property name="testrail_result_field" value="custom_ai_traces:https://logs.example.com/trace/rag-failure-001"/>
<property name="testrail_result_field" value="custom_ai_latency:10.5 seconds"/>
</properties>
<failure message="Answer generation produced factually incorrect response">
Expected: Python is the primary language for machine learning
Actual: JavaScript is the primary language for machine learning

Issue: Model hallucinated incorrect information despite correct document retrieval
Impact: Users receive misleading information that could affect decision-making
</failure>
</testcase>

<!-- Test 3: Document Retrieval Failure (Step 2 Fails) -->
<testcase classname="ai.rag.DocumentQA" name="C1002_test_rag_pipeline_retrieval_failure" time="12.2">
<properties>
<property name="test_id" value="C1002"/>

<!-- Step-Level Execution Tracking -->
<property name="testrail_result_step" value="passed:Step 1 Query Understanding"/>
<property name="testrail_result_step" value="failed:Step 2 Document Retrieval"/>
<property name="testrail_result_step" value="untested:Step 3 Answer Generation"/>
<property name="testrail_result_step" value="untested:Step 4 Response Validation"/>

<!-- Overall Quality Rating (Low Due to No Relevant Documents) -->
<property name="quality_rating" value='{"factual_accuracy": 0, "coherence": 1, "completeness": 0, "relevance": 1}'/>

<!-- AI Context Fields -->
<property name="testrail_result_field" value="custom_ai_input:Explain the Heisenberg uncertainty principle in quantum mechanics"/>
<property name="testrail_result_field" value="custom_ai_output:I don't have enough information to answer your question about the Heisenberg uncertainty principle."/>
<property name="testrail_result_field" value="custom_ai_traces:https://logs.example.com/trace/rag-retrieval-failure-001"/>
<property name="testrail_result_field" value="custom_ai_latency:12.2 seconds"/>
</properties>
<failure message="Document retrieval failed to find relevant sources">
Expected: Retrieved at least 3 relevant documents about quantum mechanics
Actual: Retrieved 0 relevant documents (only found documents about classical physics)

Issue: Vector search embeddings failed to capture semantic meaning of quantum mechanics query
Impact: System cannot provide accurate answers for domain-specific questions
Recommendation: Retrain embedding model with physics-domain knowledge or use specialized vector database
</failure>
</testcase>

</testsuite>

</testsuites>
Loading
Loading