Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
961 changes: 842 additions & 119 deletions .test/README.md

Large diffs are not rendered by default.

353 changes: 353 additions & 0 deletions .test/notebooks/gepa_skill_optimization_demo.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,353 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# GEPA Skill Optimization Demo\n",
"\n",
"This notebook demonstrates how the skill-test framework uses [GEPA](https://github.com/gepa-ai/gepa) to automatically optimize Databricks SKILL.md files for **quality** and **token efficiency**.\n",
"\n",
"SKILL.md files teach AI agents (like Claude Code) Databricks patterns. Every token in a skill consumes agent context window budget, so skills should be as concise and high-quality as possible.\n",
"\n",
"**What GEPA does:**\n",
"1. Scores the current SKILL.md against deterministic scorers (syntax, patterns, APIs, facts)\n",
"2. Reflects on failures and proposes mutations to improve the skill\n",
"3. Selects the best candidate via Pareto frontier optimization\n",
"4. Repeats until quality converges or budget is exhausted"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"from pathlib import Path\n",
"\n",
"# Add skill-test to path\n",
"repo_root = Path(\".\").resolve()\n",
"while not (repo_root / \".test\" / \"src\").exists() and repo_root != repo_root.parent:\n",
" repo_root = repo_root.parent\n",
"sys.path.insert(0, str(repo_root / \".test\" / \"src\"))\n",
"\n",
"print(f\"Repo root: {repo_root}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "import os\n\n# Configure the reflection model -- pick ONE:\n\n# Option A: Databricks Model Serving (default, recommended)\n# IMPORTANT: DATABRICKS_API_BASE must end with /serving-endpoints\n# os.environ[\"DATABRICKS_API_KEY\"] = \"dapi...\" \n# os.environ[\"DATABRICKS_API_BASE\"] = \"https://<workspace>.cloud.databricks.com/serving-endpoints\"\n# os.environ[\"GEPA_REFLECTION_LM\"] = \"databricks/databricks-gpt-5-2\"\n\n# Option B: OpenAI\n# os.environ[\"OPENAI_API_KEY\"] = \"sk-...\"\n# os.environ[\"GEPA_REFLECTION_LM\"] = \"openai/gpt-4o\"\n\nprint(f\"Reflection LM: {os.environ.get('GEPA_REFLECTION_LM', 'databricks/databricks-gpt-5-2 (default)')}\")"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1: Inspect the Skill\n",
"\n",
"Let's look at the `databricks-model-serving` skill -- its current size, test cases, and baseline score."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"SKILL_NAME = \"databricks-model-serving\"\n",
"\n",
"from skill_test.optimize.evaluator import _find_skill_md, count_tokens\n",
"from skill_test.optimize.splitter import create_gepa_datasets\n",
"\n",
"# Load skill\n",
"skill_path = _find_skill_md(SKILL_NAME)\n",
"original_content = skill_path.read_text()\n",
"original_tokens = count_tokens(original_content)\n",
"\n",
"# Load test cases\n",
"train, val = create_gepa_datasets(SKILL_NAME)\n",
"\n",
"print(f\"Skill: {SKILL_NAME}\")\n",
"print(f\"Path: {skill_path}\")\n",
"print(f\"Lines: {len(original_content.splitlines())}\")\n",
"print(f\"Tokens: {original_tokens:,}\")\n",
"print(f\"Train cases: {len(train)}\")\n",
"print(f\"Val cases: {len(val) if val else 'None'}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Show first few test cases\n",
"for t in train[:3]:\n",
" print(f\"\\n--- {t['id']} ---\")\n",
" print(f\"Prompt: {t['input'][:100]}...\")\n",
" if t.get('answer'):\n",
" print(f\"Answer: {t['answer'][:100]}...\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2: Evaluate Current Quality (Baseline)\n",
"\n",
"Before optimizing, measure the current skill quality using the scorer pipeline."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "from skill_test.optimize.evaluator import create_skill_evaluator, SKILL_KEY\nfrom skill_test.optimize.splitter import to_gepa_instances\n\nevaluator = create_skill_evaluator(SKILL_NAME)\nseed_candidate = {SKILL_KEY: original_content}\n\n# Evaluate on all train tasks\ngepa_instances = to_gepa_instances(train)\n\nprint(f\"{'Task ID':<35} {'Score':>8}\")\nprint(\"-\" * 45)\nfor i, inst in enumerate(gepa_instances):\n score, side_info = evaluator(seed_candidate, inst)\n task_id = train[i]['id']\n status = 'PASS' if score >= 0.5 else 'FAIL'\n print(f\"{task_id:<35} {score:>7.3f} {status}\")\n\n# Quick baseline\nscores = [evaluator(seed_candidate, inst)[0] for inst in gepa_instances]\nbaseline_score = sum(scores) / len(scores)\nprint(f\"\\nBaseline Score: {baseline_score:.3f}\")\nprint(f\"Token Count: {original_tokens:,}\")"
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 3: Run GEPA Optimization\n",
"\n",
"Now run the optimization. GEPA will:\n",
"- Use the current SKILL.md as the seed candidate\n",
"- Run scorers against each test case\n",
"- Reflect on failures to propose mutations\n",
"- Select the best candidate via Pareto frontier\n",
"- Penalize token bloat (80% quality, 20% efficiency weighting)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from skill_test.optimize.runner import optimize_skill\n",
"\n",
"result = optimize_skill(\n",
" skill_name=SKILL_NAME,\n",
" mode=\"static\",\n",
" preset=\"quick\", # 15 iterations -- increase to \"standard\" (50) or \"thorough\" (150) for better results\n",
")\n",
"\n",
"print(f\"Optimization complete!\")\n",
"print(f\"GEPA metric calls: {result.gepa_result.total_metric_calls}\")\n",
"print(f\"Candidates explored: {result.gepa_result.num_candidates}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 4: Results Comparison\n",
"\n",
"Compare the original vs. optimized skill across quality and token efficiency."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"=\" * 60)\n",
"print(f\" OPTIMIZATION RESULTS: {SKILL_NAME}\")\n",
"print(\"=\" * 60)\n",
"print()\n",
"\n",
"# Quality comparison\n",
"quality_delta = result.improvement\n",
"quality_pct = (quality_delta / result.original_score * 100) if result.original_score > 0 else 0\n",
"print(f\" Quality Score\")\n",
"print(f\" Before: {result.original_score:.3f}\")\n",
"print(f\" After: {result.optimized_score:.3f}\")\n",
"print(f\" Delta: {quality_delta:+.3f} ({quality_pct:+.1f}%)\")\n",
"print()\n",
"\n",
"# Token comparison \n",
"token_delta = result.original_token_count - result.optimized_token_count\n",
"print(f\" Token Count\")\n",
"print(f\" Before: {result.original_token_count:,}\")\n",
"print(f\" After: {result.optimized_token_count:,}\")\n",
"print(f\" Saved: {token_delta:,} tokens ({result.token_reduction_pct:.1f}% reduction)\")\n",
"print()\n",
"\n",
"# Line count comparison\n",
"orig_lines = len(result.original_content.splitlines())\n",
"opt_lines = len(result.optimized_content.splitlines())\n",
"print(f\" Lines\")\n",
"print(f\" Before: {orig_lines}\")\n",
"print(f\" After: {opt_lines}\")\n",
"print(f\" Saved: {orig_lines - opt_lines} lines\")\n",
"print()\n",
"\n",
"# Validation scores\n",
"if result.val_scores:\n",
" avg_val = sum(result.val_scores.values()) / len(result.val_scores)\n",
" print(f\" Validation (held-out test cases)\")\n",
" for tid, score in result.val_scores.items():\n",
" print(f\" {tid}: {score:.3f}\")\n",
" print(f\" Average: {avg_val:.3f}\")\n",
"\n",
"print()\n",
"print(\"=\" * 60)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Visual comparison bar chart\n",
"try:\n",
" import matplotlib.pyplot as plt\n",
" import matplotlib\n",
" matplotlib.rcParams['font.family'] = 'monospace'\n",
"\n",
" fig, axes = plt.subplots(1, 2, figsize=(12, 5))\n",
"\n",
" # Quality scores\n",
" ax = axes[0]\n",
" bars = ax.bar(\n",
" ['Before', 'After'],\n",
" [result.original_score, result.optimized_score],\n",
" color=['#d4534b', '#4a9c5d'],\n",
" width=0.5\n",
" )\n",
" ax.set_ylim(0, 1.1)\n",
" ax.set_ylabel('Quality Score')\n",
" ax.set_title(f'Quality: {result.original_score:.3f} → {result.optimized_score:.3f}')\n",
" for bar, val in zip(bars, [result.original_score, result.optimized_score]):\n",
" ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.02,\n",
" f'{val:.3f}', ha='center', fontweight='bold')\n",
"\n",
" # Token counts\n",
" ax = axes[1]\n",
" bars = ax.bar(\n",
" ['Before', 'After'],\n",
" [result.original_token_count, result.optimized_token_count],\n",
" color=['#d4534b', '#4a9c5d'],\n",
" width=0.5\n",
" )\n",
" ax.set_ylabel('Token Count')\n",
" ax.set_title(f'Tokens: {result.original_token_count:,} → {result.optimized_token_count:,} ({result.token_reduction_pct:.0f}% reduction)')\n",
" for bar, val in zip(bars, [result.original_token_count, result.optimized_token_count]):\n",
" ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 50,\n",
" f'{val:,}', ha='center', fontweight='bold')\n",
"\n",
" fig.suptitle(f'GEPA Optimization: {SKILL_NAME}', fontsize=14, fontweight='bold')\n",
" plt.tight_layout()\n",
" plt.show()\n",
"except ImportError:\n",
" print(\"(matplotlib not installed -- skipping chart)\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 5: Review the Diff\n",
"\n",
"Inspect what GEPA changed in the SKILL.md."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from skill_test.optimize.review import review_optimization\n",
"\n",
"review_optimization(result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 6: Apply (Optional)\n",
"\n",
"If the results look good, apply the optimized SKILL.md. Uncomment the cell below to write it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Uncomment to apply:\n",
"# from skill_test.optimize.review import apply_optimization\n",
"# apply_optimization(result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## Multi-Component Optimization: Skills + Tools\n\nGEPA supports optimizing multiple text components simultaneously. You can optimize SKILL.md files **alongside** MCP tool descriptions in a single run.\n\nGEPA's `RoundRobinReflectionComponentSelector` cycles through components one at a time, so each gets dedicated reflection and mutation."
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "# Inspect available MCP tools\nfrom skill_test.optimize.tools import get_tool_stats, extract_tool_descriptions, tools_to_gepa_components\n\nstats = get_tool_stats()\nprint(f\"MCP Tool Modules: {stats['modules']}\")\nprint(f\"Total Tools: {stats['total_tools']}\")\nprint(f\"Total Chars: {stats['total_description_chars']:,}\")\nprint()\nfor mod, info in stats[\"per_module\"].items():\n print(f\" {mod:<20} {info['tools']:>2} tools {info['chars']:>6,} chars\")\n\n# Show what GEPA components look like for selected modules\ntool_map = extract_tool_descriptions(modules=[\"serving\", \"sql\"])\ncomponents = tools_to_gepa_components(tool_map, per_module=True)\nprint(f\"\\nGEPA components for serving + sql: {list(components.keys())}\")\nfor name, text in components.items():\n from skill_test.optimize.evaluator import count_tokens\n print(f\" {name}: {count_tokens(text):,} tokens\")"
},
{
"cell_type": "code",
"source": "## Changing the Reflection Model\n\nBy default, GEPA uses `databricks/databricks-gpt-5-2` via Databricks Model Serving.\nOverride per-call or via environment variable:\n\n```python\n# Per-call\nresult = optimize_skill(\"my-skill\", reflection_lm=\"openai/gpt-4o\")\n\n# Environment variable (persistent)\nos.environ[\"GEPA_REFLECTION_LM\"] = \"databricks/databricks-gpt-5-2\"\n```\n\nSee README.md for full model configuration options.",
"metadata": {},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary\n",
"\n",
"The GEPA optimization pipeline:\n",
"\n",
"| Metric | Before | After | Change |\n",
"|--------|--------|-------|--------|\n",
"| Quality Score | `result.original_score` | `result.optimized_score` | `result.improvement` |\n",
"| Token Count | `result.original_token_count` | `result.optimized_token_count` | `result.token_reduction_pct`% |\n",
"\n",
"Key points:\n",
"- **Quality gate**: Existing scorers (syntax, patterns, APIs, facts) are reused as-is\n",
"- **Token efficiency**: 80/20 quality/efficiency weighting penalizes bloated skills\n",
"- **Validation split**: Held-out test cases detect overfitting\n",
"- **Reflection LM**: Configurable via `--reflection-lm` flag or `GEPA_REFLECTION_LM` env var\n",
"- **Default model**: `databricks/databricks-gpt-5-2` via Databricks Model Serving"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.12.0"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
3 changes: 2 additions & 1 deletion .test/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@ dependencies = [
[project.optional-dependencies]
databricks = ["databricks-sdk>=0.20.0"]
dev = ["pytest>=8.0", "pytest-asyncio>=0.23"]
all = ["skill-test[databricks,dev]"]
optimize = ["gepa>=0.1.0", "tiktoken>=0.7.0"]
all = ["skill-test[databricks,dev,optimize]"]

[project.scripts]
skill-test = "skill_test.cli:main"
Expand Down
Loading
Loading