databricks-solutions · auschoi96 · Feb 21, 2026 · Feb 22, 2026 · Feb 22, 2026 · Feb 22, 2026
diff --git a/.test/README.md b/.test/README.md
diff --git a/.test/notebooks/gepa_skill_optimization_demo.ipynb b/.test/notebooks/gepa_skill_optimization_demo.ipynb
@@ -0,0 +1,353 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# GEPA Skill Optimization Demo\n",
+    "\n",
+    "This notebook demonstrates how the skill-test framework uses [GEPA](https://github.com/gepa-ai/gepa) to automatically optimize Databricks SKILL.md files for **quality** and **token efficiency**.\n",
+    "\n",
+    "SKILL.md files teach AI agents (like Claude Code) Databricks patterns. Every token in a skill consumes agent context window budget, so skills should be as concise and high-quality as possible.\n",
+    "\n",
+    "**What GEPA does:**\n",
+    "1. Scores the current SKILL.md against deterministic scorers (syntax, patterns, APIs, facts)\n",
+    "2. Reflects on failures and proposes mutations to improve the skill\n",
+    "3. Selects the best candidate via Pareto frontier optimization\n",
+    "4. Repeats until quality converges or budget is exhausted"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "from pathlib import Path\n",
+    "\n",
+    "# Add skill-test to path\n",
+    "repo_root = Path(\".\").resolve()\n",
+    "while not (repo_root / \".test\" / \"src\").exists() and repo_root != repo_root.parent:\n",
+    "    repo_root = repo_root.parent\n",
+    "sys.path.insert(0, str(repo_root / \".test\" / \"src\"))\n",
+    "\n",
+    "print(f\"Repo root: {repo_root}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "import os\n\n# Configure the reflection model -- pick ONE:\n\n# Option A: Databricks Model Serving (default, recommended)\n# IMPORTANT: DATABRICKS_API_BASE must end with /serving-endpoints\n# os.environ[\"DATABRICKS_API_KEY\"] = \"dapi...\"  \n# os.environ[\"DATABRICKS_API_BASE\"] = \"https://<workspace>.cloud.databricks.com/serving-endpoints\"\n# os.environ[\"GEPA_REFLECTION_LM\"] = \"databricks/databricks-gpt-5-2\"\n\n# Option B: OpenAI\n# os.environ[\"OPENAI_API_KEY\"] = \"sk-...\"\n# os.environ[\"GEPA_REFLECTION_LM\"] = \"openai/gpt-4o\"\n\nprint(f\"Reflection LM: {os.environ.get('GEPA_REFLECTION_LM', 'databricks/databricks-gpt-5-2 (default)')}\")"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 1: Inspect the Skill\n",
+    "\n",
+    "Let's look at the `databricks-model-serving` skill -- its current size, test cases, and baseline score."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "SKILL_NAME = \"databricks-model-serving\"\n",
+    "\n",
+    "from skill_test.optimize.evaluator import _find_skill_md, count_tokens\n",
+    "from skill_test.optimize.splitter import create_gepa_datasets\n",
+    "\n",
+    "# Load skill\n",
+    "skill_path = _find_skill_md(SKILL_NAME)\n",
+    "original_content = skill_path.read_text()\n",
+    "original_tokens = count_tokens(original_content)\n",
+    "\n",
+    "# Load test cases\n",
+    "train, val = create_gepa_datasets(SKILL_NAME)\n",
+    "\n",
+    "print(f\"Skill:        {SKILL_NAME}\")\n",
+    "print(f\"Path:         {skill_path}\")\n",
+    "print(f\"Lines:        {len(original_content.splitlines())}\")\n",
+    "print(f\"Tokens:       {original_tokens:,}\")\n",
+    "print(f\"Train cases:  {len(train)}\")\n",
+    "print(f\"Val cases:    {len(val) if val else 'None'}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Show first few test cases\n",
+    "for t in train[:3]:\n",
+    "    print(f\"\\n--- {t['id']} ---\")\n",
+    "    print(f\"Prompt: {t['input'][:100]}...\")\n",
+    "    if t.get('answer'):\n",
+    "        print(f\"Answer: {t['answer'][:100]}...\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 2: Evaluate Current Quality (Baseline)\n",
+    "\n",
+    "Before optimizing, measure the current skill quality using the scorer pipeline."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "from skill_test.optimize.evaluator import create_skill_evaluator, SKILL_KEY\nfrom skill_test.optimize.splitter import to_gepa_instances\n\nevaluator = create_skill_evaluator(SKILL_NAME)\nseed_candidate = {SKILL_KEY: original_content}\n\n# Evaluate on all train tasks\ngepa_instances = to_gepa_instances(train)\n\nprint(f\"{'Task ID':<35} {'Score':>8}\")\nprint(\"-\" * 45)\nfor i, inst in enumerate(gepa_instances):\n    score, side_info = evaluator(seed_candidate, inst)\n    task_id = train[i]['id']\n    status = 'PASS' if score >= 0.5 else 'FAIL'\n    print(f\"{task_id:<35} {score:>7.3f}  {status}\")\n\n# Quick baseline\nscores = [evaluator(seed_candidate, inst)[0] for inst in gepa_instances]\nbaseline_score = sum(scores) / len(scores)\nprint(f\"\\nBaseline Score: {baseline_score:.3f}\")\nprint(f\"Token Count:    {original_tokens:,}\")"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 3: Run GEPA Optimization\n",
+    "\n",
+    "Now run the optimization. GEPA will:\n",
+    "- Use the current SKILL.md as the seed candidate\n",
+    "- Run scorers against each test case\n",
+    "- Reflect on failures to propose mutations\n",
+    "- Select the best candidate via Pareto frontier\n",
+    "- Penalize token bloat (80% quality, 20% efficiency weighting)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from skill_test.optimize.runner import optimize_skill\n",
+    "\n",
+    "result = optimize_skill(\n",
+    "    skill_name=SKILL_NAME,\n",
+    "    mode=\"static\",\n",
+    "    preset=\"quick\",  # 15 iterations -- increase to \"standard\" (50) or \"thorough\" (150) for better results\n",
+    ")\n",
+    "\n",
+    "print(f\"Optimization complete!\")\n",
+    "print(f\"GEPA metric calls: {result.gepa_result.total_metric_calls}\")\n",
+    "print(f\"Candidates explored: {result.gepa_result.num_candidates}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 4: Results Comparison\n",
+    "\n",
+    "Compare the original vs. optimized skill across quality and token efficiency."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"=\" * 60)\n",
+    "print(f\"  OPTIMIZATION RESULTS: {SKILL_NAME}\")\n",
+    "print(\"=\" * 60)\n",
+    "print()\n",
+    "\n",
+    "# Quality comparison\n",
+    "quality_delta = result.improvement\n",
+    "quality_pct = (quality_delta / result.original_score * 100) if result.original_score > 0 else 0\n",
+    "print(f\"  Quality Score\")\n",
+    "print(f\"    Before:      {result.original_score:.3f}\")\n",
+    "print(f\"    After:       {result.optimized_score:.3f}\")\n",
+    "print(f\"    Delta:       {quality_delta:+.3f} ({quality_pct:+.1f}%)\")\n",
+    "print()\n",
+    "\n",
+    "# Token comparison  \n",
+    "token_delta = result.original_token_count - result.optimized_token_count\n",
+    "print(f\"  Token Count\")\n",
+    "print(f\"    Before:      {result.original_token_count:,}\")\n",
+    "print(f\"    After:       {result.optimized_token_count:,}\")\n",
+    "print(f\"    Saved:       {token_delta:,} tokens ({result.token_reduction_pct:.1f}% reduction)\")\n",
+    "print()\n",
+    "\n",
+    "# Line count comparison\n",
+    "orig_lines = len(result.original_content.splitlines())\n",
+    "opt_lines = len(result.optimized_content.splitlines())\n",
+    "print(f\"  Lines\")\n",
+    "print(f\"    Before:      {orig_lines}\")\n",
+    "print(f\"    After:       {opt_lines}\")\n",
+    "print(f\"    Saved:       {orig_lines - opt_lines} lines\")\n",
+    "print()\n",
+    "\n",
+    "# Validation scores\n",
+    "if result.val_scores:\n",
+    "    avg_val = sum(result.val_scores.values()) / len(result.val_scores)\n",
+    "    print(f\"  Validation (held-out test cases)\")\n",
+    "    for tid, score in result.val_scores.items():\n",
+    "        print(f\"    {tid}: {score:.3f}\")\n",
+    "    print(f\"    Average: {avg_val:.3f}\")\n",
+    "\n",
+    "print()\n",
+    "print(\"=\" * 60)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Visual comparison bar chart\n",
+    "try:\n",
+    "    import matplotlib.pyplot as plt\n",
+    "    import matplotlib\n",
+    "    matplotlib.rcParams['font.family'] = 'monospace'\n",
+    "\n",
+    "    fig, axes = plt.subplots(1, 2, figsize=(12, 5))\n",
+    "\n",
+    "    # Quality scores\n",
+    "    ax = axes[0]\n",
+    "    bars = ax.bar(\n",
+    "        ['Before', 'After'],\n",
+    "        [result.original_score, result.optimized_score],\n",
+    "        color=['#d4534b', '#4a9c5d'],\n",
+    "        width=0.5\n",
+    "    )\n",
+    "    ax.set_ylim(0, 1.1)\n",
+    "    ax.set_ylabel('Quality Score')\n",
+    "    ax.set_title(f'Quality: {result.original_score:.3f} → {result.optimized_score:.3f}')\n",
+    "    for bar, val in zip(bars, [result.original_score, result.optimized_score]):\n",
+    "        ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.02,\n",
+    "                f'{val:.3f}', ha='center', fontweight='bold')\n",
+    "\n",
+    "    # Token counts\n",
+    "    ax = axes[1]\n",
+    "    bars = ax.bar(\n",
+    "        ['Before', 'After'],\n",
+    "        [result.original_token_count, result.optimized_token_count],\n",
+    "        color=['#d4534b', '#4a9c5d'],\n",
+    "        width=0.5\n",
+    "    )\n",
+    "    ax.set_ylabel('Token Count')\n",
+    "    ax.set_title(f'Tokens: {result.original_token_count:,} → {result.optimized_token_count:,} ({result.token_reduction_pct:.0f}% reduction)')\n",
+    "    for bar, val in zip(bars, [result.original_token_count, result.optimized_token_count]):\n",
+    "        ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 50,\n",
+    "                f'{val:,}', ha='center', fontweight='bold')\n",
+    "\n",
+    "    fig.suptitle(f'GEPA Optimization: {SKILL_NAME}', fontsize=14, fontweight='bold')\n",
+    "    plt.tight_layout()\n",
+    "    plt.show()\n",
+    "except ImportError:\n",
+    "    print(\"(matplotlib not installed -- skipping chart)\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 5: Review the Diff\n",
+    "\n",
+    "Inspect what GEPA changed in the SKILL.md."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from skill_test.optimize.review import review_optimization\n",
+    "\n",
+    "review_optimization(result)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 6: Apply (Optional)\n",
+    "\n",
+    "If the results look good, apply the optimized SKILL.md. Uncomment the cell below to write it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Uncomment to apply:\n",
+    "# from skill_test.optimize.review import apply_optimization\n",
+    "# apply_optimization(result)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## Multi-Component Optimization: Skills + Tools\n\nGEPA supports optimizing multiple text components simultaneously. You can optimize SKILL.md files **alongside** MCP tool descriptions in a single run.\n\nGEPA's `RoundRobinReflectionComponentSelector` cycles through components one at a time, so each gets dedicated reflection and mutation."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# Inspect available MCP tools\nfrom skill_test.optimize.tools import get_tool_stats, extract_tool_descriptions, tools_to_gepa_components\n\nstats = get_tool_stats()\nprint(f\"MCP Tool Modules: {stats['modules']}\")\nprint(f\"Total Tools:      {stats['total_tools']}\")\nprint(f\"Total Chars:      {stats['total_description_chars']:,}\")\nprint()\nfor mod, info in stats[\"per_module\"].items():\n    print(f\"  {mod:<20} {info['tools']:>2} tools  {info['chars']:>6,} chars\")\n\n# Show what GEPA components look like for selected modules\ntool_map = extract_tool_descriptions(modules=[\"serving\", \"sql\"])\ncomponents = tools_to_gepa_components(tool_map, per_module=True)\nprint(f\"\\nGEPA components for serving + sql: {list(components.keys())}\")\nfor name, text in components.items():\n    from skill_test.optimize.evaluator import count_tokens\n    print(f\"  {name}: {count_tokens(text):,} tokens\")"
+  },
+  {
+   "cell_type": "code",
+   "source": "## Changing the Reflection Model\n\nBy default, GEPA uses `databricks/databricks-gpt-5-2` via Databricks Model Serving.\nOverride per-call or via environment variable:\n\n```python\n# Per-call\nresult = optimize_skill(\"my-skill\", reflection_lm=\"openai/gpt-4o\")\n\n# Environment variable (persistent)\nos.environ[\"GEPA_REFLECTION_LM\"] = \"databricks/databricks-gpt-5-2\"\n```\n\nSee README.md for full model configuration options.",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Summary\n",
+    "\n",
+    "The GEPA optimization pipeline:\n",
+    "\n",
+    "| Metric | Before | After | Change |\n",
+    "|--------|--------|-------|--------|\n",
+    "| Quality Score | `result.original_score` | `result.optimized_score` | `result.improvement` |\n",
+    "| Token Count | `result.original_token_count` | `result.optimized_token_count` | `result.token_reduction_pct`% |\n",
+    "\n",
+    "Key points:\n",
+    "- **Quality gate**: Existing scorers (syntax, patterns, APIs, facts) are reused as-is\n",
+    "- **Token efficiency**: 80/20 quality/efficiency weighting penalizes bloated skills\n",
+    "- **Validation split**: Held-out test cases detect overfitting\n",
+    "- **Reflection LM**: Configurable via `--reflection-lm` flag or `GEPA_REFLECTION_LM` env var\n",
+    "- **Default model**: `databricks/databricks-gpt-5-2` via Databricks Model Serving"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.12.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/.test/pyproject.toml b/.test/pyproject.toml
@@ -17,7 +17,8 @@ dependencies = [
 [project.optional-dependencies]
 databricks = ["databricks-sdk>=0.20.0"]
 dev = ["pytest>=8.0", "pytest-asyncio>=0.23"]
-all = ["skill-test[databricks,dev]"]
+optimize = ["gepa>=0.1.0", "tiktoken>=0.7.0"]
+all = ["skill-test[databricks,dev,optimize]"]
 
 [project.scripts]
 skill-test = "skill_test.cli:main"