diff --git a/BACKLOG.md b/BACKLOG.md index ce6c590..fb98f71 100644 --- a/BACKLOG.md +++ b/BACKLOG.md @@ -33,14 +33,43 @@ Goal: a working, tested CBC design generator and MNL estimator on synthetic data. By the end of this phase, we can hand-run estimation against fake respondents and verify the math is correct. -### 1.2 Design diagnostics (D-efficiency, level balance) - -**WHAT:** Implement `kai.design.design_diagnostics.diagnose_cbc_design()` -returning a `DesignReport` with D-efficiency, level frequencies, dominated -alternative count, pass/fail vs gates. - -**WHY:** Quality gate before any human sees the questionnaire. Bad designs -waste your time and corrupt estimation. +### 1.1.5 Within-task overlap minimization (swap-based D-efficiency) + +**WHAT:** Extend `kai.design.cbc_generator.generate_cbc_design()` for +`method="balanced_overlap"` to perform swap-based D-efficiency optimization +on top of the level-balanced sampling 1.1 already does. + +Algorithm sketch: starting from the 1.1 level-balanced design, iterate up +to N times. In each iteration: pick a random pair of alternatives (within +the same task or across tasks), try swapping a single attribute's level +between them, accept the swap iff D-efficiency improves AND the swap +doesn't break level balance per attribute. Stop when no improving swap +is found in M consecutive attempts. + +Must remain deterministic given the same seed. + +**WHY:** Phase 1.2 calibration showed our 1.1 generator produces D-eff +~0.38 at production scale, statistically indistinguishable from pure +random sampling. The 0.85 quality gate is calibrated against full +Sawtooth-style balanced overlap which includes this swap step. Until +1.1.5 ships, the production design fails its own quality gate. + +**TARGET:** Production D-eff >= 0.85 on `config/taxonomy.yaml` at +20 tasks x 4 alts. When the target is met, the sentinel test in +`tests/unit/test_design_diagnostics.py` +(`test_production_design_intentionally_fails_d_eff_gate`) flips its +assertion to expect `passes_gates=True`. + +**PRIORITY:** Highest remaining Phase 1 work. Should ship before 1.4 +(MNL estimator) so we estimate against a design that earns its quality +gate. + +**OPEN QUESTIONS:** +- Greedy vs simulated annealing? Greedy is simpler and likely sufficient + at our scale; SA only matters if greedy gets stuck in local minima. + Default greedy unless calibration shows otherwise. +- Iteration cap: probably 1000-10000 swaps; tune based on D-eff + stability across seeds. ### 1.4 MNL estimator (MLE + bootstrap CIs) diff --git a/CHANGELOG.md b/CHANGELOG.md index 0e04a93..ba26bf9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -10,54 +10,72 @@ Versioning follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html). **Tags:** Feature -First Phase 1 work: the CBC design generator graduates from stub to -implementation. Closes BACKLOG items 1.1 and 1.3. +Phase 1.2 ships: design diagnostics. Closes BACKLOG 1.2. Surfaces a new +follow-up captured as BACKLOG 1.1.5. ### Added (code) -- `kai.design.cbc_generator.generate_cbc_design()` — Phase 1.1, balanced- - overlap method. Pure-numpy implementation; level-balanced per attribute - via a single seeded `numpy.random.default_rng(seed)` consumed in - attribute-id-sorted order. Returns `CBCDesign(d_efficiency=None)` — - diagnostics fill that field in Phase 1.2. -- `tests/unit/test_cbc_generator.py` — output contract, level balance - (perfect when divisible, near-balanced otherwise with deterministic - alphabetical-first remainder distribution), argument validation, - determinism (the Phase 1.3 ADR-005 byte-identical-pickle contract, - verified across multiple seeds and against a different taxonomy), - and a production-size smoke test against `config/taxonomy.yaml` at - 20 tasks × 4 alts × 8 attributes. + +- `kai.design.design_diagnostics.diagnose_cbc_design()` - computes + D-efficiency, per-attribute level balance, max level imbalance, and + duplicate-alternative count; checks against shared.py quality gates + and returns a `DesignReport`. +- D-efficiency uses the standard MNL relative formulation under uniform + priors: `det(I)^(1/p) / N` where `I = sum_t X_t' M X_t` and M is the + J x J within-task centering matrix. Catches task-degenerate designs + (an attribute constant across all alts in a task contributes zero + information from that task). +- Effects coding: deviation/sum-to-zero, alphabetically-first level as + reference. Numerical stability via `numpy.linalg.slogdet`; singular + information matrices report d_efficiency=0.0 with descriptive message. +- `kai.shared.QUALITY_GATE_MAX_LEVEL_IMBALANCE = 0.15` - promoted from + literal in design_params.yaml to cross-module constant per Tenet 1. +- `tests/unit/test_design_diagnostics.py` - 25+ tests covering output + contract, level balance, D-efficiency (hand-verifiable orthogonal + case = exactly 1.0; singular = 0.0), duplicate detection, gate logic, + Tenet 1 enforcement of defaults, and a production-config sentinel. ### Changed (code) -- `cbc_generator.py` no longer raises `NotImplementedError` for - `method="balanced_overlap"`. `method="orthogonal"` and - `method="random"` remain unimplemented and raise `NotImplementedError` - with a message indicating they are recognized but pending Phase 2+. - Unknown methods raise `ValueError`. + +- `DesignReport.n_dominated_alternatives` renamed to + `n_duplicate_alternatives`. Strict dominance requires preference- + direction metadata not currently in the taxonomy schema; we count + duplicate alternatives within a task instead, which is well-defined + and a real pathology. Real dominance detection captured in BACKLOG. ### Decisions -- **Balanced-overlap v1 = level-balanced random assignment**, no within- - task overlap minimization. Production-size smoke test shows worst-case - level imbalance of 2.5% — well under the 15% gate from - `design_params.yaml`. Adding overlap-minimization speculation would - violate Tenets 2 and 5; the call to add it (or not) is gated on a - measured D-efficiency result from Phase 1.2. -- **Remainder distribution is deterministic by alphabetical id sort** - when `n_slots` is not divisible by `n_levels`. Documented in the - generator's algorithm docstring. -- **Generator is a pure function returning `d_efficiency=None`.** Any - reject-and-regenerate loop based on a quality threshold lives above - the generator (in a session-creation orchestrator), not inside it — - matches the stub's "Filled by diagnostics" comment and keeps the - generator independently testable. + +- **Production D-efficiency lands at ~0.38**, well below the 0.85 gate. + Calibration over 20 seeds shows our generator is statistically + indistinguishable from pure random sampling. The 0.85 gate is + calibrated against full Sawtooth balanced overlap (swap-based D-eff + optimization on top of level balancing). Captured as new BACKLOG + item 1.1.5; preferred over weakening the gate. Estimation still works + at D-eff 0.38 (wider CIs, not wrong answers), so the intermediate + state is survivable. +- **Sentinel test for 1.1.5**: + `test_production_design_intentionally_fails_d_eff_gate` asserts the + current failure with a docstring explaining the path forward. When + 1.1.5 ships, the assertion flips. ### Verified -- Byte-identical pickle output across separate Python processes. -- Byte-identical pickle output across `PYTHONHASHSEED` variation - (hash-order dependence ruled out). + +- D-efficiency matches hand-computed value (1.0 exactly) on orthogonal + 2-attribute 2-level case. +- Singular designs correctly report d_efficiency=0.0 and fail gates. +- Production calibration: our generator vs pure random vs random-with- + no-duplicates over 20 seeds each gave 0.384 / 0.371 / 0.372 mean + D-eff respectively (statistically indistinguishable). ### Closes (BACKLOG) -- 1.1 CBC design generator (balanced overlap method) -- 1.3 Determinism test (covered by `TestDeterminism` in the new test file) + +- 1.2 Design diagnostics (D-efficiency, level balance) + +### New BACKLOG item + +- 1.1.5 Within-task overlap minimization (swap-based D-efficiency + optimization). Target: production D-eff >= 0.85. Highest priority + remaining Phase 1 work; should land before 1.4 (MNL estimator). + ## [0.2.1] — 2026-04-26 — Resilience + observability foundations diff --git a/PROJECT_KNOWLEDGE.txt b/PROJECT_KNOWLEDGE.txt index 5298c23..a3cf672 100644 --- a/PROJECT_KNOWLEDGE.txt +++ b/PROJECT_KNOWLEDGE.txt @@ -61,9 +61,9 @@ Purpose: │ │ │ ├── design/ │ │ ├── plugin.py DesignPlugin lifecycle - │ │ ├── cbc_generator.py CBC design generation (stub) + │ │ ├── cbc_generator.py CBC design generation │ │ ├── maxdiff_generator.py MaxDiff BIBD generation (stub) - │ │ └── design_diagnostics.py D-efficiency, level balance (stub) + │ │ └── design_diagnostics.py D-efficiency, level balance │ │ │ ├── estimation/ │ │ ├── plugin.py EstimationPlugin lifecycle @@ -146,6 +146,7 @@ Purpose: THRESHOLDS MIN_OBS_PARAMS_RATIO QUALITY_GATE_MIN_D_EFFICIENCY + QUALITY_GATE_MAX_LEVEL_IMBALANCE ERROR TYPES KaiError (base) diff --git a/src/kai/design/design_diagnostics.py b/src/kai/design/design_diagnostics.py index b522653..32d6e3f 100644 --- a/src/kai/design/design_diagnostics.py +++ b/src/kai/design/design_diagnostics.py @@ -1,34 +1,69 @@ """ -Design diagnostics — quality metrics on generated experimental designs. +Design diagnostics - quality metrics on generated experimental designs. Run AFTER design generation to verify the design will actually let us estimate unbiased part-worths. Better to fail here than after collecting data on a degenerate design. Metrics: - - D-efficiency: how close to optimal information matrix - - Level balance: each level appears with similar frequency - - Pair balance: each level pair co-occurs with similar frequency - - Dominated alternatives: alternatives that strictly dominate others - (these should be rare; they leak no preference information) + - D-efficiency: how close to optimal information matrix (uniform-prior + approximation; see docstring of `diagnose_cbc_design` for details). + - Level balance: each level appears with similar frequency across the + whole design. + - Duplicate alternatives: alternatives that are identical to another + alternative WITHIN the same task (these leak no preference info). + +Note on "dominance": strict dominance (A is at least as good as B on every +attribute, strictly better on at least one) requires preference direction +metadata not currently in the taxonomy schema. We report duplicates instead, +which is computable and a real pathology. Real dominance detection is +captured as a BACKLOG follow-up. """ from __future__ import annotations +from collections import Counter from dataclasses import dataclass +import numpy as np + from kai.design.cbc_generator import CBCDesign +from kai.shared import ( + QUALITY_GATE_MAX_LEVEL_IMBALANCE, + QUALITY_GATE_MIN_D_EFFICIENCY, +) from kai.taxonomy.schema import Taxonomy @dataclass(frozen=True) class DesignReport: - """Quality assessment of a CBC design.""" + """Quality assessment of a CBC design. + + Fields: + d_efficiency: Relative D-efficiency under uniform-prior assumption. + Range roughly [0, 1]; orthogonal balanced design ~= 1.0. + level_balance: Per-attribute frequency of each level, expressed as + a fraction of total slots for that attribute. + level_balance[attr_id][level_id] = count / n_slots. + max_level_imbalance: Worst case max(|freq * n_levels - 1|) over all + (attr, level) pairs. 0.0 means every level appears exactly the + uniform expected count; 1.0 means a level is missing entirely + or appears at twice its expected rate. + n_duplicate_alternatives: Total count of alternatives that share + their full level vector with at least one other alternative + in the SAME task. (See module docstring on why this is + duplicates-not-dominance.) + passes_gates: True iff d_efficiency and max_level_imbalance both + satisfy their thresholds. Duplicate count is reported but not + gated (per Phase 1.2 design decision). + failed_gates: Human-readable descriptions of any failed gates. + Empty list iff passes_gates is True. + """ d_efficiency: float - level_balance: dict[str, dict[str, float]] # attr -> level -> frequency + level_balance: dict[str, dict[str, float]] max_level_imbalance: float - n_dominated_alternatives: int + n_duplicate_alternatives: int passes_gates: bool failed_gates: list[str] @@ -36,11 +71,194 @@ class DesignReport: def diagnose_cbc_design( design: CBCDesign, taxonomy: Taxonomy, - min_d_efficiency: float = 0.85, - max_level_imbalance: float = 0.15, + min_d_efficiency: float = QUALITY_GATE_MIN_D_EFFICIENCY, + max_level_imbalance: float = QUALITY_GATE_MAX_LEVEL_IMBALANCE, ) -> DesignReport: """Compute design quality metrics and check against gates. - NOT YET IMPLEMENTED — scaffolding stub. + Args: + design: A CBCDesign to evaluate. + taxonomy: The taxonomy the design was generated against. Caller is + responsible for matching versions; we don't re-check here. + min_d_efficiency: Pass threshold for D-efficiency. Defaults to the + shared cross-module constant. + max_level_imbalance: Pass threshold for level balance. Defaults to + the shared cross-module constant. + + Returns: + DesignReport with all fields populated. + + D-efficiency formulation: + We use the standard relative D-efficiency for multinomial logit + under the uniform-prior assumption: + + D-eff = (det(I))^(1/p) / N + + where I is the MNL information matrix: + + I = sum_t (X_t' M X_t) + + with X_t the (J, p) effects-coded submatrix for task t, J the + number of alternatives per task, and M = I_J - (1/J) * 1 * 1' + the J x J within-task centering matrix. + + Equivalently: I = X_c' X_c where X_c is the full design matrix + with each task's rows centered to zero column-means within the + task. The centering reflects that MNL's likelihood depends on + differences within a task, not absolute level values. This + means task-degenerate designs (where an attribute is constant + across all alternatives in a task) correctly contribute zero + information about that attribute from that task. + + Effects coding: deviation/sum-to-zero. For each attribute, the + alphabetically-first level is the reference (-1 in all K-1 + columns); other levels are +1 in their own column, 0 elsewhere. + - N = n_tasks * n_alts_per_task (one row per alternative) + - p = sum(n_levels - 1) across attributes (estimable params) + + Range is approximately [0, 1]; an orthogonal balanced design + with no within-task degeneracy hits ~1.0. + + Caveat: The MNL information matrix actually depends on assumed + prior part-worths through the choice probabilities. We use the + uniform-prior approximation (equiprobable choices), which + simplifies the formula above and makes the metric a pure + function of the design. Part-worth-aware D-efficiency is a + future-work candidate. + + Numerical stability: + Computed via numpy.linalg.slogdet to avoid overflow on |X'X| + for designs with many parameters. If the design matrix is + singular (sign <= 0 from slogdet), d_efficiency is reported + as 0.0 and the gate fails, with a descriptive failed_gates + message. """ - raise NotImplementedError("Design diagnostics pending") + # ---- Sort attributes / levels deterministically ------------------------ + sorted_attrs = sorted(taxonomy.attributes, key=lambda a: a.id) + + # ---- Level balance ----------------------------------------------------- + n_alts_per_task = len(design.tasks[0].alternatives) if design.tasks else 0 + n_slots = len(design.tasks) * n_alts_per_task + + level_balance: dict[str, dict[str, float]] = {} + worst_imbalance = 0.0 + worst_attr_id = "" + worst_level_id = "" + + for attr in sorted_attrs: + sorted_level_ids = sorted(lvl.id for lvl in attr.levels) + n_levels = len(sorted_level_ids) + counts: Counter[str] = Counter() + for task in design.tasks: + for alt in task.alternatives: + counts[alt.levels[attr.id]] += 1 + + # Frequencies in sorted level-id order so the dict has stable + # iteration order (Python 3.7+ insertion-order-preserving dicts). + attr_balance: dict[str, float] = {} + for level_id in sorted_level_ids: + freq = counts.get(level_id, 0) / n_slots if n_slots else 0.0 + attr_balance[level_id] = freq + # Imbalance metric: |freq * n_levels - 1|. 0 = perfectly uniform. + imbalance = abs(freq * n_levels - 1.0) + if imbalance > worst_imbalance: + worst_imbalance = imbalance + worst_attr_id = attr.id + worst_level_id = level_id + level_balance[attr.id] = attr_balance + + # ---- Duplicate alternatives within a task ----------------------------- + # An alternative "duplicates" another within the same task iff their + # full level vectors are identical. We count each alternative that + # has at least one duplicate within its task. (So if a task has 3 + # identical alts, that contributes 3 to the count, not 2 or 1.) + n_duplicate_alternatives = 0 + for task in design.tasks: + # Convert each alt's levels dict to a hashable signature. + # Sort by attr_id so the signature is order-independent. + signatures = [tuple(sorted(alt.levels.items())) for alt in task.alternatives] + sig_counts = Counter(signatures) + for sig in signatures: + if sig_counts[sig] > 1: + n_duplicate_alternatives += 1 + + # ---- D-efficiency ------------------------------------------------------ + # Build effects-coded design matrix X. + # For each attribute with K levels, contribute K-1 columns. The + # alphabetically-first level (in sorted_level_ids[0]) is the reference. + # Encoding for level k: + # - reference level: -1 in every column for this attribute + # - non-reference level k (1 <= k <= K-1): +1 in column k, 0 elsewhere + p = sum(len(a.levels) - 1 for a in sorted_attrs) # estimable params + n_rows = n_slots + + if p == 0 or n_rows == 0: + # Degenerate input: no estimable params (every attr has 1 level) + # or empty design. Either way, D-efficiency is undefined. Report 0. + d_efficiency = 0.0 + else: + X = np.zeros((n_rows, p), dtype=np.float64) # noqa: N806 — design matrix, statistical convention + col_offsets: dict[str, int] = {} # attr_id -> starting column + col = 0 + for attr in sorted_attrs: + col_offsets[attr.id] = col + col += len(attr.levels) - 1 + + row = 0 + for task in design.tasks: + for alt in task.alternatives: + for attr in sorted_attrs: + sorted_level_ids = sorted(lvl.id for lvl in attr.levels) + ref = sorted_level_ids[0] + chosen = alt.levels[attr.id] + base = col_offsets[attr.id] + if chosen == ref: + # All non-reference columns get -1 + for j in range(len(attr.levels) - 1): + X[row, base + j] = -1.0 + else: + # Find the index of `chosen` among non-reference + # levels (sorted_level_ids[1:]). + non_ref = sorted_level_ids[1:] + idx = non_ref.index(chosen) + X[row, base + idx] = 1.0 + row += 1 + + # MNL information matrix uses task-centered design. + # Reshape X to (n_tasks, n_alts_per_task, p) so we can subtract + # each task's column means in one vectorized step. + X_blocks = X.reshape(len(design.tasks), n_alts_per_task, p) # noqa: N806 + # axis=1 means: average over alternatives within each task. + # keepdims=True so the subtraction broadcasts cleanly. + task_means = X_blocks.mean(axis=1, keepdims=True) + X_centered = (X_blocks - task_means).reshape(n_rows, p) # noqa: N806 + + info_matrix = X_centered.T @ X_centered + sign, log_abs_det = np.linalg.slogdet(info_matrix) + if sign <= 0: # noqa: SIM108 — branch is clearer than a 90-char ternary + # Singular or non-positive-definite information matrix: + # the design cannot identify all parameters (typically due + # to within-task degeneracy on at least one attribute). + d_efficiency = 0.0 + else: + d_efficiency = float(np.exp(log_abs_det / p) / n_rows) + + # ---- Apply gates ------------------------------------------------------- + failed_gates: list[str] = [] + if d_efficiency < min_d_efficiency: + failed_gates.append(f"d_efficiency {d_efficiency:.4f} < {min_d_efficiency:.4f} minimum") + if worst_imbalance > max_level_imbalance: + failed_gates.append( + f"level imbalance {worst_imbalance:.4f} > " + f"{max_level_imbalance:.4f} maximum " + f"(worst: attr={worst_attr_id!r} level={worst_level_id!r})" + ) + + return DesignReport( + d_efficiency=d_efficiency, + level_balance=level_balance, + max_level_imbalance=worst_imbalance, + n_duplicate_alternatives=n_duplicate_alternatives, + passes_gates=len(failed_gates) == 0, + failed_gates=failed_gates, + ) diff --git a/src/kai/shared.py b/src/kai/shared.py index 927d770..304c512 100644 --- a/src/kai/shared.py +++ b/src/kai/shared.py @@ -85,6 +85,13 @@ QUALITY_GATE_MIN_D_EFFICIENCY: float = 0.85 """Minimum D-efficiency for a CBC design to pass quality gates.""" +QUALITY_GATE_MAX_LEVEL_IMBALANCE: float = 0.15 +"""Maximum permitted per-level imbalance in a CBC design, expressed as +abs(observed_freq * n_levels - 1). 0.0 means perfectly uniform; 1.0 +means a level is missing entirely or appears at twice its expected +rate. Mirrors the value in design_params.yaml; promoted here for +cross-module access (diagnostics, future orchestrator) per Tenet 1.""" + DEFAULT_LOG_MAX_BYTES: int = 5 * 1024 * 1024 # 5 MB per file DEFAULT_LOG_BACKUP_COUNT: int = 3 """Log rotation defaults — per the tenet: 'Disk-fill is a real failure mode @@ -472,6 +479,7 @@ def setup_rotating_logger( # Thresholds + log defaults "MIN_OBS_PARAMS_RATIO", "QUALITY_GATE_MIN_D_EFFICIENCY", + "QUALITY_GATE_MAX_LEVEL_IMBALANCE", "DEFAULT_LOG_MAX_BYTES", "DEFAULT_LOG_BACKUP_COUNT", # Errors diff --git a/tests/unit/test_design_diagnostics.py b/tests/unit/test_design_diagnostics.py new file mode 100644 index 0000000..9fbd633 --- /dev/null +++ b/tests/unit/test_design_diagnostics.py @@ -0,0 +1,455 @@ +""" +Tests for the CBC design diagnostics. + +Coverage: + - Output contract (DesignReport with all fields populated) + - Level balance (dict shape, frequencies, max-imbalance metric) + - D-efficiency (hand-verifiable orthogonal case, singular design) + - Duplicate alternative detection + - Gate logic (pass/fail, failed_gates messages) + - Default thresholds pulled from shared.py (Tenet 1) + - Determinism (pure function of input) + - Production-config sentinel test for BACKLOG 1.1.5 (see test docstring) +""" + +from __future__ import annotations + +import inspect + +import pytest + +from kai.design.cbc_generator import ( + Alternative, + CBCDesign, + ChoiceTask, + generate_cbc_design, +) +from kai.design.design_diagnostics import DesignReport, diagnose_cbc_design +from kai.shared import ( + QUALITY_GATE_MAX_LEVEL_IMBALANCE, + QUALITY_GATE_MIN_D_EFFICIENCY, + REPO_ROOT, +) +from kai.taxonomy.schema import Attribute, Level, Taxonomy, Tenet + +# --------------------------------------------------------------------------- +# Fixtures +# --------------------------------------------------------------------------- + + +def _two_attribute_taxonomy() -> Taxonomy: + """Small taxonomy: one 3-level attr, one 2-level attr.""" + return Taxonomy( + version="test-1.0", + tenets=[ + Tenet(id="quality", name="Q", user_definition="QC"), + Tenet(id="speed", name="S", user_definition="ship"), + ], + attributes=[ + Attribute( + id="coverage", + name="Coverage", + description="Test coverage", + related_tenets=["quality"], + levels=[ + Level(id="low", display="60%"), + Level(id="med", display="80%"), + Level(id="high", display="95%"), + ], + ), + Attribute( + id="timeline", + name="Timeline", + description="Time to ship", + related_tenets=["speed"], + levels=[ + Level(id="fast", display="2 days"), + Level(id="slow", display="3 weeks"), + ], + ), + ], + ) + + +def _orthogonal_2x2_taxonomy() -> Taxonomy: + """Two 2-level attrs. With levels named 'aa'/'bb', 'aa' is the + alphabetically-first (reference) level under our effects coding.""" + return Taxonomy( + version="orth-2x2", + tenets=[Tenet(id="t", name="T", user_definition="T")], + attributes=[ + Attribute( + id="x", + name="X", + description="x", + related_tenets=["t"], + levels=[Level(id="aa", display="A"), Level(id="bb", display="B")], + ), + Attribute( + id="y", + name="Y", + description="x", + related_tenets=["t"], + levels=[Level(id="aa", display="A"), Level(id="bb", display="B")], + ), + ], + ) + + +def _make_design(tasks_specs: list[list[dict[str, str]]]) -> CBCDesign: + """Build a CBCDesign from a list-of-lists of level dicts.""" + tasks = [ + ChoiceTask( + task_id=i, + alternatives=[Alternative(levels=alt) for alt in task_alts], + ) + for i, task_alts in enumerate(tasks_specs) + ] + return CBCDesign(tasks=tasks, method="balanced_overlap", seed=0, d_efficiency=None) + + +# --------------------------------------------------------------------------- +# Output contract +# --------------------------------------------------------------------------- + + +class TestOutputContract: + def test_returns_design_report(self) -> None: + tax = _two_attribute_taxonomy() + design = generate_cbc_design(tax, n_tasks=10, n_alts_per_task=4, seed=1) + report = diagnose_cbc_design(design, tax) + assert isinstance(report, DesignReport) + + def test_all_fields_populated(self) -> None: + tax = _two_attribute_taxonomy() + design = generate_cbc_design(tax, n_tasks=10, n_alts_per_task=4, seed=1) + report = diagnose_cbc_design(design, tax) + assert isinstance(report.d_efficiency, float) + assert isinstance(report.level_balance, dict) + assert isinstance(report.max_level_imbalance, float) + assert isinstance(report.n_duplicate_alternatives, int) + assert isinstance(report.passes_gates, bool) + assert isinstance(report.failed_gates, list) + + def test_level_balance_has_all_attributes(self) -> None: + tax = _two_attribute_taxonomy() + design = generate_cbc_design(tax, n_tasks=5, n_alts_per_task=3, seed=1) + report = diagnose_cbc_design(design, tax) + assert set(report.level_balance.keys()) == {"coverage", "timeline"} + + def test_level_balance_has_all_levels_per_attribute(self) -> None: + tax = _two_attribute_taxonomy() + design = generate_cbc_design(tax, n_tasks=5, n_alts_per_task=3, seed=1) + report = diagnose_cbc_design(design, tax) + assert set(report.level_balance["coverage"].keys()) == {"low", "med", "high"} + assert set(report.level_balance["timeline"].keys()) == {"fast", "slow"} + + +# --------------------------------------------------------------------------- +# Level balance +# --------------------------------------------------------------------------- + + +class TestLevelBalance: + def test_frequencies_sum_to_one_per_attribute(self) -> None: + tax = _two_attribute_taxonomy() + design = generate_cbc_design(tax, n_tasks=10, n_alts_per_task=4, seed=1) + report = diagnose_cbc_design(design, tax) + for attr_id, freqs in report.level_balance.items(): + assert ( + abs(sum(freqs.values()) - 1.0) < 1e-10 + ), f"Attribute {attr_id!r} frequencies sum to {sum(freqs.values())}" + + def test_two_level_attribute_perfectly_balanced(self) -> None: + # 10 tasks * 4 alts = 40 slots / 2 levels = 20 each => freq 0.5 + tax = _two_attribute_taxonomy() + design = generate_cbc_design(tax, n_tasks=10, n_alts_per_task=4, seed=1) + report = diagnose_cbc_design(design, tax) + assert report.level_balance["timeline"]["fast"] == 0.5 + assert report.level_balance["timeline"]["slow"] == 0.5 + + def test_imbalance_zero_when_perfectly_uniform(self) -> None: + # Build by hand: attr 'x' (2 levels) with exactly 50/50 split. + tax = _orthogonal_2x2_taxonomy() + design = _make_design( + [ + [{"x": "aa", "y": "aa"}, {"x": "bb", "y": "bb"}], + [{"x": "aa", "y": "bb"}, {"x": "bb", "y": "aa"}], + ] + ) + report = diagnose_cbc_design(design, tax) + assert report.max_level_imbalance == 0.0 + + def test_imbalance_correct_for_skewed_design(self) -> None: + # 3 of 4 alts have x="aa". freq("aa") = 0.75, n_levels=2. + # imbalance = |0.75 * 2 - 1| = 0.5 + tax = _orthogonal_2x2_taxonomy() + design = _make_design( + [ + [{"x": "aa", "y": "aa"}, {"x": "aa", "y": "bb"}], + [{"x": "aa", "y": "aa"}, {"x": "bb", "y": "bb"}], + ] + ) + report = diagnose_cbc_design(design, tax) + assert report.max_level_imbalance == pytest.approx(0.5) + + +# --------------------------------------------------------------------------- +# D-efficiency +# --------------------------------------------------------------------------- + + +class TestDEfficiency: + def test_orthogonal_2x2_design_d_eff_is_one(self) -> None: + """Hand-verifiable case: 1 task with 4 alts forming an orthogonal + 2-attribute, 2-level design. Effects coding gives X with rows + (+1,+1), (+1,-1), (-1,+1), (-1,-1). Within-task means are zero, + so centered X = X. X'X = diag(4,4); det=16; p=2; N=4. + D-eff = 16^(1/2) / 4 = 1.0.""" + tax = _orthogonal_2x2_taxonomy() + design = _make_design( + [ + [ + {"x": "bb", "y": "bb"}, + {"x": "bb", "y": "aa"}, + {"x": "aa", "y": "bb"}, + {"x": "aa", "y": "aa"}, + ], + ] + ) + report = diagnose_cbc_design(design, tax) + assert report.d_efficiency == pytest.approx(1.0, abs=1e-10) + + def test_singular_design_d_eff_is_zero(self) -> None: + """Design where attr 'x' is constant across all alts in every task. + The MNL information matrix becomes singular (centered column for + x is zero), so D-eff is reported as 0.0 and the gate fails with + a descriptive message.""" + tax = _orthogonal_2x2_taxonomy() + design = _make_design( + [ + [ + {"x": "aa", "y": "aa"}, + {"x": "aa", "y": "bb"}, + {"x": "aa", "y": "aa"}, + {"x": "aa", "y": "bb"}, + ], + ] + ) + report = diagnose_cbc_design(design, tax) + assert report.d_efficiency == 0.0 + assert not report.passes_gates + assert any("d_efficiency" in msg for msg in report.failed_gates) + + def test_d_eff_is_pure_function_of_design(self) -> None: + """Determinism: same input always produces same D-efficiency. + (Diagnostics aren't randomized - this is a sanity check, not a + contract obligation like ADR-005's seed determinism.)""" + tax = _two_attribute_taxonomy() + design = generate_cbc_design(tax, n_tasks=10, n_alts_per_task=4, seed=1) + r1 = diagnose_cbc_design(design, tax) + r2 = diagnose_cbc_design(design, tax) + assert r1.d_efficiency == r2.d_efficiency + assert r1.level_balance == r2.level_balance + assert r1.max_level_imbalance == r2.max_level_imbalance + + +# --------------------------------------------------------------------------- +# Duplicate alternatives +# --------------------------------------------------------------------------- + + +class TestDuplicateDetection: + def test_no_duplicates_detected_when_all_distinct(self) -> None: + tax = _orthogonal_2x2_taxonomy() + design = _make_design( + [ + [ + {"x": "aa", "y": "aa"}, + {"x": "aa", "y": "bb"}, + {"x": "bb", "y": "aa"}, + {"x": "bb", "y": "bb"}, + ], + ] + ) + report = diagnose_cbc_design(design, tax) + assert report.n_duplicate_alternatives == 0 + + def test_all_three_alts_identical_counts_three(self) -> None: + tax = _orthogonal_2x2_taxonomy() + design = _make_design( + [ + [ + {"x": "aa", "y": "aa"}, + {"x": "aa", "y": "aa"}, + {"x": "aa", "y": "aa"}, + ], + ] + ) + report = diagnose_cbc_design(design, tax) + assert report.n_duplicate_alternatives == 3 + + def test_one_pair_duplicates_counts_two(self) -> None: + # 3 alts: A, A, B -> the two A's are duplicates; B is not. + tax = _orthogonal_2x2_taxonomy() + design = _make_design( + [ + [ + {"x": "aa", "y": "bb"}, + {"x": "aa", "y": "bb"}, + {"x": "bb", "y": "aa"}, + ], + ] + ) + report = diagnose_cbc_design(design, tax) + assert report.n_duplicate_alternatives == 2 + + def test_duplicates_summed_across_tasks(self) -> None: + # Task 0: 3 dups. Task 1: 0 dups. Task 2: 2 dups (one pair). + tax = _orthogonal_2x2_taxonomy() + design = _make_design( + [ + [ + {"x": "aa", "y": "aa"}, + {"x": "aa", "y": "aa"}, + {"x": "aa", "y": "aa"}, + ], + [ + {"x": "aa", "y": "aa"}, + {"x": "bb", "y": "aa"}, + {"x": "aa", "y": "bb"}, + ], + [ + {"x": "aa", "y": "bb"}, + {"x": "aa", "y": "bb"}, + {"x": "bb", "y": "aa"}, + ], + ] + ) + report = diagnose_cbc_design(design, tax) + assert report.n_duplicate_alternatives == 5 + + +# --------------------------------------------------------------------------- +# Gate logic +# --------------------------------------------------------------------------- + + +class TestGateLogic: + def test_strict_d_eff_threshold_triggers_failure(self) -> None: + tax = _two_attribute_taxonomy() + design = generate_cbc_design(tax, n_tasks=10, n_alts_per_task=4, seed=1) + # Push threshold above any realistic D-eff. + report = diagnose_cbc_design(design, tax, min_d_efficiency=0.999) + assert not report.passes_gates + assert any("d_efficiency" in msg for msg in report.failed_gates) + + def test_strict_imbalance_threshold_triggers_failure(self) -> None: + tax = _two_attribute_taxonomy() + design = generate_cbc_design(tax, n_tasks=10, n_alts_per_task=4, seed=1) + # Threshold of 0 means no imbalance tolerated at all. + report = diagnose_cbc_design(design, tax, max_level_imbalance=0.0) + assert not report.passes_gates + assert any("imbalance" in msg for msg in report.failed_gates) + + def test_failed_gate_messages_are_descriptive(self) -> None: + tax = _two_attribute_taxonomy() + design = generate_cbc_design(tax, n_tasks=10, n_alts_per_task=4, seed=1) + report = diagnose_cbc_design(design, tax, min_d_efficiency=0.999) + msg = report.failed_gates[0] + # Must mention what was measured and what the threshold was. + assert "d_efficiency" in msg + assert any(c.isdigit() for c in msg) + assert "<" in msg or ">" in msg or "minimum" in msg + + def test_passes_gates_iff_failed_gates_empty(self) -> None: + tax = _two_attribute_taxonomy() + design = generate_cbc_design(tax, n_tasks=10, n_alts_per_task=4, seed=1) + report = diagnose_cbc_design(design, tax) + assert report.passes_gates == (len(report.failed_gates) == 0) + + +# --------------------------------------------------------------------------- +# Default thresholds (Tenet 1) +# --------------------------------------------------------------------------- + + +class TestDefaultThresholds: + """Defaults must come from shared.py constants, not literals. + See ARCHITECTURE_TENETS Tenet 1: cross-cutting thresholds belong in + shared.py and every other module imports from there.""" + + def test_default_min_d_efficiency_matches_shared(self) -> None: + sig = inspect.signature(diagnose_cbc_design) + assert sig.parameters["min_d_efficiency"].default == QUALITY_GATE_MIN_D_EFFICIENCY + + def test_default_max_level_imbalance_matches_shared(self) -> None: + sig = inspect.signature(diagnose_cbc_design) + assert sig.parameters["max_level_imbalance"].default == QUALITY_GATE_MAX_LEVEL_IMBALANCE + + +# --------------------------------------------------------------------------- +# Production-config sentinel (BACKLOG 1.1.5) +# --------------------------------------------------------------------------- + + +class TestProductionConfigSentinel: + """Phase 1.2 ships diagnostics correctness; the gate failure on the + Phase 1.1 generator's output is itself the result of running them. + + The Phase 1.1 generator's level-balanced independent shuffles produce + D-efficiency ~0.38 at production scale, statistically indistinguishable + from pure random sampling. This is well below the 0.85 gate, which is + calibrated against full Sawtooth-style balanced overlap with swap-based + D-eff optimization. + + BACKLOG item 1.1.5 will add swap-based optimization to the generator. + Once it lands, this assertion flips: prod design will pass the gate. + + Until then, this test is a deliberate sentinel: + - If it KEEPS failing: 1.1.5 hasn't shipped yet (expected). + - If it starts PASSING: either 1.1.5 has shipped (great, flip the + assertion) or someone weakened the gate threshold (investigate). + """ + + def test_production_design_intentionally_fails_d_eff_gate(self) -> None: + from kai.taxonomy.loader import load_taxonomy + + real_taxonomy = REPO_ROOT / "config" / "taxonomy.yaml" + if not real_taxonomy.exists(): + pytest.skip(f"Real taxonomy not present at {real_taxonomy}") + + tax = load_taxonomy(real_taxonomy) + design = generate_cbc_design(tax, n_tasks=20, n_alts_per_task=4, seed=42) + report = diagnose_cbc_design(design, tax) + + # Sentinel: gate must FAIL until 1.1.5 ships. + assert not report.passes_gates, ( + f"Production design unexpectedly PASSES gates with d_eff=" + f"{report.d_efficiency:.4f}. Either BACKLOG 1.1.5 (swap-based " + f"D-eff optimization) has shipped - in which case flip this " + f"assertion to expect passes_gates=True - or the threshold " + f"has been weakened. Investigate." + ) + assert any("d_efficiency" in m for m in report.failed_gates) + # Bound the expected range so a wildly-off measurement (e.g. ~0.0 + # from a regression that breaks the formula) still trips the test. + assert 0.30 <= report.d_efficiency <= 0.50, ( + f"Production D-eff {report.d_efficiency:.4f} outside expected " + f"[0.30, 0.50] range for level-balanced indep shuffles. " + f"Either the generator changed or the metric did." + ) + + def test_production_design_passes_imbalance_gate(self) -> None: + """Level-balance gate IS expected to pass - that's what 1.1's + algorithm actually optimizes for.""" + from kai.taxonomy.loader import load_taxonomy + + real_taxonomy = REPO_ROOT / "config" / "taxonomy.yaml" + if not real_taxonomy.exists(): + pytest.skip(f"Real taxonomy not present at {real_taxonomy}") + + tax = load_taxonomy(real_taxonomy) + design = generate_cbc_design(tax, n_tasks=20, n_alts_per_task=4, seed=42) + report = diagnose_cbc_design(design, tax) + assert report.max_level_imbalance <= QUALITY_GATE_MAX_LEVEL_IMBALANCE