Gimpleberry · Gimpleberry · Apr 28, 2026 · Apr 28, 2026 · Apr 28, 2026
diff --git a/BACKLOG.md b/BACKLOG.md
@@ -33,14 +33,43 @@ Goal: a working, tested CBC design generator and MNL estimator on synthetic
 data. By the end of this phase, we can hand-run estimation against fake
 respondents and verify the math is correct.
 
-### 1.2 Design diagnostics (D-efficiency, level balance)
-
-**WHAT:** Implement `kai.design.design_diagnostics.diagnose_cbc_design()`
-returning a `DesignReport` with D-efficiency, level frequencies, dominated
-alternative count, pass/fail vs gates.
-
-**WHY:** Quality gate before any human sees the questionnaire. Bad designs
-waste your time and corrupt estimation.
+### 1.1.5 Within-task overlap minimization (swap-based D-efficiency)
+
+**WHAT:** Extend `kai.design.cbc_generator.generate_cbc_design()` for
+`method="balanced_overlap"` to perform swap-based D-efficiency optimization
+on top of the level-balanced sampling 1.1 already does.
+
+Algorithm sketch: starting from the 1.1 level-balanced design, iterate up
+to N times. In each iteration: pick a random pair of alternatives (within
+the same task or across tasks), try swapping a single attribute's level
+between them, accept the swap iff D-efficiency improves AND the swap
+doesn't break level balance per attribute. Stop when no improving swap
+is found in M consecutive attempts.
+
+Must remain deterministic given the same seed.
+
+**WHY:** Phase 1.2 calibration showed our 1.1 generator produces D-eff
+~0.38 at production scale, statistically indistinguishable from pure
+random sampling. The 0.85 quality gate is calibrated against full
+Sawtooth-style balanced overlap which includes this swap step. Until
+1.1.5 ships, the production design fails its own quality gate.
+
+**TARGET:** Production D-eff >= 0.85 on `config/taxonomy.yaml` at
+20 tasks x 4 alts. When the target is met, the sentinel test in
+`tests/unit/test_design_diagnostics.py`
+(`test_production_design_intentionally_fails_d_eff_gate`) flips its
+assertion to expect `passes_gates=True`.
+
+**PRIORITY:** Highest remaining Phase 1 work. Should ship before 1.4
+(MNL estimator) so we estimate against a design that earns its quality
+gate.
+
+**OPEN QUESTIONS:**
+- Greedy vs simulated annealing? Greedy is simpler and likely sufficient
+  at our scale; SA only matters if greedy gets stuck in local minima.
+  Default greedy unless calibration shows otherwise.
+- Iteration cap: probably 1000-10000 swaps; tune based on D-eff
+  stability across seeds.
 
 ### 1.4 MNL estimator (MLE + bootstrap CIs)
 

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -10,54 +10,72 @@ Versioning follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
 **Tags:** Feature
 
-First Phase 1 work: the CBC design generator graduates from stub to
-implementation. Closes BACKLOG items 1.1 and 1.3.
+Phase 1.2 ships: design diagnostics. Closes BACKLOG 1.2. Surfaces a new
+follow-up captured as BACKLOG 1.1.5.
 
 ### Added (code)
-- `kai.design.cbc_generator.generate_cbc_design()` — Phase 1.1, balanced-
-  overlap method. Pure-numpy implementation; level-balanced per attribute
-  via a single seeded `numpy.random.default_rng(seed)` consumed in
-  attribute-id-sorted order. Returns `CBCDesign(d_efficiency=None)` —
-  diagnostics fill that field in Phase 1.2.
-- `tests/unit/test_cbc_generator.py` — output contract, level balance
-  (perfect when divisible, near-balanced otherwise with deterministic
-  alphabetical-first remainder distribution), argument validation,
-  determinism (the Phase 1.3 ADR-005 byte-identical-pickle contract,
-  verified across multiple seeds and against a different taxonomy),
-  and a production-size smoke test against `config/taxonomy.yaml` at
-  20 tasks × 4 alts × 8 attributes.
+
+- `kai.design.design_diagnostics.diagnose_cbc_design()` - computes
+  D-efficiency, per-attribute level balance, max level imbalance, and
+  duplicate-alternative count; checks against shared.py quality gates
+  and returns a `DesignReport`.
+- D-efficiency uses the standard MNL relative formulation under uniform
+  priors: `det(I)^(1/p) / N` where `I = sum_t X_t' M X_t` and M is the
+  J x J within-task centering matrix. Catches task-degenerate designs
+  (an attribute constant across all alts in a task contributes zero
+  information from that task).
+- Effects coding: deviation/sum-to-zero, alphabetically-first level as
+  reference. Numerical stability via `numpy.linalg.slogdet`; singular
+  information matrices report d_efficiency=0.0 with descriptive message.
+- `kai.shared.QUALITY_GATE_MAX_LEVEL_IMBALANCE = 0.15` - promoted from
+  literal in design_params.yaml to cross-module constant per Tenet 1.
+- `tests/unit/test_design_diagnostics.py` - 25+ tests covering output
+  contract, level balance, D-efficiency (hand-verifiable orthogonal
+  case = exactly 1.0; singular = 0.0), duplicate detection, gate logic,
+  Tenet 1 enforcement of defaults, and a production-config sentinel.
 
 ### Changed (code)
-- `cbc_generator.py` no longer raises `NotImplementedError` for
-  `method="balanced_overlap"`. `method="orthogonal"` and
-  `method="random"` remain unimplemented and raise `NotImplementedError`
-  with a message indicating they are recognized but pending Phase 2+.
-  Unknown methods raise `ValueError`.
+
+- `DesignReport.n_dominated_alternatives` renamed to
+  `n_duplicate_alternatives`. Strict dominance requires preference-
+  direction metadata not currently in the taxonomy schema; we count
+  duplicate alternatives within a task instead, which is well-defined
+  and a real pathology. Real dominance detection captured in BACKLOG.
 
 ### Decisions
-- **Balanced-overlap v1 = level-balanced random assignment**, no within-
-  task overlap minimization. Production-size smoke test shows worst-case
-  level imbalance of 2.5% — well under the 15% gate from
-  `design_params.yaml`. Adding overlap-minimization speculation would
-  violate Tenets 2 and 5; the call to add it (or not) is gated on a
-  measured D-efficiency result from Phase 1.2.
-- **Remainder distribution is deterministic by alphabetical id sort**
-  when `n_slots` is not divisible by `n_levels`. Documented in the
-  generator's algorithm docstring.
-- **Generator is a pure function returning `d_efficiency=None`.** Any
-  reject-and-regenerate loop based on a quality threshold lives above
-  the generator (in a session-creation orchestrator), not inside it —
-  matches the stub's "Filled by diagnostics" comment and keeps the
-  generator independently testable.
+
+- **Production D-efficiency lands at ~0.38**, well below the 0.85 gate.
+  Calibration over 20 seeds shows our generator is statistically
+  indistinguishable from pure random sampling. The 0.85 gate is
+  calibrated against full Sawtooth balanced overlap (swap-based D-eff
+  optimization on top of level balancing). Captured as new BACKLOG
+  item 1.1.5; preferred over weakening the gate. Estimation still works
+  at D-eff 0.38 (wider CIs, not wrong answers), so the intermediate
+  state is survivable.
+- **Sentinel test for 1.1.5**:
+  `test_production_design_intentionally_fails_d_eff_gate` asserts the
+  current failure with a docstring explaining the path forward. When
+  1.1.5 ships, the assertion flips.
 
 ### Verified
-- Byte-identical pickle output across separate Python processes.
-- Byte-identical pickle output across `PYTHONHASHSEED` variation
-  (hash-order dependence ruled out).
+
+- D-efficiency matches hand-computed value (1.0 exactly) on orthogonal
+  2-attribute 2-level case.
+- Singular designs correctly report d_efficiency=0.0 and fail gates.
+- Production calibration: our generator vs pure random vs random-with-
+  no-duplicates over 20 seeds each gave 0.384 / 0.371 / 0.372 mean
+  D-eff respectively (statistically indistinguishable).
 
 ### Closes (BACKLOG)
-- 1.1 CBC design generator (balanced overlap method)
-- 1.3 Determinism test (covered by `TestDeterminism` in the new test file)
+
+- 1.2 Design diagnostics (D-efficiency, level balance)
+
+### New BACKLOG item
+
+- 1.1.5 Within-task overlap minimization (swap-based D-efficiency
+  optimization). Target: production D-eff >= 0.85. Highest priority
+  remaining Phase 1 work; should land before 1.4 (MNL estimator).
+
 
 
 ## [0.2.1] — 2026-04-26 — Resilience + observability foundations

diff --git a/PROJECT_KNOWLEDGE.txt b/PROJECT_KNOWLEDGE.txt
@@ -61,9 +61,9 @@ Purpose:
   │   │
   │   ├── design/
   │   │   ├── plugin.py             DesignPlugin lifecycle
-  │   │   ├── cbc_generator.py      CBC design generation (stub)
+  │   │   ├── cbc_generator.py      CBC design generation
   │   │   ├── maxdiff_generator.py  MaxDiff BIBD generation (stub)
-  │   │   └── design_diagnostics.py D-efficiency, level balance (stub)
+  │   │   └── design_diagnostics.py D-efficiency, level balance
   │   │
   │   ├── estimation/
   │   │   ├── plugin.py             EstimationPlugin lifecycle
@@ -146,6 +146,7 @@ Purpose:
   THRESHOLDS
     MIN_OBS_PARAMS_RATIO
     QUALITY_GATE_MIN_D_EFFICIENCY
+    QUALITY_GATE_MAX_LEVEL_IMBALANCE
 
   ERROR TYPES
     KaiError (base)