Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 37 additions & 8 deletions BACKLOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,14 +33,43 @@ Goal: a working, tested CBC design generator and MNL estimator on synthetic
data. By the end of this phase, we can hand-run estimation against fake
respondents and verify the math is correct.

### 1.2 Design diagnostics (D-efficiency, level balance)

**WHAT:** Implement `kai.design.design_diagnostics.diagnose_cbc_design()`
returning a `DesignReport` with D-efficiency, level frequencies, dominated
alternative count, pass/fail vs gates.

**WHY:** Quality gate before any human sees the questionnaire. Bad designs
waste your time and corrupt estimation.
### 1.1.5 Within-task overlap minimization (swap-based D-efficiency)

**WHAT:** Extend `kai.design.cbc_generator.generate_cbc_design()` for
`method="balanced_overlap"` to perform swap-based D-efficiency optimization
on top of the level-balanced sampling 1.1 already does.

Algorithm sketch: starting from the 1.1 level-balanced design, iterate up
to N times. In each iteration: pick a random pair of alternatives (within
the same task or across tasks), try swapping a single attribute's level
between them, accept the swap iff D-efficiency improves AND the swap
doesn't break level balance per attribute. Stop when no improving swap
is found in M consecutive attempts.

Must remain deterministic given the same seed.

**WHY:** Phase 1.2 calibration showed our 1.1 generator produces D-eff
~0.38 at production scale, statistically indistinguishable from pure
random sampling. The 0.85 quality gate is calibrated against full
Sawtooth-style balanced overlap which includes this swap step. Until
1.1.5 ships, the production design fails its own quality gate.

**TARGET:** Production D-eff >= 0.85 on `config/taxonomy.yaml` at
20 tasks x 4 alts. When the target is met, the sentinel test in
`tests/unit/test_design_diagnostics.py`
(`test_production_design_intentionally_fails_d_eff_gate`) flips its
assertion to expect `passes_gates=True`.

**PRIORITY:** Highest remaining Phase 1 work. Should ship before 1.4
(MNL estimator) so we estimate against a design that earns its quality
gate.

**OPEN QUESTIONS:**
- Greedy vs simulated annealing? Greedy is simpler and likely sufficient
at our scale; SA only matters if greedy gets stuck in local minima.
Default greedy unless calibration shows otherwise.
- Iteration cap: probably 1000-10000 swaps; tune based on D-eff
stability across seeds.

### 1.4 MNL estimator (MLE + bootstrap CIs)

Expand Down
94 changes: 56 additions & 38 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,54 +10,72 @@ Versioning follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

**Tags:** Feature

First Phase 1 work: the CBC design generator graduates from stub to
implementation. Closes BACKLOG items 1.1 and 1.3.
Phase 1.2 ships: design diagnostics. Closes BACKLOG 1.2. Surfaces a new
follow-up captured as BACKLOG 1.1.5.

### Added (code)
- `kai.design.cbc_generator.generate_cbc_design()` — Phase 1.1, balanced-
overlap method. Pure-numpy implementation; level-balanced per attribute
via a single seeded `numpy.random.default_rng(seed)` consumed in
attribute-id-sorted order. Returns `CBCDesign(d_efficiency=None)` —
diagnostics fill that field in Phase 1.2.
- `tests/unit/test_cbc_generator.py` — output contract, level balance
(perfect when divisible, near-balanced otherwise with deterministic
alphabetical-first remainder distribution), argument validation,
determinism (the Phase 1.3 ADR-005 byte-identical-pickle contract,
verified across multiple seeds and against a different taxonomy),
and a production-size smoke test against `config/taxonomy.yaml` at
20 tasks × 4 alts × 8 attributes.

- `kai.design.design_diagnostics.diagnose_cbc_design()` - computes
D-efficiency, per-attribute level balance, max level imbalance, and
duplicate-alternative count; checks against shared.py quality gates
and returns a `DesignReport`.
- D-efficiency uses the standard MNL relative formulation under uniform
priors: `det(I)^(1/p) / N` where `I = sum_t X_t' M X_t` and M is the
J x J within-task centering matrix. Catches task-degenerate designs
(an attribute constant across all alts in a task contributes zero
information from that task).
- Effects coding: deviation/sum-to-zero, alphabetically-first level as
reference. Numerical stability via `numpy.linalg.slogdet`; singular
information matrices report d_efficiency=0.0 with descriptive message.
- `kai.shared.QUALITY_GATE_MAX_LEVEL_IMBALANCE = 0.15` - promoted from
literal in design_params.yaml to cross-module constant per Tenet 1.
- `tests/unit/test_design_diagnostics.py` - 25+ tests covering output
contract, level balance, D-efficiency (hand-verifiable orthogonal
case = exactly 1.0; singular = 0.0), duplicate detection, gate logic,
Tenet 1 enforcement of defaults, and a production-config sentinel.

### Changed (code)
- `cbc_generator.py` no longer raises `NotImplementedError` for
`method="balanced_overlap"`. `method="orthogonal"` and
`method="random"` remain unimplemented and raise `NotImplementedError`
with a message indicating they are recognized but pending Phase 2+.
Unknown methods raise `ValueError`.

- `DesignReport.n_dominated_alternatives` renamed to
`n_duplicate_alternatives`. Strict dominance requires preference-
direction metadata not currently in the taxonomy schema; we count
duplicate alternatives within a task instead, which is well-defined
and a real pathology. Real dominance detection captured in BACKLOG.

### Decisions
- **Balanced-overlap v1 = level-balanced random assignment**, no within-
task overlap minimization. Production-size smoke test shows worst-case
level imbalance of 2.5% — well under the 15% gate from
`design_params.yaml`. Adding overlap-minimization speculation would
violate Tenets 2 and 5; the call to add it (or not) is gated on a
measured D-efficiency result from Phase 1.2.
- **Remainder distribution is deterministic by alphabetical id sort**
when `n_slots` is not divisible by `n_levels`. Documented in the
generator's algorithm docstring.
- **Generator is a pure function returning `d_efficiency=None`.** Any
reject-and-regenerate loop based on a quality threshold lives above
the generator (in a session-creation orchestrator), not inside it —
matches the stub's "Filled by diagnostics" comment and keeps the
generator independently testable.

- **Production D-efficiency lands at ~0.38**, well below the 0.85 gate.
Calibration over 20 seeds shows our generator is statistically
indistinguishable from pure random sampling. The 0.85 gate is
calibrated against full Sawtooth balanced overlap (swap-based D-eff
optimization on top of level balancing). Captured as new BACKLOG
item 1.1.5; preferred over weakening the gate. Estimation still works
at D-eff 0.38 (wider CIs, not wrong answers), so the intermediate
state is survivable.
- **Sentinel test for 1.1.5**:
`test_production_design_intentionally_fails_d_eff_gate` asserts the
current failure with a docstring explaining the path forward. When
1.1.5 ships, the assertion flips.

### Verified
- Byte-identical pickle output across separate Python processes.
- Byte-identical pickle output across `PYTHONHASHSEED` variation
(hash-order dependence ruled out).

- D-efficiency matches hand-computed value (1.0 exactly) on orthogonal
2-attribute 2-level case.
- Singular designs correctly report d_efficiency=0.0 and fail gates.
- Production calibration: our generator vs pure random vs random-with-
no-duplicates over 20 seeds each gave 0.384 / 0.371 / 0.372 mean
D-eff respectively (statistically indistinguishable).

### Closes (BACKLOG)
- 1.1 CBC design generator (balanced overlap method)
- 1.3 Determinism test (covered by `TestDeterminism` in the new test file)

- 1.2 Design diagnostics (D-efficiency, level balance)

### New BACKLOG item

- 1.1.5 Within-task overlap minimization (swap-based D-efficiency
optimization). Target: production D-eff >= 0.85. Highest priority
remaining Phase 1 work; should land before 1.4 (MNL estimator).



## [0.2.1] — 2026-04-26 — Resilience + observability foundations
Expand Down
5 changes: 3 additions & 2 deletions PROJECT_KNOWLEDGE.txt
Original file line number Diff line number Diff line change
Expand Up @@ -61,9 +61,9 @@ Purpose:
│ │
│ ├── design/
│ │ ├── plugin.py DesignPlugin lifecycle
│ │ ├── cbc_generator.py CBC design generation (stub)
│ │ ├── cbc_generator.py CBC design generation
│ │ ├── maxdiff_generator.py MaxDiff BIBD generation (stub)
│ │ └── design_diagnostics.py D-efficiency, level balance (stub)
│ │ └── design_diagnostics.py D-efficiency, level balance
│ │
│ ├── estimation/
│ │ ├── plugin.py EstimationPlugin lifecycle
Expand Down Expand Up @@ -146,6 +146,7 @@ Purpose:
THRESHOLDS
MIN_OBS_PARAMS_RATIO
QUALITY_GATE_MIN_D_EFFICIENCY
QUALITY_GATE_MAX_LEVEL_IMBALANCE

ERROR TYPES
KaiError (base)
Expand Down
Loading
Loading