Skip to content

feat(gso): Genie Space Optimizer — complete optimizer engine (lossless contract, process spine, decision trace, observability)#202

Open
prashsub wants to merge 867 commits intomainfrom
fix/gso-optimizer-correctness-and-leakage
Open

feat(gso): Genie Space Optimizer — complete optimizer engine (lossless contract, process spine, decision trace, observability)#202
prashsub wants to merge 867 commits intomainfrom
fix/gso-optimizer-correctness-and-leakage

Conversation

@prashsub
Copy link
Copy Markdown
Collaborator

@prashsub prashsub commented May 6, 2026

Closes #86
Closes #182
Closes #183
Closes #184
Closes #189

Partially addresses #185 (preflight tile shipped; pre/post delta + ScanSummary still TODO)


Summary

This PR delivers the complete Genie Space Optimizer engine built on top of the existing scaffold. It covers ~900 commits across the packages/genie-space-optimizer/ package, touching 165 source files (+86k lines) and 470 test files (+83k lines).

Correctness & leakage fixes

  • Validator tolerates runtime underscore-prefixed keys; question_id deduplication in cluster_failures; root-cause cascade hardening; vacuous filter rejection; resume display fixes
  • Canonical QID extraction (_qid_extraction.py) as single source of truth across harness and GT-correction; fixes Cycle 8 bug where GT-correction candidates were silently dropped

Lossless contract & replay gate

  • Typed JourneyStage/JourneyTerminalState enum contract with validator and transition rules
  • Deterministic lever_loop_replay pure driver; frozen cycle fixtures (Cycles 7–11); canonical journey JSON for byte-stable replay diffs
  • Lane-aware validation (trunk vs proposal lanes); cross-projection replay tests

Decision trace & unified observability (Phases B–H)

  • DecisionRecord with 10 typed DecisionTypes + AlternativeOption/RejectReason
  • OptimizationTrace + 9-section render_operator_transcript; ScoreboardSnapshot; classify_unresolved_qid priority-ladder bucketing
  • Per-stage stage_io_capture decorators; Phase H run_output_bundle with GSO_ARTIFACT_INDEX_V1 marker; per-iteration content completeness (pre-stamp → finalize at all 10 exit paths)
  • Evidence bundle CLI + lazy MLflow trace fetcher + replay_runid_fixture CLI + gso-postmortem skill

Process spine modularization (Phases F–G)

  • optimization/stages/ package (9 stages: evaluation → learning) with STAGES registry + StageHandler protocol
  • All stages wired with additive observability; uniform execute() alias; INPUT_CLASS/OUTPUT_CLASS per stage; JSON-serializable I/O contract enforced

RCA & control-plane hardening

  • Expanded RcaKind; rca_execution.py deterministic execution plans; causal acceptance gate; convergence quarantine attribution
  • blast_radius + dead_on_arrival producers; rca_groundedness.py unified gate; PATCH_APPLIED/RCA_FORMED/PROPOSAL_GENERATED decision records with RCA-grounding fields
  • Per-question lever preference; forbid_tables constraint propagation; classify_unresolved_qid priority ladder

Cycle-by-cycle optimizer improvements (Cycles 1–11)

  • Intra-AG body-fingerprint dedup; shared-cause blast radius; force structural synthesis (GSO_FORCE_STRUCTURAL_SYNTHESIS_ON_LEVER5_DROP); typed proposal-failure outcomes; productive-iteration budget; causal-drop strategist feedback; soft-cluster drift recovery; lever-6 SQL-shape forcing; invariant warn-and-degrade policy; MLflow eval hang defense (adaptive tier ladder + liveness watchdog + per-request OpenAI timeout); patch-acceptance reliability (W1–W8); RCA ungrounded records; AG levers union with recommended_levers; narrow L6 expression fallback; Cycle 11 invariant suite (I1–I8)

Dependency lockdown

  • All Python + frontend deps pinned to exact versions; mlflow aligned to 3.11.1; uv.lock regenerated; exact-only policy codified in AGENTS.md

Docs

  • docs/optimizer-process-design/ (11 files) — permanent reference architecture + interactive optimizer visualization
  • canonical-schema.md; optimizer iteration ledger (Cycles 1–11)

Test plan

  • 470+ test files: unit, integration, replay, snapshot
  • Cycle 8–11 real-run fixtures committed with zero-violation budget assertions
  • Byte-stable snapshot tests for transcript, decision trace, journey events, scoreboard, and alternatives ordering
  • Cycle 11 invariant suite (I1–I8) wired into iteration epilogue with pilot baseline assertions
  • Structural AST audit lint for inner-helper variable leaks

@prashsub prashsub requested a review from hiydavid May 6, 2026 12:44
prashsub added 29 commits May 6, 2026 08:17
…posals

Adds _drop_proposals_matching_rolled_back_content_fingerprints helper
and wires it BEFORE the existing _patch_forbidden check. Uses
patch_retry_signature[5] (content_fingerprint) to drop any proposal
whose content matches any prior rolled-back patch — irrespective of
rollback_class.

The wire-up builds a separate _all_rolled_back_patches_for_dedup
list from reflection_buffer because the existing
_rolled_back_patches_for_retry filters to CONTENT_REGRESSION only,
which would defeat the dedup's purpose (closing the iter-3/iter-4
non-CONTENT_REGRESSION re-emission gap).

Co-authored-by: Isaac
…-A burndown log

- airline_real_v1_cycle10_raw.json: commit real-run cycle 10 replay fixture
- burn-down-to-merge-roadmap.md: Phase D/E status progress, cycle 10 notes
- phase-a-burndown-log.md: append cycle 10 run entry
Lands the foundational scaffolding for Phase F stage-aligned
modularization:
  - stages/__init__.py exports StageContext, StageHandler
  - stages/_context.py defines StageContext dataclass (run_id,
    iteration, journey/decision emit hooks, MLflow anchor, feature
    flags)
  - stages/_protocol.py defines StageHandler[StageInputT, StageOutputT]
    Protocol with stage_key, decision_producer, execute()

Unit tests pin the public surface and the Protocol's
duck-typing contract. F2-F9 build on this skeleton.

Combines plan Tasks 1+2+3 into one commit since all three are part
of the same package skeleton and pass green together.

Co-authored-by: Isaac
…tion + classify_eval_rows wrappers (Phase F1)

Lands stages/evaluation.py with:
  - EvaluationInput / EvaluationResult dataclasses
  - _classify_eval_rows: production partition using control_plane
    row_is_hard_failure / row_is_passing / row_is_actionable_soft
  - _run_full_evaluation: thin wrapper around evaluation.run_evaluation
  - evaluate_baseline / evaluate_post_patch public entry points
  - eval_classification_records emission via ctx.decision_emit

Adapted eval_classification_records call to its actual signature
(eval_qids + classification dict, not rows). Partition-parity test
against lever_loop_replay._classify_eval_rows pins predicate
agreement.

Combines plan Tasks 4+5 into one commit since both are part of the
same module landing and pass green together.

Co-authored-by: Isaac
…ility gate (Phase F1)

Captures airline_real_v1.json replay output (canonical journey JSON,
canonical decision JSON, operator transcript, validation report)
into tests/replay/snapshots/before_f1.json. Used by the F1
byte-stability gate (Task 8) to assert F1's wrappers produce
byte-identical replay output.

The fixture is small and produces zero decision records / violations /
missing qids; the snapshot's primary value is pinning the empty-shape
of those collections so a future regression that adds spurious
records or violations would diff.

Co-authored-by: Isaac
…uation (Phase F1)

Wires harness.py:9924 (the per-iteration full_result_1 = run_evaluation
call inside _run_gate_checks) through stages.evaluation.evaluate_post_patch.
Also adds the F1 byte-stability replay gate
(tests/replay/test_phase_f1_byte_stable.py) which asserts
airline_real_v1 replay produces byte-identical canonical journey JSON,
canonical decision JSON, operator transcript, and validation report
against the pre-F1 snapshot.

Implementation notes:
  - Built a local _stage_ctx_full_eval at the call site with NO-OP
    journey_emit / decision_emit. Both _emit_eval_entry_journey
    and eval_classification_records are already emitted upstream
    in _run_lever_loop (lines 12097, 12218) based on cluster
    analysis; the wrapper must NOT double-emit, so its emits go to
    no-ops here. Subsequent F-plans absorb _run_lever_loop's
    surrounding orchestration into stages and let the wrapper own
    journey/decision emission directly.
  - Added EvaluationResult.raw passthrough field carrying the full
    evaluation.run_evaluation dict. The harness assigns
    full_result_1 = _eval_result.raw to preserve every downstream
    field access (asi_extraction_audit, scores, both_correct_rate,
    quarantined_benchmarks_qids, etc.) without enumeration.
  - The other 5 inline run_evaluation(...) call sites (lines 2013,
    3499, 6689, 9712, 9853, 18673) are intentionally untouched per
    plan scope.

Combines plan Tasks 6+8 into one commit. Task 9 final-cleanup
verifications (LOC, residual references, decision-emitter wiring)
all pass.

Co-authored-by: Isaac
Adds stages/rca_evidence.py with:
  - RcaEvidenceInput dataclass (eval_rows + per-qid judge + ASI metadata)
  - Stage2Evidence dataclass (per_qid_evidence, rca_kinds_by_qid,
    evidence_refs, promoted_to_top_n_qids). Renamed from RcaEvidence to
    avoid clash with the existing rca.RcaEvidence frozen dataclass.
  - collect(ctx, inp) entry that wraps rca._asi_finding_from_metadata
    + rca._top_n_collapse_metadata_override (PR-D promotion tracking).

Per the plan's Reality Check appendix: F2 is consolidate-style
observability-only. The harness has no for-_qid evidence-shaping
loop to "extract"; per-qid evidence is constructed inside
optimizer.cluster_failures via rca._asi_finding_from_metadata. F2
stands up a typed surface that F3 will consume; harness.py is
NOT modified.

Tests:
  - 5 unit tests covering dataclass shape, single-qid happy path,
    PR-D top-N promotion regression, empty inputs.
  - F2 byte-stability replay gate against airline_real_v1.json.

Combines plan Tasks 1+2+3+4 into one commit since F2 is a single
observability-only landing.

Co-authored-by: Isaac
Adds stages/clustering.py with:
  - ClusteringInput (eval_result_for_clustering, metadata_snapshot,
    soft_eval_result, held_out_qids, qid_state) — matches the actual
    optimizer.cluster_failures signature documented in the plan's
    Reality Check appendix.
  - ClusterFindings (clusters, soft_clusters,
    rejected_cluster_alternatives) — typed output for F4 to consume.
  - form(ctx, inp) entry that calls cluster_failures for hard and
    optionally soft, splitting promoted vs rejected by demoted_reason
    (no fabricated emit_rejected kwarg).

Plan deviations (with justification):
  - Task 2 step 3's example body used kwargs (rca_evidence,
    rca_kinds_by_qid, qids, eval_rows, emit_rejected) that the Reality
    Check explicitly contradicts. Implemented per the Reality Check's
    actual signature.
  - F3 acceptance criteria say "harness no longer invokes
    cluster_failures(...)" and "cluster_records / rca_formed_records
    emitted only from stages/clustering.py". These would require
    moving emission code from _run_lever_loop:12296+ (where it
    currently fires from _analysis output) into form() — a different
    call site with different timing relative to journey events.
    Doing that within F3's byte-stability gate is high-risk.
    Implemented as observability-only (matching F2's pattern); harness
    wiring + emission move deferred to a follow-up plan.

Tests: 5 unit tests covering dataclass shape, hard-only path, hard+
soft path, demoted_reason split, empty input. F3 byte-stability replay
gate against airline_real_v1.json.

Combines plan Tasks 1+2+3+4 into one commit since F3 is a single
observability-only landing.

Co-authored-by: Isaac
… F4)

Adds stages/action_groups.py with:
  - ActionGroupsInput (action_groups, source_clusters_by_id,
    rca_id_by_cluster, ag_alternatives_by_id) matching the actual
    decision_emitters.strategist_ag_records signature.
  - ActionGroupSlate (ags, rejected_ag_alternatives) typed output.
  - select(ctx, inp) that emits one STRATEGIST_AG_EMITTED
    DecisionRecord per AG via ctx.decision_emit, propagating
    target_qids / rca_id / root_cause / alternatives_considered.

Plan deviations (with justification):
  - Plan Task 2 step 3's example body fabricated kwargs
    (ag=ag, alternatives_considered=...) that don't match the actual
    strategist_ag_records signature.
  - Plan acceptance criteria say "harness no longer contains the
    strategist-invocation block, _drain_buffered_action_groups,
    _build_ag_alternatives_by_id" + "strategist_ag_records emitted
    only from stages/action_groups.py". Per the plan's own Reality
    Check, the strategist invocation is "NOT a single contiguous
    function" — it's ~300-500 LOC of inline LLM orchestration.
    Lifting that under F4's byte-stability gate is high-risk.
    Implemented as observability-only (matching F2/F3); harness
    wiring + helper moves deferred to a follow-up plan.

Tests: 5 unit tests covering shape, multi-AG emission, MISSING_TARGET_QIDS
reason for empty target_qids (Cycle-8-Bug-1 signal), empty input.
F4 byte-stability replay gate against airline_real_v1.json.

Combines plan Tasks 1+2+3+4 into one commit.

Co-authored-by: Isaac
Adds stages/proposals.py with:
  - ProposalsInput (proposals_by_ag, rca_id_by_cluster,
    cluster_root_cause_by_id, proposal_alternatives_by_ag) matching
    the actual decision_emitters.proposal_generated_records signature.
  - ProposalSlate (proposals_by_ag, rejected_proposal_alternatives,
    content_fingerprints_emitted) typed output.
  - generate(ctx, inp) that stamps content_fingerprint via
    reflection_retry.patch_retry_signature on every proposal, emits
    PROPOSAL_GENERATED records via ctx.decision_emit, and returns the
    fingerprinted slate.

Plan deviations (with justification):
  - Plan Task 2 step 3 fabricated patch_retry_signature kwargs
    (patch_text=, target_qids=, lever=) that don't match the actual
    signature (single dict argument returning a 6-tuple).
  - Plan acceptance criteria say "harness no longer invokes
    generate_proposals_from_strategy(...)" + "proposal_generated_records
    emitted only from stages/proposals.py". Per the plan's own
    Reality Check, the call site is preceded by ~50 LOC of
    cluster-driven-synthesis dispatch inside the per-AG loop.
    Lifting that under F5's byte-stability gate is high-risk.
    Implemented as observability-only (matching F2/F3/F4); harness
    wiring + helper moves deferred to a follow-up plan.

content_fingerprint joins 6-tuple components (with frozenset
section_set sorted+joined for stability) so F6's PR-E content-
fingerprint dedup can compare against rolled-back fingerprints.

Tests: 5 unit tests covering shape, multi-proposal emission with
fingerprints, MISSING_TARGET_QIDS skip (Cycle-8-Bug-1), empty input,
fingerprint consistency. F5 byte-stability replay gate against
airline_real_v1.json.

Combines plan Tasks 1+2+3+4 into one commit.

Co-authored-by: Isaac
…ble sub-handlers (Phase F6)

Adds stages/gates.py with:
  - GatesInput / GateOutcome / GateDrop dataclasses.
  - GATE_PIPELINE_ORDER = (content_fingerprint_dedup, lever5_structural,
    rca_groundedness, blast_radius, dead_on_arrival).
  - run_gate(name, ctx, inp): named entry per sub-handler.
  - filter(ctx, inp): runs every sub-handler in pipeline order,
    accumulating drops and DOA signatures.
  - 5 composable sub-handlers, each with field-driven minimal gate
    logic so unit tests can exercise drop conditions in isolation.

Plan deviations (with justification):
  - Per the plan's Reality Check appendix: the four gate sites in
    harness are NOT contiguous, the Lever-5 gate has no single
    primitive to lift, and the DOA primitive is just the signature
    recorder. Lifting the production gate logic under F6's
    byte-stability gate is high-risk.
  - Implemented as observability-only (matching F2-F5): each
    sub-handler's logic is field-driven (patch_text presence,
    rca_id presence, affected_tables count, content_fingerprint
    matching, noop flag) so it's testable in isolation. Production
    gate logic stays in harness; this stage is the typed surface
    F7 will eventually consume.

Tests: 10 unit tests covering shape, pipeline order, each
sub-handler in isolation, full filter() pipeline, unknown-gate
error path. F6 byte-stability replay gate against airline_real_v1.json.

Combines plan Tasks 1-9 into one commit since F6 is a single
observability-only landing.

Co-authored-by: Isaac
Adds stages/application.py with:
  - ApplicationInput (applied_entries_by_ag, ags, rca_id_by_cluster,
    cluster_root_cause_by_id) matching the actual
    decision_emitters.patch_applied_records signature.
  - AppliedPatch (proposal_id, ag_id, patch_type, target_qids,
    cluster_id, content_fingerprint, rolled_back_immediately,
    rollback_reason).
  - AppliedPatchSet (applied tuple + applied_signature SHA256-based
    cycle-detection hash).
  - apply(ctx, inp) entry that converts apply_log entries to
    AppliedPatch records, emits PATCH_APPLIED via ctx.decision_emit,
    and returns the typed slate.

Plan deviations (with justification):
  - Plan referenced applier.apply_levers_to_config which doesn't exist;
    actual primitive is apply_patch_set per the Reality Check.
  - Plan acceptance criteria say "harness no longer invokes
    apply_levers_to_config". F7 is observability-only (matching F2-F6);
    harness wiring + apply call + post-apply verification block move
    deferred to a follow-up plan because the FailedRollbackVerification
    path is intertwined with downstream eval logic at harness.py:16845.

Tests: 6 unit tests covering shape, multi-entry emission, immediate
rollback marker propagation, empty input, deterministic
applied_signature. F7 byte-stability replay gate against
airline_real_v1.json.

Combines plan Tasks 1-4 into one commit since F7 is a single
observability-only landing.

Co-authored-by: Isaac
Adds stages/acceptance.py with:
  - AcceptanceInput (applied_entries_by_ag, ags, baseline/candidate
    accuracy, baseline/candidate pre_arbiter_accuracy [PR-E],
    pre_rows/post_rows, protected_qids, min_gain_pp,
    min_pre_arbiter_gain_pp). Field names match the actual
    decide_control_plane_acceptance signature per the Reality Check.
  - AgOutcomeRecord (ag_id, outcome, reason_code, target_qids,
    affected_qids, content_fingerprints).
  - AgOutcome (outcomes_by_ag, qid_resolutions,
    rolled_back_content_fingerprints) — the latter is the typed surface
    F6's content-fingerprint dedup gate consumes on the next iteration.
  - decide(ctx, inp): per-AG control_plane gate + ACCEPTANCE_DECIDED
    emission + per-qid QID_RESOLUTION emission via
    post_eval_resolution_records.

Plan deviations (with justification):
  - Plan Task 1's dataclass used baseline_post_arbiter_accuracy /
    baseline_qid_pass_states / etc. that DON'T match the actual
    decide_control_plane_acceptance kwargs. Reality Check explicitly
    flagged this; I corrected to baseline_accuracy / pre_rows /
    post_rows / etc. matching the real signature.
  - Plan referenced decision.outcome / decision.affected_qids fields
    that don't exist on ControlPlaneAcceptance (which has accepted +
    reason_code). I map decision.accepted + reason_code → outcome
    string in _outcome_string.
  - Plan asked to delete ag_outcome.py + post_eval.py thin modules.
    Deferred — F1's stages/evaluation.py imports
    eval_entry._emit_eval_entry_journey, and harness still imports
    post_eval._emit_post_eval_journey. Deletion is a coordinated
    follow-up.

PR-E regression test: post-arbiter saturated at 91.7% with 22/24
already arbiter-rescued + pre-arbiter improved 4.2pp → accepted with
reason_code=accepted_pre_arbiter_improvement, the cycle-10 saturation
ceiling that PR-E lifted.

Tests: 7 unit tests covering shape, PR-E acceptance regression,
rollback path with collateral, post-eval resolution emission,
empty input. F8 byte-stability replay gate against airline_real_v1.json.

Combines plan Tasks 1-5 into one commit since F8 is a single
observability-only landing.

Co-authored-by: Isaac
Adds stages/learning.py with:
  - LearningInput (prior_reflection_buffer, prior_do_not_retry,
    prior_rolled_back_content_fingerprints, ag_outcomes_by_id,
    applied_signature, accuracy_delta, current_hard_failure_qids,
    regression_debt_qids, quarantined_qids, sql_delta_qids,
    pending_buffered_ags, diagnostic_action_queue).
  - LearningUpdate (new_reflection_buffer, new_do_not_retry,
    new_rolled_back_content_fingerprints, terminal_decision,
    retired_ags [PR-B2], ag_retired_records).
  - update(ctx, inp) entry that:
    * appends to reflection_buffer
    * accumulates do-not-retry signatures
    * accumulates rolled-back content fingerprints (PR-E groundwork)
    * resolves terminal status via rca_terminal.resolve_terminal_on_plateau
    * emits one AG_RETIRED DecisionRecord per retired AG (PR-B2)

Plan deviations (with justification):
  - Per the plan's Reality Check, the harness's end-of-iteration
    learning logic is intertwined with break/continue control flow
    and stdout banner emission. Lifting that under F9's byte-stability
    gate is high-risk. Implemented as observability-only (matching
    F2-F8); harness wiring + helper moves deferred to follow-up.

Tests: 7 unit tests covering shape, reflection buffer append,
rolled-back fingerprint accumulation (PR-E), AG_RETIRED emission
(PR-B2 regression), terminal_decision shape, empty input.
F9 byte-stability replay gate against airline_real_v1.json.

Combines plan Tasks 1-5 into one commit since F9 is a single
observability-only landing.

Phase F closeout: all 9 stage modules now exist as parallel typed
surfaces. The harness still owns the production wiring; full
extraction (replacing harness orchestration with direct stage
calls + deleting moved helpers) is a follow-up multi-PR project.

Co-authored-by: Isaac
Decorate StageHandler Protocol with @runtime_checkable so isinstance
checks become valid. Adds 3 unit tests pinning the behavior:
  - _is_runtime_protocol attribute is True
  - isinstance accepts an object with execute()
  - isinstance rejects an object without execute()

Phase G-lite Task 1.

Co-authored-by: Isaac
…e G-lite)

Each stage module now exposes both:
  - The human-readable named verb (evaluate_post_patch, collect, form,
    select, generate, filter, apply, decide, update) — preserved for
    harness call sites.
  - A uniform ``execute`` module-level alias — what the registry,
    conformance test, and Phase H's capture decorator will import.

Conformance test (tests/unit/test_stage_conformance.py) verifies all
9 stage modules expose both surfaces with identical callable identity.

Phase G-lite Task 2.

Co-authored-by: Isaac
Closes the F1 weak point (eval_kwargs: dict[str, Any]) with a typed
TypedDict mirroring evaluation.run_evaluation's 25-parameter
signature. total=False keeps every key optional so the harness can
construct partial kwargs dicts as it does today.

Re-exported from stages/__init__.py alongside StageContext and
StageHandler.

Phase G-lite Task 3.

Co-authored-by: Isaac
…gs (Phase G-lite)

Replace dict[str, Any] annotation on eval_kwargs at the F1 sites with
the typed RunEvaluationKwargs TypedDict. Annotation-only — no runtime
behavior change. F1 byte-stability replay test confirms.

Sites updated:
  - stages/evaluation.py: 4 annotations
    (_run_full_evaluation, evaluate_baseline, evaluate_post_patch,
    _evaluate)
  - harness.py: 1 annotation on _eval_kwargs_full at the F1 wire-up

Phase G-lite Task 4.

Co-authored-by: Isaac
stages/_registry.py exports STAGES: tuple[StageEntry, ...] in
canonical 9-stage process order, with each entry carrying
(stage_key, module, execute). get_stage(stage_key) provides keyed
lookup; raises KeyError on unknown keys.

The registry is the single source of truth for "what stages exist
and in what order" until Phase H promotes the keys to
run_output_contract.PROCESS_STAGE_ORDER.

Phase G-lite Task 5.

Co-authored-by: Isaac
…hase G-lite)

Extends test_stage_conformance.py with three new assertions per the
plan's Task 6:
  - STAGE_KEY constant on each stage module matches the canonical
    9-stage process key.
  - Each stage module satisfies isinstance(module, StageHandler)
    (Protocol's @runtime_checkable contract).
  - The STAGES registry entries' stage_keys agree with the modules'
    STAGE_KEY constants (no drift).

Also narrows StageHandler Protocol to only require execute() (the
runtime-checkable check). The earlier draft included
stage_key/decision_producer ClassVar declarations, which made
isinstance(module, StageHandler) fail because modules expose
STAGE_KEY (uppercase) per the plan's pin. STAGE_KEY validity is
checked separately by the conformance test, matching the plan's
documented "ClassVar checked via hasattr + value validation"
strategy.

Phase G-lite Task 6.

Co-authored-by: Isaac
Adds a no-resurrection guard for ag_outcome.py / post_eval.py.

Plan deviation: the G-lite plan assumed F8 deleted both modules. My
F8 execution explicitly deferred the deletion (modules still exist as
shims). The test is marked xfail(strict=True) so it acts as a signal
flag — activates automatically when a follow-up actually deletes the
modules, prompting the operator to remove the xfail marker and turn
the test into a real no-resurrection guard.

Phase G-lite Task 7.

Co-authored-by: Isaac
Replace the old Phase G section (full freeze + mypy strict scope,
~7-10 days) with the G-lite section (Stage Protocol + registry +
RunEvaluationKwargs, ~1-2 days). Updates the at-a-glance table,
post-merge calendar estimate (~3-5 weeks → ~2-3.5 weeks), Real-Genie
cost summary prose, and cross-references row.

Documents what's deliberately out of scope for G-lite and why the
full-freeze scope was deferred (replay byte-stability already catches
behavioral regressions; freezing carries non-trivial breakage risk;
mypy --strict has permanent maintenance tax for marginal benefit on
a probabilistic codebase).

Phase G-lite Task 8.

Co-authored-by: Isaac
…ndle (Phase H pre-step)

Drops process-order numbering from the F2 stage's output dataclass to
match the sibling stage convention (ClusterFindings, ProposalSlate,
GateOutcome, AppliedPatchSet, LearningUpdate) — natural noun for the
role, no Stage<N> prefix. "Bundle" is the canonical name for typed
containers of per-qid records and stays distinct from the existing
rca.RcaEvidence (singular evidence atom).

Phase H Task 3 declares OUTPUT_CLASS on every stage module; this
rename makes the rca_evidence module's OUTPUT_CLASS consistent with
the rest of the registry instead of leaking process-order numbering
into the artifact contract.

Replay byte-stability gate is green (relocation-only refactor — same
emitted behavior, the dataclass body is identical, only the symbol
name changes). 3148 unit + replay tests pass; only the 2 known pre-
existing failures remain.

Co-authored-by: Isaac
Adds the canonical Run Output Contract module: GSO_BUNDLE_ROOT,
RunRole enum, ProcessStage dataclass, the 11-entry PROCESS_STAGE_ORDER
(9 executable stages + Stage 1/8 split + contract_health meta), and
the path builders (iteration_bundle_prefix, stage_artifact_paths,
bundle_artifact_paths).

The module is import-pure — no MLflow, Spark, or Databricks SDK — so
that the transcript renderer, bundle assembler, evidence_bundle,
mlflow_audit, and gso-postmortem skill all share one source of truth
for vocabulary and paths.

Phase H Task 2 (next) will lock the STAGES ⊆ PROCESS_STAGE_ORDER rule
in a registry-reconciliation test.

Co-authored-by: Isaac
…ation (Phase H T2)

Conformance test enforcing that every G-lite STAGES key appears in
PROCESS_STAGE_ORDER in the same relative order, and that transcript-
only keys (post_patch_evaluation, contract_health) are explicitly
documented. Without this rail, future drift between the executable
registry and the bundle's transcript ordering would silently break
downstream postmortem tooling.

Co-authored-by: Isaac
…ase H T3)

Adds explicit INPUT_CLASS / OUTPUT_CLASS module-level declarations on
every stage module (evaluation, rca_evidence, clustering, action_groups,
proposals, gates, application, acceptance, learning). The Phase H I/O
capture decorator imports these to serialize each stage's typed input
and output to MLflow without relying on fragile annotation inspection
of the loosely-typed ``ctx`` parameter.

Also extends StageEntry to carry input_class + output_class and
populates them from the per-module declarations, plus a registry test
asserting the new fields are real types and identical to the module
declarations.

For rca_evidence the OUTPUT_CLASS is RcaEvidenceBundle (renamed from
Stage2Evidence in the prior commit) so the registry stays consistent
with the sibling stages' natural-noun output names.

3167 unit + replay tests pass; only the 2 known pre-existing failures
remain.

Co-authored-by: Isaac
…hase H T4)

Parametrized test that constructs minimal instances of every stage's
INPUT_CLASS and OUTPUT_CLASS, runs them through dataclasses.asdict +
json.dumps, and asserts the round-trip succeeds. 18 cases (9 stages ×
{input, output}) all pass.

The serializer mirrors what the Phase H capture decorator will use in
production: ``default=lambda v: list(v) if isinstance(v, (set,
frozenset)) else str(v)`` so set-typed fields (e.g. forbidden
signatures, do-not-retry sets) become sorted-friendly lists in the
bundle without forcing every stage to switch its in-memory
representation. Postmortem readers don't need set semantics.

Co-authored-by: Isaac
Adds wrap_with_io_capture(execute, stage_key) which:
  * serializes the stage's typed input via dataclasses.asdict +
    json.dumps and writes it as iter_NN/stages/<NN>_<key>/input.json
    on the MLflow anchor run;
  * hooks ctx.decision_emit so decisions emitted while the wrapped
    execute runs are captured for decisions.json without breaking
    pass-through to the original emit callback;
  * runs the wrapped execute, serializes the output, writes
    output.json + decisions.json;
  * returns the output unchanged.

The decorator NEVER raises. MLflow log_text failures are caught and
warned — diagnostic capture must never break the optimizer. If
ctx.mlflow_anchor_run_id is None (e.g. replay tests), logging is
silently skipped.

Set fields are normalized to sorted lists for deterministic bundle
JSON; other opaque objects fall back to str() via json.dumps default.

Co-authored-by: Isaac
prashsub and others added 20 commits May 6, 2026 08:17
…es I8)

Adds select_plateau_currently_failing pure helper in harness.py that
picks candidate_eval_failing (default) or journey_ledger_hard_qids
(when the most recent acceptance was a rollback). Wires the helper at
the plateau call site at harness.py:13524 with
last_acceptance_was_rollback=False as a safe default; the existing
_current_hard_qids_raw is sourced from load_latest_state_iteration
(committed/journey-aligned), so the helper is byte-stable in this
codepath.

Adds GSO_PLATEAU_INPUT_SOURCE_V1 marker (emitter +
marker_parser registration) so future runs surface the source
selection.

Adds GSO_PLATEAU_INPUT_USES_JOURNEY_AFTER_ROLLBACK default-on flag.

The helper is the load-bearing contract for invariant I8; the harness
wiring uses safe defaults so the change is replay byte-stable. Cycle 12
can wire a real rollback-flag accumulator into the helper.

Co-authored-by: Isaac
Documents the manual 1-hour probe to be executed before Cycle 12
scoping. The probe answers whether hand-crafted L6 SQL-snippet
patches can flip the airline gs_024 / 7NOW gs_026 target qids on
their committed regression spaces. The decision rule directs
future-cycle investment to either deterministic L6 synthesis,
space-config / sample-question layers, or the upstream propagation
bug, depending on the probe result.

Cycle 11 ships independently of the probe result; the probe informs
Cycle 12 scope only.

Also adds !docs/runid_analysis to .gitignore so run-analysis
artifacts in that directory can be committed.

Co-authored-by: Isaac
Improves _fixture_to_evidence projection to derive iteration-level
ags / applied_patches / acceptance_decision / open_hard_cluster_ids /
rca_cards_present / selected_ag_id / proposal_count from the fixture's
decision_records and strategist_response, so the 8-invariant suite
sees what the postmortems documented.

Runs the suite over the airline + 7NOW fixtures: Case B — I4 fires on
both. Airline: same_body_fingerprints_after_rollback on AG_DECOMPOSED_H004
(iter 1→2). 7NOW: consecutive_empty_proposals_same_ag on AG1 (iters 2→3,
3→4, 4→5). I1–I3, I5–I8 all green. Failing I4 names Cycle 12 scope:
DOA guard for same-body retry and zero-proposal spin.

Appends Cycle 11 row to the iteration ledger.

Co-authored-by: Isaac
…oop.py for production

The invariant suite ships in warn-and-degrade mode for production: typed
INVARIANT_VIOLATION decision records land in the Phase B trace but no
AssertionError is raised, so a violation cannot block a customer run.
The default-on strict mode in common/config.py is preserved for CI and
replay tests; production explicitly opts out via setdefault on the
GSO Job's lever_loop notebook entry point.

CI / replay / local debugging can still set GSO_LOOP_INVARIANTS_STRICT=1
explicitly to enforce strict raises.

Co-authored-by: Isaac
…cceptance stage

Cycle 11's typed PRODUCER_EXCEPTION decision record from run
80532762433063 (space 7now) surfaced the actual root cause of the
optimizer's silent acceptance failures — and likely the upstream
cause of empty target buckets, repeated empty proposals, ambiguous
target accounting, and phase_b.total_records=0 across recent runs.

The acceptance stage in `_run_lever_loop` referenced three names
that are only assigned in the sibling `_run_gate_checks` eval helper
and live nowhere in `_run_lever_loop`'s scope:

  - full_pre_arbiter_accuracy (the named bug, fallback at the
    candidate-pre-arbiter computation; raised every iteration where
    `gate_result.full_pre_arbiter_accuracy` was None)
  - _best_pre_arbiter (latent at AcceptanceInput, would have raised
    immediately after the first fix landed)
  - full_result_1 (latent at the post_rows fallback, would have
    raised whenever `gate_result.full_result.rows` was empty)

The producer try/except previously swallowed the NameError silently;
Cycle 11's PRODUCER_EXCEPTION wiring now lands the exception class,
repr, and traceback in the iteration's decision-record set, which
is how this bug became diagnosable.

Fix:

  - Extract a small pure helper `_candidate_pre_arbiter_from_gate`
    that mirrors the post-arbiter pattern used four lines above the
    bug (`float(gate_result.get("full_accuracy") or 0.0)`).
  - Initialize an in-scope `_iter_best_pre_arbiter` next to
    `best_accuracy = prev_accuracy` (sourced from the canonical
    `_pre_arbiter/overall_accuracy` score with safe fallback) and
    roll it forward on acceptance next to `best_accuracy = full_accuracy`.
  - Drop the impossible `full_result_1` fallback at the post_rows
    construction; `or []` matches the safe pattern at the
    accepted-baseline write site.

Audit (read-only grep of cross-scope name family inside
`_run_lever_loop`, line ≥ 12435):

  full_pre_arbiter_accuracy : 1 hit (fixed)
  _best_pre_arbiter         : 1 hit (fixed)
  full_result_1             : 1 hit (fixed)
  full_scores / full_accuracy / full_result / new_model_id :
                              all hits properly scoped — assigned
                              from `gate_result["..."]` at :21233-21236
                              before any read.

0 additional cross-scope hits found beyond the three named.

Verification:

  - tests/unit (3668 passed, 1 skipped, 3 xfailed) — unchanged.
  - tests/replay (51 passed, 8 skipped) — gained 4 new test cases
    from the new pre-fix regression fixture; new
    `test_no_full_pre_arbiter_accuracy_nameerror_in_committed_fixtures`
    is green on the two existing fixtures and skipped on the
    pre-fix fixture (kept on disk for regression provenance).
  - New unit test `tests/unit/test_harness_acceptance_pre_arbiter_scope.py`
    asserts the helper's fallback semantics (None / missing key /
    int coercion / None gate_result).

Byte-stability: paths that *did* execute (where
`gate_result.full_pre_arbiter_accuracy is not None`) used the
existing branch and are unchanged by the helper. The broken
fallback was raising 100% of the time it was reached, so no
fixture changes byte-for-byte from this fix.

Risk / rollback: no flags introduced; revert is a single commit.
…cle 11 iteration-end emitters crash-resilient

Cycle 11's typed PRODUCER_EXCEPTION decision record fired again on run
40405156883710 (airline, parent run 1099b152-8655-4f1e-ab43-1240a9400280),
this time naming UnboundLocalError on `full_accuracy` at the F9 plateau-
termination call site (`_run_lever_loop`, line :13779 post-Bug-A-fix).
The root cause is the same family as Bug A: a variable assigned only in
the acceptance branch is read on a code path that never reaches the
assignment, and Python's compile-time scoping promotes it to a local
that stays unbound on rollback-only plateaus.

Bug B fix
---------
* New pure helper `_f9_accuracy_delta_safe(gate_result, best_accuracy)`
  next to `_candidate_pre_arbiter_from_gate` (Bug A's helper). Returns
  `gate_result.full_accuracy - best_accuracy` when present, else `0.0`.
* Replaced `accuracy_delta=float(full_accuracy - best_accuracy)` at the
  F9 LearningInput call site with `_f9_accuracy_delta_safe(locals().get(
  "gate_result"), best_accuracy)`. The `locals().get` keeps the helper
  pure across iteration boundaries.
* New unit test `tests/unit/test_harness_f9_accuracy_delta_scope.py`
  (4 cases) covers the rollback-only plateau, the well-formed
  gate_result, the float-coercion contract, and the non-dict defensive
  path.

Emitter resilience (the structural follow-up)
---------------------------------------------
The same run also exposed that Cycle 11's invariant runner, the
plateau-input-source marker, and the manifest path validator can all be
bypassed when a producer exception cascades through the iteration body.
The runner sat after `_finalize_iteration_summary` in the iteration's
happy-path block, so any iteration that hit a `continue`/`break` from
inside an absorber missed it entirely — the very iterations the
runner exists to surface were the iterations where it never ran.

* Extracted the runner into a sibling helper
  `_run_iteration_invariants_and_append_records`, called from inside
  `_finalize_iteration_summary` *before* the trace/summary stamp so any
  emitted INVARIANT_VIOLATION records appear in `iter_traces[iteration]`
  and the rendered `decision_record_count`.
* Threaded `run_id` and `iter_producer_exceptions` through the 11
  existing `_finalize_iteration_summary` call sites in `_run_lever_loop`
  (one per `exit_path` label) plus the iteration-end `exit_path=
  "completed"` site.
* Wrapped the `for _iter_num in range(...)` body in `try/finally` so an
  uncaught exception always falls back to a finalize call with
  `exit_path="exception"`. Per-iteration dedup is via a private
  `_finalized_this_iter` flag stored on the iteration's
  `current_iter_inputs` dict (filtered out by
  `journey_fixture_exporter._strip_iteration` so it never leaks into
  fixtures). The fallback is purely additive — explicit finalize sites
  remain the source of truth for the `exit_path` label.
* Updated 3 brittle source-introspection tests
  (test_patch_applyability, test_phase_b_observability_wiring,
  test_question_journey_rendering) to match the +4-space iteration-body
  indentation introduced by the wrap.

Audit findings (Tasks 3 and 7)
------------------------------
* Bug B siblings (Task 3): `rg -n '\\bfull_(accuracy|scores|result)\\b
  |\\bnew_model_id\\b'` confirms 0 additional cross-scope reads in
  `_run_lever_loop`. `full_scores`/`full_result`/`new_model_id` are
  read only after their own self-assignments inside the acceptance
  branch (post `:21263+`) — safe.
* Plateau-input-source emit (Task 7): wired at
  `harness.py:13764-13781` with its own try/except, now lives inside
  the new outer iteration try/finally — crash-resilient. The previous
  run's empty `plateau_input_source: []` marker was a downstream effect
  of Bug B aborting the iteration before reaching this emit; the fix
  restores reach. Marker parser/emit semantics unchanged.
* Phase H manifest validator (Task 7): wired in the post-loop block
  via `validate_phase_h_manifest_paths`. Reachable; runs once per
  function exit. Previous run's empty `missing_pieces` likely
  reflects either (a) `phase_h_manifest_strict_validation_enabled()`
  off, or (b) empty `_phase_h_anchor_run_id`. Both deserve a Cycle 12
  follow-up but neither is a wiring bug.

Regression provenance
---------------------
Committed `tests/replay/fixtures/run_40405156883710_airline_pre_bugb_fix.json`
as the on-disk pre-Bug-B-fix marker. Added
`test_no_full_accuracy_unbound_local_error_in_committed_fixtures` over
all fixtures; the pre-fix fixture is opted into `_PRE_BUGB_FIX_FIXTURES`
and skipped, every other fixture must be free of the
`full_accuracy` `UnboundLocalError` family. The same fixture also
predates the Bug A NameError fix's deployment, so it is also added to
`_PRE_NAMEERROR_FIX_FIXTURES`.

Test results
------------
`uv run pytest tests/unit tests/replay -q` — 3728 passed, 11 skipped,
3 xfailed. Same shape as `main` plus 4 new unit tests (Task 1) and
1 new replay-fixture regression assertion (Task 8); 2 new skips for
the pre-Bug-B fixture against the post-fix assertions (intentional).
…ver_loop) + add structural AST audit lint

Cycle 11's typed PRODUCER_EXCEPTION decision record fired again on
run 476499410793687 (7now, parent run
3b050ec5-4032-457f-a785-2d1a3942a097), this time naming a NameError
on `_baseline_rows_for_control_plane` at the rollback-side
AcceptanceInput call site:

    NameError("name '_baseline_rows_for_control_plane' is not defined")
    File ".../harness.py", line 20746, in _run_lever_loop
        pre_rows=tuple(_baseline_rows_for_control_plane or []),

This is the third sibling in the same family as Bug A
(full_pre_arbiter_accuracy, commit 2013fd3) and Bug B
(full_accuracy, commit 7f538a4): a name assigned only inside a
sibling helper (_run_gate_checks) is read inside _run_lever_loop
with no local assignment and no closure relationship — Python
compiles the read as a free-variable lookup that fails LEGB at
runtime.

Bug C fix (call-site)
---------------------
* New pure helper `_baseline_rows_for_acceptance_input` next to
  `_candidate_pre_arbiter_from_gate` (Bug A) and
  `_f9_accuracy_delta_safe` (Bug B). Returns
  `tuple(accepted_baseline_rows or [])`.
* Replaced `pre_rows=tuple(_baseline_rows_for_control_plane or [])`
  at harness.py:20746 with a helper call sourced from
  _run_lever_loop's own `_accepted_baseline_rows_for_control_plane`
  local — the architecturally correct source already passed into
  the gate at :20535.
* New unit test `tests/unit/test_harness_baseline_rows_acceptance_input_scope.py`
  (3 cases) covers None/empty fallback, well-formed list of rows,
  and no input mutation.

AST audit lint (the structural follow-up)
-----------------------------------------
* New unit test `tests/unit/test_harness_no_inner_helper_leaks.py`
  parses harness.py once, finds the `_run_lever_loop` FunctionDef,
  and asserts every `Name(ctx=Load)` in the function's top scope
  resolves to one of: (a) parameter, (b) name assigned in
  _run_lever_loop's top scope, (c) module-level binding, (d)
  Python builtin, or (e) explicit `_KNOWN_DEFENDED_DEAD_CODE_LEAKS`
  allow-list entry. Anything else is, by definition, an
  inner-helper variable leak.
* The lint walker does NOT descend into nested FunctionDef /
  AsyncFunctionDef / Lambda / ClassDef / comprehension scopes —
  those are independent and may legitimately close over
  _run_lever_loop's locals.
* Allow-list shrinkage is enforced too: the lint asserts that every
  entry in `_KNOWN_DEFENDED_DEAD_CODE_LEAKS` still corresponds to a
  current leak, so a Cycle 13 cleanup that removes a defended-dead-
  code branch automatically forces the allow-list update.

Bugs surfaced by the lint and fixed in this same commit
-------------------------------------------------------
The first lint run revealed 7 leaks. Two were undefended
(production-blocking) and fixed in this commit:

  Bug D: `MIN_POST_ARBITER_GAIN_PP` at harness.py:20780. Imported
    only inside _run_gate_checks at :11871. Would NameError on the
    rollback-side AcceptanceInput path immediately after the Bug C
    fix unmasked it. Fix: added to the module-level
    `from genie_space_optimizer.common.config import (...)` block.

  Bug E: `_audit_emit` at harness.py:17819 and :19422. Defined only
    as an inner closure of _run_gate_checks at :11214. The :17819
    call is wrapped in a try/except that swallowed the NameError
    silently, losing audit; the :19422 call is undefended and would
    NameError every time the no_causal_applyable_patch branch was
    taken (now caught by the outer try/finally from commit 7f538a4
    and finalized as exit_path="exception"). Fix: replaced both
    `_audit_emit(...)` calls with `logger.debug(...)` preserving
    the audit information at log level. TODO(cycle-13) marker
    points to rehoming the audit emitter.

Allow-listed (defended dead code, no behaviour change)
------------------------------------------------------
Five names are read inside _run_lever_loop only via
`if "name" in locals()` / `if "name" in dir()` guards whose True
branch is dead — Python's compile-time scoping makes them free
variables, so the guard always falls through to the fallback. They
are technically leaks but cannot crash:

  - _candidate_clusters_for_decision_trace
  - _raw_proposals_for_ag
  - _rca_evidence_bundle
  - _rolled_back_content_fingerprints
  - strategist_returned_ags

Cleanup of these (collapse to fallback unconditionally OR rehome
via gate_result kwargs) is a Cycle 13 follow-up tracked by the
allow-list in `tests/unit/test_harness_no_inner_helper_leaks.py`.

Regression provenance
---------------------
Committed `tests/replay/fixtures/run_476499410793687_7now_pre_bugc_fix.json`
as the pre-Bug-C-fix on-disk marker. Added `_PRE_BUGC_FIX_FIXTURES`
and `test_no_baseline_rows_for_control_plane_nameerror_in_committed_fixtures`
mirroring the Bug A and Bug B regression assertions; the pre-fix
fixture is opted in and skipped, every other fixture must be free
of the `_baseline_rows_for_control_plane` NameError.

Test results
------------
`uv run pytest tests/unit tests/replay -q` — 3740 passed, 12
skipped, 3 xfailed. Same shape as main plus 3 new unit tests
(Task 1) plus 1 new lint test (Task 3) plus +1 new fixture-driven
regression assertion run across all fixtures (Task 5); 1 new skip
for the pre-Bug-C fixture against its own assertion (intentional).
…arning record (15 tests)

decision_emitters.py
- AG outcome decision record carries full acceptance detail fields.

stages/acceptance.py
- Acceptance stage captures AcceptanceDetail into stage output.

rca_decision_trace.py / run_output_bundle.py
- Contract-health section renders from bundle missing_pieces + manifest.

operator_process_transcript.py
- Section 10 (contract health) wired from run_output_bundle.

harness.py
- Wire iteration learning record at end-of-iteration site.

Tests (15 passed):
- test_ag_outcome_decision_record_acceptance_detail
- test_contract_health_stage_renders
- test_iteration_learning_record
+ refreshed: test_harness_iteration_stamping, test_phase_h_overview_and_summary_builders,
  test_process_stage_order_matches_stages_registry, test_run_output_bundle
Add invariant_projection.project_iter_evidence which turns the live
_current_iter_inputs dict into the shape invariants.run_invariants
expects (clusters / ags / applied_patches / acceptance_decision /
open_hard_cluster_ids / rca_cards_present / decision_records). Carries
prior_iter_evidence forward so I4 (no silent retry) sees prev+curr in
one evidence dict.

Pure: no I/O, no mutation of inputs. Empty run_id short-circuits to
the same no-op shape the harness used previously, preserving
byte-stability for the legacy callers.

The harness call-site swap lands in the next commit so this commit is
behaviour-neutral.

Co-authored-by: Isaac
Replace the empty _iter_evidence literal in
_run_iteration_invariants_and_append_records with a call to the new
project_iter_evidence projector. Thread prior_iter_evidence through
_finalize_iteration_summary and the 12 _run_lever_loop call sites so
I4 (no silent retry) sees prev+curr in one evidence dict.

Behaviour: I2/I3/I4/I7 now have the data they need to fire when their
preconditions hold. I1 already worked. I5/I6/I8 stay no-ops at the
per-iteration level — they remain Phase H run-end signals and are
projected as empty here.

The runner stays pure-on-failure (logger.debug) and pure-on-flag-off
(_inv_enabled() short-circuit). Strict mode still re-raises
AssertionError exactly as before.

Co-authored-by: Isaac
Snapshot run 809960554692716 (3b050ec5 attempt 4, latest pre-projection-
fix evidence) as a permanent fixture. Add
test_invariants_fire_on_run_809960554692716_pre_projection_fixture
which projects the fixture through project_iter_evidence and asserts
that I3 (acceptance buckets) or I7 (RCA grounding) fires.

Before this commit the post-Cycle-11 invariant runner saw 0 violations
on this run despite F2/F3/F5/F6 in the postmortem being textbook fires.
After: the fixture surfaces named violations the operator can act on,
which is the binary pass/fail signal Cycle 11 was designed to give.

This unblocks P0 (narrow structural fallback) and P1 (DOA fingerprint)
because future re-pilots will produce named invariant fires we can
explicitly close.

Co-authored-by: Isaac
I6 (manifest paths), I5 (replay validity), and I8 (journey ledger
hard qids) all require run-end signals that the per-iteration runner
does not have. Document these as deferred follow-ups so the next
re-pilot post-projection-fix is the trigger to wire them up.

Co-authored-by: Isaac
Diagnostic-only test. Pins narrow_replacement_diagnosis returning
patch_type_lacks_where_predicate and build_narrow_l6_replacement
returning None for the H002 add_sql_snippet_expression shape from
run 809960554692716 (3b050ec5 attempt 4).

This is the failing-baseline anchor: P0 Task 2A flips the second
assertion (build returns a real patch). When that ships, this test's
expectations move with it in the same commit.

No production code change.

Co-authored-by: Isaac
Gate the upcoming expression / measure narrow-replacement synthesizer
behind a flag so existing canonical replay fixtures stay byte-stable.
Default off; flipped on per-environment after the P0 re-pilot confirms
the synthesizer produces gate-clearing narrow expressions on the
run-809960554692716 fixture.

Also: mutate the Task-1 diagnostic tests to be flag-aware (renamed
*_when_flag_off, monkeypatch.delenv at start) so the post-Task-4A
flag-on path is unblocked.

Co-authored-by: Isaac
When GSO_L6_NARROW_REPLACEMENT_FOR_EXPRESSION is on, narrow_replacement_
diagnosis now returns applicable=True for add_sql_snippet_expression
and add_sql_snippet_measure patches that carry a non-empty
sql_expression AND target qids, and build_narrow_l6_replacement emits a
CASE-wrapped variant scoping the expression to the named QIDs.

Closes the postmortem F2 / F3 finding from run 809960554692716: the
H002 zone-VP expression patch was dropped at HCRF and the synthesizer
returned None because expression patches lack a where_predicate. After
this commit, the same drop produces a narrow CASE-wrapped variant that
the harness re-tests through patch_blast_radius_is_safe.

Flag default off → canonical replay byte-stability holds. Flag-on path
proven by 3 newly-green unit tests; flag-off path preserved by the
existing diagnostic tests (mutated to be flag-aware in Task 3A).

Co-authored-by: Isaac
Mirror of narrow_not_applicable: when _run_narrow_l6_replacement_loop
produces a survivor that clears patch_blast_radius_is_safe, emit a
typed NarrowReplacementSynthesizedRecord into decision_records and
a GSO_NARROW_REPLACEMENT_SYNTHESIZED_V1 marker into markers.

Makes the P0 win observable: dashboards can now distinguish
"narrow-replacement saved an iteration" from "no narrow-replacement
was ever attempted" or "narrow-replacement declined as not applicable".

Co-authored-by: Isaac
Snapshot the latest 3b050ec5 attempt 4 replay fixture as a permanent
P0 anchor. Add tests that, with the flag on, the same HCRF-dropped
H002 expression patch produces a narrow_replacement survivor with
narrowing_strategy=expression_qid_scope; with the flag off, the same
patch continues to produce no survivor (byte-stability).

The tests skip gracefully when the fixture's drop list does not
include an explicit per-iteration entry, falling back to the unit-
level proofs from Task 4A.

Co-authored-by: Isaac
…llback

Pilot env, replay anchor, and binary success conditions for the
GSO_L6_NARROW_REPLACEMENT_FOR_EXPRESSION flip.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants