feat(gso): Genie Space Optimizer — complete optimizer engine (lossless contract, process spine, decision trace, observability) by prashsub · Pull Request #202 · databricks-solutions/databricks-genie-workbench

prashsub · 2026-05-06T12:44:30Z

Closes #86
Closes #182
Closes #183
Closes #184
Closes #189

Partially addresses #185 (preflight tile shipped; pre/post delta + ScanSummary still TODO)

Summary

This PR delivers the complete Genie Space Optimizer engine built on top of the existing scaffold. It covers ~900 commits across the packages/genie-space-optimizer/ package, touching 165 source files (+86k lines) and 470 test files (+83k lines).

Correctness & leakage fixes

Validator tolerates runtime underscore-prefixed keys; question_id deduplication in cluster_failures; root-cause cascade hardening; vacuous filter rejection; resume display fixes
Canonical QID extraction (_qid_extraction.py) as single source of truth across harness and GT-correction; fixes Cycle 8 bug where GT-correction candidates were silently dropped

Lossless contract & replay gate

Typed JourneyStage/JourneyTerminalState enum contract with validator and transition rules
Deterministic lever_loop_replay pure driver; frozen cycle fixtures (Cycles 7–11); canonical journey JSON for byte-stable replay diffs
Lane-aware validation (trunk vs proposal lanes); cross-projection replay tests

Decision trace & unified observability (Phases B–H)

DecisionRecord with 10 typed DecisionTypes + AlternativeOption/RejectReason
OptimizationTrace + 9-section render_operator_transcript; ScoreboardSnapshot; classify_unresolved_qid priority-ladder bucketing
Per-stage stage_io_capture decorators; Phase H run_output_bundle with GSO_ARTIFACT_INDEX_V1 marker; per-iteration content completeness (pre-stamp → finalize at all 10 exit paths)
Evidence bundle CLI + lazy MLflow trace fetcher + replay_runid_fixture CLI + gso-postmortem skill

Process spine modularization (Phases F–G)

optimization/stages/ package (9 stages: evaluation → learning) with STAGES registry + StageHandler protocol
All stages wired with additive observability; uniform execute() alias; INPUT_CLASS/OUTPUT_CLASS per stage; JSON-serializable I/O contract enforced

RCA & control-plane hardening

Expanded RcaKind; rca_execution.py deterministic execution plans; causal acceptance gate; convergence quarantine attribution
blast_radius + dead_on_arrival producers; rca_groundedness.py unified gate; PATCH_APPLIED/RCA_FORMED/PROPOSAL_GENERATED decision records with RCA-grounding fields
Per-question lever preference; forbid_tables constraint propagation; classify_unresolved_qid priority ladder

Cycle-by-cycle optimizer improvements (Cycles 1–11)

Intra-AG body-fingerprint dedup; shared-cause blast radius; force structural synthesis (GSO_FORCE_STRUCTURAL_SYNTHESIS_ON_LEVER5_DROP); typed proposal-failure outcomes; productive-iteration budget; causal-drop strategist feedback; soft-cluster drift recovery; lever-6 SQL-shape forcing; invariant warn-and-degrade policy; MLflow eval hang defense (adaptive tier ladder + liveness watchdog + per-request OpenAI timeout); patch-acceptance reliability (W1–W8); RCA ungrounded records; AG levers union with recommended_levers; narrow L6 expression fallback; Cycle 11 invariant suite (I1–I8)

Dependency lockdown

All Python + frontend deps pinned to exact versions; mlflow aligned to 3.11.1; uv.lock regenerated; exact-only policy codified in AGENTS.md

Docs

docs/optimizer-process-design/ (11 files) — permanent reference architecture + interactive optimizer visualization
canonical-schema.md; optimizer iteration ledger (Cycles 1–11)

Test plan

470+ test files: unit, integration, replay, snapshot
Cycle 8–11 real-run fixtures committed with zero-violation budget assertions
Byte-stable snapshot tests for transcript, decision trace, journey events, scoreboard, and alternatives ordering
Cycle 11 invariant suite (I1–I8) wired into iteration epilogue with pilot baseline assertions
Structural AST audit lint for inner-helper variable leaks

…posals Adds _drop_proposals_matching_rolled_back_content_fingerprints helper and wires it BEFORE the existing _patch_forbidden check. Uses patch_retry_signature[5] (content_fingerprint) to drop any proposal whose content matches any prior rolled-back patch — irrespective of rollback_class. The wire-up builds a separate _all_rolled_back_patches_for_dedup list from reflection_buffer because the existing _rolled_back_patches_for_retry filters to CONTENT_REGRESSION only, which would defeat the dedup's purpose (closing the iter-3/iter-4 non-CONTENT_REGRESSION re-emission gap). Co-authored-by: Isaac

…-A burndown log - airline_real_v1_cycle10_raw.json: commit real-run cycle 10 replay fixture - burn-down-to-merge-roadmap.md: Phase D/E status progress, cycle 10 notes - phase-a-burndown-log.md: append cycle 10 run entry

Lands the foundational scaffolding for Phase F stage-aligned modularization: - stages/__init__.py exports StageContext, StageHandler - stages/_context.py defines StageContext dataclass (run_id, iteration, journey/decision emit hooks, MLflow anchor, feature flags) - stages/_protocol.py defines StageHandler[StageInputT, StageOutputT] Protocol with stage_key, decision_producer, execute() Unit tests pin the public surface and the Protocol's duck-typing contract. F2-F9 build on this skeleton. Combines plan Tasks 1+2+3 into one commit since all three are part of the same package skeleton and pass green together. Co-authored-by: Isaac

…tion + classify_eval_rows wrappers (Phase F1) Lands stages/evaluation.py with: - EvaluationInput / EvaluationResult dataclasses - _classify_eval_rows: production partition using control_plane row_is_hard_failure / row_is_passing / row_is_actionable_soft - _run_full_evaluation: thin wrapper around evaluation.run_evaluation - evaluate_baseline / evaluate_post_patch public entry points - eval_classification_records emission via ctx.decision_emit Adapted eval_classification_records call to its actual signature (eval_qids + classification dict, not rows). Partition-parity test against lever_loop_replay._classify_eval_rows pins predicate agreement. Combines plan Tasks 4+5 into one commit since both are part of the same module landing and pass green together. Co-authored-by: Isaac

…ility gate (Phase F1) Captures airline_real_v1.json replay output (canonical journey JSON, canonical decision JSON, operator transcript, validation report) into tests/replay/snapshots/before_f1.json. Used by the F1 byte-stability gate (Task 8) to assert F1's wrappers produce byte-identical replay output. The fixture is small and produces zero decision records / violations / missing qids; the snapshot's primary value is pinning the empty-shape of those collections so a future regression that adds spurious records or violations would diff. Co-authored-by: Isaac

…uation (Phase F1) Wires harness.py:9924 (the per-iteration full_result_1 = run_evaluation call inside _run_gate_checks) through stages.evaluation.evaluate_post_patch. Also adds the F1 byte-stability replay gate (tests/replay/test_phase_f1_byte_stable.py) which asserts airline_real_v1 replay produces byte-identical canonical journey JSON, canonical decision JSON, operator transcript, and validation report against the pre-F1 snapshot. Implementation notes: - Built a local _stage_ctx_full_eval at the call site with NO-OP journey_emit / decision_emit. Both _emit_eval_entry_journey and eval_classification_records are already emitted upstream in _run_lever_loop (lines 12097, 12218) based on cluster analysis; the wrapper must NOT double-emit, so its emits go to no-ops here. Subsequent F-plans absorb _run_lever_loop's surrounding orchestration into stages and let the wrapper own journey/decision emission directly. - Added EvaluationResult.raw passthrough field carrying the full evaluation.run_evaluation dict. The harness assigns full_result_1 = _eval_result.raw to preserve every downstream field access (asi_extraction_audit, scores, both_correct_rate, quarantined_benchmarks_qids, etc.) without enumeration. - The other 5 inline run_evaluation(...) call sites (lines 2013, 3499, 6689, 9712, 9853, 18673) are intentionally untouched per plan scope. Combines plan Tasks 6+8 into one commit. Task 9 final-cleanup verifications (LOC, residual references, decision-emitter wiring) all pass. Co-authored-by: Isaac

Adds stages/rca_evidence.py with: - RcaEvidenceInput dataclass (eval_rows + per-qid judge + ASI metadata) - Stage2Evidence dataclass (per_qid_evidence, rca_kinds_by_qid, evidence_refs, promoted_to_top_n_qids). Renamed from RcaEvidence to avoid clash with the existing rca.RcaEvidence frozen dataclass. - collect(ctx, inp) entry that wraps rca._asi_finding_from_metadata + rca._top_n_collapse_metadata_override (PR-D promotion tracking). Per the plan's Reality Check appendix: F2 is consolidate-style observability-only. The harness has no for-_qid evidence-shaping loop to "extract"; per-qid evidence is constructed inside optimizer.cluster_failures via rca._asi_finding_from_metadata. F2 stands up a typed surface that F3 will consume; harness.py is NOT modified. Tests: - 5 unit tests covering dataclass shape, single-qid happy path, PR-D top-N promotion regression, empty inputs. - F2 byte-stability replay gate against airline_real_v1.json. Combines plan Tasks 1+2+3+4 into one commit since F2 is a single observability-only landing. Co-authored-by: Isaac

Adds stages/clustering.py with: - ClusteringInput (eval_result_for_clustering, metadata_snapshot, soft_eval_result, held_out_qids, qid_state) — matches the actual optimizer.cluster_failures signature documented in the plan's Reality Check appendix. - ClusterFindings (clusters, soft_clusters, rejected_cluster_alternatives) — typed output for F4 to consume. - form(ctx, inp) entry that calls cluster_failures for hard and optionally soft, splitting promoted vs rejected by demoted_reason (no fabricated emit_rejected kwarg). Plan deviations (with justification): - Task 2 step 3's example body used kwargs (rca_evidence, rca_kinds_by_qid, qids, eval_rows, emit_rejected) that the Reality Check explicitly contradicts. Implemented per the Reality Check's actual signature. - F3 acceptance criteria say "harness no longer invokes cluster_failures(...)" and "cluster_records / rca_formed_records emitted only from stages/clustering.py". These would require moving emission code from _run_lever_loop:12296+ (where it currently fires from _analysis output) into form() — a different call site with different timing relative to journey events. Doing that within F3's byte-stability gate is high-risk. Implemented as observability-only (matching F2's pattern); harness wiring + emission move deferred to a follow-up plan. Tests: 5 unit tests covering dataclass shape, hard-only path, hard+ soft path, demoted_reason split, empty input. F3 byte-stability replay gate against airline_real_v1.json. Combines plan Tasks 1+2+3+4 into one commit since F3 is a single observability-only landing. Co-authored-by: Isaac

… F4) Adds stages/action_groups.py with: - ActionGroupsInput (action_groups, source_clusters_by_id, rca_id_by_cluster, ag_alternatives_by_id) matching the actual decision_emitters.strategist_ag_records signature. - ActionGroupSlate (ags, rejected_ag_alternatives) typed output. - select(ctx, inp) that emits one STRATEGIST_AG_EMITTED DecisionRecord per AG via ctx.decision_emit, propagating target_qids / rca_id / root_cause / alternatives_considered. Plan deviations (with justification): - Plan Task 2 step 3's example body fabricated kwargs (ag=ag, alternatives_considered=...) that don't match the actual strategist_ag_records signature. - Plan acceptance criteria say "harness no longer contains the strategist-invocation block, _drain_buffered_action_groups, _build_ag_alternatives_by_id" + "strategist_ag_records emitted only from stages/action_groups.py". Per the plan's own Reality Check, the strategist invocation is "NOT a single contiguous function" — it's ~300-500 LOC of inline LLM orchestration. Lifting that under F4's byte-stability gate is high-risk. Implemented as observability-only (matching F2/F3); harness wiring + helper moves deferred to a follow-up plan. Tests: 5 unit tests covering shape, multi-AG emission, MISSING_TARGET_QIDS reason for empty target_qids (Cycle-8-Bug-1 signal), empty input. F4 byte-stability replay gate against airline_real_v1.json. Combines plan Tasks 1+2+3+4 into one commit. Co-authored-by: Isaac

Adds stages/proposals.py with: - ProposalsInput (proposals_by_ag, rca_id_by_cluster, cluster_root_cause_by_id, proposal_alternatives_by_ag) matching the actual decision_emitters.proposal_generated_records signature. - ProposalSlate (proposals_by_ag, rejected_proposal_alternatives, content_fingerprints_emitted) typed output. - generate(ctx, inp) that stamps content_fingerprint via reflection_retry.patch_retry_signature on every proposal, emits PROPOSAL_GENERATED records via ctx.decision_emit, and returns the fingerprinted slate. Plan deviations (with justification): - Plan Task 2 step 3 fabricated patch_retry_signature kwargs (patch_text=, target_qids=, lever=) that don't match the actual signature (single dict argument returning a 6-tuple). - Plan acceptance criteria say "harness no longer invokes generate_proposals_from_strategy(...)" + "proposal_generated_records emitted only from stages/proposals.py". Per the plan's own Reality Check, the call site is preceded by ~50 LOC of cluster-driven-synthesis dispatch inside the per-AG loop. Lifting that under F5's byte-stability gate is high-risk. Implemented as observability-only (matching F2/F3/F4); harness wiring + helper moves deferred to a follow-up plan. content_fingerprint joins 6-tuple components (with frozenset section_set sorted+joined for stability) so F6's PR-E content- fingerprint dedup can compare against rolled-back fingerprints. Tests: 5 unit tests covering shape, multi-proposal emission with fingerprints, MISSING_TARGET_QIDS skip (Cycle-8-Bug-1), empty input, fingerprint consistency. F5 byte-stability replay gate against airline_real_v1.json. Combines plan Tasks 1+2+3+4 into one commit. Co-authored-by: Isaac

…ble sub-handlers (Phase F6) Adds stages/gates.py with: - GatesInput / GateOutcome / GateDrop dataclasses. - GATE_PIPELINE_ORDER = (content_fingerprint_dedup, lever5_structural, rca_groundedness, blast_radius, dead_on_arrival). - run_gate(name, ctx, inp): named entry per sub-handler. - filter(ctx, inp): runs every sub-handler in pipeline order, accumulating drops and DOA signatures. - 5 composable sub-handlers, each with field-driven minimal gate logic so unit tests can exercise drop conditions in isolation. Plan deviations (with justification): - Per the plan's Reality Check appendix: the four gate sites in harness are NOT contiguous, the Lever-5 gate has no single primitive to lift, and the DOA primitive is just the signature recorder. Lifting the production gate logic under F6's byte-stability gate is high-risk. - Implemented as observability-only (matching F2-F5): each sub-handler's logic is field-driven (patch_text presence, rca_id presence, affected_tables count, content_fingerprint matching, noop flag) so it's testable in isolation. Production gate logic stays in harness; this stage is the typed surface F7 will eventually consume. Tests: 10 unit tests covering shape, pipeline order, each sub-handler in isolation, full filter() pipeline, unknown-gate error path. F6 byte-stability replay gate against airline_real_v1.json. Combines plan Tasks 1-9 into one commit since F6 is a single observability-only landing. Co-authored-by: Isaac

Adds stages/application.py with: - ApplicationInput (applied_entries_by_ag, ags, rca_id_by_cluster, cluster_root_cause_by_id) matching the actual decision_emitters.patch_applied_records signature. - AppliedPatch (proposal_id, ag_id, patch_type, target_qids, cluster_id, content_fingerprint, rolled_back_immediately, rollback_reason). - AppliedPatchSet (applied tuple + applied_signature SHA256-based cycle-detection hash). - apply(ctx, inp) entry that converts apply_log entries to AppliedPatch records, emits PATCH_APPLIED via ctx.decision_emit, and returns the typed slate. Plan deviations (with justification): - Plan referenced applier.apply_levers_to_config which doesn't exist; actual primitive is apply_patch_set per the Reality Check. - Plan acceptance criteria say "harness no longer invokes apply_levers_to_config". F7 is observability-only (matching F2-F6); harness wiring + apply call + post-apply verification block move deferred to a follow-up plan because the FailedRollbackVerification path is intertwined with downstream eval logic at harness.py:16845. Tests: 6 unit tests covering shape, multi-entry emission, immediate rollback marker propagation, empty input, deterministic applied_signature. F7 byte-stability replay gate against airline_real_v1.json. Combines plan Tasks 1-4 into one commit since F7 is a single observability-only landing. Co-authored-by: Isaac

Adds stages/acceptance.py with: - AcceptanceInput (applied_entries_by_ag, ags, baseline/candidate accuracy, baseline/candidate pre_arbiter_accuracy [PR-E], pre_rows/post_rows, protected_qids, min_gain_pp, min_pre_arbiter_gain_pp). Field names match the actual decide_control_plane_acceptance signature per the Reality Check. - AgOutcomeRecord (ag_id, outcome, reason_code, target_qids, affected_qids, content_fingerprints). - AgOutcome (outcomes_by_ag, qid_resolutions, rolled_back_content_fingerprints) — the latter is the typed surface F6's content-fingerprint dedup gate consumes on the next iteration. - decide(ctx, inp): per-AG control_plane gate + ACCEPTANCE_DECIDED emission + per-qid QID_RESOLUTION emission via post_eval_resolution_records. Plan deviations (with justification): - Plan Task 1's dataclass used baseline_post_arbiter_accuracy / baseline_qid_pass_states / etc. that DON'T match the actual decide_control_plane_acceptance kwargs. Reality Check explicitly flagged this; I corrected to baseline_accuracy / pre_rows / post_rows / etc. matching the real signature. - Plan referenced decision.outcome / decision.affected_qids fields that don't exist on ControlPlaneAcceptance (which has accepted + reason_code). I map decision.accepted + reason_code → outcome string in _outcome_string. - Plan asked to delete ag_outcome.py + post_eval.py thin modules. Deferred — F1's stages/evaluation.py imports eval_entry._emit_eval_entry_journey, and harness still imports post_eval._emit_post_eval_journey. Deletion is a coordinated follow-up. PR-E regression test: post-arbiter saturated at 91.7% with 22/24 already arbiter-rescued + pre-arbiter improved 4.2pp → accepted with reason_code=accepted_pre_arbiter_improvement, the cycle-10 saturation ceiling that PR-E lifted. Tests: 7 unit tests covering shape, PR-E acceptance regression, rollback path with collateral, post-eval resolution emission, empty input. F8 byte-stability replay gate against airline_real_v1.json. Combines plan Tasks 1-5 into one commit since F8 is a single observability-only landing. Co-authored-by: Isaac

Adds stages/learning.py with: - LearningInput (prior_reflection_buffer, prior_do_not_retry, prior_rolled_back_content_fingerprints, ag_outcomes_by_id, applied_signature, accuracy_delta, current_hard_failure_qids, regression_debt_qids, quarantined_qids, sql_delta_qids, pending_buffered_ags, diagnostic_action_queue). - LearningUpdate (new_reflection_buffer, new_do_not_retry, new_rolled_back_content_fingerprints, terminal_decision, retired_ags [PR-B2], ag_retired_records). - update(ctx, inp) entry that: * appends to reflection_buffer * accumulates do-not-retry signatures * accumulates rolled-back content fingerprints (PR-E groundwork) * resolves terminal status via rca_terminal.resolve_terminal_on_plateau * emits one AG_RETIRED DecisionRecord per retired AG (PR-B2) Plan deviations (with justification): - Per the plan's Reality Check, the harness's end-of-iteration learning logic is intertwined with break/continue control flow and stdout banner emission. Lifting that under F9's byte-stability gate is high-risk. Implemented as observability-only (matching F2-F8); harness wiring + helper moves deferred to follow-up. Tests: 7 unit tests covering shape, reflection buffer append, rolled-back fingerprint accumulation (PR-E), AG_RETIRED emission (PR-B2 regression), terminal_decision shape, empty input. F9 byte-stability replay gate against airline_real_v1.json. Combines plan Tasks 1-5 into one commit since F9 is a single observability-only landing. Phase F closeout: all 9 stage modules now exist as parallel typed surfaces. The harness still owns the production wiring; full extraction (replacing harness orchestration with direct stage calls + deleting moved helpers) is a follow-up multi-PR project. Co-authored-by: Isaac

… update

Decorate StageHandler Protocol with @runtime_checkable so isinstance checks become valid. Adds 3 unit tests pinning the behavior: - _is_runtime_protocol attribute is True - isinstance accepts an object with execute() - isinstance rejects an object without execute() Phase G-lite Task 1. Co-authored-by: Isaac

…e G-lite) Each stage module now exposes both: - The human-readable named verb (evaluate_post_patch, collect, form, select, generate, filter, apply, decide, update) — preserved for harness call sites. - A uniform ``execute`` module-level alias — what the registry, conformance test, and Phase H's capture decorator will import. Conformance test (tests/unit/test_stage_conformance.py) verifies all 9 stage modules expose both surfaces with identical callable identity. Phase G-lite Task 2. Co-authored-by: Isaac

Closes the F1 weak point (eval_kwargs: dict[str, Any]) with a typed TypedDict mirroring evaluation.run_evaluation's 25-parameter signature. total=False keeps every key optional so the harness can construct partial kwargs dicts as it does today. Re-exported from stages/__init__.py alongside StageContext and StageHandler. Phase G-lite Task 3. Co-authored-by: Isaac

…gs (Phase G-lite) Replace dict[str, Any] annotation on eval_kwargs at the F1 sites with the typed RunEvaluationKwargs TypedDict. Annotation-only — no runtime behavior change. F1 byte-stability replay test confirms. Sites updated: - stages/evaluation.py: 4 annotations (_run_full_evaluation, evaluate_baseline, evaluate_post_patch, _evaluate) - harness.py: 1 annotation on _eval_kwargs_full at the F1 wire-up Phase G-lite Task 4. Co-authored-by: Isaac

stages/_registry.py exports STAGES: tuple[StageEntry, ...] in canonical 9-stage process order, with each entry carrying (stage_key, module, execute). get_stage(stage_key) provides keyed lookup; raises KeyError on unknown keys. The registry is the single source of truth for "what stages exist and in what order" until Phase H promotes the keys to run_output_contract.PROCESS_STAGE_ORDER. Phase G-lite Task 5. Co-authored-by: Isaac

…hase G-lite) Extends test_stage_conformance.py with three new assertions per the plan's Task 6: - STAGE_KEY constant on each stage module matches the canonical 9-stage process key. - Each stage module satisfies isinstance(module, StageHandler) (Protocol's @runtime_checkable contract). - The STAGES registry entries' stage_keys agree with the modules' STAGE_KEY constants (no drift). Also narrows StageHandler Protocol to only require execute() (the runtime-checkable check). The earlier draft included stage_key/decision_producer ClassVar declarations, which made isinstance(module, StageHandler) fail because modules expose STAGE_KEY (uppercase) per the plan's pin. STAGE_KEY validity is checked separately by the conformance test, matching the plan's documented "ClassVar checked via hasattr + value validation" strategy. Phase G-lite Task 6. Co-authored-by: Isaac

Adds a no-resurrection guard for ag_outcome.py / post_eval.py. Plan deviation: the G-lite plan assumed F8 deleted both modules. My F8 execution explicitly deferred the deletion (modules still exist as shims). The test is marked xfail(strict=True) so it acts as a signal flag — activates automatically when a follow-up actually deletes the modules, prompting the operator to remove the xfail marker and turn the test into a real no-resurrection guard. Phase G-lite Task 7. Co-authored-by: Isaac

Replace the old Phase G section (full freeze + mypy strict scope, ~7-10 days) with the G-lite section (Stage Protocol + registry + RunEvaluationKwargs, ~1-2 days). Updates the at-a-glance table, post-merge calendar estimate (~3-5 weeks → ~2-3.5 weeks), Real-Genie cost summary prose, and cross-references row. Documents what's deliberately out of scope for G-lite and why the full-freeze scope was deferred (replay byte-stability already catches behavioral regressions; freezing carries non-trivial breakage risk; mypy --strict has permanent maintenance tax for marginal benefit on a probabilistic codebase). Phase G-lite Task 8. Co-authored-by: Isaac

…ndle (Phase H pre-step) Drops process-order numbering from the F2 stage's output dataclass to match the sibling stage convention (ClusterFindings, ProposalSlate, GateOutcome, AppliedPatchSet, LearningUpdate) — natural noun for the role, no Stage<N> prefix. "Bundle" is the canonical name for typed containers of per-qid records and stays distinct from the existing rca.RcaEvidence (singular evidence atom). Phase H Task 3 declares OUTPUT_CLASS on every stage module; this rename makes the rca_evidence module's OUTPUT_CLASS consistent with the rest of the registry instead of leaking process-order numbering into the artifact contract. Replay byte-stability gate is green (relocation-only refactor — same emitted behavior, the dataclass body is identical, only the symbol name changes). 3148 unit + replay tests pass; only the 2 known pre- existing failures remain. Co-authored-by: Isaac

Adds the canonical Run Output Contract module: GSO_BUNDLE_ROOT, RunRole enum, ProcessStage dataclass, the 11-entry PROCESS_STAGE_ORDER (9 executable stages + Stage 1/8 split + contract_health meta), and the path builders (iteration_bundle_prefix, stage_artifact_paths, bundle_artifact_paths). The module is import-pure — no MLflow, Spark, or Databricks SDK — so that the transcript renderer, bundle assembler, evidence_bundle, mlflow_audit, and gso-postmortem skill all share one source of truth for vocabulary and paths. Phase H Task 2 (next) will lock the STAGES ⊆ PROCESS_STAGE_ORDER rule in a registry-reconciliation test. Co-authored-by: Isaac

…ation (Phase H T2) Conformance test enforcing that every G-lite STAGES key appears in PROCESS_STAGE_ORDER in the same relative order, and that transcript- only keys (post_patch_evaluation, contract_health) are explicitly documented. Without this rail, future drift between the executable registry and the bundle's transcript ordering would silently break downstream postmortem tooling. Co-authored-by: Isaac

…ase H T3) Adds explicit INPUT_CLASS / OUTPUT_CLASS module-level declarations on every stage module (evaluation, rca_evidence, clustering, action_groups, proposals, gates, application, acceptance, learning). The Phase H I/O capture decorator imports these to serialize each stage's typed input and output to MLflow without relying on fragile annotation inspection of the loosely-typed ``ctx`` parameter. Also extends StageEntry to carry input_class + output_class and populates them from the per-module declarations, plus a registry test asserting the new fields are real types and identical to the module declarations. For rca_evidence the OUTPUT_CLASS is RcaEvidenceBundle (renamed from Stage2Evidence in the prior commit) so the registry stays consistent with the sibling stages' natural-noun output names. 3167 unit + replay tests pass; only the 2 known pre-existing failures remain. Co-authored-by: Isaac

…hase H T4) Parametrized test that constructs minimal instances of every stage's INPUT_CLASS and OUTPUT_CLASS, runs them through dataclasses.asdict + json.dumps, and asserts the round-trip succeeds. 18 cases (9 stages × {input, output}) all pass. The serializer mirrors what the Phase H capture decorator will use in production: ``default=lambda v: list(v) if isinstance(v, (set, frozenset)) else str(v)`` so set-typed fields (e.g. forbidden signatures, do-not-retry sets) become sorted-friendly lists in the bundle without forcing every stage to switch its in-memory representation. Postmortem readers don't need set semantics. Co-authored-by: Isaac

Adds wrap_with_io_capture(execute, stage_key) which: * serializes the stage's typed input via dataclasses.asdict + json.dumps and writes it as iter_NN/stages/<NN>_<key>/input.json on the MLflow anchor run; * hooks ctx.decision_emit so decisions emitted while the wrapped execute runs are captured for decisions.json without breaking pass-through to the original emit callback; * runs the wrapped execute, serializes the output, writes output.json + decisions.json; * returns the output unchanged. The decorator NEVER raises. MLflow log_text failures are caught and warned — diagnostic capture must never break the optimizer. If ctx.mlflow_anchor_run_id is None (e.g. replay tests), logging is silently skipped. Set fields are normalized to sorted lists for deterministic bundle JSON; other opaque objects fall back to str() via json.dumps default. Co-authored-by: Isaac

…es I8) Adds select_plateau_currently_failing pure helper in harness.py that picks candidate_eval_failing (default) or journey_ledger_hard_qids (when the most recent acceptance was a rollback). Wires the helper at the plateau call site at harness.py:13524 with last_acceptance_was_rollback=False as a safe default; the existing _current_hard_qids_raw is sourced from load_latest_state_iteration (committed/journey-aligned), so the helper is byte-stable in this codepath. Adds GSO_PLATEAU_INPUT_SOURCE_V1 marker (emitter + marker_parser registration) so future runs surface the source selection. Adds GSO_PLATEAU_INPUT_USES_JOURNEY_AFTER_ROLLBACK default-on flag. The helper is the load-bearing contract for invariant I8; the harness wiring uses safe defaults so the change is replay byte-stable. Cycle 12 can wire a real rollback-flag accumulator into the helper. Co-authored-by: Isaac

Documents the manual 1-hour probe to be executed before Cycle 12 scoping. The probe answers whether hand-crafted L6 SQL-snippet patches can flip the airline gs_024 / 7NOW gs_026 target qids on their committed regression spaces. The decision rule directs future-cycle investment to either deterministic L6 synthesis, space-config / sample-question layers, or the upstream propagation bug, depending on the probe result. Cycle 11 ships independently of the probe result; the probe informs Cycle 12 scope only. Also adds !docs/runid_analysis to .gitignore so run-analysis artifacts in that directory can be committed. Co-authored-by: Isaac

Improves _fixture_to_evidence projection to derive iteration-level ags / applied_patches / acceptance_decision / open_hard_cluster_ids / rca_cards_present / selected_ag_id / proposal_count from the fixture's decision_records and strategist_response, so the 8-invariant suite sees what the postmortems documented. Runs the suite over the airline + 7NOW fixtures: Case B — I4 fires on both. Airline: same_body_fingerprints_after_rollback on AG_DECOMPOSED_H004 (iter 1→2). 7NOW: consecutive_empty_proposals_same_ag on AG1 (iters 2→3, 3→4, 4→5). I1–I3, I5–I8 all green. Failing I4 names Cycle 12 scope: DOA guard for same-body retry and zero-proposal spin. Appends Cycle 11 row to the iteration ledger. Co-authored-by: Isaac

…oop.py for production The invariant suite ships in warn-and-degrade mode for production: typed INVARIANT_VIOLATION decision records land in the Phase B trace but no AssertionError is raised, so a violation cannot block a customer run. The default-on strict mode in common/config.py is preserved for CI and replay tests; production explicitly opts out via setdefault on the GSO Job's lever_loop notebook entry point. CI / replay / local debugging can still set GSO_LOOP_INVARIANTS_STRICT=1 explicitly to enforce strict raises. Co-authored-by: Isaac

…cceptance stage Cycle 11's typed PRODUCER_EXCEPTION decision record from run 80532762433063 (space 7now) surfaced the actual root cause of the optimizer's silent acceptance failures — and likely the upstream cause of empty target buckets, repeated empty proposals, ambiguous target accounting, and phase_b.total_records=0 across recent runs. The acceptance stage in `_run_lever_loop` referenced three names that are only assigned in the sibling `_run_gate_checks` eval helper and live nowhere in `_run_lever_loop`'s scope: - full_pre_arbiter_accuracy (the named bug, fallback at the candidate-pre-arbiter computation; raised every iteration where `gate_result.full_pre_arbiter_accuracy` was None) - _best_pre_arbiter (latent at AcceptanceInput, would have raised immediately after the first fix landed) - full_result_1 (latent at the post_rows fallback, would have raised whenever `gate_result.full_result.rows` was empty) The producer try/except previously swallowed the NameError silently; Cycle 11's PRODUCER_EXCEPTION wiring now lands the exception class, repr, and traceback in the iteration's decision-record set, which is how this bug became diagnosable. Fix: - Extract a small pure helper `_candidate_pre_arbiter_from_gate` that mirrors the post-arbiter pattern used four lines above the bug (`float(gate_result.get("full_accuracy") or 0.0)`). - Initialize an in-scope `_iter_best_pre_arbiter` next to `best_accuracy = prev_accuracy` (sourced from the canonical `_pre_arbiter/overall_accuracy` score with safe fallback) and roll it forward on acceptance next to `best_accuracy = full_accuracy`. - Drop the impossible `full_result_1` fallback at the post_rows construction; `or []` matches the safe pattern at the accepted-baseline write site. Audit (read-only grep of cross-scope name family inside `_run_lever_loop`, line ≥ 12435): full_pre_arbiter_accuracy : 1 hit (fixed) _best_pre_arbiter : 1 hit (fixed) full_result_1 : 1 hit (fixed) full_scores / full_accuracy / full_result / new_model_id : all hits properly scoped — assigned from `gate_result["..."]` at :21233-21236 before any read. 0 additional cross-scope hits found beyond the three named. Verification: - tests/unit (3668 passed, 1 skipped, 3 xfailed) — unchanged. - tests/replay (51 passed, 8 skipped) — gained 4 new test cases from the new pre-fix regression fixture; new `test_no_full_pre_arbiter_accuracy_nameerror_in_committed_fixtures` is green on the two existing fixtures and skipped on the pre-fix fixture (kept on disk for regression provenance). - New unit test `tests/unit/test_harness_acceptance_pre_arbiter_scope.py` asserts the helper's fallback semantics (None / missing key / int coercion / None gate_result). Byte-stability: paths that *did* execute (where `gate_result.full_pre_arbiter_accuracy is not None`) used the existing branch and are unchanged by the helper. The broken fallback was raising 100% of the time it was reached, so no fixture changes byte-for-byte from this fix. Risk / rollback: no flags introduced; revert is a single commit.

…cle 11 iteration-end emitters crash-resilient Cycle 11's typed PRODUCER_EXCEPTION decision record fired again on run 40405156883710 (airline, parent run 1099b152-8655-4f1e-ab43-1240a9400280), this time naming UnboundLocalError on `full_accuracy` at the F9 plateau- termination call site (`_run_lever_loop`, line :13779 post-Bug-A-fix). The root cause is the same family as Bug A: a variable assigned only in the acceptance branch is read on a code path that never reaches the assignment, and Python's compile-time scoping promotes it to a local that stays unbound on rollback-only plateaus. Bug B fix --------- * New pure helper `_f9_accuracy_delta_safe(gate_result, best_accuracy)` next to `_candidate_pre_arbiter_from_gate` (Bug A's helper). Returns `gate_result.full_accuracy - best_accuracy` when present, else `0.0`. * Replaced `accuracy_delta=float(full_accuracy - best_accuracy)` at the F9 LearningInput call site with `_f9_accuracy_delta_safe(locals().get( "gate_result"), best_accuracy)`. The `locals().get` keeps the helper pure across iteration boundaries. * New unit test `tests/unit/test_harness_f9_accuracy_delta_scope.py` (4 cases) covers the rollback-only plateau, the well-formed gate_result, the float-coercion contract, and the non-dict defensive path. Emitter resilience (the structural follow-up) --------------------------------------------- The same run also exposed that Cycle 11's invariant runner, the plateau-input-source marker, and the manifest path validator can all be bypassed when a producer exception cascades through the iteration body. The runner sat after `_finalize_iteration_summary` in the iteration's happy-path block, so any iteration that hit a `continue`/`break` from inside an absorber missed it entirely — the very iterations the runner exists to surface were the iterations where it never ran. * Extracted the runner into a sibling helper `_run_iteration_invariants_and_append_records`, called from inside `_finalize_iteration_summary` *before* the trace/summary stamp so any emitted INVARIANT_VIOLATION records appear in `iter_traces[iteration]` and the rendered `decision_record_count`. * Threaded `run_id` and `iter_producer_exceptions` through the 11 existing `_finalize_iteration_summary` call sites in `_run_lever_loop` (one per `exit_path` label) plus the iteration-end `exit_path= "completed"` site. * Wrapped the `for _iter_num in range(...)` body in `try/finally` so an uncaught exception always falls back to a finalize call with `exit_path="exception"`. Per-iteration dedup is via a private `_finalized_this_iter` flag stored on the iteration's `current_iter_inputs` dict (filtered out by `journey_fixture_exporter._strip_iteration` so it never leaks into fixtures). The fallback is purely additive — explicit finalize sites remain the source of truth for the `exit_path` label. * Updated 3 brittle source-introspection tests (test_patch_applyability, test_phase_b_observability_wiring, test_question_journey_rendering) to match the +4-space iteration-body indentation introduced by the wrap. Audit findings (Tasks 3 and 7) ------------------------------ * Bug B siblings (Task 3): `rg -n '\\bfull_(accuracy|scores|result)\\b |\\bnew_model_id\\b'` confirms 0 additional cross-scope reads in `_run_lever_loop`. `full_scores`/`full_result`/`new_model_id` are read only after their own self-assignments inside the acceptance branch (post `:21263+`) — safe. * Plateau-input-source emit (Task 7): wired at `harness.py:13764-13781` with its own try/except, now lives inside the new outer iteration try/finally — crash-resilient. The previous run's empty `plateau_input_source: []` marker was a downstream effect of Bug B aborting the iteration before reaching this emit; the fix restores reach. Marker parser/emit semantics unchanged. * Phase H manifest validator (Task 7): wired in the post-loop block via `validate_phase_h_manifest_paths`. Reachable; runs once per function exit. Previous run's empty `missing_pieces` likely reflects either (a) `phase_h_manifest_strict_validation_enabled()` off, or (b) empty `_phase_h_anchor_run_id`. Both deserve a Cycle 12 follow-up but neither is a wiring bug. Regression provenance --------------------- Committed `tests/replay/fixtures/run_40405156883710_airline_pre_bugb_fix.json` as the on-disk pre-Bug-B-fix marker. Added `test_no_full_accuracy_unbound_local_error_in_committed_fixtures` over all fixtures; the pre-fix fixture is opted into `_PRE_BUGB_FIX_FIXTURES` and skipped, every other fixture must be free of the `full_accuracy` `UnboundLocalError` family. The same fixture also predates the Bug A NameError fix's deployment, so it is also added to `_PRE_NAMEERROR_FIX_FIXTURES`. Test results ------------ `uv run pytest tests/unit tests/replay -q` — 3728 passed, 11 skipped, 3 xfailed. Same shape as `main` plus 4 new unit tests (Task 1) and 1 new replay-fixture regression assertion (Task 8); 2 new skips for the pre-Bug-B fixture against the post-fix assertions (intentional).

…ver_loop) + add structural AST audit lint Cycle 11's typed PRODUCER_EXCEPTION decision record fired again on run 476499410793687 (7now, parent run 3b050ec5-4032-457f-a785-2d1a3942a097), this time naming a NameError on `_baseline_rows_for_control_plane` at the rollback-side AcceptanceInput call site: NameError("name '_baseline_rows_for_control_plane' is not defined") File ".../harness.py", line 20746, in _run_lever_loop pre_rows=tuple(_baseline_rows_for_control_plane or []), This is the third sibling in the same family as Bug A (full_pre_arbiter_accuracy, commit 2013fd3) and Bug B (full_accuracy, commit 7f538a4): a name assigned only inside a sibling helper (_run_gate_checks) is read inside _run_lever_loop with no local assignment and no closure relationship — Python compiles the read as a free-variable lookup that fails LEGB at runtime. Bug C fix (call-site) --------------------- * New pure helper `_baseline_rows_for_acceptance_input` next to `_candidate_pre_arbiter_from_gate` (Bug A) and `_f9_accuracy_delta_safe` (Bug B). Returns `tuple(accepted_baseline_rows or [])`. * Replaced `pre_rows=tuple(_baseline_rows_for_control_plane or [])` at harness.py:20746 with a helper call sourced from _run_lever_loop's own `_accepted_baseline_rows_for_control_plane` local — the architecturally correct source already passed into the gate at :20535. * New unit test `tests/unit/test_harness_baseline_rows_acceptance_input_scope.py` (3 cases) covers None/empty fallback, well-formed list of rows, and no input mutation. AST audit lint (the structural follow-up) ----------------------------------------- * New unit test `tests/unit/test_harness_no_inner_helper_leaks.py` parses harness.py once, finds the `_run_lever_loop` FunctionDef, and asserts every `Name(ctx=Load)` in the function's top scope resolves to one of: (a) parameter, (b) name assigned in _run_lever_loop's top scope, (c) module-level binding, (d) Python builtin, or (e) explicit `_KNOWN_DEFENDED_DEAD_CODE_LEAKS` allow-list entry. Anything else is, by definition, an inner-helper variable leak. * The lint walker does NOT descend into nested FunctionDef / AsyncFunctionDef / Lambda / ClassDef / comprehension scopes — those are independent and may legitimately close over _run_lever_loop's locals. * Allow-list shrinkage is enforced too: the lint asserts that every entry in `_KNOWN_DEFENDED_DEAD_CODE_LEAKS` still corresponds to a current leak, so a Cycle 13 cleanup that removes a defended-dead- code branch automatically forces the allow-list update. Bugs surfaced by the lint and fixed in this same commit ------------------------------------------------------- The first lint run revealed 7 leaks. Two were undefended (production-blocking) and fixed in this commit: Bug D: `MIN_POST_ARBITER_GAIN_PP` at harness.py:20780. Imported only inside _run_gate_checks at :11871. Would NameError on the rollback-side AcceptanceInput path immediately after the Bug C fix unmasked it. Fix: added to the module-level `from genie_space_optimizer.common.config import (...)` block. Bug E: `_audit_emit` at harness.py:17819 and :19422. Defined only as an inner closure of _run_gate_checks at :11214. The :17819 call is wrapped in a try/except that swallowed the NameError silently, losing audit; the :19422 call is undefended and would NameError every time the no_causal_applyable_patch branch was taken (now caught by the outer try/finally from commit 7f538a4 and finalized as exit_path="exception"). Fix: replaced both `_audit_emit(...)` calls with `logger.debug(...)` preserving the audit information at log level. TODO(cycle-13) marker points to rehoming the audit emitter. Allow-listed (defended dead code, no behaviour change) ------------------------------------------------------ Five names are read inside _run_lever_loop only via `if "name" in locals()` / `if "name" in dir()` guards whose True branch is dead — Python's compile-time scoping makes them free variables, so the guard always falls through to the fallback. They are technically leaks but cannot crash: - _candidate_clusters_for_decision_trace - _raw_proposals_for_ag - _rca_evidence_bundle - _rolled_back_content_fingerprints - strategist_returned_ags Cleanup of these (collapse to fallback unconditionally OR rehome via gate_result kwargs) is a Cycle 13 follow-up tracked by the allow-list in `tests/unit/test_harness_no_inner_helper_leaks.py`. Regression provenance --------------------- Committed `tests/replay/fixtures/run_476499410793687_7now_pre_bugc_fix.json` as the pre-Bug-C-fix on-disk marker. Added `_PRE_BUGC_FIX_FIXTURES` and `test_no_baseline_rows_for_control_plane_nameerror_in_committed_fixtures` mirroring the Bug A and Bug B regression assertions; the pre-fix fixture is opted in and skipped, every other fixture must be free of the `_baseline_rows_for_control_plane` NameError. Test results ------------ `uv run pytest tests/unit tests/replay -q` — 3740 passed, 12 skipped, 3 xfailed. Same shape as main plus 3 new unit tests (Task 1) plus 1 new lint test (Task 3) plus +1 new fixture-driven regression assertion run across all fixtures (Task 5); 1 new skip for the pre-Bug-C fixture against its own assertion (intentional).

…arning record (15 tests) decision_emitters.py - AG outcome decision record carries full acceptance detail fields. stages/acceptance.py - Acceptance stage captures AcceptanceDetail into stage output. rca_decision_trace.py / run_output_bundle.py - Contract-health section renders from bundle missing_pieces + manifest. operator_process_transcript.py - Section 10 (contract health) wired from run_output_bundle. harness.py - Wire iteration learning record at end-of-iteration site. Tests (15 passed): - test_ag_outcome_decision_record_acceptance_detail - test_contract_health_stage_renders - test_iteration_learning_record + refreshed: test_harness_iteration_stamping, test_phase_h_overview_and_summary_builders, test_process_stage_order_matches_stages_registry, test_run_output_bundle

Add invariant_projection.project_iter_evidence which turns the live _current_iter_inputs dict into the shape invariants.run_invariants expects (clusters / ags / applied_patches / acceptance_decision / open_hard_cluster_ids / rca_cards_present / decision_records). Carries prior_iter_evidence forward so I4 (no silent retry) sees prev+curr in one evidence dict. Pure: no I/O, no mutation of inputs. Empty run_id short-circuits to the same no-op shape the harness used previously, preserving byte-stability for the legacy callers. The harness call-site swap lands in the next commit so this commit is behaviour-neutral. Co-authored-by: Isaac

Replace the empty _iter_evidence literal in _run_iteration_invariants_and_append_records with a call to the new project_iter_evidence projector. Thread prior_iter_evidence through _finalize_iteration_summary and the 12 _run_lever_loop call sites so I4 (no silent retry) sees prev+curr in one evidence dict. Behaviour: I2/I3/I4/I7 now have the data they need to fire when their preconditions hold. I1 already worked. I5/I6/I8 stay no-ops at the per-iteration level — they remain Phase H run-end signals and are projected as empty here. The runner stays pure-on-failure (logger.debug) and pure-on-flag-off (_inv_enabled() short-circuit). Strict mode still re-raises AssertionError exactly as before. Co-authored-by: Isaac

Snapshot run 809960554692716 (3b050ec5 attempt 4, latest pre-projection- fix evidence) as a permanent fixture. Add test_invariants_fire_on_run_809960554692716_pre_projection_fixture which projects the fixture through project_iter_evidence and asserts that I3 (acceptance buckets) or I7 (RCA grounding) fires. Before this commit the post-Cycle-11 invariant runner saw 0 violations on this run despite F2/F3/F5/F6 in the postmortem being textbook fires. After: the fixture surfaces named violations the operator can act on, which is the binary pass/fail signal Cycle 11 was designed to give. This unblocks P0 (narrow structural fallback) and P1 (DOA fingerprint) because future re-pilots will produce named invariant fires we can explicitly close. Co-authored-by: Isaac

I6 (manifest paths), I5 (replay validity), and I8 (journey ledger hard qids) all require run-end signals that the per-iteration runner does not have. Document these as deferred follow-ups so the next re-pilot post-projection-fix is the trigger to wire them up. Co-authored-by: Isaac

Diagnostic-only test. Pins narrow_replacement_diagnosis returning patch_type_lacks_where_predicate and build_narrow_l6_replacement returning None for the H002 add_sql_snippet_expression shape from run 809960554692716 (3b050ec5 attempt 4). This is the failing-baseline anchor: P0 Task 2A flips the second assertion (build returns a real patch). When that ships, this test's expectations move with it in the same commit. No production code change. Co-authored-by: Isaac

Gate the upcoming expression / measure narrow-replacement synthesizer behind a flag so existing canonical replay fixtures stay byte-stable. Default off; flipped on per-environment after the P0 re-pilot confirms the synthesizer produces gate-clearing narrow expressions on the run-809960554692716 fixture. Also: mutate the Task-1 diagnostic tests to be flag-aware (renamed *_when_flag_off, monkeypatch.delenv at start) so the post-Task-4A flag-on path is unblocked. Co-authored-by: Isaac

When GSO_L6_NARROW_REPLACEMENT_FOR_EXPRESSION is on, narrow_replacement_ diagnosis now returns applicable=True for add_sql_snippet_expression and add_sql_snippet_measure patches that carry a non-empty sql_expression AND target qids, and build_narrow_l6_replacement emits a CASE-wrapped variant scoping the expression to the named QIDs. Closes the postmortem F2 / F3 finding from run 809960554692716: the H002 zone-VP expression patch was dropped at HCRF and the synthesizer returned None because expression patches lack a where_predicate. After this commit, the same drop produces a narrow CASE-wrapped variant that the harness re-tests through patch_blast_radius_is_safe. Flag default off → canonical replay byte-stability holds. Flag-on path proven by 3 newly-green unit tests; flag-off path preserved by the existing diagnostic tests (mutated to be flag-aware in Task 3A). Co-authored-by: Isaac

Mirror of narrow_not_applicable: when _run_narrow_l6_replacement_loop produces a survivor that clears patch_blast_radius_is_safe, emit a typed NarrowReplacementSynthesizedRecord into decision_records and a GSO_NARROW_REPLACEMENT_SYNTHESIZED_V1 marker into markers. Makes the P0 win observable: dashboards can now distinguish "narrow-replacement saved an iteration" from "no narrow-replacement was ever attempted" or "narrow-replacement declined as not applicable". Co-authored-by: Isaac

Snapshot the latest 3b050ec5 attempt 4 replay fixture as a permanent P0 anchor. Add tests that, with the flag on, the same HCRF-dropped H002 expression patch produces a narrow_replacement survivor with narrowing_strategy=expression_qid_scope; with the flag off, the same patch continues to produce no survivor (byte-stability). The tests skip gracefully when the fixture's drop list does not include an explicit per-iteration entry, falling back to the unit- level proofs from Task 4A. Co-authored-by: Isaac

…llback Pilot env, replay anchor, and binary success conditions for the GSO_L6_NARROW_REPLACEMENT_FOR_EXPRESSION flip. Co-authored-by: Isaac

prashsub requested a review from hiydavid May 6, 2026 12:44

prashsub added 29 commits May 6, 2026 08:17

docs(gso): burn-down roadmap — Phase F stages modularization progress…

1a8861f

… update

prashsub and others added 20 commits May 6, 2026 08:17

P0 Branch A: document re-pilot exit criteria for narrow expression fa…

059a4a3

…llback Pilot env, replay anchor, and binary success conditions for the GSO_L6_NARROW_REPLACEMENT_FOR_EXPRESSION flip. Co-authored-by: Isaac

Include rolled_back in optimization iteration queries

030c251

Fix IQ scanner warning thresholds

6a8ce8e

hiydavid force-pushed the fix/gso-optimizer-correctness-and-leakage branch from 7e6f475 to 6a8ce8e Compare May 6, 2026 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gso): Genie Space Optimizer — complete optimizer engine (lossless contract, process spine, decision trace, observability)#202

feat(gso): Genie Space Optimizer — complete optimizer engine (lossless contract, process spine, decision trace, observability)#202
prashsub wants to merge 867 commits intomainfrom
fix/gso-optimizer-correctness-and-leakage

prashsub commented May 6, 2026 •

edited by hiydavid

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

prashsub commented May 6, 2026 • edited by hiydavid Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Correctness & leakage fixes

Lossless contract & replay gate

Decision trace & unified observability (Phases B–H)

Process spine modularization (Phases F–G)

RCA & control-plane hardening

Cycle-by-cycle optimizer improvements (Cycles 1–11)

Dependency lockdown

Docs

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

prashsub commented May 6, 2026 •

edited by hiydavid

Loading