feat: override-learned trust calibration (#88, v1)#122
Open
harrymunro wants to merge 1 commit into
Open
Conversation
Replaces the binary admiralty-action-required model with an override-learning trust calibration store. After each mission, admiralty decisions (approved / modified / rejected) are aggregated per (task_type, ship_class) bucket; at plan-approved time, stderr advisories surface tasks whose history meets the sample threshold. Advisory-only — no station_tier mutation, all schema additions optional and backwards compatible. Adds the admiralty-decision and trust-report subcommands, threads an optional --task-type through the task command, and wires the new memory store into stand-down (incremental) and index (rebuild).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
(task_type, ship_class)trust calibration learned from admiralty decisions (approved / modified / rejected) rather than self-reported agent confidence.nelson_data_calibration.pymodule with incremental update at stand-down, full rebuild fromcmd_index, and stderr advisory atplan-approvedwhen the historical sample meetsMIN_DECISIONS_FOR_ADVISORY = 3. Falls back from the bucket to theby_task_typerollup when the precise bucket is under-sampled.nelson-data.py admiralty-decision(writesadmiralty_action_completedevent withdecision_type,task_type,ship_class) andnelson-data.py trust-report(text +--json).--task-typeflag on the existingtasksubcommand; backwards compatible — existing missions keep working withtask_type=nulland the calibration store stays empty.station_tiermutation, no significance gating. Auto-elevation, FM-synthesized prose, and Fisher's exact gating are explicitly out of scope (issue Confidence-weighted dynamic trust calibration #88 follow-up).Test plan
pytest skills/nelson/scripts/test_nelson_data_calibration.py -v— 17 new tests covering admiralty-decision CLI, store aggregation/idempotency/edge cases, plan-approved advisory threshold + rollup fallback, and trust-report text/JSON/filter.pytest skills/nelson/scripts/— full suite, 370 tests, all green.--task-type auth_refactor+admiralty-decision --decision-type modifiedpopulate the bucket; the 4th mission'splan-approvedprintsTrust advisory: task 1 (auth_refactor on frigate) — historical override rate 100% (n=3). Consider raising station_tier.on stderr.index --rebuildreconstructs the calibration store from existing missions.