Skip to content

feat: override-learned trust calibration (#88, v1)#122

Open
harrymunro wants to merge 1 commit into
mainfrom
worktree-tidy-churning-swan
Open

feat: override-learned trust calibration (#88, v1)#122
harrymunro wants to merge 1 commit into
mainfrom
worktree-tidy-churning-swan

Conversation

@harrymunro
Copy link
Copy Markdown
Collaborator

Summary

  • Implements issue Confidence-weighted dynamic trust calibration #88's override-learning alternative: per-(task_type, ship_class) trust calibration learned from admiralty decisions (approved / modified / rejected) rather than self-reported agent confidence.
  • New nelson_data_calibration.py module with incremental update at stand-down, full rebuild from cmd_index, and stderr advisory at plan-approved when the historical sample meets MIN_DECISIONS_FOR_ADVISORY = 3. Falls back from the bucket to the by_task_type rollup when the precise bucket is under-sampled.
  • New CLI: nelson-data.py admiralty-decision (writes admiralty_action_completed event with decision_type, task_type, ship_class) and nelson-data.py trust-report (text + --json).
  • Optional --task-type flag on the existing task subcommand; backwards compatible — existing missions keep working with task_type=null and the calibration store stays empty.
  • v1 is advisory-only: no station_tier mutation, no significance gating. Auto-elevation, FM-synthesized prose, and Fisher's exact gating are explicitly out of scope (issue Confidence-weighted dynamic trust calibration #88 follow-up).

Test plan

  • pytest skills/nelson/scripts/test_nelson_data_calibration.py -v — 17 new tests covering admiralty-decision CLI, store aggregation/idempotency/edge cases, plan-approved advisory threshold + rollup fallback, and trust-report text/JSON/filter.
  • pytest skills/nelson/scripts/ — full suite, 370 tests, all green.
  • End-to-end smoke: 3 missions with --task-type auth_refactor + admiralty-decision --decision-type modified populate the bucket; the 4th mission's plan-approved prints Trust advisory: task 1 (auth_refactor on frigate) — historical override rate 100% (n=3). Consider raising station_tier. on stderr.
  • index --rebuild reconstructs the calibration store from existing missions.

Replaces the binary admiralty-action-required model with an
override-learning trust calibration store. After each mission,
admiralty decisions (approved / modified / rejected) are aggregated
per (task_type, ship_class) bucket; at plan-approved time, stderr
advisories surface tasks whose history meets the sample threshold.
Advisory-only — no station_tier mutation, all schema additions
optional and backwards compatible.

Adds the admiralty-decision and trust-report subcommands, threads
an optional --task-type through the task command, and wires the
new memory store into stand-down (incremental) and index (rebuild).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant