Feat calibration diagnostic#207
Conversation
|
Tests pass locally but will not run remotely due to new restrictions on GitHub Actions. I will take a look and fix this separately |
| 1. `winnow train` – Performs confidence calibration on a dataset of annotated PSMs, outputting the fitted model checkpoint. | ||
| 2. `winnow compute-features` – Computes and outputs the feature set for a dataset of PSMs. | ||
| 3. `winnow predict` – Performs confidence calibration using a fitted model checkpoint (defaults to a pretrained general model from Hugging Face), estimates and controls FDR using the calibrated confidence scores. | ||
| 4. `winnow diagnose-calibration` – On a labelled holdout, estimates tail calibration error (sTECE/TECE) at the FDR operating threshold and writes a reliability diagram. |
There was a problem hiding this comment.
Explain the terms (or link to an explanation of the terms) sTECE/TECE.
| iso=iso, | ||
| ) | ||
| stece_empirical = signed_tail_ece_empirical( | ||
| np.asarray(scores, dtype=float), labels_f, conf_cutoff |
There was a problem hiding this comment.
run_calibration_diagnostic passes full-length scores together with tail-filtered labels_f into signed_tail_ece_empirical, so labels[mask] raises IndexError whenever any PSM falls below conf_cutoff.
Any realistic input where the conf_cutoff actually filters anything (e.g. 1000 PSMs, 200 above cutoff): inside signed_tail_ece_empirical, mask = scores >= conf_cutoff has length 1000, but labels_f has length 200. labels_f[mask] raises IndexError: boolean index did not match indexed array along axis 0; size of axis is 200 but size of corresponding boolean axis is 1000. The unit tests miss this because their fixtures (uniform(conf_cutoff, 1.0)) keep every score above the cutoff so lengths happen to match.
To reproduce:
import numpy as np
from winnow.calibration.diagnostics import run_calibration_diagnostic
# Realistic: scores spread above and below the cutoff
scores = np.linspace(0.1, 1.0, 1000)
labels = (np.random.default_rng(0).uniform(size=1000) < scores).astype(bool)
run_calibration_diagnostic(
scores=scores, labels=labels, conf_cutoff=0.6,
nominal_fdr=0.05, tolerance=0.005,
label_source="sequence", label_column="correct",
min_tail_psms=50,
)
# IndexError: boolean index did not match indexed array along axis 0;
# size of axis is 445 but size of corresponding boolean axis is 1000Fix: pass the already-filtered tail arrays, not the full ones.
| np.asarray(scores, dtype=float), labels_f, conf_cutoff | |
| tail_scores, labels_f, conf_cutoff |
And add a regression test where len(tail) < len(scores).
| if len(df) > 0 and isinstance(df["prediction"].iloc[0], str): | ||
| df["prediction"] = df["prediction"].apply(metrics._split_peptide) | ||
|
|
||
| def _row_correct(row: pd.Series) -> bool: |
There was a problem hiding this comment.
For data_loader=mztab with diagnostics.label_source=sequence, Polars list columns reach pandas as numpy.ndarray values, but _row_correct() only accepts Python list. As a result, fully matching MZTab sequence/prediction pairs are marked incorrect and the calibration diagnostic is computed against all-false labels.
To reproduce:
import numpy as np
import pandas as pd
from winnow.calibration.diagnostics import compute_correct_from_sequence
masses = {"A": 71.037114, "G": 57.021464}
meta = pd.DataFrame({
"sequence": [np.array(["A", "G"], dtype=object)],
"prediction": [np.array(["A", "G"], dtype=object)],
})
print(compute_correct_from_sequence(meta, masses).tolist())Current result:
[False]Expected:
[True]Suggested fix: normalize sequence-like values before the type check, or accept both list and np.ndarray.
def _as_token_list(value: object) -> list[str] | None:
if isinstance(value, np.ndarray):
return value.tolist()
if isinstance(value, list):
return value
return None
def _row_correct(row: pd.Series) -> bool:
sequence = _as_token_list(row["sequence"])
prediction = _as_token_list(row["prediction"])
if sequence is None or prediction is None:
return False
num_matches = metrics._novor_match(sequence, prediction)
return num_matches == len(sequence) == len(prediction)And add a regression test with np.array(["A", "G"], dtype=object) for both sequence and prediction, asserting [True].
Description
This PR adds a tail calibration diagnostic workflow for non-parametric FDR (
winnow diagnose-calibration), and documentation. On a labelled holdout, Winnow calibrates scores, derives the operating confidence cutoff at the nominal FDR target, estimates sTECE and TECE on the tail via isotonic regression, writes a JSON report, and optionally saves a tail-only reliability diagram.