Skip to content

Feat calibration diagnostic#207

Open
JemmaLDaniel wants to merge 4 commits into
mainfrom
feat-calibration-diagnostic
Open

Feat calibration diagnostic#207
JemmaLDaniel wants to merge 4 commits into
mainfrom
feat-calibration-diagnostic

Conversation

@JemmaLDaniel
Copy link
Copy Markdown
Collaborator

Description

This PR adds a tail calibration diagnostic workflow for non-parametric FDR (winnow diagnose-calibration), and documentation. On a labelled holdout, Winnow calibrates scores, derives the operating confidence cutoff at the nominal FDR target, estimates sTECE and TECE on the tail via isotonic regression, writes a JSON report, and optionally saves a tail-only reliability diagram.

@JemmaLDaniel JemmaLDaniel requested a review from BioGeek May 24, 2026 15:31
@JemmaLDaniel JemmaLDaniel self-assigned this May 24, 2026
@JemmaLDaniel JemmaLDaniel added the enhancement New feature or request label May 24, 2026
@JemmaLDaniel
Copy link
Copy Markdown
Collaborator Author

JemmaLDaniel commented May 24, 2026

Tests pass locally but will not run remotely due to new restrictions on GitHub Actions. I will take a look and fix this separately

Comment thread README.md
1. `winnow train` – Performs confidence calibration on a dataset of annotated PSMs, outputting the fitted model checkpoint.
2. `winnow compute-features` – Computes and outputs the feature set for a dataset of PSMs.
3. `winnow predict` – Performs confidence calibration using a fitted model checkpoint (defaults to a pretrained general model from Hugging Face), estimates and controls FDR using the calibrated confidence scores.
4. `winnow diagnose-calibration` – On a labelled holdout, estimates tail calibration error (sTECE/TECE) at the FDR operating threshold and writes a reliability diagram.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain the terms (or link to an explanation of the terms) sTECE/TECE.

iso=iso,
)
stece_empirical = signed_tail_ece_empirical(
np.asarray(scores, dtype=float), labels_f, conf_cutoff
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_calibration_diagnostic passes full-length scores together with tail-filtered labels_f into signed_tail_ece_empirical, so labels[mask] raises IndexError whenever any PSM falls below conf_cutoff.

Any realistic input where the conf_cutoff actually filters anything (e.g. 1000 PSMs, 200 above cutoff): inside signed_tail_ece_empirical, mask = scores >= conf_cutoff has length 1000, but labels_f has length 200. labels_f[mask] raises IndexError: boolean index did not match indexed array along axis 0; size of axis is 200 but size of corresponding boolean axis is 1000. The unit tests miss this because their fixtures (uniform(conf_cutoff, 1.0)) keep every score above the cutoff so lengths happen to match.

To reproduce:

import numpy as np
from winnow.calibration.diagnostics import run_calibration_diagnostic

# Realistic: scores spread above and below the cutoff
scores = np.linspace(0.1, 1.0, 1000)
labels = (np.random.default_rng(0).uniform(size=1000) < scores).astype(bool)

run_calibration_diagnostic(
    scores=scores, labels=labels, conf_cutoff=0.6,
    nominal_fdr=0.05, tolerance=0.005,
    label_source="sequence", label_column="correct",
    min_tail_psms=50,
)
# IndexError: boolean index did not match indexed array along axis 0;
# size of axis is 445 but size of corresponding boolean axis is 1000

Fix: pass the already-filtered tail arrays, not the full ones.

Suggested change
np.asarray(scores, dtype=float), labels_f, conf_cutoff
tail_scores, labels_f, conf_cutoff

And add a regression test where len(tail) < len(scores).

if len(df) > 0 and isinstance(df["prediction"].iloc[0], str):
df["prediction"] = df["prediction"].apply(metrics._split_peptide)

def _row_correct(row: pd.Series) -> bool:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For data_loader=mztab with diagnostics.label_source=sequence, Polars list columns reach pandas as numpy.ndarray values, but _row_correct() only accepts Python list. As a result, fully matching MZTab sequence/prediction pairs are marked incorrect and the calibration diagnostic is computed against all-false labels.

To reproduce:

import numpy as np
import pandas as pd
from winnow.calibration.diagnostics import compute_correct_from_sequence

masses = {"A": 71.037114, "G": 57.021464}

meta = pd.DataFrame({
    "sequence": [np.array(["A", "G"], dtype=object)],
    "prediction": [np.array(["A", "G"], dtype=object)],
})

print(compute_correct_from_sequence(meta, masses).tolist())

Current result:

 [False]

Expected:

[True]

Suggested fix: normalize sequence-like values before the type check, or accept both list and np.ndarray.

def _as_token_list(value: object) -> list[str] | None:
    if isinstance(value, np.ndarray):
        return value.tolist()
    if isinstance(value, list):
        return value
    return None

def _row_correct(row: pd.Series) -> bool:
    sequence = _as_token_list(row["sequence"])
    prediction = _as_token_list(row["prediction"])
    if sequence is None or prediction is None:
        return False

    num_matches = metrics._novor_match(sequence, prediction)
    return num_matches == len(sequence) == len(prediction)

And add a regression test with np.array(["A", "G"], dtype=object) for both sequence and prediction, asserting [True].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants