feat: Enable prediction-map generation for final linear classifier test runs#14
Conversation
Use the `label` and `fold` columns produced by the upstream k-fold split instead of deriving labels from coverage columns and randomly splitting val. Memory-mapped via HuggingFace datasets so the full embedding parquet no longer has to fit in numpy. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ddings Datamodule downloads embeddings + kfold artifacts from MLflow, joins on (slide_id, x, y) via pyarrow, applies class mapping, tissue/class coverage filters, and exposes per-fold splits via set_val_fold(). Training script loops folds in a single run and logs per-fold + aggregate metrics. Probe adds per-class F1, confusion matrix figures, optional input L2-norm and class weights. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The experiment file was declaring /class_mapping as a fresh default while configs/ml/linear_probe.yaml already had one, which Hydra rejects as a duplicate. Mark it as an override so the experiment replaces the base default. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ml/train.py uses @with_cli_args(["+ml=linear_probe"]), so the decorator already injects that arg. Passing it again on the command line caused Hydra to load configs/ml/linear_probe.yaml twice and reject duplicate defaults. Rely on the decorator and pass only +experiment=... Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ng refs
Two interpolation problems prevented Hydra from resolving the linear-probe
config:
1. configs/ml.yaml uses ${random_seed:} and configs/ml/linear_probe.yaml
uses ${len:...}, but neither resolver is registered anywhere. Register
both at module import time in ml/train.py.
2. The class_mapping yamls use # @Package _global_, so class_mapping,
class_indices, and class_names land at the config root. The references
in linear_probe.yaml were doubly nested (e.g. class_mapping.class_mapping).
Drop the prefix.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The filtered tiles parquet collapses ROI columns at tiling time, so
kfold writes canonical names ("Epithelium", etc.) directly into `label`.
The raw→canonical lookup built from the BB-suffixed YAML lists matched
none of these and dropped the entire 1.1M-tile dataset under
drop_unmapped=True.
Extend _raw_to_canonical with identity entries for every canonical class
so modern parquets pass through while legacy un-collapsed labels still
collapse correctly. "background" stays unmapped → dropped, as intended.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add EmbeddingsDataModule.compute_class_weights("balanced"|"inverse")
using sklearn-style weights from the current train fold.
- train.py resolves class_weights="balanced"/"inverse" via the
datamodule and passes the resulting list to LinearProbe at instantiate
time (per-fold, since splits change).
- Bump class_coverage_min from 0.0 to 0.5 to drop mosaic tiles.
- Drop the redundant /class_mapping default from configs/ml/linear_probe.yaml;
experiment files now own the choice.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extract derive_labels logic to shared preprocessing/_labels.py, then use it in both split/kfold_split.py and the new embedding_dataset pipeline. The new pipeline joins k-fold (train) / filter_tiles (test) tile metadata with precomputed embeddings after applying tissue + per-dominant-class ROI thresholds, and emits a SlidesTilesLoader-compatible Parquet dataset as an MLflow artifact. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Joining 1M+ rows of list<double> embeddings was either OOMing on to_pandas() or hitting int32 list-offset overflow inside take(). The fix: - read embeddings into Arrow only and cast each chunk to large_list so take() concatenation uses int64 offsets; - run the join on keys plus a synthetic row index because Acero refuses list columns in non-key fields, then pull embeddings via take(); - combine_chunks() before take() for an O(N) single-pass copy; - write the parquet straight from Arrow, never materialising the embedding column in pandas. Also bumps the kube job memory to 64Gi to give the combined-chunks + take() peak some headroom, and trims the verbose [timing] prints down to one progress line per split. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without this guard a malformed train artifact would crash deep inside apply_thresholds with a confusing KeyError. Surface a clear error that points at the expected upstream artifact instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
setup(stage="fit") replaces criterion with class-weighted CrossEntropyLoss, adding a criterion.weight buffer that gets saved to checkpoints. At test, Lightning restores the checkpoint before setup() runs, so the model still has the unweighted criterion from __init__ and strict load fails with "Unexpected key(s) in state_dict: criterion.weight". Affected both adamw and lbfgs test runs. Initialize criterion with a placeholder ones-weight sized num_classes so the criterion.weight key always exists; setup(fit) still overrides it with the real class-balanced weights. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
configs/ml/data/final_embedding_tiles.yaml (1)
3-8: 💤 Low valueConsider adding
eval_batch_sizefor evaluation-phase batching.According to the review stack context, the
DataModulenow accepts an optionaleval_batch_sizeparameter for separate evaluation-phase batching. While the code will default to usingtrain_batch_sizefor evaluation, specifying a largereval_batch_sizecould improve memory efficiency during validation and testing since gradients are not computed.💡 Suggested addition
data: train_batch_size: 1024 + eval_batch_size: 2048 num_workers: 4 train_shuffle: true train_drop_last: false🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@configs/ml/data/final_embedding_tiles.yaml` around lines 3 - 8, Add an eval_batch_size field under the top-level data mapping to allow the DataModule to use a separate, larger batch size during evaluation (instead of defaulting to train_batch_size); update the YAML key "eval_batch_size" (alongside "train_batch_size", "num_workers", etc.) and set a value appropriate for validation/testing (e.g., a larger number to improve memory efficiency when gradients are off).configs/experiment/ml/linear_classifier_adamw_stratified_kfold.yaml (1)
10-13: ⚡ Quick winVerify that
weight_decay: 0.0is intentional for AdamW.AdamW's primary feature is decoupled weight decay for better regularization. Setting
weight_decayto0.0disables this, which may be suboptimal for generalization. Other AdamW configs in this PR useweight_decay: 1.0e-3.If this is intentional for testing an unregularized baseline, consider documenting it. Otherwise, consider using a non-zero weight decay value.
💡 Suggested fix if weight decay is desired
model: optimizer: adamw learning_rate: 1.0e-4 - weight_decay: 0.0 + weight_decay: 1.0e-3🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@configs/experiment/ml/linear_classifier_adamw_stratified_kfold.yaml` around lines 10 - 13, The YAML sets model.optimizer to "adamw" but model.weight_decay is 0.0 which disables AdamW's decoupled regularization; either change model.weight_decay to a non-zero value (e.g., 1.0e-3) to match other AdamW configs or add a comment/field documenting that 0.0 is intentional for an unregularized baseline; update the config entry for model.weight_decay and/or add a nearby comment explaining the rationale so reviewers know this is deliberate.ml/data/datasets/embedding_tiles.py (2)
200-201: 💤 Low valueGuard against empty first chunk when computing embedding dimension.
If the first chunk in
emb_colhappens to be empty (possible with certain partitioned parquet files), this division will raiseZeroDivisionError. Consider finding the first non-empty chunk or using a defensive check.🛡️ Suggested defensive handling
- first_chunk = emb_col.chunks[0] - embedding_dim = len(first_chunk.values) // len(first_chunk) + for chunk in emb_col.chunks: + if len(chunk) > 0: + embedding_dim = len(chunk.values) // len(chunk) + break + else: + raise RuntimeError("embedding column has no data")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml/data/datasets/embedding_tiles.py` around lines 200 - 201, The code assumes emb_col.chunks[0] is non-empty when computing embedding_dim, which can raise ZeroDivisionError; modify the logic in embedding_tiles.py to locate the first non-empty chunk (iterate emb_col.chunks until len(chunk) > 0) or check len(first_chunk) before dividing, and if no non-empty chunk exists raise a clear ValueError or return a sensible default; update the computation of embedding_dim (using the found non-empty chunk) so embedding_dim = len(chunk.values) // len(chunk) is only executed on a non-empty chunk.
252-258: ⚡ Quick winUnreachable column validation due to
read_parquetbehavior.The check at lines 255-258 is dead code. When
tissue_columndoesn't exist in the parquet file,pd.read_parquet(..., columns=[..., tissue_column])raises an error before reaching your custom validation. Users will see a less descriptive pyarrow error instead of your helpful message.♻️ Suggested fix: read all columns first, then validate
def _filter_metadata( metadata_uri: str | Path, tissue_column: str, tissue_min: float, ) -> pd.DataFrame: local = _resolve_uri(metadata_uri) - columns = ["slide_id", "x", "y", tissue_column] - df = pd.read_parquet(local, columns=columns) + required_cols = ["slide_id", "x", "y"] + df = pd.read_parquet(local) if tissue_column not in df.columns: raise ValueError( f"metadata parquet has no {tissue_column!r} column; cannot filter" ) - df = df.loc[df[tissue_column] > tissue_min, ["slide_id", "x", "y"]] + df = df.loc[df[tissue_column] > tissue_min, required_cols]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml/data/datasets/embedding_tiles.py` around lines 252 - 258, The explicit check for tissue_column is unreachable because pd.read_parquet(..., columns=[..., tissue_column]) will raise beforehand; update embedding_tiles.py to first load the parquet without a columns projection (or catch the pyarrow error), validate that tissue_column exists on the resulting df (or inspect df.columns), then select/assign columns = ["slide_id","x","y", tissue_column] and subset df accordingly; refer to _resolve_uri(metadata_uri), tissue_column, and the df variable when making the change so the ValueError with your custom message is raised from your own validation rather than from pyarrow.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ml/data/data_module.py`:
- Line 29: Replace the truthy fallback with an explicit None check for the eval
batch-size: in the data module where self.eval_batch_size is set (use the
constructor parameter names eval_batch_size and train_batch_size), change the
expression self.eval_batch_size = eval_batch_size or train_batch_size to use a
conditional that only falls back when eval_batch_size is None (e.g.,
self.eval_batch_size = train_batch_size if eval_batch_size is None else
eval_batch_size) so falsey but invalid values like 0 are not silently masked.
---
Nitpick comments:
In `@configs/experiment/ml/linear_classifier_adamw_stratified_kfold.yaml`:
- Around line 10-13: The YAML sets model.optimizer to "adamw" but
model.weight_decay is 0.0 which disables AdamW's decoupled regularization;
either change model.weight_decay to a non-zero value (e.g., 1.0e-3) to match
other AdamW configs or add a comment/field documenting that 0.0 is intentional
for an unregularized baseline; update the config entry for model.weight_decay
and/or add a nearby comment explaining the rationale so reviewers know this is
deliberate.
In `@configs/ml/data/final_embedding_tiles.yaml`:
- Around line 3-8: Add an eval_batch_size field under the top-level data mapping
to allow the DataModule to use a separate, larger batch size during evaluation
(instead of defaulting to train_batch_size); update the YAML key
"eval_batch_size" (alongside "train_batch_size", "num_workers", etc.) and set a
value appropriate for validation/testing (e.g., a larger number to improve
memory efficiency when gradients are off).
In `@ml/data/datasets/embedding_tiles.py`:
- Around line 200-201: The code assumes emb_col.chunks[0] is non-empty when
computing embedding_dim, which can raise ZeroDivisionError; modify the logic in
embedding_tiles.py to locate the first non-empty chunk (iterate emb_col.chunks
until len(chunk) > 0) or check len(first_chunk) before dividing, and if no
non-empty chunk exists raise a clear ValueError or return a sensible default;
update the computation of embedding_dim (using the found non-empty chunk) so
embedding_dim = len(chunk.values) // len(chunk) is only executed on a non-empty
chunk.
- Around line 252-258: The explicit check for tissue_column is unreachable
because pd.read_parquet(..., columns=[..., tissue_column]) will raise
beforehand; update embedding_tiles.py to first load the parquet without a
columns projection (or catch the pyarrow error), validate that tissue_column
exists on the resulting df (or inspect df.columns), then select/assign columns =
["slide_id","x","y", tissue_column] and subset df accordingly; refer to
_resolve_uri(metadata_uri), tissue_column, and the df variable when making the
change so the ValueError with your custom message is raised from your own
validation rather than from pyarrow.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 22daf217-348c-4bd3-aa28-fa9c2b3c3a8b
📒 Files selected for processing (19)
configs/experiment/ml/linear_classifier_adamw_stratified_group_kfold.yamlconfigs/experiment/ml/linear_classifier_adamw_stratified_kfold.yamlconfigs/experiment/ml/linear_classifier_final_lbfgs.yamlconfigs/experiment/ml/linear_classifier_lbfgs_stratified_group_kfold.yamlconfigs/experiment/ml/linear_classifier_lbfgs_stratified_kfold.yamlconfigs/experiment/ml/linear_classifier_predict_tissue_tiles.yamlconfigs/experiment/ml/linear_classifier_test_adamw.yamlconfigs/experiment/ml/linear_classifier_test_lbfgs.yamlconfigs/ml.yamlconfigs/ml/data/final_embedding_tiles.yamlconfigs/ml/data/kfold_embedding_tiles.yamlconfigs/ml/task/final_linear_classifier.yamlconfigs/ml/task/kfold_linear_classifier.yamlml/__main__.pyml/callbacks/tiff_prediction_map_writer.pyml/data/data_module.pyml/data/datasets/embedding_tiles.pyml/meta_arch.pyscripts/submit_train_linear_probe.py
✅ Files skipped from review due to trivial changes (4)
- configs/ml/data/kfold_embedding_tiles.yaml
- configs/ml.yaml
- configs/experiment/ml/linear_classifier_lbfgs_stratified_group_kfold.yaml
- configs/experiment/ml/linear_classifier_adamw_stratified_group_kfold.yaml
🚧 Files skipped from review as they are similar to previous changes (7)
- configs/experiment/ml/linear_classifier_final_lbfgs.yaml
- configs/experiment/ml/linear_classifier_test_lbfgs.yaml
- configs/experiment/ml/linear_classifier_predict_tissue_tiles.yaml
- configs/ml/task/final_linear_classifier.yaml
- configs/experiment/ml/linear_classifier_test_adamw.yaml
- ml/main.py
- ml/callbacks/tiff_prediction_map_writer.py
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
scripts/submit_test_linear.py (1)
12-12: ⚡ Quick winAvoid hardcoding a feature branch in submit script.
Line 12 ties runtime to
feature/ml-test-mode; if that branch is deleted/renamed, submissions fail. Prefer a stable ref (e.g.,master) or parameterize the ref.Suggested change
- "git clone --branch feature/ml-test-mode https://github.com/RationAI/tissue-classification.git workdir", + "git clone --branch master https://github.com/RationAI/tissue-classification.git workdir",🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/submit_test_linear.py` at line 12, The git clone command in scripts/submit_test_linear.py hardcodes the feature branch "feature/ml-test-mode", causing failures if that branch is removed; change it to use a stable ref or a parameterized variable (e.g., use a default BRANCH="master" or accept BRANCH from env/args) and interpolate that variable into the clone string instead of the literal "feature/ml-test-mode" so the clone command becomes dynamic and resilient.submit_report.py (1)
17-17: ⚡ Quick winPin the report dependency to an immutable revision.
Installing from a mutable feature branch makes the job non-reproducible and fragile. Please pin this to a tag or commit SHA so reruns keep using the same reporter code.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@submit_report.py` at line 17, The dependency string currently installs the report package from a mutable feature branch ("git+ssh://git@gitlab.ics.muni.cz/.../report.git@feature/force-wsi-service-protocol"); change that to pin to an immutable revision by replacing the branch ref with a tag or commit SHA (e.g., use ".../report.git@vX.Y.Z" or ".../report.git@<commit-sha>") in the list where the string appears in submit_report.py so reruns use the exact same reporter code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ml/meta_arch.py`:
- Around line 368-370: The artifact_file passed to mlflow.log_table is using a
.parquet extension which violates mlflow.log_table's JSON contract; update the
call to mlflow.log_table (the mlflow.log_table invocation that writes
"per_slide/test_tile_accuracy.parquet") to use
"per_slide/test_tile_accuracy.json" instead, and if you actually need Parquet
output, write the DataFrame to a .parquet via DataFrame.to_parquet and use
mlflow.log_artifact(s) to upload it rather than mlflow.log_table.
In `@submit_report.py`:
- Around line 6-21: The module currently calls submit_job(...) at import time
(submit_job with job_name, username, cpu, memory, gpu, public, script, storage),
causing side effects; wrap that logic into a function (e.g., def main() or
submit_report()) and move the submit_job(...) invocation inside it, then call
that function only under a main guard (if __name__ == "__main__": main()),
keeping the same submit_job parameters and behavior so imports no longer trigger
job submission.
---
Nitpick comments:
In `@scripts/submit_test_linear.py`:
- Line 12: The git clone command in scripts/submit_test_linear.py hardcodes the
feature branch "feature/ml-test-mode", causing failures if that branch is
removed; change it to use a stable ref or a parameterized variable (e.g., use a
default BRANCH="master" or accept BRANCH from env/args) and interpolate that
variable into the clone string instead of the literal "feature/ml-test-mode" so
the clone command becomes dynamic and resilient.
In `@submit_report.py`:
- Line 17: The dependency string currently installs the report package from a
mutable feature branch
("git+ssh://git@gitlab.ics.muni.cz/.../report.git@feature/force-wsi-service-protocol");
change that to pin to an immutable revision by replacing the branch ref with a
tag or commit SHA (e.g., use ".../report.git@vX.Y.Z" or
".../report.git@<commit-sha>") in the list where the string appears in
submit_report.py so reruns use the exact same reporter code.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9a00defe-f089-426a-b475-4e2610e87941
📒 Files selected for processing (3)
ml/meta_arch.pyscripts/submit_test_linear.pysubmit_report.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/submit_test_linear.py`:
- Line 6: The submission script uses a hardcoded username ("vcifka") instead of
the repository convention of using the Ellipsis placeholder; update the
parameter assignment for username in scripts/submit_test_linear.py (the
username= entry) to use the placeholder username=... so it matches the existing
pattern (e.g., +experiment=...) used elsewhere.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5c16cc67-14d5-42a5-b99b-e46cbfe76b84
📒 Files selected for processing (3)
ml/callbacks/tiff_prediction_map_writer.pyml/meta_arch.pyscripts/submit_test_linear.py
🚧 Files skipped from review as they are similar to previous changes (2)
- ml/meta_arch.py
- ml/callbacks/tiff_prediction_map_writer.py
Clear the batch buffer only on rank!=0 or after a successful write so the on_test_end fallback no longer hits an always-empty buffer. Add diagnostic prints to the silent early-return guards and an idempotency flag so the two write hooks cooperate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This PR enables TIFF prediction-map generation for the final linear classifier test configs and cleans up the related ML config structure.
Changes include:
prediction_maps_tiff/pred/
prediction_maps_tiff/prob//
Summary by CodeRabbit
New Features
Enhancements